Nick Coghlan added the comment:

Ideally we'd have string modification support for all the translations we offer 
as codec error handlers:

* Unicode replacement character ('replace' on input)
* ASCII question mark ('replace' on output)
* Dropping them entirely ('ignore')
* XML character reference ('xmlcharrefreplace')
* Python escape sequence ('backslashreplace')

The reason it's beneficial to be able to do these as string transformations 
rather than only in the codecs is that you may just be contributing part of the 
output, with the actual encoding operation handled elsewhere (e.g. you may be 
storing it in a data structure that will later be encoded as JSON or XML, or my 
earlier example of generating a list of files to be included in an email). 
Surrogates are great when you're just passing data straight back to the 
operating system. They're not so great when you're passing them on to other 
parts of the application as text. I'd prefer to be able to deal with them 
closer to the point of origin, at least in some cases.

Now, some of these things *can* be done today using Serhiy's trick of encoding 
to UTF-8 and then decoding again:

    data.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')
    data.encode('utf-8', 'replace').decode('utf-8')
    data.encode('utf-8', 'ignore').decode('utf-8')

However, these two don't work properly:

    data.encode('utf-8', 'xmlcharrefreplace').decode('utf-8')
    data.encode('utf-8', 'backslashreplace').decode('utf-8')

The reason those don't work is because they'll encode the *surrogate escaped 
bytes*, rather than the originals.

Mapping the escaped bytes to percent encoding has the same problem - you likely 
want to do a two step transformation (escaped surrogate -> original byte -> 
percent encoded value), rather than directly percent encoding the already 
escaped bytes.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18814>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to