Nick Coghlan added the comment:
Ideally we'd have string modification support for all the translations we offer
as codec error handlers:
* Unicode replacement character ('replace' on input)
* ASCII question mark ('replace' on output)
* Dropping them entirely ('ignore')
* XML character reference ('xmlcharrefreplace')
* Python escape sequence ('backslashreplace')
The reason it's beneficial to be able to do these as string transformations
rather than only in the codecs is that you may just be contributing part of the
output, with the actual encoding operation handled elsewhere (e.g. you may be
storing it in a data structure that will later be encoded as JSON or XML, or my
earlier example of generating a list of files to be included in an email).
Surrogates are great when you're just passing data straight back to the
operating system. They're not so great when you're passing them on to other
parts of the application as text. I'd prefer to be able to deal with them
closer to the point of origin, at least in some cases.
Now, some of these things *can* be done today using Serhiy's trick of encoding
to UTF-8 and then decoding again:
data.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')
data.encode('utf-8', 'replace').decode('utf-8')
data.encode('utf-8', 'ignore').decode('utf-8')
However, these two don't work properly:
data.encode('utf-8', 'xmlcharrefreplace').decode('utf-8')
data.encode('utf-8', 'backslashreplace').decode('utf-8')
The reason those don't work is because they'll encode the *surrogate escaped
bytes*, rather than the originals.
Mapping the escaped bytes to percent encoding has the same problem - you likely
want to do a two step transformation (escaped surrogate -> original byte ->
percent encoded value), rather than directly percent encoding the already
escaped bytes.
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue18814>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com