Nick Coghlan added the comment:
My main use case is for passing data to other applications that *don't* have
their Unicode handling in order - I want to be able to use Python to do the
data scrubbing, but at the moment it requires intimate knowledge of the codec
error handling system to do it. (I had never even heard of surrogatepass until
this evening)
Situation:
What I have: data decoded with surrogateescape
What I want: that same data with all the surrogates gone, replaced with either
the Unicode replacement character or an ASCII question mark (which I want will
depend on the exact situation)
Assume I am largely clueless about the codec system. I know nothing beyond the
fact that Python 3 strings may have smuggled bytes in them and I want to get
rid of them because they confuse the application I'm passing them to.
The concrete example that got me thinking about this again was the task of
writing filenames into a UTF-8 encoded email, and wanting to scrub the output
from os.listdir before writing the list into the email (s/email/web page/ also
works).
For issue #22016 I actually suggested doing this as *another* codec error
handler ("surrogatereplace"), but Stephen Turnbull convinced me this original
idea was better: it should just be a pure data transformation pass on the
string, clearing the surrogates out, and leaving me with data that is identical
to that I would have had if "surrogatereplace" had been used instead of
"surrogateescape" in the first place.
As "errors='replace'" already covers the "ASCII ?" replacement case, that means
your proposed "redecode" based solution would cover the rest of my use case.
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue18814>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com