Nick Coghlan added the comment:
The redecode thing is a distraction from my core concern here, so I've split
that out to issue #22264, a separate RFE for a "wsgiref.fix_encoding" function.
For this issue, my main concern is the function to *clean* a string of escaped
binary data, so it can be displayed easily, or otherwise purged of the escaped
characters. Preserving the data by default is good, but you have to know a
*lot* about how Python 3 works in order to be able figure out how to clean it
out.
For that, not knowing Unicode in general isn't the problem: it's not knowing
PEP 383. If we forget the idea of exposing the constant with the escaped values
(I agree that's not very useful), it suggests "codecs.clean_surrogate_escapes"
as a possible name:
# Helper to ensure a string contains no escaped surrogates
# This allows it to be safely encoded without surrogateescape
_extended_ascii = bytes(range(128, 256))
_escaped_surrogates = _extended_ascii.decode('ascii',
errors='surrogateescape')
_match_escaped = re.compile('[{}]'.format(_escaped_surrogates))
def clean_surrogate_escapes(s, repl='\ufffd'):
return _match_escaped.sub(repl, s)
A more efficient implementation in C would also be fine, this is just an easy
way to define the exact semantics.
(I also just noticed that unlike other error handlers, surrogateespace and
surrogatepass do not have corresponding codecs.surrogateescape_errors and
codecs.surrogatepass_errors functions)
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue18814>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com