[issue18814] Add tools for "cleaning" surrogate escaped strings

Nick Coghlan Wed, 27 Aug 2014 04:01:45 -0700

Nick Coghlan added the comment:

Note that pairing fsencode with 'utf-8' isn't guaranteed to do the right thing. 
It would work for the default C locale (since that's ASCII), but not in the 
general case.


Enhancing backslashreplace to also work on input is an interesting idea, but 
worth making it's own RFE: http://bugs.python.org/issue22286

I also agree we can ignore xmlcharrefreplace here.

So that leaves the basic pattern as:

data.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace')
data.encode('utf-8', 'surrogateescape').decode('utf-8', 'ignore')
data.encode('utf-8', 'surrogateescape').decode('utf-8', 'backslashreplace')

This wouldn't allow the option of substituting an ASCII question mark, but I'd 
be OK with that.

Possible function name and implementation:

    def convert_surrogateescape(data, errors='replace'):
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
    
Added bonus: pass "errors='strict'" and you'll get an exception if there were 
any surrogate escaped values in the string. (I take that emergent property as a 
sign that we're converging on a sensible design here)

Adding a fast path for keeping track of whether or not a string contains 
escaped surrogates would then be a separate RFE.

----------
dependencies: +Allow backslashreplace error handler to be used on input

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18814>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18814] Add tools for "cleaning" surrogate escaped strings

Reply via email to