Nick Coghlan added the comment: Note that pairing fsencode with 'utf-8' isn't guaranteed to do the right thing. It would work for the default C locale (since that's ASCII), but not in the general case.
Enhancing backslashreplace to also work on input is an interesting idea, but worth making it's own RFE: http://bugs.python.org/issue22286 I also agree we can ignore xmlcharrefreplace here. So that leaves the basic pattern as: data.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace') data.encode('utf-8', 'surrogateescape').decode('utf-8', 'ignore') data.encode('utf-8', 'surrogateescape').decode('utf-8', 'backslashreplace') This wouldn't allow the option of substituting an ASCII question mark, but I'd be OK with that. Possible function name and implementation: def convert_surrogateescape(data, errors='replace'): return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) Added bonus: pass "errors='strict'" and you'll get an exception if there were any surrogate escaped values in the string. (I take that emergent property as a sign that we're converging on a sensible design here) Adding a fast path for keeping track of whether or not a string contains escaped surrogates would then be a separate RFE. ---------- dependencies: +Allow backslashreplace error handler to be used on input _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue18814> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com