Nick Coghlan added the comment: As RDM noted, avoiding the use of surrogateescape isn't feasible when we do it by default on all OS interfaces (including the standard streams when we detect 'ascii' as the filesystem encoding in 3.5+).
This *needs* to be a case that folks can handle without needing to spend years learning about encodings and error handlers first. That means being able to tell them "use this documented function to remove the surrogates" rather than "use this magic incantation that you don't understand, and that other people may not be able to read". I know more about Unicode encodings than the average programmer at this point, yet I still needed to be schooled by true experts in this thread to learn how to solve the problem properly. Look at this as an opportunity to encapsulate that knowledge in executable form, as while the code is short, it is conceptually *very* dense. If there's a dedicated function, then replacing the encode/decode dance with a faster pure C alternative also becomes a future possibility (with only a recipe, there's no opportunity to ever optimise it). With the additional clarification, it is also clear to me that Antoine is correct that the encoding needs to be configurable and should default to the appropriate setting to remove the surrogates from OS provided data. With that change: def convert_surrogates(data, encoding=None, errors='replace'): """Convert escaped surrogates by applying a different error handler If no encoding is given, defaults to sys.getfilesystemencoding() Uses the "replace" error handler by default, but any input error handler may be specified. """ if encoding is None: encoding = sys.getfilesystemencoding() return data.encode(encoding, 'surrogateescape').decode(encoding, errors) Since it's primarily intended for cleaning OS provided data, then I agree os.convert_surrogates() could be a good choice. It would be appropriate to reference it from os.fsdecode() as a way to clean escaped data when the original binary data was no longer available to be decoded again with a different error handler. ---------- title: Add codecs.convert_surrogateescape to "clean" surrogate escaped strings -> Add a convert_surrogates function to "clean" surrogate escaped strings _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue18814> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com