Nick Coghlan added the comment:

As RDM noted, avoiding the use of surrogateescape isn't feasible when we do it 
by default on all OS interfaces (including the standard streams when we detect 
'ascii' as the filesystem encoding in 3.5+).

This *needs* to be a case that folks can handle without needing to spend years 
learning about encodings and error handlers first. That means being able to 
tell them "use this documented function to remove the surrogates" rather than 
"use this magic incantation that you don't understand, and that other people 
may not be able to read".

I know more about Unicode encodings than the average programmer at this point, 
yet I still needed to be schooled by true experts in this thread to learn how 
to solve the problem properly.

Look at this as an opportunity to encapsulate that knowledge in executable 
form, as while the code is short, it is conceptually *very* dense.

If there's a dedicated function, then replacing the encode/decode dance with a 
faster pure C alternative also becomes a future possibility (with only a 
recipe, there's no opportunity to ever optimise it).

With the additional clarification, it is also clear to me that Antoine is 
correct that the encoding needs to be configurable and should default to the 
appropriate setting to remove the surrogates from OS provided data.

With that change:

    def convert_surrogates(data, encoding=None, errors='replace'):
        """Convert escaped surrogates by applying a different error handler

        If no encoding is given, defaults to sys.getfilesystemencoding()
        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        if encoding is None:
            encoding = sys.getfilesystemencoding()
        return data.encode(encoding, 'surrogateescape').decode(encoding, errors)

Since it's primarily intended for cleaning OS provided data, then I agree 
os.convert_surrogates() could be a good choice. It would be appropriate to 
reference it from os.fsdecode() as a way to clean escaped data when the 
original binary data was no longer available to be decoded again with a 
different error handler.

----------
title: Add codecs.convert_surrogateescape to "clean" surrogate escaped strings 
-> Add a convert_surrogates function to "clean" surrogate escaped strings

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18814>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to