[issue18814] Add tools for "cleaning" surrogate escaped strings

Nick Coghlan Sat, 23 Aug 2014 20:00:28 -0700

Nick Coghlan added the comment:

Based on the latest round of bytes handling discussions on python-dev, I came 
up with this updated proposal:


    # Constant in the string module (akin to string.ascii_letters et al)
    escaped_surrogates = bytes(range(128, 256)).decode('ascii', 
errors='surrogateescape')

    # Helper to ensure a string contains no escaped surrogates
    # This allows it to be safely encoded without surrogateescape
    _match_surrogates = re.compile('[{}]'.format(escaped_surrogates))
    def clean(s, repl='\ufffd'):
        return _match_surrogates.sub(repl, s)

    # Helper to redecode a string that was decoded incorrectly
    # For example, WSGI strings are passed from the server to the
    # framework as latin-1 by default and may need to be redecoded
    def redecode(s, encoding, errors='strict', old_encoding='latin-1', 
old_errors='strict'):
        return s.encode(old_encoding, old_errors).decode(encoding, errors)

In addition to the concrete use cases David describes, I think these will also 
serve a useful documentation purpose, in highlighting the two main mechanisms 
for "smuggling" raw binary data through text APIs (i.e. surrogate escapes and 
latin-1 decoding).

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue18814>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18814] Add tools for "cleaning" surrogate escaped strings

Reply via email to