Steven D'Aprano <>:

> For those cases where you do wish to take an arbitrary byte stream and
> round-trip it, Python now provides an error handler for that.
> py> import random
> py> b = bytes([random.randint(0, 255) for _ in range(10000)])
> py> s = b.decode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0:
> invalid start byte
> py> s = b.decode('utf-8', errors='surrogateescape')
> py> s.encode('utf-8', errors='surrogateescape') == b
> True

That is indeed a valid workaround. With it we achieve

   b.decode('utf-8', errors='surrogateescape'). \
       encode('utf-8', errors='surrogateescape') == b

for any bytes b. It goes to great lengths to address the Linux
programmer's situation.


 * it's not UTF-8 but a variant of it,

 * it sacrifices the ordering correspondence of UTF-8:

   >>> '\udc80' > 'ä'
   >>> '\udc80'.encode('utf-8', errors='surrogateescape') > \
   ...        'ä'.encode('utf-8', errors='surrogateescape')

 * it still isn't bijective between str and bytes:

   >>> '\udd00'.encode('utf-8', errors='surrogateescape')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-8' codec can't encode character 
   '\udd00' in position 0: surrogates not allowed


Reply via email to