STINNER Victor added the comment:
I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does not
work as expected.
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore')
'[]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace')
'[�]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'surrogateescape')
'[\udc80\udcdc\uffff'
=> I expected '[\udc80\udcdc]'.
With a decoder, surrogateescape does not work neither:
>>> '[\uDC80]'.encode('utf-16-le', 'surrogateescape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\udc80' in
position 1: surrogates not allowed
Using the PEP 383, I expect that data.decode(encoding, 'surrogateescape') does
never fail, data.decode(encoding, 'surrogateescape').encode(encoding,
'surrogateescape') should give data.
--
With UTF-16, there is a corner case:
>>> b'[\x00\x00'.decode('utf-16-le', 'surrogateescape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/haypo/prog/python/default/Lib/encodings/utf_16_le.py", line 16,
in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position 2:
truncated data
>>> b'[\x00\x80'.decode('utf-16-le', 'surrogateescape')
'[\udc80'
The incomplete sequence b'\x00' raises a decoder error, wheras b'\x80' does
not. Should we extend the PEP 383 to bytes in range [0; 127]? Or should we keep
this behaviour?
Sorry, this question is unrelated to this issue.
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12892>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com