Antoine Pitrou <pit...@free.fr> added the comment: > I ran tests of utf16_error_handling-3.2_4.patch on Python 3.1. Two tests are > failing: > - b'\x00\xd8'.decode('utf-16le', 'replace')='\ufffd\ufffd' != '\ufffd' > - b'\xd8\x00'.decode('utf-16be', 'replace')='\ufffd\ufffd' != '\ufffd' > > I don't think that the test is correct: UTF-16 should resynchronize as > early as possible (ignore the first invalid byte and restart at the > following byte), so '\ufffd\ufffd' is the correct answer.
UTF-16 units are 16-bit words, not bytes, so '\uffffd' sounds correct to me. You resynchronize on the word boundary: the invalid word is skipped. > - with UTF-8 decoder: (b'\xC3' + > '\xe9'.encode('utf-8')).decode('utf-8', 'replace') returns '\ufffd > \xe9' That's because UTF-8 operates on bytes: the invalid byte is skipped. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue14579> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com