Kang-Hao (Kenny) Lu <kennyl...@csail.mit.edu> added the comment: Attached patch does the following beyond what the patch from haypo does: * call the error handler * reject 0xd800~0xdfff when decoding utf-32
The followings are on my TODO list, although this patch doesn't depend on any of these and can be reviewed and landed separately: * make the surrogatepass error handler work for utf-16 and utf-32. (I should be able to finish this by today) * fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A" for some reason) * make unicode_encode_call_errorhandler return bytes so that we can simplify this patch. (This arguably belongs to a separate bug so I'll file it when needed) > All UTF codecs should reject lone surrogates in strict error mode, Should we really reject lone surrogates for UTF-7? There's a test in test_codecs.py that tests "\udc80" to be encoded b"+3IA-" (. Given that UTF-7 is not really part of the Unicode Standard and it is more like a "data encoding" than a "text encoding" to me, I am not sure it's a good idea. > but let them pass using the surrogatepass error handler (the UTF-8 > codec already does) and apply the usual error handling for ignore > and replace. For 'replace', the patch now emits b"\x00?" instead of b"?" so that UTF-16 stream doesn't get corrupted. It is not "usual" and not matching # Implements the ``replace`` error handling: malformed data is replaced # with a suitable replacement character such as ``'?'`` in bytestrings # and ``'\ufffd'`` in Unicode strings. in the documentation. What do we do? Are there other encodings that are not ASCII compatible besides UTF-7, UTF-16 and UTF-32 that Python supports? I think it would be better to use encoded U+fffd whenever possible and fall back to '?'. What do you think? Some other self comments on my patch: * In the STORECHAR macro for utf-16 and utf-32, I change all instances of "ch & 0xFF" to (unsigned char) ch. I don't have enough C knowledge to know if this is actually better or if this makes any difference at all. * The code for utf-16 and utf-32 are duplicates of the uft-8 one. That one's complexity comes from issue #8092 . Not sure if there are ways to simplify these. For example, are there suitable functions there so that we don't need to check integer overflow at these places? ---------- nosy: +kennyluck Added file: http://bugs.python.org/file24368/utf-16&32_reject_surrogates.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12892> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com