[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

Kang-Hao (Kenny) Lu Sun, 29 Jan 2012 23:30:08 -0800

Kang-Hao (Kenny) Lu <[email protected]> added the comment:

Attached patch does the following beyond what the patch from haypo does:
  * call the error handler
  * reject 0xd800~0xdfff when decoding utf-32


The followings are on my TODO list, although this patch doesn't depend on any 
of these and can be reviewed and landed separately:
  * make the surrogatepass error handler work for utf-16 and utf-32. (I should 
be able to finish this by today)
  * fix an error in the error handler for utf-16-le. (In, Python3.2 
b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A" 
for some reason)
  * make unicode_encode_call_errorhandler return bytes so that we can simplify 
this patch. (This arguably belongs to a separate bug so I'll file it when 
needed)

> All UTF codecs should reject lone surrogates in strict error mode,

Should we really reject lone surrogates for UTF-7? There's a test in 
test_codecs.py that tests "\udc80" to be encoded b"+3IA-" (. Given that UTF-7 
is not really part of the Unicode Standard and it is more like a "data 
encoding" than a "text encoding" to me, I am not sure it's a good idea.

> but let them pass using the surrogatepass error handler (the UTF-8
> codec already does) and apply the usual error handling for ignore
> and replace.

For 'replace', the patch now emits b"\x00?" instead of b"?" so that UTF-16 
stream doesn't get corrupted. It is not "usual" and not matching

  # Implements the ``replace`` error handling: malformed data is replaced
  # with a suitable replacement character such as ``'?'`` in bytestrings 
  # and ``'\ufffd'`` in Unicode strings.

in the documentation. What do we do? Are there other encodings that are not 
ASCII compatible besides UTF-7, UTF-16 and UTF-32 that Python supports? I think 
it would be better to use encoded U+fffd whenever possible and fall back to 
'?'. What do you think?

Some other self comments on my patch:
  * In the STORECHAR macro for utf-16 and utf-32, I change all instances of "ch 
& 0xFF" to (unsigned char) ch. I don't have enough C knowledge to know if this 
is actually better or if this makes any difference at all.
  * The code for utf-16 and utf-32 are duplicates of the uft-8 one. That one's 
complexity comes from issue #8092 . Not sure if there are ways to simplify 
these. For example, are there suitable functions there so that we don't need to 
check integer overflow at these places?

----------
nosy: +kennyluck
Added file: http://bugs.python.org/file24368/utf-16&32_reject_surrogates.patch

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12892>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

Reply via email to