[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

Serhiy Storchaka Mon, 10 Aug 2015 23:05:09 -0700

Serhiy Storchaka added the comment:

There are two causes:


1. UTF-16 and UTF-32 are based on 2- and 4-bytes units. If the surrogateescape 
error handler will support UTF-16 and UTF-32, encoding could produce the data 
that can't be decoded back correctly. For example '\udcac \udcac' -> 
b'\xac\x20\x00\xac' -> '\u20ac\uac20' == '€가'.

2. ASCII bytes (0x00-0x80) can't be escaped with surrogateescape. UTF-16 and 
UTF-32 data can contain illegal ASCII bytes (b'\xD8\x00' in UTF-16-BE, b'abcd' 
in UTF-32). For the same reason surrogateescape is not compatible with UTF-7 
and CP037.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12892>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

Reply via email to