[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

STINNER Victor Sat, 07 May 2011 02:22:09 -0700

STINNER Victor <victor.stin...@haypocalc.com> added the comment:

_codecs_cn implements different multibyte encodings: gb2312, gbkext, gbcommon, 
gb18030ext, gbk, gb18030.


And there are other Asian multibyte encodings: big5 family, ISO 2202 family, 
JIS family, korean encodings (KSX1001, EUC_KR, CP949, ...), Big5, CP950, ...

All of them ignore the all bytes if one byte of a multibyte sequence is invalid 
(lile 0xFF 0x0A: replaced by ? instead of ?\n using replace error handler).

I don't think that you can/should patch only one encoding: we should use the 
same rule for all encodings.

By the way, do you have any document explaining which result is the good one (? 
or ?\n)? For UTF-8, we have well defined standards explaining exactly what to 
do with invalid byte sequences => see issue #8271. It is easy to fix the 
decoders, but I would like to be sure that your proposed change is the right 
way to decode these encodings.

Change the multibyte encodings can also concern the security. Read for example 
the following section "Check byte strings before decoding them to character 
strings" of my book:
http://www.haypocalc.com/tmp/unicode-2011-03-25/html/issues.html#check-byte-strings-before-decoding-them-to-character-strings
(https://github.com/haypo/unicode_book/wiki)

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12016>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

Reply via email to