New submission from Ma Lin: Hi,
There is a small bug in GB18030 decoder. For 4-byte sequence, the legal range is: 0x81-0xFE for the 1st byte 0x30-0x39 for the 2nd byte 0x81-0xFE for the 3rd byte 0x30-0x39 for the 4th byte The current code forgets to check 0xFE of the 1st and 3rd byte. Therefore, there are 8630 illegal 4-byte sequences can be decoded by GB18030 codec, here is an example: # legal sequence 0x81319130 is decoded to U+060A, it's fine. b = bytes([0x81, 0x31, 0x81, 0x30]) uchar = b.decode('gb18030') print(ord(uchar)) # illegal sequence 0x8130FF30 can be decoded to U+060A as well. b = bytes([0x81, 0x30, 0xFF, 0x30]) uchar = b.decode('gb18030') print(ord(uchar)) ---------- components: Unicode files: forpy27.patch keywords: patch messages: 242457 nosy: Ma Lin, ezio.melotti, haypo priority: normal severity: normal status: open title: A small bug in GB18030 decoder. versions: Python 2.7, Python 3.4, Python 3.5 Added file: http://bugs.python.org/file39277/forpy27.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue24117> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com