[issue24117] A small bug in GB18030 decoder.

Ma Lin Sun, 03 May 2015 01:17:54 -0700

New submission from Ma Lin:

Hi,


There is a small bug in GB18030 decoder.

For 4-byte sequence, the legal range is:
0x81-0xFE for the 1st byte
0x30-0x39 for the 2nd byte
0x81-0xFE for the 3rd byte
0x30-0x39 for the 4th byte

The current code forgets to check 0xFE of the 1st and 3rd byte.
Therefore, there are 8630 illegal 4-byte sequences can be decoded by GB18030 
codec, here is an example:

# legal sequence 0x81319130 is decoded to U+060A, it's fine.
b = bytes([0x81, 0x31, 0x81, 0x30])
uchar = b.decode('gb18030')
print(ord(uchar))

# illegal sequence 0x8130FF30 can be decoded to U+060A as well.
b = bytes([0x81, 0x30, 0xFF, 0x30])  
uchar = b.decode('gb18030')
print(ord(uchar))

----------
components: Unicode
files: forpy27.patch
keywords: patch
messages: 242457
nosy: Ma Lin, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: A small bug in GB18030 decoder.
versions: Python 2.7, Python 3.4, Python 3.5
Added file: http://bugs.python.org/file39277/forpy27.patch

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue24117>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue24117] A small bug in GB18030 decoder.

Reply via email to