[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Ezio Melotti Sun, 27 Feb 2011 10:59:04 -0800

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

The patch turned out to be less trivial than I initially thought.


The current algorithm checks for invalid continuation bytes in 4 places:
1) before the switch/case statement in Objects/unicodeobject.c when it checks 
if there are enough bytes in the string (e.g. if the current byte is a start 
byte of a 4-bytes sequence and there are only 2 more bytes in the string, the 
sequence is invalid);
2) in the "case 2" of the switch, where it's checked if the second byte is a 
valid continuation byte;
3) in the "case 3" of the switch, where it's checked if the second and third 
bytes are valid continuation bytes, including additional invalid cases for the 
second bytes;
3) in the "case 4" of the switch, where it's checked if the second, third, and 
fourth bytes are valid continuation bytes, including additional invalid cases 
for the second bytes;

The correct algorithm should determine the maximum valid subpart of the 
sequence determining the position of the first invalid continuation byte. 
Continuation bytes are all in range 80..BF except for the second  byte of 
3-bytes sequences that start with E0 or ED and second byte of 4-bytes sequences 
that start with F0 or F4 (3rd and 4th continuation bytes are always in range 
80..BF).
This means that the above 4 cases should be changed in this way:
1) if there aren't enough bytes left to complete the sequence check for valid 
continuation bytes considering the special cases for the second bytes (E0, ED, 
F0, F4) instead of using the naive algorithm that checks only for continuation 
bytes in range 80..BF;
2) the "case 2" is fine as is, because the second byte is always in range 
80..BF;
3) the "case 3" should check (pseudocode):
  if (second_byte_is_not_valid) max_subpart_len = 1
  else if (third_byte not in 80..BF) max_subpart_len = 2
  else  # the sequence is valid
the "second_byte_is_not_valid" part should consider the two special cases for 
E0 and ED.
4) the "case 4" should check (pseudocode):
  if (second_byte_is_not_valid) max_subpart_len = 1
  else if (third_byte not in 80..BF) max_subpart_len = 2
  else if (fourth_byte not in 80..BF) max_subpart_len = 3
  else  # the sequence is valid
here the "second_byte_is_not_valid" part should consider the two special cases 
for F0 and E4.

In order to avoid duplication of code I was thinking to add 2 macros (something 
like IS_VALID_3_SEQ_2ND_BYTE, IS_VALID_4_SEQ_2ND_BYTE) that will be used in the 
cases 1) and 3), and 1) and 4) respectively.
The change shouldn't affect the decoding speed, but it will increase the lines 
of code and complexity of the function.
Is this OK?

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Reply via email to