[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Ezio Melotti Thu, 08 Sep 2011 01:32:53 -0700

Ezio Melotti <[email protected]> added the comment:

So to summarize a bit, there are different possible level of strictness:
  1) all the possible encodable values, including the ones >10FFFF;
  2) values in range 0..10FFFF;
  3) values in range 0..10FFFF except surrogates (aka scalar values);
  4) values in range 0..10FFFF except surrogates and noncharacters;


and this is what is currently available in Python:
  1) not available, probably it will never be;
  2) available through the 'surrogatepass' error handler;
  3) default behavior (i.e. with the 'strict' error handler);
  4) currently not available.

(note: this refers to the utf-8 codec in Python 3, but it should be true for 
the utf-16/32 codecs too once #12892 is fixed.  This whole message refers to 
codecs only and what they should (dis)allow.  What we use internally seems to 
work fine and doesn't need to be changed.)

Now, assume that we don't care about option 1 and want to implement the missing 
option 4 (which I'm still not 100% sure about).  The possible options are:
  * add a new codec (actually one for each UTF encoding);
  * add a new error handler that explicitly disallows noncharacters;
  * change the meaning of 'strict' to match option 4;

This depends on what should be the default behavior while dealing with 
noncharacters.  If they are rejected by default, then 'strict' should reject 
them.  However this would leave us without option 3 (something to encode all 
and only scalar values), and surrogatepass will be misnamed if it also allows 
noncharacters (and if it doesn't we will end up without option 2 too).  This is 
apparently what Perl does:
> Perl will never ever produce nor accept one of the 66 noncharacers
> on any stream marked as one of the 7 character encoding schemes. 


Implementation-wise, I think the 'strict' error handler should be the strictest 
one, because the codec must detects all the "problematic" chars and send them 
to the error handler that might then decide what to do with them.  I.e. if the 
codec detects noncharacters, sends them to the error handler, and the error 
handler is strict, an error will be raised; if it doesn't detect them, the 
error handler won't be able to do anything with them.  
Another option is to provide another codec that specifically detects them, but 
this means re-implementing a slightly different version of each codec (or 
possibly add an extra argument to the PyUnicode_{Encode,Decode}UTF* functions).

We could also decide to leave the handling of noncharacters as it is -- after 
all the Unicode standard doesn't seem to explicitly forbid them as it does with 
e.g. surrogates.


> We have a flavor of non-strict utf8, spelled "utf8" instead of "UTF-8",
> that can produce and accept illegal characters, although by default it
> is still going to generate a warning

How did Perl implement this?  With two (or more) slightly different version of 
the same codec?
And how does Perl handle errors?  With some global options that turns (possibly 
specific) warnings into error (like python -We)?

Python has different codecs that encode/decode str/bytes and whenever they find 
a "problematic" char they send it to the error handler that might decide to 
raise an error, remove the char, replace it with something else, sending it 
back unchanged, generate a warning and so on.  In this way you can have 
different combinations of codecs and error handlers to get the desired 
behaviors.  (and FWIW in Python 'utf8' is an alias for 'UTF-8'.)

> I think a big problem here is that the Python culture doesn't use stream
> encodings enough.  People are always making their own repeated and tedious
> calls to encode and then sending stuff out a byte stream, by which time it
> is too late to check.
> [...]
> Anything that deals with streams should have an encoding argument.  But
> often/many? things in Python don't.
Several objects have an .encoding and .error attributes (e.g. sys.stdin/out), 
and they are used to encode/decode the text/bytes sent/read to/from them.  In 
other places we prefer the "explicit is better than implicit" approach and 
require the user (or some other higher-level layer) to encode/decode manually.

I'm not sure why you are saying that it's too late to check, and since the 
encoding/decoding happens only in a few places I don't think it's tedious at 
all (and often it's automatic too).

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to