[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Tom Christiansen Sat, 13 Aug 2011 19:35:34 -0700

Tom Christiansen <tchr...@perl.com> added the comment:

>> Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds.
>> Perhaps someone could tell me why the Python documentation says it uses
>> UCS-2 on a narrow build.


> There's a disagreement on that point between several developers. 
> See an example sub-thread at:

>       http://mail.python.org/pipermail/python-dev/2010-November/105751.html

Some of those folks know what they're talking about, and some do not.

Most of the postings miss the mark.

Python uses UTF-16 for its narrow builds.  It does not use UCS-2.

The argument that it must be UCS-2 because it can store lone surrogates
in memory is spurious.

You have to read The Unicode Standard very *very* closely, but it is not
necessary that all internal buffers always be in well-formed UTF-whatever.
Otherwise it would be impossible to append a code unit at a time to buffer.
I could pull out the reference if I worked at it, because I've had to find
it before.  It's in there.  Trust me.  I know.

It is also spurious to pretend that because you can produce illegal output
when telling it to generate something in UTF-16 that it is somehow not using
UTF-16.  You have simply made a mistake.  You have generated something  that
you have promised you would not generate.   I have more to say about this below.

Finally, it is spurious to argue against UTF-16 because of the code unit
interface.  Java does exactly  the same thing as Python does *in all regards*
here, and no one pretends that Java is UCS-2.  Both are UTF-16.

It is simply a design error to pretend that the number of characters
is the number of code units instead of code points.  A terrible and
ugly one, but it does not mean you are UCS-2.

You are not.  Python uses UTF-16 on narrow builds.  

The ugly terrible design error is digusting and wrong, just as much in 
Python as in Java, and perhaps moreso because of the idiocy of narrow
builds even existing.  But that doesn't make them UCS-2.

If I could wave a magic wand, I would have Python undo its code unit
blunder and go back to code points, no matter what.  That means to stop
talking about serialization schemes and start talking about logical code
points.  It means that slicing and index and length and everything only
report true code points.  This horrible code unit botch from narrow builds
is most easily cured by moving to wide builds only.

However, there is more.

I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is broken
in a bunch of ways.  You should be raising as exception in all kinds of
places and you aren't.  I can see I need to bug report this stuff to.  
I don't to be mean about this.  HONEST!  It's just the way it is.

Unicode currently reserves 66 code points as noncharacters, which it 
guarantees will never be in a legal UTF-anything stream.  I am not talking 
about surrogates, either.

To start with, no code point which when bitwise added with 0xFFFE returns
0xFFFE can never appear in a valid UTF-* stream, but Python allow this
without any error.

That means that both 0xNN_FFFE and 0xNN_FFFF are illegal in all planes,
where NN is 00 through 10 in hex.  So that's 2 noncharacters times 17 
planes = 34 code points illegal for interchange that Python is passing 
through illegally.  

The remaining 32 nonsurrogate code points illegal for open interchange
are 0xFDD0 through 0xFDEF.  Those are not allowed either, but Python
doesn't seem to care.

You simply cannot say you are generating UTF-8 and then generate a byte
sequence that UTF-8 guarantees can never occur.  This is a violation.

***SIGH***

--tom

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to