Tom Christiansen <tchr...@perl.com> added the comment: >> Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds. >> Perhaps someone could tell me why the Python documentation says it uses >> UCS-2 on a narrow build.
> There's a disagreement on that point between several developers. > See an example sub-thread at: > http://mail.python.org/pipermail/python-dev/2010-November/105751.html Some of those folks know what they're talking about, and some do not. Most of the postings miss the mark. Python uses UTF-16 for its narrow builds. It does not use UCS-2. The argument that it must be UCS-2 because it can store lone surrogates in memory is spurious. You have to read The Unicode Standard very *very* closely, but it is not necessary that all internal buffers always be in well-formed UTF-whatever. Otherwise it would be impossible to append a code unit at a time to buffer. I could pull out the reference if I worked at it, because I've had to find it before. It's in there. Trust me. I know. It is also spurious to pretend that because you can produce illegal output when telling it to generate something in UTF-16 that it is somehow not using UTF-16. You have simply made a mistake. You have generated something that you have promised you would not generate. I have more to say about this below. Finally, it is spurious to argue against UTF-16 because of the code unit interface. Java does exactly the same thing as Python does *in all regards* here, and no one pretends that Java is UCS-2. Both are UTF-16. It is simply a design error to pretend that the number of characters is the number of code units instead of code points. A terrible and ugly one, but it does not mean you are UCS-2. You are not. Python uses UTF-16 on narrow builds. The ugly terrible design error is digusting and wrong, just as much in Python as in Java, and perhaps moreso because of the idiocy of narrow builds even existing. But that doesn't make them UCS-2. If I could wave a magic wand, I would have Python undo its code unit blunder and go back to code points, no matter what. That means to stop talking about serialization schemes and start talking about logical code points. It means that slicing and index and length and everything only report true code points. This horrible code unit botch from narrow builds is most easily cured by moving to wide builds only. However, there is more. I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is broken in a bunch of ways. You should be raising as exception in all kinds of places and you aren't. I can see I need to bug report this stuff to. I don't to be mean about this. HONEST! It's just the way it is. Unicode currently reserves 66 code points as noncharacters, which it guarantees will never be in a legal UTF-anything stream. I am not talking about surrogates, either. To start with, no code point which when bitwise added with 0xFFFE returns 0xFFFE can never appear in a valid UTF-* stream, but Python allow this without any error. That means that both 0xNN_FFFE and 0xNN_FFFF are illegal in all planes, where NN is 00 through 10 in hex. So that's 2 noncharacters times 17 planes = 34 code points illegal for interchange that Python is passing through illegally. The remaining 32 nonsurrogate code points illegal for open interchange are 0xFDD0 through 0xFDEF. Those are not allowed either, but Python doesn't seem to care. You simply cannot say you are generating UTF-8 and then generate a byte sequence that UTF-8 guarantees can never occur. This is a violation. ***SIGH*** --tom ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12729> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com