Terry Reedy writes: > Excuse me for believing the fine 3.2 manual that says > "Strings contain Unicode characters."
The manual is wrong, then, subject to a pronouncement to the contrary, of course. I was on your side of the fence when this was discussed, pre-release. I was wrong then. My bet is that we are still wrong, now. > For the purpose of my sentence, the same thing in that code points > correspond to characters, Not in Unicode, they do not. By definition, a small number of code points (eg, U+FFFF) *never* did and *never* will correspond to characters. Since about Unicode 3.0, the same is true of surrogate code points. Some restrictions have been placed on what can be done with composed characters, so even with the PEP (which gives us code point arrays) we do not really get arrays of Unicode characters that fully conform to the model. > strings are NOT code point sequences. They are 2-byte code *unit* > sequences. I stand corrected on Unicode terminology. "Code unit" is what I meant, and what I understand Guido to have defined unicode objects as arrays of. > Any narrow build string with even 1 non-BMP char violates the > standard. Yup. That's by design. > > Guido has made that absolutely clear on a number > > of occasions. > > It is not clear what you mean, but recently on python-ideas he has > reiterated that he intends bytes and strings to be conceptually > different. Sure. Nevertheless, practicality beat purity long ago, and that decision has never been rescinded AFAIK. > Bytes are computer-oriented binary arrays; strings are > supposedly human-oriented character/codepoint arrays. And indeed they are, in UCS-4 builds. But they are *not* in Unicode! Unicode violates the array model. Specifically, in handling composing characters, and in bidi, where arbitrary slicing of direction control characters will result in garbled display. The thing is, that 90% of applications are not really going to care about full conformance to the Unicode standard. Of the remaining 10%, 90% are not going to need both huge strings *and* ABI interoperability with C modules compiled for UCS-2, so UCS-4 is satisfactory. Of the remaining 1% of all applications, those that deal with huge strings *and* need full Unicode conformance, well, they need efficiency too almost by definition. They probably are going to want something more efficient than either the UTF-16 or the UTF-32 representation can provide, and therefore will need trickier, possibly app-specific, algorithms that probably do not belong in an initial implementation. > > And the reasons have very little to do with lack of > > non-BMP characters to trip up the implementation. Changing those > > semantics should have been done before the release of Python 3. > > The documentation was changed at least a bit for 3.0, and anyway, as > indicated above, it is easy (especially for new users) to read the docs > in a way that makes the current behavior buggy. I agree that the > implementation should have been changed already. I don't. I suspect Guido does not, even today. > Currently, the meaning of Python code differs on narrow versus wide > build, and in a way that few users would expect or want. Let them become developers, then, and show us how to do it better. > PEP 393 abolishes narrow builds as we now know them and changes > semantics. I was answering a complaint about that change. If you do > not like the PEP, fine. No, I do like the PEP. However, it is only a step, a rather conservative one in some ways, toward conformance to the Unicode character model. In particular, it does nothing to resolve the fact that len() will give different answers for character count depending on normalization, and that slicing and indexing will allow you to cut characters in half (even in NFC, since not all composed characters have fully composed forms). > > It is not clear to me that it is a good idea to try to decide on "the" > > correct implementation of Unicode strings in Python even today. > > If the implementation is invisible to the Python user, as I believe it > should be without specially introspection, and mostly invisible in the > C-API except for those who intentionally poke into the details, then the > implementation can be changed as the consensus on best implementation > changes. A naive implementation of UTF-16 will be quite visible in terms of performance, I suspect, and performance-oriented applications will "go behind the API's back" to get it. We're already seeing that in the people who insist that bytes are characters too, and string APIs should work on them just as they do on (Unicode) strings. > > It's true that Python is going to need good libraries to provide > > correct handling of Unicode strings (as opposed to unicode objects). > > Given that 3.0 unicode (string) objects are defined as Unicode character > strings, I do not see the opposition. I think they're not, I think they're defined as Unicode code unit arrays, and that the documentation is in error. If the documentation is correct, then Python 3.0 was released about 5 years too early, because correct handling of those objects as arrays of Unicode characters has never been implemented or even discussed in terms of proposed code that I know of. Martin has long claimed that the fact that I/O is done in terms of UTF-16 means that the internal representation is UTF-16, so I could be wrong. But when issues of slicing, len() values and so on have come up in the past, Guido has always said "no, there will be no change in semantics of builtins here". _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com