Terry Reedy writes: > The current UCS2 Unicode string implementation, by design, quickly gives > WRONG answers for len(), iteration, indexing, and slicing if a string > contains any non-BMP (surrogate pair) Unicode characters. That may have > been excusable when there essentially were no such extended chars, and > the few there were were almost never used.
Well, no, it gives the right answer according to the design. unicode objects do not contain character strings. By design, they contain code point strings. Guido has made that absolutely clear on a number of occasions. And the reasons have very little to do with lack of non-BMP characters to trip up the implementation. Changing those semantics should have been done before the release of Python 3. It is not clear to me that it is a good idea to try to decide on "the" correct implementation of Unicode strings in Python even today. There are a number of approaches that I can think of. 1. The "too bad if you can't take a joke" approach: do nothing and recommend UTF-32 to those who want len() to DTRT. 2. The "slope is slippery" approach: Implement UTF-16 objects as built-ins, and then try to fend off requests for correct treatment of unnormalized composed characters, normalization, compatibility substitutions, bidi, etc etc. 3. The "are we not hackers?" approach: Implement a transform that maps characters that are not represented by a single code point into Unicode private space, and then see if anybody really needs more than 6400 non-BMP characters. (Note that this would generalize to composed characters that don't have a one-code-point NFC form and similar non-standardized cases that nonstandard users might want handled.) 4. The "42" approach: sadly, I can't think deeply enough to explain it. There are probably others. It's true that Python is going to need good libraries to provide correct handling of Unicode strings (as opposed to unicode objects). But it's not clear to me given the wide variety of implementations I can imagine that there will be one best implementation, let alone which ones are good and Pythonic, and which not so. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com