On May 10, 2005, at 7:34 PM, James Y Knight wrote: > If you're going to call python's implementation UTF-16, I'd consider > all these very serious deficiencies:
The --enable-unicode option declares a character encoding form (CEF), not a character encoding scheme (CES). It is unfortunate that UTF-16 is a valid option for both of these things, but supporting the CEF does not imply supporting the CES. All of your complaints would be valid if we claimed that Python supported the UTF-16 CES, but the language itself only needs to support a CEF that everyone understands how to work with. It is widely recognized, I believe, that the general level of unicode support exposed to Python users is somewhat lacking when it comes to high surrogate pairs. I'd love for us to fix that problem, or, better yet, integrate something like ICU, but this isn't that discussion. > - unicodedata doesn't work for 2-char strings containing a surrogate > pairs, nor integers. Therefore it is impossible to get any data on > chars > 0xFFFF. > - there are no methods for determining if something is a surrogate > pair and turning it into a integer codepoint. > - Given that unicodedata doesn't work, I doubt also that .toupper/etc > work right on surrogate pairs, although I haven't tested. > - As has been noted before, the regexp engine doesn't properly treat > surrogate pairs as a single unit. > - Is there a method that is like unichr but that will work for > codepoints > 0xFFFF? > > I'm sure there's more as well. I think it's a mistake to consider > python to be implementing UTF-16 just because it properly > encodes/decodes surrogate pairs in the UTF-8 codec. Users should understand (and we should write doc to help them understand), that using 2-byte wide unicode support in Python means that all operations will be done on Code Units, and not Code Points. Once you understand this, you can work with the data that is given to you, although it's certainly not as nice as what you would have come to expect from Python. (For example, you can correctly construct a regexp to find the surrogate pair you're looking for by using the constituent code units). -- Nick _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com