Am 22.11.2010 11:48, schrieb Stephen J. Turnbull: > Raymond Hettinger writes: > > > Neither UTF-16 nor UCS-2 is exactly correct anyway. > >>From a standards lawyer point of view, UCS-2 is exactly correct, as > far as I can tell upon rereading ISO 10646-1, especially Annexes H > ("retransmitting devices") and Q ("UTF-16"). Annex Q makes it clear > that UTF-16 was intentionally designed so that Python-style processing > could be done in a UCS-2 context.
I could only find the FCD of 10646:2010, where annex H was integrated into section 10: http://www.itscj.ipsj.or.jp/sc2/open/02n4125/FCD10646-Main.pdf There they have stopped using the term UCS-2, and added a note # NOTE – Former editions of this standard included references to a # two-octet BMP form called UCS-2 which would be a subset # of the UTF-16 encoding form restricted to the BMP UCS scalar values. # The UCS-2 form is deprecated. I think they are now acknowledging that UCS-2 was a misleading term, making it ambiguous whether this refers to a CCS, a CEF, or a CES; like "ASCII", people have been using it for all three of them. Apparently, the ISO WG interprets earlier revisions as saying that UCS-2 is a CEF that restricted UTF-16 to the BMP. THIS IS NOT WHAT PYTHON DOES. In a narrow Python build, the character set is *not* restricted to the BMP. Instead, Unicode strings are meant to be interpreted (by applications) as UTF-16. > > For the "wide" build, the entire range of unicode is encoded at > > 4 bytes per character and slicing/len operate correctly since > > every character is the same length. This used to be called UCS-4 > > and is now UTF-32. > > That's inaccurate, I believe. UCS-4 is not a UTF, and doesn't satisfy > the range restrictions of a UTF. Not sure what it says in your copy; in mine, section 9.3 says # 9.3 UTF-32 (UCS-4) # UTF-32 (or UCS-4) is the UCS encoding form that assigns each UCS # scalar value to a single unsigned 32-bit code unit. The terms UTF-32 # and UCS-4 can be used interchangeably to designate this encoding # form. so they (now) view the two as synonyms. I think that when ISO 10646 started, they were also fairly confused about these issues (as the group/plane/row/cell structure demonstrates, IMO). This is not surprising, since the notion of byte-based character sets had been ingrained for so long. It took 20 years to learn that a UCS scalar value really is *not* a sequence of bytes, but a natural number. > However, I don't see how "narrow" tells us more than "UCS-2" does. If > "UCS-2" is equally (or more) informative, I prefer it because it is > the technically precise, already well-defined, term. But it's not. It is a confusing term, one that the relevant standards bodies are abandoning. After reading FCD 10646:2010, I could agree to call the two implementations UTF-16 and UTF-32 (as these terms designate CEFs). Unfortunately, they also designate CESs. > If we have to document what the terms we choose mean anyway, why not > document the existing terms and reduce entropy, rather than invent new > ones and increase entropy? Because the proposed existing term is deprecated. Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com