Am 20.11.2010 05:11, schrieb Stephen J. Turnbull: > "Martin v. Löwis" writes: > > > The term "UCS-2" is a character set that can encode only encode 65536 > > characters; it thus refers to Unicode 1.1. According to the Unicode > > Consortium's FAQ, the term UCS-2 should be avoided these days. > > So what do you propose we call the Python implementation?
A technical correct description would be to say that Python uses either 16-bit code units or 32-bit code units; for brevity, these can be called narrow and wide code units. > Strictly speaking, internally Python only encodes 65536 characters in > 2-octet builds. Its (Unicode) string-handling code does not know > about surrogates at all, AFAIK Here you are mistaken: it does indeed know about UTF-16 and surrogates in several places, e.g. in the UTF-8 codec, or in the repr() implementation; likewise in the parser. > and therefore is not UTF-16 conforming. I disagree. Python does "conform" to "UTF-16" (certainly in the sense that no UTF-16 specification ever mandates a certain Python API, and that Python follows all general requirements of the UTF-16 specification). > AFAIK this was not supposed to change in Python 3; indexing and > slicing go by code unit (isomorphic to UCS-n), not character, and due > to PEP 383 4-octet builds do not conform (internally) to UTF-32, and > can produce output that conforms to Unicode not at all (as a user > option, of course, but it's still non-conformant). What behavior specifically do you consider non-conforming, and what specific specification do you think it is not conforming to? For example, it *is* fully conforming with UTF-8. Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com