On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum <gu...@python.org> wrote:
> On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy <tjre...@udel.edu> wrote: > > Excuse me for believing the fine 3.2 manual that says > > "Strings contain Unicode characters." (And to a naive reader, that > implies > > that string iteration and indexing should produce Unicode characters.) > > The naive reader also doesn't know the difference between characters, > code points and code units. It's the advanced, Unicode-aware reader > who is confused by this phrase in the docs. It should say code units; > or perhaps code units for narrow builds and code points for wide > builds. For UTF-16/32 (i.e. narrow/wide), talking about "code units"[0] should be correct. Also note that: * for both, every "code unit" has a specific "codepoint" (including lone surrogates), so it might be OK to talk about "codepoints" too, but * only for wide builds every "codepoints" is represented by a single, 32-bits "code unit". In narrow builds, non-BMP chars are represented by a "code unit sequence" of two elements (i.e. a "surrogate pair"). Since "code unit" refers to the *minimal* bit combination, in UTF-8 characters that needs 2/3/4 bytes, are represented with a "code unit sequence" made of 2/3/4 "code units" (so in UTF-8 "code units" and "code points" overlaps only for the ASCII range). > With PEP 393 we can unconditionally say code points, which is > much better. We should try to remove our use of "characters" -- or > else we should *define* our use of the term "characters" as "what the > Unicode standard calls code points". > Character usually works fine, especially for naive readers. Even Unicode-aware readers often confuse between the several terms, so using a simple term and pointing to a more accurate description sounds like a better idea to me. Note that there's also another important term[1]: """ *Unicode Scalar Value*. Any Unicode * code point<http://unicode.org/glossary/#code_point> * except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. """ For example the UTF codecs produce sequences of "code units" (of 8, 16, 32 bits) that represent "scalar values"[2][3]: Chapter 3 [4] says: """ 3.9 Unicode Encoding Forms The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. [...] D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. • As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive. D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. [...] D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. """ On the other hand, Python Unicode strings are not limited to scalar values, because they can also contain lone surrogates. I hope this helps clarify the terminology a bit and doesn't add more confusion, but if we want to use the Unicode terms we should get them right. (Also note that I might have misunderstood something, even if I've been careful with the terms and I double-checked and quoted the relevant parts of the Unicode standard.) Best Regards, Ezio Melotti [0]: From the chapter 3 [4], D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. • Code units are particular units of computer storage. Other character encoding standards typically use code units defined as 8-bit units—that is, octets. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. [1]: http://unicode.org/glossary/#unicode_scalar_value [2]: Apparently Python 3 raises an error while encoding lone surrogates in UTF-8, but it doesn't for UTF-16 and UTF-32. >From the chapter 3 [4], D91: "Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed." D92: "Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed." I think this should be fixed. [3]: Note that I'm talking about codecs used to encode/decode Unicode strings to/from bytes here, it's perfectly fine for Python itself to represent lone surrogates in its *internal* representations, regardless of what encoding it's using. [4]: Chapter 3: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com