On Thu, Aug 25, 2011 at 6:40 PM, Ezio Melotti <ezio.melo...@gmail.com> wrote: > On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum <gu...@python.org> wrote: >> >> On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy <tjre...@udel.edu> wrote: >> > Excuse me for believing the fine 3.2 manual that says >> > "Strings contain Unicode characters." (And to a naive reader, that >> > implies >> > that string iteration and indexing should produce Unicode characters.) >> >> The naive reader also doesn't know the difference between characters, >> code points and code units. It's the advanced, Unicode-aware reader >> who is confused by this phrase in the docs. It should say code units; >> or perhaps code units for narrow builds and code points for wide >> builds. > > For UTF-16/32 (i.e. narrow/wide), talking about "code units"[0] should be > correct. Also note that: > * for both, every "code unit" has a specific "codepoint" (including lone > surrogates), so it might be OK to talk about "codepoints" too, but > * only for wide builds every "codepoints" is represented by a single, > 32-bits "code unit". In narrow builds, non-BMP chars are represented by a > "code unit sequence" of two elements (i.e. a "surrogate pair").
The more I think about it the more it seems to me that the biggest problem is that in narrow builds it is ambiguous whether (unicode) strings contain code units, i.e. are *encoded* code points, or whether they contain (decoded) code points. In a sense this is repeating the ambiguity of 8-bit strings in Python 2, which are sometimes assumed to contain ASCII or Latin-1 (i.e., code points with a limited range) or UTF-8 (i.e., code units). I know that by now I am repeating myself, but I think it would be really good if we could get rid of this ambiguity. PEP 393 seems the best way forward, even if it doesn't directly address what to do for IronPython or Jython, both of which have to deal with a pervasive native string type that contains UTF-16. IIUC, CPython on Windows will work just fine with PEP 393, even if it means that there is a bit more translation between Python strings and the OS native wchar_t[] type. I assume that the data volumes going through the OS APIs is relatively constrained, since data actually written to or read from a file will still be bytes, possibly run through a codec (if it's a text file), and not go through one of the wchar_t[] APIs -- the latter are used for things like filenames, which are much smaller. > Since "code unit" refers to the *minimal* bit combination, in UTF-8 > characters that needs 2/3/4 bytes, are represented with a "code unit > sequence" made of 2/3/4 "code units" (so in UTF-8 "code units" and "code > points" overlaps only for the ASCII range). Actually I think UTF-8 is best thought of as an encoding for code points, not characters -- the subtle difference between these two should be of no concern to the UTF-8 codec (unless it is a validating codec). >> With PEP 393 we can unconditionally say code points, which is >> much better. We should try to remove our use of "characters" -- or >> else we should *define* our use of the term "characters" as "what the >> Unicode standard calls code points". > > Character usually works fine, especially for naive readers. Even > Unicode-aware readers often confuse between the several terms, so using a > simple term and pointing to a more accurate description sounds like a better > idea to me. We may well have no choice -- there is just too much documentation that naively refers to characters while really referring to code units or code points. > Note that there's also another important term[1]: > """ > Unicode Scalar Value. Any Unicode code point except high-surrogate and > low-surrogate code points. In other words, the ranges of integers 0 to > D7FF16 and E00016 to 10FFFF16 inclusive. > """ This seems to involve validation. I think all validation should be sequestered to specific APIs (e.g. certain codecs) and the string type should not care about it. Depending on what they are doing, applications may have to be aware of many subtleties in order to always avoid generating "invalid" (or not well-formed-- what's the difference?) strings. > For example the UTF codecs produce sequences of "code units" (of 8, 16, 32 > bits) that represent "scalar values"[2][3]: > > Chapter 3 [4] says: > """ > 3.9 Unicode Encoding Forms > The Unicode Standard supports three character encoding forms: UTF-32, > UTF-16, and UTF-8. Each encoding form maps the Unicode code points > U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. [...] I really don't mind whether our codecs actually make exceptions for surrogates (lone or otherwise). The only requirement I care about is that surrogate-free strings round-trip correctly. Again, apps that want to conform to the requirements regarding surrogates can implement their own validation, and certainly at some point we should offer a validation library as part of the stdlib -- but it should be up to the app whether and when to use it. > D76 Unicode scalar value: Any Unicode code point except high-surrogate and > low-surrogate code points. > • As a result of this definition, the set of Unicode scalar values > consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive. > D77 Code unit: The minimal bit combination that can represent a unit of > encoded text for processing or interchange. > [...] > D79 A Unicode encoding form assigns each Unicode scalar value to a unique > code unit sequence. > """ > > On the other hand, Python Unicode strings are not limited to scalar values, > because they can also contain lone surrogates. Right. > I hope this helps clarify the terminology a bit and doesn't add more > confusion, but if we want to use the Unicode terms we should get them > right. (Also note that I might have misunderstood something, even if I've > been careful with the terms and I double-checked and quoted the relevant > parts of the Unicode standard.) I'm not more confused than I was, but I think we should reduce the number of Unicode terms we care about rather than increase them. If we only ever had to talk about code points and encoded byte sequences I'd be happy -- although in practice we also need to acknowledge the existence of characters that may be represented by multiple code points, since islower(), lower() etc. may need these (and also the re module). Other concepts we may have to at least acknowledge include various normal forms, equivalence, and collation sequences (which are language-dependent?). It would be lovely if someone wrote up an informational PEP so that we don't all have to lug around a copy of the Unicode standard. > Best Regards, > Ezio Melotti > > > [0]: From the chapter 3 [4], > D77 Code unit: The minimal bit combination that can represent a unit of > encoded text for processing or interchange. > • Code units are particular units of computer storage. Other character > encoding standards typically use code units defined as 8-bit units—that is, > octets. > The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, > 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the > UTF-32 encoding form. > [1]: http://unicode.org/glossary/#unicode_scalar_value > [2]: Apparently Python 3 raises an error while encoding lone surrogates in > UTF-8, but it doesn't for UTF-16 and UTF-32. > From the chapter 3 [4], > D91: "Because surrogate code points are not Unicode scalar values, isolated > UTF-16 code units in the range 0xD800..0xDFFF are ill-formed." > D92: "Because surrogate code points are not included in the set of Unicode > scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are > ill-formed." > I think this should be fixed. > [3]: Note that I'm talking about codecs used to encode/decode Unicode > strings to/from bytes here, it's perfectly fine for Python itself to > represent lone surrogates in its *internal* representations, regardless of > what encoding it's using. > [4]: Chapter 3: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com