On Tue, 08 Oct 2013 21:28:25 -0400, Terry Reedy wrote: > On 10/8/2013 6:30 PM, Steven D'Aprano wrote: >> On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote: >> >>> In any case, "\ud800\udc01" isn't a valid unicode string. >> >> I don't think this is correct. Can you show me where the standard says >> that Unicode strings[1] may not contain surrogates? I think that is a > > see below. > >> critical point, and the FAQ conflates *encoded strings* (i.e. bytes >> using one of the UTCs) with *Unicode strings*. >> >> The string you give above is is a Unicode string containing two code >> points, the surrogates U+D800 U+DC01, which as far as I am concerned is >> a legal string (subject to somebody pointing me to a definitive source >> that proves it is not). However, it *may or may not* be encodable to >> bytes using UTF-8, -16 or -32. > > From chapter two of the standard. > > "Plain text is a pure sequence of character codes; plain Unicode-encoded > text is therefore a sequence of Unicode character codes."
Also there are many valid non-characters in Unicode, including 66 explicitly defined non-characters, plus the many surrogates. So defining Unicode strings in terms of characters is less than helpful, since it excludes a whole bunch of strings which aren't "text" since they include non-characters. Also, "character" in the context of Unicode is ambiguous, due to normalization and decomposition: a single character can have up to four distinct forms. http://www.macchiato.com/unicode/nfc-faq *Code points* are rigorously defined, not characters, which is why I have tried very hard to only refer to code points and bytes, not characters. > http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three > encoding forms can be used to represent the full range of encoded > characters in the Unicode Standard; ... Each of the three Unicode > encoding forms can be efficiently transformed into eith er of the other > two without any loss of data." This merely says "encodings encode characters". We know that encodings can also encode non-characters, at least *some* non-characters. The question is, can they encode surrogates? > "Surrogates Area. The Surrogates Area contains only surrogate code > points and no encoded characters. See Section 16.6, Surrogates Area, for > more detail." > > Before utf-16, the surrogates area was, I believe, part of the Private > Use Area (which now starts where surrogates end). I think it would have > been better if they were no longer called code points, but simply utf-16 > code units. Private Use is irrelevant, since strings certainly can contain Private Use code-points, and UTF encodings can encode them. >> Just as there are byte sequences that cannot be generated by the UTFs, >> possibly there are code point sequences that cannot be converted to >> bytes using the UTFs. > > True, but not to the point. You switched from sequences of characters > (unicode text), which is what both I and Neil are talking about, to > sequences of codepoints which is a larger set when you include the > non-character surrogate 'code points' that are not allowed in unicode > text. I never mentioned sequences of characters. I've always talked about code points. > http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G7404 > > "The Unicode Standard supports three character encoding forms: UTF-32, > UTF-16, and UTF-8. Each encoding form maps the Unicode code points > U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences." Ah! Now we're getting somewhere! I think you've hit the nail on the head: the three UTF forms explicitly exclude the surrogates. So I think we now have an answer: Surrogate code points can exist in Unicode strings, but cannot be encoded to bytes using the standard UTF-8, UTF-16 and UTF-32 encodings. There may be other encodings, or error handlers, which are capable of handling surrogates, but they aren't UTF-8. So I think this answers my question. (I reserve the right to change my mind after reading more of the standard.) Thank you to everyone who replied. -- Steven -- https://mail.python.org/mailman/listinfo/python-list