On 10/8/2013 6:30 PM, Steven D'Aprano wrote:
On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote:

In any case, "\ud800\udc01" isn't a valid unicode string.

I don't think this is correct. Can you show me where the standard says
that Unicode strings[1] may not contain surrogates? I think that is a

see below.

critical point, and the FAQ conflates *encoded strings* (i.e. bytes using
one of the UTCs) with *Unicode strings*.

The string you give above is is a Unicode string containing two code
points, the surrogates U+D800 U+DC01, which as far as I am concerned is a
legal string (subject to somebody pointing me to a definitive source that
proves it is not). However, it *may or may not* be encodable to bytes
using UTF-8, -16 or -32.

From chapter two of the standard.

"Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes."

http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708
"All three encoding forms can be used to represent the full range of encoded characters in the Unicode Standard; ... Each of the three Unicode encoding forms can be efficiently transformed into eith
er of the other two without any loss of data."

"Surrogates Area. The Surrogates Area contains only surrogate code points and no encoded characters. See Section 16.6, Surrogates Area, for more detail."

Before utf-16, the surrogates area was, I believe, part of the Private Use Area (which now starts where surrogates end). I think it would have been better if they were no longer called code points, but simply utf-16 code units.

Just as there are byte sequences that cannot be generated by the UTFs,
possibly there are code point sequences that cannot be converted to bytes
using the UTFs.

True, but not to the point. You switched from sequences of characters (unicode text), which is what both I and Neil are talking about, to sequences of codepoints which is a larger set when you include the non-character surrogate 'code points' that are not allowed in unicode text.

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G7404

"The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences."

> [1] Sequences of Unicode code points.

This is not the Standard's definition of 'unicode text'. It is also not its definition of 'unicode string'.

"D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form."

In other words, a Unicode string is a utf encoding of unicode text. The FSR adaptively uses a subset of possible sequences from all three, though only one utf is used for any particular string.

--
D79 says what I claimed before: "The mapping of the set of Unicode scalar values to the set of code unit sequences for a Unicode encoding form is one-to-one."

--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to