Re: Encoding of surrogate code points to UTF-8

Terry Reedy Tue, 08 Oct 2013 18:31:03 -0700

On 10/8/2013 6:30 PM, Steven D'Aprano wrote:

On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote:

In any case, "\ud800\udc01" isn't a valid unicode string.


I don't think this is correct. Can you show me where the standard says
that Unicode strings[1] may not contain surrogates? I think that is a


see below.

critical point, and the FAQ conflates *encoded strings* (i.e. bytes using
one of the UTCs) with *Unicode strings*.

The string you give above is is a Unicode string containing two code
points, the surrogates U+D800 U+DC01, which as far as I am concerned is a
legal string (subject to somebody pointing me to a definitive source that
proves it is not). However, it *may or may not* be encodable to bytes
using UTF-8, -16 or -32.


From chapter two of the standard.

"Plain text is a pure sequence of character codes; plain Unicode-encodedtext is therefore a sequence of Unicode character codes."


http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708

"All three encoding forms can be used to represent the full range ofencoded characters in the Unicode Standard; ... Each of the threeUnicode encoding forms can be efficiently transformed into eith

er of the other two without any loss of data."

"Surrogates Area. The Surrogates Area contains only surrogate codepoints and no encoded characters. See Section 16.6, Surrogates Area, formore detail."

Before utf-16, the surrogates area was, I believe, part of the PrivateUse Area (which now starts where surrogates end). I think it would havebeen better if they were no longer called code points, but simply utf-16code units.

Just as there are byte sequences that cannot be generated by the UTFs,
possibly there are code point sequences that cannot be converted to bytes
using the UTFs.

True, but not to the point. You switched from sequences of characters(unicode text), which is what both I and Neil are talking about, tosequences of codepoints which is a larger set when you include thenon-character surrogate 'code points' that are not allowed in unicode text.


http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G7404

"The Unicode Standard supports three character encoding forms: UTF-32,UTF-16, and UTF-8. Each encoding form maps the Unicode code pointsU+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences."


> [1] Sequences of Unicode code points.

This is not the Standard's definition of 'unicode text'. It is also notits definition of 'unicode string'.

"D80 Unicode string: A code unit sequence containing code units of aparticular Unicode encoding form."

In other words, a Unicode string is a utf encoding of unicode text. TheFSR adaptively uses a subset of possible sequences from all three,though only one utf is used for any particular string.

--

D79 says what I claimed before: "The mapping of the set of Unicodescalar values to the set of code unit sequences for a Unicode encodingform is one-to-one."


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Re: Encoding of surrogate code points to UTF-8

Reply via email to