Doug wrote me:
You're not confusing "code point" with "code unit," are you?
Thanks for the note.

I think what you say is that I thought (or meant to write) "by first representing the sequence of scalar values in an encoding form and then counting [code points typecast from] code _units_". I think you are right, but there are some points of confusion, see below. Somehow I thought of "surrogate pair" as "pair of (surrogate) code points" instead of "pair of (surrogate) code units". I guess that additional level of indirection would make my interpretation (b) unlikely ... I think my statement is still technically correct because counting code points for UTF-16 and code units for UTF-16 leads to the same count.

What's confusing is a term like "high-surrogate code point" (see glossary). If surrogate code points are not encoded, then they practically don't exist in the ontology of Unicode terms, aside from being holes in the scalar value range, if thought of as a subrange of the integers.

In detail: The glossary defines "surrogate code point" as: "A Unicode code point in the range U+D800..U+DFFF. Reserved _for use_ by UTF-16, where _a pair of surrogate code units_ (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point." This definition doesn't say much; it says they code _points_ are "for _use_ by UTF-16", but then UTF-16 uses surrogate code units, not surrogate code points. C1 in TUS §3.2 says: "The high-surrogate and low-surrogate code _points_ _are designated for_ surrogate code _units_ in the UTF-16 character encoding form." But the actual definitions used for UTF-16 don't seem to conceptually _derive_ "surrogate code unit" from "surrogate code point". => ??

Still, I don't understand why people keep talking about code points. For me conceptually (albeit not historically) everything starts with scalar values (which are index values for certain abstract things). Scalar values are then encoded by encoding forms (and then serialized in encoding schemes). Why does everyone talk about the more generic "code point" instead of "scalar value", when non-scalar-value code points aren't used? (Because we're not using surrogate code point pairs, we're instead using surrogate code unit pairs.) Anyways, I understand that KenW and Mark Davis have pointed to earlier debates on this in an earlier thread.

Stephan

Reply via email to