Re: Origin of Ellipsis (was: RE: Empty set)

Stephan Stiller Sun, 15 Sep 2013 21:44:09 -0700

Doug wrote me:

You're not confusing "code point" with "code unit," are you?

Thanks for the note.

I think what you say is that I thought (or meant to write) "by firstrepresenting the sequence of scalar values in an encoding form and thencounting [code points typecast from] code _units_". I think you areright, but there are some points of confusion, see below. Somehow Ithought of "surrogate pair" as "pair of (surrogate) code points" insteadof "pair of (surrogate) code units". I guess that additional level ofindirection would make my interpretation (b) unlikely ... I think mystatement is still technically correct because counting code points forUTF-16 and code units for UTF-16 leads to the same count.

What's confusing is a term like "high-surrogate code point" (seeglossary). If surrogate code points are not encoded, then theypractically don't exist in the ontology of Unicode terms, aside frombeing holes in the scalar value range, if thought of as a subrange ofthe integers.

In detail: The glossary defines "surrogate code point" as: "A Unicodecode point in the range U+D800..U+DFFF. Reserved _for use_ by UTF-16,where _a pair of surrogate code units_ (a high surrogate followed by alow surrogate) “stand in” for a supplementary code point." Thisdefinition doesn't say much; it says they code _points_ are "for _use_by UTF-16", but then UTF-16 uses surrogate code units, not surrogatecode points. C1 in TUS §3.2 says: "The high-surrogate and low-surrogatecode _points_ _are designated for_ surrogate code _units_ in the UTF-16character encoding form." But the actual definitions used for UTF-16don't seem to conceptually _derive_ "surrogate code unit" from"surrogate code point". => ??

Still, I don't understand why people keep talking about code points. Forme conceptually (albeit not historically) everything starts with scalarvalues (which are index values for certain abstract things). Scalarvalues are then encoded by encoding forms (and then serialized inencoding schemes). Why does everyone talk about the more generic "codepoint" instead of "scalar value", when non-scalar-value code pointsaren't used? (Because we're not using surrogate code point pairs, we'reinstead using surrogate code unit pairs.) Anyways, I understand thatKenW and Mark Davis have pointed to earlier debates on this in anearlier thread.


Stephan

Re: Origin of Ellipsis (was: RE: Empty set)

Reply via email to