Re: Origin of Ellipsis (was: RE: Empty set)

Stephan Stiller Sun, 15 Sep 2013 20:00:38 -0700

Stephan Stiller wrote:

From the link it isn't entirely clear whether they
(a) count scalar values of NFC or
(b) count code points of NFC.

Are they not the same thing, except for surrogates?

Conceptually no, but numerically yes – you are right in that regard, andI wasn't precise in my description of (b). I suppose if you read theirdescription literally (they say they use UTF-8 internally), it followsthat they're forbidding surrogates, because these are invalid in UTF-8.(Is this what they're doing? I guess the answer wouldn't matter forsomeone who only produces Tweets properly composed of a sequence ofscalar values.)

Then, when they write that "Twitter also counts the number of codepointsin the text rather than UTF-8 bytes", it makes me wonder whether they'remaybe handling the data in UTF-16 in the relevant procedure that checksfor length. The elementary unit of abstract "text" is for me the scalarvalue. When they write "code point", that means they've just implicitlytypecast from "scalar value" to "code point", and the question is howthe typecasting was performed: by directly interpreting the scalarvalues as numbers of type "code point" or by first representing thesequence of scalar values in an encoding form and then counting codepoints? My assumption would naturally be the former, which would also beconsistent with vulgar :-) (popular) use of these terms – but I had toread Twitter's description a couple of times to make sense of it.


Stephan

Re: Origin of Ellipsis (was: RE: Empty set)

Reply via email to