Stephan Stiller wrote:
From the link it isn't entirely clear whether they
(a) count scalar values of NFC or
(b) count code points of NFC.
Are they not the same thing, except for surrogates?
Conceptually no, but numerically yes – you are right in that regard, and I wasn't precise in my description of (b). I suppose if you read their description literally (they say they use UTF-8 internally), it follows that they're forbidding surrogates, because these are invalid in UTF-8. (Is this what they're doing? I guess the answer wouldn't matter for someone who only produces Tweets properly composed of a sequence of scalar values.)

Then, when they write that "Twitter also counts the number of codepoints in the text rather than UTF-8 bytes", it makes me wonder whether they're maybe handling the data in UTF-16 in the relevant procedure that checks for length. The elementary unit of abstract "text" is for me the scalar value. When they write "code point", that means they've just implicitly typecast from "scalar value" to "code point", and the question is how the typecasting was performed: by directly interpreting the scalar values as numbers of type "code point" or by first representing the sequence of scalar values in an encoding form and then counting code points? My assumption would naturally be the former, which would also be consistent with vulgar :-) (popular) use of these terms – but I had to read Twitter's description a couple of times to make sense of it.

Stephan


Reply via email to