Stephan Stiller wrote:
From the link it isn't entirely clear whether they
(a) count scalar values of NFC or
(b) count code points of NFC.
Are they not the same thing, except for surrogates?
Conceptually no, but numerically yes – you are right in that regard, and
I wasn't precise in my description of (b). I suppose if you read their
description literally (they say they use UTF-8 internally), it follows
that they're forbidding surrogates, because these are invalid in UTF-8.
(Is this what they're doing? I guess the answer wouldn't matter for
someone who only produces Tweets properly composed of a sequence of
scalar values.)
Then, when they write that "Twitter also counts the number of codepoints
in the text rather than UTF-8 bytes", it makes me wonder whether they're
maybe handling the data in UTF-16 in the relevant procedure that checks
for length. The elementary unit of abstract "text" is for me the scalar
value. When they write "code point", that means they've just implicitly
typecast from "scalar value" to "code point", and the question is how
the typecasting was performed: by directly interpreting the scalar
values as numbers of type "code point" or by first representing the
sequence of scalar values in an encoding form and then counting code
points? My assumption would naturally be the former, which would also be
consistent with vulgar :-) (popular) use of these terms – but I had to
read Twitter's description a couple of times to make sense of it.
Stephan