[twitter-dev] How to count to 140: or, characters vs. bytes vs. entities, third strike

leoboiko Fri, 15 May 2009 10:03:55 -0700

So.  Much twitter documentation talks about “140 characters” and “160
characters limit”.  But “character” is not a raw data type, it’s a
string type.  It has been observed[1][2][3][4] that 1) twitter expects
these characters to be encoded as UTF-8 (or ASCII, which is a strict
subset of UTF-8), and 2) the limit really is 140/160 *bytes*, not
characters (UTF-8 characters can use up to 4 bytes each; two-bytes per
character are common for European languages and three-to-four bytes
are common for Asian scripts, Indic &c.).

Later I intend to thoroughly replace “characters” by “bytes” (or
“UTF-8 byte count”, &c.) in the API wiki. Hope it’s ok with everyone.

* * *

Many twitter applications want to interactively count characters as
users type. Other than the byte/character confusion, there’s another
common source of errors: the fact that ‘<’ and ‘>’ are converted to,
and counted as, their respective HTML entities (&lt; and &gt; —four
bytes each)[1]). That in itself isn’t so bad, as long as it’s
deterministic and documented. It seems the conversion may take place
a few hours after the update is sent[4], which is unfortunate but
still acceptable. Much worse is the problem that, at least according
to the FAQ, other (unspecified) characters “may” be converted to (and
counted as) HTML entities[1]. That makes a twitter character-counting
function either a potential truncation trap (if it ignores HTML
entities), or exceptionally conservative (if it assumes ALL possible
characters will be HTML-entitied). Is this still the current
behavior? If so, I’m filing a but =)

* * *

I’d like to understand fully what are the motivations for these limits
and counting algorithms. Alex Payne stated that as of now they’re
just using Ruby 1.8 String.count, which is equivalent to UTF-8 byte
count. However, AFAIK, the 140-bytes limit was originally intended to
support sending updates as SMS messages. Now, I have no SMS
experience at all, and it's true that SMS has a hard limit of 140
bytes, but AFAIK SMS text MUST be encoded in one of a few specific
encodings:

[…]the default GSM 7-bit alphabet [i.e. character encoding], the 8-
bit data alphabet, and the 16-bit UTF-16/UCS-2 alphabet. Depending on
which alphabet the subscriber has configured in the handset, this
leads to the maximum individual Short Message sizes of 160 7-bit
characters, 140 8-bit characters, or 70 16-bit characters (including
spaces). Support of the GSM 7-bit alphabet is mandatory for GSM
handsets and network elements, but characters in languages such as
Arabic, Chinese, Korean, Japanese or Cyrillic alphabet languages (e.g.
Russian) must be encoded using the 16-bit UCS-2 character encoding.

Notice the absence of UTF-8 .

That means Twitter’s “140 bytes” does not match SMS “140 bytes” at
all. A 140-byte UTF-8 Twitter update will take less than 140 bytes in
GSM 7-bit, so if you’re sending SMS as GSM you’re being too
pessimistic. And the same Twitter 140-byte string can take far more
bytes in UCS-2, so if you’re sending Unicode SMS you’re being too
optimistic. (Notice the GSM encoding supports very few characters;
it’s not possible to convert an Twitter update like “reação em japonês
é 反応” to GSM SMS, neither 7– nor 8-bit, so for those you’re stuck with
UCS-2 SMS).

Twitter doesn’t send SMS to my country so I have no way to test how
you deal with this. I suppose you take the most space-efficient
encoding that supports all characters in the message, and if it
results in more than 140 bytes, truncate the message. It would be
nice to have documentation on exactly what happens (if you already do,
hit me in the head for not finding it). In any case this seems to
complicate the “we sent your friends the short version” message.
Currently, that message means “your update, as UTF-8, is in the 141–
160 bytes range”, right? But that count means nothing to SMS — a
message with 160 UTF-8 bytes might wholly fit an SMS just fine (if the
characters are all included in GSM 7-bit), while one with 71 UTF-8
bytes might not (if they’re all non-GSM, say, ‘ç’ repeated 71 times).
I think instead the SMS-conversion function should propagate all the
way back its SMS-truncation exception, and the warning should be
phrased as “your update didn’t fit an SMS, so we sent the short
version” . Even better, only send the warning when at least one of
the subscribers is actually receiving it as SMS.

* * *

This discussion brings into mind what’s the point of limiting updates
to 140/160 bytes in the first place. If the intention was to support
SMS, UTF-8 byte count doesn’t have any meaningful relationship with
it. It does work for the English world, since 160 ASCII characters
will convert to 140 GSM bytes (in 7-bit characters). But even in this
degenerate case, the 140 part feels weird (since it’s safe to just use
160, and they won’t be truncated). And as soon as you put a “ç” or a
“日” or a “—” in there, Twitter’s byte-count limit loses all meaning.

Twitter now has an existence that’s independent of SMS, but still
intertwined with it (at least for a few lucky countries). Until SMS
is finally phased out by mobile email and such, it will remain an
important part of the service. The 160-byte limit kind of does work
as a rule of thumb for SMS conversion, at least for Westerners using
the Latin alphabet with few non-ASCII chars. And, regardless of
motivation, by now the limit has taken meaning in the Twitter
culture. If it was just lifted, it might affect the service’s
identity — having such a strict message limit determines what kind of
usage and experiments people do with Twitter. But it is my humble
opinion that the limit could be safely extended to 160 UTF-8
*characters*, even though that could take up to 640 bytes. After all,
even a currently-supported 140-byte update might exceed an SMS, so why
not simplify things and just count characters anyway? The “soft” 140-
byte limit could be phased out entirely, and the “we sent your friends
the short version” warning would appear only when the update didn’t
fit an SMS —regardless of how many bytes or characters it originally
had. (The alternative is to require applications to send updates
encoded as SMSes (140 bytes in GSM or UCS-2), which I think everyone
agrees would suck). And entity count IMHO could be completely
ignored —what’s the point of counting entity bytes in the message
limit, anyway? Shouldn’t those be strictly a property of the “view”
part of MVC? Entity size doesn’t mean anything at all to SMS and other
non-HTML Twitter clients.

* * *
References:

[1]
http://apiwiki.twitter.com/Things-Every-Developer-Should-Know#7Encodingaffectsstatuscharactercount
[2]
http://groups.google.com/group/twitter-development-talk/browse_thread/thread/f4b74d5ba883eb00
[3]
http://groups.google.com/group/twitter-development-talk/browse_thread/thread/44be91d5ec5850fa
[4]
http://groups.google.com/group/twitter-development-talk/browse_thread/thread/9d9d16d55e2e1e67
[5] http://en.wikipedia.org/wiki/SMS#Message_size

--
Leonardo Boiko
http://namakajiri.net

[twitter-dev] How to count to 140: or, characters vs. bytes vs. entities, third strike

Reply via email to