So. Much twitter documentation talks about “140 characters” and “160 characters limit”. But “character” is not a raw data type, it’s a string type. It has been observed[1][2][3][4] that 1) twitter expects these characters to be encoded as UTF-8 (or ASCII, which is a strict subset of UTF-8), and 2) the limit really is 140/160 *bytes*, not characters (UTF-8 characters can use up to 4 bytes each; two-bytes per character are common for European languages and three-to-four bytes are common for Asian scripts, Indic &c.).
Later I intend to thoroughly replace “characters” by “bytes” (or “UTF-8 byte count”, &c.) in the API wiki. Hope it’s ok with everyone. * * * Many twitter applications want to interactively count characters as users type. Other than the byte/character confusion, there’s another common source of errors: the fact that ‘<’ and ‘>’ are converted to, and counted as, their respective HTML entities (< and > —four bytes each)[1]). That in itself isn’t so bad, as long as it’s deterministic and documented. It seems the conversion may take place a few hours after the update is sent[4], which is unfortunate but still acceptable. Much worse is the problem that, at least according to the FAQ, other (unspecified) characters “may” be converted to (and counted as) HTML entities[1]. That makes a twitter character-counting function either a potential truncation trap (if it ignores HTML entities), or exceptionally conservative (if it assumes ALL possible characters will be HTML-entitied). Is this still the current behavior? If so, I’m filing a but =) * * * I’d like to understand fully what are the motivations for these limits and counting algorithms. Alex Payne stated that as of now they’re just using Ruby 1.8 String.count, which is equivalent to UTF-8 byte count. However, AFAIK, the 140-bytes limit was originally intended to support sending updates as SMS messages. Now, I have no SMS experience at all, and it's true that SMS has a hard limit of 140 bytes, but AFAIK SMS text MUST be encoded in one of a few specific encodings: […]the default GSM 7-bit alphabet [i.e. character encoding], the 8- bit data alphabet, and the 16-bit UTF-16/UCS-2 alphabet. Depending on which alphabet the subscriber has configured in the handset, this leads to the maximum individual Short Message sizes of 160 7-bit characters, 140 8-bit characters, or 70 16-bit characters (including spaces). Support of the GSM 7-bit alphabet is mandatory for GSM handsets and network elements, but characters in languages such as Arabic, Chinese, Korean, Japanese or Cyrillic alphabet languages (e.g. Russian) must be encoded using the 16-bit UCS-2 character encoding. Notice the absence of UTF-8 . That means Twitter’s “140 bytes” does not match SMS “140 bytes” at all. A 140-byte UTF-8 Twitter update will take less than 140 bytes in GSM 7-bit, so if you’re sending SMS as GSM you’re being too pessimistic. And the same Twitter 140-byte string can take far more bytes in UCS-2, so if you’re sending Unicode SMS you’re being too optimistic. (Notice the GSM encoding supports very few characters; it’s not possible to convert an Twitter update like “reação em japonês é 反応” to GSM SMS, neither 7– nor 8-bit, so for those you’re stuck with UCS-2 SMS). Twitter doesn’t send SMS to my country so I have no way to test how you deal with this. I suppose you take the most space-efficient encoding that supports all characters in the message, and if it results in more than 140 bytes, truncate the message. It would be nice to have documentation on exactly what happens (if you already do, hit me in the head for not finding it). In any case this seems to complicate the “we sent your friends the short version” message. Currently, that message means “your update, as UTF-8, is in the 141– 160 bytes range”, right? But that count means nothing to SMS — a message with 160 UTF-8 bytes might wholly fit an SMS just fine (if the characters are all included in GSM 7-bit), while one with 71 UTF-8 bytes might not (if they’re all non-GSM, say, ‘ç’ repeated 71 times). I think instead the SMS-conversion function should propagate all the way back its SMS-truncation exception, and the warning should be phrased as “your update didn’t fit an SMS, so we sent the short version” . Even better, only send the warning when at least one of the subscribers is actually receiving it as SMS. * * * This discussion brings into mind what’s the point of limiting updates to 140/160 bytes in the first place. If the intention was to support SMS, UTF-8 byte count doesn’t have any meaningful relationship with it. It does work for the English world, since 160 ASCII characters will convert to 140 GSM bytes (in 7-bit characters). But even in this degenerate case, the 140 part feels weird (since it’s safe to just use 160, and they won’t be truncated). And as soon as you put a “ç” or a “日” or a “—” in there, Twitter’s byte-count limit loses all meaning. Twitter now has an existence that’s independent of SMS, but still intertwined with it (at least for a few lucky countries). Until SMS is finally phased out by mobile email and such, it will remain an important part of the service. The 160-byte limit kind of does work as a rule of thumb for SMS conversion, at least for Westerners using the Latin alphabet with few non-ASCII chars. And, regardless of motivation, by now the limit has taken meaning in the Twitter culture. If it was just lifted, it might affect the service’s identity — having such a strict message limit determines what kind of usage and experiments people do with Twitter. But it is my humble opinion that the limit could be safely extended to 160 UTF-8 *characters*, even though that could take up to 640 bytes. After all, even a currently-supported 140-byte update might exceed an SMS, so why not simplify things and just count characters anyway? The “soft” 140- byte limit could be phased out entirely, and the “we sent your friends the short version” warning would appear only when the update didn’t fit an SMS —regardless of how many bytes or characters it originally had. (The alternative is to require applications to send updates encoded as SMSes (140 bytes in GSM or UCS-2), which I think everyone agrees would suck). And entity count IMHO could be completely ignored —what’s the point of counting entity bytes in the message limit, anyway? Shouldn’t those be strictly a property of the “view” part of MVC? Entity size doesn’t mean anything at all to SMS and other non-HTML Twitter clients. * * * References: [1] http://apiwiki.twitter.com/Things-Every-Developer-Should-Know#7Encodingaffectsstatuscharactercount [2] http://groups.google.com/group/twitter-development-talk/browse_thread/thread/f4b74d5ba883eb00 [3] http://groups.google.com/group/twitter-development-talk/browse_thread/thread/44be91d5ec5850fa [4] http://groups.google.com/group/twitter-development-talk/browse_thread/thread/9d9d16d55e2e1e67 [5] http://en.wikipedia.org/wiki/SMS#Message_size -- Leonardo Boiko http://namakajiri.net