[twitter-dev] Re: How to count to 140: or, characters vs. bytes vs. entities, third strike

leoboiko Fri, 15 May 2009 12:02:06 -0700

> On May 15, 2:03 pm, leoboiko <leobo...@gmail.com> wrote:
> while one with 71 UTF-8
> bytes might not (if they’re all non-GSM, say, ‘ç’ repeated 71 times).


Sorry, that was a bad example: 71 ‘ç’s take up 142 bytes in UTF-8, not
71.

Consider instead 71 ‘^’ (or ‘\’, ‘[’ &c.).  These take one byte in
UTF-8, but their shortest encoding in SMS is two-byte (in GSM).  So
the 71-byte UTF-8 string would take more than 140 bytes as SMS and not
fit an SMS.

Why that matters? Consider a twitter update like this:

    @d00d: in the console, type "cat ~/file.sql | tr [:upper:]
[:lower:] | less".  then you cand read the sql commands without the
annoying caps

That looks like a perfectly reasonable 140-character UTF-8 string, so
Twitter won't truncate it or warn about sending a short version.  But
its SMS encoding would take some 147 bytes, so the last words would be
truncated.

--
Leonardo Boiko
http://namakajiri.net

[twitter-dev] Re: How to count to 140: or, characters vs. bytes vs. entities, third strike

Reply via email to