[twitter-dev] Re: How to count to 140: or, characters vs. bytes vs. entities, third strike

2009-06-02 Thread leoboiko

On Jun 1, 3:16 pm, Doug Williams d...@twitter.com wrote:
 Leo,This has been covered many times before. It's 140 UTF-8 characters.
 Please search the archives of this group for the complete conversation,

If you cared to read my messages, you’d have seen that they reference
half a dozen conversations in this list, with employee participation
as you put it, plus wiki pages to boot.  Further, they address several
different issues other than “it’s 140”:

 - Encoding/byte/character confusion _in the API wiki_, which is not
publicly editable.
 - The pointlessness of counting  as entities for the message limit,
instead of it being a view-specific property.
 - The dangerousness of converting other, unspecified characters to
HTML entities.
 - Lack of formal definition of what transformations are used for SMS
sending (SMS cannot use UTF-8; further, all possible combinations of
{140, 160} UTF-8 {bytes, characters} might or might not fit an SMS; 
other issues I listed above).

Abraham Williams wrote:
 if i worked at Twitter and was dealing with issues like unplanned down time
[1]. Clarifying if users can post 140 bytes or 141 bytes is the last
thing I
would be dealing with.

Yeah, because everybody speaks English and m17n doesn’t matter.  If
you can’t see why unpredictable, silent message truncation is a
serious problem, we have different scales of seriousness.

 Or run your own laconi.ca server where you can make your own esoteric rules.

Yep, I’m moving to identi.ca — despite being based on actual free
software by volunteers, they at least take the time to discuss
contributions, instead of dismissing them without even reading as Doug
just did.


[twitter-dev] Re: How to count to 140: or, characters vs. bytes vs. entities, third strike

2009-05-20 Thread leoboiko

On May 15, 2:03 pm, leoboiko leobo...@gmail.com wrote:
 Later I intend to thoroughly replace “characters” by “bytes” (or
 “UTF-8 byte count”, c.) in the API wiki.

My bad; it seems the API wiki is not publicly-writable? Could someone
with write access do this?

--
Leonardo Boiko
http://namakajiri.net


[twitter-dev] How to count to 140: or, characters vs. bytes vs. entities, third strike

2009-05-15 Thread leoboiko

So.  Much twitter documentation talks about “140 characters” and “160
characters limit”.  But “character” is not a raw data type, it’s a
string type.  It has been observed[1][2][3][4] that 1) twitter expects
these characters to be encoded as UTF-8 (or ASCII, which is a strict
subset of UTF-8), and 2) the limit really is 140/160 *bytes*, not
characters (UTF-8 characters can use up to 4 bytes each; two-bytes per
character are common for European languages and three-to-four bytes
are common for Asian scripts, Indic c.).

Later I intend to thoroughly replace “characters” by “bytes” (or
“UTF-8 byte count”, c.) in the API wiki.  Hope it’s ok with everyone.

* * *

Many twitter applications want to interactively count characters as
users type.  Other than the byte/character confusion, there’s another
common source of errors: the fact that ‘’ and ‘’ are converted to,
and counted as, their respective HTML entities (lt; and gt; —four
bytes each)[1]).  That in itself isn’t so bad, as long as it’s
deterministic and documented.  It seems the conversion may take place
a few hours after the update is sent[4], which is unfortunate but
still acceptable.  Much worse is the problem that, at least according
to the FAQ, other (unspecified) characters “may” be converted to (and
counted as) HTML entities[1].  That makes a twitter character-counting
function either a potential truncation trap (if it ignores HTML
entities), or exceptionally conservative (if it assumes ALL possible
characters will be HTML-entitied).  Is this still the current
behavior? If so, I’m filing a but =)

* * *

I’d like to understand fully what are the motivations for these limits
and counting algorithms.  Alex Payne stated that as of now they’re
just using Ruby 1.8 String.count, which is equivalent to UTF-8 byte
count.  However, AFAIK, the 140-bytes limit was originally intended to
support sending updates as SMS messages.  Now, I have no SMS
experience at all, and it's true that SMS has a hard limit of 140
bytes, but AFAIK SMS text MUST be encoded in one of a few specific
encodings:

[…]the default GSM 7-bit alphabet [i.e. character encoding], the 8-
bit data alphabet, and the 16-bit UTF-16/UCS-2 alphabet. Depending on
which alphabet the subscriber has configured in the handset, this
leads to the maximum individual Short Message sizes of 160 7-bit
characters, 140 8-bit characters, or 70 16-bit characters (including
spaces). Support of the GSM 7-bit alphabet is mandatory for GSM
handsets and network elements, but characters in languages such as
Arabic, Chinese, Korean, Japanese or Cyrillic alphabet languages (e.g.
Russian) must be encoded using the 16-bit UCS-2 character encoding.

Notice the absence of UTF-8 .

That means Twitter’s “140 bytes” does not match SMS “140 bytes” at
all.  A 140-byte UTF-8 Twitter update will take less than 140 bytes in
GSM 7-bit, so if you’re sending SMS as GSM you’re being too
pessimistic.  And the same Twitter 140-byte string can take far more
bytes in UCS-2, so if you’re sending Unicode SMS you’re being too
optimistic.  (Notice the GSM encoding supports very few characters;
it’s not possible to convert an Twitter update like “reação em japonês
é 反応” to GSM SMS, neither 7– nor 8-bit, so for those you’re stuck with
UCS-2 SMS).

Twitter doesn’t send SMS to my country so I have no way to test how
you deal with this.  I suppose you take the most space-efficient
encoding that supports all characters in the message, and if it
results in more than 140 bytes, truncate the message.  It would be
nice to have documentation on exactly what happens (if you already do,
hit me in the head for not finding it).  In any case this seems to
complicate the “we sent your friends the short version” message.
Currently, that message means “your update, as UTF-8, is in the 141–
160 bytes range”, right? But that count means nothing to SMS — a
message with 160 UTF-8 bytes might wholly fit an SMS just fine (if the
characters are all included in GSM 7-bit), while one with 71 UTF-8
bytes might not (if they’re all non-GSM, say, ‘ç’ repeated 71 times).
I think instead the SMS-conversion function should propagate all the
way back its SMS-truncation exception, and the warning should be
phrased as “your update didn’t fit an SMS, so we sent the short
version” .  Even better, only send the warning when at least one of
the subscribers is actually receiving it as SMS.

* * *

This discussion brings into mind what’s the point of limiting updates
to 140/160 bytes in the first place.  If the intention was to support
SMS, UTF-8 byte count doesn’t have any meaningful relationship with
it.  It does work for the English world, since 160 ASCII characters
will convert to 140 GSM bytes (in 7-bit characters).  But even in this
degenerate case, the 140 part feels weird (since it’s safe to just use
160, and they won’t be truncated).  And as soon as you put a “ç” or a
“日” or a “—” in there, Twitter’s byte-count limit loses all meaning.

Twitter now has an existence that’s 

[twitter-dev] Re: How to count to 140: or, characters vs. bytes vs. entities, third strike

2009-05-15 Thread leoboiko

 On May 15, 2:03 pm, leoboiko leobo...@gmail.com wrote:
 while one with 71 UTF-8
 bytes might not (if they’re all non-GSM, say, ‘ç’ repeated 71 times).

Sorry, that was a bad example: 71 ‘ç’s take up 142 bytes in UTF-8, not
71.

Consider instead 71 ‘^’ (or ‘\’, ‘[’ c.).  These take one byte in
UTF-8, but their shortest encoding in SMS is two-byte (in GSM).  So
the 71-byte UTF-8 string would take more than 140 bytes as SMS and not
fit an SMS.

Why that matters? Consider a twitter update like this:

@d00d: in the console, type cat ~/file.sql | tr [:upper:]
[:lower:] | less.  then you cand read the sql commands without the
annoying caps

That looks like a perfectly reasonable 140-character UTF-8 string, so
Twitter won't truncate it or warn about sending a short version.  But
its SMS encoding would take some 147 bytes, so the last words would be
truncated.

--
Leonardo Boiko
http://namakajiri.net