I deployed a batch of explicit length checks this week to try and stop that madness. I didn't do the same for status text because it has another validation routine altogether. The Service Team should be able to help out with in making that more sane.

— Matt

On Mar 6, 2009, at 11:38 AM, Craig Hockenberry wrote:


This truncation as data moves throughout your system occurs in other
places. I've seen the same behavior when setting a user's location and
bio, for example.

-ch

On Mar 6, 11:18 am, Alex Payne <a...@twitter.com> wrote:
I'm taking this email to our Service Team, the folks who work on the
back-end of the service. The whole "message body changing as it moves
from cache to backing store" thing is totally unacceptable. Answers
soon.

On Fri, Mar 6, 2009 at 09:43, Craig Hockenberry



<craig.hockenbe...@gmail.com> wrote:

Some discussion about this thread popped up on Twitter yesterday:

<http://groups.google.com/group/twitter-development-talk/browse_thread/
thread/44be91d5ec5850fa>

Alex states that it's 140 bytes per tweet. So, of course, Loren
Brichter and I tried to prove that. With the following results:

1) 140 characters that including ones that include HTML entities:
<http://twitter.com/gnitset/status/1286202252>

At the time of posting, this tweet showed up on the site and in feeds
with all 140 characters. After a few hours, the "<" was converted to
"&lt;", increasing the count per character from one to four bytes and
decreasing the tweet length from 140 characters to 69. (You can see
this truncation at the end of the tweet: the "&" is from "&lt;")

Presumably, this happens as tweets in the memcache are written though
to the backing store.

I also see a lot of Twitter clients that don't realize how special the &lt; and &gt; entities are. It took me a LONG time to figure out what
was going on here.

2) 140 Unicode _multi-byte_ characters: <http://twitter.com/atebits/
status/1286199010>

What's curious is that Loren's example with 140 characters uses the
Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
truncated? This seems to contradict Alex's statement in the thread
mentioned above.

As people start to use things like Emoji, tinyarro.ws and generally
figure out that Unicode (UTF-8) is a valid type of data on Twitter,
our clients should adapt and display more accurate "characters
remaining" counts. I can count bytes instead of characters, but I'm
not sure if I should or not.

No one likes a truncated tweet: we need an explicit statement on how
to count and submit multi-byte characters and entities.

-ch

--
Alex Payne - API Lead, Twitter, Inc.http://twitter.com/al3x

Reply via email to