2009/9/9 Matt Sanford <m...@twitter.com>

>
> Hi There,
>
>    I'm sorry this never got updated. Some changes have been made and
> are waiting to go out now. When I switched from working on the
> Platform (formerly API) team to my focus on international I took over
> this issue.
>    Once this current fix is deployed (probably in a week or so since
> I'm traveling at the moment) the definition of a character will be
> consistent throughout our API. The new change will always compute
> length based on the Unicode NFC [1] version of the string. Using the
> NFC form makes the 140 character limit based on the length as
> displayed rather than some under-the-cover byte arithmetic.
>    I more than agree with the above statement that a character is a
> character and Twitter shouldn't care. Data should be data. The main
> issue with that is that some clients compose characters and some
> don't. My common example of this is é. Depending on your client
> Twitter could get:
>
> é - 1 byte
>   - URL Encoded UTF-8: %C3%A9
>   - http://www.fileformat.info/info/unicode/char/00e9/index.htm
>
>
isn't that 2 bytes?



> -- or --
>
> é - 2 bytes
>   - URL Encoded UTF-8: %65%CC%81
>   - http://www.fileformat.info/info/unicode/char/0065/index.htm
>     + plus: http://www.fileformat.info/info/unicode/char/0301/index.htm
>
>
and this three bytes?



>    So, my fix will make it so that no matter the client if the user
> sees é it counts as a single character. I'll announce something in the
> change log once my fix is deployed.
>
> Thanks;
>  — Matt Sanford / @mzsanford
>
> [1] - http://www.unicode.org/reports/tr15/
>
> On Sep 9, 6:05 am, TjL <luo...@gmail.com> wrote:
> > It's been nearly 6 months. Has this question been answered? If so I
> missed it.
> >
> >
> >
> > On Tue, Mar 24, 2009 at 9:36 PM, Alex Payne<a...@twitter.com> wrote:
> >
> > > Unfortunately, nothing definitive. We're still looking into this.
> >
> > > On Tue, Mar 24, 2009 at 07:56, Craig Hockenberry
>  > > <craig.hockenbe...@gmail.com> wrote:
> >
> > >> Any news from the Service Team? I'd really like to get the counters
> > >> right in an upcoming release...
> >
> > >> -ch
> >
> > >> On Mar 6, 12:18 pm, Alex Payne <a...@twitter.com> wrote:
> > >>> I'm taking this email to our Service Team, the folks who work on the
> > >>> back-end of the service. The whole "message body changing as it moves
> > >>> from cache to backing store" thing is totally unacceptable. Answers
> > >>> soon.
> >
> > >>> On Fri, Mar 6, 2009 at 09:43, Craig Hockenberry
> >
> > >>> <craig.hockenbe...@gmail.com> wrote:
> >
> > >>> > Some discussion about this thread popped up on Twitter yesterday:
> >
> > >>> > <
> http://groups.google.com/group/twitter-development-talk/browse_thread/
> > >>> > thread/44be91d5ec5850fa>
> >
> > >>> > Alex states that it's 140 bytes per tweet. So, of course, Loren
> > >>> > Brichter and I tried to prove that. With the following results:
> >
> > >>> > 1) 140 characters that including ones that include HTML entities:
> > >>> > <http://twitter.com/gnitset/status/1286202252>
> >
> > >>> > At the time of posting, this tweet showed up on the site and in
> feeds
> > >>> > with all 140 characters. After a few hours, the "<" was converted
> to
> > >>> > "&lt;", increasing the count per character from one to four bytes
> and
> > >>> > decreasing the tweet length from 140 characters to 69. (You can see
> > >>> > this truncation at the end of the tweet: the "&" is from "&lt;")
> >
> > >>> > Presumably, this happens as tweets in the memcache are written
> though
> > >>> > to the backing store.
> >
> > >>> > I also see a lot of Twitter clients that don't realize how special
> the
> > >>> > &lt; and &gt; entities are. It took me a LONG time to figure out
> what
> > >>> > was going on here.
> >
> > >>> > 2) 140 Unicode _multi-byte_ characters: <
> http://twitter.com/atebits/
> > >>> > status/1286199010>
> >
> > >>> > What's curious is that Loren's example with 140 characters uses the
> > >>> > Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
> > >>> > truncated? This seems to contradict Alex's statement in the thread
> > >>> > mentioned above.
> >
> > >>> > As people start to use things like Emoji, tinyarro.ws and
> generally
> > >>> > figure out that Unicode (UTF-8) is a valid type of data on Twitter,
> > >>> > our clients should adapt and display more accurate "characters
> > >>> > remaining" counts. I can count bytes instead of characters, but I'm
> > >>> > not sure if I should or not.
> >
> > >>> > No one likes a truncated tweet: we need an explicit statement on
> how
> > >>> > to count and submit multi-byte characters and entities.
> >
> > >>> > -ch
> >
> > >>> --
> > >>> Alex Payne - API Lead, Twitter, Inc.http://twitter.com/al3x
> >
> > > --
> > > Alex Payne - API Lead, Twitter, Inc.
> > >http://twitter.com/al3x
>

Reply via email to