On 11/08/2013 10:54, Joshua Landau wrote:
On 11 August 2013 07:24, Chris Angelico <ros...@gmail.com> wrote:
On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau <jos...@landau.ws> wrote:
Given tweet = b"caf\x65\xCC\x81".decode():

    >>> tweet
    'café'

But:

    >>> len(tweet)
    5

You're now looking at the difference between glyphs and combining
characters. Twitter counts combining characters, so when you build one
"thing" out of lots of separately-typed parts, it does count as more
characters.

@https://dev.twitter.com/docs/counting-characters#Definition_of_a_Character
The "café" issue mentioned above raises the question of how you count
the characters in the Tweet string "café". To the human eye the length is
clearly four characters. Depending on how the data is represented this
could be either five or six UTF-8 bytes. Twitter does not want to penalize
a user for the fact we use UTF-8 or for the fact that the API client in
question used the longer representation. Therefore, Twitter does count
"café" as four characters no matter which representation is sent.

Which would imply that twitter doesn't count combining characters,
even though the web interface seems to.

Read this article for some arguments on the subject, including a
number of references to Twitter itself:

http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

I read that *last* time you pointed it out :P. It's a good link, though.

--
Anyhow, it's good to know I haven't been obviously stupid with my
understanding of Unicode. I learnt it all from this list anyway;
wouldn't want to disappoint!

If twitter counts characters, not codepoints, you could then ask
whether it passes the codepoints through as given. If it does, then you
experiment to see how much data you could send encoded as a sequence of
combining codepoints. (You might want to check the Term of Use first,
though! :-))
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to