Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On Tue, 13 Aug 2013 15:34:45 +, Prasad, Ramit wrote: > Michael Torrie wrote: [...] >> However I know of no phone or network that won't let you use longer >> messages; multiple SMS packets are used and most phone paste them back >> together. So no there's nothing that anyone needs to change to use >> longer messages if they so chose. It's now just an arbitrary limit, >> part of the twitter culture. > > > True, but order of delivery is not guaranteed. I still sometimes get out > of order text message when multiple messages are sent at once. SMS delivery is not guaranteed *at all*. It's a best-effort delivery service, which means the telco can drop any SMSes it feels like, for any reason it likes, without notice or notification. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
RE: Could you verify this, Oh Great Unicode Experts of the Python-List?
Michael Torrie wrote: > On 08/11/2013 11:54 PM, Gregory Ewing wrote: > > Michael Torrie wrote: > >> I've always wondered if the 160 character limit or whatever it is is a > >> hard limit in their system, or if it's just a variable they could tweak > >> if they felt like it. > > > > Isn't it for compatibility with SMS? Twitter could > > probably change it, but persuading all the cell phone > > networks to change at the same time might be rather > > difficult. > > Yes I think you're correct about it being limited for SMS. > > However I know of no phone or network that won't let you use longer > messages; multiple SMS packets are used and most phone paste them back > together. So no there's nothing that anyone needs to change to use > longer messages if they so chose. It's now just an arbitrary limit, > part of the twitter culture. True, but order of delivery is not guaranteed. I still sometimes get out of order text message when multiple messages are sent at once. ~Ramit This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On 11 August 2013 12:14, Steven D'Aprano wrote: > On Sun, 11 Aug 2013 10:44:40 +0100, Joshua Landau wrote: "café" will be in your Copy-Paste buffer, and you can paste it in to the tweet-box. It takes 5 characters. So much for testing ;). >>> >>> How do you know that it takes 5 characters? Is that some Javascript >>> widget? I'd blame buggy Javascript before Twitter. >> >> I go to twitter.com, log in and press that odd blue compose button in >> the top-right. After pasting at says I have 135 (down from 140) >> characters left. > > I'm pretty sure that will be a piece of Javascript running in your > browser that reports the number of characters in the text box. So, I > would expect that either: > > - Javascript doesn't provide a way to normalize text; > > - Twitter's Javascript developer(s) don't know how to normalize text, or > can't be bothered to follow company policy (shame on them); > > - the Javascript just asks the browser, and the browser doesn't know how > to count characters the Twitter way; > > etc. But of course posting to Twitter via your browser isn't the only way > to post. Twitter provide an API to twit, and *that* is the ultimate test > of whether Twitter's dev guide is lying or not. Well, I've done some further testing and it seems you're right. It's just the javascript that's wrong. I guess they did it for better load-times. -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On Tue, Aug 13, 2013 at 4:32 AM, MRAB wrote: > On 13/08/2013 04:20, Jason Friedman wrote: > > I've always wondered if the 160 character limit or whatever it is is a > hard limit in their system, or if it's just a variable they could tweak > if they felt like it. >> >> >> I thought it was 140 characters? >> https://twitter.com/about >> > He did say "or whatever". :-) I don't personally use the service, so I just followed the figure that people were bandying about in this thread. 140 it is, then. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On 13/08/2013 04:20, Jason Friedman wrote: I've always wondered if the 160 character limit or whatever it is is a hard limit in their system, or if it's just a variable they could tweak if they felt like it. I thought it was 140 characters? https://twitter.com/about He did say "or whatever". :-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
>>> I've always wondered if the 160 character limit or whatever it is is a >>> hard limit in their system, or if it's just a variable they could tweak >>> if they felt like it. I thought it was 140 characters? https://twitter.com/about -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On Tue, Aug 13, 2013 at 2:48 AM, Michael Torrie wrote: > On 08/11/2013 11:54 PM, Gregory Ewing wrote: >> Michael Torrie wrote: >>> I've always wondered if the 160 character limit or whatever it is is a >>> hard limit in their system, or if it's just a variable they could tweak >>> if they felt like it. >> >> Isn't it for compatibility with SMS? Twitter could >> probably change it, but persuading all the cell phone >> networks to change at the same time might be rather >> difficult. > > Yes I think you're correct about it being limited for SMS. > > However I know of no phone or network that won't let you use longer > messages; multiple SMS packets are used and most phone paste them back > together. So no there's nothing that anyone needs to change to use > longer messages if they so chose. It's now just an arbitrary limit, > part of the twitter culture. It's unlikely to be changed; the limit demands brevity. 160 may be arbitrary now, but without strong argument for another cutoff, there's no reason to alter it. And that's my response, in 160 characters. :) ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On 08/11/2013 11:54 PM, Gregory Ewing wrote: > Michael Torrie wrote: >> I've always wondered if the 160 character limit or whatever it is is a >> hard limit in their system, or if it's just a variable they could tweak >> if they felt like it. > > Isn't it for compatibility with SMS? Twitter could > probably change it, but persuading all the cell phone > networks to change at the same time might be rather > difficult. Yes I think you're correct about it being limited for SMS. However I know of no phone or network that won't let you use longer messages; multiple SMS packets are used and most phone paste them back together. So no there's nothing that anyone needs to change to use longer messages if they so chose. It's now just an arbitrary limit, part of the twitter culture. -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
Michael Torrie wrote: I've always wondered if the 160 character limit or whatever it is is a hard limit in their system, or if it's just a variable they could tweak if they felt like it. Isn't it for compatibility with SMS? Twitter could probably change it, but persuading all the cell phone networks to change at the same time might be rather difficult. -- Greg -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On 08/11/2013 09:34 AM, MRAB wrote: > If twitter counts characters, not codepoints, you could then ask > whether it passes the codepoints through as given. If it does, then you > experiment to see how much data you could send encoded as a sequence of > combining codepoints. (You might want to check the Term of Use first, > though! :-)) I've always wondered if the 160 character limit or whatever it is is a hard limit in their system, or if it's just a variable they could tweak if they felt like it. -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On 11/08/2013 10:54, Joshua Landau wrote: On 11 August 2013 07:24, Chris Angelico wrote: On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau wrote: Given tweet = b"caf\x65\xCC\x81".decode(): >>> tweet 'café' But: >>> len(tweet) 5 You're now looking at the difference between glyphs and combining characters. Twitter counts combining characters, so when you build one "thing" out of lots of separately-typed parts, it does count as more characters. @https://dev.twitter.com/docs/counting-characters#Definition_of_a_Character The "café" issue mentioned above raises the question of how you count the characters in the Tweet string "café". To the human eye the length is clearly four characters. Depending on how the data is represented this could be either five or six UTF-8 bytes. Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation. Therefore, Twitter does count "café" as four characters no matter which representation is sent. Which would imply that twitter doesn't count combining characters, even though the web interface seems to. Read this article for some arguments on the subject, including a number of references to Twitter itself: http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/ I read that *last* time you pointed it out :P. It's a good link, though. -- Anyhow, it's good to know I haven't been obviously stupid with my understanding of Unicode. I learnt it all from this list anyway; wouldn't want to disappoint! If twitter counts characters, not codepoints, you could then ask whether it passes the codepoints through as given. If it does, then you experiment to see how much data you could send encoded as a sequence of combining codepoints. (You might want to check the Term of Use first, though! :-)) -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On 11 August 2013 13:51, wrote: > Le dimanche 11 août 2013 11:09:44 UTC+2, Steven D'Aprano a écrit : >> On Sun, 11 Aug 2013 07:17:42 +0100, Joshua Landau wrote: >> >> The reason some accented letters have single code point forms is to >> support legacy charsets; ... > > No. > > jmf > > PS Unicode normalization is failing expectedly very well > with the FSR. No. Joshua Landau PS Proper arguments are falling expectedly very well with the internet -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
Le dimanche 11 août 2013 11:09:44 UTC+2, Steven D'Aprano a écrit : > On Sun, 11 Aug 2013 07:17:42 +0100, Joshua Landau wrote: > > > > > The reason some accented letters have single code point forms is to > > support legacy charsets; ... No. jmf PS Unicode normalization is failing expectedly very well with the FSR. -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On 11 August 2013 12:14, Steven D'Aprano wrote: > On Sun, 11 Aug 2013 10:44:40 +0100, Joshua Landau wrote: > >> On 11 August 2013 10:09, Steven D'Aprano >> wrote: >>> The reason some accented letters have single code point forms is to >>> support legacy charsets; the reason some only exist as combining >>> characters is due to the combinational explosion. Some languages allow >>> you to add up to five or six different accent on any of dozens of >>> different letters. If each combination needed its own unique code >>> point, there wouldn't be enough code points. For bonus points, if there >>> are five accents that can be placed in any combination of zero or more >>> on any of four characters, how many code points would be needed? >> >> 52? > > More than double that. > > Consider a single character. It can have 0 to 5 accents, in any > combination. Order doesn't matter, and there are no duplicates, so there > are: > > 0 accent: take 0 from 5 = 1 combination; > 1 accent: take 1 from 5 = 5 combinations; > 2 accents: take 2 from 5 = 5!/(2!*3!) = 10 combinations; > 3 accents: take 3 from 5 = 5!/(3!*2!) = 10 combinations; > 4 accents: take 4 from 5 = 5 combinations; > 5 accents: take 5 from 5 = 1 combination > > giving a total of 32 combinations for a single character. Since there are > four characters in this hypothetical language that take accents, that > gives a total of 4*32 = 128 distinct code points needed. I didn't see "four characters", and I did (1 + 5 + 10) * 2 and came up with 52... Maybe I should get more sleep. -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On Sun, Aug 11, 2013 at 12:14 PM, Steven D'Aprano wrote: > Consider a single character. It can have 0 to 5 accents, in any > combination. Order doesn't matter, and there are no duplicates, so there > are: > > 0 accent: take 0 from 5 = 1 combination; > 1 accent: take 1 from 5 = 5 combinations; > 2 accents: take 2 from 5 = 5!/(2!*3!) = 10 combinations; > 3 accents: take 3 from 5 = 5!/(3!*2!) = 10 combinations; > 4 accents: take 4 from 5 = 5 combinations; > 5 accents: take 5 from 5 = 1 combination > > giving a total of 32 combinations for a single character. Since there are > four characters in this hypothetical language that take accents, that > gives a total of 4*32 = 128 distinct code points needed. There's an easy way to calculate it. Instead of the "take N from 5" notation, simply look at it as a set of independent bits - each of your accents may be either present or absent. So it's 1<<5 combinations for a single character, which is the same 32 figure you came up with, but easier to work with in the ridiculous case. > In reality, Unicode has currently code points U+0300 to U+036F (112 code > points) to combining characters. It's not really meaningful to combine > all 112 of them, or even most of 112 of them... If you *were* to use literally ANY combination, that would be 1<<112 which is... uhh... five billion yottacombinations. Don't bother working that one out by the "take N" method, it'll take you too long :) Oh, and that's 1<<112 possible combining character combinations, so you then need to multiply that by the number of base characters you could use ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On Sun, 11 Aug 2013 10:44:40 +0100, Joshua Landau wrote: > On 11 August 2013 10:09, Steven D'Aprano > wrote: >> The reason some accented letters have single code point forms is to >> support legacy charsets; the reason some only exist as combining >> characters is due to the combinational explosion. Some languages allow >> you to add up to five or six different accent on any of dozens of >> different letters. If each combination needed its own unique code >> point, there wouldn't be enough code points. For bonus points, if there >> are five accents that can be placed in any combination of zero or more >> on any of four characters, how many code points would be needed? > > 52? More than double that. Consider a single character. It can have 0 to 5 accents, in any combination. Order doesn't matter, and there are no duplicates, so there are: 0 accent: take 0 from 5 = 1 combination; 1 accent: take 1 from 5 = 5 combinations; 2 accents: take 2 from 5 = 5!/(2!*3!) = 10 combinations; 3 accents: take 3 from 5 = 5!/(3!*2!) = 10 combinations; 4 accents: take 4 from 5 = 5 combinations; 5 accents: take 5 from 5 = 1 combination giving a total of 32 combinations for a single character. Since there are four characters in this hypothetical language that take accents, that gives a total of 4*32 = 128 distinct code points needed. In reality, Unicode has currently code points U+0300 to U+036F (112 code points) to combining characters. It's not really meaningful to combine all 112 of them, or even most of 112 of them, but let's assume that we can legitimately combine up to three of them on average (some languages will allow more, some less) on just six different letters. That gives us: 0 accent: 1 combination 1 accent: 112 combinations 2 accents: 112!/(2!*110!) = 6216 combinations 3 accents: 112!/(3!*109!) = 227920 combinations giving 234249 combinations, by six base characters, = 1405494 code points. Which is comfortably more than the 1114112 code points Unicode has in total :-) This calculation is horribly inaccurate, since you can't arbitrarily combine (say) accents from Greek with accents from IPA, but I reckon that the combinational explosion of accented letters is still real. [...] >> Of course, they might be lying when they say "Twitter counts the length >> of a Tweet using the Normalization Form C (NFC) version of the text", I >> have no idea. But the seem to have a good grasp of the issues involved, >> and assuming they do what they say, at least Western European users >> should be happy. > > They *don't* seem to be doing what they say. [...] >>> "café" will be in your Copy-Paste buffer, and you can paste it in to >>> the tweet-box. It takes 5 characters. So much for testing ;). >> >> How do you know that it takes 5 characters? Is that some Javascript >> widget? I'd blame buggy Javascript before Twitter. > > I go to twitter.com, log in and press that odd blue compose button in > the top-right. After pasting at says I have 135 (down from 140) > characters left. I'm pretty sure that will be a piece of Javascript running in your browser that reports the number of characters in the text box. So, I would expect that either: - Javascript doesn't provide a way to normalize text; - Twitter's Javascript developer(s) don't know how to normalize text, or can't be bothered to follow company policy (shame on them); - the Javascript just asks the browser, and the browser doesn't know how to count characters the Twitter way; etc. But of course posting to Twitter via your browser isn't the only way to post. Twitter provide an API to twit, and *that* is the ultimate test of whether Twitter's dev guide is lying or not. > My only question here is, since you can't post after 140 non-normalised > characters, who cares if the server counts it as less? People who bypass the browser and write their own Twitter client. >> If this shows up in your application as café rather than café, it is a >> bug in the text rendering engine. Some applications do not deal with >> combining characters correctly. > > Why the rendering engine? If the text renderer assumes it can draw once code point at a time, it will draw the "e", then reach the combining accent. It could, in principle, backspace and draw it over the "e", but more likely it will just draw it next to it. What the renderer should do is walk the string, collecting characters until it reaches one which is not a combining character, then draw them all at once one on top of each other. A good font may have special glyphs, or at least hints, for combining accents. For instance, if you have a dot accent and a comma accent drawn one on top of the other, it looks like a comma; what you are supposed to do is move them side by side, so you have separate dot and comma glyphs. >> (It's a hard problem to solve, and really needs support from the font. >> In some languages, the same accent will appear in different places >> depending on the character t
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On 11 August 2013 07:24, Chris Angelico wrote: > On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau wrote: >> Given tweet = b"caf\x65\xCC\x81".decode(): >> >> >>> tweet >> 'café' >> >> But: >> >> >>> len(tweet) >> 5 > > You're now looking at the difference between glyphs and combining > characters. Twitter counts combining characters, so when you build one > "thing" out of lots of separately-typed parts, it does count as more > characters. @https://dev.twitter.com/docs/counting-characters#Definition_of_a_Character > The "café" issue mentioned above raises the question of how you count > the characters in the Tweet string "café". To the human eye the length is > clearly four characters. Depending on how the data is represented this > could be either five or six UTF-8 bytes. Twitter does not want to penalize > a user for the fact we use UTF-8 or for the fact that the API client in > question used the longer representation. Therefore, Twitter does count > "café" as four characters no matter which representation is sent. Which would imply that twitter doesn't count combining characters, even though the web interface seems to. > Read this article for some arguments on the subject, including a > number of references to Twitter itself: > > http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/ I read that *last* time you pointed it out :P. It's a good link, though. -- Anyhow, it's good to know I haven't been obviously stupid with my understanding of Unicode. I learnt it all from this list anyway; wouldn't want to disappoint! -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On 11 August 2013 10:09, Steven D'Aprano wrote: > The reason some accented letters have single code point forms is to > support legacy charsets; the reason some only exist as combining > characters is due to the combinational explosion. Some languages allow > you to add up to five or six different accent on any of dozens of > different letters. If each combination needed its own unique code point, > there wouldn't be enough code points. For bonus points, if there are five > accents that can be placed in any combination of zero or more on any of > four characters, how many code points would be needed? 52? > Note that the form you used, b"caf\x65\xCC\x81", is the same as the first > except that you have shown "e" in hex for some reason: > > py> b'\x65' == b'e' > True Yeah.. I did that because the linked post did it. I'm not sure why either ;). > On Sun, 11 Aug 2013 07:17:42 +0100, Joshua Landau wrote: >> >> So the solution is: >> >> >>> import unicodedata >> >>> len(unicodedata.normalize("NFC", tweet)) >> 4 > > In this particular case, this will reduce the tweet to the normalised > form that Twitter uses. > > [...] >> After further testing (I don't actually use Twitter) it seems the whole >> thing was just smoke and mirrors. The linked article is a lie, at least >> on the user's end. > > Which linked article? The one on dev.twitter.com seems to be okay to me. That's the one. > Of course, they might be lying when they say "Twitter counts the length > of a Tweet using the Normalization Form C (NFC) version of the text", I > have no idea. But the seem to have a good grasp of the issues involved, > and assuming they do what they say, at least Western European users > should be happy. They *don't* seem to be doing what they say. >> On Linux you can prove this by running: >> >> >>> p = subprocess.Popen(['xsel', '-bi'], stdin=subprocess.PIPE) >> >>> p.communicate(input=b"caf\x65\xCC\x81") >> (None, None) >> >> "café" will be in your Copy-Paste buffer, and you can paste it in to >> the tweet-box. It takes 5 characters. So much for testing ;). > > How do you know that it takes 5 characters? Is that some Javascript > widget? I'd blame buggy Javascript before Twitter. I go to twitter.com, log in and press that odd blue compose button in the top-right. After pasting at says I have 135 (down from 140) characters left. My only question here is, since you can't post after 140 non-normalised characters, who cares if the server counts it as less? > If this shows up in your application as café rather than café, it is a > bug in the text rendering engine. Some applications do not deal with > combining characters correctly. Why the rendering engine? > (It's a hard problem to solve, and really needs support from the font. In > some languages, the same accent will appear in different places depending > on the character they are attached to, or the other accents there as > well. Or so I've been lead to believe.) > > >> ¹ https://dev.twitter.com/docs/counting- >> characters#Definition_of_a_Character > > Looks reasonable to me. No obvious errors to my eyes. *Not sure whether talking about the link or my post* -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On Sun, 11 Aug 2013 07:17:42 +0100, Joshua Landau wrote: > Basically, I think Twitter's broken. Oh, in about a million ways, but apparently people like it :-( > For my full discusion on the matter, see: > http://www.reddit.com/r/learnpython/comments/1k2yrn/ help_with_len_and_input_function_33/cbku5e8 > > Here's the first post of mine, ineffectually edited for this list: > > """ > The obvious solution [to getting the length of a tweet] > is wrong. Like, slightly wrong¹. > > Given tweet = b"caf\x65\xCC\x81".decode(): I assume you're using Python 3, where UTF-8 is the default encoding. > >>> tweet > 'café' > > But: > > >>> len(tweet) > 5 Yes, that's correct. Unicode doesn't promise to have a single unique representation for all human-readable strings. In this case, the string "cafe" with an accent on the "e" can be generated by two sequences of code points: LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E COMBINING ACUTE ACCENT or LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E WITH ACUTE The reason some accented letters have single code point forms is to support legacy charsets; the reason some only exist as combining characters is due to the combinational explosion. Some languages allow you to add up to five or six different accent on any of dozens of different letters. If each combination needed its own unique code point, there wouldn't be enough code points. For bonus points, if there are five accents that can be placed in any combination of zero or more on any of four characters, how many code points would be needed? Neither form is "right" or "wrong", they are both equally valid. They encode differently, of course, since UTF-8 does guarantee that every sequence of code points has a unique byte representation: py> tweet.encode('utf-8') 'cafe\xcc\x81' py> u'café'.encode('utf-8') 'caf\xc3\xa9' Note that the form you used, b"caf\x65\xCC\x81", is the same as the first except that you have shown "e" in hex for some reason: py> b'\x65' == b'e' True > So the solution is: > > >>> import unicodedata > >>> len(unicodedata.normalize("NFC", tweet)) > 4 In this particular case, this will reduce the tweet to the normalised form that Twitter uses. [...] > After further testing (I don't actually use Twitter) it seems the whole > thing was just smoke and mirrors. The linked article is a lie, at least > on the user's end. Which linked article? The one on dev.twitter.com seems to be okay to me. Of course, they might be lying when they say "Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text", I have no idea. But the seem to have a good grasp of the issues involved, and assuming they do what they say, at least Western European users should be happy. > On Linux you can prove this by running: > > >>> p = subprocess.Popen(['xsel', '-bi'], stdin=subprocess.PIPE) > >>> p.communicate(input=b"caf\x65\xCC\x81") > (None, None) > > "café" will be in your Copy-Paste buffer, and you can paste it in to > the tweet-box. It takes 5 characters. So much for testing ;). How do you know that it takes 5 characters? Is that some Javascript widget? I'd blame buggy Javascript before Twitter. If this shows up in your application as café rather than café, it is a bug in the text rendering engine. Some applications do not deal with combining characters correctly. (It's a hard problem to solve, and really needs support from the font. In some languages, the same accent will appear in different places depending on the character they are attached to, or the other accents there as well. Or so I've been lead to believe.) > ¹ https://dev.twitter.com/docs/counting- > characters#Definition_of_a_Character Looks reasonable to me. No obvious errors to my eyes. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau wrote: > Given tweet = b"caf\x65\xCC\x81".decode(): > > >>> tweet > 'café' > > But: > > >>> len(tweet) > 5 You're now looking at the difference between glyphs and combining characters. Twitter counts combining characters, so when you build one "thing" out of lots of separately-typed parts, it does count as more characters. Read this article for some arguments on the subject, including a number of references to Twitter itself: http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/ ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Could you verify this, Oh Great Unicode Experts of the Python-List?
Basically, I think Twitter's broken. For my full discusion on the matter, see: http://www.reddit.com/r/learnpython/comments/1k2yrn/help_with_len_and_input_function_33/cbku5e8 Here's the first post of mine, ineffectually edited for this list: """ The obvious solution [to getting the length of a tweet] is wrong. Like, slightly wrong¹. Given tweet = b"caf\x65\xCC\x81".decode(): >>> tweet 'café' But: >>> len(tweet) 5 So the solution is: >>> import unicodedata >>> len(unicodedata.normalize("NFC", tweet)) 4 Read twitter's commentary¹ for proof. There are additional complications I'm trying to sort out. After further testing (I don't actually use Twitter) it seems the whole thing was just smoke and mirrors. The linked article is a lie, at least on the user's end. On Linux you can prove this by running: >>> p = subprocess.Popen(['xsel', '-bi'], stdin=subprocess.PIPE) >>> p.communicate(input=b"caf\x65\xCC\x81") (None, None) "café" will be in your Copy-Paste buffer, and you can paste it in to the tweet-box. It takes 5 characters. So much for testing ;). ¹ https://dev.twitter.com/docs/counting-characters#Definition_of_a_Character """ I know this isn't *really* Python-related, but there's Python involved and you're the sort of people who'll be able to tell me what I've done wrong, if anything. -- http://mail.python.org/mailman/listinfo/python-list