On Fri, 15 Nov 2013 14:43:17 +0000, Robin Becker wrote: > Things went wrong when utf8 was not adopted as the standard encoding > thus requiring two string types, it would have been easier to have a len > function to count bytes as before and a glyphlen to count glyphs. Now as > I understand it we have a complicated mess under the hood for unicode > objects so they have a variable representation to approximate an 8 bit > representation when suitable etc etc etc.
No no no! Glyphs are *pictures*, you know the little blocks of pixels that you see on your monitor or printed on a page. Before you can count glyphs in a string, you need to know which typeface ("font") is being used, since fonts generally lack glyphs for some code points. [Aside: there's another complication. Some fonts define alternate glyphs for the same code point, so that the design of (say) the letter "a" may vary within the one string according to whatever typographical rules the font supports and the application calls. So the question is, when you "count glyphs", should you count "a" and "alternate a" as a single glyph or two?] You don't actually mean count glyphs, you mean counting code points (think characters, only with some complications that aren't important for the purposes of this discussion). UTF-8 is utterly unsuited for in-memory storage of text strings, I don't care how many languages (Go, Haskell?) make that mistake. When you're dealing with text strings, the fundamental unit is the character, not the byte. Why do you care how many bytes a text string has? If you really need to know how much memory an object is using, that's where you use sys.getsizeof(), not len(). We don't say len({42: None}) to discover that the dict requires 136 bytes, why would you use len("heåvy") to learn that it uses 23 bytes? UTF-8 is variable width encoding, which means it's *rubbish* for the in- memory representation of strings. Counting characters is slow. Slicing is slow. If you have mutable strings, deleting or inserting characters is slow. Every operation has to effectively start at the beginning of the string and count forward, lest it split bytes in the middle of a UTF unit. Or worse, the language doesn't give you any protection from this at all, so rather than slow string routines you have unsafe string routines, and it's your responsibility to detect UTF boundaries yourself. In case you aren't familiar with what I'm talking about, here's an example using Python 3.2, starting with a Unicode string and treating it as UTF-8 bytes: py> u = "heåvy" py> s = u.encode('utf-8') py> for c in s: ... print(chr(c)) ... h e à ¥ v y "Ã¥"? It didn't take long to get moji-bake in our output, and all I did was print the (byte) string one "character" at a time. It gets worse: we can easily end up with invalid UTF-8: py> a, b = s[:len(s)//2], s[len(s)//2:] # split the string in half py> a.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2: unexpected end of data py> b.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte No, UTF-8 is okay for writing to files, but it's not suitable for text strings. The in-memory representation of text strings should be constant width, based on characters not bytes, and should prevent the caller from accidentally ending up with moji-bake or invalid strings. -- Steven -- https://mail.python.org/mailman/listinfo/python-list