On Fri, Jul 12, 2013 at 8:45 PM, Devyn Collier Johnson <devyncjohn...@gmail.com> wrote: > Could you explain what you mean? What and where is the new Flexible String > Representation?
(You're top-posting again. Please put your text underneath what you're responding to - it helps maintain flow and structure.) Python versions up to and including 3.2 came in two varieties: narrow builds (commonly found on Windows) and wide builds (commonly found on Linux). Narrow builds internally represented Unicode strings in UTF-16, while wide builds used UTF-32. This is a problem, because it means that taking a program from one to another actually changes its behaviour: Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15) [GCC 4.4.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> len(u"\U00012345") 1 Python 2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit (Intel)] on win32 >>> len(u"\U00012345") 2 In fact, the narrow builds are flat-out buggy, because you can put something in as a single character that simply isn't a single character. You can then pull that out as two characters and make a huge mess of things: >>> s=u"\U00012345" >>> s[0] u'\ud808' >>> s[1] u'\udf45' *Any* string indexing will be broken if there is a single character >U+FFFF ahead of it in the string. Now, this problem is not unique to Python. Heaps of other languages have the same issue, the same buggy behaviour, the same compromises. What's special about Python is that it actually managed to come back from that problem. (Google's V8 JavaScript engine, for instance, is stuck with it, because the ECMAScript specification demands UTF-16. I asked on an ECMAScript list and was told "can't change that, it'd break code". So it's staying buggy.) There are a number of languages that take the Texan RAM-guzzling approach of storing all strings in UTF-32; Python (since version 3.3) is among a *very* small number of languages that store strings in multiple different ways according to their content. That's described in PEP 393 [1], titled "Flexible String Representation". It details a means whereby a Python string will be represented in, effectively, UTF-32 with some of the leading zero bytes elided. Or if you prefer, in either Latin-1, UCS-2, or UCS-4, whichever's the smallest it can fit in. The difference between a string stored one-byte-per-character and a string stored four-bytes-per-character is almost invisible to a Python script; you can find out by checking the string's memory usage, but otherwise you don't need to worry about it. Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)] on win32 >>> sys.getsizeof("asdfasdfasdfasd") 40 >>> sys.getsizeof("asdfasdfasdfasdf") 41 Adding another character adds another 1 byte. (There's quite a bit of overhead for small strings - GC headers and such - but it gets dwarfed by the actual content after a while.) >>> sys.getsizeof("\u1000sdfasdfasdfasd") 68 >>> sys.getsizeof("\u1000sdfasdfasdfasdf") 70 Two bytes to add another character. >>> sys.getsizeof("\U00010001sdfasdfasdfasd") 100 >>> sys.getsizeof("\U00010001sdfasdfasdfasdf") 104 Four bytes. It uses only what it needs. Strings in Python are immutable, so there's no need to worry about up-grading or down-grading a string; there are a few optimizations that can't be done, but they're fairly trivial. Look, I'll pull a jmf and find a microbenchmark that makes 3.3 look worse: 2.7.4: >>> timeit.repeat('a=u"A"*100; a+=u"\u1000"') [0.8175005482540385, 0.789617954237201, 0.8152240019332098] >>> timeit.repeat('a=u"A"*100; a+=u"a"') [0.8088905154146744, 0.8123691698246631, 0.8172558244134365] 3.3.0: >>> timeit.repeat('a=u"A"*100; a+=u"\u1000"') [0.9623714745976031, 0.970628669281723, 0.9696310564468149] >>> timeit.repeat('a=u"A"*100; a+=u"a"') [0.7017891938739922, 0.7024725209339522, 0.6989539173082449] See? It's clearly worse on the newer Python! But actually, this is an extremely unusual situation, and 3.3 outperforms 2.7 on the more common case (where the two strings are of the same width). Python's PEP 393 strings are following the same sort of model as the native string type in a semantically-similar but syntactically-different language, Pike. In Pike (also free software, like Python), the string type can be indexed character by character, and each character can be anything in the Unicode range; and just as in Python 3.3, memory usage goes up by just one byte if every character in the string fits inside 8 bits. So it's not as if this is an untested notion; Pike has been running like this for years (I don't know how long it's had this functionality, but it won't be more than 18 years as Unicode didn't have multiple planes until 1996), and performance has been *just fine* for all that time. Pike tends to be run on servers, so memory usage and computation speed translate fairly directly into TPS. And there are some sizeable commercial entities using and developing Pike, so if the flexible string representation had turned out to be a flop, someone would have put in the coding time to rewrite it by now. And yet, despite all these excellent reasons for moving to this way of doing strings, jmf still sees his microbenchmarks as more important, and so he jumps in on threads like this to whine about how Python 3.3 is somehow US-centric because it more efficiently handles the entire Unicode range. I'd really like to take some highlights from Python and Pike and start recommending that other languages take up the ideas, but to be honest, I hesitate to inflict jmf on them all. ECMAScript may have the right idea after all - stay with UTF-16 and avoid answering jmf's stupid objections every week. [1] http://www.python.org/dev/peps/pep-0393/ ChrisA -- http://mail.python.org/mailman/listinfo/python-list