I'm digging out an old email which I saved as a draft almost a month ago but never got around to sending, because I think the new Unicode implementation in Python 3.3 is one of the coolest things ever.
On 03/12/12 16:56, eryksun wrote:
CPython 3.3 has a new implementation that angles for the best of all worlds, opting for a 1-byte, 2 byte, or 4-byte representation depending on the maximum code in the string. The internal representation doesn't use surrogates, so there's no more narrow vs wide build distinction.
The consequences of this may not be clear to some people. Here's the short version: The full range of 1114112 Unicode code points (informally "characters") do not fit the space available to two bytes. Two bytes can cover the values 0 through 65535 (0xFFFF in hexadecimal), while Unicode code points go up to 1114112 (0x10FFFF). So what to do? There are three obvious solutions: 1 If you store each character using four bytes, you can cover the entire Unicode range. The downside is that for English speakers and ASCII users, strings will use four times as much memory as you expect: e.g. the character 'A' will be stored as 0x00000041 (four bytes instead of one in pure ASCII). When you compile the Python interpreter, you can set an option to do this. This is called a "wide" build. 2 Since "wide builds" use so much extra memory for the average ASCII string, hardly anyone uses them. Instead, the default setting for Python is a "narrow" build: characters use only two bytes, which is enough for most common characters. E.g. e.g. the character 'A' will be stored as 0x0041. The less common characters can't be represented as a single two- byte character, so Unicode defines a *pair of characters* to indicate the extra (hopefully rare) characters. These are called "surrogate pairs". For example, Unicode code point 0x10859 is too large for a pair of bytes. So in Python 3.2, you get this: py> c = chr(0x10859) # IMPERIAL ARAMAIC NUMBER TWO py> print(len(c), [hex(ord(x)) for x in c]) 2 ['0xd802', '0xdc59'] Notice that instead of getting a single character, you get two characters. Your software is then supposed to manually check for such surrogate pairs. Unfortunately nobody does, because that's complicated and slow, so people end up with code that cannot handle strings with surrogate pairs safely. It's easy to break the pair up and get invalid strings that don't represent any actual character. In other words, Python *wide builds* use too much memory, and *narrow builds* are buggy and let you break strings. Oops. 3 Python 3.3 takes a third option: when you create a string object, the compiler analyses the string, works out the largest character used, and only then decides how many bytes per character to use. So in Python 3.3, the decision to use "wide" strings (4 bytes per character) or "narrow" strings (2 bytes) is no longer made when compiling the Python interpreter. It is made per string, with the added bonus that purely ASCII or Latin1 strings can use 1 byte per character. That means, no more surrogate pairs, and every Unicode character is now a single character: py> c = chr(0x10859) # Python 3.3 py> print(len(c), [ord(x) for x in c]) 1 ['0x10859'] and a good opportunity for large memory savings. How big are the memory savings? They can be substantial. Purely Latin1 strings (so-called "extended ASCII") can be close to half the size of a narrow build: [steve@ando ~]$ python3.2 -c "import sys; print(sys.getsizeof('ñ'*1000))" 2030 [steve@ando ~]$ python3.3 -c "import sys; print(sys.getsizeof('ñ'*1000))" 1037 I don't have a wide build to test, but the size would be roughly twice as big again, about 4060 bytes. But more important than the memory savings, it means that for the first time Python's handling of Unicode strings is correct for the entire range of all one million plus characters, not just the first 65 thousand. And that, I think, is a really important step. All we need now is better fonts that support more of the Unicode range so we can actually *see* the characters. -- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor