I'm digging out an old email which I saved as a draft almost a month ago
but never got around to sending, because I think the new Unicode
implementation in Python 3.3 is one of the coolest things ever.


On 03/12/12 16:56, eryksun wrote:

CPython 3.3 has a new implementation that angles for the best of all
worlds, opting for a 1-byte, 2 byte, or 4-byte representation
depending on the maximum code in the string. The internal
representation doesn't use surrogates, so there's no more narrow vs
wide build distinction.


The consequences of this may not be clear to some people. Here's the
short version:

The full range of 1114112 Unicode code points (informally "characters")
do not fit the space available to two bytes. Two bytes can cover the
values 0 through 65535 (0xFFFF in hexadecimal), while Unicode code
points go up to 1114112 (0x10FFFF). So what to do? There are three
obvious solutions:

1 If you store each character using four bytes, you can cover the
  entire Unicode range. The downside is that for English speakers and
  ASCII users, strings will use four times as much memory as you
  expect: e.g. the character 'A' will be stored as 0x00000041 (four
  bytes instead of one in pure ASCII).

  When you compile the Python interpreter, you can set an option to
  do this. This is called a "wide" build.

2 Since "wide builds" use so much extra memory for the average ASCII
  string, hardly anyone uses them. Instead, the default setting for
  Python is a "narrow" build: characters use only two bytes, which is
  enough for most common characters. E.g. e.g. the character 'A' will
  be stored as 0x0041.

  The less common characters can't be represented as a single two-
  byte character, so Unicode defines a *pair of characters* to
  indicate the extra (hopefully rare) characters. These are called
  "surrogate pairs". For example, Unicode code point 0x10859 is too
  large for a pair of bytes. So in Python 3.2, you get this:

  py> c = chr(0x10859)  # IMPERIAL ARAMAIC NUMBER TWO
  py> print(len(c), [hex(ord(x)) for x in c])
  2 ['0xd802', '0xdc59']


  Notice that instead of getting a single character, you get two
  characters. Your software is then supposed to manually check for
  such surrogate pairs. Unfortunately nobody does, because that's
  complicated and slow, so people end up with code that cannot handle
  strings with surrogate pairs safely. It's easy to break the pair up
  and get invalid strings that don't represent any actual character.

  In other words, Python *wide builds* use too much memory, and
  *narrow builds* are buggy and let you break strings. Oops.

3 Python 3.3 takes a third option: when you create a string object,
  the compiler analyses the string, works out the largest character
  used, and only then decides how many bytes per character to use.

  So in Python 3.3, the decision to use "wide" strings (4 bytes per
  character) or "narrow" strings (2 bytes) is no longer made when
  compiling the Python interpreter. It is made per string, with the
  added bonus that purely ASCII or Latin1 strings can use 1 byte
  per character. That means, no more surrogate pairs, and every
  Unicode character is now a single character:

  py> c = chr(0x10859)  # Python 3.3
  py> print(len(c), [ord(x) for x in c])
  1 ['0x10859']

  and a good opportunity for large memory savings.


How big are the memory savings? They can be substantial. Purely Latin1
strings (so-called "extended ASCII") can be close to half the size of a
narrow build:


[steve@ando ~]$ python3.2 -c "import sys; print(sys.getsizeof('ñ'*1000))"
2030
[steve@ando ~]$ python3.3 -c "import sys; print(sys.getsizeof('ñ'*1000))"
1037


I don't have a wide build to test, but the size would be roughly twice as
big again, about 4060 bytes.

But more important than the memory savings, it means that for the first
time Python's handling of Unicode strings is correct for the entire range
of all one million plus characters, not just the first 65 thousand.

And that, I think, is a really important step. All we need now is better
fonts that support more of the Unicode range so we can actually *see* the
characters.



--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to