On Sun, Dec 2, 2012 at 8:34 AM, Albert-Jan Roskam <fo...@yahoo.com> wrote:
>
> As I emailed earlier today to Peter Otten, I thought unicode_internal means
> UCS-2 or UCS-4, depending on the size of sys.maxunicode? How is this related
> to UTF-16 and UTF-32?

UCS is the universal character set. Some highlights of the Basic
Multilingual Plane (BMP): U+0000-U+00FF is Latin-1 (including the C0
and C1 control codes). U+D800-U+DFFF is reserved for UTF-16 surrogate
pairs. U+E000-U+F8FF is reserved for private use. Most of
U+F900-U+FFFF is assigned. Notably U+FEFF (zero width no-break space)
doubles as the BOM/signature in the transformation formats.

UTF-16 encodes the supplementary planes by using 2 codes as a
surrogate pair. This uses a reserved 11-bit block (U+D800-U+DFFF),
which is split into two 10-bit ranges: U+D800-U+DBFF for the lead
surrogate and U+DC00-U+DFFF for the trail surrogate. Together that's
the required 20 bits for the 16 supplementary planes. Including the
BMP, this scheme covers the complete UCS range of 17 * 2**16 ==
1114112 codes (on a wide build, that's sys.maxunicode + 1).

For encoding text, use one of the transformation formats such as
UTF-8, UTF-16, or UTF-32. Unless you have a requirement to use UTF-16
or UTF-32, it's best to stick to encoding to UTF-8. It's the default
encoding in 3.x. It's also generally the most compact representation
(especially if there's a lot of ASCII) and compatible with
null-terminated byte strings (i.e. C array of char, terminated by
NUL). Regardless of narrow vs wide build, you can always encode to one
of these formats. The encoders for UTF-8 and UTF-32 first recombine
any surrogate pairs in the internal representation.

CPython 3.3 has a new implementation that angles for the best of all
worlds, opting for a 1-byte, 2 byte, or 4-byte representation
depending on the maximum code in the string. The internal
representation doesn't use surrogates, so there's no more narrow vs
wide build distinction.
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to