"Kent Johnson" <ken...@tds.net> wrote in message news:1c2a2c590903300352t2bd3f1a7j5f37703cf1c3...@mail.gmail.com...
On Mon, Mar 30, 2009 at 3:36 AM, spir <denis.s...@free.fr> wrote:
Everything is in the title ;-)
(Is it kind of integers representing the code point?)

Unicode is represented as 16-bit integers. I'm not sure, but I don't
think Python has support for surrogate pairs, i.e. characters outside
the BMP.

Unicode is simply code points. How the code points are represented internally is another matter. The below code is from a 16-bit Unicode build of Python but should look exactly the same on a 32-bit Unicode build; however, the internal representation is different.

Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
x=u'\U00012345'
x.encode('utf8')
'\xf0\x92\x8d\x85'

However, I wonder if this should be considered a bug. I would think the length of a Unicode string should be the number of code points in the string, which for my string above should be 1. Anyone have a 32-bit Unicode build of Python handy? This exposes the implementation as UTF-16.
len(x)
2
x[0]
u'\ud808'
x[1]
u'\udf45'


-Mark


_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to