Re: [Tutor] how are unicode chars represented?

Mark Tolonen Mon, 30 Mar 2009 22:52:57 -0700

"Kent Johnson" <ken...@tds.net> wrote in messagenews:1c2a2c590903300352t2bd3f1a7j5f37703cf1c3...@mail.gmail.com...

On Mon, Mar 30, 2009 at 3:36 AM, spir <denis.s...@free.fr> wrote:

Everything is in the title ;-)
(Is it kind of integers representing the code point?)


Unicode is represented as 16-bit integers. I'm not sure, but I don't
think Python has support for surrogate pairs, i.e. characters outside
the BMP.

Unicode is simply code points. How the code points are representedinternally is another matter. The below code is from a 16-bit Unicode buildof Python but should look exactly the same on a 32-bit Unicode build;however, the internal representation is different.

Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit (Intel)]on win32

Type "help", "copyright", "credits" or "license" for more information.

x=u'\U00012345'
x.encode('utf8')

'\xf0\x92\x8d\x85'

However, I wonder if this should be considered a bug. I would think thelength of a Unicode string should be the number of code points in thestring, which for my string above should be 1. Anyone have a 32-bit Unicodebuild of Python handy? This exposes the implementation as UTF-16.

len(x)

x[0]

u'\ud808'

x[1]

u'\udf45'


-Mark


_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] how are unicode chars represented?

Reply via email to