On Tue, Mar 31, 2009 at 1:52 AM, Mark Tolonen <metolone+gm...@gmail.com> wrote:
> Unicode is simply code points. How the code points are represented > internally is another matter. The below code is from a 16-bit Unicode build > of Python but should look exactly the same on a 32-bit Unicode build; > however, the internal representation is different. > > Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit (Intel)] > on win32 > Type "help", "copyright", "credits" or "license" for more information. >>>> >>>> x=u'\U00012345' >>>> x.encode('utf8') > > '\xf0\x92\x8d\x85' > > However, I wonder if this should be considered a bug. I would think the > length of a Unicode string should be the number of code points in the > string, which for my string above should be 1. Anyone have a 32-bit Unicode > build of Python handy? This exposes the implementation as UTF-16. >>>> >>>> len(x) > > 2 >>>> >>>> x[0] > > u'\ud808' >>>> >>>> x[1] > > u'\udf45' In standard Python the representation of unicode is 16 bits, without correct handling of surrogate pairs (which is what your string contains). I think this is called UCS-2, not UTF-16. There is a a compile switch to enable 32-bit representation of unicode. See PEP 261 and the "Internal Representation" section of the second link below for more details. http://www.python.org/dev/peps/pep-0261/ http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor