"Alexandre Vassalotti" <[EMAIL PROTECTED]> wrote: > Hi, > > I was doing some testing on the new _string_io module, since I was > slightly skeptical on my handling of wide Unicode characters (32-bit > of length, instead of the usual 16-bit in UTF-16). So, I ran this > little test: > > >>> s = _string_io.StringIO() > >>> s.write(u'ð¯£') > >>> s.tell() > 2 > > Like I expected, wide Unicode characters count for two. However, I was > surprised that Python treats them as two characters as well: > > >>> len(u'ð¯£') > 2 > >>> u'ð¯£' > u'\ud87e\udccd' > > Is it a bug, or only an implementation choice?
If your Python is compiled as a UTF-16 build, then any character in the extended plane will be seen as two characters by Python. If you are using a UCS-4 build (it's the same as UTF-32), then you should be seeing the single wide character as a single wide character. The only exception to this rule is if you enter the wide character as a surrogate pair, in which case Python doesn't normalize it into the single wide character. To get a real wide character, you would need to use a proper escape, or decode from an encoded string. - Josiah _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
