Re: [Python-3000] Handling of wide Unicode characters

Josiah Carlson Fri, 01 Jun 2007 16:07:01 -0700

"Alexandre Vassalotti" <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> I was doing some testing on the new _string_io module, since I was
> slightly skeptical on my handling of wide Unicode characters (32-bit
> of length, instead of the usual 16-bit in UTF-16). So, I ran this
> little test:
> 
>    >>> s = _string_io.StringIO()
>    >>> s.write(u'ð¯£')
>    >>> s.tell()
>    2
> 
> Like I expected, wide Unicode characters count for two. However, I was
> surprised that Python treats them as two characters as well:
> 
>    >>> len(u'ð¯£')
>    2
>    >>> u'ð¯£'
>    u'\ud87e\udccd'
> 
> Is it a bug, or only an implementation choice?


If your Python is compiled as a UTF-16 build, then any character in the
extended plane will be seen as two characters by Python.  If you are
using a UCS-4 build (it's the same as UTF-32), then you should be seeing
the single wide character as a single wide character.  The only
exception to this rule is if you enter the wide character as a surrogate
pair, in which case Python doesn't normalize it into the single wide
character.  To get a real wide character, you would need to use a proper
escape, or decode from an encoded string.


 - Josiah

_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] Handling of wide Unicode characters

Reply via email to