[issue4678] Unicode: multiple chars for high code points

Eric Eisner Tue, 16 Dec 2008 15:25:32 -0800

New submission from Eric Eisner <[email protected]>:

I discovered this when trying to splice a string containing unicode
codepoints higher than U+FFFF



all examples on 32-bit Ubuntu Linux

python 2.5.2 (for comparison):
sys.maxunicode     # 1114111
len(unichr(66674)) # 1
len(u'\U00010472') # 1
len(u'𐑲')          # 2
unichr(66674)[0]   # u'\U00010472'


python 3.0: (same behavior on ubuntu's rc1 package and my build(r67781)
from svn)
sys.maxunicode    # 65535
len(chr(66674))   # 2
len('\U00010472') # 2
len('𐑲')          # 2
chr(66674)[0]     # '\ud801'

I expect the nth element of a string to be the nth codepoint, regardless
of unicode settings. I don't know why the maxunicode is configured
differently (both compiled by ubuntu), but is this the expected behavior?

If this is actually the expected behavior, how can I configure a build
of python to use the larger maxunicode value?

----------
components: Unicode
messages: 77940
nosy: ede
severity: normal
status: open
title: Unicode: multiple chars for high code points
versions: Python 3.0

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue4678>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue4678] Unicode: multiple chars for high code points

Reply via email to