New submission from Eric Eisner <e...@mit.edu>: I discovered this when trying to splice a string containing unicode codepoints higher than U+FFFF
all examples on 32-bit Ubuntu Linux python 2.5.2 (for comparison): sys.maxunicode # 1114111 len(unichr(66674)) # 1 len(u'\U00010472') # 1 len(u'𐑲') # 2 unichr(66674)[0] # u'\U00010472' python 3.0: (same behavior on ubuntu's rc1 package and my build(r67781) from svn) sys.maxunicode # 65535 len(chr(66674)) # 2 len('\U00010472') # 2 len('𐑲') # 2 chr(66674)[0] # '\ud801' I expect the nth element of a string to be the nth codepoint, regardless of unicode settings. I don't know why the maxunicode is configured differently (both compiled by ubuntu), but is this the expected behavior? If this is actually the expected behavior, how can I configure a build of python to use the larger maxunicode value? ---------- components: Unicode messages: 77940 nosy: ede severity: normal status: open title: Unicode: multiple chars for high code points versions: Python 3.0 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue4678> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com