Marc-Andre Lemburg <m...@egenix.com> added the comment: On 2008-12-17 00:25, Eric Eisner wrote: > New submission from Eric Eisner <e...@mit.edu>: > > I discovered this when trying to splice a string containing unicode > codepoints higher than U+FFFF > > > all examples on 32-bit Ubuntu Linux > > python 2.5.2 (for comparison): > sys.maxunicode # 1114111 > len(unichr(66674)) # 1 > len(u'\U00010472') # 1 > len(u'𐑲') # 2 > unichr(66674)[0] # u'\U00010472' > > > python 3.0: (same behavior on ubuntu's rc1 package and my build(r67781) > from svn) > sys.maxunicode # 65535 > len(chr(66674)) # 2 > len('\U00010472') # 2 > len('𐑲') # 2 > chr(66674)[0] # '\ud801' > > I expect the nth element of a string to be the nth codepoint, regardless > of unicode settings. I don't know why the maxunicode is configured > differently (both compiled by ubuntu), but is this the expected behavior? > > If this is actually the expected behavior, how can I configure a build > of python to use the larger maxunicode value?
You are seeing the different behavior because you've probably built Python 3.0 from source and used the Ubuntu default Python install for comparison: The default Python 3.0 build will create a UCS2 unless you specify the --enable-unicode=ucs4 configure option. The Ubuntu Python build (like many other Linux distros) uses this option per default. ---------- nosy: +lemburg _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue4678> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com