On 9/20/06, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On 9/20/06, Michael Chermside <[EMAIL PROTECTED]> wrote: > > I wrote: > > >>> msg = u'The ancient greeks used the letter "\U00010143" for the number > > >>> 5.' > > >>> msg[35:-18] > > u'"\U00010143"' > > >>> greek_five = msg[36:-19] > > >>> len(greek_five) > > 2 > > > > > > After posting, I realized that it's worse than that. I suspect that if > > I tried this on a CPython compiled with wide characters, then > > len(greek_five) would be 1. > > > > What should it be? 2? 1? Implementation-dependent? > > This has all been rehashed endlessly. It's implementation (and > platform- and compilation options-) dependent because there are good > reasons for both choices. Even if CPython 3.0 supports a dynamic > choice (which some are proposing) then the *language* will still make > it implementation dependent because of Jython and IronPython, where > the only choice is UTF-16 (or UCS-2, depending the attitude towards > surrogates).
Wow, you really did mean code units. In that case I'm very tempted to support UTF-8, with byte indexing (which is what code units are in its case). It's ugly, but it technically works fine, and it's the de facto standard on Linux. No more ugly than UTF-16 code units IMO, just more obvious. -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
