Re: [Python-3000] How will unicode get used?

Adam Olsen Wed, 20 Sep 2006 16:02:56 -0700

On 9/20/06, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On 9/20/06, Michael Chermside <[EMAIL PROTECTED]> wrote:
> > I wrote:
> > >>> msg = u'The ancient greeks used the letter "\U00010143" for the number 
> > >>> 5.'
> > >>> msg[35:-18]
> > u'"\U00010143"'
> > >>> greek_five = msg[36:-19]
> > >>> len(greek_five)
> > 2
> >
> >
> > After posting, I realized that it's worse than that. I suspect that if
> > I tried this on a CPython compiled with wide characters, then
> > len(greek_five) would be 1.
> >
> > What should it be? 2? 1? Implementation-dependent?
>
> This has all been rehashed endlessly. It's implementation (and
> platform- and compilation options-) dependent because there are good
> reasons for both choices. Even if CPython 3.0 supports a dynamic
> choice (which some are proposing) then the *language* will still make
> it implementation dependent because of Jython and IronPython, where
> the only choice is UTF-16 (or UCS-2, depending the attitude towards
> surrogates).


Wow, you really did mean code units.  In that case I'm very tempted to
support UTF-8, with byte indexing (which is what code units are in its
case).  It's ugly, but it technically works fine, and it's the de
facto standard on Linux.  No more ugly than UTF-16 code units IMO,
just more obvious.

-- 
Adam Olsen, aka Rhamphoryncus
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] How will unicode get used?

Reply via email to