Re: [Python-3000] How will unicode get used?

Adam Olsen Wed, 20 Sep 2006 11:21:10 -0700

On 9/20/06, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On 9/20/06, Adam Olsen <[EMAIL PROTECTED]> wrote:
> > Before we can decide on the internal representation of our unicode
> > objects, we need to decide on their external interface.  My thoughts
> > so far:
>
> Let me cut this short. The external string API in Py3k should not
> change or only very marginally so (like removing rarely used useless
> APIs or adding a few new conveniences). The plan is to keep the 2.x
> API that is supported (in 2.x) by both str and unicode, but merge the
> twp string types into one. Anything else could be done just as easily
> before or after Py3k.


Thanks, but one thing remains unclear: is the indexing intended to
represent bytes, code points, or code units?  Note that C code
operating on UTF-16 would use code units for slicing of UTF-16, which
splits surrogate pairs.

As far as I can tell, CPython on windows uses UTF-16 with code units.
Perhaps not intentionally, but by default (not throwing an error on
surrogates).

For those trying to make sense of this, a Code Point anything in the 0
to 0x10FFFF range.  A Code Unit goes up to 0xFF for UTF-8, 0xFFFF for
UTF-16, and 0xFFFFFFFF for UTF-32.  One or more code units may be
needed to form a single code point.  Obviously code units expose our
internal implementation choice.

-- 
Adam Olsen, aka Rhamphoryncus
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] How will unicode get used?

Reply via email to