On 9/20/06, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On 9/20/06, Adam Olsen <[EMAIL PROTECTED]> wrote: > > Before we can decide on the internal representation of our unicode > > objects, we need to decide on their external interface. My thoughts > > so far: > > Let me cut this short. The external string API in Py3k should not > change or only very marginally so (like removing rarely used useless > APIs or adding a few new conveniences). The plan is to keep the 2.x > API that is supported (in 2.x) by both str and unicode, but merge the > twp string types into one. Anything else could be done just as easily > before or after Py3k.
Thanks, but one thing remains unclear: is the indexing intended to represent bytes, code points, or code units? Note that C code operating on UTF-16 would use code units for slicing of UTF-16, which splits surrogate pairs. As far as I can tell, CPython on windows uses UTF-16 with code units. Perhaps not intentionally, but by default (not throwing an error on surrogates). For those trying to make sense of this, a Code Point anything in the 0 to 0x10FFFF range. A Code Unit goes up to 0xFF for UTF-8, 0xFFFF for UTF-16, and 0xFFFFFFFF for UTF-32. One or more code units may be needed to form a single code point. Obviously code units expose our internal implementation choice. -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
