On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but > that's not UTF-8, that's UTF-8-plus-extra-codepoints.
And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80", even though mathematically they would translate into U+0000 and U+D800 respectively. The UTF-16 *mechanism* is limited to no more than Unicode has currently used, but I'm left wondering if that's actually the other way around - that Unicode planes were deemed to stop at the point where UTF-16 can't encode any more. Not that it matters; with most of the current planes completely unallocated, it seems unlikely we'll be needing more. ChrisA -- http://mail.python.org/mailman/listinfo/python-list