On Thu, 13 Jun 2013 12:01:55 +1000, Chris Angelico wrote: > On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano > <steve+comp.lang.pyt...@pearwood.info> wrote: >> The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but >> that's not UTF-8, that's UTF-8-plus-extra-codepoints. > > And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80", even > though mathematically they would translate into U+0000 and U+D800 > respectively. The UTF-16 *mechanism* is limited to no more than Unicode > has currently used, but I'm left wondering if that's actually the other > way around - that Unicode planes were deemed to stop at the point where > UTF-16 can't encode any more.
Indeed. 5-byte and 6-byte sequences were originally part of the UTF-8 specification, allowing for 31 bits. Later revisions of the standard imposed the UTF-16 limit on Unicode as a whole. -- http://mail.python.org/mailman/listinfo/python-list