On Wed, 12 Jun 2013 21:30:23 +0100, Nobody wrote: > The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a > total of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than > 20 bits).
Same with UTF-8 and UTF-32, both of which are limited to U+10FFFF because that is what Unicode is limited to. The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but that's not UTF-8, that's UTF-8-plus-extra-codepoints. Likewise the mechanism of UTF-32 could go up to 0xFFFFFFFF, but doing so means you don't have Unicode chars any more, and hence your byte-string is not valid UTF-32: py> b = b'\xFF'*8 py> b.decode('UTF-32') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3: codepoint not in range(0x110000) -- Steven -- http://mail.python.org/mailman/listinfo/python-list