Re: A few questiosn about encoding

Steven D'Aprano Wed, 12 Jun 2013 18:48:26 -0700

On Wed, 12 Jun 2013 21:30:23 +0100, Nobody wrote:

> The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a
> total of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than
> 20 bits).


Same with UTF-8 and UTF-32, both of which are limited to U+10FFFF because 
that is what Unicode is limited to.

The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but 
that's not UTF-8, that's UTF-8-plus-extra-codepoints. Likewise the 
mechanism of UTF-32 could go up to 0xFFFFFFFF, but doing so means you 
don't have Unicode chars any more, and hence your byte-string is not 
valid UTF-32:

py> b = b'\xFF'*8
py> b.decode('UTF-32')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3: 
codepoint not in range(0x110000)


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: A few questiosn about encoding

Reply via email to