On 12/6/2013 11:30 μμ, Nobody wrote:
On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:

So, how many bytes does UTF-8 stored for codepoints > 127 ?

U+0000..U+007F  1 byte
U+0080..U+07FF  2 bytes
U+0800..U+FFFF  3 bytes
=U+10000       4 bytes

'U' stands for Unicode code-point which means a character right?

How can you be able to tell up to what character utf-8 needs 1 byte or 2 bytes or 3?


And some of the bytes' bits are used to tell where a code-points representations stops, right? i mean if we have a code-point that needs 2 bytes to be stored that the high bit must be set to 1 to signify that this character's encoding stops at 2 bytes.

I just know that 2^8 = 256, that's by first look 265 places, which mean 256 positions to hold a code-point which in turn means a character.

We take the high bit out and then we have 2^7 which is enough positions for 0-127 standard ASCII. High bit is set to '0' to signify that char is encoded in 1 byte.

Please tell me that i understood correct so far.

But how about for 2 or 3 or 4 bytes?

Am i saying ti correct ?



--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to