On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:

> So, how many bytes does UTF-8 stored for codepoints > 127 ?

U+0000..U+007F  1 byte
U+0080..U+07FF  2 bytes
U+0800..U+FFFF  3 bytes
>=U+10000       4 bytes

So, 1 byte for ASCII, 2 bytes for other Latin characters, Greek, Cyrillic,
Arabic, and Hebrew, 3 bytes for Chinese/Japanese/Korean, 4 bytes for dead
languages and mathematical symbols.

The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a total
of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than 20 bits).

-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to