On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote: > So, how many bytes does UTF-8 stored for codepoints > 127 ?
U+0000..U+007F 1 byte U+0080..U+07FF 2 bytes U+0800..U+FFFF 3 bytes >=U+10000 4 bytes So, 1 byte for ASCII, 2 bytes for other Latin characters, Greek, Cyrillic, Arabic, and Hebrew, 3 bytes for Chinese/Japanese/Korean, 4 bytes for dead languages and mathematical symbols. The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a total of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than 20 bits). -- http://mail.python.org/mailman/listinfo/python-list