>> (*) infact UTF8 also indicates the end of each character > Up to a point. The initial byte encodes the length and the top few > bits, but the subsequent octets aren’t distinguishable as final in > isolation. 0x80-0xBF can all be either medial or final.
So, the first high-bits are a directive that UTF-8 uses to know how many bytes each character is being represented as. 0-127 codepoints(characters) use 1 bit to signify they need 1 bit for storage and the rest 7 bits to actually store the character ? while 128-256 codepoints(characters) use 2 bit to signify they need 2 bits for storage and the rest 14 bits to actually store the character ? Isn't 14 bits way to many to store a character ? -- http://mail.python.org/mailman/listinfo/python-list