Re: Why UTF-8/16 character encodings?

Diggory Sat, 25 May 2013 15:45:23 -0700

On Saturday, 25 May 2013 at 20:03:59 UTC, Joakim wrote:

I have noted from the beginning that these large alphabets haveto be encoded to two bytes, so it is not a true constant-widthencoding if you are mixing one of those languages into asingle-byte encoded string. But this "variable length"encoding is so much simpler than UTF-8, there's no comparison.

All I can say is if you think that is simpler than UTF-8 then youhave completely the wrong idea about UTF-8.


Let me explain:

1) Take the byte at a particular offset in the string
2) If it is ASCII then we're done

3) Otherwise count the number of '1's at the start of the byte -this is how many bytes make up the character (there's even an ASMinstruction to do this)4) This first byte will look like '1110xxxx' for a 3 bytecharacter, '11110xxx' for a 4 byte character, etc.

5) All following bytes are of the form '10xxxxxx'

6) Now just concatenate all the 'x's together and add an offsetto get the code point

Note that this is CONSTANT TIME, O(1) with minimal branching sowell suited to pipelining (after the initial byte the other bytescan all be processed in parallel by the CPU) and only sequentialmemory access so no cache misses, and zero additional memoryrequirements


Now compare your encoding:

1) Look up the offset in the header using binary search: O(log N)lots of branching2) Look up the code page ID in a massive array of code pages towork out how many bytes per character

3) Hope this array hasn't been paged out and is still in the cache

4) Extract that many bytes from the string and combine them intoa number5) Look up this new number in yet another large array specific tothe code page6) Hope this array hasn't been paged out and is still in thecache too

This is O(log N) has lots of branching so no pipelining (everystage depends on the result of the stage before), lots of randommemory access so lots of cache misses, lots of additional memoryrequirements to store all those tables, and an algorithm thatisn't even any easier to understand.

Plus every other algorithm to operate on it except for decodingis insanely complicated.

Re: Why UTF-8/16 character encodings?

Reply via email to