On Saturday, 25 May 2013 at 20:03:59 UTC, Joakim wrote:
I have noted from the beginning that these large alphabets have to be encoded to two bytes, so it is not a true constant-width encoding if you are mixing one of those languages into a single-byte encoded string. But this "variable length" encoding is so much simpler than UTF-8, there's no comparison.

All I can say is if you think that is simpler than UTF-8 then you have completely the wrong idea about UTF-8.

Let me explain:

1) Take the byte at a particular offset in the string
2) If it is ASCII then we're done
3) Otherwise count the number of '1's at the start of the byte - this is how many bytes make up the character (there's even an ASM instruction to do this) 4) This first byte will look like '1110xxxx' for a 3 byte character, '11110xxx' for a 4 byte character, etc.
5) All following bytes are of the form '10xxxxxx'
6) Now just concatenate all the 'x's together and add an offset to get the code point

Note that this is CONSTANT TIME, O(1) with minimal branching so well suited to pipelining (after the initial byte the other bytes can all be processed in parallel by the CPU) and only sequential memory access so no cache misses, and zero additional memory requirements

Now compare your encoding:
1) Look up the offset in the header using binary search: O(log N) lots of branching 2) Look up the code page ID in a massive array of code pages to work out how many bytes per character
3) Hope this array hasn't been paged out and is still in the cache
4) Extract that many bytes from the string and combine them into a number 5) Look up this new number in yet another large array specific to the code page 6) Hope this array hasn't been paged out and is still in the cache too

This is O(log N) has lots of branching so no pipelining (every stage depends on the result of the stage before), lots of random memory access so lots of cache misses, lots of additional memory requirements to store all those tables, and an algorithm that isn't even any easier to understand.

Plus every other algorithm to operate on it except for decoding is insanely complicated.

Reply via email to