Re: Why UTF-8/16 character encodings?

Joakim Sun, 26 May 2013 01:46:06 -0700

On Saturday, 25 May 2013 at 18:56:42 UTC, Diggory wrote:

"limited success of UTF-8"
Becoming the de-facto standard encoding EVERYWERE except forwindows which uses UTF-16 is hardly a failure...

So you admit that UTF-8 isn't used on the vast majority ofcomputers since the inception of Unicode. That's what I calllimited success, thank you for agreeing with me. :)

I really don't understand your hatred for UTF-8 - it's simpleto decode and encode, fast and space-efficient. Fixed widthencodings are not inherently fast, the only thing they arefaster at is if you want to randomly access the Nth characterinstead of the Nth byte. In the rare cases that you need to doa lot of this kind of random access there exists UTF-32...

Space-efficient? Do you even understand what a single-byteencoding is? Suffice to say, a single-byte encoding beats UTF-8on all these measures, not just one.

Any fixed width encoding which can encode every unicodecharacter must use at least 3 bytes, and using 4 bytes isprobably going to be faster because of alignment, so I don'tsee what the great improvement over UTF-32 is going to be.

Slaps head. You don't need "at least 3 bytes" because you'repacking language info in the header. I don't think you even knowwhat I'm talking about.

slicing does require decoding

Nope.

Of course it does, at least partially. There is no other way toknow where the code points are.

I didn't mean that people are literally keeping code pages. Imeant that there's not much of a difference between code pageswith 2 bytes per char and the language character sets in UCS.
Unicode doesn't have "language character sets". The differentplanes only exist for organisational purposes they don't affecthow characters are encoded.

Nobody's talking about different planes. I'm talking about allthe different language character sets in this list:


http://en.wikipedia.org/wiki/List_of_Unicode_characters

?! It's okay because you deem it "coherent in its scheme?" Ideem headers much more coherent. :)
Sure if you change the word "coherent" to mean somethingcompletely different... Coherent means that you store relatedthings together, ie. everything that you need to decode acharacter in the same place, not spread out between part of acharacter and a header.

Coherent means that the organizational pieces fit together andmake sense conceptually, not that everything is stored together.My point is that putting the language info in a header seems muchmore coherent to me than ramming that info into every character.

but I suspect substring search not requiring decoding is theexception for UTF-8 algorithms, not the rule.
The only time you need to decode is when you need to do sometransformation that depends on the code point such asconverting case or identifying which character class aparticular character belongs to. Appending, slicing, copying,searching, replacing, etc. basically all the most common textoperations can all be done without any encoding or decoding.

Slicing by byte, which is the only way to slice without decoding,is useless, I have to laugh that you even include it. :) Allthese basic operations can be done very fast, often faster thanUTF-8, in a single-byte encoding. Once you start talking codepoints, it's no contest: UTF-8 flat out loses.


On Saturday, 25 May 2013 at 19:42:41 UTC, Diggory wrote:

All a code page is is a table of mappings, UCS is just a muchlarger, standardized table of such mappings.
UCS does have nothing to do with code pages, it was designed asa replacement for them. A codepage is a strict subset of thepossible characters, UCS is the entire set of possiblecharacters.

"[I]t was designed as a replacement for them" by combiningseveral of them into a master code page and removingredundancies. Functionally, they are the same and historicallythey maintain the same layout in at least some cases. To thensay, UCS has "nothing to do with code pages" is just dense.

I see you've abandoned without note your claim that phobosdoesn't convert UTF-8 to UTF-32 internally. Perhapsconverting to UTF-32 is "as fast as any variable widthencoding is going to get" but my claim is that single-byteencodings will be faster.
I haven't "abandoned my claim". It's a simple fact that phobosdoes not convert UTF-8 string to UTF-32 strings before it usesthem.
ie. the difference between this:
string mystr = ...;
dstring temp = mystr.to!dstring;
for (int i = 0; i < temp.length; ++i)
    process(temp[i]);

and this:
string mystr = ...;
size_t i = 0;
while (i < mystr.length) {
    dchar current = decode(mystr, i);
    process(current);
}
And if you can't see why the latter example is far moreefficient I give up...

I take your point that phobos is often decoding by char as ititerates through, but there are still functions in std.stringthat convert the entire string, as in your first example. Thepoint is that you are forced to decode everything to UTF-32,whether by char or the entire string. Your latter example may bemarginally more efficient but it is only useful for functionsthat start from the beginning and walk the string in only onedirection, which not all operations do.

- Multiple code pages per string
This just makes everything overly complicated and is farslower to decode what the actual character is than UTF-8.
I disagree, this would still be far faster than UTF-8,particularly if you designed your header right.
The cache misses alone caused by simply accessing the separateheaders would be a larger overhead than decoding UTF-8 whichtakes a few assembly instructions and has perfect locality andcan be efficiently pipelined by the CPU.

Lol, you think a few potential cache misses is going to be slowerthan repeatedly decoding, whether in assembly and pipelined ornot, every single UTF-8 character? :D

Then there's all the extra processing involved combining theheaders when you concatenate strings. Plus you lose the onebenefit a fixed width encoding has because random access is nolonger possible without first finding out which header controlsthe location you want to access.

There would be a few arithmetic operations on substring indiceswhen concatenating strings, hardly anything.

Random access is still not only possible, it is incredibly fastin most cases: you just have to check first if the header listsany two-byte encodings. This can be done once and cached as aproperty of the string (set a boolean no_two_byte_encoding onceand simply have the slice operator check it before going ahead),just as you could add a property to UTF-8 strings to allow quickrandom access if they happen to be pure ASCII. The difference isthat only strings that include the two-byte encodedKorean/Chinese/Japanese characters would require a bit morecalculation for slicing in my scheme, whereas _every_ non-ASCIIUTF-8 string requires full decoding to allow random access. Thisis a clear win for my single-byte encoding, though maybe not thecomplete demolition of UTF-8 you were hoping for. ;)

No, it's not the same at all. The contents of anarbitrary-length file cannot be compressed to a single byte,you would have collisions galore. But since most non-englishalphabets are less than 256 characters, they can all beuniquely encoded in a single byte per character, with theheader determining what language's code page to use. I don'tunderstand your analogy whatsoever.
It's very simple - the more information about the type of datayou are compressing you have at the time of writing thealgorithm the better compression ration you can get, to thepoint that if you know exactly what the file is going tocontain you can compress it to nothing. This is why you havespecialised compression algorithms for images, video, audio,etc.

This may be mostly true in general, but your specific example ofcompressing down to a byte is nonsense. For any arbitrarily longdata, there are always limits to compression. What any of thishas to do with my single-byte encoding, I have no idea.

It doesn't matter how few characters non-english alphabets have- unless you know WHICH alphabet it is before-hand you can'tstore it in a single byte. Since any given character could bein any alphabet the best you can do is look at theprobabilities of different characters appearing and use shorterrepresentations for more common ones. (This is the basis forall lossless compression) The english alphabet plus 0-9 andbasic punctuation are by far the most common characters used oncomputers so it makes sense to use one byte for those andmultiple bytes for rarer characters.

How many times have I said that "you know WHICH alphabet it isbefore-hand" because that info is stored in the header? That iswhy I specifically said, from my first post, that multi-languagestrings would have more complex headers, which I later pointedout could list all the different language substrings within amulti-language string. Your silly exposition of how compressionworks makes me wonder if you understand anything about how asingle-byte encoding would work.

Perhaps it made sense to use one byte for ASCII characters andrelegate _every other language_ to multiple bytes two decadesago. It doesn't make sense today.

- As I've shown the only space-efficient way to do this isusing a variable length encoding like UTF-8
You haven't shown this.
If you had thought through your suggestion of multiple codepages per string you would see that I had.

You are not packaging and transmitting the code pages with thestring, just as you do not ship the entire UCS with every UTF-8string. A single-byte encoding is going to be morespace-efficient for the vast majority of strings, everybody knowsthis.

No, it does a very bad job of this. Every non-ASCII charactertakes at least two bytes to encode, whereas my single-byteencoding scheme would encode every alphabet with less than 256characters in a single byte.
And strings with mixed characters would use lots of memory andbe extremely slow. Common when using proper names, quotes,inline translations, graphical characters, etc. etc. Not tomention the added complexity to actually implement thealgorithms.

Ah, you have finally stumbled across the path to a good argument,though I'm not sure how, given your seeming ignorance of howsingle-byte encodings work. :) There _is_ a degenerate case withmy particular single-byte encoding (not the ones you list, whichwould still be faster and use less memory than UTF-8): stringsthat use many, if not all, character sets. So the worst casescenario might be something like a string that had 100characters, every one from a different language. In that case, Ithink it would still be smaller than the equivalent UTF-8 string,but not by much.

There might be some complexity in implementing the algorithms,but on net, likely less than UTF-8, while being much more usablefor most programmers.


On Saturday, 25 May 2013 at 22:41:59 UTC, Diggory wrote:

1) Take the byte at a particular offset in the string
2) If it is ASCII then we're done
3) Otherwise count the number of '1's at the start of the byte- this is how many bytes make up the character (there's even anASM instruction to do this)4) This first byte will look like '1110xxxx' for a 3 bytecharacter, '11110xxx' for a 4 byte character, etc.
5) All following bytes are of the form '10xxxxxx'
6) Now just concatenate all the 'x's together and add an offsetto get the code point

Not sure why you chose to write this basic UTF-8 stuff out, otherthan to bluster on without much use.

Note that this is CONSTANT TIME, O(1) with minimal branching sowell suited to pipelining (after the initial byte the otherbytes can all be processed in parallel by the CPU) and onlysequential memory access so no cache misses, and zeroadditional memory requirements

It is constant time _per character_. You have to do it for_every_ non-ASCII character in your string, so the decoding addsup.

Now compare your encoding:
1) Look up the offset in the header using binary search: O(logN) lots of branching

It is difficult to reason about the header, because it alldepends on the number of languages used and how many substringsthere are. There are worst-case scenarios that could approachsomething like log(n) but extremely unlikely in real-world use.Most of the time, this would be O(1).

2) Look up the code page ID in a massive array of code pages towork out how many bytes per character

Hardly, this could be done by a simple lookup function thatsimply checked if the language was one of the few alphabets thatrequire two bytes.

3) Hope this array hasn't been paged out and is still in thecache4) Extract that many bytes from the string and combine theminto a number

Lol, I love how you think this is worth listing as a separatestep for the few two-byte encodings, yet have no problem withdoing this for every non-ASCII character in UTF-8.

5) Look up this new number in yet another large array specificto the code page

Why? The language byte and number uniquely specify thecharacter, just like your Unicode code point above. If you weresimply encoding the UCS in a single-byte encoding, you wouldarrange your scheme in such a way to trivially be able togenerate the UCS code point using these two bytes.

This is O(log N) has lots of branching so no pipelining (everystage depends on the result of the stage before), lots ofrandom memory access so lots of cache misses, lots ofadditional memory requirements to store all those tables, andan algorithm that isn't even any easier to understand.

Wrong on practically every count, as detailed above.

Plus every other algorithm to operate on it except for decodingis insanely complicated.

They are still much _less_ complicated than UTF-8, that's thecomparison that matters.

Re: Why UTF-8/16 character encodings?

Reply via email to