Re: Why UTF-8/16 character encodings?

Diggory Sat, 25 May 2013 12:45:26 -0700

On Saturday, 25 May 2013 at 19:02:43 UTC, Joakim wrote:

On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote:
On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
I think you are a little confused about what unicodeactually is... Unicode has nothing to do with code pages andnobody uses code pages any more except for compatibilitywith legacy applications (with good reason!).
Incorrect.
"Unicode is an effort to include all characters from previouscode pages into a single character enumeration that can beused with a number of encoding schemes... In practice thevarious Unicode character set encodings have simply beenassigned their own code page numbers, and all the other codepages have been technically redefined as encodings forvarious subsets of Unicode."
http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
That confirms exactly what I just said...
No, that directly _contradicts_ what you said about Unicodehaving "nothing to do with code pages." All UCS did is take abunch of existing code pages and standardize them into onemassive character set. For example, ISCII was a pre-existingsingle-byte encoding and Unicode "largely preserves the ISCIIlayout within each block."
http://en.wikipedia.org/wiki/ISCII
All a code page is is a table of mappings, UCS is just a muchlarger, standardized table of such mappings.

UCS does have nothing to do with code pages, it was designed as areplacement for them. A codepage is a strict subset of thepossible characters, UCS is the entire set of possible characters.

You said that phobos converts UTF-8 strings to UTF-32 beforeoperating on them but that's not true. As it iterates overUTF-8 strings it iterates over dchars rather than chars, butthat's not in any way inefficient so I don't really see theproblem.
And what's a dchar?  Let's check:

dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html
Of course that's inefficient, you are translating your wholeencoding over to a 32-bit encoding every time you need toprocess it. Walter as much as said so up above.
Given that all the machine registers are at least 32-bitsalready it doesn't make the slightest difference. The onlyadditional operations on top of ascii are when it's amulti-byte character, and even then it's some simple bitmanipulation which is as fast as any variable width encodingis going to get.
I see you've abandoned without note your claim that phobosdoesn't convert UTF-8 to UTF-32 internally. Perhaps convertingto UTF-32 is "as fast as any variable width encoding is goingto get" but my claim is that single-byte encodings will befaster.

I haven't "abandoned my claim". It's a simple fact that phobosdoes not convert UTF-8 string to UTF-32 strings before it usesthem.


ie. the difference between this:
string mystr = ...;
dstring temp = mystr.to!dstring;
for (int i = 0; i < temp.length; ++i)
    process(temp[i]);

and this:
string mystr = ...;
size_t i = 0;
while (i < mystr.length) {
    dchar current = decode(mystr, i);
    process(current);
}

And if you can't see why the latter example is far more efficientI give up...

The only alternatives to a variable width encoding I can seeare:
- Single code page per string
This is completely useless because now you can't concatenatestrings of different code pages.
I wouldn't be so fast to ditch this. There is a real argumentto be made that strings of different languages are sufficientlydifferent that there should be no multi-language strings. Isthis the best route? I'm not sure, but I certainly wouldn'tdismiss it out of hand.
- Multiple code pages per string
This just makes everything overly complicated and is farslower to decode what the actual character is than UTF-8.
I disagree, this would still be far faster than UTF-8,particularly if you designed your header right.

The cache misses alone caused by simply accessing the separateheaders would be a larger overhead than decoding UTF-8 whichtakes a few assembly instructions and has perfect locality andcan be efficiently pipelined by the CPU.

Then there's all the extra processing involved combining theheaders when you concatenate strings. Plus you lose the onebenefit a fixed width encoding has because random access is nolonger possible without first finding out which header controlsthe location you want to access.

- String with escape sequences to change code page
Can no longer access characters in the middle or end of thestring, you have to parse the entire string every time whichcompletely negates the benefit of a fixed width encoding.
I didn't think of this possibility, but you may be right thatit's sub-optimal.
Also your complaint that UTF-8 reserves the short charactersfor the english alphabet is not really relevant - thecharacters with longer encodings tend to be rarer (such asspecial symbols) or carry more information (such as chinesecharacters where the same sentence takes only about 1/3 thenumber of characters).
The vast majority of non-english alphabets in UCS can beencoded in a single byte. It is your exceptions that are notrelevant.
Well obviously... That's like saying "if you know what theexact contents of a file are going to be anyway you cancompress it to a single byte!"
ie. It's possible to devise an encoding which will encode anygiven string to an arbitrarily small size. It's stillcompletely useless because you'd have to know the string inadvance...
No, it's not the same at all. The contents of anarbitrary-length file cannot be compressed to a single byte,you would have collisions galore. But since most non-englishalphabets are less than 256 characters, they can all beuniquely encoded in a single byte per character, with theheader determining what language's code page to use. I don'tunderstand your analogy whatsoever.

It's very simple - the more information about the type of datayou are compressing you have at the time of writing the algorithmthe better compression ration you can get, to the point that ifyou know exactly what the file is going to contain you cancompress it to nothing. This is why you have specialisedcompression algorithms for images, video, audio, etc.

It doesn't matter how few characters non-english alphabets have -unless you know WHICH alphabet it is before-hand you can't storeit in a single byte. Since any given character could be in anyalphabet the best you can do is look at the probabilities ofdifferent characters appearing and use shorter representationsfor more common ones. (This is the basis for all losslesscompression) The english alphabet plus 0-9 and basic punctuationare by far the most common characters used on computers so itmakes sense to use one byte for those and multiple bytes forrarer characters.

- A useful encoding has to be able to handle every unicodecharacter- As I've shown the only space-efficient way to do this isusing a variable length encoding like UTF-8
You haven't shown this.

If you had thought through your suggestion of multiple code pagesper string you would see that I had.

- Given the frequency distribution of unicode characters,UTF-8 does a pretty good job at encoding higher frequencycharacters in fewer bytes.
No, it does a very bad job of this. Every non-ASCII charactertakes at least two bytes to encode, whereas my single-byteencoding scheme would encode every alphabet with less than 256characters in a single byte.

And strings with mixed characters would use lots of memory and beextremely slow. Common when using proper names, quotes, inlinetranslations, graphical characters, etc. etc. Not to mention theadded complexity to actually implement the algorithms.

Re: Why UTF-8/16 character encodings?

Reply via email to