Re: Why UTF-8/16 character encodings?

Joakim Sat, 25 May 2013 12:05:32 -0700

On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote:

On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
I think you are a little confused about what unicode actuallyis... Unicode has nothing to do with code pages and nobodyuses code pages any more except for compatibility with legacyapplications (with good reason!).
Incorrect.
"Unicode is an effort to include all characters from previouscode pages into a single character enumeration that can beused with a number of encoding schemes... In practice thevarious Unicode character set encodings have simply beenassigned their own code page numbers, and all the other codepages have been technically redefined as encodings for varioussubsets of Unicode."
http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
That confirms exactly what I just said...

No, that directly _contradicts_ what you said about Unicodehaving "nothing to do with code pages." All UCS did is take abunch of existing code pages and standardize them into onemassive character set. For example, ISCII was a pre-existingsingle-byte encoding and Unicode "largely preserves the ISCIIlayout within each block."

http://en.wikipedia.org/wiki/ISCII

All a code page is is a table of mappings, UCS is just a muchlarger, standardized table of such mappings.

You said that phobos converts UTF-8 strings to UTF-32 beforeoperating on them but that's not true. As it iterates overUTF-8 strings it iterates over dchars rather than chars, butthat's not in any way inefficient so I don't really see theproblem.
And what's a dchar?  Let's check:

dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html
Of course that's inefficient, you are translating your wholeencoding over to a 32-bit encoding every time you need toprocess it. Walter as much as said so up above.
Given that all the machine registers are at least 32-bitsalready it doesn't make the slightest difference. The onlyadditional operations on top of ascii are when it's amulti-byte character, and even then it's some simple bitmanipulation which is as fast as any variable width encoding isgoing to get.

I see you've abandoned without note your claim that phobosdoesn't convert UTF-8 to UTF-32 internally. Perhaps convertingto UTF-32 is "as fast as any variable width encoding is going toget" but my claim is that single-byte encodings will be faster.

The only alternatives to a variable width encoding I can seeare:
- Single code page per string
This is completely useless because now you can't concatenatestrings of different code pages.

I wouldn't be so fast to ditch this. There is a real argument tobe made that strings of different languages are sufficientlydifferent that there should be no multi-language strings. Isthis the best route? I'm not sure, but I certainly wouldn'tdismiss it out of hand.

- Multiple code pages per string
This just makes everything overly complicated and is far slowerto decode what the actual character is than UTF-8.

I disagree, this would still be far faster than UTF-8,particularly if you designed your header right.

- String with escape sequences to change code page
Can no longer access characters in the middle or end of thestring, you have to parse the entire string every time whichcompletely negates the benefit of a fixed width encoding.

I didn't think of this possibility, but you may be right thatit's sub-optimal.

Also your complaint that UTF-8 reserves the short charactersfor the english alphabet is not really relevant - thecharacters with longer encodings tend to be rarer (such asspecial symbols) or carry more information (such as chinesecharacters where the same sentence takes only about 1/3 thenumber of characters).
The vast majority of non-english alphabets in UCS can beencoded in a single byte. It is your exceptions that are notrelevant.
Well obviously... That's like saying "if you know what theexact contents of a file are going to be anyway you cancompress it to a single byte!"
ie. It's possible to devise an encoding which will encode anygiven string to an arbitrarily small size. It's stillcompletely useless because you'd have to know the string inadvance...

No, it's not the same at all. The contents of anarbitrary-length file cannot be compressed to a single byte, youwould have collisions galore. But since most non-englishalphabets are less than 256 characters, they can all be uniquelyencoded in a single byte per character, with the headerdetermining what language's code page to use. I don't understandyour analogy whatsoever.

- A useful encoding has to be able to handle every unicodecharacter- As I've shown the only space-efficient way to do this isusing a variable length encoding like UTF-8

You haven't shown this.

- Given the frequency distribution of unicode characters, UTF-8does a pretty good job at encoding higher frequency charactersin fewer bytes.

No, it does a very bad job of this. Every non-ASCII charactertakes at least two bytes to encode, whereas my single-byteencoding scheme would encode every alphabet with less than 256characters in a single byte.

- Yes you COULD encode non-english alphabets in a single bytebut doing so would be inefficient because it would mean themore frequently used characters take more bytes to encode.

Not sure what you mean by this.

Re: Why UTF-8/16 character encodings?

Reply via email to