Re: Why UTF-8/16 character encodings?

Diggory Sat, 25 May 2013 11:10:26 -0700

On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:

On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
I think you are a little confused about what unicode actuallyis... Unicode has nothing to do with code pages and nobodyuses code pages any more except for compatibility with legacyapplications (with good reason!).
Incorrect.
"Unicode is an effort to include all characters from previouscode pages into a single character enumeration that can be usedwith a number of encoding schemes... In practice the variousUnicode character set encodings have simply been assigned theirown code page numbers, and all the other code pages have beentechnically redefined as encodings for various subsets ofUnicode."
http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode


That confirms exactly what I just said...

You said that phobos converts UTF-8 strings to UTF-32 beforeoperating on them but that's not true. As it iterates overUTF-8 strings it iterates over dchars rather than chars, butthat's not in any way inefficient so I don't really see theproblem.
And what's a dchar?  Let's check:

dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html
Of course that's inefficient, you are translating your wholeencoding over to a 32-bit encoding every time you need toprocess it. Walter as much as said so up above.

Given that all the machine registers are at least 32-bits alreadyit doesn't make the slightest difference. The only additionaloperations on top of ascii are when it's a multi-byte character,and even then it's some simple bit manipulation which is as fastas any variable width encoding is going to get.


The only alternatives to a variable width encoding I can see are:
- Single code page per string

This is completely useless because now you can't concatenatestrings of different code pages.


- Multiple code pages per string

This just makes everything overly complicated and is far slowerto decode what the actual character is than UTF-8.


- String with escape sequences to change code page

Can no longer access characters in the middle or end of thestring, you have to parse the entire string every time whichcompletely negates the benefit of a fixed width encoding.


- An encoding wide enough to store every character
This is just UTF-32.

Also your complaint that UTF-8 reserves the short charactersfor the english alphabet is not really relevant - thecharacters with longer encodings tend to be rarer (such asspecial symbols) or carry more information (such as chinesecharacters where the same sentence takes only about 1/3 thenumber of characters).
The vast majority of non-english alphabets in UCS can beencoded in a single byte. It is your exceptions that are notrelevant.

Well obviously... That's like saying "if you know what the exactcontents of a file are going to be anyway you can compress it toa single byte!"

ie. It's possible to devise an encoding which will encode anygiven string to an arbitrarily small size. It's still completelyuseless because you'd have to know the string in advance...

- A useful encoding has to be able to handle every unicodecharacter- As I've shown the only space-efficient way to do this is usinga variable length encoding like UTF-8- Given the frequency distribution of unicode characters, UTF-8does a pretty good job at encoding higher frequency characters infewer bytes.- Yes you COULD encode non-english alphabets in a single byte butdoing so would be inefficient because it would mean the morefrequently used characters take more bytes to encode.

Re: Why UTF-8/16 character encodings?

Reply via email to