Re: Why UTF-8/16 character encodings?

Joakim Sat, 25 May 2013 01:10:26 -0700

On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:

I think you are a little confused about what unicode actuallyis... Unicode has nothing to do with code pages and nobody usescode pages any more except for compatibility with legacyapplications (with good reason!).

Incorrect.

"Unicode is an effort to include all characters from previouscode pages into a single character enumeration that can be usedwith a number of encoding schemes... In practice the variousUnicode character set encodings have simply been assigned theirown code page numbers, and all the other code pages have beentechnically redefined as encodings for various subsets ofUnicode."

http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode

Unicode is:
1) A standardised numbering of a large number of characters
2) A set of standardised algorithms for operating on thesecharacters3) A set of standardised encodings for efficiently encodingsequences of these characters

What makes you think I'm unaware of this? I have repeatedlydifferentiated between UCS (1) and UTF-8 (3).

You said that phobos converts UTF-8 strings to UTF-32 beforeoperating on them but that's not true. As it iterates overUTF-8 strings it iterates over dchars rather than chars, butthat's not in any way inefficient so I don't really see theproblem.

And what's a dchar?  Let's check:

dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html

Of course that's inefficient, you are translating your wholeencoding over to a 32-bit encoding every time you need to processit. Walter as much as said so up above.

Also your complaint that UTF-8 reserves the short charactersfor the english alphabet is not really relevant - thecharacters with longer encodings tend to be rarer (such asspecial symbols) or carry more information (such as chinesecharacters where the same sentence takes only about 1/3 thenumber of characters).

The vast majority of non-english alphabets in UCS can be encodedin a single byte. It is your exceptions that are not relevant.

Re: Why UTF-8/16 character encodings?

Reply via email to