On Monday, 30 May 2016 at 17:35:36 UTC, Chris wrote:
On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:

*** http://site.icu-project.org/home#TOC-What-is-ICU-

I was actually talking about ICU with a colleague today. Could it be that Unicode itself is broken? I've often heard criticism of Unicode but never looked into it.

Part of it is the complexity of written language, part of it is bad technical decisions. Building the default string type in D around the horrible UTF-8 encoding was a fundamental mistake, both in terms of efficiency and complexity. I noted this in one of my first threads in this forum, and as Andrei said at the time, nobody agreed with me, with a lot of hand-waving about how efficiency wasn't an issue or that UTF-8 arrays were fine. Fast-forward years later and exactly the issues I raised are now causing pain.

UTF-8 is an antiquated hack that needs to be eradicated. It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world. It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched.

D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable. I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_

The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language! This is madness.

Yes, the complexity of diacritics and combining characters will remain, but that is complexity that is inherent to the variety of written language. UTF-8 is not: it is just a bad technical decision, likely chosen for ASCII compatibility and some misguided notion that being able to combine arbitrary language strings with no other metadata was worthwhile. It is not.

Reply via email to