On Monday, 30 May 2016 at 17:35:36 UTC, Chris wrote:
On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:
*** http://site.icu-project.org/home#TOC-What-is-ICU-
I was actually talking about ICU with a colleague today. Could
it be that Unicode itself is broken? I've often heard criticism
of Unicode but never looked into it.
Part of it is the complexity of written language, part of it is
bad technical decisions. Building the default string type in D
around the horrible UTF-8 encoding was a fundamental mistake,
both in terms of efficiency and complexity. I noted this in one
of my first threads in this forum, and as Andrei said at the
time, nobody agreed with me, with a lot of hand-waving about how
efficiency wasn't an issue or that UTF-8 arrays were fine.
Fast-forward years later and exactly the issues I raised are now
causing pain.
UTF-8 is an antiquated hack that needs to be eradicated. It
forces all other languages than English to be twice as long, for
no good reason, have fun with that when you're downloading text
on a 2G connection in the developing world. It is unnecessarily
inefficient, which is precisely why auto-decoding is a problem.
It is only a matter of time till UTF-8 is ditched.
D devs should lead the way in getting rid of the UTF-8 encoding,
not bickering about how to make it more palatable. I suggested a
single-byte encoding for most languages, with double-byte for the
ones which wouldn't fit in a byte. Use some kind of header or
other metadata to combine strings of different languages, _rather
than encoding the language into every character!_
The common string-handling use case, by far, is strings with only
one language, with a distant second some substrings in a second
language, yet here we are putting the overhead into every
character to allow inserting characters from an arbitrary
language! This is madness.
Yes, the complexity of diacritics and combining characters will
remain, but that is complexity that is inherent to the variety of
written language. UTF-8 is not: it is just a bad technical
decision, likely chosen for ASCII compatibility and some
misguided notion that being able to combine arbitrary language
strings with no other metadata was worthwhile. It is not.