Re: The Case Against Autodecode

Joakim via Digitalmars-d Tue, 31 May 2016 09:32:32 -0700

On Monday, 30 May 2016 at 17:35:36 UTC, Chris wrote:

On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:
*** http://site.icu-project.org/home#TOC-What-is-ICU-
I was actually talking about ICU with a colleague today. Couldit be that Unicode itself is broken? I've often heard criticismof Unicode but never looked into it.

Part of it is the complexity of written language, part of it isbad technical decisions. Building the default string type in Daround the horrible UTF-8 encoding was a fundamental mistake,both in terms of efficiency and complexity. I noted this in oneof my first threads in this forum, and as Andrei said at thetime, nobody agreed with me, with a lot of hand-waving about howefficiency wasn't an issue or that UTF-8 arrays were fine.Fast-forward years later and exactly the issues I raised are nowcausing pain.

UTF-8 is an antiquated hack that needs to be eradicated. Itforces all other languages than English to be twice as long, forno good reason, have fun with that when you're downloading texton a 2G connection in the developing world. It is unnecessarilyinefficient, which is precisely why auto-decoding is a problem.It is only a matter of time till UTF-8 is ditched.

D devs should lead the way in getting rid of the UTF-8 encoding,not bickering about how to make it more palatable. I suggested asingle-byte encoding for most languages, with double-byte for theones which wouldn't fit in a byte. Use some kind of header orother metadata to combine strings of different languages, _ratherthan encoding the language into every character!_

The common string-handling use case, by far, is strings with onlyone language, with a distant second some substrings in a secondlanguage, yet here we are putting the overhead into everycharacter to allow inserting characters from an arbitrarylanguage! This is madness.

Yes, the complexity of diacritics and combining characters willremain, but that is complexity that is inherent to the variety ofwritten language. UTF-8 is not: it is just a bad technicaldecision, likely chosen for ASCII compatibility and somemisguided notion that being able to combine arbitrary languagestrings with no other metadata was worthwhile. It is not.

Re: The Case Against Autodecode

Reply via email to