Am Tue, 31 May 2016 16:29:33 +0000 schrieb Joakim <dl...@joakim.fea.st>:
> Part of it is the complexity of written language, part of it is > bad technical decisions. Building the default string type in D > around the horrible UTF-8 encoding was a fundamental mistake, > both in terms of efficiency and complexity. I noted this in one > of my first threads in this forum, and as Andrei said at the > time, nobody agreed with me, with a lot of hand-waving about how > efficiency wasn't an issue or that UTF-8 arrays were fine. > Fast-forward years later and exactly the issues I raised are now > causing pain. Maybe you can dig up your old post and we can look at each of your complaints in detail. > UTF-8 is an antiquated hack that needs to be eradicated. It > forces all other languages than English to be twice as long, for > no good reason, have fun with that when you're downloading text > on a 2G connection in the developing world. It is unnecessarily > inefficient, which is precisely why auto-decoding is a problem. > It is only a matter of time till UTF-8 is ditched. You don't download twice the data. First of all, some languages had two-byte encodings before UTF-8, and second web content is full of HTML syntax and gzip compressed afterwards. Take this Thai Wikipedia entry for example: https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2 The download of the gzipped html is 11% larger in UTF-8 than in Thai TIS-620 single-byte encoding. And that is dwarfed by the size of JS + images. (I don't have the numbers, but I expect the effective overhead to be ~2%). Ironically a lot of symbols we take for granted would then have to be implemented as HTML entities using their Unicode code points(sic!). Amongst them basic stuff like dashes, degree (°) and minute (′), accents in names, non-breaking space or footnotes (↑). > D devs should lead the way in getting rid of the UTF-8 encoding, > not bickering about how to make it more palatable. I suggested a > single-byte encoding for most languages, with double-byte for the > ones which wouldn't fit in a byte. Use some kind of header or > other metadata to combine strings of different languages, _rather > than encoding the language into every character!_ That would have put D on an island. "Some kind of header" would be a horrible mess to have in strings, because you have to account for it when concatenating strings and scan for them all the time to see if there is some interspersed 2 byte encoding in the stream. That's hardly better than UTF-8. And yes, a huge amount of websites mix scripts and a lot of other text uses the available extra symbols like ° or α,β,γ. > The common string-handling use case, by far, is strings with only > one language, with a distant second some substrings in a second > language, yet here we are putting the overhead into every > character to allow inserting characters from an arbitrary > language! This is madness. No thx, madness was when we couldn't reliably open text files, because nowhere was the encoding stored and when you had to compile programs for each of a dozen codepages, so localized text would be rendered correctly. And your retro codepage system wont convince the world to drop Unicode either. > Yes, the complexity of diacritics and combining characters will > remain, but that is complexity that is inherent to the variety of > written language. UTF-8 is not: it is just a bad technical > decision, likely chosen for ASCII compatibility and some > misguided notion that being able to combine arbitrary language > strings with no other metadata was worthwhile. It is not. The web proves you wrong. Scripts do get mixed often. Be it Wikipedia, a foreign language learning site or mathematical symbols. -- Marco