Re: The Case Against Autodecode

Joakim via Digitalmars-d Tue, 31 May 2016 13:28:14 -0700

On Tuesday, 31 May 2016 at 18:34:54 UTC, Jonathan M Davis wrote:

On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-dwrote:
UTF-8 is an antiquated hack that needs to be eradicated. Itforces all other languages than English to be twice as long,for no good reason, have fun with that when you're downloadingtext on a 2G connection in the developing world. It isunnecessarily inefficient, which is precisely whyauto-decoding is a problem. It is only a matter of time tillUTF-8 is ditched.
Considering that *nix land uses UTF-8 almost exclusively, andmany C libraries do even on Windows, I very much doubt thatUTF-8 is going anywhere anytime soon - if ever. The Win32 APIdoes use UTF-16, and Java and C# do, but vast sea of code thatis C or C++ generally uses UTF-8 as do plenty of otherprogramming languages.


I agree that both UTF encodings are somewhat popular now.

And even aside from English, most European languages are goingto be more efficient with UTF-8, because they're stillprimarily ASCII even if they contain characters that are not.Stuff like Chinese is definitely worse in UTF-8 than it wouldbe in UTF-16, but there are a lot of languages other thanEnglish which are going to encode better with UTF-8 than UTF-16- let alone UTF-32.

And there are a lot more languages that will be twice as longthan English, ie ASCII.

Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ toomuch uses it for it to be going anywhere, and most folks haveno problem with that. Any attempt to get rid of it would be ahuge, uphill battle.

I disagree, it is inevitable. Any tech so complex andinefficient cannot last long.

But D supports UTF-8, UTF-16, _and_ UTF-32 natively - evenwithout involving the standard library - so anyone who wants toavoid UTF-8 is free to do so.

Yes, but not by using UTF-16/32, which use too much memory. I'vesuggested a single-byte encoding for most languages instead, bothin my last post and the earlier thread.

D could use this new encoding internally, while keeping itscurrent UTF-8/16 strings around for any outside UTF-8/16 datapassed in. Any of that data run through algorithms that don'trequire decoding could be kept in UTF-8, but the moment anydecoding is required, D would translate UTF-8 to the newencoding, which would be much easier for programmers tounderstand and manipulate. If UTF-8 output is needed, you'd haveto encode back again.

Yes, this translation layer would be a bit of a pain, but the newencoding would be so much more efficient and understandable thatit would be worth it, and you're already decoding and encodingback to UTF-8 for those algorithms now. All that's changing isthat you're using a new and different encoding than dchar as thedefault. If it succeeds for D, it could then be sold more widelyas a replacement for UTF-8/16.

I think this would be the right path forward, not navigating thisUTF-8/16 mess further.

Re: The Case Against Autodecode

Reply via email to