On Tuesday, 31 May 2016 at 18:34:54 UTC, Jonathan M Davis wrote:
On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d wrote:
UTF-8 is an antiquated hack that needs to be eradicated. It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world. It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched.

Considering that *nix land uses UTF-8 almost exclusively, and many C libraries do even on Windows, I very much doubt that UTF-8 is going anywhere anytime soon - if ever. The Win32 API does use UTF-16, and Java and C# do, but vast sea of code that is C or C++ generally uses UTF-8 as do plenty of other programming languages.

I agree that both UTF encodings are somewhat popular now.

And even aside from English, most European languages are going to be more efficient with UTF-8, because they're still primarily ASCII even if they contain characters that are not. Stuff like Chinese is definitely worse in UTF-8 than it would be in UTF-16, but there are a lot of languages other than English which are going to encode better with UTF-8 than UTF-16 - let alone UTF-32.

And there are a lot more languages that will be twice as long than English, ie ASCII.

Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too much uses it for it to be going anywhere, and most folks have no problem with that. Any attempt to get rid of it would be a huge, uphill battle.

I disagree, it is inevitable. Any tech so complex and inefficient cannot last long.

But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even without involving the standard library - so anyone who wants to avoid UTF-8 is free to do so.

Yes, but not by using UTF-16/32, which use too much memory. I've suggested a single-byte encoding for most languages instead, both in my last post and the earlier thread.

D could use this new encoding internally, while keeping its current UTF-8/16 strings around for any outside UTF-8/16 data passed in. Any of that data run through algorithms that don't require decoding could be kept in UTF-8, but the moment any decoding is required, D would translate UTF-8 to the new encoding, which would be much easier for programmers to understand and manipulate. If UTF-8 output is needed, you'd have to encode back again.

Yes, this translation layer would be a bit of a pain, but the new encoding would be so much more efficient and understandable that it would be worth it, and you're already decoding and encoding back to UTF-8 for those algorithms now. All that's changing is that you're using a new and different encoding than dchar as the default. If it succeeds for D, it could then be sold more widely as a replacement for UTF-8/16.

I think this would be the right path forward, not navigating this UTF-8/16 mess further.

Reply via email to