On Tuesday, 31 May 2016 at 18:34:54 UTC, Jonathan M Davis wrote:
On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d
wrote:
UTF-8 is an antiquated hack that needs to be eradicated. It
forces all other languages than English to be twice as long,
for no good reason, have fun with that when you're downloading
text on a 2G connection in the developing world. It is
unnecessarily inefficient, which is precisely why
auto-decoding is a problem. It is only a matter of time till
UTF-8 is ditched.
Considering that *nix land uses UTF-8 almost exclusively, and
many C libraries do even on Windows, I very much doubt that
UTF-8 is going anywhere anytime soon - if ever. The Win32 API
does use UTF-16, and Java and C# do, but vast sea of code that
is C or C++ generally uses UTF-8 as do plenty of other
programming languages.
I agree that both UTF encodings are somewhat popular now.
And even aside from English, most European languages are going
to be more efficient with UTF-8, because they're still
primarily ASCII even if they contain characters that are not.
Stuff like Chinese is definitely worse in UTF-8 than it would
be in UTF-16, but there are a lot of languages other than
English which are going to encode better with UTF-8 than UTF-16
- let alone UTF-32.
And there are a lot more languages that will be twice as long
than English, ie ASCII.
Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too
much uses it for it to be going anywhere, and most folks have
no problem with that. Any attempt to get rid of it would be a
huge, uphill battle.
I disagree, it is inevitable. Any tech so complex and
inefficient cannot last long.
But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even
without involving the standard library - so anyone who wants to
avoid UTF-8 is free to do so.
Yes, but not by using UTF-16/32, which use too much memory. I've
suggested a single-byte encoding for most languages instead, both
in my last post and the earlier thread.
D could use this new encoding internally, while keeping its
current UTF-8/16 strings around for any outside UTF-8/16 data
passed in. Any of that data run through algorithms that don't
require decoding could be kept in UTF-8, but the moment any
decoding is required, D would translate UTF-8 to the new
encoding, which would be much easier for programmers to
understand and manipulate. If UTF-8 output is needed, you'd have
to encode back again.
Yes, this translation layer would be a bit of a pain, but the new
encoding would be so much more efficient and understandable that
it would be worth it, and you're already decoding and encoding
back to UTF-8 for those algorithms now. All that's changing is
that you're using a new and different encoding than dchar as the
default. If it succeeds for D, it could then be sold more widely
as a replacement for UTF-8/16.
I think this would be the right path forward, not navigating this
UTF-8/16 mess further.