On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
One of the first, and best, decisions I made for D was it would
be Unicode front to back.
That is why I asked this question here. I think D is still one
of the few programming languages with such unicode support.
This is more a problem with the algorithms taking the easy way
than a problem with UTF-8. You can do all the string
algorithms, including regex, by working with the UTF-8 directly
rather than converting to UTF-32. Then the algorithms work at
full speed.
I call BS on this. There's no way working on a variable-width
encoding can be as "full speed" as a constant-width encoding.
Perhaps you mean that the slowdown is minimal, but I doubt that
also.
That was the go-to solution in the 1980's, they were called
"code pages". A disaster.
My understanding is that code pages were a "disaster" because
they weren't standardized and often badly implemented. If you
used UCS with a single-byte encoding, you wouldn't have that
problem.
> with the few exceptional languages with more than 256
characters encoded in two bytes.
Like those rare languages Japanese, Korean, Chinese, etc. This
too was done in the 80's with "Shift-JIS" for Japanese, and
some other wacky scheme for Korean, and a third nutburger one
for Chinese.
Of course, you have to have more than one byte for those
languages, because they have more than 256 characters. So there
will be no compression gain over UTF-8/16 there, but a big gain
in parsing complexity with a simpler encoding, particularly when
dealing with multi-language strings.
I've had the misfortune of supporting all that in the old
Zortech C++ compiler. It's AWFUL. If you think it's simpler,
all I can say is you've never tried to write internationalized
code with it.
Heh, I'm not saying "let's go back to badly defined code pages"
because I'm saying "let's go back to single-byte encodings." The
two are separate arguments.
UTF-8 is heavenly in comparison. Your code is automatically
internationalized. It's awesome.
At what cost? Most programmers completely punt on unicode,
because they just don't want to deal with the complexity.
Perhaps you can deal with it and don't mind the performance loss,
but I suspect you're in the minority.