On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
One of the first, and best, decisions I made for D was it would be Unicode front to back.
That is why I asked this question here. I think D is still one of the few programming languages with such unicode support.

This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed.
I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also.

That was the go-to solution in the 1980's, they were called "code pages". A disaster.
My understanding is that code pages were a "disaster" because they weren't standardized and often badly implemented. If you used UCS with a single-byte encoding, you wouldn't have that problem.

> with the few exceptional languages with more than 256
characters encoded in two bytes.

Like those rare languages Japanese, Korean, Chinese, etc. This too was done in the 80's with "Shift-JIS" for Japanese, and some other wacky scheme for Korean, and a third nutburger one for Chinese.
Of course, you have to have more than one byte for those languages, because they have more than 256 characters. So there will be no compression gain over UTF-8/16 there, but a big gain in parsing complexity with a simpler encoding, particularly when dealing with multi-language strings.

I've had the misfortune of supporting all that in the old Zortech C++ compiler. It's AWFUL. If you think it's simpler, all I can say is you've never tried to write internationalized code with it.
Heh, I'm not saying "let's go back to badly defined code pages" because I'm saying "let's go back to single-byte encodings." The two are separate arguments.

UTF-8 is heavenly in comparison. Your code is automatically internationalized. It's awesome.
At what cost? Most programmers completely punt on unicode, because they just don't want to deal with the complexity. Perhaps you can deal with it and don't mind the performance loss, but I suspect you're in the minority.

Reply via email to