Re: Why UTF-8/16 character encodings?

Joakim Sat, 25 May 2013 00:35:30 -0700

On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:

One of the first, and best, decisions I made for D was it wouldbe Unicode front to back.

That is why I asked this question here. I think D is still oneof the few programming languages with such unicode support.

This is more a problem with the algorithms taking the easy waythan a problem with UTF-8. You can do all the stringalgorithms, including regex, by working with the UTF-8 directlyrather than converting to UTF-32. Then the algorithms work atfull speed.

I call BS on this. There's no way working on a variable-widthencoding can be as "full speed" as a constant-width encoding.Perhaps you mean that the slowdown is minimal, but I doubt thatalso.

That was the go-to solution in the 1980's, they were called"code pages". A disaster.

My understanding is that code pages were a "disaster" becausethey weren't standardized and often badly implemented. If youused UCS with a single-byte encoding, you wouldn't have thatproblem.

> with the few exceptional languages with more than 256
characters encoded in two bytes.
Like those rare languages Japanese, Korean, Chinese, etc. Thistoo was done in the 80's with "Shift-JIS" for Japanese, andsome other wacky scheme for Korean, and a third nutburger onefor Chinese.

Of course, you have to have more than one byte for thoselanguages, because they have more than 256 characters. So therewill be no compression gain over UTF-8/16 there, but a big gainin parsing complexity with a simpler encoding, particularly whendealing with multi-language strings.

I've had the misfortune of supporting all that in the oldZortech C++ compiler. It's AWFUL. If you think it's simpler,all I can say is you've never tried to write internationalizedcode with it.

Heh, I'm not saying "let's go back to badly defined code pages"because I'm saying "let's go back to single-byte encodings." Thetwo are separate arguments.

UTF-8 is heavenly in comparison. Your code is automaticallyinternationalized. It's awesome.

At what cost? Most programmers completely punt on unicode,because they just don't want to deal with the complexity.Perhaps you can deal with it and don't mind the performance loss,but I suspect you're in the minority.

Re: Why UTF-8/16 character encodings?

Reply via email to