Re: Wide characters support in D

Ruslan Nikolaev Mon, 07 Jun 2010 23:50:56 -0700

> 
> Maybe "lousy" is too strong a word, but aside from
> compatibility with other 
> libs/software that use it (which I'll address separately),
> UTF-16 is not 
> particularly useful compared to UTF-8 and UTF-32:
...
>


I tried to avoid commenting this because I am afraid we'll stray away from the 
main point (which is not discussion about which Unicode is better). But in 
short I would say: "Not quite right". UTF-16 as already mentioned is generally 
faster for non-Latin letters (as reading 2 bytes of aligned data takes the same 
time as reading 1 byte). Although, I am not familiar with Asian languages, I 
believe that UTF-16 requires just 2 bytes instead of 3 for most of symbols. 
That is one of the reason they don't like UTF-8. UTF-32 doesn't have any 
advantage except for being fixed length. It has a lot of unnecessary memory, 
cache, etc. overhead (the worst case scenario for both UTF8/16) which is not 
justified for any language.

> 
> First of all, it's not exactly unheard of for big projects
> to make a 
> sub-optimal decision.

I would say, the decision was quite optimal for many reasons, including that 
"lousy programming" will not cause too many problems as in case of UTF-8.

> 
> Secondly, Java and Windows adapted 16-bit encodings back
> when many people 
> were still under the mistaken impression that would allow
> them to hold any 
> character in one code-unit. If that had been true, then it

I doubt that it was the only reason. UTF-8 was already available before Windows 
NT was released. It would be much easier to use UTF-8 instead of ANSI as 
opposed to creating parallel API. Nonetheless, UTF-16 has been chosen. In 
addition, C# has been released already when UTF-16 became variable length. I 
doubt that conversion overhead (which is small compared to VM) was the main 
reason to preserve UTF-16.


Concerning why I say that it's good to have conversion to UTF-32 (you asked 
somewhere):

I think you did not understand correctly what I meant. This a very common 
practice, and in fact - required, to convert from both UTF-8 and UTF-16 to 
UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In fact, it 
is the only place where UTF-32 is commonly used and useful.

Re: Wide characters support in D

Reply via email to