Re: Nicest UTF

Andrew C. West Fri, 03 Dec 2004 03:15:35 -0800

On Thu, 2 Dec 2004 21:56:28 -0800, "Doug Ewell" wrote:
> 
> This thread amuses me.
>


Me too, but then most threads on this list do ;)

> 
> I also think that as more and more Han characters are encoded in the
> supplementary space, corresponding to the ever-growing repretoires of
> Eastern standards, the story that UTF-16 is virtually a fixed-width
> encoding because "supplementary code points are very rare in most text"
> will gradually go away.
> 

More and more mostly very obscure and rarely used Han ideographs. It does not
matter how many tens of thousands of additional CJK ideographs you add to the
supplementary planes, the vast majority of CJK users will still get by quite
happily with only CJK and CJK-A, which, as they are inherited from the important
legacy CJK encoding standards, are what most CJK users have been living with for
many years now. Of course people on this list, such as Richard Cook and myself,
find endless use for obscure and archaic ideographs, but in writing day-to-day
Chinese/Japanese/Korean there is no need to resort to CJK-B or CJK-C, except for
certain idiosyncratic (U+24B62 CEI4 is my personal faourite) or dialectal
usages, which are not typical.

Now that the number of allocated characters in planes 1, 2 and 14 (45,718
characters) is little fewer than the number of allocated characters in the BMP
(57,129) (and soon it wil be greater), it is of course ridiculous to claim that
Unicode is basically a standard for 16-bit characters, but despite the large
number of supra-BMP characters they are, by definition, rarely used, and IMHO it
will remain true that "supplementary code points are very rare in most text". 
That is not to say that I think that it is OK for people to be lazy, and just
ignore everything outside the BMP. I strongly agree that all Unicode
implementations should cover all of Unicode, and not just the BMP, and it really
annoys me when they don't; but suggesting that you need to implement supra-BMP
characters because they are going to start popping up all over the place is
wrong in my opinion (not that Doug suggested that, but that's my extrapolation
of his point). Software developers need to implement supra-BMP characters
because some users (probably very few) will from time to time want to use them,
and software should allow people to do what they want.

Andrew

Re: Nicest UTF

Reply via email to