On Thu, 2 Dec 2004 21:56:28 -0800, "Doug Ewell" wrote: > > This thread amuses me. >
Me too, but then most threads on this list do ;) > > I also think that as more and more Han characters are encoded in the > supplementary space, corresponding to the ever-growing repretoires of > Eastern standards, the story that UTF-16 is virtually a fixed-width > encoding because "supplementary code points are very rare in most text" > will gradually go away. > More and more mostly very obscure and rarely used Han ideographs. It does not matter how many tens of thousands of additional CJK ideographs you add to the supplementary planes, the vast majority of CJK users will still get by quite happily with only CJK and CJK-A, which, as they are inherited from the important legacy CJK encoding standards, are what most CJK users have been living with for many years now. Of course people on this list, such as Richard Cook and myself, find endless use for obscure and archaic ideographs, but in writing day-to-day Chinese/Japanese/Korean there is no need to resort to CJK-B or CJK-C, except for certain idiosyncratic (U+24B62 CEI4 is my personal faourite) or dialectal usages, which are not typical. Now that the number of allocated characters in planes 1, 2 and 14 (45,718 characters) is little fewer than the number of allocated characters in the BMP (57,129) (and soon it wil be greater), it is of course ridiculous to claim that Unicode is basically a standard for 16-bit characters, but despite the large number of supra-BMP characters they are, by definition, rarely used, and IMHO it will remain true that "supplementary code points are very rare in most text". That is not to say that I think that it is OK for people to be lazy, and just ignore everything outside the BMP. I strongly agree that all Unicode implementations should cover all of Unicode, and not just the BMP, and it really annoys me when they don't; but suggesting that you need to implement supra-BMP characters because they are going to start popping up all over the place is wrong in my opinion (not that Doug suggested that, but that's my extrapolation of his point). Software developers need to implement supra-BMP characters because some users (probably very few) will from time to time want to use them, and software should allow people to do what they want. Andrew