From: "Paul Prescod" <[EMAIL PROTECTED]>
> > Why will they bother screaming loud enough? Unicode doesn't do what they want
> > and JIS/SJIS/EUC/whatever does.
>
> But where do they get their software?
>
> Microsoft and Sun are making it near impossible to use any character set
> other than Java internally with their recent APIs. So I'd like to know
> more about whether Japanese and Chinese people are really using
> something other than Unicode or whether they are just using variant
> encodings for data that their software treats internally as Unicode. I
> know for sure that Java and Mozilla treat everything inside as Unicode.
> I have strong reason to believe that goes for most Microsoft software
> also.
>
> If every bit of modern software only uses non-Unicode encodings but
> always works with the Unicode character set (And I don't know if that's
> true!) then why would it matter if programming languages had intrinsic
> support for non-Unicode character sets?
There are at least three real issues for Unicode v. regional encodings:
1) sort order: a regional encoding may have a code order which fits in with
regional conventions more. But the more rare characters that one's data
has, the less likely its local sort order will fit in with the order in any
encoding or character set.
2) repertoire: Unicode has more characters. But round-tripping user-defined
characters through Unicode systems does not have conventions that are
reliably implemented. So mixed local/Unicode systems probably create extra
problems for people who need extra characters. The magical view
of characters found in some Chinese societies, for example, places
a strong emphasis on not unifying characters: so there is a religous or
progressive aspect to repertoire as well.
3) file size: UTF-8 takes 50% more space than the common CJK encodings
for the common characters. But UTF-16 takes the same amount. However,
compression reduces information in any encoding by pretty similar amounts.
On top of that, there is the bogus issue that Unicode is a Western imposition,
unrequested by the rest of the world--one could say the same thing about
computers or gasoline I suppose.
Anyway, the key is to separate encoding issues from character repertoire
issues: if you want to use one encoding that should be your business. But,
for any system where communication is key, the bottom line should be that
we can interoperate...and Unicode (the character set, not the UTF-* encodings)
provides the only feasible foundation for that.
Cheers
Rick Jelliffe