The day somebody asks you why java needs to be replaced, one answer will be 'it only supports 16-bits chars'. laughable as it might seem, it's true.
yes, people, a Unicode char is not 16 bit (as I always though!) but 32!!
This is a misconception. Unicode is an odd mixture: at the same time it defines codepoints for representing characters and "surrogate characters" for encoding non-baseplane characters (whose codepoints don't fit into 16 bit).
ISO 10646 originally intended to use full 32bit for 2^64 characters. Because of slow progress an complaints about "wasting space", the Unicode consortium was formed which made quick progress on specifying a 16-bit charcater set. The surrogate characters were built in in case more than 2^16 characters came along, and for giving people plenty of room to experiment themself in the "private areas" there. Meanwhile, ISO-10646 and Unicode converged: ISO limited the charset to 0x110000 characters, which should be enough for everyone, and Unicode dropped the "16 bit charset" notation, they just define codepoints. Unfortunately for them, they can't undo the surrogate character mess and other wicked problems they now like to get rid of (singletons, certain compatibility characters, some presentation forms, ligatures).
A Java "char" variable can't hold non-baseplane Unicode charaters, but Java strings can. For Sun JVMs, they are basically a UTF-16 encoded Unicode strings. BTW there are JVMs out there which use UTF-8 in Java Strings, the same way strings are stored in class files.
The point is of course: can the run time libraries handle non-baseplane characters? The java.text.BreakIterator can, but that's no magic. I have no idea whether for example AWT display routines can display non- baseplane characters, mainly because I've yet to get an appropriate font. The TTF unicode mapping tables allocate, lo and behold, 16 bits for the character. Who's complaining about Java?
BTW Mozilla can't deal with non-baseplane characters either, to the chagrin of the MathML folks who use them for mathematical presentation forms. Guess what's the main reason, beside fonts: C's wchar_t is 16 bit too.
now, if you thought you could take the character() SAX event and create a String out of it and do something useful with is (like print it, for example), forget it. The result will very likely not be the one you expect.
That's an interesting observation. I never had problems in this area. But this may have something to do with the fact that I never went out of the Unicode baseplane with my chars. Heck, I'
Another reason not to use Stings at all.
Stings are bad, of course :-) Strings are another matter. In fact, Strings should be preferred over char arrays because they can hide the actual representation of the Unicode strings. If you use character arrays, you have to deal with surrogate character pairs yourseelf. A substring() could be implemented to deal with non-baseplane characters correctly. Of course, Java was invented when people thought of Unicode as 16 bit charset, and the standardized behaviour is that the String methods operate on the internal char array.
J.Pietschmann
