jsona laio <[EMAIL PROTECTED]> wrote on Thu, 30 Oct 2003 07:55:17 +0000:
>however, lately i want to participate a porject, in >which involves developing encoding like CCCII (CJK >based characters set for asian characters, which >defines more characters than unicode supports in the >parts of CJK). however, as i know, java vm is based >upon unicode. What exactly do you mean when you say "Unicode"? It seems that you think that Java uses only U+0000 .. U+FFFF (the "basic multilingual plane", also known as UCS-2). As far as I know, this was true in the past, but this restriction has changed in the meantime. Nowadays, Java uses the full Unicode character set. The following is quoted from [1]: >The native coded character set of the Java programming language is that >of the first seventeen planes of the Unicode version 3.0 character set; >that is, it consists in the basic multilingual plane (BMP) of Unicode >version 1 plus the next sixteen planes of Unicode version 3. This is >because the language's internal representation of characters uses the >UTF-16 encoding, which encodes the BMP directly and uses surrogate pairs, >a simple escape mechanism, to encode the other planes. Hence a charset in >the Java platform defines a mapping between sequences of sixteen-bit >values in UTF-16 and sequences of bytes. Basically, you are free to use any Unicode code point that can be mapped to and from UTF-16. You also might want to have a look at Unicode 4.0, which has added many additional code points for CJK ideographs. (By the way, CCCII is listed among the "Source Standards and Specifications" of Unicode 4.0, chapter R.1, page 1385; also available online from the Unicode site). But if you really need ideographs that are not covered in Unicode, there are the "private usage areas". These are large ranges of Unicode code points that are reserved for private purposes. If you want to write a converter between some character encoding and Unicode (possibly using code points from a private usage area, if Unicode does not provide a code point for a specific ideograph), please have a look that the java.nio.charset package. I think that GNU Classpath would be glad to accept converters. The distinction between character sets and character encodings can cause a lot of confusion. A helpful introduction is [2]. [1] http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html [2] http://www.unicode.org/standard/principles.html Best regards, -- Sascha Sascha Brawer, [EMAIL PROTECTED], http://www.dandelis.ch/people/brawer/ _______________________________________________ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath

