But then, it's my day to be an idiot... Of course an int can store more than 16 bits. It's char that's defined at 0..65535 in Java. int's will work fine in the APIs. It's the chars that are a problem.
Must be the heat. ;-) Addison Addison P. Phillips Globalization Architect / Manager, Globalization Engineering webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA +1 408.962.5487 (phone) +1 408.210.3659 (mobile) ------------------------------------------------- Internationalization is an architecture. It is not a feature. -----Original Message----- From: Addison Phillips [wM] [mailto:[EMAIL PROTECTED]] Sent: Monday, October 01, 2001 6:24 PM To: Yung-Fong Tang; [EMAIL PROTECTED] Subject: RE: surrogate at java's property file Java doesn't define any characters beyond Unicode 2.1.8 at the moment. It's stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of these versions have defined characters in the supplemental planes. In Java, a java.lang.Character object is closely tied to the definition of an "int", the 16-bit numeric type. Many classes and objects make no distinction (or worse, conflate a character with an int---many methods are defined to take and return ints for "Characters"). As a result, the Java character model appears to be tied to UCS-2 (and I don't mean UTF-16). A surrogate character *is* recognized to be a surrogate, but a high-low pair is not recognized as representing a character, nor can you retrieve the character properties of the matched pair. So to property files. The java.lang.Character sequence U+D800 U+DC00 is represented by the sequence "\ud800\udc00". This sequence does NOT represent U+10000. It represents TWO Characters, which happen to be surrogates that form a valid pair. I should point out that Java is slightly clever. For example, the UTF-8 converter knows that U+D800 U+DC00 represents the scalar value U+10000 and encodes it as a valid four byte sequence: f0-90-80-80 (and vice versa, of course). However, it is unclear how Unicode 3.1 support is going to make it into JDK 1.4++. The APIs are going to have to change to support the supplemental planes and the ripple effects on various APIs seems like an interesting problem. Perhaps they'll redefine an int to be a 32-bit value and switch Java to UTF-32 (yeah, sure.....) Best Regards, Addison Addison P. Phillips Globalization Architect / Manager, Globalization Engineering webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA +1 408.962.5487 (phone) +1 408.210.3569 (mobile) ------------------------------------------------- Internationalization is an architecture. It is not a feature. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Yung-Fong Tang Sent: Monday, October 01, 2001 5:10 PM To: [EMAIL PROTECTED] Subject: surrogate at java's property file Any one know how does Java handle Surrogate pair property file ? Java's property file use the \u encoding for non ASCII characters, therefore U+00a5 is \u00A5. I wonder anyone know how does it handle Surrogate Pair? Does U+10000 (0xd800 0xdc00) encoded as "\u10000" or "\ud800\udc00" ? (I think it should be \u10000) or they cannot handle them at all ?