I've seen similar differences between other VMs around the handling of UTF-8 encoded data, especially between Sun and IBM VMs. For example, if you read a file with a UTF-8 encoding that contains an invalid byte(s), the IBM VM will throw an IOException, but the Sun VM will convert the invalid byte(s) into the Unicode unknown character (diamond-backed-question-mark).
Personally, I prefer VMs that explicitly stick to Unicode and the various encodings and indicate error conditions. -Nathan > -----Original Message----- > From: Stepan Mishura [mailto:[EMAIL PROTECTED] > Sent: Friday, March 24, 2006 12:57 AM > To: harmony-dev > Subject: [bug-to-bug] UTF-8: interpreting non-shortest forms > > According to Unicode standart 4.0 (since 3.0) interpretation of non- > shortest > forms is forbidden for UTF-8. So if a byte sequence is not in table of > well-formed UTF-8 byte sequences then it is considered as ill-formed and > treated as error. Harmony follows Unicode spec. but RI doesn't. I didn't > find in the spec. explanation but I assume it is caused by backward > compatibility. > > The following example demonstrates the difference. For example, code point > '1071' should be represented by the next UTF-8 byte sequence <D0 AF>. But > it > may be represented as 3 bytes sequence: <E0 90 AF> that is its non- > shortest > form. So the following code prints "ERROR" on Harmony implementation and > "Ok > with non-shortest forms" on RI > > String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90, (byte) > 0xAF}, "UTF-8"); > String s2 = new String(new char[]{1071}); > > if(s1.equals(s2)){ > System.out.println("Ok with non-shortest forms"); > } else { > System.out.println("ERROR"); > } > > We should decide whether we going to be compatible with RI or Unicode > spec. > > Thanks, > Stepan Mishura > Intel Middleware Products Division