On 3/27/06, Richard Liang wrote: > > Nathan Beyer wrote: > > I've seen similar differences between other VMs around the handling of > UTF-8 > > encoded data, especially between Sun and IBM VMs. For example, if you > read > > a file with a UTF-8 encoding that contains an invalid byte(s), the IBM > VM > > will throw an IOException, but the Sun VM will convert the invalid > byte(s) > > into the Unicode unknown character (diamond-backed-question-mark). > > > > Personally, I prefer VMs that explicitly stick to Unicode and the > various > > encodings and indicate error conditions. > > > > > Hello Nathan, > > +1, we shall stick to Unicode and various encodings.
For me it is not obvious and I cannot make the choice. Let's review the next theoretical situation: if the next Unicode spec. update or corrigendum will require update that break Harmony backward compatibility. Should we stick to the new Unicode version or be backward compatible? Thanks, Stepan. > -Nathan > > > > > >> -----Original Message----- > >> From: Stepan Mishura [mailto:[EMAIL PROTECTED] > >> Sent: Friday, March 24, 2006 12:57 AM > >> To: harmony-dev > >> Subject: [bug-to-bug] UTF-8: interpreting non-shortest forms > >> > >> According to Unicode standart 4.0 (since 3.0) interpretation of non- > >> shortest > >> forms is forbidden for UTF-8. So if a byte sequence is not in table of > >> well-formed UTF-8 byte sequences then it is considered as ill-formed > and > >> treated as error. Harmony follows Unicode spec. but RI doesn't. I > didn't > >> find in the spec. explanation but I assume it is caused by backward > >> compatibility. > >> > >> The following example demonstrates the difference. For example, code > point > >> '1071' should be represented by the next UTF-8 byte sequence <D0 AF>. > But > >> it > >> may be represented as 3 bytes sequence: <E0 90 AF> that is its non- > >> shortest > >> form. So the following code prints "ERROR" on Harmony implementation > and > >> "Ok > >> with non-shortest forms" on RI > >> > >> String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90, > (byte) > >> 0xAF}, "UTF-8"); > >> String s2 = new String(new char[]{1071}); > >> > >> if(s1.equals(s2)){ > >> System.out.println("Ok with non-shortest forms"); > >> } else { > >> System.out.println("ERROR"); > >> } > >> > >> We should decide whether we going to be compatible with RI or Unicode > >> spec. > >> > >> Thanks, > >> Stepan Mishura > >> Intel Middleware Products Division > >> > > > > > > > > > -- Thanks, Stepan Mishura Intel Middleware Products Division