Re: [bug-to-bug] UTF-8: interpreting non-shortest forms

Stepan Mishura Mon, 27 Mar 2006 03:16:09 -0800

On 3/27/06, Richard Liang wrote:
>
> Nathan Beyer wrote:
> > I've seen similar differences between other VMs around the handling of
> UTF-8
> > encoded data, especially between Sun and IBM VMs.  For example, if you
> read
> > a file with a UTF-8 encoding that contains an invalid byte(s), the IBM
> VM
> > will throw an IOException, but the Sun VM will convert the invalid
> byte(s)
> > into the Unicode unknown character (diamond-backed-question-mark).
> >
> > Personally, I prefer VMs that explicitly stick to Unicode and the
> various
> > encodings and indicate error conditions.
> >
> >
> Hello Nathan,
>
> +1, we shall stick to Unicode and various encodings.




For me it is not obvious and I cannot make the choice.
Let's review the next theoretical situation: if the next Unicode spec.
update or corrigendum will require update that break Harmony backward
compatibility. Should we stick to the new Unicode version or be backward
compatible?

Thanks,
Stepan.

> -Nathan
> >
> >
> >> -----Original Message-----
> >> From: Stepan Mishura [mailto:[EMAIL PROTECTED]
> >> Sent: Friday, March 24, 2006 12:57 AM
> >> To: harmony-dev
> >> Subject: [bug-to-bug] UTF-8: interpreting non-shortest forms
> >>
> >> According to Unicode standart 4.0 (since 3.0) interpretation of non-
> >> shortest
> >> forms is forbidden for UTF-8. So if a byte sequence is not in table of
> >> well-formed UTF-8 byte sequences then it is considered as ill-formed
> and
> >> treated as error. Harmony follows Unicode spec. but RI doesn't. I
> didn't
> >> find in the spec. explanation but I assume it is caused by backward
> >> compatibility.
> >>
> >> The following example demonstrates the difference. For example, code
> point
> >> '1071' should be represented by the next UTF-8 byte sequence <D0 AF>.
> But
> >> it
> >> may be represented as 3 bytes sequence: <E0 90 AF> that is its non-
> >> shortest
> >> form. So the following code prints "ERROR" on Harmony implementation
> and
> >> "Ok
> >> with non-shortest forms" on RI
> >>
> >>         String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90,
> (byte)
> >> 0xAF}, "UTF-8");
> >>         String s2 = new String(new char[]{1071});
> >>
> >>         if(s1.equals(s2)){
> >>             System.out.println("Ok with non-shortest forms");
> >>         } else {
> >>             System.out.println("ERROR");
> >>         }
> >>
> >> We should decide whether we going to be compatible with RI or Unicode
> >> spec.
> >>
> >> Thanks,
> >> Stepan Mishura
> >> Intel Middleware Products Division
> >>
> >
> >
> >
>
>
>


--
Thanks,
Stepan Mishura
Intel Middleware Products Division

Re: [bug-to-bug] UTF-8: interpreting non-shortest forms

Reply via email to