Frank Tang asked:

> >>  This discussion has been centered around UTF-8.  But I hope the
> >>corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0:
> >>
> >>. for UTF-32: occurrences of 'surrogates' are ill-formed.
> >>
> >>    
> >>
> How about UTF-32 sequence which the 4 bytes represent value > U+10FFFF ? 
> Are they considered ill-formed? Should they?

Yes, they are ill-formed.

Since all the encoding forms are based on the Unicode scalar values,
and since the Unicode scalar values are *defined* to be the
range 0x0000..0xD7FF, 0xE000..0x10FFFF, any attempt to represent
a code point higher than U+10FFFF in *any* encoding form is
ill-formed.

This will be called out explicitly in the Unicode 4.0 text, in
case anyone still has the question:

" * Any UTF-32 code unit greater than 0010FFFF<sub>16</sub> is
    ill-formed."
    
I can keep answering these questions, but I can also assure
everyone that the UTC worked *very* hard this time around to
make the character encoding model much clearer in the Unicode 4.0
text, and to anticipate all these edge cases.

--Ken


Reply via email to