Frank Tang asked: > >> This discussion has been centered around UTF-8. But I hope the > >>corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0: > >> > >>. for UTF-32: occurrences of 'surrogates' are ill-formed. > >> > >> > >> > How about UTF-32 sequence which the 4 bytes represent value > U+10FFFF ? > Are they considered ill-formed? Should they?
Yes, they are ill-formed. Since all the encoding forms are based on the Unicode scalar values, and since the Unicode scalar values are *defined* to be the range 0x0000..0xD7FF, 0xE000..0x10FFFF, any attempt to represent a code point higher than U+10FFFF in *any* encoding form is ill-formed. This will be called out explicitly in the Unicode 4.0 text, in case anyone still has the question: " * Any UTF-32 code unit greater than 0010FFFF<sub>16</sub> is ill-formed." I can keep answering these questions, but I can also assure everyone that the UTC worked *very* hard this time around to make the character encoding model much clearer in the Unicode 4.0 text, and to anticipate all these edge cases. --Ken