Frank Tang continued: > >If you read through those definitions from Unicode 4.0 carefully, > >you will see that UTF-8 representing a noncharacter is perfectly > >valid, but UTF-8 representing an unpaired surrogate code point > >is ill-formed (and therefore disallowed). > > > I see a hole here. How about UTF-8 representing a paired of surrogate > code point with two 3 octets sequence instead of an one octets UTF-8 > sequence? It should be ill-formed since it is non-shortest form also, > right? But we really need to watch out the language used there so we > won't create new problem. I DO NOT want people think one 3 otects of > UTF-8 surrogate low or high is ill-formed but one 3 octets of UTF-8 > surrogate high followed by a one 3 octets of UTF-8 surrogate low is legal.
This is old news. Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value sequences. There were two types: a. 0xC0 0x80 for U+0000 (instead of 0x00) b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+10000 (instead of 0xF0 0x90 0x80 0x80) Type (b), encoding two surrogate code points as if they were characters, instead of encoding the code point of the character itself (using the 4-byte form of UTF-8), is what has come to be documented as "CESU-8", but it has never been allowed for UTF-8. Cf. Unicode 2.0, p. A-8: "When converting Unicode values to UTF-8, always use the shortest form that can represent those values. ..." Such language was carried forward into Unicode 3.0, p. 47, strengthened to make the point: "When converting a Unicode scalar value to UTF-8, the shortest form that can represent those values shall be used. ..." The problem in Unicode 3.0 was that it allowed a loophole for *interpretation* of both kinds of non-shortest forms, on the assumption that interpretation of non-shortest forms would be harmless. That was criticized as a security hole, and was addressed in Unicode 3.1 (and tweaked further in Unicode 3.2), Unicode 3.2 stated, in C12: "Conformant processes cannot interpret ill-formed code unit sequences..." And that is what (a) and (b) above are, namely ill-formed code unit sequences. The Unicode 4.0 text further strengthens Conformance Clause C12, to make this crystal clear: "C12 When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code unit sequences. "C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition, and shall not interpret such sequences as characters." And just in case anyone still has any trouble reading the painfully detailed specification of the UTF-8 encoding form, an explicit note is included there: "* Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed." So I don't think there is any hole here. If anyone still thinks that they can use these 3-octet/3-octet encodings of supplementary characters and call it UTF-8, then they are either engaging in wishful thinking or are not reading the standard carefully enough. --Ken