Frank Tang continued:

> >If you read through those definitions from Unicode 4.0 carefully,
> >you will see that UTF-8 representing a noncharacter is perfectly
> >valid, but UTF-8 representing an unpaired surrogate code point
> >is ill-formed (and therefore disallowed).
> >
> I see a hole here. How about UTF-8 representing a paired of surrogate 
> code point with two 3 octets sequence instead of an one octets UTF-8 
> sequence? It should be ill-formed since it is non-shortest form also, 
> right? But we really need to watch out the language used there so we 
> won't create new problem. I DO NOT want people think one 3 otects of 
> UTF-8 surrogate low or high is ill-formed but one 3 octets of UTF-8 
> surrogate high followed by a one 3 octets of UTF-8 surrogate low is legal.

This is old news.

Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value
sequences. There were two types:

   a. 0xC0 0x80 for U+0000 (instead of 0x00)
   b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+10000 (instead of 0xF0 0x90 0x80 0x80)
   
Type (b), encoding two surrogate code points as if they were
characters, instead of encoding the code point of the character
itself (using the 4-byte form of UTF-8), is what has come to
be documented as "CESU-8", but it has never been allowed for
UTF-8. Cf. Unicode 2.0, p. A-8:

   "When converting Unicode values to UTF-8, always use the shortest
    form that can represent those values. ..."
    
Such language was carried forward into Unicode 3.0, p. 47,
strengthened to make the point:

   "When converting a Unicode scalar value to UTF-8, the shortest
    form that can represent those values shall be used. ..."
    
The problem in Unicode 3.0 was that it allowed a loophole for
*interpretation* of both kinds of non-shortest forms, on the
assumption that interpretation of non-shortest forms would be
harmless. That was criticized as a security hole, and was
addressed in Unicode 3.1 (and tweaked further in Unicode 3.2),

Unicode 3.2 stated, in C12:

   "Conformant processes cannot interpret ill-formed code
    unit sequences..."
    
And that is what (a) and (b) above are, namely ill-formed code
unit sequences.

The Unicode 4.0 text further strengthens Conformance Clause
C12, to make this crystal clear:

   "C12 When a process generates a code unit sequence which
    purports to be in a Unicode character encoding form, it shall
    not emit ill-formed code unit sequences.
    
   "C12a When a process interprets a code unit sequence which
    purports to be in a Unicode character encoding form, it
    shall treat ill-formed code unit sequences as an error
    condition, and shall not interpret such sequences as
    characters."
    
And just in case anyone still has any trouble reading the
painfully detailed specification of the UTF-8
encoding form, an explicit note is included there:

   "* Because surrogate code points are not Unicode scalar
      values, any UTF-8 byte sequence that would otherwise
      map to code points D800..DFFF is ill-formed."
      
So I don't think there is any hole here. If anyone still
thinks that they can use these 3-octet/3-octet encodings
of supplementary characters and call it UTF-8, then they
are either engaging in wishful thinking or are not reading
the standard carefully enough.

--Ken


Reply via email to