Doug,

>
> It is true that the *specific* irregular UTF-8 sequences introduced (and
> required) by CESU-8 decode to characters above 0xFFFF when interpreted as
> CESU-8, and to pairs of surrogate code points when (incorrectly)
> interpreted
> as UTF-8.  Since definition D29, arguably my least favorite part
> of Unicode,
> requires that all UTFs (including UTF-8) be able to represent unpaired
> surrogates, the character count for the same chunk of data could
> be different
> depending on whether it is interpreted as CESU-8 or UTF-8.  That's a
> potential security hole.

>From TR27

D36 (a) UTF-8 is the Unicode Transformation Format that serializes a Unicode
code point as a sequence of one to four bytes, as specified in Table 3.1,
UTF-8 Bit Distribution.
(b) An illegal UTF-8 code unit sequence is any byte sequence that does not
match the patterns listed in Table 3.1B, Legal UTF-8 Byte Sequences.
(c) An irregular UTF-8 code unit sequence is a six-byte sequence where the
first three bytes correspond to a high surrogate, and the next three bytes
correspond to a low surrogate. As a consequence of C12, these irregular
UTF-8 sequences shall not be generated by a conformant process.

Add to Appendix C: Relationship to ISO/IEC 10646, Section C.3: UCS
Transformation Formats, at the end of the subsection UTF-8:

The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for
the use of five- and six-byte sequences to encode characters that are
outside the range of the Unicode character set; those five- and six-byte
sequences are illegal for the use of UTF-8 as a transformation of Unicode
characters. ISO/IEC 10646 does not allow mapping of unpaired surrogates, nor
U+FFFE and U+FFFF (but it does allow other noncharacters).

Carl


Reply via email to