Yung-Fong Tang <ftang at netscape dot com> wrote: > I read the RFC 2279 again ( > http://www.cis.ohio-state.edu/cs/Services/rfc/rfc-text/rfc2279.txt ) > 1. I cannot find any text in it mentioned about. non short form is > invalid UTF8, and
First, we've already established that a revision to RFC 2279 is in the works. That said, the existing RFC 2279 says the following: "Encoding from UCS-4 to UTF-8 proceeds as follows: "1) Determine the number of octets required from the character value and the first column of the table above. It is important to note that the rows of the table are mutually exclusive, i.e. there is only one valid way to encode a given UCS-4 character." The phrase "only one valid way" makes it very clear, at least to me, that non-shortest forms are invalid. And in the "Security Considerations" section, overlong sequences are referred to as "illegal UTF-8 sequences." This has not changed in the draft replacement, probably because it is already sufficient. > 3. It mentioned about how to encode surrogate pair to UTF-8. But it > does not say the UTF8 sequence mapping directly to Surrogate High and > Surrogate Low are illegal Again, from RFC 2279: "UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire into pairs of UCS-2 values from a reserved range. UTF-16 impacts UTF-8 in that UCS-2 values from the reserved range must be treated specially in the UTF-8 transformation." and again: "The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be obtained from the above, in principle, by simply extending each UCS-2 character with two zero-valued octets. However, pairs of UCS-2 values between D800 and DFFF (surrogate pairs in Unicode parlance), being actually UCS-4 characters transformed through UTF-16, need special treatment: the UTF-16 transformation must be undone, yielding a UCS-4 character that is then transformed as above." It's pretty hard to read these paragraphs and come away with the impression that it's OK to map directly between UTF-8 and UTF-16 code units. Only by ignoring the existence of UTF-16 and these passages in RFC 2279, and treating every 16-bit code unit as a character (as some database vendors evidently did), would this even be necessary. The only "shortcoming" in the RFC is that it doesn't use the word "illegal" to describe this. The draft replacement adds the following, which should remove all doubt: "The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. When encoding in UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to obtain character numbers, which are then encoded in UTF-8 as described above." Side note: I'm a little disappointed that the draft replacement goes on to include a description of CESU-8, which is basically a perversion of UTF-8 for processes that are ignorant of UTF-16, and which the RFC later (and correctly) refers to as "a naive implementation." CESU-8 is best kept in a dark closet and used internally only by processes that have no choice, and not publicized any more than necessary. -Doug Ewell Fullerton, California