Re: RFC, 5-6 octets sequence in UTF8, non short form in UTF8

Doug Ewell Wed, 19 Feb 2003 09:30:13 -0800

Yung-Fong Tang <ftang at netscape dot com> wrote:

> I read the RFC 2279 again (
> http://www.cis.ohio-state.edu/cs/Services/rfc/rfc-text/rfc2279.txt )
> 1.  I cannot find any text in it mentioned about. non short form is
> invalid UTF8, and


First, we've already established that a revision to RFC 2279 is in the
works.

That said, the existing RFC 2279 says the following:

"Encoding from UCS-4 to UTF-8 proceeds as follows:

"1) Determine the number of octets required from the character value
    and the first column of the table above.  It is important to note
    that the rows of the table are mutually exclusive, i.e. there is
    only one valid way to encode a given UCS-4 character."

The phrase "only one valid way" makes it very clear, at least to me,
that non-shortest forms are invalid.  And in the "Security
Considerations" section, overlong sequences are referred to as "illegal
UTF-8 sequences."  This has not changed in the draft replacement,
probably because it is already sufficient.

> 3. It mentioned about how to encode surrogate pair to UTF-8. But it
> does not say the UTF8 sequence mapping directly to Surrogate High and
> Surrogate Low are illegal

Again, from RFC 2279:

"UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire
into pairs of UCS-2 values from a reserved range.  UTF-16 impacts
UTF-8 in that UCS-2 values from the reserved range must be treated
specially in the UTF-8 transformation."

and again:

"The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be
obtained from the above, in principle, by simply extending each
UCS-2 character with two zero-valued octets.  However, pairs of
UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
parlance), being actually UCS-4 characters transformed through
UTF-16, need special treatment: the UTF-16 transformation must be
undone, yielding a UCS-4 character that is then transformed as
above."

It's pretty hard to read these paragraphs and come away with the
impression that it's OK to map directly between UTF-8 and UTF-16 code
units.  Only by ignoring the existence of UTF-16 and these passages in
RFC 2279, and treating every 16-bit code unit as a character (as some
database vendors evidently did), would this even be necessary.  The only
"shortcoming" in the RFC is that it doesn't use the word "illegal" to
describe this.

The draft replacement adds the following, which should remove all doubt:

"The definition of UTF-8 prohibits encoding character numbers between
U+D800 and U+DFFF, which are reserved for use with the UTF-16
encoding form (as surrogate pairs) and do not directly represent
characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
to first decode the UTF-16 data to obtain character numbers, which
are then encoded in UTF-8 as described above."

Side note:  I'm a little disappointed that the draft replacement goes on
to include a description of CESU-8, which is basically a perversion of
UTF-8 for processes that are ignorant of UTF-16, and which the RFC later
(and correctly) refers to as "a naive implementation."  CESU-8 is best
kept in a dark closet and used internally only by processes that have no
choice, and not publicized any more than necessary.

-Doug Ewell
 Fullerton, California

Re: RFC, 5-6 octets sequence in UTF8, non short form in UTF8

Reply via email to