Hello, > > Robert Haas <robertmh...@gmail.com> writes: > >> - Why does the second byte need special handling for 0xED and 0xF4? > > > > http://www.faqs.org/rfcs/rfc3629.html > > > > See section 4 in particular. The underlying requirement is to disallow > > multiple representations of the same Unicode code point.
The special handling skips the utf8 code regions corresponds to the regions U+D800 - U+DFFF and U+110000 - U+11ffff in ucs-4. The former is reserved for use with the UTF-16 encoding form as surrougate pairs and do not directly represent characters as described in section 3 of rfc3629. The latter is the region which is out of the utf-8 range by the definition described also in the same section. former> The definition of UTF-8 prohibits encoding character former> numbers between U+D800 and U+DFFF, which are reserved for former> use with the UTF-16 encoding form (as surrogate pairs) former> and do not directly represent characters. latter> In UTF-8, characters from the U+0000..U+10FFFF range (the latter> UTF-16 accessible range) are encoded using sequences of 1 latter> to 4 octets. # However, I wrote this exception simplly mimicked the # pg_utf8_validator()'s behavior at the beginning. This must be the basis of the behavior of pg_utf8_verifier(), and pg_utf8_increment() has taken over it. It may be good to describe this origin of the special handling as comment of these functions to avoid this sort of confusion. > I'm still confused. The input string is already known to be valid > UTF-8, so the second byte (if there is one) must be between 0x80 and > 0xBF. Therefore it will be neither 0xED nor 0xF4. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers