Hello, I feel at a loss what to do...
I thought that code was looking for 0xED/0xF4 in the second position,
but it's actually looking for them in the first position, which makes
vastly more sense. Whee!
Anyway, I try to describe another aspect of this code a the present.
The switch block in the g_utf8_increnet is a folded code of five
individual manipulation according to the byte-length of the
sequence. The separation presupposes the input bytes and length
formes a valid utf-8 sequence.
For a character more than 5 byte length, retunes false.
For 4 bytes, the sequence ranges between U+1 and U+1f.
If charptr[3] is less than 0xbf, increment it and return true.
Else assign 0x80 to charptr[3] and then if charptr[2] is less
than 0xbf increment it and return true.
Else assign 0x80 to charptr[2] and then,
if (charptr[1] is less than 0x8f when charptr[0] == 0xf4) or
(charptr[1] is less than 0xbf when charptr[0] != 0xf4)
increment it and return true.
Else assign 0x80 to charptr[1] and then if charptr[0] is not
0xf4 increment it and return true.
Else the input sequence must be 0xf4 0x8f 0xbf 0xbf which
represents U+10 and this is the upper limit of UTF-8
representation. Restore the sequnce and return false.
for 3 bytes, the sequence ranges between u+800 and u+.
If charptr[2] is less than 0xbf increment it and reutrn true.
Else assign 0x80 to charptr[2] and then,
if (charptr[1] is less than 0x9f when charptr[0] == 0xed) or
(charptr[1] is less than 0xbf when charptr[0] != 0xed)
increment it and return true.
The sequence 0xed 0x9f 0xbf represents U+d7ff will
incremented to 0xef 0x80 0x80 (U+f000) at the end.
Else assign 0x80 to charptr[1] and then if charptr[0] is not
0xef increment it and return true.
Else the input sequence must be 0xef 0xbf 0xbf which represents
U+ and the next UTF8 sequence has the length of 4. Restore
the sequnce and return false.
For 2 bytes, the sequence ranges between U+80 and U+7ff.
If charptr[1] is less than 0xbf increment it and reutrn true.
Else assign 0x80 to charptr[1] and then if charptr[0] is not
0xdf increment it and return true.
Else the input sequence must be 0xdf 0xbf which reporesents
U+7ff and next UTF8 sequence has the length of 3. Restore the
sequence and return false.
For 1 byte, the byte ranges between U+0 and U+7f.
If charptr[0] is less than 0x7f increment it and return true.
Else the input sequence must be 0x7f which represents U+7f and
next UTF8 sequence has the length of 2. Restore the sequence
and return false.
--
Kyotaro Horiguchi
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers