One thing needs to clarify here is that there is no four byte encoding in UTF-8S proposal and four byte encoding is illegal but not irregular. As everything in UTF-8S is perfect match to UTF-16, any blame to this proposal also applies to UTF-16 encoding form. Regards, Jianping. Kenneth Whistler wrote: > Case I. Code points U-0000D800..U-0000DFFF excluded > from the UTF's. "The way God intended it to be" > > code point UTF-8 UTF-16 UTF-32 > > a. 00000000 <=> 00 0000 00000000 > b. 0000D700 <=> ED 9F BF D7FF 0000D7FF > g. 0000E000 <=> EE 80 80 E000 0000E000 > h. 0000FFFF <=> EF BF BF FFFF 0000FFFF > i. 00010000 <=> F0 90 80 80 D800 DC00 00010000 > j. 0010FFFF <=> F4 8F BF BF DBFF DFFF 0010FFFF > > [Commentary by Ken: UTF-16 does not define the same > binary ordering as UTF-8 or UTF-32. Big whoop.] > > =========================================================== > > Case II. Code points U-0000D800..U-0000DFFF included > in the UTF's. "Mark's hard look at the real > world, where the angels have fallen." > http://www.macchiato.com/utc/utf_comparison.htm > > code point UTF-8 UTF-16 UTF-32 > > a. 00000000 <=> 00 0000 00000000 > b. 0000D700 <=> ED 9F BF D7FF 0000D7FF > g. 0000E000 <=> EE 80 80 E000 0000E000 > h. 0000FFFF <=> EF BF BF FFFF 0000FFFF > i. 00010000 <=> F0 90 80 80 D800 DC00 00010000 > j. 0010FFFF <=> F4 8F BF BF DBFF DFFF 0010FFFF > > Round-tripping isolated surrogate code points (when not > appropriately paired): > > c. 0000D800 <=> ED A0 80 D800 0000D800 > d. 0000DBFF <=> ED AF BF DBFF 0000DBFF > e. 0000DC00 <=> ED B0 80 DC00 0000DC00 > f. 0000DFFF <=> EF BF BF DFFF 0000DFFF > > Code point sequences that do not round-trip from UTF code > unit sequences. [Could be termed "irregular code point > sequences" --Ken]: > > k. 0000D800 0000DC00 => F0 90 80 80 D800 DC00 00010000 > l. 0000DBFF 0000DFFF => F4 8F BF BF DBFF DFFF 0010FFFF > > UTF code unit sequences that do not round-trip from code > points. (Irregular code unit sequences): > > m. 00010000 <= ED A0 80 ED B0 80 ---- 0000D800 0000DC00 > n. 0010FFFF <= ED AF BF ED BF BF ---- 0000DBFF 0000DFFF > > [Commentary by Ken: k and l are a real problem here, > since the conditional handling of "surrogate code points", > where they convert to a single UTF-32 code unit when isolated, > but *also* convert to a single UTF-32 code unit when paired, > breaks the 1-to-1 relationship, character==>code unit, implicit > for UTF-32. m and n have the same problem in reverse for UTF32. > I don't think either can be considered a correct specification > for UTF-32.] > > =========================================================== > > Case III. Code points U-0000D800..U-0000DFFF included > in the UTF's, using UTF-8s "The vision provided > by the Oracle." > > code point UTF-8s UTF-16 UTF-32 > > a. 00000000 <=> 00 0000 00000000 > b. 0000D700 <=> ED 9F BF D7FF 0000D7FF > g. 0000E000 <=> EE 80 80 E000 0000E000 > h. 0000FFFF <=> EF BF BF FFFF 0000FFFF > i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000 > j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF > > Round-tripping isolated surrogate code points: > > c. 0000D800 <=> ED A0 80 D800 0000D800 > d. 0000DBFF <=> ED AF BF DBFF 0000DBFF > e. 0000DC00 <=> ED B0 80 DC00 0000DC00 > f. 0000DFFF <=> EF BF BF DFFF 0000DFFF > > Code point sequences that do not round-trip from all UTF code > unit sequences. (Could be termed "irregular code point > sequences" --Ken): > > k. 0000D800 0000DC00 => ED A0 80 ED B0 80 D800 DC00 0000D800 0000DC00 > l. 0000DBFF 0000DFFF => ED AF BF ED BF BF DBFF DFFF 0000DBFF 0000DFFF > > UTF code unit sequences that do not round-trip from code > points. (Irregular code unit sequences): > > m. 00010000 <= F0 90 80 80 ---- ??? > n. 0010FFFF <= F4 8F BF BF ---- ??? > > [Commentary by Ken: The UTF-8s proposal reverses the > sense of the irregular UTF-8 code unit sequences, making > them regular for UTF-8s and making the regular UTF-8 > code unit sequences for supplementary characters *irregular* > for UTF-8s. The proposal suffers the same nagging problem > about what to do for UTF-32 for the odd cases of k, l, m, n. > The UTF-32 *does* round-trip for k and l, but the UTF-8 > and UTF-16 do not. This leads to a conversion conundrum > for UTF-32: > > <0000D800 0000DC00> => <U+D800, U+DC00> ==> > <ED A0 80 ED AF BF> => U+10000 != <U+D800, U+DC00> > > Further note: To think about this Case the way Oracle does, > recast everything in terms of UTF-8s <==> UTF-16 conversions. > This vision of UTF-8s is really the extrapolation of the > original UTF-2, as a transform on UCS-2, seeking not to > special-case the handling of surrogate code units that > were introduced in UTF-16. ] > > =========================================================== > > Case IV. Code points U-0000D800..U-0000DFFF included > in the UTF's, using UTF-8s and adding UTF-32s. > "Let them order UTF-16 cake." > > code point UTF-8s UTF-16 UTF-32s > > a. 00000000 <=> 00 0000 00000000 > b. 0000D700 <=> ED 9F BF D7FF 0000D7FF > g. 0000E000 <=> EE 80 80 E000 0011E000 > h. 0000FFFF <=> EF BF BF FFFF 0011FFFF > i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000 > j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF > > (and everything else follows the Oracle Case III.) > > [Commentary by Ken: This one is *too* weird. UTF-32s > now has the same binary order as UTF-16 and UTF-8s, but > it breaks the numeric relationship between code point > and UTF-32 code unit value, which is sure to break lots > of code. Use of code unit values greater than 0x10FFFF would > also break code that assumed the UTF-32 structure. Otherwise > this has the same imprecision regarding irregular UTF-32 > for surrogate pairs as Case III.] > > =========================================================== > > Case V. Code points U-0000D800..U-0000DFFF included > in the UTF's, using UTF-16x. "Huh?" > > code point UTF-8 UTF-16x UTF-32 > > a. 00000000 <=> 00 0000 00000000 > b. 0000D700 <=> ED 9F BF D7FF 0000D7FF > g. 0000E000 <=> EE 80 80 D800 0000E000 > h. 0000FFFF <=> EF BF BF F7FF 0000FFFF > i. 00010000 <=> F0 90 80 80 F800 FC00 00010000 > j. 0010FFFF <=> F4 8F BF BF FBFF FFFF 0010FFFF > > (And it isn't unclear what else to do with this, as I > haven't seen a complete specification yet.) > > [Commentary by Ken: This one is *even* weirder, if > I have interpreted what people have in mind. Mark already > ruled it "impossible". While obtaining the goal of > binary order compatibility between the three UTF's, it > would trash interoperability with existing UTF-16 data and > API's.] > > =========================================================== > > Case VI. "Ken's Horrible Vision of the Future with > UTF-8 *and* UTF-8s" > > code point UTF-8/8s UTF-16 UTF-32 > > a. 00000000 <=> 00 0000 00000000 > b. 0000D700 <=> ED 9F BF D7FF 0000D7FF > g. 0000E000 <=> EE 80 80 E000 0000E000 > h. 0000FFFF <=> EF BF BF FFFF 0000FFFF > > code point UTF-8 UTF-16 UTF-32 > > i. 00010000 <=> F0 90 80 80 D800 DC00 00010000 > j. 0010FFFF <=> F4 8F BF BF DBFF DFFF 0010FFFF > > code point UTF-8s UTF-16 UTF-32 > > i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000 > j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF > > Round-tripping isolated surrogate code points: > > code point UTF-8/8s UTF-16 UTF-32 > > c. 0000D800 <=> ED A0 80 D800 0000D800 > d. 0000DBFF <=> ED AF BF DBFF 0000DBFF > e. 0000DC00 <=> ED B0 80 DC00 0000DC00 > f. 0000DFFF <=> EF BF BF DFFF 0000DFFF > > Code point sequences that do not round-trip from UTF code > unit sequences. [Commentary by Ken: These also have to > map from irregular UTF-32 code unit sequences, as currently > defined.]: > > code point UTF-8 UTF-32 > > k. 0000D800 0000DC00 => F0 90 80 80 0000D800 0000DC00 > l. 0000DBFF 0000DFFF => F4 8F BF BF 0000DBFF 0000DFFF > > code point UTF-8s > > k. 0000D800 0000DC00 => ED A0 80 ED B0 80 0000D800 0000DC00 > l. 0000DBFF 0000DFFF => ED AF BF ED BF BF 0000DBFF 0000DFFF > > UTF code unit sequences that do not round-trip from code > points. (Irregular UTF-8/8s code unit sequences): > > code point UTF-8 > > m. 00010000 <= ED A0 80 ED B0 80 > n. 0010FFFF <= ED AF BF ED BF BF > > code point UTF-8s > > m. 00010000 <= F0 90 80 80 > n. 0010FFFF <= F4 8F BF BF > > [Commentary by Ken: All generic UTF-8 handlers will have > to be armed with the expectation that they may run into > supplementary characters encoded either as UTF-8 or as UTF-8s. > All processing of UTF-8 will necessitate normalization > between the two forms, to avoid inconsistencies, round-trip > failures, and security issues. The actual API's that people > want to write: UTF8toUTF16, UTF16toUTF8, UTF8toUTF32, > UTF32toUTF8, etc., will be greatly complicated by this > situation, compared to the situation for Case 1, "The way > God intended it to be."] > > --Ken
begin:vcard n:Yang;Jianping tel;fax:650-506-7225 tel;work:650-506-4865 x-mozilla-html:FALSE org:Server Gobalization Technology;Server Technology version:2.1 email;internet:[EMAIL PROTECTED] title:Senior Development Manager adr;quoted-printable:;;500 Oracle Packway=0D=0AM/S 659407;Redwood Shores;CA;94065; fn:Jianping Yang end:vcard