Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads

Jianping Yang Tue, 12 Jun 2001 17:53:33 -0700

One thing needs to clarify here is that there is no four byte encoding in
UTF-8S proposal and four byte encoding is illegal but not irregular. As
everything in UTF-8S is perfect match to UTF-16, any blame to this proposal
also applies to UTF-16 encoding form.

Regards,
Jianping.

Kenneth Whistler wrote:

> Case I. Code points U-0000D800..U-0000DFFF excluded
>         from the UTF's. "The way God intended it to be"
>
>    code point     UTF-8              UTF-16     UTF-32
>
> a. 00000000  <=>  00                 0000       00000000
> b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
> g. 0000E000  <=>  EE 80 80           E000       0000E000
> h. 0000FFFF  <=>  EF BF BF           FFFF       0000FFFF
> i. 00010000  <=>  F0 90 80 80        D800 DC00  00010000
> j. 0010FFFF  <=>  F4 8F BF BF        DBFF DFFF  0010FFFF
>
> [Commentary by Ken: UTF-16 does not define the same
>  binary ordering as UTF-8 or UTF-32. Big whoop.]
>
> ===========================================================
>
> Case II. Code points U-0000D800..U-0000DFFF included
>         in the UTF's. "Mark's hard look at the real
>         world, where the angels have fallen."
>         http://www.macchiato.com/utc/utf_comparison.htm
>
>    code point     UTF-8              UTF-16     UTF-32
>
> a. 00000000  <=>  00                 0000       00000000
> b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
> g. 0000E000  <=>  EE 80 80           E000       0000E000
> h. 0000FFFF  <=>  EF BF BF           FFFF       0000FFFF
> i. 00010000  <=>  F0 90 80 80        D800 DC00  00010000
> j. 0010FFFF  <=>  F4 8F BF BF        DBFF DFFF  0010FFFF
>
> Round-tripping isolated surrogate code points (when not
> appropriately paired):
>
> c. 0000D800  <=>  ED A0 80           D800       0000D800
> d. 0000DBFF  <=>  ED AF BF           DBFF       0000DBFF
> e. 0000DC00  <=>  ED B0 80           DC00       0000DC00
> f. 0000DFFF  <=>  EF BF BF           DFFF       0000DFFF
>
> Code point sequences that do not round-trip from UTF code
> unit sequences. [Could be termed "irregular code point
> sequences" --Ken]:
>
> k. 0000D800 0000DC00  =>  F0 90 80 80  D800 DC00  00010000
> l. 0000DBFF 0000DFFF  =>  F4 8F BF BF  DBFF DFFF  0010FFFF
>
> UTF code unit sequences that do not round-trip from code
> points. (Irregular code unit sequences):
>
> m. 00010000  <=   ED A0 80 ED B0 80   ----      0000D800 0000DC00
> n. 0010FFFF  <=   ED AF BF ED BF BF   ----      0000DBFF 0000DFFF
>
> [Commentary by Ken: k and l are a real problem here,
>  since the conditional handling of "surrogate code points",
>  where they convert to a single UTF-32 code unit when isolated,
>  but *also* convert to a single UTF-32 code unit when paired,
>  breaks the 1-to-1 relationship, character==>code unit, implicit
>  for UTF-32. m and n have the same problem in reverse for UTF32.
>  I don't think either can be considered a correct specification
>  for UTF-32.]
>
> ===========================================================
>
> Case III. Code points U-0000D800..U-0000DFFF included
>         in the UTF's, using UTF-8s "The vision provided
>         by the Oracle."
>
>    code point     UTF-8s             UTF-16     UTF-32
>
> a. 00000000  <=>  00                 0000       00000000
> b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
> g. 0000E000  <=>  EE 80 80           E000       0000E000
> h. 0000FFFF  <=>  EF BF BF           FFFF       0000FFFF
> i. 00010000  <=>  ED A0 80 ED B0 80  D800 DC00  00010000
> j. 0010FFFF  <=>  ED AF BF ED BF BF  DBFF DFFF  0010FFFF
>
> Round-tripping isolated surrogate code points:
>
> c. 0000D800  <=>  ED A0 80           D800       0000D800
> d. 0000DBFF  <=>  ED AF BF           DBFF       0000DBFF
> e. 0000DC00  <=>  ED B0 80           DC00       0000DC00
> f. 0000DFFF  <=>  EF BF BF           DFFF       0000DFFF
>
> Code point sequences that do not round-trip from all UTF code
> unit sequences. (Could be termed "irregular code point
> sequences" --Ken):
>
> k. 0000D800 0000DC00  =>  ED A0 80 ED B0 80  D800 DC00  0000D800 0000DC00
> l. 0000DBFF 0000DFFF  =>  ED AF BF ED BF BF  DBFF DFFF  0000DBFF 0000DFFF
>
> UTF code unit sequences that do not round-trip from code
> points. (Irregular code unit sequences):
>
> m. 00010000  <=   F0 90 80 80        ----      ???
> n. 0010FFFF  <=   F4 8F BF BF        ----      ???
>
> [Commentary by Ken: The UTF-8s proposal reverses the
>  sense of the irregular UTF-8 code unit sequences, making
>  them regular for UTF-8s and making the regular UTF-8
>  code unit sequences for supplementary characters *irregular*
>  for UTF-8s. The proposal suffers the same nagging problem
>  about what to do for UTF-32 for the odd cases of k, l, m, n.
>  The UTF-32 *does* round-trip for k and l, but the UTF-8
>  and UTF-16 do not. This leads to a conversion conundrum
>  for UTF-32:
>
>  <0000D800 0000DC00> => <U+D800, U+DC00> ==>
>       <ED A0 80 ED AF BF> => U+10000 != <U+D800, U+DC00>
>
>  Further note: To think about this Case the way Oracle does,
>  recast everything in terms of UTF-8s <==> UTF-16 conversions.
>  This vision of UTF-8s is really the extrapolation of the
>  original UTF-2, as a transform on UCS-2, seeking not to
>  special-case the handling of surrogate code units that
>  were introduced in UTF-16. ]
>
> ===========================================================
>
> Case IV. Code points U-0000D800..U-0000DFFF included
>         in the UTF's, using UTF-8s and adding UTF-32s.
>         "Let them order UTF-16 cake."
>
>    code point     UTF-8s             UTF-16     UTF-32s
>
> a. 00000000  <=>  00                 0000       00000000
> b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
> g. 0000E000  <=>  EE 80 80           E000       0011E000
> h. 0000FFFF  <=>  EF BF BF           FFFF       0011FFFF
> i. 00010000  <=>  ED A0 80 ED B0 80  D800 DC00  00010000
> j. 0010FFFF  <=>  ED AF BF ED BF BF  DBFF DFFF  0010FFFF
>
> (and everything else follows the Oracle Case III.)
>
> [Commentary by Ken: This one is *too* weird. UTF-32s
>  now has the same binary order as UTF-16 and UTF-8s, but
>  it breaks the numeric relationship between code point
>  and UTF-32 code unit value, which is sure to break lots
>  of code. Use of code unit values greater than 0x10FFFF would
>  also break code that assumed the UTF-32 structure. Otherwise
>  this has the same imprecision regarding irregular UTF-32
>  for surrogate pairs as Case III.]
>
> ===========================================================
>
> Case V. Code points U-0000D800..U-0000DFFF included
>         in the UTF's, using UTF-16x. "Huh?"
>
>    code point     UTF-8              UTF-16x    UTF-32
>
> a. 00000000  <=>  00                 0000       00000000
> b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
> g. 0000E000  <=>  EE 80 80           D800       0000E000
> h. 0000FFFF  <=>  EF BF BF           F7FF       0000FFFF
> i. 00010000  <=>  F0 90 80 80        F800 FC00  00010000
> j. 0010FFFF  <=>  F4 8F BF BF        FBFF FFFF  0010FFFF
>
> (And it isn't unclear what else to do with this, as I
>  haven't seen a complete specification yet.)
>
> [Commentary by Ken: This one is *even* weirder, if
>  I have interpreted what people have in mind. Mark already
>  ruled it "impossible". While obtaining the goal of
>  binary order compatibility between the three UTF's, it
>  would trash interoperability with existing UTF-16 data and
>  API's.]
>
> ===========================================================
>
> Case VI. "Ken's Horrible Vision of the Future with
>     UTF-8 *and* UTF-8s"
>
>    code point     UTF-8/8s           UTF-16     UTF-32
>
> a. 00000000  <=>  00                 0000       00000000
> b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
> g. 0000E000  <=>  EE 80 80           E000       0000E000
> h. 0000FFFF  <=>  EF BF BF           FFFF       0000FFFF
>
>    code point     UTF-8              UTF-16     UTF-32
>
> i. 00010000  <=>  F0 90 80 80        D800 DC00  00010000
> j. 0010FFFF  <=>  F4 8F BF BF        DBFF DFFF  0010FFFF
>
>    code point     UTF-8s             UTF-16     UTF-32
>
> i. 00010000  <=>  ED A0 80 ED B0 80  D800 DC00  00010000
> j. 0010FFFF  <=>  ED AF BF ED BF BF  DBFF DFFF  0010FFFF
>
> Round-tripping isolated surrogate code points:
>
>    code point     UTF-8/8s           UTF-16     UTF-32
>
> c. 0000D800  <=>  ED A0 80           D800       0000D800
> d. 0000DBFF  <=>  ED AF BF           DBFF       0000DBFF
> e. 0000DC00  <=>  ED B0 80           DC00       0000DC00
> f. 0000DFFF  <=>  EF BF BF           DFFF       0000DFFF
>
> Code point sequences that do not round-trip from UTF code
> unit sequences. [Commentary by Ken: These also have to
> map from irregular UTF-32 code unit sequences, as currently
> defined.]:
>
>    code point             UTF-8              UTF-32
>
> k. 0000D800 0000DC00  =>  F0 90 80 80        0000D800 0000DC00
> l. 0000DBFF 0000DFFF  =>  F4 8F BF BF        0000DBFF 0000DFFF
>
>    code point             UTF-8s
>
> k. 0000D800 0000DC00  =>  ED A0 80 ED B0 80  0000D800 0000DC00
> l. 0000DBFF 0000DFFF  =>  ED AF BF ED BF BF  0000DBFF 0000DFFF
>
> UTF code unit sequences that do not round-trip from code
> points. (Irregular UTF-8/8s code unit sequences):
>
>    code point     UTF-8
>
> m. 00010000  <=   ED A0 80 ED B0 80
> n. 0010FFFF  <=   ED AF BF ED BF BF
>
>    code point     UTF-8s
>
> m. 00010000  <=   F0 90 80 80
> n. 0010FFFF  <=   F4 8F BF BF
>
> [Commentary by Ken: All generic UTF-8 handlers will have
> to be armed with the expectation that they may run into
> supplementary characters encoded either as UTF-8 or as UTF-8s.
> All processing of UTF-8 will necessitate normalization
> between the two forms, to avoid inconsistencies, round-trip
> failures, and security issues. The actual API's that people
> want to write: UTF8toUTF16, UTF16toUTF8, UTF8toUTF32,
> UTF32toUTF8, etc., will be greatly complicated by this
> situation, compared to the situation for Case 1, "The way
> God intended it to be."]
>
> --Ken

begin:vcard 
n:Yang;Jianping
tel;fax:650-506-7225
tel;work:650-506-4865
x-mozilla-html:FALSE
org:Server Gobalization Technology;Server Technology
version:2.1
email;internet:[EMAIL PROTECTED]
title:Senior Development Manager
adr;quoted-printable:;;500 Oracle Packway=0D=0AM/S 659407;Redwood Shores;CA;94065;
fn:Jianping Yang
end:vcard

Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads

Reply via email to