Kenneth Whistler <kenw at sybase dot com> wrote: > I do not think this is a proposal to amend UTF-8 to allow > invalid sequences. So we should get that off the table.
I hope you are right. > Apparently Lars is currently using PUA U+E080..U+E0FF > (or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping > of byte values uninterpretable as characters to be converted, and > is asking for standard Unicode values for this purpose, instead. If I understand correctly, he is using these PUA values when the data is in UTF-16, and using bare high-bit bytes (i.e. invalid UTF-8 sequences) when the data is in UTF-8, and expecting to convert between the two. That has at least two bad implications: (1) the PUA characters would not round-trip from UTF-8 to UTF-16 to UTF-8, but would be converted to the bare high-bit bytes, and (2) the bare high-bit bytes might or might not accidentally form valid UTF-8 sequences, which mean they might not round-tip either. > Say a process gets handed a "UTF-8" string that contains the > byte sequence <61 62 63 93 4D D0 B0 E4 BA 8C F0 90 8C 82 94>. > ^^ ^^ > > The 93 and 94 are just corrupt data -- it cannot be interpreted > as UTF-8, and may have been introduced by some process that > screwed up smart quotes from Code Page 1252 and UTF-8, for > example. Interpreting the string, we have: > > <U+0061, U+0062, U+0063, ???, U+004D, U+0430, U+4E8C, U+10302, ???> > > Now *if* I am interpreting Lars correctly, he is using 128 > PUA code points to *validly* contain any such byte, so that > it can be retained. If the range he is using is U+EE80..U+EEFF, > then the string would be reinterpreted as: > > <U+0061, U+0062, U+0063, U+EE93, U+004D, U+0430, U+4E8C, U+10302, > U+EE94> > > which in UTF-8 would be the byte sequence: > > <61 62 63 EE BA 93 4D D0 B0 E4 BA 8C F0 90 8C 82 EE BA 94> > ^^^^^^^^ ^^^^^^^^ > > This is now well-formed UTF-8, which anybody could deal with. > And if you interpret U+EE93 as meaning "a placeholder for the > uninterpreted or corrupt byte 0x93 in the original source", > and so on, you could use this representation to exactly > preserve the original information, including corruptions, > which you could feed back out, byte-for-byte, if you reversed > the conversion. Oh, how I hope that is all he is asking for. > Now moving from interpretation to critique, I think it unlikely > that the UTC would actually want to encode 128 such characters > to represent byte values -- and the reasons would be similar to > those adduced for rejecting the earlier proposal. Effectively, > in either case, these are proposals for enabling representation > of arbitrary, embedded binary data (byte streams) in plain text. > And that concept is pretty fundamentally antithetical to the > Unicode concept of plain text. Isn't this an excellent use for the PUA? These characters are private anyway; they are defined by some standard other than Unicode, which is not evident in the Unicode data. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

