Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Doug Ewell Tue, 07 Dec 2004 22:30:28 -0800

Kenneth Whistler <kenw at sybase dot com> wrote:

> I do not think this is a proposal to amend UTF-8 to allow
> invalid sequences. So we should get that off the table.


I hope you are right.

> Apparently Lars is currently using PUA U+E080..U+E0FF
> (or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping
> of byte values uninterpretable as characters to be converted, and
> is asking for standard Unicode values for this purpose, instead.

If I understand correctly, he is using these PUA values when the data is
in UTF-16, and using bare high-bit bytes (i.e. invalid UTF-8 sequences)
when the data is in UTF-8, and expecting to convert between the two.
That has at least two bad implications:

(1) the PUA characters would not round-trip from UTF-8 to UTF-16 to
UTF-8, but would be converted to the bare high-bit bytes, and

(2) the bare high-bit bytes might or might not accidentally form valid
UTF-8 sequences, which mean they might not round-tip either.

> Say a process gets handed a "UTF-8" string that contains the
> byte sequence <61 62 63 93 4D D0 B0 E4 BA 8C F0 90 8C 82 94>.
>                         ^^                               ^^
>
> The 93 and 94 are just corrupt data -- it cannot be interpreted
> as UTF-8, and may have been introduced by some process that
> screwed up smart quotes from Code Page 1252 and UTF-8, for
> example. Interpreting the string, we have:
>
> <U+0061, U+0062, U+0063, ???, U+004D, U+0430, U+4E8C, U+10302, ???>
>
> Now *if* I am interpreting Lars correctly, he is using 128
> PUA code points to *validly* contain any such byte, so that
> it can be retained. If the range he is using is U+EE80..U+EEFF,
> then the string would be reinterpreted as:
>
> <U+0061, U+0062, U+0063, U+EE93, U+004D, U+0430, U+4E8C, U+10302,
> U+EE94>
>
> which in UTF-8 would be the byte sequence:
>
> <61 62 63 EE BA 93 4D D0 B0 E4 BA 8C F0 90 8C 82 EE BA 94>
>           ^^^^^^^^                               ^^^^^^^^
>
> This is now well-formed UTF-8, which anybody could deal with.
> And if you interpret U+EE93 as meaning "a placeholder for the
> uninterpreted or corrupt byte 0x93 in the original source",
> and so on, you could use this representation to exactly
> preserve the original information, including corruptions,
> which you could feed back out, byte-for-byte, if you reversed
> the conversion.

Oh, how I hope that is all he is asking for.

> Now moving from interpretation to critique, I think it unlikely
> that the UTC would actually want to encode 128 such characters
> to represent byte values -- and the reasons would be similar to
> those adduced for rejecting the earlier proposal. Effectively,
> in either case, these are proposals for enabling representation
> of arbitrary, embedded binary data (byte streams) in plain text.
> And that concept is pretty fundamentally antithetical to the
> Unicode concept of plain text.

Isn't this an excellent use for the PUA?  These characters are private
anyway; they are defined by some standard other than Unicode, which is
not evident in the Unicode data.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Reply via email to