Re: Roundtripping in Unicode

Philippe Verdy Mon, 13 Dec 2004 17:16:11 -0800

That's exactly the same response and idea as Ken I gave to Lars, for the case where he wants valid codepoints (but I also argued that this was not offering roundtripping, only a better substitution than U+FFFD, i.e. this conversion is not completely lossless, given that those private conventions for substitutions would become not different from legal input with no encoding error:

If you convert invalid input bytes nn to U+EEnn, then you can't reverse U+EEnn back to bytes nn without also converting correctly encoded U+EEnn that would have been present on the original input stream.

So I don't call that "roundtripping" (the conversion is not fully bijective), but "substitution" as this conversion CANNOT be safely reversed. Such substituion is one-way only.

The only way to perform roundtripping of invalid input bytes to internal code units, is to convert these bytes to invalid sequences of code units for internal processing. This way you are certain that internal processing code units (even if they are invalid) will not be equal to other valid internal code units that could be reversed illegally to invalid output bytes (doing so would!

So if an input can contain invalid bytes in the UTF-8 stream, these bytes must be converted (if full roundtripping is needed) to invalid sequences of code units (with an extended UTF-16 internal processing, one can use 0xFFFE and 0xFFFF as markers before an isolated trailing surrogate; with an extended UTF-16 internal processing, one can use code units above 0x10FFFF). Doing this does not even require any private agreement.

Same thing if processing UTF-16BE or UTF16-LE input streams with invalid byte sequences: the internal processing can be performed in UTF-8 or UTF-32 using invalid sequences of 8-bit or 32-bit code units.

----- Original Message ----- From: "Mark Davis" <[EMAIL PROTECTED]> To: "Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Monday, December 13, 2004 11:04 PM Subject: Re: Roundtripping in Unicode

Ken is absolutely right. It would be theoretically possible to add 128 code points that would allow one to roundtrip a bytestream after passing through a UTF-8 <=> UTF-32 conversion. (For that matter, it would be possible to add 2048 code points that would allow the same for a 16-bit data stream.)

Re: Roundtripping in Unicode

Reply via email to