If you convert invalid input bytes nn to U+EEnn, then you can't reverse U+EEnn back to bytes nn without also converting correctly encoded U+EEnn that would have been present on the original input stream.
So I don't call that "roundtripping" (the conversion is not fully bijective), but "substitution" as this conversion CANNOT be safely reversed. Such substituion is one-way only.
The only way to perform roundtripping of invalid input bytes to internal code units, is to convert these bytes to invalid sequences of code units for internal processing. This way you are certain that internal processing code units (even if they are invalid) will not be equal to other valid internal code units that could be reversed illegally to invalid output bytes (doing so would!
So if an input can contain invalid bytes in the UTF-8 stream, these bytes must be converted (if full roundtripping is needed) to invalid sequences of code units (with an extended UTF-16 internal processing, one can use 0xFFFE and 0xFFFF as markers before an isolated trailing surrogate; with an extended UTF-16 internal processing, one can use code units above 0x10FFFF). Doing this does not even require any private agreement.
Same thing if processing UTF-16BE or UTF16-LE input streams with invalid byte sequences: the internal processing can be performed in UTF-8 or UTF-32 using invalid sequences of 8-bit or 32-bit code units.
----- Original Message ----- From: "Mark Davis" <[EMAIL PROTECTED]>
To: "Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Monday, December 13, 2004 11:04 PM
Subject: Re: Roundtripping in Unicode
Ken is absolutely right. It would be theoretically possible to add 128 code
points that would allow one to roundtrip a bytestream after passing through
a UTF-8 <=> UTF-32 conversion. (For that matter, it would be possible to add
2048 code points that would allow the same for a 16-bit data stream.)