Title: RE: Roundtripping in Unicode

> Ken is absolutely right. It would be theoretically possible
> to add 128 code
> points that would allow one to roundtrip a bytestream after
> passing through
> a UTF-8 <=> UTF-32 conversion. (For that matter, it would be
> possible to add
> 2048 code points that would allow the same for a 16-bit data stream.)
You don't really need to add anything for 16-bit <=> UTF-32. There is no real-life need to have that roundtrip guaranteed. For 8-bit data there is real-life need. And even, for 16-bit <=> UTF-32 you can do it simply by defining how surrogates should be processed. Not saying it should be done, but showing it could be done. But for UTF-8 <=> UTF-32 it cannot be done without 128 new codepoints. Which is why I am often comparing these 128 codepoints to the surrogates. With one difference, they should be valid characters.

>
> However, these new code points would really be no better than
> private use
> code points, since their interpretation would depend entirely
Oh yes they would. Anyone might be using those same codepoints in PUA for something completely different.

> on whatever
> was assumed to be the interpretation of the original bytestream. If X
> converted a bytestream that was assumed to be a mixture of
> 8858-7 with UTF-8
> into Unicode with these new characters, and handed it off to Y, who
> converted the bytestream back assuming that the odd bytes were to be
> iso-8859-9, you would get data corruption. X and Y would have
Nope. No data corruption. You just get the odd bytes back. And achieve exactly the same as if X passed the data directly to Y. Y doesn't convert from UTF-8 to iso-8859-9, nor does it convert the odd bytes to iso-8859-9. It converts UTF-8 to the original byte stream and ONLY THEN interpretes it as iso-8859-9. So, the same as if it got the data directly.


Lars

Reply via email to