Ken is absolutely right. It would be theoretically possible to add 128 code
points that would allow one to roundtrip a bytestream after passing through
a UTF-8 <=> UTF-32 conversion. (For that matter, it would be possible to add
2048 code points that would allow the same for a 16-bit data stream.)

However, these new code points would really be no better than private use
code points, since their interpretation would depend entirely on whatever
was assumed to be the interpretation of the original bytestream. If X
converted a bytestream that was assumed to be a mixture of 8858-7 with UTF-8
into Unicode with these new characters, and handed it off to Y, who
converted the bytestream back assuming that the odd bytes were to be
iso-8859-9, you would get data corruption. X and Y would have to agree on
the interpretation of these odd bytes to avoid that corruption, so it is
really no different than private use (where they also have to agree on the
interpretation).

âMark

----- Original Message ----- 
From: "Kenneth Whistler" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Monday, December 13, 2004 13:04
Subject: RE: Roundtripping in Unicode


> Lars Kristan stated:
>
> > I said, the choice is yours. My proposal does not prevent you from doing
it
> > your way. You don't need to change anything and it will still work the
way
> > it worked before. OK? I just want 128 codepoints so I can make my own
> > choice.
>
> You have them: U+EE80..U+EEFF, which are yours to use (or abuse)
> in an application as you see fit. Just don't expect others outside
> your application to interpret them as you do.
>
> > And once and for all, you can treat those 128 codepoints just as you
> > do today.
>
> A number of people on the list have patiently explained why what
> you are proposing to do fundamentally breaks UTF-8 and its
> relationship to other Unicode encoding forms.
>
> The chances that you will get the standard extended to incorporate
> these 128 code points and define their mapping to invalid byte
> values in UTF-8 is somewhere between zilch, nada, and nil.
>
> --Ken
>
>
>


Reply via email to