Re: Roundtripping in Unicode

Philippe Verdy Sat, 11 Dec 2004 16:17:30 -0800

From: "Doug Ewell" <[EMAIL PROTECTED]>

Lars Kristan wrote:
I am sure one of the standardizers will find a Unicodally
correct way of putting it.
I can't even understand that paragraph, let alone paraphrase it.

My understanding of his question and my reponse to his problem is that you MUST not use VALID Unicode codepoints to represent INVALID byte sequences found in some text with alleged UTF encoding.

The only way is to use INVALID codepoints, out of the Unicode space, and then design an encoding scheme that contains and extends the Unicode UTF, and make sure that there will be no possible interaction between such encoded binary data and encoded plain text (so the conversion between the encoding scheme of the bytes stream and the encoding form with code units or codepoints in memory must be fully bijective; it is hard to design if you have to also support multiple UTF encoding schemes, because the invalid byte sequences of these UTF schemes are not the same, and must then be represented with distinct invalid codepoints or code units for each external UTF!)

I won't support the idea of reserving some valid codepoint in the Unicode space to allow storing something which is already considered invalid character data, notably because the Unicode standard is evolving, and such private encoding form which would work now could become incompatible with a later version of the Unicode standard, or a later standardized Unicode encoding scheme, meaning that interoperability would be lost...

The only thing for which you have a guarantee that Unicode will not assign a mandatory behavior is the codepoint space after U+10FFFF (I'm not sure about the permanent invalidity of some code unit spaces in UTF-8 and UTF-16 encoding forms; also I'm not sure that there will be enough free space in later standard encoding forms or schemes, see for example SCSU or BOCU-1, or with other already used private encoding forms like the "modified UTF-8" extended encoding scheme defined by Sun in Java).

Re: Roundtripping in Unicode

Reply via email to