Re: Roundtripping Solved

Philippe Verdy Fri, 17 Dec 2004 09:39:03 -0800

From: "Arcane Jill" <[EMAIL PROTECTED]>

Lars's current implementation of this scheme is that his "f" "escapes" the binary octet 1bbbbbbb to 11101110 1011101b 10bbbbbb (or equivalently, byte x becomes the character U+EE00 + x). He is unhappy with this because characters in the range U+EE80 to U+EEFF might be found in real text. So you and I have, between us, suggested three alternative escaping functions, in an attempt to find an escape sequence with a vanishingly small probability of being found in real text. I'm not quite sure why Lars isn't happy with these suggestions - maybe his goal has still not been clearly stated - but either way, since nobody is proposing an amendment to UTFs, it surely isn't the business of the UTC.

What Lars wants has a name: it's a "transfer-encoding-syntax", to allow transporting any code unit sequences into a more restricted environment. This is not a new thing, but this is not specified by Unicode.

It is specified in specific interfaces or APIs, as part of a protocol accepted by two compliant-parties. Such Transfer-Encoding-Syntaxes are used: - in MIME for transporting non-plain-text documents: Base64, UUEncoding, Hex, Quoted-Printable... - in programming languages: the special "\" prefix used to escape some characters (including '\' itself) with a sequence whose meaning is specified in the language itself, or doubling occurences of single-quotes in quoted SQL string constants. - in many protocols: notably COBS (that allow escaping any restricted byte such as 0x00); many variations of the COBS technic are used - in HTML: for example """ to escape the double-quote character

Remember that all this is a notation. What makes it a Transfer-Encoding-Syntax is that this notation is published and easily implementable by various processes, because the specification is wellknown and can be easily agreed between two distinct processes that accept the notation with a well-defined name.

A Transfer-Encoding-Syntax does not alter the meaning or encoding of the original document, and it is by necessity completely bijective: given an arbitrary code unit sequence x in a value set F, it transforms it into valid code unit sequences y=f(x) in a value set G, and is reversible back to x with a second "decoding" function g, so that x=g(y)=g(f(x)).

A Transfer-Encoding-Syntax is fully bijective between the two definition domains of f() and g(): any valid code unit sequence y in G (the definition domain of g) MUST be decodable without error to F (the definition domain of f), so that y=g(f(y)) for any valid y (in G).

Note that F and G will are almost always distinct even if, often but not always, F includes G (F will not include G for example if f() transforms any sequence of bytes F="[\x00-\xFF]*" in a sequence of valid UTF-32 code units G="[\U00000000-\U0010FFFF]*").

There's a way to create such pair of functions f() and g(): - G must be the complete valid value range of Unicode codepoints as indicated above. - F must be the complete valid value range of bytes as indicated above. - f() transforms each invalid byte '\xnn' into codepoint U+EEnn (note that as all \x00-\x7F are valid, only U+EE80 to U+EEFF will be used). - f() MUST also transform any valid byte sequence normally encoding codepoints U+EE80 to U+EEFF, by mapping these VALID bytes '\xnn' with the codepoints U+EEnn.

Note that the UTF-8 encoding of U+EE80 to U+EEFF is:
source bits:       1110    11101b    bbbbbb
UTF-8 bits: 11101110 1011101b 10bbbbbb
UTF-8 bytes: [\xEE][\xBA-\xBB][\x80-\xBF]

For example, consider this NOT-UTF-8 sequence of bytes:
   \x20\xC0\x80\x21\xEE\xB8\x80\x22

You want to escape it to valid UTF-8. It decomposes as:
- \x20 : valid UTF-8,
   no change, code as \x20 (which encodes U+0020 in UTF-8)
- \xC0\x80 : not UTF-8, escape it as:
   \xC0 becomes \xEE\xBB\x80 (which encodes U+EEC0 in UTF-8)
   \x80 becomes \xEE\xBA\x80 (which encodes U+EE80 in UTF-8)
- \x21: valid UTF-8,
   no change, code as \x21 (which encodes U+0021 in UTF-8)
- \xEE\xBA\x80: valid UTF-8, but it would encode U+EE80, escape it:
   \xEE becomes \xEE\xBB\xBE (which encodes U+EEEE in UTF-8)
   \xBA becomes \xEE\xBA\xBA (which encodes U+EEBA in UTF-8)
   \x80 becomes \xEE\xBA\x80 (which encodes U+EE80 in UTF-8)
- \x22: valid UTF-8,
   no change, code as \x21 (which encodes U+0021 in UTF-8)

The generated sequence is 10-bytes longer, but it is now all valid UTF-8. To get it back to the original NON-UTF-8 sequence, you just need to convert back any occurence of [\xEE][\xBA-\xBB][\x80-\xBF] back to a byte in [\x80-\xFF].

You could as well have generated valid UTF-16 or UTF-32 with the SAME algorithm:

* Escaping to UTF-16:
- \x20 : valid UTF-8,
   no change, code as \u0020 (which encodes U+0020 in UTF-16)
- \xC0\x80 : not UTF-8, escape it as:
   \xC0 becomes \uEEC0 (which encodes U+EEC0 in UTF-16)
   \x80 becomes \uEE80 (which encodes U+EE80 in UTF-16)
- \x21: valid UTF-8,
   no change, code as \x21 (which encodes U+0021 in UTF-16)
- \xEE\xBA\x80: valid UTF-8, but it would encode U+EE80, escape it:
   \xEE becomes \uEEEE (which encodes U+EEEE in UTF-16)
   \xBA becomes \uEEBA (which encodes U+EEBA in UTF-16)
   \x80 becomes \uEE80 (which encodes U+EE80 in UTF-16)
- \x22: valid UTF-8,
   no change, code as \u0021 (which encodes U+0021 in UTF-16)

* Escaping to valid UTF-32: - Just replace all occurences of "\u" and "UTF-16" in the previous paragraph by "\U0000" and "UTF-32".

But note that any occurence of U+EE80 to U+EEFF in the original NON-UTF-8 "text" are escaped, despite they are valid Unicode. However, choosing U+EE80 to U+EEFF is not a problem because these PUAs are very unlikely to be present in valid source texts, in absence of a prior PUA-agreement.

Remember that this is only a Transform-Encoding-Syntax, not a new encoding. It does not require ANY new codepoint allocation by Unicode!

Re: Roundtripping Solved

Reply via email to