Roundtripping Solved

Arcane Jill Wed, 15 Dec 2004 03:52:22 -0800

I followed (and understood) Lar's explanation as to why the NOT-xxxx solution wouldn't work for him. Shame really - but here's another bash at a solution, again without breaking the Unicode model. If I have understood this correctly, these are Lars' requirements:

1) There exists a function, f(), which maps an arbitrary octet stream to a sequence of Unicode characters 2) A required property of f() is that, if any substring of its input is valid UTF-8, then f() must convert that substring to the sequence of Unicode characters which would have been obtained by UTF-8 itself. 3) There exists an inverse function, g(), such that g(a) == b if and only if f(b) == a.

As Unicoders have pointed out, these goals appear to be mutually contradictory, unless we assume the following corrollory, which I shall call "requirement 4".

4) A second required property of f() is that, if any octet of its input is not part of a valid UTF-8 substring, then f() must convert that octet to a Unicode character string /which cannot possibly appear in Unicode plain text/.

It is for reasons of requirement (4) that Lars proposes the introduction of 128 BMP codepoints. His intention is that they be marked as "reserved - do not use", so that requirement 4 is met. Naturally, this proposal has met with a lot of resistance, and almost certainly would never get approved by the UC. Therefore, I propose an alternative solution, as follows:

DEFINITION - "f" is a function which maps an arbitrary octet stream to a sequence of Unicode characters, such that (1) any substring which happens to be valid UTF-8 is mapped to the sequence of Unicode characters which would have been produced by UTF-8, and (2) all remaining single octets, xx (with x necessarily such that 0x80 <= xx <= 0xFF) are each mapped to the sequence: { U+0C55E3, U+01ED7A, U+05FDCB, U+09C351, U+07E168, U+0BBC80, U+107C09, U+0BA458, U+064188, U+048375, U+08ACE0, U+031DEF, U+00xx } (I got those numbers from a true random number generator).

OBSERVATION - Requirement (4) is not met absolutely, however, the probability of the UTF-8 encoding of this sequence occuring "accidently" at an arbitrary offset in an arbitrary octet stream is approximately one in 2^384; the probability of its occuring in /plain text/ is even smaller. This means that if your application were capable of processing one terabyte of date per second, you would expect to encounter this sequence by accident once every 2^340 years. (For reference, the Universe is somewhere around 2^13 years old). This means that requirement 4 is "effectively met", even if not actually met.

DEFINITION - "g" is the inverse function of f. By the observation above, f is injective, not bijective, so in the event of ambiguity, the sequence { U+0C55E3, U+01ED7A, U+05FDCB, U+09C351, U+07E168, U+0BBC80, U+107C09, U+0BA458, U+064188, U+048375, U+08ACE0, U+031DEF, U+00xx }is /always/ assumed to map to the single octet xx. The probability of this choice being wrong is as stated above.

Now everything will work. Unicode is not broken. All UTFs are interchangeable as before; Lars's "escape aware" applications can use the functions f() and g() instead of UTF-8 transformations; all other Unicode applications will retain Lars's data uncorrupted, and he can "unescape" it (that is, apply function g()) at the appropriate time to recover the original data.

That do?
Jill

Roundtripping Solved

Reply via email to