On 15/12/2004 11:11, Arcane Jill wrote:

I followed (and understood) Lar's explanation as to why the NOT-xxxx solution wouldn't work for him. Shame really - but here's another bash at a solution, again without breaking the Unicode model. If I have understood this correctly, these are Lars' requirements:

1) There exists a function, f(), which maps an arbitrary octet stream to a sequence of Unicode characters
2) A required property of f() is that, if any substring of its input is valid UTF-8, then f() must convert that substring to the sequence of Unicode characters which would have been obtained by UTF-8 itself.
3) There exists an inverse function, g(), such that g(a) == b if and only if f(b) == a.


Lars seems to have extended the requirement here such that a can be any sequence of 16-bit words, just as b can be any sequence of octets, i.e. he requires not only that g(f(b)) == b for all b, but also that f(g(a)) == a for all a. That may makes things much harder! There is at least a need to deal with unpaired surrogates.


As Unicoders have pointed out, these goals appear to be mutually contradictory, unless we assume the following corrollory, which I shall call "requirement 4".


4) A second required property of f() is that, if any octet of its input is not part of a valid UTF-8 substring, then f() must convert that octet to a Unicode character string /which cannot possibly appear in Unicode plain text/.

It is for reasons of requirement (4) that Lars proposes the introduction of 128 BMP codepoints. His intention is that they be marked as "reserved - do not use", so that requirement 4 is met. Naturally, this proposal has met with a lot of resistance, and almost certainly would never get approved by the UC. Therefore, I propose an alternative solution, as follows:

...

Now everything will work. Unicode is not broken. All UTFs are interchangeable as before; Lars's "escape aware" applications can use the functions f() and g() instead of UTF-8 transformations; all other Unicode applications will retain Lars's data uncorrupted, and he can "unescape" it (that is, apply function g()) at the appropriate time to recover the original data.

That do?
Jill

Jill, again your solution is ingenious. But would it not work just as well to for Lars' purposes to use, instead of your string of random characters, just ONE reserved code point followed by U+0xx? Instead of asking the UTC to allocate a specific code point for this (which it probably will not do), he can use either U+FFFE or U+FFFF, which "are intended for process internal uses, but are not permitted for interchange." Let's call the one non-character chosen INVALID.

Of course a problem arises if the original filename consists of a string which is the UTF-8 representation of INVALID. Does this in fact count as valid UTF-8? (If it does, an alternative might be to use an unpaired surrogate for INVALID, because the UTF-8 representation of a surrogate is invalid UTF-8.) Even if it does, it does not represent valid Unicode, and so the conversion routine can convert the UTF-8 for INVALID as if it was three invalid bytes.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Reply via email to