On 15/12/2004 14:36, Arcane Jill wrote:

Yes, but only if you can have some reasonable assurance that the byte sequence emitted by UTF(c,x) (where c is the single reserved codepoint you suggest, and x is U+00xx, the value to be escaped expressed as a character) will not occur in plain text. This is theoretically checkable - the total number of legal Unix locales is large, but finite. I don't know how many there are, but, in principle at least, one could examine each of them in turn and determine the probability of any given byte sequence occuring in each locale's encoding.


You don't need this kind of assurance. Suppose my chosen INVALID character would normally become <0xpp, 0xqq, 0xrr> according to the UTF-8 algorithm, and 0xyy is an octet which cannot be interpreted as part of UTF-8.

My proposed conversion from the NOT-UTF-8 of the filename to NOT-Unicode would be that 0xyy is mapped to <INVALID, U+00yy> - which can be represented in NOT-UTF-16 and in NOT-UTF-32 (actually maybe in UTF-16 and UTF-32 if these forms are defined as able to represent the noncharacter INVALID). And this conversion is reversible, as long as no one attempts to pass noncharacters through it for any other reason.

Then suppose the NOT-UTF-8 filename includes the octet sequence <0xpp, 0xqq, 0xrr>. A regular UTF-8 conversion would convert this sequence to INVALID, and 0xyy perhaps to REPLACEMENT CHARACTER. But my alternative NON-UTF-8 conversion would (as well as converting 0xyy to <INVALID, U+00yy>) recognise that the sequence <0xpp, 0xqq, 0xrr> does not represent a valid Unicode character (but rather a noncharacter), and so convert it to <INVALID, U+00pp, INVALID, U+00qq, INVALID, U+00rr>. This conversion is reversible.

I think that meets the requirement that g(f(b)) == b for all b. It also requires a little extra complexity in my NON-UTF-8 conversion to reject conversion of noncharacters.

This is not reversible in the other direction, for f(g(a)) != a for all a. For example <INVALID, U+0020> becomes 0x20 in NON-UTF-8 which of course is converted back to simply U+0020; or else it becomes <0xpp, 0xqq, 0xrr, 0x20> which is converted back to <INVALID, U+00pp, INVALID, U+00qq, INVALID, U+00rr, U+0020>. But Lars confirmed that this is not a requirement.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Reply via email to