Yes, but only if you can have some reasonable assurance that the byte sequence emitted by UTF(c,x) (where c is the single reserved codepoint you suggest, and x is U+00xx, the value to be escaped expressed as a character) will not occur in plain text. This is theoretically checkable - the total number of legal Unix locales is large, but finite. I don't know how many there are, but, in principle at least, one could examine each of them in turn and determine the probability of any given byte sequence occuring in each locale's encoding.

Another good choice for c would be U+001A, preserving the original meaning of the old ASCII SUB character. My understanding is that, back in the days of teletypes, SUB originally caused the following character to be displayed in red ink instead of black ink, until smarter printers came along, after which time SUB caused the following character to be selected from an alternative character set. Of course, all that changed when the 8th bit started to be used. Now the C0 control codepoints (apart from TAB, CR, LF and FF) are nothing but an ancient historical legacy which (in my opinion) could be re-used for something else. (That won't happen, of course, because of stability guarantees).

But it's the "knowing" part that the problem. Can you really "know" that such any given byte sequence won't appear in plain text? That's the only reason I thought of pushing the probability of incorrect identification down astronomically low.

Jill

-----Original Message-----
From: Peter Kirk [mailto:[EMAIL PROTECTED]
Sent: 15 December 2004 12:54
To: Arcane Jill
Cc: Unicode
Subject: Re: Roundtripping Solved

But would it not work just as
well to for Lars' purposes to use, instead of your string of random
characters, just ONE reserved code point followed by U+0xx?





Reply via email to