On 16/12/2004 11:36, Lars Kristan wrote:
But this last requirement provides the proof that you can't have what you want....
> can use either U+FFFE or U+FFFF, which "are
> intended for process internal uses, but are not permitted for
> interchange." Let's call the one non-character chosen INVALID.
Can't. I DO want the resulting UTF-16 to be valid for interchange. This is the whole purpose. And increasing the overhead is also not desired.
The current situation is:
1. for all valid UTF-8 strings s8, f(s8) is a valid UTF-16 string and g(f(s8)) = s8
2. for all valid UTF-16 strings s16, g(s16) is a valid UTF-8 string and f(g(s16)) = s16
Your requirements are apparently:
3. for all INVALID UTF-8 strings t8, f(t8) is a valid UTF-16 string and g(f(t8)) = t8
But if f(t8) is a valid UTF-16 string, by rule 2 g(f(t8)) is a valid UTF-8 string, and by rule 3 g(f(t8)) = t8. But we have already stated that t8 is an INVALID UTF-8 string. So there is a mathematically proved inconsistency in your requirements.
The only way round this is to break the functionality of g so that it does not correctly convert all valid UTF-16 strings to UTF-8. That will certainly be unacceptable to the UTC. The most you might get away with is a private function which does some non-standard conversion of PUA characters, but then you risk messing up PUA characters used by agreement between end users, or in filenames as UTF-8.
Alternatively, you need to relax your requirement that f(t8) is a valid UTF-16 string, and instead allow that it can be a UTF-16-like string but including something invalid like a noncharacter or an unpaired surrogate. This will not be technically valid for interchange, of course. But my suggestion of using a noncharacter as an escape is a way in which this could be done.
-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/