On 16/12/2004 11:36, Lars Kristan wrote:

...

> can use either U+FFFE or U+FFFF, which "are
> intended for process internal uses, but are not permitted for
> interchange." Let's call the one non-character chosen INVALID.
Can't. I DO want the resulting UTF-16 to be valid for interchange. This is the whole purpose. And increasing the overhead is also not desired.



But this last requirement provides the proof that you can't have what you want.

The current situation is:

1. for all valid UTF-8 strings s8, f(s8) is a valid UTF-16 string and g(f(s8)) = s8
2. for all valid UTF-16 strings s16, g(s16) is a valid UTF-8 string and f(g(s16)) = s16


Your requirements are apparently:

3. for all INVALID UTF-8 strings t8, f(t8) is a valid UTF-16 string and g(f(t8)) = t8

But if f(t8) is a valid UTF-16 string, by rule 2 g(f(t8)) is a valid UTF-8 string, and by rule 3 g(f(t8)) = t8. But we have already stated that t8 is an INVALID UTF-8 string. So there is a mathematically proved inconsistency in your requirements.

The only way round this is to break the functionality of g so that it does not correctly convert all valid UTF-16 strings to UTF-8. That will certainly be unacceptable to the UTC. The most you might get away with is a private function which does some non-standard conversion of PUA characters, but then you risk messing up PUA characters used by agreement between end users, or in filenames as UTF-8.

Alternatively, you need to relax your requirement that f(t8) is a valid UTF-16 string, and instead allow that it can be a UTF-16-like string but including something invalid like a noncharacter or an unpaired surrogate. This will not be technically valid for interchange, of course. But my suggestion of using a noncharacter as an escape is a way in which this could be done.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Reply via email to