Title: RE: Roundtripping in Unicode

Marcin 'Qrczak' Kowalczyk wrote:
> If one application switches from standard UTF-8 to your modification,
> and another application continues to use standard UTF-8, then the
> ability to pass arbitrary Unicode strings between them by serializing
> them to UTF-8 is lost. So you can't claim that does not affect
> programs which don't adopt it. It would have to be adopted by all
> programs which currently use UTF-8, or data exchange would break.

I don't think so. If I produce UTF-8 data from filenames, and give it to an UTF-8 application, nothing can be lost in the portion of this architecture that deals with Unicode data. Now, if you expect that you can give me Unicode data and I should store it in a filesystem (as a filename), then you're in error. It is definitely true that you can create a sequence of valid Unicode characters from my range and I will not be able to give it back. But I will also have to reject any '/' characters you feed me. You are misusing my application.

If some application chooses to use my conversion and looses or misinterpretes your data, then it is broken and shouldn't use that conversion or should not declare that particular interface as Unicode interface.

> But it's not a viable replacement of UTF-8. Even if both applications
> use your modification, the ability to serialize arbitrary sequences
> of valid code points (i.e. not surrogates) through UTF-8 is lost: the
> mapping to modified UTF-8 is not injective.
Yes, that is true. But there are people who would be willing to accept that since it only happens if those 128 codepoints are used. Those can use the conversion, others needn't.

OK, there is one problem that I *do* see with the use of my conversion. I map a file from UX to Win. You then use not my application, but another one, which copies the file back from Win to UX (and that is easier, so you *can* use this application). Now the invalid sequence is already escaped. If I map this new file to Win again, I need to escape the escape. They can start piling up.

Of course you can realize the problem, and simply rename the file, you can undo the over-escaping (no data is ever lost!), and probably rename that file to valid UTF-8, which is what you want anyway. And, you can do it even from the Windows system. If you prevent my solution, you will not have my program in the first place, meaning you will need to go to the UNIX system to rename the file, and that even in order to access it in the first place.

Actually, there are two subflavors of my conversion possible (I can hear you say "oh, noooo"). One does escape the escapes, the other doesn't. This second flavor can be used by applications that need to make UTF-8 from an arbitrary input, but do not need to re-create the original byte sequence. Basically, they are preserving all the data, except for the information how many times the original invalid sequences were escaped. There may be a need for such applications and they would in fact reduce the re-escaping problem.


> Which means that UTF-8 can't be replaced with your modification.
> If they coexisted, expect trouble when the two slightly incompatible
> encodings meet.
Or, expect trouble when dealing with data that is not guaranteed to be UTF-8. Or hope that there will be no such data, in near future, and I mean none.


> > Using my conversion, Windows can access any file on UNIX, because my
> > conversion guarantees roundtrip UX=>Win=>UX
>
> Well, with or without your conversion it's not true, because there
> are various characters which are valid in Unix filenames but not in
> Windows (e.g. ? * : \ and control characters). So if all filenames are
> to be accessible, they have to introduce some escaping. And as soon
> as an escaping scheme is used, it can be extended to encode isolated
> bytes with high bit set.
Good point. But you are assuming I copy the files to Windows filesystem. I don't. I have no problems if you specify your filename with any of the above characters, even from Windows.

And, BTW, suppose UTF-8 validation is introduced (as an option) on UNIX filesystems. The characters you mention (and some other, I can tell you exactly which don't work on Windows) could again be (optionally) rejected on UNIX filesystems.

> > Win=>UX=>Win roundtrip is not guaranteed.
>
> Currently it breaks only for isolated surrogates (assuming the Unix
> is configured to use UTF-8). If Windows filenames are specified to be
> UTF-16, the error is clearly on the Windows side and this side should
> be fixed.
And in my case, it would break for some malicious sequences of the 128 codepoints. Equally rare, and with equal minor consequences. Ummmm, and it can be fixed, too. Such malicious sequences could be forbidden in contexts where we fear they might cause problems.


Lars

Reply via email to