Title: RE: Roundtripping in Unicode

Marcin 'Qrczak' Kowalczyk wrote:
> You are trying to stick with processing byte sequences, carefully
> preserving the storage format instead of preserving the meaning in
> terms of Unicode characters. This leads to less robust software
> which is not certain about the encoding of texts it processes and
> thus can't apply algorithms like case mapping without risking doing
> a meaningless damage to the text.
I am not proposing that this approach is better or that it should be used generally. What I am saying is that this approach is, unfortunately, needed in order to make the transition easier. The fact is that currently data exists that cannot be converted easily. An over-robust software, in my opinion, can be impratcical and might not be accepted with open hands. We should acknowledge the fact that some products will chose a different path. You can say these applications will be less robust, but we should really give the the users a choice and let them decide what they want.

> Conversion should signal an error by default. Replacing errors by
> U+FFFD should be done only when the data is processed purely for
> showing it to the user, without any further processing, i.e. when it's
> better to show the text partially even if we know that it's corrupted.
I think showing it to the user is not the only case when you need to use U+FFFD. A text viewer could do the replacement when reading the file and do further processing in Unicode. But an editor cannot. Keeping the text in original binary form is far from practical and opens numerous possibilities for bugs. But, as I once already said, you can do it with UTF-8, you simply keep the invalid sequences as they are, and really handle them differently only when you actually process them or display them. But you cannot do this in UTF-16, since you cannot preserve all the data.

As for signalling - in some cases signalling is impossible. Listing files in a directory should not signal anything. It MUST return all files and it should also return them in a way that this list can be used to access each of the files.

>
> > Either you do everything in UTF-8, or everything in UTF-16. Not
> > always, but typically. If comparisons are not always done in the
> > same UTF, then you need to validate. And not validate while
> > converting, but validate on its own. And now many designers will
> > remember that they didn't. So, all UTF-8 programs (of that kind)
> > will need to be fixed. Well, might as well adopt my broken
> > conversion and fix all UTF-16 programs. Again, of that kind, not all
> > in general, so there are few. And even those would not be all
> > affected. It would depend on which conversion is used where. Things
> > could be worked out. Even if we would start changing all the
> > conversions. Even more so if a new conversion is added and only used
> > when specifically requested.
>
> I don't understand anything of this.
Let's start with UTF-8 usernames. This is a likely scenario, since I think UTF-8 will typically be used in network communication. If you store the usernames in UTF-16, the conversion will signal an error and you will not have any users with invalid UTF-8 sequences nor will any invalid sequence be able to match any user. If you later on start comparing users somewhere else, in UTF-8, then you must not only strcmp them, but also validate each string. This is just a fact and I am not complaining about it.

In the opposite case, if you would have UTF-8 storage and UTF-16 communication, and any comparisons would be done in UTF-16, you again need to validate the UTF-16 strings.

Now I am supposing that there are such applications already out there. And that some of them do not validate (or validate only in conversion, but not when comparing or otherwise processing native strings).

They should be analyzed and fixed. At the time I wrote the above paragraph, I though UTF-16 programs don't need to validate, but that is not true, so all the applications need to be fixed, if they are not already validating.

Now, suppose my 'broken' conversion is standardized. As an option, not for UTF-16 to UTF-8 conversion. If you don't start using it, the existing rules apply.

The interesting thing is that if you do start using my conversion, you can actually get rid of the need to validate UTF-8 strings in the first scenario. That of course means you will allow users with invalid UTF-8 sequences, but if one determines that this is acceptable (or even desired), then it makes things easier. But the choice is yours.

For the second scenario, things do indeed become a bit more complicated. But can be solved. And there is still a number of choices you can make about the level of validation. And, again, one of them is that you keep using the existing conversion and the existing validation.

>
> > I cannot afford not to access the files.
>
> Then you have two choices:
> - Don't use Unicode.
As soon as a Windows system enters the picture, it is practically impossible to not use Unicode. Or if a UNIX user uses a UTF-8 locale.

> - Pretend that filenames are encoded in ISO-8859-1, and represent them
>   as a sequence of code points U+0001..U+00FF. They will not
> be displayed
>   correctly but the information will be preserved.
Been there, done that. Works in one way, but not the other. And becomes increasingly less useful as more and more data is in UTF-8.


Lars

Reply via email to