Title: RE: Roundtripping in Unicode

Marcin 'Qrczak' Kowalczyk wrote:
> Lars Kristan <[EMAIL PROTECTED]> writes:
>
> > All assigned codepoints do roundtrip even in my concept.
> > But unassigned codepoints are not valid data.
>
> Please make up your mind: either they are valid and programs are
> required to accept them, or they are invalid and programs are required
> to reject them.
I don't know what they should be called. The fact is there shouldn't be any. And that current software should treat them as valid. So, they are not valid but cannot (and must not) be validated. As stupid as it sounds. I am sure one of the standardizers will find a Unicodally correct way of putting it.

>
> > Furthermore, I was proposing this concept to be used, but not
> > unconditionally. So, you can, possibly even should, keep using
> > whatever you are using.
>
> So you prefer to make programs misbehave in unpredictable ways
> (when they pass the data from a component which uses relaxed rules
> to a component which uses strict rules) rather than have a clear and
> unambiguous notion of a valid UTF-8?
I am not particulary thrilled about it. In fact it should be discussed. Constructively. Simply assuming everything will break is not helpful. But if you want an answer, yes, I would go for it. Actually, there are fewer concerns involved than people think. Security is definitely an issue. But again, one shouldn't assume it breaks just like that. Let me risk a bold statement: security is typically implicitly centralized. And if comparison is always done in the same UTF, it won't break. A simple fact that two different UTF-16 strings compare equal in UTF-8 (after relaxed conversion), does not introduce a security issue. Today, two invalid UTF-8 strings compare the same in UTF-16, after a valid conversion (using a single replacement char, U+FFFD) and they compare different in their original form, if you use strcmp. But you probably don't. Either you do everything in UTF-8, or everything in UTF-16. Not always, but typically. If comparisons are not always done in the same UTF, then you need to validate. And not validate while converting, but validate on its own. And now many designers will remember that they didn't. So, all UTF-8 programs (of that kind) will need to be fixed. Well, might as well adopt my broken conversion and fix all UTF-16 programs. Again, of that kind, not all in general, so there are few. And even those would not be all affected. It would depend on which conversion is used where. Things could be worked out. Even if we would start changing all the conversions. Even more so if a new conversion is added and only used when specifically requested.

There is cost and there are risks. Nothing should be done hastily. But let's go back and ask ourselves what are the benefits. And evaluate the whole.

>
> > Perhaps I can convert mine, but I cannot convert all filenames on
> > a user's system.
>
> They you can't access his files.
Yes, this is where it all started. I cannot afford not to access the files. I am not writing a notepad.

>
> With your proposal you couldn't as well, because you don't make them
> valid unconditionally. Some programs would access them and some would
> break, and it's not clear what should be fixed: programs or filenames.
It is important to have a way to write programs that can. And, there is definitely nothing to be fixed about the filenames. They are there and nobody will bother to change them. It is the programs that need to be fixed. And if Unicode needs to be fixed to allow that, then that is what is supposed to happen. Eventually.

Lars

Reply via email to