RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Lars Kristan Wed, 08 Dec 2004 02:21:36 -0800

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Doug Ewell wrote:
> How do file names work when the user changes from one SBCS to another
> (let's ignore UTF-8 for now) where the interpretation is
> different? For
> example, byte C3 is U+00C3, A with tilde (Ã) in ISO 8859-1,
> but U+0102,
> A with breve (Ä) in ISO 8859-2. If a file name contains byte
> C3, is its
> name different depending on the current locale?
It displays differently, but compares the same. Whether or not it is the same name is a philosophical question.

> Is it
> accessible in all
> locales?
Typically, yes for all SBCS, but not really guaranteed for all MBCS. Depends on whether you validate the string or not. The way UNIX is being developed, those files are typically still accessible since the programs are still working with 8-bit strings. And that is what I am saying. A UTF-8 program (a hypothetical 'UNIX Commander 8') would have no problems accessing the files. A UTF-16 program (a hypothetical 'UNIX Commander 16') on the other hand would have problems.

> (Not every SBCS defines a character at every code point.
> There's no C3 in ISO 8859-3, for example.)
It works just like unassigned codepoints in Unicode work. How they are displayed is not defined, but they can be passed around and compared for equality. Collation is again not defined, but simple sorting does give useful results.

>
> Does this work with MBCS other than UTF-8? I know you said
> other MBCS,
> like Shift-JIS, are not often used alongside other encodings except
> ASCII, but we can't guarantee that since we're not in a perfect world.
> :-) What if they were?
I don't know if and how much they were. But I am assuming UTF-8 would be used alongside other encodings on a much larger scale. At least that's what we are hoping for aren't we? Of course it would be even better if we would be only using UTF-8 (or any other Unicode format), but the transition has to come first.

> I fear Ken is not correct when he says you are not arguing for the
> legalization of invalid UTF-8 sequences.
I am arguing for a mechanism that allows processing invalid UTF-8 sequences. For those who need to do so. You can still think of them as invalid. Exactly how they will be called and to what extent will they be discouraged still needs to be investigated and defined.

> This isn't about UTF-8 versus other encoding forms. UTF-8-based
> programs will reject these invalid sequences because they don't map to
> code points, and because they are supposed to reject them.
The problem is, until now a text editor typically preserved all data if a file was opened and saved immediately. Even binary data. And the data could be interpreted as Latin 1, Latin 2, ... But you cannot interprete the data as UTF-8 and preserve all the data at the same time. Well, actually it is possible, which is exactly what I am saying is the advantage of UTF-8. But if you insist on validation, you break it. Fine, you get your Unicode world, and UTF-16 is then just as good as UTF-8. But you are now losing data where previously it wasn't lost. Well, you better remember to put a disclaimer in you license agreement...

> > Besides, surrogates are not completely interchangeable.
> Frankly, they
> > are, but do not need to be, right?
>
> They are not completely. In UTF-8 and UTF-32, they are not allowed at
> all. In UTF-16, they may only occur in the proper context: a high
> surrogate may only occur before a low surrogate, and a low
> surrogate may
> only appear after a high surrogate. No other usage of surrogates is
> permitted, because if unpaired surrogates could be interpreted, the
> interpretation would be ambiguous.
Well, yes, that's the theory. But as usual, I look at how things that are not defined yet work. From the algorithms, unpaired surrogates convert pretty well. Unless they start to pair up, of course. But there are cases where one knows they cannot (no concatenation is done).

Let me bring up one issue again. I want to standardize a mechanism that allows a roundtrip for 8-bit data. And I already stated that by doing that, you lose the roundtrip for 16-bit data. Now I ask myself again, is that true? Yes and no. For the case I mentioned above (no concatenation), roundtrip is currently really possible. But generally speaking, it is not always possible. And last but not least, you don't even care for it, right? Good, because that means my proposal doesn't make anything worse.

> I admit my error with regard to the handling of file names by
> Unix-style
> file systems, and I appreciate being set straight.

Sorry for rubbing it in, but ...... could it be that a lot of conclusions you have about what Unicode should or should not be are also wrong if they were based on such incorrect assumptions.

>
> I think preserving transmission errors is carrying things too
> far. Your
> Unix file system already doesn't guarantee that; if a byte
> gets changed
> to 00 or 2F, you will have problems.
>
Like this one. Transmission, disk, memory errors (unless data is compressed) are typically 1 bit errors. And one case where things go really wrong doesn't invalidate the importance of many cases where things remain within certain limits.

> On the other hand, if the user is typing UTF-8 bytes directly into a
> non-UTF-8-aware editor, then of course anything is possible. But that
> seems like a bad way to live.
On UNIX, files are also concatenated, and assembled in many other ways. By scripts, by the system... Again, eventually, it will all be UTF-8. But if there will be problems in the transition period ..... hmmmm, who knows.

> Now we're getting somewhere. We are no longer talking about a
> mysterious, unknown encoding in arbitrary text, but about file names
> known to be in Latin-1 instead of UTF-8. If the security risk is
> determined to be low, you *may* be able to get away with interpreting
> invalid UTF-8 as Latin-1. But in that case, the bytes need to be
> converted to real Unicode characters in the range
> U+0080..U+00FF, not to
> PUA characters, and they must not be written back as invalid UTF-8.
No, that is not what I am talking about.
* First, there were never any mysterious encodings. I was always referring to existing, well defined encodings (except when I was talking about transmission errors).

* The 'unknown encoding' stood for the fact that there is no information about WHICH encoding was used. And in the above example, this encoding is Latin 1. You know it, I know it. We can see it. But the computer doesn't, because there is no information about it, because it is plain text.

* The assumption is that most other data is already UTF-8 (or user chose to set locale to UTF-8 in order to start using it). Hence, the program will attempt to interprete the data as UTF-8, not Latin 1.

> Maybe not. I think your scheme involves converting invalid
> UTF-8 to PUA
> code points in UTF-16, and back to invalid UTF-8. I'm saying the PUA
> part is sensible, and the invalid-UTF-8 part is not. (I
> know... only if
> I'm afraid to break some eggs...)
>
Well, if the invalid UTF-8 part is not sensible, then why have the PUA part, right? But, remember filesystems, so you say having UTF-8 filenames on the same disk as legacy encoded filenames is not sensible? How do you suppose users will get from one state to the other? Or should they switch their locale each time they want to work with the other group of filenames? And what if they want to work with both at the same time? Backup?

Well, you might as well force them to never mix the two. Let them buy a new machine for UTF-8 and keep the eggs intact.

> > Was that sarcastic or.....
>
> Yes, and I apologize for that.
It's not about my feelings, it almost lead to a misunderstanding.

> But I disagree that following the
> standard -- even if you think it is flawed -- constitutes a "serious
> security issue" in my design.
A security issue is a security issue. If it is a result of following a standard, this doesn't make it less of an issue. Rather more, I would say.

> As a programmer, I will say:
>
> - Validating conversion is *part* of supporting Unicode, not a frill.
> - Validating conversion is one of the easiest parts of supporting
> Unicode, not a major source of struggle.
> - The standard is very clear about validation; there is no controversy
> over where to start and where to end.
> - Strict validation is required by the standard, and not that
> difficult.
> - Validation of conversion can be very efficient.
> - Validation of conversion from a well-defined charset is
> straightforward, and can easily be guaranteed.

As a software architect, I will say:

Validating the conversion, yes. But for UTF-8 programs to work like UTF-16 programs, you need to validate data even when no conversion is done.

When to start and where to end was meant: how do you tell programmers WHERE to validate. Input? Yes. Output? Maybe. Oh, but what is input and what is output? Of the program? Of a function?

>
> > Can you, please, provide a description specific problems with my
> > design? I mean other than that it violates certain rules, clauses or
> > whatever.
>
> Well, there's that. That's not trivial, is it?
I know it isn't. That's why I am adressing this mailing list.

> Why don't you write a proposal for this to the UTC? They may
> be able to
> provide you with a more satisfactory answer than I can. Be sure to be
> thorough in describing what you want.
Maybe I will. But as long as the general opinion is that this is a complete nonsense and has be dealt with before - I don't stand a chance, do I?

Lars

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Reply via email to