RE: Roundtripping in Unicode

Lars Kristan Mon, 13 Dec 2004 07:43:12 -0800

Title: RE: Roundtripping in Unicode

Philippe Verdy wrote:
> An implementation that uses UTF-8 for valid string could use
> the invalid
> ranges for lead bytes to encapsultate invalid byte values.
> Note however that
> invalid bytes you would need to represent have 256 possible
> values, but the
> UTF-8 lead bytes have only 2 reserved values (0xC0 and 0xC1)
> each for 64
> codes, if you want to use an encoding on two bytes. The
> alternative would be
> to use the UTF-8 lead byte values which have initially been
> assigned to byte
> sequences longer than 4 bytes, and that are now unassigned/invalid in
> standard UTF-8. For example: {0xF8+(n/64); 0x80+(n%64)}.
> Here also it will be a private encoding, that should NOT be
> named UTF-8, and
> the application should clearly document that it will not only
> accept any
> valid Unicode string, but also some invalid data which will have some
> roundtrip compatibility.
Now you are devising an algorithm to store invalid sequences with other invalid sequences. In UTF-8. Why not simply stick with the original invalid sequences?

And the whole purpose of what I am trying to do is to get VALID sequences. In order to be able to store and manipulate with Unicode strings.

>
> So what is the problem: suppose that the application,
> internally, starts to
> generate strings containing any occurences of such private
> sequences, then
> it will be possible for the application to generate on its
> output a byte
> stream that would NOT have roundtrip compatibility, back to
> the private
> representation. So roundtripping would only be guaranteed for streams
> converted FROM an UTF-8 where some invalid sequences are
> present and must be
> preserved by the internal representation. So the
> transformation is not
> bijective as you would think, and this potentially creates
> lots of possible
> security issues.
Yes, it does. An application that uses my approach needs to be designed accordingly. *IF* the security issues apply. For a UTF-16 text editor this probably doesn't apply (in terms of data, not filenames). And this is just an example, with a text editor you can perhaps force the user to select a different encoding, but there are cases where this cannot be done, but data still needs to be preserved.

So far, many people have suggested that there is no need to preserve 'invalid data'. After some argumentation and a couple of examples, the need is acknowledged. But then they question the way it is done. They see the codepoint approach as unsuitable or unneeded. And suggest using some form of escaping. Now, any escaping has exactly the same problems you are mentioning, and some on top. And is actually representing invalid data with valid codepoints (except more than one per invalid byte), which you say is a definite no-no.

And on top of all, the approach I am proposing is NOT intended to be used everywhere. It should only be used when interfacing to a system that cannot guarantee valid UTF-8, but does use UTF-8. For example, a UNIX filesystem. And, actually, if the security is entirely done by the filesystem, then it doesn't even matter if two UTF-16 strings map to the same filename. They will open the same file. Or be both denied. Which is exactly what is required. A Windows filesystem is case preserving but case insensitive. Did it ever bother you that you can use either upper case or lower case filename to open a file? Does it introduce security issues? Typically no, because you leave the security to the filesystem. And those checks are always done in the same UTF.

This is a simple example of something that doesn't even need to be fixed. There are cases where validation would really need to be fixed. But then again, only if you use the new conversion. If you don't, your security remains exactly where it is today.

We should be analyzing the security aspects. Learning where it can break, and in which cases. Get to know the enemy. And once we understand that things are manageable and not as frigtening as it seems at first, then we can stop using this as an argument against introducing 128 codepoints. People who will find them useful should and will bother with the consequences. Others don't need to and can roundtrip them as today.

So, interpreting the 128 codepoints as 'recreate the original byte sequence' is an option. If you convert from UTF-16 to UTF-8, then you do exactly as you do now. Even I will do the same where I just want to represent Unicode in UTF-8. I will only use this conversion in certain places. The fact that my conversion actually produces UTF-8 from most of Unicode points does not mean it produced UTF-8. The result is just a byte sequence. The same one that I started with when I was replacing invalid sequences with the 128 codepoints. And this is not limited to conversion from 'byte sequence that is mostly UTF-8' to UTF-16. I can (and even should) convert from this byte sequence to UTF-8. Preserving most of it and replacing each byte of invalid sequences with several bytes that represent the appropriate codepoint, in UTF-8.

> So the best thing you can do to secure your application, is
> to REJECT/IGNORE
> all files whose names do not match the strict UTF-8 encoding
> rules that your
> application expect (all will happen as if those files were
> not present, but
> this may still create security problems if an application
> that does not see
Some situations favor security over preserving data, other (far more common) favor preserving data and have no security aspects at all.

> any file in a directory wants to delete that directory,
> assuming it is
> empty... In that case the application must be ready to accept
> the presence
> of directories without any content, and must not depend on
> the presence of a
> directory to determine that it has some contents; anyway, on secured
> filesystems, such things could happen due to access
> restrictions, completely
> unrelated to the encoding of filenames, and it is not
> unreasonnable to
> prepare the application so that it will behave correctly face to
> inaccessible files or directories, so that the application will also
> correctly handle the fact that the same filesystem will contain non
> plain-text and inaccessible filenames).
Inaccessible filenames are something we shouldn't accept. All your discussion of non-empty empty directories is just approaching the problem from the wrong end. One should fix the root cause, not consequences. And you would be fixing just that, the consequences, the fact would remain that there are inaccessible files. Isn't that a problem on its own? Why not fix that and get rid of a plethora of problems.

> Notably, the concept of filenames is a legacy and badly
> designed concept,
> inherited from times where storage space was very limited,
> and the designers
> wanted to create a compact (but often cryptic) representation.
About as bad as a post-it label that you put on a box when you take the box to the attic. I don't understand what is bad about them. And even if it is bad, what is one suppposed to do? We have them, and should process them.

Lars

RE: Roundtripping in Unicode

Reply via email to