Re: ffs and utf8

Anthony J. Bentley Wed, 03 Dec 2014 22:15:38 -0800

Joel Rees writes:
> 2014/12/03 22:23 "Dmitrij D. Czarkoff" <czark...@gmail.com>:
> >
> > First of all, I really don't believe that preservation of non-canonical
> > form should be a consideration for any software.
> 
> There is no particular canonical form for some kinds of software.
> 
> Unix, in particular, happens to have file name limitations that are
> compatible with all versions of Unicode past 2.0, at least, in UTF-8, but
> it has no native encoding.


To me, the current state of affairs--where filenames can contain
anything and the same filename can and does get interpreted differently
by different programs--feels extremely dangerous. Moving to a single,
well-defined encoding for filenames would make things simpler and
safer. Well, it might. That's why we're discussing this carefully, to
figure out if something like this is actually workable.

There are two kinds of features being discussed:

1) Unicode normalization. This is analogous to case insensitivity:
   multiple filenames map to the same (normalized) filename.

2) Disallowing particular characters. 1-31 and invalid UTF-8 sequences
   are popular examples.

Maybe one is workable. Maybe both are, or neither.

Say I have a hypothetical machine with the above two features
(normalizing to NFC, disallowing 1-31/invalid UTF-8). Now I log into a
typical Unix "anything but \0 or /" machine, via SFTP or whatever. What
are the failure modes?

The first kind is that I could type "get x" followed by "get y",
where x and y are canonically the same in Unicode but represented
differently because they're not normalized on the remote host. I would
expect this to work smoothly: first I download x to NFC(x), and then
b overwrites it.

The second kind is that I could type "get z", where z contains an invalid
character. How should my system handle this? Error as if I had asked for
a filename that's too long? Come up with a new errno? I don't know, but
in this hypothetical machine it should fail somehow.

But creating new files is only part of the problem. If we still allow
them in existing files, we lose all the security/robustness benefits
and just annoy ourselves by adding restrictions with no point.

So say I mount a filesystem containing the same files a, b, and c. What
happens?

 - Fail to mount? (Simultaneously simplest, safest, and least useful)
 - Hide the files? (Seems potentially unsafe)
 - Try to escape the filenames? (Seems crazy)

Is it currently possible to take a hex editor and add "/" to a filename
(as opposed to a pathname) inside a disk image? If that's possible, how
do systems currently deal with it? Because it's the same problem.

FAT32 has both case insensitivity and disallowed characters. How well
does OpenBSD handle those restrictions? If not optimally, then how can
they be made better? If it already handles them with aplomb, then is
it applicable to the above scenarios?

-- 
Anthony J. Bentley

Re: ffs and utf8

Reply via email to