On Sun, Jan 02, 2011 at 03:14:41PM -0200, Henrique de Moraes Holschuh wrote: > > 1. Create unlinked file fd (benefits from kernel support, but doesn't > require it). If a filesystem cannot support this or the boundary conditions > are unaceptable, fail. Needs to know the destination name to do the unliked > create on the right fs and directory (otherwise attempts to link the file > later would have to fail if the fs is different).
This is possible. It would be specific only to file systems that support inodes (i.e., ix-nay for NFS, FAT, etc.). Some file systems would want to know a likely directory where the file would be linked so for their inode and block allocation policies can optimize the inode and block placement. > 2. fd works as any normal fd to an unlinked regular file. > > 3. create a link() that can do unlink+link atomically. Maybe this already > exists, otherwise needs kernel support. > > The behaviour of (3) should allow synchrous wait of a fsync() and a sync of > the metadata of the parent dir. It doesn't matter much if it does > everything, or just calling fsync(), or creating a fclose() variant that > does it. OK, so this is where things get trickly. The first is you are asking for the ability to take a file descriptor and link it into some directory. The inode associated with the fd might or might not be already linked to some other directory, and it might or might not be owned by the user trying to do the link. The latter could get problematical if quota is enabled, since it does open up a new potential security exposure. A user might pass a file descriptor to another process in a different security domain, and that process could create a link to some directory which the original user doesn't have access to. The user would no longer be able to delete file and drop quota, and the process would retain permanent access to the file, which it might not otherwise have if the inode was protected by a parent directory's permissions. It's for the same reason that we can't just implement open-by-inode-number; even if you use the inode's permissions and ACL's to do a access check, this allows someone to bypass security controls based on the containing directory's permissions. It might not be a security exposure, but for some scenarios (i.e., a mode 600 ~/Private directory that contains world-readable files), it changes accessibility of some files. We could control for this by only allowing the link to happen if the user executing this new system call owns the inode being linked, so this particular problem is addressable. The larger problem is this doesn't solve give you any performance benefits over simply creating a temporary file, fsync'ing it, and then doing the rename. And it doesn't solve the problem that userspace is responsible for copying over the extended attributes and ACL information. So in exchange for doing something non-portable which is Linux specific, and won't work on FAT, NFS, and other non-inode based file systems at all, and which requires special file-system modifications for inode-based file systems --- the only real benefit you get is that the temp file gets cleaned up automatically if you crash before the link/unlink new magical system call is completed. Is it worth it? I'm not at all convinced. Can this be fixed? Well, I suppose we could have this magical link/unlink system call also magically copy over the xattr and acl's. And if you don't care about when things happen, you could have the kernel fork off a kernel thread, which does the fsync, followed by the magic ACL and xattr copying, and once all of this completes, it could do the magic link/unlink. So we could bundle all of this into a system call. *Theoretically*. But then someone else will say that they want to know when this magic link/unlink system call actually completes. Others might say that they don't care about the fsync happening right away, but would rather wait some arbitary time, and let the system writeback algorithsm write back the file *whenever*, but only when the file is written back *whenever*, should the rest of the magical link/unlink happen. So now we have an explosion of complexity, with all sorts of different variants. And there's also the problem where if you don't do don't make the system call synchronous (where it does an fsync() and waits for it to complete), you'll lose the ability to report errors back to userspace. Which gets me back to the question of use cases. When are we going to be using this monster? For many use cases, where the original reason why we said people were doing it wrong because they weren't doing things right, the risk was losing data. But if you don't do things synchronously, and use fsync(), you'll also end up risking losing data because you won't know about write failures --- specifically, your program may have long exited by the time the write failure is noticed by the kernel. But if you use make the system call synchronous, now there's no performance advantage over simply doing the fsync() and rename() in userspace. And if we do this using O_ATOMIC, or your scheme with unlinked file descriptors and the magic link/unlink by fd system call, it means the application programmers have to modify their programs anyway --- so why not modify them to use the userspace library that does safe writing? So is all of this effort really worth it at the end of the day? When you sum it all up, the only way that it makes sense is if one of the following applies: 1) You care about data loss in the case of power failure, but not in the case of hard drive or storage failure, *AND* you are writing tons and tons of tiny 3-4 byte files and so you are worried about performance because you're doing something insane with large number of small files. 2) You are specifically worried about the case where you are replacing the contents of a file that is owned by different uid than the user doing the data file update, in a safe way where you don't want a partially written file to replace the old, complete file, *AND* you care about the file's ownership after the data update. 3) You care about the temp file used by the userspace library, or application which is doing the write temp file, fsync(), rename() scheme, being automatically deleted in case of a system crash or a process getting sent an uncatchable signal and getting terminated. Against these possible scenarios where some new kernel code might be a win, you have to weigh: A) Lack of OS portability to other POSIX operating systems: Mac OS X, Solaris, FreeBSD, AIX, etc. B) Lack of portability for file sytstems that don't use inodes as a basis for their design. C) Lack of portability for file systems that haven't been hacked to support this new scheme, even if they are inode-based. Is it worth it? I'd say no; and suggest that someone who really cares should create a userspace application helper library first, since you'll need it as a fallback for the cases listed above where this scheme won't work. (Even if you do the fallback in the kernel, you'll still need userspace fallback for non-Linux systems, and for when the application is run on an older Linux kernel that doesn't have all of this O_ATOMIC or link/unlink magic). The scheme you suggested is certainly *technically feasible* in terms of something that could be implemented. Whether or not it would be worth it, given that portable applications won't be able to count on it being present, is a very different question entirely. The reality is we've lived without this capability in Unix and Linux system for something like three decades. I suspect we can live without it for the next couple of decades without it being the end of the world. - Ted -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110103040405.gd11...@thunk.org