Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create
On Monday 16 April 2007 18:21, Christoph Hellwig wrote: > On Mon, Apr 16, 2007 at 06:11:30PM +0200, Andreas Gruenbacher wrote: > > On Thursday 12 April 2007 12:06, Christoph Hellwig wrote: > > > Once again very strong NACK. Every conditional passing of vfsmounts get > > > my veto. As mentioned last time if you really want this send a patch > > > series first that passed the vfsmount consistantly. > > > > I don't consider it fair to NACK this patch just because some other part > > of the kernel uses vfs_create() in the wrong way -- those are really > > independent issues. The bug is not that hard to fix though; here is a > > proposed patch on top of the other AppArmor patches. > > Note that it's not just this particular hunk, all these kinds of conditional > passing are wrong. Yes, the vfs uses or passes through a NULL nameidata or vfsmount pointer and no intent information in several places: (1) vfs_create() is called with a NULL nameidata in two places, (2) lookup_one_len() and lookup_one_len_kern() pass a NULL nameidata to permission(), d_op->d_revalidate(), and i_op->lookup(), (3) file_permission() passes a NULL nameidata to permission(). (4) permission() is directly called with NULL nameidata in some places, and In general it is incorrect to use NULL nameidata or vfsmounts because then the vfsmount->mnt_flags cannot be checked, and no intent information is available. Intents are an optional optimization for remote filesystems; we can simply ignore them for local filesystems. There are some cases where we are passing NULL which are real bugs, but there are also other cases in which the operations are not mount related: for example, filesystems like pipefs have the MS_NOUSER flag set which indicates that they cannot be mounted at all. On filesystems like sysfs, files are created and removed whether or not that filesystem is mounted. The places where we should pass a proper nameidata / vfsmount but don't should be fixed. Those are long-standing issues in the vfs though, and at least some are not easy to correct. AppArmor is affected by some of the NULL vfsmount passing, and we found a few more problems when looking into this more closely, but most of that NULL passing doesn't affect AppArmor: * In some places, vfs functions are called without nameidata / vfsmount, followed by calling lsm hooks with a vfsmount. In those cases, the vfs functions cannot check the vfsmount flags, but the lsm hooks will still do the appropriate checks. * In the AppArmor security model, directory searching is always allowed, and the access check are done once the leaf dentry + vfsmount has been looked up. For example, granting write access to /var/tmp/foo in a profile is sufficient to allow that access; search access to /var and /var/tmp does not need to be specified. (The regular inode permission checks are of course still performed.) * For filesystems like nfsd, the concept of pathname based access control doesn't make sense because of properties of the protocol: there is no reliable mapping from an nfs file handle to a particular file; aliases to the same file cannot be distinguished. (The AppArmor techdoc [1] describes this in more detail.) Not allowing to confine nfsd by AppArmor doesn't hurt much: nfsd provides its own export access control mechanisms. Also, nfsd cannot really be confined in a secure way because of how it is implemented in the kernel. Gfs2 may have similar properties; I didn't actually look into the protocol. So I propose to separate the vfs NULL nameidata / vfsmount passing discussion from the AppArmor discussion. I have no problem working on the vfs fixes -- we could some nice cleanups there -- but that really is independent of AppArmor. We almost have the revised AppArmor patches ready. We'll include all the fixes that are truly necessary, and everything beyond that will be posted separately. > But anyway, creating fake nameidata structures is not really helpful. > If there is a nameidata passed people expect it to be complete, and > if you pass them to an LSM people will e.g. try to look into lookup > intents. Splitting up nameidata helps somewhat here. In cases where we don't have intent information available we can zero out the intent flags, and so in the worst case, we won't get the intent optimizations. Thanks, Andreas - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Fri, May 11, 2007 at 03:46:41AM -0600, Valerie Henson wrote: > On Wed, May 09, 2007 at 02:51:41PM -0500, Matt Mackall wrote: > > > > We will, unfortunately, need to be able to check an entire directory > > at once. There's no other efficient way to assure that there are no > > duplicate names in a directory, for instance. > > I don't see that being a major problem for the vast majority of > workloads. > > > In summary, checking a tile requires trivial checks on all the inodes > > and directories that point into a tile. Inodes, directories, and data > > that are inside a tile get checked more thoroughly but still don't > > need to do much pointer chasing. > > Okay, I'm totally convinced - checking a tile-at-a-time works! I'm > going to steal as many of your ideas as possible and write ChileFS. :) > Per-block inode rmap in particular has so many advantages that I'm > ranking it up with checksums as a must-have feature. > > Now for the hard part: repair. If you find an indirect block or > extent with a bad checksum, how much of the file system are you going > to have to read to fix the dangling blocks? I can see a speed-up by > reading just the rmaps and looking for the associated inode number. > What about an inode that has been corrupted? You could at least get > the inode number out of the rmap, but your pointers to your first > level indirect blocks are gone. A directory block? No way to get > useful information out of the rmap there. Ad nauseum. This is where > I really like having the encapsulation of chunkfs, despite all the > nasty continuation inode bits. There are a few interesting possibilities here. First we've got several new sources of integrity checking. If a block points to an inode and the inode doesn't point back to it, we follow the inode's corresponding pointer forward and back. If that turns out to be consistent, we know that the original block is in the wrong. We can also check the tile header and inode CRCs if they disagree and neither pointer checks out. Failing that, stray blocks can get attached to an inode in lost+found, and we can reattach them later during a full filesystem sweep. Another possibility is that we can compare forward and backward pointers at runtime. This is probably a good idea anyway: we've got to read tile headers for runtime CRC checks, so might as well check the pointers too. If we discover a pointer mismatch, we can either orphan data blocks -or recover orphans- on the fly. So doing a filesystem-wide backup will also do most of an fsck. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: > On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote: > > On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: > > > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > > > > I have the updated patches ready which take care of Andrew's comments. > > > > Will run some tests and post them soon. > > > > > > > > But, before submitting these patches, I think it will be better to > > > > finalize on certain things which might be worth some discussion here: > > > > > > > > 1) Should the file size change when preallocation is done beyond EOF ? > > > > - Andreas and Chris Wedgwood are in favor of not changing the file size > > > > in this case. I also tend to agree with them. Does anyone has an > > > > argument in favor of changing the filesize ? If not, I will remove the > > > > code which changes the filesize, before I resubmit the concerned ext4 > > > > patch. > > > > > > I think there needs to be both. If we don't have a mechanism to atomically > > > change the file size with the preallocation, then applications that use > > > stat() to work out if they need to preallocate more space will end up > > > racing. > > > > By "both" above, do you mean we should give user the flexibility if it wants > > the filesize changed or not ? It can be done by having *two* modes for > > preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we > > use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not > > change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() > > will change the filesize if required (i.e. when allocation is beyond EOF) > > and also update [cm]time. This way, the application can decide what it > > wants. > > Yes, that's right. > > > This will be helpfull for the partial allocation scenario also. Think of the > > case when we do not change the filesize in fallocate() and expect > > applications/posix_fallocate() to do ftruncate() after fallocate() for this. > > Now if fallocate() results in a partial allocation with -ENOSPC error > > returned, applications/posix_fallocate() will not know for what length > > ftruncate() has to be called. :( > > Well, posix_fallocate() either gets all the space or it fails. If > you truncate to extend the file size after an ENOSPC, then that is > a buggy implementation. > > The same could be said for any application, or even the fallocate() > call itself if it changes the filesize without having completely > preallocated the space asked > > > Hence it may be a good idea to give user the flexibility if it wants to > > atomically change the file size with preallocation or not. But, with more > > flexibility there comes inconsistency in behavior, which is worth > > considering. > > We've got different modes to specify different behaviour. That's > what the mode field was put there for in the first place - the > interface is *designed* to support different preallocation > behaviours > > > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of > > > > normal (non-preallocated) blocks (blocks allocated via regular > > > > write/truncate operations) also (i.e. work as punch()) ? > > > > > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what > > > i did for FA_UNALLOCATE as well. > > > > Ok. But, some people may not expect/like this. I think, we can keep it on > > the backburner for a while, till other issues are sorted out. > > How can it be a "backburner" issue when it defines the > implementation? I've already implemented some thing in XFS that > sort of does what I think that the interface is supposed to do, but > I need that interface to be nailed down before proceeding any > further. > > All I'm really interested in right now is that the fallocate > _interface_ can be used as a *complete replacement* for the > pre-existing XFS-specific ioctls that are already used by > applications. What ext4 can or can't do right now is irrelevant to > this discussion - the interface definition needs to take priority > over implementation Would you like to write up an interface definition description (likely man page) and post it for review, possibly with a mention of apps using it today ? One reason for introducing the mode parameter was to allow the interface to evolve incrementally as more options / semantic questions are proposed, so that we don't have to make all the decisions right now. So it would be good to start with a *minimal* definition, even just one mode. The rest could follow as subsequent patches, each being reviewed and debated separately. Otherwise this discussion can drag on for a long time. Regards Suparna > > Cheers, > > Dave, > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Wed, May 09, 2007 at 02:51:41PM -0500, Matt Mackall wrote: > > We will, unfortunately, need to be able to check an entire directory > at once. There's no other efficient way to assure that there are no > duplicate names in a directory, for instance. I don't see that being a major problem for the vast majority of workloads. > In summary, checking a tile requires trivial checks on all the inodes > and directories that point into a tile. Inodes, directories, and data > that are inside a tile get checked more thoroughly but still don't > need to do much pointer chasing. Okay, I'm totally convinced - checking a tile-at-a-time works! I'm going to steal as many of your ideas as possible and write ChileFS. :) Per-block inode rmap in particular has so many advantages that I'm ranking it up with checksums as a must-have feature. Now for the hard part: repair. If you find an indirect block or extent with a bad checksum, how much of the file system are you going to have to read to fix the dangling blocks? I can see a speed-up by reading just the rmaps and looking for the associated inode number. What about an inode that has been corrupted? You could at least get the inode number out of the rmap, but your pointers to your first level indirect blocks are gone. A directory block? No way to get useful information out of the rmap there. Ad nauseum. This is where I really like having the encapsulation of chunkfs, despite all the nasty continuation inode bits. -VAL P.S. Note the email address change - I'm leaving Intel as of Tuesday and my [EMAIL PROTECTED] address will probably either bounce or black hole - not sure which. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] AFS: Fix interminable loop in afs_write_back_from_locked_page()
Andrew Morton <[EMAIL PROTECTED]> wrote: > Yes, it's a shame that there doesn't seem to be a fine-grained way of > turning on -W's useful bits. You can turn off -W's undesirable bits. For net/rxrpc/ and fs/afs/ at least, adding: CFLAGS += -W -Wno-unused-parameter to the Makefile generates no warnings. Perhaps this should be added to the master Makefile. Adding -Wsign-compare finds some stuff that I will fix. It also finds some stuff in the main and the networking headers. This is a really useful option and found some tricky bugs in CacheFiles. I would endorse adding this generally too. David - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] AFS: Fix interminable loop in afs_write_back_from_locked_page()
On Fri, 11 May 2007 10:49:23 +0100 David Howells <[EMAIL PROTECTED]> wrote: > Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > Following bug was uncovered by compiling with '-W' flag: > > > > gcc -W finds a number of fairly scary bugs. > > Do you mean in my code specifically? Or in the kernel in general? In general. > As far as > I can tell -W only finds an eye-glazingly large quantity of 'unused parameter' > warnings in AFS and AF_RXRPC. Yes, it's a shame that there doesn't seem to be a fine-grained way of turning on -W's useful bits. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] AFS: Fix interminable loop in afs_write_back_from_locked_page()
Andrew Morton <[EMAIL PROTECTED]> wrote: > More than one would expect, given that it is recommended in > Documentation/SubmitChecklist, which everyone reads ;) Which states incorrectly: | 22: Newly-added code has been compiled with `gcc -W'. This will generate | lots of noise, but is good for finding bugs like "warning: comparison | between signed and unsigned". -W does not imply -Wsign-compare, at least not on my gcc. David - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] AFS: Fix interminable loop in afs_write_back_from_locked_page()
Andrew Morton <[EMAIL PROTECTED]> wrote: > > Following bug was uncovered by compiling with '-W' flag: > > gcc -W finds a number of fairly scary bugs. Do you mean in my code specifically? Or in the kernel in general? As far as I can tell -W only finds an eye-glazingly large quantity of 'unused parameter' warnings in AFS and AF_RXRPC. David - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html