Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create

2007-05-11 Thread Andreas Gruenbacher
On Monday 16 April 2007 18:21, Christoph Hellwig wrote:
> On Mon, Apr 16, 2007 at 06:11:30PM +0200, Andreas Gruenbacher wrote:
> > On Thursday 12 April 2007 12:06, Christoph Hellwig wrote:
> > > Once again very strong NACK.  Every conditional passing of vfsmounts get 
> > > my veto.  As mentioned last time if you really want this send a patch
> > > series first that passed the vfsmount consistantly.
> > 
> > I don't consider it fair to NACK this patch just because some other part
> > of the kernel uses vfs_create() in the wrong way -- those are really
> > independent issues. The bug is not that hard to fix though; here is a
> > proposed patch on top of the other AppArmor patches.
> 
> Note that it's not just this particular hunk, all these kinds of conditional
> passing are wrong.

Yes, the vfs uses or passes through a NULL nameidata or vfsmount pointer and 
no intent information in several places:

 (1) vfs_create() is called with a NULL nameidata in two places,

 (2) lookup_one_len() and lookup_one_len_kern() pass a NULL nameidata to
 permission(), d_op->d_revalidate(), and i_op->lookup(),

 (3) file_permission() passes a NULL nameidata to permission().

 (4) permission() is directly called with NULL nameidata in some places, and

In general it is incorrect to use NULL nameidata or vfsmounts because then the
vfsmount->mnt_flags cannot be checked, and no intent information is available. 
Intents are an optional optimization for remote filesystems; we can simply 
ignore them for local filesystems. There are some cases where we are passing 
NULL which are real bugs, but there are also other cases in which the 
operations are not mount related: for example, filesystems like pipefs have 
the MS_NOUSER flag set which indicates that they cannot be mounted at all. On 
filesystems like sysfs, files are created and removed whether or not that 
filesystem is mounted.

The places where we should pass a proper nameidata / vfsmount but don't should 
be fixed. Those are long-standing issues in the vfs though, and at least some 
are not easy to correct. AppArmor is affected by some of the NULL vfsmount 
passing, and we found a few more problems when looking into this more 
closely, but most of that NULL passing doesn't affect AppArmor:

 * In some places, vfs functions are called without nameidata / vfsmount,
   followed by calling lsm hooks with a vfsmount. In those cases, the vfs
   functions cannot check the vfsmount flags, but the lsm hooks will still
   do the appropriate checks.

 * In the AppArmor security model, directory searching is always allowed, and
   the access check are done once the leaf dentry + vfsmount has been looked
   up. For example, granting write access to /var/tmp/foo in a profile is
   sufficient to allow that access; search access to /var and /var/tmp does
   not need to be specified. (The regular inode permission checks are of
   course still performed.)

 * For filesystems like nfsd, the concept of pathname based access control
   doesn't make sense because of properties of the protocol: there is no
   reliable mapping from an nfs file handle to a particular file; aliases
   to the same file cannot be distinguished. (The AppArmor techdoc [1]
   describes this in more detail.)

   Not allowing to confine nfsd by AppArmor doesn't hurt much: nfsd provides
   its own export access control mechanisms. Also, nfsd cannot really be
   confined in a secure way because of how it is implemented in the kernel.
   Gfs2 may have similar properties; I didn't actually look into the protocol.

So I propose to separate the vfs NULL nameidata / vfsmount passing discussion 
from the AppArmor discussion. I have no problem working on the vfs fixes -- 
we could some nice cleanups there -- but that really is independent of 
AppArmor.

We almost have the revised AppArmor patches ready. We'll include all the fixes 
that are truly necessary, and everything beyond that will be posted 
separately.

> But anyway, creating fake nameidata structures is not really helpful.
> If there is a nameidata passed people expect it to be complete, and
> if you pass them to an LSM people will e.g. try to look into lookup
> intents.

Splitting up nameidata helps somewhat here. In cases where we don't have 
intent information available we can zero out the intent flags, and so in the 
worst case, we won't get the intent optimizations.

Thanks,
Andreas
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-11 Thread Matt Mackall
On Fri, May 11, 2007 at 03:46:41AM -0600, Valerie Henson wrote:
> On Wed, May 09, 2007 at 02:51:41PM -0500, Matt Mackall wrote:
> >
> > We will, unfortunately, need to be able to check an entire directory
> > at once. There's no other efficient way to assure that there are no
> > duplicate names in a directory, for instance.
> 
> I don't see that being a major problem for the vast majority of
> workloads.
> 
> > In summary, checking a tile requires trivial checks on all the inodes
> > and directories that point into a tile. Inodes, directories, and data
> > that are inside a tile get checked more thoroughly but still don't
> > need to do much pointer chasing.
> 
> Okay, I'm totally convinced - checking a tile-at-a-time works!  I'm
> going to steal as many of your ideas as possible and write ChileFS. :)
> Per-block inode rmap in particular has so many advantages that I'm
> ranking it up with checksums as a must-have feature.
> 
> Now for the hard part: repair.  If you find an indirect block or
> extent with a bad checksum, how much of the file system are you going
> to have to read to fix the dangling blocks?  I can see a speed-up by
> reading just the rmaps and looking for the associated inode number.
> What about an inode that has been corrupted?  You could at least get
> the inode number out of the rmap, but your pointers to your first
> level indirect blocks are gone.  A directory block?  No way to get
> useful information out of the rmap there.  Ad nauseum.  This is where
> I really like having the encapsulation of chunkfs, despite all the
> nasty continuation inode bits.

There are a few interesting possibilities here.

First we've got several new sources of integrity checking. If a block
points to an inode and the inode doesn't point back to it, we follow
the inode's corresponding pointer forward and back. If that turns out
to be consistent, we know that the original block is in the wrong. We
can also check the tile header and inode CRCs if they disagree and
neither pointer checks out.

Failing that, stray blocks can get attached to an inode in lost+found,
and we can reattach them later during a full filesystem sweep.

Another possibility is that we can compare forward and backward
pointers at runtime. This is probably a good idea anyway: we've got to
read tile headers for runtime CRC checks, so might as well check the
pointers too. If we discover a pointer mismatch, we can either orphan
data blocks -or recover orphans- on the fly. So doing a
filesystem-wide backup will also do most of an fsck.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-11 Thread Suparna Bhattacharya
On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
> On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote:
> > On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
> > > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> > > > I have the updated patches ready which take care of Andrew's comments.
> > > > Will run some tests and post them soon.
> > > > 
> > > > But, before submitting these patches, I think it will be better to
> > > > finalize on certain things which might be worth some discussion here:
> > > > 
> > > > 1) Should the file size change when preallocation is done beyond EOF ?
> > > > - Andreas and Chris Wedgwood are in favor of not changing the file size
> > > > in this case. I also tend to agree with them. Does anyone has an
> > > > argument in favor of changing the filesize ?  If not, I will remove the
> > > > code which changes the filesize, before I resubmit the concerned ext4
> > > > patch.
> > > 
> > > I think there needs to be both. If we don't have a mechanism to atomically
> > > change the file size with the preallocation, then applications that use
> > > stat() to work out if they need to preallocate more space will end up
> > > racing.
> > 
> > By "both" above, do you mean we should give user the flexibility if it wants
> > the filesize changed or not ? It can be done by having *two* modes for
> > preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we
> > use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not
> > change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate()
> > will change the filesize if required (i.e.  when allocation is beyond EOF)
> > and also update [cm]time.  This way, the application can decide what it
> > wants.
> 
> Yes, that's right.
> 
> > This will be helpfull for the partial allocation scenario also. Think of the
> > case when we do not change the filesize in fallocate() and expect
> > applications/posix_fallocate() to do ftruncate() after fallocate() for this.
> > Now if fallocate() results in a partial allocation with -ENOSPC error
> > returned, applications/posix_fallocate() will not know for what length
> > ftruncate() has to be called.  :(
> 
> Well, posix_fallocate() either gets all the space or it fails. If
> you truncate to extend the file size after an ENOSPC, then that is
> a buggy implementation.
> 
> The same could be said for any application, or even the fallocate()
> call itself if it changes the filesize without having completely
> preallocated the space asked
> 
> > Hence it may be a good idea to give user the flexibility if it wants to
> > atomically change the file size with preallocation or not. But, with more
> > flexibility there comes inconsistency in behavior, which is worth
> > considering.
> 
> We've got different modes to specify different behaviour. That's
> what the mode field was put there for in the first place - the
> interface is *designed* to support different preallocation
> behaviours
> 
> > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of
> > > > normal (non-preallocated) blocks (blocks allocated via regular
> > > > write/truncate operations) also (i.e. work as punch()) ?
> > > 
> > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what
> > > i did for FA_UNALLOCATE as well.
> > 
> > Ok. But, some people may not expect/like this. I think, we can keep it on
> > the backburner for a while, till other issues are sorted out.
> 
> How can it be a "backburner" issue when it defines the
> implementation?  I've already implemented some thing in XFS that
> sort of does what I think that the interface is supposed to do, but
> I need that interface to be nailed down before proceeding any
> further.
> 
> All I'm really interested in right now is that the fallocate
> _interface_ can be used as a *complete replacement* for the
> pre-existing XFS-specific ioctls that are already used by
> applications.  What ext4 can or can't do right now is irrelevant to
> this discussion - the interface definition needs to take priority
> over implementation

Would you like to write up an interface definition description (likely
man page) and post it for review, possibly with a mention of apps using
it today ?

One reason for introducing the mode parameter was to allow the interface to
evolve incrementally as more options / semantic questions are proposed, so
that we don't have to make all the decisions right now. 
So it would be good to start with a *minimal* definition, even just one mode.
The rest could follow as subsequent patches, each being reviewed and debated
separately. Otherwise this discussion can drag on for a long time.

Regards
Suparna

> 
> Cheers,
> 
> Dave,
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-11 Thread Valerie Henson
On Wed, May 09, 2007 at 02:51:41PM -0500, Matt Mackall wrote:
>
> We will, unfortunately, need to be able to check an entire directory
> at once. There's no other efficient way to assure that there are no
> duplicate names in a directory, for instance.

I don't see that being a major problem for the vast majority of
workloads.

> In summary, checking a tile requires trivial checks on all the inodes
> and directories that point into a tile. Inodes, directories, and data
> that are inside a tile get checked more thoroughly but still don't
> need to do much pointer chasing.

Okay, I'm totally convinced - checking a tile-at-a-time works!  I'm
going to steal as many of your ideas as possible and write ChileFS. :)
Per-block inode rmap in particular has so many advantages that I'm
ranking it up with checksums as a must-have feature.

Now for the hard part: repair.  If you find an indirect block or
extent with a bad checksum, how much of the file system are you going
to have to read to fix the dangling blocks?  I can see a speed-up by
reading just the rmaps and looking for the associated inode number.
What about an inode that has been corrupted?  You could at least get
the inode number out of the rmap, but your pointers to your first
level indirect blocks are gone.  A directory block?  No way to get
useful information out of the rmap there.  Ad nauseum.  This is where
I really like having the encapsulation of chunkfs, despite all the
nasty continuation inode bits.

-VAL

P.S. Note the email address change - I'm leaving Intel as of Tuesday
and my [EMAIL PROTECTED] address will probably either bounce
or black hole - not sure which.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] AFS: Fix interminable loop in afs_write_back_from_locked_page()

2007-05-11 Thread David Howells
Andrew Morton <[EMAIL PROTECTED]> wrote:

> Yes, it's a shame that there doesn't seem to be a fine-grained way of
> turning on -W's useful bits.

You can turn off -W's undesirable bits.  For net/rxrpc/ and fs/afs/ at least,
adding:

CFLAGS += -W -Wno-unused-parameter

to the Makefile generates no warnings.  Perhaps this should be added to the
master Makefile.

Adding -Wsign-compare finds some stuff that I will fix.  It also finds some
stuff in the main and the networking headers.  This is a really useful option
and found some tricky bugs in CacheFiles.  I would endorse adding this
generally too.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] AFS: Fix interminable loop in afs_write_back_from_locked_page()

2007-05-11 Thread Andrew Morton
On Fri, 11 May 2007 10:49:23 +0100 David Howells <[EMAIL PROTECTED]> wrote:

> Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
> > > Following bug was uncovered by compiling with '-W' flag:
> > 
> > gcc -W finds a number of fairly scary bugs.
> 
> Do you mean in my code specifically?  Or in the kernel in general?

In general.

>  As far as
> I can tell -W only finds an eye-glazingly large quantity of 'unused parameter'
> warnings in AFS and AF_RXRPC.

Yes, it's a shame that there doesn't seem to be a fine-grained way of
turning on -W's useful bits.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] AFS: Fix interminable loop in afs_write_back_from_locked_page()

2007-05-11 Thread David Howells
Andrew Morton <[EMAIL PROTECTED]> wrote:

> More than one would expect, given that it is recommended in
> Documentation/SubmitChecklist, which everyone reads ;)

Which states incorrectly:

| 22: Newly-added code has been compiled with `gcc -W'.  This will generate
| lots of noise, but is good for finding bugs like "warning: comparison
| between signed and unsigned".

-W does not imply -Wsign-compare, at least not on my gcc.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] AFS: Fix interminable loop in afs_write_back_from_locked_page()

2007-05-11 Thread David Howells
Andrew Morton <[EMAIL PROTECTED]> wrote:

> > Following bug was uncovered by compiling with '-W' flag:
> 
> gcc -W finds a number of fairly scary bugs.

Do you mean in my code specifically?  Or in the kernel in general?  As far as
I can tell -W only finds an eye-glazingly large quantity of 'unused parameter'
warnings in AFS and AF_RXRPC.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html