date:20070425

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread David Chinner

On Wed, Apr 25, 2007 at 04:03:44PM -0700, Valerie Henson wrote:
> On Wed, Apr 25, 2007 at 08:54:34PM +1000, David Chinner wrote:
> > On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> > > 
> > > The structure looks like this:
> > > 
> > >  --   --
> > > | cnode 0  |-->| cnode 0  |--> to another cnode or NULL
> > >  --   --
> > > | cnode 1  |-  | cnode 1  |-
> > >  --   |   --  |
> > > | cnode 2  |-- |  | cnode 2  |--   |
> > >  --  | |  --  |   |
> > > | cnode 3  | | |  | cnode 3  | |   |
> > >  --  | |  --  |   |
> > > |  |  ||  |   |
> > > 
> > >  inodes   inodes or NULL
> > 
> > How do you recover if fsfuzzer takes out a cnode in the chain? The
> > chunk is marked clean, but clearly corrupted and needs fixing and
> > you don't know what it was pointing at.  Hence you have a pointer to
> > a trashed cnode *somewhere* that you need to find and fix, and a
> > bunch of orphaned cnodes that nobody points to *somewhere else* in
> > the filesystem that you have to find. That's a full scan fsck case,
> > isn't?
> 
> Excellent question.  This is one of the trickier aspects of chunkfs -
> the orphan inode problem (tricky, but solvable).  The problem is what
> if you smash/lose/corrupt an inode in one chunk that has a
> continuation inode in another chunk?  A back pointer does you no good
> if the back pointer is corrupted.

*nod*

> What you do is keep tabs on whether you see damage that looks like
> this has occurred - e.g., inode use/free counts wrong, you had to zero
> a corrupted inode - and when this happens, you do a scan of all
> continuation inodes in chunks that have links to the corrupted chunk.

This assumes that you know a chunk has been corrupted, though.
How do you find that out?

> What you need to make this go fast is (1) a pre-made list of which
> chunks have links with which other chunks,

So you add a new on-disk structure that needs to be kept up to
date? How do you trust that structure to be correct if you are
not journalling it? What happens if fsfuzzer trashes part
of this table as well and you can't trust it?

> (2) a fast way to read all
> of the continuation inodes in a chunk (ignoring chunk-local inodes).
> This stage is O(fs size) approximately, but it should be quite swift.

Assuming you can trust this list. if not, finding cnodes is going
to be rather slow.

> > It seems that any sort of damage to the underlying storage (e.g.
> > media error, I/O error or user brain explosion) results in the need
> > to do a full fsck and hence chunkfs gives you no benefit in this
> > case.
> 
> I worry about this but so far haven't found something which couldn't
> be cut down significantly with just a little extra work.  It might be
> helpful to look at an extreme case.
> 
> Let's say we're incredibly paranoid.  We could be justified in running
> a full fsck on the entire file system in between every single I/O.
> After all, something *might* have been silently corrupted.  But this
> would be ridiculously slow.  We could instead never check the file
> system.  But then we would end up panicking and corrupting the file
> system a lot.  So what's a good compromise?
> 
> In the chunkfs case, here's my rules of thumb so far:
> 
> 1. Detection: All metadata has magic numbers and checksums.
> 2. Scrubbing: Random check of chunks when possible.
> 3. Repair: When we detect corruption, either by checksum error, file
>system code assertion failure, or hardware tells us we have a bug,
>check the chunk containing the error and any outside-chunk
>information that could be affected by it.

So if you end up with a corruption in a "clean" part of the
filesystem, you may not find out about the corruption on reboot and
fsck?  You need to trip over the corruption first before fsck can be
told it needs to check/repair a given chunk? Or do you need to force
a "check everything" fsck in this case?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Valerie Henson

On Wed, Apr 25, 2007 at 05:38:34AM -0600, Andreas Dilger wrote:
> 
> The case where only a fsck of the corrupt chunk is done would not find the
> cnode references.  Maybe there needs to be per-chunk info which contains
> a list/bitmap of other chunks that have cnodes shared with each chunk?

Yes, exactly.  One might almost think you had solved this problem
before. :):):)

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Valerie Henson

On Wed, Apr 25, 2007 at 08:54:34PM +1000, David Chinner wrote:
> On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> > 
> > The structure looks like this:
> > 
> >  -- --
> > | cnode 0  |-->| cnode 0  |--> to another cnode or NULL
> >  -- --
> > | cnode 1  |-  | cnode 1  |-
> >  -- |   --  |
> > | cnode 2  |-- |  | cnode 2  |--   |
> >  --  | |--  |   |
> > | cnode 3  | | |  | cnode 3  | |   |
> >  --  | |--  |   |
> >   |  |  ||  |   |
> > 
> >inodes   inodes or NULL
> 
> How do you recover if fsfuzzer takes out a cnode in the chain? The
> chunk is marked clean, but clearly corrupted and needs fixing and
> you don't know what it was pointing at.  Hence you have a pointer to
> a trashed cnode *somewhere* that you need to find and fix, and a
> bunch of orphaned cnodes that nobody points to *somewhere else* in
> the filesystem that you have to find. That's a full scan fsck case,
> isn't?

Excellent question.  This is one of the trickier aspects of chunkfs -
the orphan inode problem (tricky, but solvable).  The problem is what
if you smash/lose/corrupt an inode in one chunk that has a
continuation inode in another chunk?  A back pointer does you no good
if the back pointer is corrupted.

What you do is keep tabs on whether you see damage that looks like
this has occurred - e.g., inode use/free counts wrong, you had to zero
a corrupted inode - and when this happens, you do a scan of all
continuation inodes in chunks that have links to the corrupted chunk.
What you need to make this go fast is (1) a pre-made list of which
chunks have links with which other chunks, (2) a fast way to read all
of the continuation inodes in a chunk (ignoring chunk-local inodes).
This stage is O(fs size) approximately, but it should be quite swift.

> It seems that any sort of damage to the underlying storage (e.g.
> media error, I/O error or user brain explosion) results in the need
> to do a full fsck and hence chunkfs gives you no benefit in this
> case.

I worry about this but so far haven't found something which couldn't
be cut down significantly with just a little extra work.  It might be
helpful to look at an extreme case.

Let's say we're incredibly paranoid.  We could be justified in running
a full fsck on the entire file system in between every single I/O.
After all, something *might* have been silently corrupted.  But this
would be ridiculously slow.  We could instead never check the file
system.  But then we would end up panicking and corrupting the file
system a lot.  So what's a good compromise?

In the chunkfs case, here's my rules of thumb so far:

1. Detection: All metadata has magic numbers and checksums.
2. Scrubbing: Random check of chunks when possible.
3. Repair: When we detect corruption, either by checksum error, file
   system code assertion failure, or hardware tells us we have a bug,
   check the chunk containing the error and any outside-chunk
   information that could be affected by it.

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Valerie Henson

On Wed, Apr 25, 2007 at 03:34:03PM +0400, Nikita Danilov wrote:
> 
> What is more important, design puts (as far as I can see) no upper limit
> on the number of continuation inodes, and hence, even if _average_ fsck
> time is greatly reduced, occasionally it can take more time than ext2 of
> the same size. This is clearly unacceptable in many situations (HA,
> etc.).

Actually, there is an upper limit on the number of continuation
inodes.  Each file can have a maximum of one continuation inode per
chunk. (This is why we need to support sparse files.)

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Valerie Henson

On Tue, Apr 24, 2007 at 11:34:48PM +0400, Nikita Danilov wrote:
> 
> Maybe I failed to describe the problem presicely.
> 
> Suppose that all chunks have been checked. After that, for every inode
> I0 having continuations I1, I2, ... In, one has to check that every
> logical block is presented in at most one of these inodes. For this one
> has to read I0, with all its indirect (double-indirect, triple-indirect)
> blocks, then read I1 with all its indirect blocks, etc. And to repeat
> this for every inode with continuations.
> 
> In the worst case (every inode has a continuation in every chunk) this
> obviously is as bad as un-chunked fsck. But even in the average case,
> total amount of io necessary for this operation is proportional to the
> _total_ file system size, rather than to the chunk size.

Fsck in chunkfs is still going to have an element that is proportional
to the file system size for certain cases.  However, that element will
be a great deal smaller than in a regular file system, except in the
most pathological cases.  If those pathological cases happen often,
then it's back to the drawing board.  My hunch is that they won't be
common.

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/16] AF_RXRPC socket family and AFS rewrite [try #3]

2007-04-25 Thread David Miller

From: David Howells <[EMAIL PROTECTED]>
Date: Wed, 25 Apr 2007 20:56:47 +0100

> David Miller <[EMAIL PROTECTED]> wrote:
> 
> > Then please generate your patches against my net-2.6.21 GIT
> > tree.  Most of your initial patches in the series (the SKB
> > routine one for example) are already in my tree.
> 
> Do you mean your net-2.6.22 GIT tree?
> 
> Do you want me to make it available as a GIT tree for you to pull?  Or would
> you prefer patches?

Just patches is perfectly fine.

Also, if it's easier to diff against -mm, that works too
since Andrew integrates my net-2.6.22 tree into -mm most
of the time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/16] AF_RXRPC socket family and AFS rewrite [try #3]

2007-04-25 Thread David Howells

David Miller <[EMAIL PROTECTED]> wrote:

> Then please generate your patches against my net-2.6.21 GIT
> tree.  Most of your initial patches in the series (the SKB
> routine one for example) are already in my tree.

Do you mean your net-2.6.22 GIT tree?

Do you want me to make it available as a GIT tree for you to pull?  Or would
you prefer patches?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] unprivileged mounts update

2007-04-25 Thread Miklos Szeredi

> I'll be dropping all the unprivileged-mounts stuff - it looks like
> it was a bit early, and that a new patch series against 2.6.27-rc1

Yeah, I guess we can wait a few more years ;)   -^^^

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/16] AF_RXRPC socket family and AFS rewrite [try #3]

2007-04-25 Thread David Miller

From: David Howells <[EMAIL PROTECTED]>
Date: Wed, 25 Apr 2007 14:38:32 +0100

> I think the idea is for them (or at least some of them) to go
> through one of DaveM's net git trees anyway.

Then please generate your patches against my net-2.6.21 GIT
tree.  Most of your initial patches in the series (the SKB
routine one for example) are already in my tree.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] unprivileged mounts update

2007-04-25 Thread Miklos Szeredi

> Right, I figure if the normal action is to always do
> mnt->user = current->fsuid, then for the special case we
> pass a uid in someplace.  Of course...  do we not have a
> place to do that?  Would it be a no-no to use 'data' for
> a non-fs-specific arg?

I guess it would be OK for bind, but not for new- and remounts, where
'data' is already used.

Maybe it's best to stay with fsuid after all, and live with having to
restore capabilities.  It's not so bad after all, this seems to do the
trick:

cap_t cap = cap_get_proc();
setfsuid(uid);
cap_set_proc(cap);

Unfortunately these functions are not in libc, but in a separate
"libcap" library.  Ugh.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] unprivileged mounts update

2007-04-25 Thread Andrew Morton

On Wed, 25 Apr 2007 17:18:12 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > From: Miklos Szeredi <[EMAIL PROTECTED]>
> > 
> > - refine adding "nosuid" and "nodev" flags for unprivileged mounts:
> > o add "nosuid", only if mounter doesn't have CAP_SETUID capability
> > o add "nodev", only if mounter doesn't have CAP_MKNOD capability
> > 
> > - allow unprivileged forced unmount, but only for FS_SAFE filesystems
> > 
> > - allow mounting over special files, but not symlinks
> > 
> > - for mounting and umounting check "fsuid" instead of "ruid"
> 
> Andrew, please skip this patch, for now.

I'll be dropping all the unprivileged-mounts stuff - it looks like it
was a bit early, and that a new patch series against 2.6.27-rc1 or thereabouts
would be best.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] unprivileged mounts update

2007-04-25 Thread Serge E. Hallyn

Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> "Serge E. Hallyn" <[EMAIL PROTECTED]> writes:
> 
> > Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> >> 
> >> Are there other permission checks that mount is doing that we
> >> care about.
> >
> > Not mount itself, but in looking up /share/fa/root/home/fa,
> > user fa doesn't have the rights to read /share, and by setting
> > fsuid to fa and dropping CAP_DAC_READ_SEARCH the mount action fails.
> 
> Got it. 
> 
> I'm not certain this is actually a problem it may be a feature.
> But it does fly in the face of the general principle of just
> getting out of roots way so things can get done.
> 
> I think we can solve your basic problem by simply doing like:
> chdir(/share); mount(.);  To simply avoid the permission problem.
> 
> The practical question is how much do we care.
> 
> > But the solution you outlined in your previous post would work around
> > this perfectly.
> 
> If we are not using usual permissions which user do we use current->uid?
> Or do we pass that user someplace?

Right, I figure if the normal action is to always do
mnt->user = current->fsuid, then for the special case we
pass a uid in someplace.  Of course...  do we not have a
place to do that?  Would it be a no-no to use 'data' for
a non-fs-specific arg?

> >> > If it were really the equivalent then I could keep my capabilities :)
> >> > after changing it.
> >> 
> >> We drop all capabilities after we change the euid.
> >
> > Not if we've done prctl(PR_SET_KEEPCAPS, 1)
> 
> Ah cap_clear doesn't do the obvious thing.
> 
> Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] unprivileged mounts update

2007-04-25 Thread Eric W. Biederman

"Serge E. Hallyn" <[EMAIL PROTECTED]> writes:

> Quoting Eric W. Biederman ([EMAIL PROTECTED]):
>> 
>> Are there other permission checks that mount is doing that we
>> care about.
>
> Not mount itself, but in looking up /share/fa/root/home/fa,
> user fa doesn't have the rights to read /share, and by setting
> fsuid to fa and dropping CAP_DAC_READ_SEARCH the mount action fails.

Got it. 

I'm not certain this is actually a problem it may be a feature.
But it does fly in the face of the general principle of just
getting out of roots way so things can get done.

I think we can solve your basic problem by simply doing like:
chdir(/share); mount(.);  To simply avoid the permission problem.

The practical question is how much do we care.

> But the solution you outlined in your previous post would work around
> this perfectly.

If we are not using usual permissions which user do we use current->uid?
Or do we pass that user someplace?

>> > If it were really the equivalent then I could keep my capabilities :)
>> > after changing it.
>> 
>> We drop all capabilities after we change the euid.
>
> Not if we've done prctl(PR_SET_KEEPCAPS, 1)

Ah cap_clear doesn't do the obvious thing.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] unprivileged mounts update

2007-04-25 Thread Serge E. Hallyn

Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> "Serge E. Hallyn" <[EMAIL PROTECTED]> writes:
> 
> > Quoting H. Peter Anvin ([EMAIL PROTECTED]):
> >> Miklos Szeredi wrote:
> >> > 
> >> > Andrew, please skip this patch, for now.
> >> > 
> >> > Serge found a problem with the fsuid approach: setfsuid(nonzero) will
> >> > remove filesystem related capabilities.  So even if root is trying to
> >> > set the "user=UID" flag on a mount, access to the target (and in case
> >> > of bind, the source) is checked with user privileges.
> >> > 
> >> > Root should be able to set this flag on any mountpoint, _regardless_
> >> > of permissions.
> >> > 
> >> 
> >> Right, if you're using fsuid != 0, you're not running as root 
> >
> > Sure, but what I'm not clear on is why, if I've done a
> > prctl(PR_SET_KEEPCAPS, 1) before the setfsuid, I still lose the
> > CAP_FS_MASK perms.  I see the special case handling in
> > cap_task_post_setuid().  I'm sure there was a reason for it, but
> > this is a piece of the capability implementation I don't understand
> > right now.
> 
> So we drop CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH,
> CAP_FOWNER, and CAP_FSETID
> 
> Since we are checking CAP_SETUID or CAP_SYS_ADMIN how is that
> a problem?
> 
> Are there other permission checks that mount is doing that we
> care about.

Not mount itself, but in looking up /share/fa/root/home/fa,
user fa doesn't have the rights to read /share, and by setting
fsuid to fa and dropping CAP_DAC_READ_SEARCH the mount action fails.

But the solution you outlined in your previous post would work around
this perfectly.

> >> (fsuid is
> >> the equivalent to euid for the filesystem.)
> >
> > If it were really the equivalent then I could keep my capabilities :)
> > after changing it.
> 
> We drop all capabilities after we change the euid.

Not if we've done prctl(PR_SET_KEEPCAPS, 1)

> >> I fail to see how ruid should have *any* impact on mount(2).  That seems
> >> to be a design flaw.
> >
> > May be, but just using fsuid at this point stops me from enabling user
> > mounts under /share if /share is chmod 000 (which it is).
> 
> I'm dense today.  If we can't work out the details we can always use a flag.
> But what is the problem with fsuid?

See above.

> You are not trying to test this using a non-default security model are you?

Nope, at the moment CONFIG_SECURITY=n so I'm running with capabilities
only.

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Amit Gud


Andreas Dilger wrote:

How do you recover if fsfuzzer takes out a cnode in the chain? The
chunk is marked clean, but clearly corrupted and needs fixing and
you don't know what it was pointing at.  Hence you have a pointer to
a trashed cnode *somewhere* that you need to find and fix, and a
bunch of orphaned cnodes that nobody points to *somewhere else* in
the filesystem that you have to find. That's a full scan fsck case,
isn't?


Presumably, the cnodes in the other chunks contain forward and back
references.  Those need to contain at minimum inode + generation + chunk
to avoid problem of pointing to a _different_ inode after such corruption
caused the old inode to be deleted and a new one allocated in its place.

If the cnode in each chunk is more than just a singly-linked list, the
file as a whole could survive multiple chunk corruptions, though there
would now be holes in the file.


It seems that any sort of damage to the underlying storage (e.g.
media error, I/O error or user brain explosion) results in the need
to do a full fsck and hence chunkfs gives you no benefit in this
case.




Yes, what originated from discussions on #linuxfs is that redundancy is 
required for cnodes, in order to avoid checking the entire file system 
in search of a dangling cnode reference or for "parent" of a cnode.


If corruption, due to any reason, occurs in any other part of the file 
system, it would be localized for that chunk. Even if entire fsck is 
needed, chances of which are rare, full fsck of chunked file system is 
no worse than fsck of non-chunked file system. Passes 3, 4, and 5 of 
fsck take only 10-15% of total fsck run time and almost no-I/O Pass 6 
for chunkfs wouldn't add whole lot.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] unprivileged mounts update

2007-04-25 Thread Eric W. Biederman

"Serge E. Hallyn" <[EMAIL PROTECTED]> writes:

> Quoting H. Peter Anvin ([EMAIL PROTECTED]):
>> Miklos Szeredi wrote:
>> > 
>> > Andrew, please skip this patch, for now.
>> > 
>> > Serge found a problem with the fsuid approach: setfsuid(nonzero) will
>> > remove filesystem related capabilities.  So even if root is trying to
>> > set the "user=UID" flag on a mount, access to the target (and in case
>> > of bind, the source) is checked with user privileges.
>> > 
>> > Root should be able to set this flag on any mountpoint, _regardless_
>> > of permissions.
>> > 
>> 
>> Right, if you're using fsuid != 0, you're not running as root 
>
> Sure, but what I'm not clear on is why, if I've done a
> prctl(PR_SET_KEEPCAPS, 1) before the setfsuid, I still lose the
> CAP_FS_MASK perms.  I see the special case handling in
> cap_task_post_setuid().  I'm sure there was a reason for it, but
> this is a piece of the capability implementation I don't understand
> right now.

So we drop CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH,
CAP_FOWNER, and CAP_FSETID

Since we are checking CAP_SETUID or CAP_SYS_ADMIN how is that
a problem?

Are there other permission checks that mount is doing that we
care about.


>> (fsuid is
>> the equivalent to euid for the filesystem.)
>
> If it were really the equivalent then I could keep my capabilities :)
> after changing it.

We drop all capabilities after we change the euid.

>> I fail to see how ruid should have *any* impact on mount(2).  That seems
>> to be a design flaw.
>
> May be, but just using fsuid at this point stops me from enabling user
> mounts under /share if /share is chmod 000 (which it is).

I'm dense today.  If we can't work out the details we can always use a flag.
But what is the problem with fsuid?

You are not trying to test this using a non-default security model are you?


Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] unprivileged mounts update

2007-04-25 Thread Serge E. Hallyn

Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> Miklos Szeredi <[EMAIL PROTECTED]> writes:
> 
> >> From: Miklos Szeredi <[EMAIL PROTECTED]>
> >> 
> >> - refine adding "nosuid" and "nodev" flags for unprivileged mounts:
> >> o add "nosuid", only if mounter doesn't have CAP_SETUID capability
> >> o add "nodev", only if mounter doesn't have CAP_MKNOD capability
> >> 
> >> - allow unprivileged forced unmount, but only for FS_SAFE filesystems
> >> 
> >> - allow mounting over special files, but not symlinks
> >> 
> >> - for mounting and umounting check "fsuid" instead of "ruid"
> >
> > Andrew, please skip this patch, for now.
> >
> > Serge found a problem with the fsuid approach: setfsuid(nonzero) will
> > remove filesystem related capabilities.  So even if root is trying to
> > set the "user=UID" flag on a mount, access to the target (and in case
> > of bind, the source) is checked with user privileges.
> 
> I do have a major problem with this patchset though.  We still have
> the unnecessary concept of user mounts.  That seems only needed now
> for the /proc/mounts output.
> 
> All mounts should have an owner.  Prior to the unprivileged mount work
> root owns all mounts.
> 
> > Root should be able to set this flag on any mountpoint, _regardless_
> > of permissions.
> 
> We don't need a flag, and thinking of it in the context of a flag
> is clearly the wrong thing.  Yes if we have the proper capability we
> should be able to explicitly specify  the owner of the mount
> 
> > It is possible to restore filesystem capabilities after setting fsuid,
> > but the interfaces are rather horrible at all levels.  mount(8) can
> > probably live with these, but I'm not sure that using "fsuid" over
> > "ruid" has enough advantages to force this.
> >
> > Why did we want to use fsuid, exactly?
> 
> - Because ruid is completely the wrong thing we want mounts owned
>   by whomever's permissions we are using to perform the mount.
> 
> 
> There are two basic cases.
> - Mounting a filesystem as who we are.
>   This can use fsuid with no problems.  If we are suid to root to perform
>   the mount by default we want root to own the mount so that is correct.
> 
> - Mounting a filesystem as another user.
>   This is the tricky case rare case needed in setup.  If we aren't
>   jumping through to many hoops to make it work when using fsuid it
>   sounds like the right thing here as well.
> 
>   How hard is it to set fsuid to a different value?  I.e. What hoops
>   does root have to jump through.
> 
> Further when using fsuid we don't need an extra flag to mount.
> 
> Plus things are a little more consistent with the rest of the
> linux/unix interface.
> 
> Now I can see doing something like using a special flag and not using
> fsuid for the one case where we explicitly want to mount a filesystem
> as someone else.  However if only user space has to special case this
> (as it does anyway) and we don't have to special case it in the
> kernel.  So much the better. 

Yes, what you describe (or my reading of it :) would simplify the
implementation, and solve the capability problem.

So in general, when you mount something, the mount is owned by you.

To mount something as you, either the mountpoint's mount is owned by
you, or you have some capability, maybe CAP_SYS_ADMIN.

So, before any non-root user can do a mount, root must mount an ancestor
mount in the name of that user.  This would be a new mount flag, so

mount -o user=some_user /share/$USER/home/$USER /share/$USER/home/$USER

as root.  Mount does not change the fsuid, it simply passes the user=
flag into do_loopback(), which sets the mnt->user flag.  And now, even
though i have /share as chmod 000, root didn't have to setfsuid so we
have the necessary caps.

(clearly, -o user requires CAP_SYS_ADMIN or something)

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] unprivileged mounts update

2007-04-25 Thread Eric W. Biederman

Miklos Szeredi <[EMAIL PROTECTED]> writes:

>> From: Miklos Szeredi <[EMAIL PROTECTED]>
>> 
>> - refine adding "nosuid" and "nodev" flags for unprivileged mounts:
>> o add "nosuid", only if mounter doesn't have CAP_SETUID capability
>> o add "nodev", only if mounter doesn't have CAP_MKNOD capability
>> 
>> - allow unprivileged forced unmount, but only for FS_SAFE filesystems
>> 
>> - allow mounting over special files, but not symlinks
>> 
>> - for mounting and umounting check "fsuid" instead of "ruid"
>
> Andrew, please skip this patch, for now.
>
> Serge found a problem with the fsuid approach: setfsuid(nonzero) will
> remove filesystem related capabilities.  So even if root is trying to
> set the "user=UID" flag on a mount, access to the target (and in case
> of bind, the source) is checked with user privileges.

I do have a major problem with this patchset though.  We still have
the unnecessary concept of user mounts.  That seems only needed now
for the /proc/mounts output.

All mounts should have an owner.  Prior to the unprivileged mount work
root owns all mounts.

> Root should be able to set this flag on any mountpoint, _regardless_
> of permissions.

We don't need a flag, and thinking of it in the context of a flag
is clearly the wrong thing.  Yes if we have the proper capability we
should be able to explicitly specify  the owner of the mount

> It is possible to restore filesystem capabilities after setting fsuid,
> but the interfaces are rather horrible at all levels.  mount(8) can
> probably live with these, but I'm not sure that using "fsuid" over
> "ruid" has enough advantages to force this.
>
> Why did we want to use fsuid, exactly?

- Because ruid is completely the wrong thing we want mounts owned
  by whomever's permissions we are using to perform the mount.

There are two basic cases.
- Mounting a filesystem as who we are.
  This can use fsuid with no problems.  If we are suid to root to perform
  the mount by default we want root to own the mount so that is correct.

- Mounting a filesystem as another user.
  This is the tricky case rare case needed in setup.  If we aren't
  jumping through to many hoops to make it work when using fsuid it
  sounds like the right thing here as well.

  How hard is it to set fsuid to a different value?  I.e. What hoops
  does root have to jump through.

Further when using fsuid we don't need an extra flag to mount.

Plus things are a little more consistent with the rest of the
linux/unix interface.

Now I can see doing something like using a special flag and not using
fsuid for the one case where we explicitly want to mount a filesystem
as someone else.  However if only user space has to special case this
(as it does anyway) and we don't have to special case it in the
kernel.  So much the better. 

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] unprivileged mounts update

2007-04-25 Thread Serge E. Hallyn

Quoting H. Peter Anvin ([EMAIL PROTECTED]):
> Miklos Szeredi wrote:
> > 
> > Andrew, please skip this patch, for now.
> > 
> > Serge found a problem with the fsuid approach: setfsuid(nonzero) will
> > remove filesystem related capabilities.  So even if root is trying to
> > set the "user=UID" flag on a mount, access to the target (and in case
> > of bind, the source) is checked with user privileges.
> > 
> > Root should be able to set this flag on any mountpoint, _regardless_
> > of permissions.
> > 
> 
> Right, if you're using fsuid != 0, you're not running as root 

Sure, but what I'm not clear on is why, if I've done a
prctl(PR_SET_KEEPCAPS, 1) before the setfsuid, I still lose the
CAP_FS_MASK perms.  I see the special case handling in
cap_task_post_setuid().  I'm sure there was a reason for it, but
this is a piece of the capability implementation I don't understand
right now.

I would send in a patch to make it honor current->keep_capabilities,
but I have a feeling there was a good reason not to do so in the
first place.

> (fsuid is
> the equivalent to euid for the filesystem.)

If it were really the equivalent then I could keep my capabilities :)
after changing it.

> I fail to see how ruid should have *any* impact on mount(2).  That seems
> to be a design flaw.

May be, but just using fsuid at this point stops me from enabling user
mounts under /share if /share is chmod 000 (which it is).

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread David Lang

On Wed, 25 Apr 2007, Nikita Danilov wrote:

David Lang writes:
> On Tue, 24 Apr 2007, Nikita Danilov wrote:
>
> > David Lang writes:
> > > On Tue, 24 Apr 2007, Nikita Danilov wrote:
> > >
> > > > Amit Gud writes:
> > > >
> > > > Hello,
> > > >
> > > > >
> > > > > This is an initial implementation of ChunkFS technique, briefly 
discussed
> > > > > at: http://lwn.net/Articles/190222 and
> > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> > > >
> > > > I have a couple of questions about chunkfs repair process.
> > > >
> > > > First, as I understand it, each continuation inode is a sparse file,
> > > > mapping some subset of logical file blocks into block numbers. Then it
> > > > seems, that during "final phase" fsck has to check that these partial
> > > > mappings are consistent, for example, that no two different continuation
> > > > inodes for a given file contain a block number for the same offset. This
> > > > check requires scan of all chunks (rather than of only "active during
> > > > crash"), which seems to return us back to the scalability problem
> > > > chunkfs tries to address.
> > >
> > > not quite.
> > >
> > > this checking is a O(n^2) or worse problem, and it can eat a lot of 
memory in
> > > the process. with chunkfs you divide the problem by a large constant (100 
or
> > > more) for the checks of individual chunks. after those are done then the 
final
> > > pass checking the cross-chunk links doesn't have to keep track of 
everything, it
> > > only needs to check those links and what they point to
> >
> > Maybe I failed to describe the problem presicely.
> >
> > Suppose that all chunks have been checked. After that, for every inode
> > I0 having continuations I1, I2, ... In, one has to check that every
> > logical block is presented in at most one of these inodes. For this one
> > has to read I0, with all its indirect (double-indirect, triple-indirect)
> > blocks, then read I1 with all its indirect blocks, etc. And to repeat
> > this for every inode with continuations.
> >
> > In the worst case (every inode has a continuation in every chunk) this
> > obviously is as bad as un-chunked fsck. But even in the average case,
> > total amount of io necessary for this operation is proportional to the
> > _total_ file system size, rather than to the chunk size.
>
> actually, it should be proportional to the number of continuation nodes. The
> expectation (and design) is that they are rare.

Indeed, but total size of meta-data pertaining to all continuation
inodes is still proportional to the total file system size, and so is
fsck time: O(total_file_system_size).

correct, but remember that in the real world O(total_file_system_size) does not 
mean that it can't work well. it just means that larger filesystems will take 
longer to check.

they aren't out to eliminate the need for fsck, just to be able to divide the 
time it currently takes by a large value so that as the filesystems continue to 
get larger it is still reasonable to check them

What is more important, design puts (as far as I can see) no upper limit
on the number of continuation inodes, and hence, even if _average_ fsck
time is greatly reduced, occasionally it can take more time than ext2 of
the same size. This is clearly unacceptable in many situations (HA,
etc.).

in a pathalogical situation you are correct, it would take longer. however 
before declaring that this is completely unacceptable why don't you wait and see 
if the pathalogical situation is at all likely?

when you are doing ha with shared storage you tend to be doing things like 
databases, every database that I know about splits it's data files into many 
pieces of a fixed size. Postgres for example does 1M files. if you do a chunk 
size of 1G it's very unlikly that more then a couple files out of every thousand 
will end up with continuation nodes.

remember that the current thinking on chunk size is to make the chunks be ~1% of 
your filesystem, so on a 1TB filesystem your chunk size would be 10G (which, in 
the example above would mean just a couple files out of every ten thousand would 
have continuation nodes).

with the current filesystems it's _possible_ for a file to be spread out across 
the disk such that it's first block is at the beginning of the disk, the second 
at the end of the disk, the third back at the beginning, the fourth at the end, 
etc. but users don't worry about this when useing the filesystems becouse the 
odds of this happening under normal use are vanishingly small (and the 
filesystem designers work to make the odds this small). similarly the chunkfs 
designers are working to make the odds of every file having a continuation nodes 
vanishingly small as well.

David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] unprivileged mounts update

2007-04-25 Thread H. Peter Anvin

Miklos Szeredi wrote:
> 
> Andrew, please skip this patch, for now.
> 
> Serge found a problem with the fsuid approach: setfsuid(nonzero) will
> remove filesystem related capabilities.  So even if root is trying to
> set the "user=UID" flag on a mount, access to the target (and in case
> of bind, the source) is checked with user privileges.
> 
> Root should be able to set this flag on any mountpoint, _regardless_
> of permissions.
> 

Right, if you're using fsuid != 0, you're not running as root (fsuid is
the equivalent to euid for the filesystem.)

I fail to see how ruid should have *any* impact on mount(2).  That seems
to be a design flaw.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/16] AF_RXRPC: Move generic skbuff stuff from XFRM code to generic code [try #4]

2007-04-25 Thread David Howells

Move generic skbuff stuff from XFRM code to generic code so that AF_RXRPC can
use it too.

The kdoc comments I've attached to the functions needs to be checked by whoever
wrote them as I had to make some guesses about the workings of these functions.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 include/linux/skbuff.h |6 ++
 include/net/esp.h  |2 -
 net/core/skbuff.c  |  188 
 net/xfrm/xfrm_algo.c   |  169 ---
 4 files changed, 194 insertions(+), 171 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5992f65..c905d42 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -83,6 +83,7 @@
  */
 
 struct net_device;
+struct scatterlist;
 
 #ifdef CONFIG_NETFILTER
 struct nf_conntrack {
@@ -361,6 +362,11 @@ extern struct sk_buff *skb_realloc_headroom(struct sk_buff 
*skb,
 extern struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
   int newheadroom, int newtailroom,
   gfp_t priority);
+extern intskb_to_sgvec(struct sk_buff *skb,
+   struct scatterlist *sg, int offset,
+   int len);
+extern intskb_cow_data(struct sk_buff *skb, int tailbits,
+   struct sk_buff **trailer);
 extern intskb_pad(struct sk_buff *skb, int pad);
 #define dev_kfree_skb(a)   kfree_skb(a)
 extern void  skb_over_panic(struct sk_buff *skb, int len,
diff --git a/include/net/esp.h b/include/net/esp.h
index 713d039..d05d8d2 100644
--- a/include/net/esp.h
+++ b/include/net/esp.h
@@ -40,8 +40,6 @@ struct esp_data
} auth;
 };
 
-extern int skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int 
offset, int len);
-extern int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff 
**trailer);
 extern void *pskb_put(struct sk_buff *skb, struct sk_buff *tail, int len);
 
 static inline int esp_mac_digest(struct esp_data *esp, struct sk_buff *skb,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 336958f..aa02bd4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2005,6 +2006,190 @@ void __init skb_init(void)
NULL, NULL);
 }
 
+/**
+ * skb_to_sgvec - Fill a scatter-gather list from a socket buffer
+ * @skb: Socket buffer containing the buffers to be mapped
+ * @sg: The scatter-gather list to map into
+ * @offset: The offset into the buffer's contents to start mapping
+ * @len: Length of buffer space to be mapped
+ *
+ * Fill the specified scatter-gather list with mappings/pointers into a
+ * region of the buffer space attached to a socket buffer.
+ */
+int
+skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len)
+{
+   int start = skb_headlen(skb);
+   int i, copy = start - offset;
+   int elt = 0;
+
+   if (copy > 0) {
+   if (copy > len)
+   copy = len;
+   sg[elt].page = virt_to_page(skb->data + offset);
+   sg[elt].offset = (unsigned long)(skb->data + offset) % 
PAGE_SIZE;
+   sg[elt].length = copy;
+   elt++;
+   if ((len -= copy) == 0)
+   return elt;
+   offset += copy;
+   }
+
+   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+   int end;
+
+   BUG_TRAP(start <= offset + len);
+
+   end = start + skb_shinfo(skb)->frags[i].size;
+   if ((copy = end - offset) > 0) {
+   skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+   if (copy > len)
+   copy = len;
+   sg[elt].page = frag->page;
+   sg[elt].offset = frag->page_offset+offset-start;
+   sg[elt].length = copy;
+   elt++;
+   if (!(len -= copy))
+   return elt;
+   offset += copy;
+   }
+   start = end;
+   }
+
+   if (skb_shinfo(skb)->frag_list) {
+   struct sk_buff *list = skb_shinfo(skb)->frag_list;
+
+   for (; list; list = list->next) {
+   int end;
+
+   BUG_TRAP(start <= offset + len);
+
+   end = start + list->len;
+   if ((copy = end - offset) > 0) {
+   if (copy > len)
+   copy = len;
+   elt += skb_to_sgvec(list, sg+elt, offset - 
start, copy);
+   if ((len -= copy) == 0)
+   return elt;
+

[PATCH 04/16] AF_RXRPC: Make it possible to merely try to cancel timers from a module [try #4]

2007-04-25 Thread David Howells

Export try_to_del_timer_sync() for use by the AF_RXRPC module.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 kernel/timer.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/timer.c b/kernel/timer.c
index dd6c2c1..b22bd39 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -505,6 +505,8 @@ out:
return ret;
 }
 
+EXPORT_SYMBOL(try_to_del_timer_sync);
+
 /**
  * del_timer_sync - deactivate a timer and wait for the handler to finish.
  * @timer: the timer to be deactivated

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/16] AFS: Add support for the CB.GetCapabilities operation [try #4]

2007-04-25 Thread David Howells

Add support for the CB.GetCapabilities operation with which the fileserver can
ask the client for the following information:

 (1) The list of network interfaces it has available as IPv4 address + netmask
 plus the MTUs.

 (2) The client's UUID.

 (3) The extended capabilities of the client, for which the only current one
 is unified error mapping (abort code interpretation).

To support this, the patch adds the following routines to AFS:

 (1) A function to iterate through all the network interfaces using RTNETLINK
 to extract IPv4 addresses and MTUs.

 (2) A function to iterate through all the network interfaces using RTNETLINK
 to pull out the MAC address of the lowest index interface to use in UUID
 construction.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 fs/afs/Makefile|1 
 fs/afs/afs_cm.h|3 
 fs/afs/cmservice.c |   98 ++
 fs/afs/internal.h  |   42 
 fs/afs/main.c  |   49 +
 fs/afs/rxrpc.c |   39 
 fs/afs/use-rtnetlink.c |  473 
 7 files changed, 705 insertions(+), 0 deletions(-)

diff --git a/fs/afs/Makefile b/fs/afs/Makefile
index cca198b..01545eb 100644
--- a/fs/afs/Makefile
+++ b/fs/afs/Makefile
@@ -18,6 +18,7 @@ kafs-objs := \
security.o \
server.o \
super.o \
+   use-rtnetlink.o \
vlclient.o \
vlocation.o \
vnode.o \
diff --git a/fs/afs/afs_cm.h b/fs/afs/afs_cm.h
index 7c8e3d4..d4bd201 100644
--- a/fs/afs/afs_cm.h
+++ b/fs/afs/afs_cm.h
@@ -23,6 +23,9 @@ enum AFS_CM_Operations {
CBGetCE = 208,  /* get cache file description */
CBGetXStatsVersion  = 209,  /* get version of extended statistics */
CBGetXStats = 210,  /* get contents of extended statistics 
data */
+   CBGetCapabilities   = 65538, /* get client capabilities */
 };
 
+#define AFS_CAP_ERROR_TRANSLATION  0x1
+
 #endif /* AFS_FS_H */
diff --git a/fs/afs/cmservice.c b/fs/afs/cmservice.c
index 7e184bb..f8ad36b 100644
--- a/fs/afs/cmservice.c
+++ b/fs/afs/cmservice.c
@@ -22,6 +22,8 @@ static int afs_deliver_cb_init_call_back_state(struct 
afs_call *,
   struct sk_buff *, bool);
 static int afs_deliver_cb_probe(struct afs_call *, struct sk_buff *, bool);
 static int afs_deliver_cb_callback(struct afs_call *, struct sk_buff *, bool);
+static int afs_deliver_cb_get_capabilities(struct afs_call *, struct sk_buff *,
+  bool);
 static void afs_cm_destructor(struct afs_call *);
 
 /*
@@ -55,6 +57,16 @@ static const struct afs_call_type afs_SRXCBProbe = {
 };
 
 /*
+ * CB.GetCapabilities operation type
+ */
+static const struct afs_call_type afs_SRXCBGetCapabilites = {
+   .name   = "CB.GetCapabilities",
+   .deliver= afs_deliver_cb_get_capabilities,
+   .abort_to_error = afs_abort_to_error,
+   .destructor = afs_cm_destructor,
+};
+
+/*
  * route an incoming cache manager call
  * - return T if supported, F if not
  */
@@ -74,6 +86,9 @@ bool afs_cm_incoming_call(struct afs_call *call)
case CBProbe:
call->type = &afs_SRXCBProbe;
return true;
+   case CBGetCapabilities:
+   call->type = &afs_SRXCBGetCapabilites;
+   return true;
default:
return false;
}
@@ -328,3 +343,86 @@ static int afs_deliver_cb_probe(struct afs_call *call, 
struct sk_buff *skb,
schedule_work(&call->work);
return 0;
 }
+
+/*
+ * allow the fileserver to ask about the cache manager's capabilities
+ */
+static void SRXAFSCB_GetCapabilities(struct work_struct *work)
+{
+   struct afs_interface *ifs;
+   struct afs_call *call = container_of(work, struct afs_call, work);
+   int loop, nifs;
+
+   struct {
+   struct /* InterfaceAddr */ {
+   __be32 nifs;
+   __be32 uuid[11];
+   __be32 ifaddr[32];
+   __be32 netmask[32];
+   __be32 mtu[32];
+   } ia;
+   struct /* Capabilities */ {
+   __be32 capcount;
+   __be32 caps[1];
+   } cap;
+   } reply;
+
+   _enter("");
+
+   nifs = 0;
+   ifs = kcalloc(32, sizeof(*ifs), GFP_KERNEL);
+   if (ifs) {
+   nifs = afs_get_ipv4_interfaces(ifs, 32, false);
+   if (nifs < 0) {
+   kfree(ifs);
+   ifs = NULL;
+   nifs = 0;
+   }
+   }
+
+   memset(&reply, 0, sizeof(reply));
+   reply.ia.nifs = htonl(nifs);
+
+   reply.ia.uuid[0] = htonl(afs_uuid.time_low);
+   reply.ia.uuid[1] = htonl(afs_uuid.time_mid);
+   reply.ia.uuid[2] = htonl(afs_uuid.time_hi_and_version);
+   reply.ia.uuid[3] = htonl((s8) afs_uuid.clock_se

[PATCH 12/16] AFS: Update the AFS fs documentation [try #4]

2007-04-25 Thread David Howells

Update the AFS fs documentation.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 Documentation/filesystems/afs.txt |  214 +++--
 1 files changed, 154 insertions(+), 60 deletions(-)

diff --git a/Documentation/filesystems/afs.txt 
b/Documentation/filesystems/afs.txt
index 2f4237d..12ad6c7 100644
--- a/Documentation/filesystems/afs.txt
+++ b/Documentation/filesystems/afs.txt
@@ -1,31 +1,82 @@
+
 kAFS: AFS FILESYSTEM
 
 
-ABOUT
-=
+Contents:
+
+ - Overview.
+ - Usage.
+ - Mountpoints.
+ - Proc filesystem.
+ - The cell database.
+ - Security.
+ - Examples.
+
+
+
+OVERVIEW
+
 
-This filesystem provides a fairly simple AFS filesystem driver. It is under
-development and only provides very basic facilities. It does not yet support
-the following AFS features:
+This filesystem provides a fairly simple secure AFS filesystem driver. It is
+under development and does not yet provide the full feature set.  The features
+it does support include:
 
-   (*) Write support.
-   (*) Communications security.
-   (*) Local caching.
-   (*) pioctl() system call.
-   (*) Automatic mounting of embedded mountpoints.
+ (*) Security (currently only AFS kaserver and KerberosIV tickets).
 
+ (*) File reading.
 
+ (*) Automounting.
+
+It does not yet support the following AFS features:
+
+ (*) Write support.
+
+ (*) Local caching.
+
+ (*) pioctl() system call.
+
+
+===
+COMPILATION
+===
+
+The filesystem should be enabled by turning on the kernel configuration
+options:
+
+   CONFIG_AF_RXRPC - The RxRPC protocol transport
+   CONFIG_RXKAD- The RxRPC Kerberos security handler
+   CONFIG_AFS  - The AFS filesystem
+
+Additionally, the following can be turned on to aid debugging:
+
+   CONFIG_AF_RXRPC_DEBUG   - Permit AF_RXRPC debugging to be enabled
+   CONFIG_AFS_DEBUG- Permit AFS debugging to be enabled
+
+They permit the debugging messages to be turned on dynamically by manipulating
+the masks in the following files:
+
+   /sys/module/af_rxrpc/parameters/debug
+   /sys/module/afs/parameters/debug
+
+
+=
 USAGE
 =
 
 When inserting the driver modules the root cell must be specified along with a
 list of volume location server IP addresses:
 
-   insmod rxrpc.o
+   insmod af_rxrpc.o
+   insmod rxkad.o
insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
 
-The first module is a driver for the RxRPC remote operation protocol, and the
-second is the actual filesystem driver for the AFS filesystem.
+The first module is the AF_RXRPC network protocol driver.  This provides the
+RxRPC remote operation protocol and may also be accessed from userspace.  See:
+
+   Documentation/networking/rxrpc.txt
+
+The second module is the kerberos RxRPC security driver, and the third module
+is the actual filesystem driver for the AFS filesystem.
 
 Once the module has been loaded, more modules can be added by the following
 procedure:
@@ -33,7 +84,7 @@ procedure:
echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
 
 Where the parameters to the "add" command are the name of a cell and a list of
-volume location servers within that cell.
+volume location servers within that cell, with the latter separated by colons.
 
 Filesystems can be mounted anywhere by commands similar to the following:
 
@@ -42,11 +93,6 @@ Filesystems can be mounted anywhere by commands similar to 
the following:
mount -t afs "#root.afs." /afs
mount -t afs "#root.cell." /afs/cambridge
 
-  NB: When using this on Linux 2.4, the mount command has to be different,
-  since the filesystem doesn't have access to the device name argument:
-
-   mount -t afs none /afs -ovol="#root.afs."
-
 Where the initial character is either a hash or a percent symbol depending on
 whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O
 volume, but are willing to use a R/W volume instead (percent).
@@ -60,55 +106,66 @@ named volume will be looked up in the cell specified 
during insmod.
 Additional cells can be added through /proc (see later section).
 
 
+===
 MOUNTPOINTS
 ===
 
-AFS has a concept of mountpoints. These are specially formatted symbolic links
-(of the same form as the "device name" passed to mount). kAFS presents these
-to the user as directories that have special properties:
+AFS has a concept of mountpoints. In AFS terms, these are specially formatted
+symbolic links (of the same form as the "device name" passed to mount).  kAFS
+presents these to the user as directories that have a follow-link capability
+(ie: symbolic link semantics).  If anyone attempts to access them, they will
+automatically cause the target volume to be mounted (if possible) on that site.
 
-  (*

[PATCH 03/16] AF_RXRPC: Key facility changes for AF_RXRPC [try #4]

2007-04-25 Thread David Howells

Export the keyring key type definition and document its availability.

Add alternative types into the key's type_data union to make it more useful.
Not all users necessarily want to use it as a list_head (AF_RXRPC doesn't, for
example), so make it clear that it can be used in other ways.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 Documentation/keys.txt  |   12 
 include/linux/key.h |2 ++
 security/keys/keyring.c |2 ++
 3 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index 60c665d..81d9aa0 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -859,6 +859,18 @@ payload contents" for more information.
void unregister_key_type(struct key_type *type);
 
 
+Under some circumstances, it may be desirable to desirable to deal with a
+bundle of keys.  The facility provides access to the keyring type for managing
+such a bundle:
+
+   struct key_type key_type_keyring;
+
+This can be used with a function such as request_key() to find a specific
+keyring in a process's keyrings.  A keyring thus found can then be searched
+with keyring_search().  Note that it is not possible to use request_key() to
+search a specific keyring, so using keyrings in this way is of limited utility.
+
+
 ===
 NOTES ON ACCESSING PAYLOAD CONTENTS
 ===
diff --git a/include/linux/key.h b/include/linux/key.h
index 169f05e..a9220e7 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -160,6 +160,8 @@ struct key {
 */
union {
struct list_headlink;
+   unsigned long   x[2];
+   void*p[2];
} type_data;
 
/* key data
diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index ad45ce7..88292e3 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -66,6 +66,8 @@ struct key_type key_type_keyring = {
.read   = keyring_read,
 };
 
+EXPORT_SYMBOL(key_type_keyring);
+
 /*
  * semaphore to serialise link/link calls to prevent two link calls in parallel
  * introducing a cycle

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/16] commit ad495d7b6cfcd1bc2eaf06c42699be0bb5d84234 [try #4]

2007-04-25 Thread David Howells

[NETLINK]: Mirror UDP MSG_TRUNC semantics.

If the user passes MSG_TRUNC in via msg_flags, return
the full packet size not the truncated size.

Idea from Herbert Xu and Thomas Graf.

Signed-off-by: David S. Miller <[EMAIL PROTECTED]>
---

 net/netlink/af_netlink.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index c48b0f4..5890210 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1242,6 +1242,9 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct 
socket *sock,
 
scm_recv(sock, msg, siocb->scm, flags);
 
+   if (flags & MSG_TRUNC)
+   copied = skb->len;
+
 out:
netlink_rcv_wake(sk);
return err ? : copied;

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 15/16] AFS: Implement the CB.InitCallBackState3 operation [try #4]

2007-04-25 Thread David Howells

Implement the CB.InitCallBackState3 operation for the fileserver to call.
This reduces the amount of network traffic because if this op is aborted, the
fileserver will then attempt an CB.InitCallBackState operation.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 fs/afs/afs_cm.h|1 +
 fs/afs/cmservice.c |   46 ++
 2 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/fs/afs/afs_cm.h b/fs/afs/afs_cm.h
index d4bd201..7b4d4fa 100644
--- a/fs/afs/afs_cm.h
+++ b/fs/afs/afs_cm.h
@@ -23,6 +23,7 @@ enum AFS_CM_Operations {
CBGetCE = 208,  /* get cache file description */
CBGetXStatsVersion  = 209,  /* get version of extended statistics */
CBGetXStats = 210,  /* get contents of extended statistics 
data */
+   CBInitCallBackState3= 213,  /* initialise callback state, version 3 
*/
CBGetCapabilities   = 65538, /* get client capabilities */
 };
 
diff --git a/fs/afs/cmservice.c b/fs/afs/cmservice.c
index f8ad36b..32deb04 100644
--- a/fs/afs/cmservice.c
+++ b/fs/afs/cmservice.c
@@ -20,6 +20,8 @@ struct workqueue_struct *afs_cm_workqueue;
 
 static int afs_deliver_cb_init_call_back_state(struct afs_call *,
   struct sk_buff *, bool);
+static int afs_deliver_cb_init_call_back_state3(struct afs_call *,
+   struct sk_buff *, bool);
 static int afs_deliver_cb_probe(struct afs_call *, struct sk_buff *, bool);
 static int afs_deliver_cb_callback(struct afs_call *, struct sk_buff *, bool);
 static int afs_deliver_cb_get_capabilities(struct afs_call *, struct sk_buff *,
@@ -47,6 +49,16 @@ static const struct afs_call_type afs_SRXCBInitCallBackState 
= {
 };
 
 /*
+ * CB.InitCallBackState3 operation type
+ */
+static const struct afs_call_type afs_SRXCBInitCallBackState3 = {
+   .name   = "CB.InitCallBackState3",
+   .deliver= afs_deliver_cb_init_call_back_state3,
+   .abort_to_error = afs_abort_to_error,
+   .destructor = afs_cm_destructor,
+};
+
+/*
  * CB.Probe operation type
  */
 static const struct afs_call_type afs_SRXCBProbe = {
@@ -83,6 +95,9 @@ bool afs_cm_incoming_call(struct afs_call *call)
case CBInitCallBackState:
call->type = &afs_SRXCBInitCallBackState;
return true;
+   case CBInitCallBackState3:
+   call->type = &afs_SRXCBInitCallBackState3;
+   return true;
case CBProbe:
call->type = &afs_SRXCBProbe;
return true;
@@ -312,6 +327,37 @@ static int afs_deliver_cb_init_call_back_state(struct 
afs_call *call,
 }
 
 /*
+ * deliver request data to a CB.InitCallBackState3 call
+ */
+static int afs_deliver_cb_init_call_back_state3(struct afs_call *call,
+   struct sk_buff *skb,
+   bool last)
+{
+   struct afs_server *server;
+   struct in_addr addr;
+
+   _enter(",{%u},%d", skb->len, last);
+
+   if (!last)
+   return 0;
+
+   /* no unmarshalling required */
+   call->state = AFS_CALL_REPLYING;
+
+   /* we'll need the file server record as that tells us which set of
+* vnodes to operate upon */
+   memcpy(&addr, &skb->nh.iph->saddr, 4);
+   server = afs_find_server(&addr);
+   if (!server)
+   return -ENOTCONN;
+   call->server = server;
+
+   INIT_WORK(&call->work, SRXAFSCB_InitCallBackState);
+   schedule_work(&call->work);
+   return 0;
+}
+
+/*
  * allow the fileserver to see if the cache manager is still alive
  */
 static void SRXAFSCB_Probe(struct work_struct *work)

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/16] AFS: Handle multiple mounts of an AFS superblock correctly [try #4]

2007-04-25 Thread David Howells

Handle multiple mounts of an AFS superblock correctly, checking to see whether
the superblock is already initialised after calling sget() rather than just
unconditionally stamping all over it.

Also delete the "silent" parameter to afs_fill_super() as it's not used and
can, in any case, be obtained from sb->s_flags.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 fs/afs/super.c |   26 --
 1 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/fs/afs/super.c b/fs/afs/super.c
index efc4fe6..77e6875 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -212,7 +212,7 @@ static int afs_test_super(struct super_block *sb, void 
*data)
 /*
  * fill in the superblock
  */
-static int afs_fill_super(struct super_block *sb, void *data, int silent)
+static int afs_fill_super(struct super_block *sb, void *data)
 {
struct afs_mount_params *params = data;
struct afs_super_info *as = NULL;
@@ -319,17 +319,23 @@ static int afs_get_sb(struct file_system_type *fs_type,
goto error;
}
 
-   sb->s_flags = flags;
-
-   ret = afs_fill_super(sb, ¶ms, flags & MS_SILENT ? 1 : 0);
-   if (ret < 0) {
-   up_write(&sb->s_umount);
-   deactivate_super(sb);
-   goto error;
+   if (!sb->s_root) {
+   /* initial superblock/root creation */
+   _debug("create");
+   sb->s_flags = flags;
+   ret = afs_fill_super(sb, ¶ms);
+   if (ret < 0) {
+   up_write(&sb->s_umount);
+   deactivate_super(sb);
+   goto error;
+   }
+   sb->s_flags |= MS_ACTIVE;
+   } else {
+   _debug("reuse");
+   ASSERTCMP(sb->s_flags, &, MS_ACTIVE);
}
-   sb->s_flags |= MS_ACTIVE;
-   simple_set_mnt(mnt, sb);
 
+   simple_set_mnt(mnt, sb);
afs_put_volume(params.volume);
afs_put_cell(params.default_cell);
_leave(" = 0 [%p]", sb);

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/16] cancel_delayed_work: use del_timer() instead of del_timer_sync() [try #4]

2007-04-25 Thread David Howells

del_timer_sync() buys nothing for cancel_delayed_work(), but it is less
efficient since it locks the timer unconditionally, and may wait for the
completion of the delayed_work_timer_fn().

cancel_delayed_work() == 0 means:

before this patch:
work->func may still be running or queued

after this patch:
work->func may still be running or queued, or
delayed_work_timer_fn->__queue_work() in progress.

The latter doesn't differ from the caller's POV,
delayed_work_timer_fn() is called with _PENDING
bit set.

cancel_delayed_work() == 1 with this patch adds a new possibility:

delayed_work->work was cancelled, but delayed_work_timer_fn
is still running (this is only possible for the re-arming
works on single-threaded workqueue).

In this case the timer was re-started by work->func(), nobody
else can do this. This in turn means that delayed_work_timer_fn
has already passed __queue_work() (and wont't touch delayed_work)
because nobody else can queue delayed_work->work.

Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]>
Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 include/linux/workqueue.h |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 2a7b38d..b8abfc7 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -191,14 +191,15 @@ int execute_in_process_context(work_func_t fn, struct 
execute_work *);
 
 /*
  * Kill off a pending schedule_delayed_work().  Note that the work callback
- * function may still be running on return from cancel_delayed_work().  Run
- * flush_scheduled_work() to wait on it.
+ * function may still be running on return from cancel_delayed_work(), unless
+ * it returns 1 and the work doesn't re-arm itself. Run flush_workqueue() or
+ * cancel_work_sync() to wait on it.
  */
 static inline int cancel_delayed_work(struct delayed_work *work)
 {
int ret;
 
-   ret = del_timer_sync(&work->timer);
+   ret = del_timer(&work->timer);
if (ret)
work_release(&work->work);
return ret;

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/16] AF_RXRPC socket family and AFS rewrite [try #4]

2007-04-25 Thread David Howells


The first of these patches together provide secure client-side RxRPC
connectivity as a Linux kernel socket family.  Only the RxRPC transport/session
side is supplied - the presentation side (marshalling the data) is left to the
client.  Copies of the patches can be found here:

http://people.redhat.com/~dhowells/rxrpc/series
http://people.redhat.com/~dhowells/rxrpc/01-move-skb-generic.diff
http://people.redhat.com/~dhowells/rxrpc/02-cancel_delayed_work.diff
http://people.redhat.com/~dhowells/rxrpc/03-keys.diff
http://people.redhat.com/~dhowells/rxrpc/04-timer-exports.diff
http://people.redhat.com/~dhowells/rxrpc/05-af_rxrpc.diff

Further patches make the in-kernel AFS filesystem use AF_RXRPC and delete the
old RxRPC implementation:

http://people.redhat.com/~dhowells/rxrpc/06-afs-cleanup.diff
http://people.redhat.com/~dhowells/rxrpc/07-af_rxrpc-kernel.diff
http://people.redhat.com/~dhowells/rxrpc/08-af_rxrpc-afs.diff
http://people.redhat.com/~dhowells/rxrpc/09-af_rxrpc-delete-old.diff

And then the rest of the patches extend AFS to provide automatic unmounting of
automount trees, security support and directory-level write support (create,
mkdir, etc.):

http://people.redhat.com/~dhowells/rxrpc/10-afs-multimount.diff
http://people.redhat.com/~dhowells/rxrpc/11-afs-security.diff
http://people.redhat.com/~dhowells/rxrpc/12-afs-doc.diff

http://people.redhat.com/~dhowells/rxrpc/13-netlink-support-MSG_TRUNC.diff
http://people.redhat.com/~dhowells/rxrpc/14-afs-get-capabilities.diff
http://people.redhat.com/~dhowells/rxrpc/15-afs-initcallbackstate3.diff
http://people.redhat.com/~dhowells/rxrpc/16-afs-dir-write-support.diff

Note that file-level write support is not yet complete and so is not included
in this patch set.


The userspace access methods make use of the control data passed to/by
sendmsg() and recvmsg().  See the three simple test programs:

http://people.redhat.com/~dhowells/rxrpc/klog.c
http://people.redhat.com/~dhowells/rxrpc/rxrpc.c
http://people.redhat.com/~dhowells/rxrpc/listen.c

The klog program is provided to go and get a Kerberos IV key from the AFS
kaserver.  Currently it must be edited before compiling to note the right
server IP address and the appropriate credentials.

These programs can be compiled by:

make klog rxrpc listen CFLAGS="-Wall -g" LDLIBS="-lcrypto -lcrypt 
-lkrb4 -lkeyutils"

Then a ticket can be obtained by:

./klog

If a security key is acquired in this way, then all subsequent AFS operations -
including VL lookups and mounts - performed with that session keyring will be
authenticated using that key.  The key can be viewed like so:

[EMAIL PROTECTED] ~]# keyctl show
Session Keyring
   -3 --alswrv  0 0  keyring: _ses.3268
2 --alswrv  0 0   \_ keyring: _uid.0
111416553 --als--v  0 0   \_ rxrpc: [EMAIL PROTECTED]

TODO:

 (*) Make certain parameters (such as connection timeouts) userspace
 configurable.

 (*) Make userspace utilities use it; librxrpc.

 (*) Userspace documentation.

 (*) KerberosV security.

Changes:

 (*) SOCK_RPC has been removed.  SOCK_DGRAM is now used instead.

 (*) I've add a facility whereby calls can be made to destinations other than
 the connect() address of a client socket by making use of msg_name in the
 msghdr struct when using sendmsg() to send the first data packet of a
 call.  Indeed, a client socket need not be connected before being used
 so.

 (*) I've also added a facility whereby client calls may also be made on
 server sockets, again by using msg_name in the msghdr struct.  In such a
 case, the server's local transport endpoint is used.

 (*) I've made the write buffer space check available to various callers
 (sk_write_space) and implemented poll support.

 (*) Rewrote rxrpc_recvmsg().  It now concatenates adjacent data messages from
 the same call when delivering them.

 (*) Updated the documentation to include notes on recvmsg, cover control
 messages and cover SOL_RXRPC-level socket options.

 (*) Provided an in-kernel interface to give in-kernel utilities easier access
 to the facility.

 (*) Made fs/afs/ use it.

 (*) Deleted the old contents of net/rxrpc/.

 (*) Use the scatterlist interface to the crypto API for now.  The patch that
 added the direct access interface conflicts with patches Herbert Xu is
 producing, so I've dropped it for the moment.

 (*) Moved a bug fix to make secure connection reuse work from the
 af_rxrpc-kernel patch to the af_rxrpc main patch.

 (*) Make RxRPC use its own private work queues rather than keventd's to avoid
 deadlocks when AFS tries to use keventd too.  This also puts encryption
 in the private work queue rather than keventd's queue as that might take
 a relatively long time t

Re: [patch] unprivileged mounts update

2007-04-25 Thread Miklos Szeredi

> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> - refine adding "nosuid" and "nodev" flags for unprivileged mounts:
> o add "nosuid", only if mounter doesn't have CAP_SETUID capability
> o add "nodev", only if mounter doesn't have CAP_MKNOD capability
> 
> - allow unprivileged forced unmount, but only for FS_SAFE filesystems
> 
> - allow mounting over special files, but not symlinks
> 
> - for mounting and umounting check "fsuid" instead of "ruid"

Andrew, please skip this patch, for now.

Serge found a problem with the fsuid approach: setfsuid(nonzero) will
remove filesystem related capabilities.  So even if root is trying to
set the "user=UID" flag on a mount, access to the target (and in case
of bind, the source) is checked with user privileges.

Root should be able to set this flag on any mountpoint, _regardless_
of permissions.

It is possible to restore filesystem capabilities after setting fsuid,
but the interfaces are rather horrible at all levels.  mount(8) can
probably live with these, but I'm not sure that using "fsuid" over
"ruid" has enough advantages to force this.

Why did we want to use fsuid, exactly?

Thanks,
Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/16] AF_RXRPC socket family and AFS rewrite [try #3]

2007-04-25 Thread David Howells

Andrew Morton <[EMAIL PROTECTED]> wrote:

> I'm ducking all feature and cleanup patches now, and probably shall
> continue to do so for some weeks.  The priority (which I believe to be
> increasingly urgent) is to fix the 2.6.21 regressions and to stabilise
> the things which we presently have queued for 2.6.22.  Not to
> mention the 1000ish unaddressed bug reports in bugzilla and elsewhere.

Fair enough.  I think the idea is for them (or at least some of them) to go
through one of DaveM's net git trees anyway.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Andreas Dilger

On Apr 25, 2007  20:54 +1000, David Chinner wrote:
> On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> > Right now, there is no distinction between an inode and continuation 
> > inode (also referred to as 'cnode' below), except for the 
> > EXT2_IS_CONT_FL flag. Every inode holds a list of static number of 
> > inodes, currently limited to 4.
> > 
> > The structure looks like this:
> > 
> >  -- --
> > | cnode 0  |-->| cnode 0  |--> to another cnode or NULL
> >  -- --
> > | cnode 1  |-  | cnode 1  |-
> >  -- |   --  |
> > | cnode 2  |-- |  | cnode 2  |--   |
> >  --  | |--  |   |
> > | cnode 3  | | |  | cnode 3  | |   |
> >  --  | |--  |   |
> >   |  |  ||  |   |
> > 
> >inodes   inodes or NULL
> 
> How do you recover if fsfuzzer takes out a cnode in the chain? The
> chunk is marked clean, but clearly corrupted and needs fixing and
> you don't know what it was pointing at.  Hence you have a pointer to
> a trashed cnode *somewhere* that you need to find and fix, and a
> bunch of orphaned cnodes that nobody points to *somewhere else* in
> the filesystem that you have to find. That's a full scan fsck case,
> isn't?

Presumably, the cnodes in the other chunks contain forward and back
references.  Those need to contain at minimum inode + generation + chunk
to avoid problem of pointing to a _different_ inode after such corruption
caused the old inode to be deleted and a new one allocated in its place.

If the cnode in each chunk is more than just a singly-linked list, the
file as a whole could survive multiple chunk corruptions, though there
would now be holes in the file.

> It seems that any sort of damage to the underlying storage (e.g.
> media error, I/O error or user brain explosion) results in the need
> to do a full fsck and hence chunkfs gives you no benefit in this
> case.

There are several cases where such corruption could be found:
- file access from the "parent" cnode will be missing corrupted cnode,
  probably causing a fsck of both the source and target chunks
- a fsck of the source chunk would find the dangling cnode reference
  and cause a fsck of the corrupt chunk
- a fsck of the later cnode chunks would find the dangling cnode reference
  and cause a fsck of the corrupt chunk
- a fsck of the corrupt chunk would find the original corruption

The case where only a fsck of the corrupt chunk is done would not find the
cnode references.  Maybe there needs to be per-chunk info which contains
a list/bitmap of other chunks that have cnodes shared with each chunk?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Nikita Danilov

David Lang writes:
 > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > 
 > > David Lang writes:
 > > > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > > >
 > > > > Amit Gud writes:
 > > > >
 > > > > Hello,
 > > > >
 > > > > >
 > > > > > This is an initial implementation of ChunkFS technique, briefly 
 > > > > > discussed
 > > > > > at: http://lwn.net/Articles/190222 and
 > > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
 > > > >
 > > > > I have a couple of questions about chunkfs repair process.
 > > > >
 > > > > First, as I understand it, each continuation inode is a sparse file,
 > > > > mapping some subset of logical file blocks into block numbers. Then it
 > > > > seems, that during "final phase" fsck has to check that these partial
 > > > > mappings are consistent, for example, that no two different 
 > > > > continuation
 > > > > inodes for a given file contain a block number for the same offset. 
 > > > > This
 > > > > check requires scan of all chunks (rather than of only "active during
 > > > > crash"), which seems to return us back to the scalability problem
 > > > > chunkfs tries to address.
 > > >
 > > > not quite.
 > > >
 > > > this checking is a O(n^2) or worse problem, and it can eat a lot of 
 > > > memory in
 > > > the process. with chunkfs you divide the problem by a large constant 
 > > > (100 or
 > > > more) for the checks of individual chunks. after those are done then the 
 > > > final
 > > > pass checking the cross-chunk links doesn't have to keep track of 
 > > > everything, it
 > > > only needs to check those links and what they point to
 > >
 > > Maybe I failed to describe the problem presicely.
 > >
 > > Suppose that all chunks have been checked. After that, for every inode
 > > I0 having continuations I1, I2, ... In, one has to check that every
 > > logical block is presented in at most one of these inodes. For this one
 > > has to read I0, with all its indirect (double-indirect, triple-indirect)
 > > blocks, then read I1 with all its indirect blocks, etc. And to repeat
 > > this for every inode with continuations.
 > >
 > > In the worst case (every inode has a continuation in every chunk) this
 > > obviously is as bad as un-chunked fsck. But even in the average case,
 > > total amount of io necessary for this operation is proportional to the
 > > _total_ file system size, rather than to the chunk size.
 > 
 > actually, it should be proportional to the number of continuation nodes. The 
 > expectation (and design) is that they are rare.

Indeed, but total size of meta-data pertaining to all continuation
inodes is still proportional to the total file system size, and so is
fsck time: O(total_file_system_size).

What is more important, design puts (as far as I can see) no upper limit
on the number of continuation inodes, and hence, even if _average_ fsck
time is greatly reduced, occasionally it can take more time than ext2 of
the same size. This is clearly unacceptable in many situations (HA,
etc.).

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Testing framework

2007-04-25 Thread Karuna sagar K


On 4/23/07, Avishay Traeger <[EMAIL PROTECTED]> wrote:

On Mon, 2007-04-23 at 02:16 +0530, Karuna sagar K wrote:


You may want to check out the paper "EXPLODE: A Lightweight, General
System for Finding Serious Storage System Errors" from OSDI 2006 (if you
haven't already).  The idea sounds very similar to me, although I
haven't read all the details of your proposal.


EXPLODE is more of a generic tool i.e. it is used to find larger set
of errors/bugs in file systems than the Test framework which focuses
on the repair of file systems.

The Test framework is focused towards repairability of the file
systems, it doesnt use model checking concept, it uses replayable
corruption mechanism and is user space implementation. Thats the
reason why this is not similar to EXPLODE.



Avishay




Thanks,
Karuna
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 15/16] AFS: Implement the CB.InitCallBackState3 operation [try #3]

2007-04-25 Thread David Howells

Implement the CB.InitCallBackState3 operation for the fileserver to call.
This reduces the amount of network traffic because if this op is aborted, the
fileserver will then attempt an CB.InitCallBackState operation.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 fs/afs/AFS_CM.h|1 +
 fs/afs/cmservice.c |   46 ++
 2 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/fs/afs/AFS_CM.h b/fs/afs/AFS_CM.h
index d4bd201..7b4d4fa 100644
--- a/fs/afs/AFS_CM.h
+++ b/fs/afs/AFS_CM.h
@@ -23,6 +23,7 @@ enum AFS_CM_Operations {
CBGetCE = 208,  /* get cache file description */
CBGetXStatsVersion  = 209,  /* get version of extended statistics */
CBGetXStats = 210,  /* get contents of extended statistics 
data */
+   CBInitCallBackState3= 213,  /* initialise callback state, version 3 
*/
CBGetCapabilities   = 65538, /* get client capabilities */
 };
 
diff --git a/fs/afs/cmservice.c b/fs/afs/cmservice.c
index 5139723..3d58861 100644
--- a/fs/afs/cmservice.c
+++ b/fs/afs/cmservice.c
@@ -20,6 +20,8 @@ struct workqueue_struct *afs_cm_workqueue;
 
 static int afs_deliver_cb_init_call_back_state(struct afs_call *,
   struct sk_buff *, bool);
+static int afs_deliver_cb_init_call_back_state3(struct afs_call *,
+   struct sk_buff *, bool);
 static int afs_deliver_cb_probe(struct afs_call *, struct sk_buff *, bool);
 static int afs_deliver_cb_callback(struct afs_call *, struct sk_buff *, bool);
 static int afs_deliver_cb_get_capabilities(struct afs_call *, struct sk_buff *,
@@ -47,6 +49,16 @@ static const struct afs_call_type afs_SRXCBInitCallBackState 
= {
 };
 
 /*
+ * CB.InitCallBackState3 operation type
+ */
+static const struct afs_call_type afs_SRXCBInitCallBackState3 = {
+   .name   = "CB.InitCallBackState3",
+   .deliver= afs_deliver_cb_init_call_back_state3,
+   .abort_to_error = afs_abort_to_error,
+   .destructor = afs_cm_destructor,
+};
+
+/*
  * CB.Probe operation type
  */
 static const struct afs_call_type afs_SRXCBProbe = {
@@ -83,6 +95,9 @@ bool afs_cm_incoming_call(struct afs_call *call)
case CBInitCallBackState:
call->type = &afs_SRXCBInitCallBackState;
return true;
+   case CBInitCallBackState3:
+   call->type = &afs_SRXCBInitCallBackState3;
+   return true;
case CBProbe:
call->type = &afs_SRXCBProbe;
return true;
@@ -312,6 +327,37 @@ static int afs_deliver_cb_init_call_back_state(struct 
afs_call *call,
 }
 
 /*
+ * deliver request data to a CB.InitCallBackState3 call
+ */
+static int afs_deliver_cb_init_call_back_state3(struct afs_call *call,
+   struct sk_buff *skb,
+   bool last)
+{
+   struct afs_server *server;
+   struct in_addr addr;
+
+   _enter(",{%u},%d", skb->len, last);
+
+   if (!last)
+   return 0;
+
+   /* no unmarshalling required */
+   call->state = AFS_CALL_REPLYING;
+
+   /* we'll need the file server record as that tells us which set of
+* vnodes to operate upon */
+   memcpy(&addr, &skb->nh.iph->saddr, 4);
+   server = afs_find_server(&addr);
+   if (!server)
+   return -ENOTCONN;
+   call->server = server;
+
+   INIT_WORK(&call->work, SRXAFSCB_InitCallBackState);
+   schedule_work(&call->work);
+   return 0;
+}
+
+/*
  * allow the fileserver to see if the cache manager is still alive
  */
 static void SRXAFSCB_Probe(struct work_struct *work)

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/16] AFS: Add support for the CB.GetCapabilities operation [try #3]

2007-04-25 Thread David Howells

Add support for the CB.GetCapabilities operation with which the fileserver can
ask the client for the following information:

 (1) The list of network interfaces it has available as IPv4 address + netmask
 plus the MTUs.

 (2) The client's UUID.

 (3) The extended capabilities of the client, for which the only current one
 is unified error mapping (abort code interpretation).

To support this, the patch adds the following routines to AFS:

 (1) A function to iterate through all the network interfaces using RTNETLINK
 to extract IPv4 addresses and MTUs.

 (2) A function to iterate through all the network interfaces using RTNETLINK
 to pull out the MAC address of the lowest index interface to use in UUID
 construction.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 fs/afs/AFS_CM.h|3 
 fs/afs/Makefile|1 
 fs/afs/cmservice.c |   98 ++
 fs/afs/internal.h  |   42 
 fs/afs/main.c  |   49 +
 fs/afs/rxrpc.c |   39 
 fs/afs/use-rtnetlink.c |  473 
 7 files changed, 705 insertions(+), 0 deletions(-)

diff --git a/fs/afs/AFS_CM.h b/fs/afs/AFS_CM.h
index 7c8e3d4..d4bd201 100644
--- a/fs/afs/AFS_CM.h
+++ b/fs/afs/AFS_CM.h
@@ -23,6 +23,9 @@ enum AFS_CM_Operations {
CBGetCE = 208,  /* get cache file description */
CBGetXStatsVersion  = 209,  /* get version of extended statistics */
CBGetXStats = 210,  /* get contents of extended statistics 
data */
+   CBGetCapabilities   = 65538, /* get client capabilities */
 };
 
+#define AFS_CAP_ERROR_TRANSLATION  0x1
+
 #endif /* AFS_FS_H */
diff --git a/fs/afs/Makefile b/fs/afs/Makefile
index cca198b..01545eb 100644
--- a/fs/afs/Makefile
+++ b/fs/afs/Makefile
@@ -18,6 +18,7 @@ kafs-objs := \
security.o \
server.o \
super.o \
+   use-rtnetlink.o \
vlclient.o \
vlocation.o \
vnode.o \
diff --git a/fs/afs/cmservice.c b/fs/afs/cmservice.c
index 9cb3ac5..5139723 100644
--- a/fs/afs/cmservice.c
+++ b/fs/afs/cmservice.c
@@ -22,6 +22,8 @@ static int afs_deliver_cb_init_call_back_state(struct 
afs_call *,
   struct sk_buff *, bool);
 static int afs_deliver_cb_probe(struct afs_call *, struct sk_buff *, bool);
 static int afs_deliver_cb_callback(struct afs_call *, struct sk_buff *, bool);
+static int afs_deliver_cb_get_capabilities(struct afs_call *, struct sk_buff *,
+  bool);
 static void afs_cm_destructor(struct afs_call *);
 
 /*
@@ -55,6 +57,16 @@ static const struct afs_call_type afs_SRXCBProbe = {
 };
 
 /*
+ * CB.GetCapabilities operation type
+ */
+static const struct afs_call_type afs_SRXCBGetCapabilites = {
+   .name   = "CB.GetCapabilities",
+   .deliver= afs_deliver_cb_get_capabilities,
+   .abort_to_error = afs_abort_to_error,
+   .destructor = afs_cm_destructor,
+};
+
+/*
  * route an incoming cache manager call
  * - return T if supported, F if not
  */
@@ -74,6 +86,9 @@ bool afs_cm_incoming_call(struct afs_call *call)
case CBProbe:
call->type = &afs_SRXCBProbe;
return true;
+   case CBGetCapabilities:
+   call->type = &afs_SRXCBGetCapabilites;
+   return true;
default:
return false;
}
@@ -328,3 +343,86 @@ static int afs_deliver_cb_probe(struct afs_call *call, 
struct sk_buff *skb,
schedule_work(&call->work);
return 0;
 }
+
+/*
+ * allow the fileserver to ask about the cache manager's capabilities
+ */
+static void SRXAFSCB_GetCapabilities(struct work_struct *work)
+{
+   struct afs_interface *ifs;
+   struct afs_call *call = container_of(work, struct afs_call, work);
+   int loop, nifs;
+
+   struct {
+   struct /* InterfaceAddr */ {
+   __be32 nifs;
+   __be32 uuid[11];
+   __be32 ifaddr[32];
+   __be32 netmask[32];
+   __be32 mtu[32];
+   } ia;
+   struct /* Capabilities */ {
+   __be32 capcount;
+   __be32 caps[1];
+   } cap;
+   } reply;
+
+   _enter("");
+
+   nifs = 0;
+   ifs = kcalloc(32, sizeof(*ifs), GFP_KERNEL);
+   if (ifs) {
+   nifs = afs_get_ipv4_interfaces(ifs, 32, false);
+   if (nifs < 0) {
+   kfree(ifs);
+   ifs = NULL;
+   nifs = 0;
+   }
+   }
+
+   memset(&reply, 0, sizeof(reply));
+   reply.ia.nifs = htonl(nifs);
+
+   reply.ia.uuid[0] = htonl(afs_uuid.time_low);
+   reply.ia.uuid[1] = htonl(afs_uuid.time_mid);
+   reply.ia.uuid[2] = htonl(afs_uuid.time_hi_and_version);
+   reply.ia.uuid[3] = htonl((s8) afs_uuid.clock_se

[PATCH 12/16] AFS: Update the AFS fs documentation [try #3]

2007-04-25 Thread David Howells

Update the AFS fs documentation.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 Documentation/filesystems/afs.txt |  214 +++--
 1 files changed, 154 insertions(+), 60 deletions(-)

diff --git a/Documentation/filesystems/afs.txt 
b/Documentation/filesystems/afs.txt
index 2f4237d..12ad6c7 100644
--- a/Documentation/filesystems/afs.txt
+++ b/Documentation/filesystems/afs.txt
@@ -1,31 +1,82 @@
+
 kAFS: AFS FILESYSTEM
 
 
-ABOUT
-=
+Contents:
+
+ - Overview.
+ - Usage.
+ - Mountpoints.
+ - Proc filesystem.
+ - The cell database.
+ - Security.
+ - Examples.
+
+
+
+OVERVIEW
+
 
-This filesystem provides a fairly simple AFS filesystem driver. It is under
-development and only provides very basic facilities. It does not yet support
-the following AFS features:
+This filesystem provides a fairly simple secure AFS filesystem driver. It is
+under development and does not yet provide the full feature set.  The features
+it does support include:
 
-   (*) Write support.
-   (*) Communications security.
-   (*) Local caching.
-   (*) pioctl() system call.
-   (*) Automatic mounting of embedded mountpoints.
+ (*) Security (currently only AFS kaserver and KerberosIV tickets).
 
+ (*) File reading.
 
+ (*) Automounting.
+
+It does not yet support the following AFS features:
+
+ (*) Write support.
+
+ (*) Local caching.
+
+ (*) pioctl() system call.
+
+
+===
+COMPILATION
+===
+
+The filesystem should be enabled by turning on the kernel configuration
+options:
+
+   CONFIG_AF_RXRPC - The RxRPC protocol transport
+   CONFIG_RXKAD- The RxRPC Kerberos security handler
+   CONFIG_AFS  - The AFS filesystem
+
+Additionally, the following can be turned on to aid debugging:
+
+   CONFIG_AF_RXRPC_DEBUG   - Permit AF_RXRPC debugging to be enabled
+   CONFIG_AFS_DEBUG- Permit AFS debugging to be enabled
+
+They permit the debugging messages to be turned on dynamically by manipulating
+the masks in the following files:
+
+   /sys/module/af_rxrpc/parameters/debug
+   /sys/module/afs/parameters/debug
+
+
+=
 USAGE
 =
 
 When inserting the driver modules the root cell must be specified along with a
 list of volume location server IP addresses:
 
-   insmod rxrpc.o
+   insmod af_rxrpc.o
+   insmod rxkad.o
insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
 
-The first module is a driver for the RxRPC remote operation protocol, and the
-second is the actual filesystem driver for the AFS filesystem.
+The first module is the AF_RXRPC network protocol driver.  This provides the
+RxRPC remote operation protocol and may also be accessed from userspace.  See:
+
+   Documentation/networking/rxrpc.txt
+
+The second module is the kerberos RxRPC security driver, and the third module
+is the actual filesystem driver for the AFS filesystem.
 
 Once the module has been loaded, more modules can be added by the following
 procedure:
@@ -33,7 +84,7 @@ procedure:
echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
 
 Where the parameters to the "add" command are the name of a cell and a list of
-volume location servers within that cell.
+volume location servers within that cell, with the latter separated by colons.
 
 Filesystems can be mounted anywhere by commands similar to the following:
 
@@ -42,11 +93,6 @@ Filesystems can be mounted anywhere by commands similar to 
the following:
mount -t afs "#root.afs." /afs
mount -t afs "#root.cell." /afs/cambridge
 
-  NB: When using this on Linux 2.4, the mount command has to be different,
-  since the filesystem doesn't have access to the device name argument:
-
-   mount -t afs none /afs -ovol="#root.afs."
-
 Where the initial character is either a hash or a percent symbol depending on
 whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O
 volume, but are willing to use a R/W volume instead (percent).
@@ -60,55 +106,66 @@ named volume will be looked up in the cell specified 
during insmod.
 Additional cells can be added through /proc (see later section).
 
 
+===
 MOUNTPOINTS
 ===
 
-AFS has a concept of mountpoints. These are specially formatted symbolic links
-(of the same form as the "device name" passed to mount). kAFS presents these
-to the user as directories that have special properties:
+AFS has a concept of mountpoints. In AFS terms, these are specially formatted
+symbolic links (of the same form as the "device name" passed to mount).  kAFS
+presents these to the user as directories that have a follow-link capability
+(ie: symbolic link semantics).  If anyone attempts to access them, they will
+automatically cause the target volume to be mounted (if possible) on that site.
 
-  (*

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread David Chinner

On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> Nikita Danilov wrote:
> >Maybe I failed to describe the problem presicely.
> >
> >Suppose that all chunks have been checked. After that, for every inode
> >I0 having continuations I1, I2, ... In, one has to check that every
> >logical block is presented in at most one of these inodes. For this one
> >has to read I0, with all its indirect (double-indirect, triple-indirect)
> >blocks, then read I1 with all its indirect blocks, etc. And to repeat
> >this for every inode with continuations.
> >
> >In the worst case (every inode has a continuation in every chunk) this
> >obviously is as bad as un-chunked fsck. But even in the average case,
> >total amount of io necessary for this operation is proportional to the
> >_total_ file system size, rather than to the chunk size.
> >
> 
> Perhaps, I should talk about how continuation inodes are managed / 
> located on disk. (This is how it is in my current implementation)
> 
> Right now, there is no distinction between an inode and continuation 
> inode (also referred to as 'cnode' below), except for the 
> EXT2_IS_CONT_FL flag. Every inode holds a list of static number of 
> inodes, currently limited to 4.
> 
> The structure looks like this:
> 
>  --   --
> | cnode 0  |-->| cnode 0  |--> to another cnode or NULL
>  --   --
> | cnode 1  |-  | cnode 1  |-
>  --   |   --  |
> | cnode 2  |-- |  | cnode 2  |--   |
>  --  | |  --  |   |
> | cnode 3  | | |  | cnode 3  | |   |
>  --  | |  --  |   |
> |  |  ||  |   |
> 
>  inodes   inodes or NULL

How do you recover if fsfuzzer takes out a cnode in the chain? The
chunk is marked clean, but clearly corrupted and needs fixing and
you don't know what it was pointing at.  Hence you have a pointer to
a trashed cnode *somewhere* that you need to find and fix, and a
bunch of orphaned cnodes that nobody points to *somewhere else* in
the filesystem that you have to find. That's a full scan fsck case,
isn't?

It seems that any sort of damage to the underlying storage (e.g.
media error, I/O error or user brain explosion) results in the need
to do a full fsck and hence chunkfs gives you no benefit in this
case.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/16] commit ad495d7b6cfcd1bc2eaf06c42699be0bb5d84234 [try #3]

2007-04-25 Thread David Howells

[NETLINK]: Mirror UDP MSG_TRUNC semantics.

If the user passes MSG_TRUNC in via msg_flags, return
the full packet size not the truncated size.

Idea from Herbert Xu and Thomas Graf.

Signed-off-by: David S. Miller <[EMAIL PROTECTED]>
---

 net/netlink/af_netlink.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index c48b0f4..5890210 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1242,6 +1242,9 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct 
socket *sock,
 
scm_recv(sock, msg, siocb->scm, flags);
 
+   if (flags & MSG_TRUNC)
+   copied = skb->len;
+
 out:
netlink_rcv_wake(sk);
return err ? : copied;

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/16] AFS: Handle multiple mounts of an AFS superblock correctly [try #3]

2007-04-25 Thread David Howells

Handle multiple mounts of an AFS superblock correctly, checking to see whether
the superblock is already initialised after calling sget() rather than just
unconditionally stamping all over it.

Also delete the "silent" parameter to afs_fill_super() as it's not used and
can, in any case, be obtained from sb->s_flags.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 fs/afs/super.c |   26 --
 1 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/fs/afs/super.c b/fs/afs/super.c
index efc4fe6..77e6875 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -212,7 +212,7 @@ static int afs_test_super(struct super_block *sb, void 
*data)
 /*
  * fill in the superblock
  */
-static int afs_fill_super(struct super_block *sb, void *data, int silent)
+static int afs_fill_super(struct super_block *sb, void *data)
 {
struct afs_mount_params *params = data;
struct afs_super_info *as = NULL;
@@ -319,17 +319,23 @@ static int afs_get_sb(struct file_system_type *fs_type,
goto error;
}
 
-   sb->s_flags = flags;
-
-   ret = afs_fill_super(sb, ¶ms, flags & MS_SILENT ? 1 : 0);
-   if (ret < 0) {
-   up_write(&sb->s_umount);
-   deactivate_super(sb);
-   goto error;
+   if (!sb->s_root) {
+   /* initial superblock/root creation */
+   _debug("create");
+   sb->s_flags = flags;
+   ret = afs_fill_super(sb, ¶ms);
+   if (ret < 0) {
+   up_write(&sb->s_umount);
+   deactivate_super(sb);
+   goto error;
+   }
+   sb->s_flags |= MS_ACTIVE;
+   } else {
+   _debug("reuse");
+   ASSERTCMP(sb->s_flags, &, MS_ACTIVE);
}
-   sb->s_flags |= MS_ACTIVE;
-   simple_set_mnt(mnt, sb);
 
+   simple_set_mnt(mnt, sb);
afs_put_volume(params.volume);
afs_put_cell(params.default_cell);
_leave(" = 0 [%p]", sb);

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/16] AF_RXRPC socket family and AFS rewrite [try #3]

2007-04-25 Thread David Howells


The first of these patches together provide secure client-side RxRPC
connectivity as a Linux kernel socket family.  Only the RxRPC transport/session
side is supplied - the presentation side (marshalling the data) is left to the
client.  Copies of the patches can be found here:

http://people.redhat.com/~dhowells/rxrpc/series
http://people.redhat.com/~dhowells/rxrpc/01-move-skb-generic.diff
http://people.redhat.com/~dhowells/rxrpc/02-cancel_delayed_work.diff
http://people.redhat.com/~dhowells/rxrpc/03-keys.diff
http://people.redhat.com/~dhowells/rxrpc/04-timer-exports.diff
http://people.redhat.com/~dhowells/rxrpc/05-af_rxrpc.diff

Further patches make the in-kernel AFS filesystem use AF_RXRPC and delete the
old RxRPC implementation:

http://people.redhat.com/~dhowells/rxrpc/06-afs-cleanup.diff
http://people.redhat.com/~dhowells/rxrpc/07-af_rxrpc-kernel.diff
http://people.redhat.com/~dhowells/rxrpc/08-af_rxrpc-afs.diff
http://people.redhat.com/~dhowells/rxrpc/09-af_rxrpc-delete-old.diff

And then the rest of the patches extend AFS to provide automatic unmounting of
automount trees, security support and directory-level write support (create,
mkdir, etc.):

http://people.redhat.com/~dhowells/rxrpc/10-afs-multimount.diff
http://people.redhat.com/~dhowells/rxrpc/11-afs-security.diff
http://people.redhat.com/~dhowells/rxrpc/12-afs-doc.diff

http://people.redhat.com/~dhowells/rxrpc/13-netlink-support-MSG_TRUNC.diff
http://people.redhat.com/~dhowells/rxrpc/14-afs-get-capabilities.diff
http://people.redhat.com/~dhowells/rxrpc/15-afs-initcallbackstate3.diff
http://people.redhat.com/~dhowells/rxrpc/16-afs-dir-write-support.diff

Note that file-level write support is not yet complete and so is not included
in this patch set.


The userspace access methods make use of the control data passed to/by
sendmsg() and recvmsg().  See the three simple test programs:

http://people.redhat.com/~dhowells/rxrpc/klog.c
http://people.redhat.com/~dhowells/rxrpc/rxrpc.c
http://people.redhat.com/~dhowells/rxrpc/listen.c

The klog program is provided to go and get a Kerberos IV key from the AFS
kaserver.  Currently it must be edited before compiling to note the right
server IP address and the appropriate credentials.

These programs can be compiled by:

make klog rxrpc listen CFLAGS="-Wall -g" LDLIBS="-lcrypto -lcrypt 
-lkrb4 -lkeyutils"

Then a ticket can be obtained by:

./klog

If a security key is acquired in this way, then all subsequent AFS operations -
including VL lookups and mounts - performed with that session keyring will be
authenticated using that key.  The key can be viewed like so:

[EMAIL PROTECTED] ~]# keyctl show
Session Keyring
   -3 --alswrv  0 0  keyring: _ses.3268
2 --alswrv  0 0   \_ keyring: _uid.0
111416553 --als--v  0 0   \_ rxrpc: [EMAIL PROTECTED]

TODO:

 (*) Make certain parameters (such as connection timeouts) userspace
 configurable.

 (*) Make userspace utilities use it; librxrpc.

 (*) Userspace documentation.

 (*) KerberosV security.

Changes:

 (*) SOCK_RPC has been removed.  SOCK_DGRAM is now used instead.

 (*) I've add a facility whereby calls can be made to destinations other than
 the connect() address of a client socket by making use of msg_name in the
 msghdr struct when using sendmsg() to send the first data packet of a
 call.  Indeed, a client socket need not be connected before being used
 so.

 (*) I've also added a facility whereby client calls may also be made on
 server sockets, again by using msg_name in the msghdr struct.  In such a
 case, the server's local transport endpoint is used.

 (*) I've made the write buffer space check available to various callers
 (sk_write_space) and implemented poll support.

 (*) Rewrote rxrpc_recvmsg().  It now concatenates adjacent data messages from
 the same call when delivering them.

 (*) Updated the documentation to include notes on recvmsg, cover control
 messages and cover SOL_RXRPC-level socket options.

 (*) Provided an in-kernel interface to give in-kernel utilities easier access
 to the facility.

 (*) Made fs/afs/ use it.

 (*) Deleted the old contents of net/rxrpc/.

 (*) Use the scatterlist interface to the crypto API for now.  The patch that
 added the direct access interface conflicts with patches Herbert Xu is
 producing, so I've dropped it for the moment.

 (*) Moved a bug fix to make secure connection reuse work from the
 af_rxrpc-kernel patch to the af_rxrpc main patch.

 (*) Make RxRPC use its own private work queues rather than keventd's to avoid
 deadlocks when AFS tries to use keventd too.  This also puts encryption
 in the private work queue rather than keventd's queue as that might take
 a relatively long time t

[PATCH 02/16] cancel_delayed_work: use del_timer() instead of del_timer_sync() [try #3]

2007-04-25 Thread David Howells

del_timer_sync() buys nothing for cancel_delayed_work(), but it is less
efficient since it locks the timer unconditionally, and may wait for the
completion of the delayed_work_timer_fn().

cancel_delayed_work() == 0 means:

before this patch:
work->func may still be running or queued

after this patch:
work->func may still be running or queued, or
delayed_work_timer_fn->__queue_work() in progress.

The latter doesn't differ from the caller's POV,
delayed_work_timer_fn() is called with _PENDING
bit set.

cancel_delayed_work() == 1 with this patch adds a new possibility:

delayed_work->work was cancelled, but delayed_work_timer_fn
is still running (this is only possible for the re-arming
works on single-threaded workqueue).

In this case the timer was re-started by work->func(), nobody
else can do this. This in turn means that delayed_work_timer_fn
has already passed __queue_work() (and wont't touch delayed_work)
because nobody else can queue delayed_work->work.

Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]>
Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 include/linux/workqueue.h |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 2a7b38d..b8abfc7 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -191,14 +191,15 @@ int execute_in_process_context(work_func_t fn, struct 
execute_work *);
 
 /*
  * Kill off a pending schedule_delayed_work().  Note that the work callback
- * function may still be running on return from cancel_delayed_work().  Run
- * flush_scheduled_work() to wait on it.
+ * function may still be running on return from cancel_delayed_work(), unless
+ * it returns 1 and the work doesn't re-arm itself. Run flush_workqueue() or
+ * cancel_work_sync() to wait on it.
  */
 static inline int cancel_delayed_work(struct delayed_work *work)
 {
int ret;
 
-   ret = del_timer_sync(&work->timer);
+   ret = del_timer(&work->timer);
if (ret)
work_release(&work->work);
return ret;

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/16] AF_RXRPC: Make it possible to merely try to cancel timers from a module [try #3]

2007-04-25 Thread David Howells

Export try_to_del_timer_sync() for use by the AF_RXRPC module.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 kernel/timer.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/timer.c b/kernel/timer.c
index dd6c2c1..b22bd39 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -505,6 +505,8 @@ out:
return ret;
 }
 
+EXPORT_SYMBOL(try_to_del_timer_sync);
+
 /**
  * del_timer_sync - deactivate a timer and wait for the handler to finish.
  * @timer: the timer to be deactivated

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/16] AF_RXRPC: Key facility changes for AF_RXRPC [try #3]

2007-04-25 Thread David Howells

Export the keyring key type definition and document its availability.

Add alternative types into the key's type_data union to make it more useful.
Not all users necessarily want to use it as a list_head (AF_RXRPC doesn't, for
example), so make it clear that it can be used in other ways.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 Documentation/keys.txt  |   12 
 include/linux/key.h |2 ++
 security/keys/keyring.c |2 ++
 3 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index 60c665d..81d9aa0 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -859,6 +859,18 @@ payload contents" for more information.
void unregister_key_type(struct key_type *type);
 
 
+Under some circumstances, it may be desirable to desirable to deal with a
+bundle of keys.  The facility provides access to the keyring type for managing
+such a bundle:
+
+   struct key_type key_type_keyring;
+
+This can be used with a function such as request_key() to find a specific
+keyring in a process's keyrings.  A keyring thus found can then be searched
+with keyring_search().  Note that it is not possible to use request_key() to
+search a specific keyring, so using keyrings in this way is of limited utility.
+
+
 ===
 NOTES ON ACCESSING PAYLOAD CONTENTS
 ===
diff --git a/include/linux/key.h b/include/linux/key.h
index 169f05e..a9220e7 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -160,6 +160,8 @@ struct key {
 */
union {
struct list_headlink;
+   unsigned long   x[2];
+   void*p[2];
} type_data;
 
/* key data
diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index ad45ce7..88292e3 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -66,6 +66,8 @@ struct key_type key_type_keyring = {
.read   = keyring_read,
 };
 
+EXPORT_SYMBOL(key_type_keyring);
+
 /*
  * semaphore to serialise link/link calls to prevent two link calls in parallel
  * introducing a cycle

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/16] AF_RXRPC: Move generic skbuff stuff from XFRM code to generic code [try #3]

2007-04-25 Thread David Howells

Move generic skbuff stuff from XFRM code to generic code so that AF_RXRPC can
use it too.

The kdoc comments I've attached to the functions needs to be checked by whoever
wrote them as I had to make some guesses about the workings of these functions.

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---

 include/linux/skbuff.h |6 ++
 include/net/esp.h  |2 -
 net/core/skbuff.c  |  188 
 net/xfrm/xfrm_algo.c   |  169 ---
 4 files changed, 194 insertions(+), 171 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5992f65..c905d42 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -83,6 +83,7 @@
  */
 
 struct net_device;
+struct scatterlist;
 
 #ifdef CONFIG_NETFILTER
 struct nf_conntrack {
@@ -361,6 +362,11 @@ extern struct sk_buff *skb_realloc_headroom(struct sk_buff 
*skb,
 extern struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
   int newheadroom, int newtailroom,
   gfp_t priority);
+extern intskb_to_sgvec(struct sk_buff *skb,
+   struct scatterlist *sg, int offset,
+   int len);
+extern intskb_cow_data(struct sk_buff *skb, int tailbits,
+   struct sk_buff **trailer);
 extern intskb_pad(struct sk_buff *skb, int pad);
 #define dev_kfree_skb(a)   kfree_skb(a)
 extern void  skb_over_panic(struct sk_buff *skb, int len,
diff --git a/include/net/esp.h b/include/net/esp.h
index 713d039..d05d8d2 100644
--- a/include/net/esp.h
+++ b/include/net/esp.h
@@ -40,8 +40,6 @@ struct esp_data
} auth;
 };
 
-extern int skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int 
offset, int len);
-extern int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff 
**trailer);
 extern void *pskb_put(struct sk_buff *skb, struct sk_buff *tail, int len);
 
 static inline int esp_mac_digest(struct esp_data *esp, struct sk_buff *skb,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 336958f..aa02bd4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2005,6 +2006,190 @@ void __init skb_init(void)
NULL, NULL);
 }
 
+/**
+ * skb_to_sgvec - Fill a scatter-gather list from a socket buffer
+ * @skb: Socket buffer containing the buffers to be mapped
+ * @sg: The scatter-gather list to map into
+ * @offset: The offset into the buffer's contents to start mapping
+ * @len: Length of buffer space to be mapped
+ *
+ * Fill the specified scatter-gather list with mappings/pointers into a
+ * region of the buffer space attached to a socket buffer.
+ */
+int
+skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len)
+{
+   int start = skb_headlen(skb);
+   int i, copy = start - offset;
+   int elt = 0;
+
+   if (copy > 0) {
+   if (copy > len)
+   copy = len;
+   sg[elt].page = virt_to_page(skb->data + offset);
+   sg[elt].offset = (unsigned long)(skb->data + offset) % 
PAGE_SIZE;
+   sg[elt].length = copy;
+   elt++;
+   if ((len -= copy) == 0)
+   return elt;
+   offset += copy;
+   }
+
+   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+   int end;
+
+   BUG_TRAP(start <= offset + len);
+
+   end = start + skb_shinfo(skb)->frags[i].size;
+   if ((copy = end - offset) > 0) {
+   skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+   if (copy > len)
+   copy = len;
+   sg[elt].page = frag->page;
+   sg[elt].offset = frag->page_offset+offset-start;
+   sg[elt].length = copy;
+   elt++;
+   if (!(len -= copy))
+   return elt;
+   offset += copy;
+   }
+   start = end;
+   }
+
+   if (skb_shinfo(skb)->frag_list) {
+   struct sk_buff *list = skb_shinfo(skb)->frag_list;
+
+   for (; list; list = list->next) {
+   int end;
+
+   BUG_TRAP(start <= offset + len);
+
+   end = start + list->len;
+   if ((copy = end - offset) > 0) {
+   if (copy > len)
+   copy = len;
+   elt += skb_to_sgvec(list, sg+elt, offset - 
start, copy);
+   if ((len -= copy) == 0)
+   return elt;
+

Re: ChunkFS - measuring cross-chunk references

2007-04-25 Thread Suparna Bhattacharya

On Wed, Apr 25, 2007 at 05:50:55AM +0530, Karuna sagar K wrote:
> On 4/24/07, Theodore Tso <[EMAIL PROTECTED]> wrote:
> >On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote:
> .
> >It would also be good to distinguish between directories referencing
> >files in another chunk, and directories referencing subdirectories in
> >another chunk (which would be simpler to handle, given the topological
> >restrictions on directories, as compared to files and hard links).
> >
> 
> Modified the tool to distinguish between
> 1. cross references between directories and files
> 2. cross references between directories and sub directories
> 3. cross references within a file (due to huge file size)

One more set of numbers to calculate would be an estimate of cross-references
across chunks of block groups -- 1 (=128MB), 2 (=256MB), 4 (=512MB), 8(=1GB)
as suggested by Kalpak.

Once we have that, it would be nice if we can get data on results with
the tool from other people, especially with larger filesystem sizes.

Regards
Suparna

> 
> Below is the result from / partition of ext3 file system:
> 
> Number of files = 221794
> Number of directories = 24457
> Total size = 8193116 KB
> Total data stored = 7187392 KB
> Size of block groups = 131072 KB
> Number of inodes per block group = 16288
> No. of cross references between directories and sub-directories = 7791
> No. of cross references between directories and file = 657
> Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570)
> 
> Thanks for the suggestions.
> 
> >There may also be special things we will need to do to handle
> >scenarios such as BackupPC, where if it looks like a directory
> >contains a huge number of hard links to a particular chunk, we'll need
> >to make sure that directory is either created in the right chunk
> >(possibly with hints from the application) or migrated to the right
> >chunk (but this might cause the inode number of the directory to
> >change --- maybe we allow this as long as the directory has never been
> >stat'ed, so that the inode number has never been observed).
> >
> >The other thing which we should consider is that chunkfs really
> >requires a 64-bit inode number space, which means either we only allow
> >it on 64-bit systems, or we need to consider a migration so that even
> >on 32-bit platforms, stat() functions like stat64(), insofar that it
> >uses a stat structure which returns a 64-bit ino_t.
> >
> >   - Ted
> >
> 
> 
> Thanks,
> Karuna



-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-25 Thread Karel Zak

On Wed, Apr 25, 2007 at 09:18:28AM +0200, Miklos Szeredi wrote:
> > > The following extra security measures are taken for unprivileged
> > > mounts:
> > > 
> > >  - usermounts are limited by a sysctl tunable
> > >  - force "nosuid,nodev" mount options on the created mount
> > 
> >  The original userspace "user=" solution also implies the "noexec"
> >  option by default (you can override the default by "exec" option).
> 
> Unlike "nosuid" and "nodev", I don't think "noexec" has real security
> benefits.

 Yes. I agree. 

> >  It means the kernel based solution is not fully compatible ;-(
> 
> Oh, I don't think that matters.  For traditional /etc/fstab based user
> mounts, mount(8) will have to remain suid-root, the kernel can't
> replace the fstab check.

 Ok, it makes sense. You're right that for the mount(8) is more
 important the fstab check. 

 Please, prepare a mount(8) patch -- with the patch it will be more
 clear.

> We could add a new "nosubmount" or similar flag, to prevent
> submounting, but that again would go against the simplicity of the
> current approach, so I'm not sure it's worth it.

 The "nosubmount" is probably good idea.

 The patches seem much better in v4. I'm fun for the feature in the
 kernel (and also for every change that makes mtab more and more
 obsolete :-).

Karel

> 
> Miklos
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
 Karel Zak  <[EMAIL PROTECTED]>

 Red Hat Czech s.r.o.
 Purkynova 99/71, 612 45 Brno, Czech Republic
 Reg.id: CZ27690016
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch] unprivileged mounts update

2007-04-25 Thread Miklos Szeredi

From: Miklos Szeredi <[EMAIL PROTECTED]>

- refine adding "nosuid" and "nodev" flags for unprivileged mounts:
o add "nosuid", only if mounter doesn't have CAP_SETUID capability
o add "nodev", only if mounter doesn't have CAP_MKNOD capability

- allow unprivileged forced unmount, but only for FS_SAFE filesystems

- allow mounting over special files, but not symlinks

- for mounting and umounting check "fsuid" instead of "ruid"

Thanks to everyone for the comments, with special thanks to Serge
Hallyn and Eric Biederman.

For testing the new functionality provided by this patchset a simple
tool similar in syntax to mount(8) is available from:

  http://www.kernel.org/pub/linux/kernel/people/mszeredi/mmount

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-22 17:48:18.0 +0200
+++ linux/fs/namespace.c2007-04-22 18:19:51.0 +0200
@@ -252,10 +252,12 @@ static int reserve_user_mount(void)
 static void __set_mnt_user(struct vfsmount *mnt)
 {
BUG_ON(mnt->mnt_flags & MNT_USER);
-   mnt->mnt_uid = current->uid;
+   mnt->mnt_uid = current->fsuid;
mnt->mnt_flags |= MNT_USER;
-   if (!capable(CAP_SYS_ADMIN))
-   mnt->mnt_flags |= MNT_NOSUID | MNT_NODEV;
+   if (!capable(CAP_SETUID))
+   mnt->mnt_flags |= MNT_NOSUID;
+   if (!capable(CAP_MKNOD))
+   mnt->mnt_flags |= MNT_NODEV;
 }
 
 static void set_mnt_user(struct vfsmount *mnt)
@@ -725,10 +727,10 @@ static bool permit_umount(struct vfsmoun
if (!(mnt->mnt_flags & MNT_USER))
return false;
 
-   if (flags & MNT_FORCE)
+   if ((flags & MNT_FORCE) && !(mnt->mnt_sb->s_type->fs_flags & FS_SAFE))
return false;
 
-   return mnt->mnt_uid == current->uid;
+   return mnt->mnt_uid == current->fsuid;
 }
 
 /*
@@ -792,13 +794,13 @@ static bool permit_mount(struct nameidat
if (type && !(type->fs_flags & FS_SAFE))
return false;
 
-   if (!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode))
+   if (S_ISLNK(inode->i_mode))
return false;
 
if (!(nd->mnt->mnt_flags & MNT_USER))
return false;
 
-   if (nd->mnt->mnt_uid != current->uid)
+   if (nd->mnt->mnt_uid != current->fsuid)
return false;
 
*flags |= MS_SETUSER;
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-25 Thread Miklos Szeredi

> > The following extra security measures are taken for unprivileged
> > mounts:
> > 
> >  - usermounts are limited by a sysctl tunable
> >  - force "nosuid,nodev" mount options on the created mount
> 
>  The original userspace "user=" solution also implies the "noexec"
>  option by default (you can override the default by "exec" option).

Unlike "nosuid" and "nodev", I don't think "noexec" has real security
benefits.

>  It means the kernel based solution is not fully compatible ;-(

Oh, I don't think that matters.  For traditional /etc/fstab based user
mounts, mount(8) will have to remain suid-root, the kernel can't
replace the fstab check.

In fact the latest patches don't even support these "legacy" user
mounts too well: setting the owner of a mount gives not only umount
privilege, but the ability to submount.  This is not necessarily a
good thing for these kinds of user mounts.

We could add a new "nosubmount" or similar flag, to prevent
submounting, but that again would go against the simplicity of the
current approach, so I'm not sure it's worth it.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

51 matches

Mail list logo