BUG: Dentry still in use during umount in 2.6.21-rc5-git6

2007-04-20 Thread Andi Kleen

One of my autoboot test clients gave me this during shutdown. It used
reiserfs and autofs and NFS heavily.

Unmounting file systems
BUG: Dentry 8100f3693a40{i=2352220,n=xattrs} still in use (1) [unmount of 
reiserfs sda9]
[ cut here ]
kernel BUG at 
/mnt/dm-2/newautoboot/autoboot/lsrc/mainline/linux/fs/dcache.c:623!
invalid opcode:  [1] SMP 
CPU 1 
Modules linked in:
Pid: 15791, comm: umount Not tainted 2.6.21-rc5-git6 #44
RIP: 0010:[]  [] 
shrink_dcache_for_umount_subtree+0x178/0x250
RSP: 0018:8100f5f67e18  EFLAGS: 00010292
RAX: 0060 RBX: 8100f3693a40 RCX: 5207
RDX:  RSI: 0046 RDI: 00014661
RBP: 8100f6dc9cc0 R08: 00a0 R09: 0005
R10:  R11:  R12: 8100f3693aa0
R13: 00014661 R14: 0050ea70 R15: 0050ead0
FS:  2adc863a86d0() GS:8100f7fdc1c0() knlGS:b7be38d0
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 2adc8626a688 CR3: f628b000 CR4: 06e0
Process umount (pid: 15791, threadinfo 8100f5f66000, task 8100f7a08100)
Stack:  810004dab218 810004dab000 80558860 810004dab000
  8028815b 810004dab000 8027a1a5
  8100f6c50980 806c1600 8027a2a4
Call Trace:
 [] shrink_dcache_for_umount+0x2f/0x3d
 [] generic_shutdown_super+0x19/0xf2
 [] kill_block_super+0x26/0x3b
 [] deactivate_super+0x47/0x60
 [] sys_umount+0x1f7/0x22a
 [] sys_newstat+0x19/0x31
 [] system_call+0x7e/0x83


Code: 0f 0b eb fe 48 8b 6b 28 48 39 dd 75 04 31 ed eb 04 f0 ff 4d 
RIP  [] shrink_dcache_for_umount_subtree+0x178/0x250
 RSP 
/etc/init.d/boot.d/K14boot.localfs: line 93: 15791 Segmentation fault  
umount -avt noproc,nonfs,nonfs4,nosmbfs,nocifs,notmpfs



-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [d_path 0/7] Fixes to d_path: Respin

2007-04-20 Thread Miklos Szeredi
> > I gave a chroot example that showed that in the current
> > implementation, you can get pretty random clashes between mounts; there are
> > other cases with lazy unmounts as well.
> 
> Irrelevant as well.  If you create chroot problems it's your problem.
> 
> The fact is that if you have a normal setup the code works fine.  All
> other situations cannot be handled with the current kernel interface.
> 
> This does not give anybody the right to say "since the code doesn't
> always work we can break it completely".  That's completely
> unacceptable.

I'm not sure I understand the situation completely.  What exactly is
broken in libc by removing unreachable mounts from /proc/mounts?

Is it the situation when
 - file descriptor is opened
 - process does chroot
 - process does fstatvfs on file descriptor
?

In that case currently fstatvfs() _usually_ gives the correct results,
but can give wrong results if mounts paths accidently clash in
/proc/mounts?

Also isn't it the case, that fstatvfs() or statvfs() performed within
the chroot could also give incorrect result for a _reachable_ mount if
it clashes with an unreachable mount?

If this is the case, I would think that removing the unreachable
mounts from /proc/mounts, would actually be fixing this second case,
which is more likely to be used anyway.

BTW, this patch, or at least a predecessor is in -mm, and it very much
feels the Right Thing(tm).  The /proc/mounts under a chroot
environment actually looks sane, instead of some random crap, that it
was previously.

While we should make every effort to keep the kernel interfaces
stable, this shouldn't prevent us from fixing bugs.  And this one is
clearly a bug, even if not a very serious one.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AppArmor FAQ

2007-04-20 Thread Karl MacMillan
On Fri, 2007-04-20 at 11:45 -0700, David Lang wrote:
> On Thu, 19 Apr 2007, Stephen Smalley wrote:
> 
> > already happened to integrate such support into userland.
> >
> > To look at it in a slightly different way, the AA emphasis on not
> > modifying applications could be viewed as a limitation.  Ultimately,
> > users have security goals that go beyond just what the OS can directly
> > enforce and at least some applications (notably things like X, D-BUS,
> > PostgreSQL, etc) need to likewise support strong domain separation and
> > controlled information flow through their own internal objects and
> > operations.  SELinux provides APIs and infrastructure for such
> > applications, and has already done quite a bit of work in that space
> > (D-BUS support, XACE/XSELinux, SE-PostgreSQL), whereas AA seems to have
> > no interest in going there (and would have to recant its emphasis on no
> > application mods to do so).  If you actually want to truly confine a
> > desktop application, you can't limit yourself to the kernel.  And the
>^^^
> 
> > label model provides a unifying abstraction for dealing with all of
> > these various objects, whereas the path/"natural abstraction" model has
> > no unifying abstraction at all.
> 
> 
> AA isn't aimed at confineing desktop applications. it's aimed at confining 
> server applications. this really is a easier task (if it happens to be useful 
> for some desktop apps as well, so much the better)
> 

Steve's point holds equally well for server applications - SE-PostgreSQl
is a good example.

Karl

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AppArmor FAQ

2007-04-20 Thread David Lang

On Thu, 19 Apr 2007, Stephen Smalley wrote:


already happened to integrate such support into userland.

To look at it in a slightly different way, the AA emphasis on not
modifying applications could be viewed as a limitation.  Ultimately,
users have security goals that go beyond just what the OS can directly
enforce and at least some applications (notably things like X, D-BUS,
PostgreSQL, etc) need to likewise support strong domain separation and
controlled information flow through their own internal objects and
operations.  SELinux provides APIs and infrastructure for such
applications, and has already done quite a bit of work in that space
(D-BUS support, XACE/XSELinux, SE-PostgreSQL), whereas AA seems to have
no interest in going there (and would have to recant its emphasis on no
application mods to do so).  If you actually want to truly confine a
desktop application, you can't limit yourself to the kernel.  And the

  ^^^


label model provides a unifying abstraction for dealing with all of
these various objects, whereas the path/"natural abstraction" model has
no unifying abstraction at all.



AA isn't aimed at confineing desktop applications. it's aimed at confining 
server applications. this really is a easier task (if it happens to be useful 
for some desktop apps as well, so much the better)


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [d_path 0/7] Fixes to d_path: Respin

2007-04-20 Thread Ulrich Drepper

On 4/20/07, Andreas Gruenbacher <[EMAIL PROTECTED]> wrote:

The code also seems to stop at the first matching mount point. You can have
the same device mounted on the same mount point multiple times but with
different mount options, e.g., [...]


You can unfortunately do many stupid things.  That's the user's
problem.  The point is that everything works fine in an environment
which does not have such bogus mounts.  Namespaces are also the
problem of somebody else.  The people who came up with them didn't
think about the ramifications.  None of these problems can be
reasonably and reliably fixed with more support from the kernel.



I gave a chroot example that showed that in the current
implementation, you can get pretty random clashes between mounts; there are
other cases with lazy unmounts as well.


Irrelevant as well.  If you create chroot problems it's your problem.

The fact is that if you have a normal setup the code works fine.  All
other situations cannot be handled with the current kernel interface.

This does not give anybody the right to say "since the code doesn't
always work we can break it completely".  That's completely
unacceptable.

If you want to improve the situation, do it.  Provide a solution for
the problems we are having in implementing statvfs.  Then we can talk
about stopping to use /proc/mounts for statvfs and you can change it
in a way which would harm the old implementation.  That's *my* view,
but I know there will be lots of people who would even object to that.
The /proc filesystem is part of the kernel API and cannot be lightly
broken.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [d_path 0/7] Fixes to d_path: Respin

2007-04-20 Thread Andreas Gruenbacher
On Friday 20 April 2007 17:24, Ulrich Drepper wrote:
> On 4/20/07, Andreas Gruenbacher <[EMAIL PROTECTED]> wrote:
> > Yes, that one, sorry. The values it obtains that way are not reliable.
>
> Why should the mount point info together with the filesystem type not
> be reliable?

Ah ... I overlooked that fstatvfs() also checks the device number, not only 
the name, and so then it can find the right device from that.

So for how glibc uses /proc/mounts in fstatvfs(), hiding unreachable mount 
points from /proc/mounts wouldn't improve things. The heuristic already 
doesn't work for file descriptors from other namespaces, so it's already 
broken unfortunately. A more robust mechanism for glibc to use would be nice; 
not sure it would be worth it only for fstatvfs though.

The code also seems to stop at the first matching mount point. You can have 
the same device mounted on the same mount point multiple times but with 
different mount options, e.g.,

$ dd if=/dev/zero of=/var/tmp/ext2 bs=4096 count=16384
$ mkfs.ext2 -F /var/tmp/ext2
$ mount -o loop /var/tmp/ext2 /mnt
$ mount -o loop,ro /var/tmp/ext2 /mnt
$ tail -n 2 /proc/mounts
/dev/loop0 /mnt ext2 rw 0 0
/dev/loop1 /mnt ext2 ro 0 0

The topmost mount point appears last in /proc/mounts, and so unless I am 
overlooking something else, that's another minor problem.

The third problem, as I already tried to argue several times now, is that the 
mount points path that /proc/mounts reports may or may not actually exist. 
That's a problem for glibc, and you should be one of the first to notice and 
acknowledge that. I gave a chroot example that showed that in the current 
implementation, you can get pretty random clashes between mounts; there are 
other cases with lazy unmounts as well.

> You're trying to find an excuse to break tings, that seems all there is.

Now what makes you think that??

Andreas
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-20 Thread Eric W. Biederman
"Serge E. Hallyn" <[EMAIL PROTECTED]> writes:

> Quoting Miklos Szeredi ([EMAIL PROTECTED]):
>> This patchset has now been bared to the "lowest common denominator"
>> that everybody can agree on.  Or at least there weren't any objections
>> to this proposal.
>> 
>> Andrew, please consider it for -mm.
>> 
>> Thanks,
>> Miklos
>> 
>> 
>> v3 -> v4:
>> 
>>  - simplify interface as much as possible, now only a single option
>>("user=UID") is used to control everything
>>  - no longer allow/deny mounting based on file/directory permissions,
>>that approach does not always make sense
>> 
>> 
>> This patchset adds support for keeping mount ownership information in
>> the kernel, and allow unprivileged mount(2) and umount(2) in certain
>> cases.
>> 
>> The mount owner has the following privileges:
>> 
>>   - unmount the owned mount
>>   - create a submount under the owned mount
>> 
>> The sysadmin can set the owner explicitly on mount and remount.  When
>> an unprivileged user creates a mount, then the owner is automatically
>> set to the user.
>> 
>> The following use cases are envisioned:
>> 
>> 1) Private namespace, with selected mounts owned by user.
>>E.g. /home/$USER is a good candidate for allowing unpriv mounts and
>>unmounts within.
>> 
>> 2) Private namespace, with all mounts owned by user and having the
>>"nosuid" flag.  User can mount and umount anywhere within the
>>namespace, but suid programs will not work.
>> 
>> 3) Global namespace, with a designated directory, which is a mount
>>owned by the user.  E.g. /mnt/users/$USER is set up so that it is
>>bind mounted onto itself, and set to be owned by $USER.  The user
>>can add/remove mounts only under this directory.
>> 
>> The following extra security measures are taken for unprivileged
>> mounts:
>> 
>>  - usermounts are limited by a sysctl tunable
>>  - force "nosuid,nodev" mount options on the created mount
>
> Very nice.  I like these semantics.
>
> I'll try to rework my laptop in the next few days to use this patchset
> as a test.

Agreed.  It appears the approach of adding owner ship information to
mount points and using that to control what may happen with them
in regards to mount/unmount is the only workable approach in the
unix environment.

Now to dig into the details and ensure that they are correct.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [d_path 0/7] Fixes to d_path: Respin

2007-04-20 Thread Ulrich Drepper

On 4/20/07, Andreas Gruenbacher <[EMAIL PROTECTED]> wrote:

Possibly for fstatfs(): fstatfs() has no way of looking up mount points per
path name in /proc/mounts, and so it resorts to mapping from the numeric
statfs->f_type to the filesystem name (e.g., "ext3"), looks up the first
mount point with that name, and sets the statfs->f_flag flags based on that
entry. This field may change from one arbitrary value to another.


What are you talking about?  fstatfs is a syscall, we do nothing but
copying values around at userlevel.

statvfs on the other hand does use /proc/mounts.  And it most
certainly does look at the mount point before looking at the
filesystem type.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Interface for the new fallocate() system call

2007-04-20 Thread Jakub Jelinek
On Fri, Apr 20, 2007 at 07:21:46PM +0530, Amit K. Arora wrote:
> Ok.
> In this case we may have to consider following things:
> 
> 1) Obviously, for this glibc will have to call fallocate() syscall with
> different arguments on s390, than other archs. I think this should be
> doable and should not be an issue with glibc folks (right?).

glibc can cope with this easily, will just add
sysdeps/unix/sysv/linux/s390/fallocate.c or something similar to override
the generic Linux implementation.

> 2) we also need to see how strace behaves in this case. With little
> knowledge that I have of strace, I don't think it should depend on
> argument ordering of a system call on different archs (since it uses
> ptrace internally and that should take care of it). But, it will be
> nice if someone can confirm this.

strace would solve this with #ifdef mess, it already does that in many
places so guess another few lines don't make it significantly worse.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [d_path 0/7] Fixes to d_path: Respin

2007-04-20 Thread Ulrich Drepper

On 4/20/07, Andreas Gruenbacher <[EMAIL PROTECTED]> wrote:

Yes, that one, sorry. The values it obtains that way are not reliable.


Why should the mount point info together with the filesystem type not
be reliable?  You're trying to find an excuse to break tings, that
seems all there is.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Interface for the new fallocate() system call

2007-04-20 Thread Amit K. Arora
On Wed, Apr 18, 2007 at 07:06:00AM -0600, Andreas Dilger wrote:
> On Apr 17, 2007  18:25 +0530, Amit K. Arora wrote:
> > On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote:
> > > Wouldn't
> > > int fallocate(loff_t offset, loff_t len, int fd, int mode)
> > > work on both s390 and ppc/arm?  glibc will certainly wrap it and
> > > reorder the arguments as needed, so there is no need to keep fd first.
> > 
> > I think more people are comfirtable with this approach.
> 
> Really?  I thought from the last postings that "fd first, wrap on s390"
> was better.
> 
> > Since glibc
> > will wrap the system call and export the "conventional" interface
> > (with fd first) to applications, we may not worry about keeping fd first
> > in kernel code. I am personally fine with this approach.
> 
> It would seem to make more sense to wrap the syscall on those architectures
> that can't handle the "conventional" interface (fd first).

Ok.
In this case we may have to consider following things:

1) Obviously, for this glibc will have to call fallocate() syscall with
different arguments on s390, than other archs. I think this should be
doable and should not be an issue with glibc folks (right?).

2) we also need to see how strace behaves in this case. With little
knowledge that I have of strace, I don't think it should depend on
argument ordering of a system call on different archs (since it uses
ptrace internally and that should take care of it). But, it will be
nice if someone can confirm this.

Thanks!
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [d_path 0/7] Fixes to d_path: Respin

2007-04-20 Thread Andreas Gruenbacher
On Friday 20 April 2007 17:15, Ulrich Drepper wrote:
> On 4/20/07, Andreas Gruenbacher <[EMAIL PROTECTED]> wrote:
> > Possibly for fstatfs(): fstatfs() has no way of looking up mount points
> > per path name in /proc/mounts, and so it resorts to mapping from the
> > numeric statfs->f_type to the filesystem name (e.g., "ext3"), looks up
> > the first mount point with that name, and sets the statfs->f_flag flags
> > based on that entry. This field may change from one arbitrary value to
> > another.
>
> What are you talking about?  fstatfs is a syscall, we do nothing but
> copying values around at userlevel.
>
> statvfs on the other hand does use /proc/mounts.  And it most
> certainly does look at the mount point before looking at the
> filesystem type.

Yes, that one, sorry. The values it obtains that way are not reliable.

Thanks,
Andreas
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-20 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> This patchset has now been bared to the "lowest common denominator"
> that everybody can agree on.  Or at least there weren't any objections
> to this proposal.
> 
> Andrew, please consider it for -mm.
> 
> Thanks,
> Miklos
> 
> 
> v3 -> v4:
> 
>  - simplify interface as much as possible, now only a single option
>("user=UID") is used to control everything
>  - no longer allow/deny mounting based on file/directory permissions,
>that approach does not always make sense
> 
> 
> This patchset adds support for keeping mount ownership information in
> the kernel, and allow unprivileged mount(2) and umount(2) in certain
> cases.
> 
> The mount owner has the following privileges:
> 
>   - unmount the owned mount
>   - create a submount under the owned mount
> 
> The sysadmin can set the owner explicitly on mount and remount.  When
> an unprivileged user creates a mount, then the owner is automatically
> set to the user.
> 
> The following use cases are envisioned:
> 
> 1) Private namespace, with selected mounts owned by user.
>E.g. /home/$USER is a good candidate for allowing unpriv mounts and
>unmounts within.
> 
> 2) Private namespace, with all mounts owned by user and having the
>"nosuid" flag.  User can mount and umount anywhere within the
>namespace, but suid programs will not work.
> 
> 3) Global namespace, with a designated directory, which is a mount
>owned by the user.  E.g. /mnt/users/$USER is set up so that it is
>bind mounted onto itself, and set to be owned by $USER.  The user
>can add/remove mounts only under this directory.
> 
> The following extra security measures are taken for unprivileged
> mounts:
> 
>  - usermounts are limited by a sysctl tunable
>  - force "nosuid,nodev" mount options on the created mount

Very nice.  I like these semantics.

I'll try to rework my laptop in the next few days to use this patchset
as a test.

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [d_path 0/7] Fixes to d_path: Respin

2007-04-20 Thread Andreas Gruenbacher
On Friday 20 April 2007 11:30, Alan Cox wrote:
> > As far as I can see, glibc internally looks at /proc/mounts (or else
> > mtab) to find out where tmpfs is mounted for opening files there, and to
> > look up filesystem information for statfs(), while accessing that path,
> > too. Fstatfs() also looks into the same files, but it only matches by
> > filesystem type, so this is only a very unreliable heuristic, anyway.
> >
> > So judging from that, glibc users should be fine.
>
> So glibc does use it and you will change behaviour

Not for statfs(), shm_open(), and sem_open().

Possibly for fstatfs(): fstatfs() has no way of looking up mount points per 
path name in /proc/mounts, and so it resorts to mapping from the numeric 
statfs->f_type to the filesystem name (e.g., "ext3"), looks up the first 
mount point with that name, and sets the statfs->f_flag flags based on that 
entry. This field may change from one arbitrary value to another.

Andreas
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


relayFS question: calling relay_write() from interrupt context does not write to file - why ?

2007-04-20 Thread Rami Rosen

I am trying to use relayFS from an interrupt context. I read the
documentation and downloaded and ran successfully the examples from
http://relayfs.sourceforge.net.

I am running on Fedora 6 machine with 2.6.18-1.2798.fc6 kernel (no patches).
I have two years of experience in linux kernel programming.

I am mounting successfully debugfs on /debug.

According to relay.txt from the Linux documentation,
relay_write() should be used if you might be logging from interrupt context.

My module needs to write from interrupt context (in fact, it is a soft
interrupt) so I tried using relay_write.

My user space application try to read the relayFS files using read, not mmap,
following the read-mod kernel module from the examples (which I tried
successfully).

What happens that when I try reading the relayFS files (by cat, which
eventually uses read) which I generated
from interrupt context in my module, I get nothing;
while when trying files which are generated
from a kernel thread with relay_write, I do succeed with reading these
files using cat.
I am attaching the code for the test module I wrote and I hope you can
take a look at it; I simplified it and removed away everything which is
not connected directly to my question.
What happens is, in fact: when calling start_test_thread_new() we DO succeed to
read the file with cat /debug/testTree/cpu0, whereas when
using the second alternative, meaning calling directly test_thread_new()
from the main_hook() (which runs in software interrupt context) we
do not read anything (cat /debug/testTree/cpu0 shows nothing; there is,however,
no segfault when running "cat /debug/testTree/cpu0" in that case).

here below is test.c, a short module which I wrote demonstarting this
problem.


Any ideas?

Regards,
Rami Rosen



// test.c

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 


#define MAX_EVENT_SIZE 256


struct dentry *dir;
struct rchan *channel;
static struct completion done;
static int hooksRegistered = 0;
static struct task_struct *kthread_thread;

static int test_thread_new(void *unused)
{
int i,count;
char buf[MAX_EVENT_SIZE + 1];

for (i = 0; i < 10; i++) {
count = snprintf(buf, MAX_EVENT_SIZE,
"[%08i]test event\n", i);

relay_write(channel, buf, count);
}
return 0;
}


static void start_test_thread_new(void)
{
int cpu = 0;
struct task_struct *p;
printk("in start_test_thread_new\n");
init_completion(&done);

p = kthread_create(test_thread_new, NULL, "%s/%d", "test", cpu);
if (IS_ERR(p))
return;
if (p) {
kthread_bind(p, cpu);
wake_up_process(p);
kthread_thread = p;
}

}



static int test_subbuf_start_callback(struct rchan_buf *buf, void *subbuf,
 void *prev_subbuf, size_t prev_padding)
{
if (relay_buf_full(buf))
{
printk("buffer full\n");  
return 0;
}

return 1;

}

static int test_remove_buf_file_callback(struct dentry *dentry)
{
 debugfs_remove(dentry);
 return 0;
}


static struct dentry *test_create_buf_file_callback(const char *filename,
   struct dentry *parent,
   int mode,
   struct rchan_buf *buf,
   int *is_global)
{
 return debugfs_create_file(filename, mode, parent, buf,
   &relay_file_operations);
}

static struct nf_hook_ops netfilter_ops_in;

static int inithook(void);

//  


unsigned int main_hook(unsigned int hooknum,

 struct sk_buff** skb,

 const struct net_device* in,

 const struct net_device* out,

 int (*okfn)(struct sk_buff*))
 {
static int counter=0;
void *unused=NULL;
counter++;  
printk("in main_hook counter=%d\n",counter);
test_thread_new(unused);
return NF_ACCEPT;
}

//

static int inithook()
{
int i;
printk("starting i

[patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-20 Thread Miklos Szeredi
This patchset has now been bared to the "lowest common denominator"
that everybody can agree on.  Or at least there weren't any objections
to this proposal.

Andrew, please consider it for -mm.

Thanks,
Miklos


v3 -> v4:

 - simplify interface as much as possible, now only a single option
   ("user=UID") is used to control everything
 - no longer allow/deny mounting based on file/directory permissions,
   that approach does not always make sense


This patchset adds support for keeping mount ownership information in
the kernel, and allow unprivileged mount(2) and umount(2) in certain
cases.

The mount owner has the following privileges:

  - unmount the owned mount
  - create a submount under the owned mount

The sysadmin can set the owner explicitly on mount and remount.  When
an unprivileged user creates a mount, then the owner is automatically
set to the user.

The following use cases are envisioned:

1) Private namespace, with selected mounts owned by user.
   E.g. /home/$USER is a good candidate for allowing unpriv mounts and
   unmounts within.

2) Private namespace, with all mounts owned by user and having the
   "nosuid" flag.  User can mount and umount anywhere within the
   namespace, but suid programs will not work.

3) Global namespace, with a designated directory, which is a mount
   owned by the user.  E.g. /mnt/users/$USER is set up so that it is
   bind mounted onto itself, and set to be owned by $USER.  The user
   can add/remove mounts only under this directory.

The following extra security measures are taken for unprivileged
mounts:

 - usermounts are limited by a sysctl tunable
 - force "nosuid,nodev" mount options on the created mount

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 3/8] account user mounts

2007-04-20 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Add sysctl variables for accounting and limiting the number of user
mounts.

The maximum number of user mounts is set to 1024 by default.  This
won't in itself enable user mounts, setting a mount to be owned by a
user is first needed

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/include/linux/sysctl.h
===
--- linux.orig/include/linux/sysctl.h   2007-04-20 11:55:02.0 +0200
+++ linux/include/linux/sysctl.h2007-04-20 11:55:07.0 +0200
@@ -818,6 +818,8 @@ enum
FS_AIO_NR=18,   /* current system-wide number of aio requests */
FS_AIO_MAX_NR=19,   /* system-wide maximum number of aio requests */
FS_INOTIFY=20,  /* inotify submenu */
+   FS_NR_USER_MOUNTS=21,   /* int:current number of user mounts */
+   FS_MAX_USER_MOUNTS=22,  /* int:maximum number of user mounts */
FS_OCFS2=988,   /* ocfs2 */
 };
 
Index: linux/kernel/sysctl.c
===
--- linux.orig/kernel/sysctl.c  2007-04-20 11:55:02.0 +0200
+++ linux/kernel/sysctl.c   2007-04-20 11:55:07.0 +0200
@@ -1063,6 +1063,22 @@ static ctl_table fs_table[] = {
 #endif 
 #endif
{
+   .ctl_name   = FS_NR_USER_MOUNTS,
+   .procname   = "nr_user_mounts",
+   .data   = &nr_user_mounts,
+   .maxlen = sizeof(int),
+   .mode   = 0444,
+   .proc_handler   = &proc_dointvec,
+   },
+   {
+   .ctl_name   = FS_MAX_USER_MOUNTS,
+   .procname   = "max_user_mounts",
+   .data   = &max_user_mounts,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = &proc_dointvec,
+   },
+   {
.ctl_name   = KERN_SETUID_DUMPABLE,
.procname   = "suid_dumpable",
.data   = &suid_dumpable,
Index: linux/Documentation/filesystems/proc.txt
===
--- linux.orig/Documentation/filesystems/proc.txt   2007-04-20 
11:55:02.0 +0200
+++ linux/Documentation/filesystems/proc.txt2007-04-20 11:55:07.0 
+0200
@@ -923,6 +923,15 @@ reaches aio-max-nr then io_setup will fa
 raising aio-max-nr does not result in the pre-allocation or re-sizing
 of any kernel data structures.
 
+nr_user_mounts and max_user_mounts
+--
+
+These represent the number of "user" mounts and the maximum number of
+"user" mounts respectively.  User mounts may be created by
+unprivileged users.  User mounts may also be created with sysadmin
+privileges on behalf of a user, in which case nr_user_mounts may
+exceed max_user_mounts.
+
 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
 ---
 
Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-20 11:55:06.0 +0200
+++ linux/fs/namespace.c2007-04-20 11:55:07.0 +0200
@@ -39,6 +39,9 @@ static int hash_mask __read_mostly, hash
 static struct kmem_cache *mnt_cache __read_mostly;
 static struct rw_semaphore namespace_sem;
 
+int nr_user_mounts;
+int max_user_mounts = 1024;
+
 /* /sys/fs */
 decl_subsys(fs, NULL, NULL);
 EXPORT_SYMBOL_GPL(fs_subsys);
@@ -227,11 +230,30 @@ static struct vfsmount *skip_mnt_tree(st
return p;
 }
 
+static void dec_nr_user_mounts(void)
+{
+   spin_lock(&vfsmount_lock);
+   nr_user_mounts--;
+   spin_unlock(&vfsmount_lock);
+}
+
 static void set_mnt_user(struct vfsmount *mnt)
 {
BUG_ON(mnt->mnt_flags & MNT_USER);
mnt->mnt_uid = current->uid;
mnt->mnt_flags |= MNT_USER;
+   spin_lock(&vfsmount_lock);
+   nr_user_mounts++;
+   spin_unlock(&vfsmount_lock);
+}
+
+static void clear_mnt_user(struct vfsmount *mnt)
+{
+   if (mnt->mnt_flags & MNT_USER) {
+   mnt->mnt_uid = 0;
+   mnt->mnt_flags &= ~MNT_USER;
+   dec_nr_user_mounts();
+   }
 }
 
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
@@ -283,6 +305,7 @@ static inline void __mntput(struct vfsmo
 {
struct super_block *sb = mnt->mnt_sb;
dput(mnt->mnt_root);
+   clear_mnt_user(mnt);
free_vfsmnt(mnt);
deactivate_super(sb);
 }
@@ -1023,6 +1046,7 @@ static int do_remount(struct nameidata *
down_write(&sb->s_umount);
err = do_remount_sb(sb, flags, data, 0);
if (!err) {
+   clear_mnt_user(nd->mnt);
nd->mnt->mnt_flags = mnt_flags;
if (flags & MS_SETUSER)
set_mnt_user(nd->mnt);
Index: linux/include/linux/fs.h

[patch 7/8] allow unprivileged mounts

2007-04-20 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Define a new fs flag FS_SAFE, which denotes, that unprivileged
mounting of this filesystem may not constitute a security problem.

Since most filesystems haven't been designed with unprivileged
mounting in mind, a thorough audit is needed before setting this flag.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-20 11:55:10.0 +0200
+++ linux/fs/namespace.c2007-04-20 11:55:13.0 +0200
@@ -781,13 +781,17 @@ asmlinkage long sys_oldumount(char __use
  * - mountpoint is not a symlink or special file
  * - mountpoint is in a mount owned by the user
  */
-static bool permit_mount(struct nameidata *nd, int *flags)
+static bool permit_mount(struct nameidata *nd, struct file_system_type *type,
+int *flags)
 {
struct inode *inode = nd->dentry->d_inode;
 
if (capable(CAP_SYS_ADMIN))
return true;
 
+   if (type && !(type->fs_flags & FS_SAFE))
+   return false;
+
if (!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode))
return false;
 
@@ -1021,7 +1025,7 @@ static int do_loopback(struct nameidata 
struct vfsmount *mnt = NULL;
int err;
 
-   if (!permit_mount(nd, &flags))
+   if (!permit_mount(nd, NULL, &flags))
return -EPERM;
if (!old_name || !*old_name)
return -EINVAL;
@@ -1182,26 +1186,46 @@ out:
  * create a new mount for userspace and request it to be added into the
  * namespace's tree
  */
-static int do_new_mount(struct nameidata *nd, char *type, int flags,
+static int do_new_mount(struct nameidata *nd, char *fstype, int flags,
int mnt_flags, char *name, void *data)
 {
+   int err;
struct vfsmount *mnt;
+   struct file_system_type *type;
 
-   if (!type || !memchr(type, 0, PAGE_SIZE))
+   if (!fstype || !memchr(fstype, 0, PAGE_SIZE))
return -EINVAL;
 
-   /* we need capabilities... */
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
-   mnt = do_kern_mount(type, flags & ~MS_SETUSER, name, data);
-   if (IS_ERR(mnt))
+   type = get_fs_type(fstype);
+   if (!type)
+   return -ENODEV;
+
+   err = -EPERM;
+   if (!permit_mount(nd, type, &flags))
+   goto out_put_filesystem;
+
+   if (flags & MS_SETUSER) {
+   err = reserve_user_mount();
+   if (err)
+   goto out_put_filesystem;
+   }
+
+   mnt = vfs_kern_mount(type, flags & ~MS_SETUSER, name, data);
+   put_filesystem(type);
+   if (IS_ERR(mnt)) {
+   if (flags & MS_SETUSER)
+   dec_nr_user_mounts();
return PTR_ERR(mnt);
+   }
 
if (flags & MS_SETUSER)
-   set_mnt_user(mnt);
+   __set_mnt_user(mnt);
 
return do_add_mount(mnt, nd, mnt_flags, NULL);
+
+ out_put_filesystem:
+   put_filesystem(type);
+   return err;
 }
 
 /*
@@ -1231,7 +1255,7 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
goto unlock;
 
-   /* MNT_USER was set earlier */
+   /* some flags may have been set earlier */
newmnt->mnt_flags |= mnt_flags;
if ((err = graft_tree(newmnt, nd)))
goto unlock;
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-20 11:55:11.0 +0200
+++ linux/include/linux/fs.h2007-04-20 11:55:13.0 +0200
@@ -96,6 +96,7 @@ extern int dir_notify_enable;
 #define FS_REQUIRES_DEV 1 
 #define FS_BINARY_MOUNTDATA 2
 #define FS_HAS_SUBTYPE 4
+#define FS_SAFE 8  /* Safe to mount by unprivileged users */
 #define FS_REVAL_DOT   16384   /* Check the paths ".", ".." for staleness */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move()
 * during rename() internally.

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 5/8] allow unprivileged bind mounts

2007-04-20 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Allow bind mounts to unprivileged users if the following conditions
are met:

  - mountpoint is not a symlink or special file
  - parent mount is owned by the user
  - the number of user mounts is below the maximum

Unprivileged mounts imply MS_SETUSER, and will also have the "nosuid"
and "nodev" mount flags set.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-20 11:55:09.0 +0200
+++ linux/fs/namespace.c2007-04-20 11:55:10.0 +0200
@@ -237,11 +237,30 @@ static void dec_nr_user_mounts(void)
spin_unlock(&vfsmount_lock);
 }
 
-static void set_mnt_user(struct vfsmount *mnt)
+static int reserve_user_mount(void)
+{
+   int err = 0;
+   spin_lock(&vfsmount_lock);
+   if (nr_user_mounts >= max_user_mounts && !capable(CAP_SYS_ADMIN))
+   err = -EPERM;
+   else
+   nr_user_mounts++;
+   spin_unlock(&vfsmount_lock);
+   return err;
+}
+
+static void __set_mnt_user(struct vfsmount *mnt)
 {
BUG_ON(mnt->mnt_flags & MNT_USER);
mnt->mnt_uid = current->uid;
mnt->mnt_flags |= MNT_USER;
+   if (!capable(CAP_SYS_ADMIN))
+   mnt->mnt_flags |= MNT_NOSUID | MNT_NODEV;
+}
+
+static void set_mnt_user(struct vfsmount *mnt)
+{
+   __set_mnt_user(mnt);
spin_lock(&vfsmount_lock);
nr_user_mounts++;
spin_unlock(&vfsmount_lock);
@@ -260,9 +279,16 @@ static struct vfsmount *clone_mnt(struct
int flag)
 {
struct super_block *sb = old->mnt_sb;
-   struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);
+   struct vfsmount *mnt;
+
+   if (flag & CL_SETUSER) {
+   int err = reserve_user_mount();
+   if (err)
+   return ERR_PTR(err);
+   }
+   mnt = alloc_vfsmnt(old->mnt_devname);
if (!mnt)
-   return ERR_PTR(-ENOMEM);
+   goto alloc_failed;
 
mnt->mnt_flags = old->mnt_flags;
atomic_inc(&sb->s_active);
@@ -274,7 +300,7 @@ static struct vfsmount *clone_mnt(struct
/* don't copy the MNT_USER flag */
mnt->mnt_flags &= ~MNT_USER;
if (flag & CL_SETUSER)
-   set_mnt_user(mnt);
+   __set_mnt_user(mnt);
 
if (flag & CL_SLAVE) {
list_add(&mnt->mnt_slave, &old->mnt_slave_list);
@@ -299,6 +325,11 @@ static struct vfsmount *clone_mnt(struct
spin_unlock(&vfsmount_lock);
}
return mnt;
+
+ alloc_failed:
+   if (flag & CL_SETUSER)
+   dec_nr_user_mounts();
+   return ERR_PTR(-ENOMEM);
 }
 
 static inline void __mntput(struct vfsmount *mnt)
@@ -745,22 +776,29 @@ asmlinkage long sys_oldumount(char __use
 
 #endif
 
-static int mount_is_safe(struct nameidata *nd)
+/*
+ * Conditions for unprivileged mounts are:
+ * - mountpoint is not a symlink or special file
+ * - mountpoint is in a mount owned by the user
+ */
+static bool permit_mount(struct nameidata *nd, int *flags)
 {
+   struct inode *inode = nd->dentry->d_inode;
+
if (capable(CAP_SYS_ADMIN))
-   return 0;
-   return -EPERM;
-#ifdef notyet
-   if (S_ISLNK(nd->dentry->d_inode->i_mode))
-   return -EPERM;
-   if (nd->dentry->d_inode->i_mode & S_ISVTX) {
-   if (current->uid != nd->dentry->d_inode->i_uid)
-   return -EPERM;
-   }
-   if (vfs_permission(nd, MAY_WRITE))
-   return -EPERM;
-   return 0;
-#endif
+   return true;
+
+   if (!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode))
+   return false;
+
+   if (!(nd->mnt->mnt_flags & MNT_USER))
+   return false;
+
+   if (nd->mnt->mnt_uid != current->uid)
+   return false;
+
+   *flags |= MS_SETUSER;
+   return true;
 }
 
 static int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry)
@@ -981,9 +1019,10 @@ static int do_loopback(struct nameidata 
int clone_flags;
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
-   int err = mount_is_safe(nd);
-   if (err)
-   return err;
+   int err;
+
+   if (!permit_mount(nd, &flags))
+   return -EPERM;
if (!old_name || !*old_name)
return -EINVAL;
err = path_lookup(old_name, LOOKUP_FOLLOW, &old_nd);

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 8/8] allow unprivileged fuse mounts

2007-04-20 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Use FS_SAFE for "fuse" fs type, but not for "fuseblk".

FUSE was designed from the beginning to be safe for unprivileged
users.  This has also been verified in practice over many years.  In
addition unprivileged mounts require the parent mount to be owned by
the user, which is more strict than the current userspace policy.

This will enable future installations to remove the suid-root
fusermount utility.

Don't require the "user_id=" and "group_id=" options for unprivileged
mounts, but if they are present, verify them for sanity.

Disallow the "allow_other" option for unprivileged mounts.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/fuse/inode.c
===
--- linux.orig/fs/fuse/inode.c  2007-04-20 11:55:01.0 +0200
+++ linux/fs/fuse/inode.c   2007-04-20 11:55:14.0 +0200
@@ -311,6 +311,19 @@ static int parse_fuse_opt(char *opt, str
d->max_read = ~0;
d->blksize = 512;
 
+   /*
+* For unprivileged mounts use current uid/gid.  Still allow
+* "user_id" and "group_id" options for compatibility, but
+* only if they match these values.
+*/
+   if (!capable(CAP_SYS_ADMIN)) {
+   d->user_id = current->uid;
+   d->user_id_present = 1;
+   d->group_id = current->gid;
+   d->group_id_present = 1;
+
+   }
+
while ((p = strsep(&opt, ",")) != NULL) {
int token;
int value;
@@ -339,6 +352,8 @@ static int parse_fuse_opt(char *opt, str
case OPT_USER_ID:
if (match_int(&args[0], &value))
return 0;
+   if (d->user_id_present && d->user_id != value)
+   return 0;
d->user_id = value;
d->user_id_present = 1;
break;
@@ -346,6 +361,8 @@ static int parse_fuse_opt(char *opt, str
case OPT_GROUP_ID:
if (match_int(&args[0], &value))
return 0;
+   if (d->group_id_present && d->group_id != value)
+   return 0;
d->group_id = value;
d->group_id_present = 1;
break;
@@ -536,6 +553,10 @@ static int fuse_fill_super(struct super_
if (!parse_fuse_opt((char *) data, &d, is_bdev))
return -EINVAL;
 
+   /* This is a privileged option */
+   if ((d.flags & FUSE_ALLOW_OTHER) && !capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
if (is_bdev) {
 #ifdef CONFIG_BLOCK
if (!sb_set_blocksize(sb, d.blksize))
@@ -639,6 +660,7 @@ static struct file_system_type fuse_fs_t
.fs_flags   = FS_HAS_SUBTYPE,
.get_sb = fuse_get_sb,
.kill_sb= kill_anon_super,
+   .fs_flags   = FS_SAFE,
 };
 
 #ifdef CONFIG_BLOCK

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 6/8] put declaration of put_filesystem() in fs.h

2007-04-20 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Declarations go into headers.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/super.c
===
--- linux.orig/fs/super.c   2007-04-20 11:55:02.0 +0200
+++ linux/fs/super.c2007-04-20 11:55:11.0 +0200
@@ -40,10 +40,6 @@
 #include 
 
 
-void get_filesystem(struct file_system_type *fs);
-void put_filesystem(struct file_system_type *fs);
-struct file_system_type *get_fs_type(const char *name);
-
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-20 11:55:07.0 +0200
+++ linux/include/linux/fs.h2007-04-20 11:55:11.0 +0200
@@ -1918,6 +1918,8 @@ extern int vfs_fstat(unsigned int, struc
 
 extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long);
 
+extern void get_filesystem(struct file_system_type *fs);
+extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern struct super_block *get_super(struct block_device *);
 extern struct super_block *user_get_super(dev_t);

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 4/8] propagate error values from clone_mnt

2007-04-20 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Allow clone_mnt() to return errors other than ENOMEM.  This will be
used for returning a different error value when the number of user
mounts goes over the limit.

Fix copy_tree() to return EPERM for unbindable mounts.

Don't propagate further from dup_mnt_ns() as that copy_tree() can only
fail with -ENOMEM.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-20 11:55:07.0 +0200
+++ linux/fs/namespace.c2007-04-20 11:55:09.0 +0200
@@ -261,42 +261,42 @@ static struct vfsmount *clone_mnt(struct
 {
struct super_block *sb = old->mnt_sb;
struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);
+   if (!mnt)
+   return ERR_PTR(-ENOMEM);
 
-   if (mnt) {
-   mnt->mnt_flags = old->mnt_flags;
-   atomic_inc(&sb->s_active);
-   mnt->mnt_sb = sb;
-   mnt->mnt_root = dget(root);
-   mnt->mnt_mountpoint = mnt->mnt_root;
-   mnt->mnt_parent = mnt;
-
-   /* don't copy the MNT_USER flag */
-   mnt->mnt_flags &= ~MNT_USER;
-   if (flag & CL_SETUSER)
-   set_mnt_user(mnt);
-
-   if (flag & CL_SLAVE) {
-   list_add(&mnt->mnt_slave, &old->mnt_slave_list);
-   mnt->mnt_master = old;
-   CLEAR_MNT_SHARED(mnt);
-   } else {
-   if ((flag & CL_PROPAGATION) || IS_MNT_SHARED(old))
-   list_add(&mnt->mnt_share, &old->mnt_share);
-   if (IS_MNT_SLAVE(old))
-   list_add(&mnt->mnt_slave, &old->mnt_slave);
-   mnt->mnt_master = old->mnt_master;
-   }
-   if (flag & CL_MAKE_SHARED)
-   set_mnt_shared(mnt);
+   mnt->mnt_flags = old->mnt_flags;
+   atomic_inc(&sb->s_active);
+   mnt->mnt_sb = sb;
+   mnt->mnt_root = dget(root);
+   mnt->mnt_mountpoint = mnt->mnt_root;
+   mnt->mnt_parent = mnt;
+
+   /* don't copy the MNT_USER flag */
+   mnt->mnt_flags &= ~MNT_USER;
+   if (flag & CL_SETUSER)
+   set_mnt_user(mnt);
 
-   /* stick the duplicate mount on the same expiry list
-* as the original if that was on one */
-   if (flag & CL_EXPIRE) {
-   spin_lock(&vfsmount_lock);
-   if (!list_empty(&old->mnt_expire))
-   list_add(&mnt->mnt_expire, &old->mnt_expire);
-   spin_unlock(&vfsmount_lock);
-   }
+   if (flag & CL_SLAVE) {
+   list_add(&mnt->mnt_slave, &old->mnt_slave_list);
+   mnt->mnt_master = old;
+   CLEAR_MNT_SHARED(mnt);
+   } else {
+   if ((flag & CL_PROPAGATION) || IS_MNT_SHARED(old))
+   list_add(&mnt->mnt_share, &old->mnt_share);
+   if (IS_MNT_SLAVE(old))
+   list_add(&mnt->mnt_slave, &old->mnt_slave);
+   mnt->mnt_master = old->mnt_master;
+   }
+   if (flag & CL_MAKE_SHARED)
+   set_mnt_shared(mnt);
+
+   /* stick the duplicate mount on the same expiry list
+* as the original if that was on one */
+   if (flag & CL_EXPIRE) {
+   spin_lock(&vfsmount_lock);
+   if (!list_empty(&old->mnt_expire))
+   list_add(&mnt->mnt_expire, &old->mnt_expire);
+   spin_unlock(&vfsmount_lock);
}
return mnt;
 }
@@ -781,11 +781,11 @@ struct vfsmount *copy_tree(struct vfsmou
struct nameidata nd;
 
if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt))
-   return NULL;
+   return ERR_PTR(-EPERM);
 
res = q = clone_mnt(mnt, dentry, flag);
-   if (!q)
-   goto Enomem;
+   if (IS_ERR(q))
+   goto error;
q->mnt_mountpoint = mnt->mnt_mountpoint;
 
p = mnt;
@@ -806,8 +806,8 @@ struct vfsmount *copy_tree(struct vfsmou
nd.mnt = q;
nd.dentry = p->mnt_mountpoint;
q = clone_mnt(p, p->mnt_root, flag);
-   if (!q)
-   goto Enomem;
+   if (IS_ERR(q))
+   goto error;
spin_lock(&vfsmount_lock);
list_add_tail(&q->mnt_list, &res->mnt_list);
attach_mnt(q, &nd);
@@ -815,7 +815,7 @@ struct vfsmount *copy_tree(struct vfsmou
}
}
return res;
-Enomem:
+ error:
if (res) {
LIST_HEAD(umount_list);
spin_lock(&vfsmount_lock);
@@ -823,7 +823,7 @@ Enomem:
s

[patch 1/8] add user mounts to the kernel

2007-04-20 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Add ownership information to mounts.

A new mount flag, MS_SETUSER is used to make a mount owned by a user.
If this flag is specified, then the owner will be set to the current
real user id and the mount will be marked with the MNT_USER flag.  On
remount don't preserve previous owner, and treat MS_SETUSER as for a
new mount.  The MS_SETUSER flag is ignored on mount move.

The MNT_USER flag is not copied on any kind of mount cloning:
namespace creation, binding or propagation.  For bind mounts the
cloned mount(s) are set to MNT_USER depending on the MS_SETUSER mount
flag.  In all the other cases MNT_USER is always cleared.

For MNT_USER mounts a "user=UID" option is added to /proc/PID/mounts.
This is compatible with how mount ownership is stored in /etc/mtab.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-20 11:55:02.0 +0200
+++ linux/fs/namespace.c2007-04-20 11:55:05.0 +0200
@@ -227,6 +227,13 @@ static struct vfsmount *skip_mnt_tree(st
return p;
 }
 
+static void set_mnt_user(struct vfsmount *mnt)
+{
+   BUG_ON(mnt->mnt_flags & MNT_USER);
+   mnt->mnt_uid = current->uid;
+   mnt->mnt_flags |= MNT_USER;
+}
+
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
int flag)
 {
@@ -241,6 +248,11 @@ static struct vfsmount *clone_mnt(struct
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
 
+   /* don't copy the MNT_USER flag */
+   mnt->mnt_flags &= ~MNT_USER;
+   if (flag & CL_SETUSER)
+   set_mnt_user(mnt);
+
if (flag & CL_SLAVE) {
list_add(&mnt->mnt_slave, &old->mnt_slave_list);
mnt->mnt_master = old;
@@ -403,6 +415,8 @@ static int show_vfsmnt(struct seq_file *
if (mnt->mnt_flags & fs_infop->flag)
seq_puts(m, fs_infop->str);
}
+   if (mnt->mnt_flags & MNT_USER)
+   seq_printf(m, ",user=%i", mnt->mnt_uid);
if (mnt->mnt_sb->s_op->show_options)
err = mnt->mnt_sb->s_op->show_options(m, mnt);
seq_puts(m, " 0 0\n");
@@ -920,8 +934,9 @@ static int do_change_type(struct nameida
 /*
  * do loopback mount.
  */
-static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
+static int do_loopback(struct nameidata *nd, char *old_name, int flags)
 {
+   int clone_flags;
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
int err = mount_is_safe(nd);
@@ -941,11 +956,12 @@ static int do_loopback(struct nameidata 
if (!check_mnt(nd->mnt) || !check_mnt(old_nd.mnt))
goto out;
 
+   clone_flags = (flags & MS_SETUSER) ? CL_SETUSER : 0;
err = -ENOMEM;
-   if (recurse)
-   mnt = copy_tree(old_nd.mnt, old_nd.dentry, 0);
+   if (flags & MS_REC)
+   mnt = copy_tree(old_nd.mnt, old_nd.dentry, clone_flags);
else
-   mnt = clone_mnt(old_nd.mnt, old_nd.dentry, 0);
+   mnt = clone_mnt(old_nd.mnt, old_nd.dentry, clone_flags);
 
if (!mnt)
goto out;
@@ -987,8 +1003,11 @@ static int do_remount(struct nameidata *
 
down_write(&sb->s_umount);
err = do_remount_sb(sb, flags, data, 0);
-   if (!err)
+   if (!err) {
nd->mnt->mnt_flags = mnt_flags;
+   if (flags & MS_SETUSER)
+   set_mnt_user(nd->mnt);
+   }
up_write(&sb->s_umount);
if (!err)
security_sb_post_remount(nd->mnt, flags, data);
@@ -1093,10 +1112,13 @@ static int do_new_mount(struct nameidata
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
 
-   mnt = do_kern_mount(type, flags, name, data);
+   mnt = do_kern_mount(type, flags & ~MS_SETUSER, name, data);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
 
+   if (flags & MS_SETUSER)
+   set_mnt_user(mnt);
+
return do_add_mount(mnt, nd, mnt_flags, NULL);
 }
 
@@ -1127,7 +1149,8 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
goto unlock;
 
-   newmnt->mnt_flags = mnt_flags;
+   /* MNT_USER was set earlier */
+   newmnt->mnt_flags |= mnt_flags;
if ((err = graft_tree(newmnt, nd)))
goto unlock;
 
@@ -1447,7 +1470,7 @@ long do_mount(char *dev_name, char *dir_
retval = do_remount(&nd, flags & ~MS_REMOUNT, mnt_flags,
data_page);
else if (flags & MS_BIND)
-   retval = do_loopback(&nd, dev_name, flags & MS_REC);
+   retval = do_loopback(&nd, dev_name, flags);
else if (flags & (MS_

[patch 2/8] allow unprivileged umount

2007-04-20 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

The owner doesn't need sysadmin capabilities to call umount().

Similar behavior as umount(8) on mounts having "user=UID" option in
/etc/mtab.  The difference is that umount also checks /etc/fstab,
presumably to exclude another mount on the same mountpoint.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-20 11:55:05.0 +0200
+++ linux/fs/namespace.c2007-04-20 11:55:06.0 +0200
@@ -659,6 +659,25 @@ static int do_umount(struct vfsmount *mn
 }
 
 /*
+ * umount is permitted for
+ *  - sysadmin
+ *  - mount owner, if not forced umount
+ */
+static bool permit_umount(struct vfsmount *mnt, int flags)
+{
+   if (capable(CAP_SYS_ADMIN))
+   return true;
+
+   if (!(mnt->mnt_flags & MNT_USER))
+   return false;
+
+   if (flags & MNT_FORCE)
+   return false;
+
+   return mnt->mnt_uid == current->uid;
+}
+
+/*
  * Now umount can handle mount points as well as block devices.
  * This is important for filesystems which use unnamed block devices.
  *
@@ -681,7 +700,7 @@ asmlinkage long sys_umount(char __user *
goto dput_and_out;
 
retval = -EPERM;
-   if (!capable(CAP_SYS_ADMIN))
+   if (!permit_umount(nd.mnt, flags))
goto dput_and_out;
 
retval = do_umount(nd.mnt, flags);

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [d_path 6/7] Filter out disconnected paths from /proc/mounts

2007-04-20 Thread Alan Cox
> There is some disagreement what /proc/mounts should include. Currently it
> reports all mounts from the current namespace and doesn't include lazy
> unmounts. This leads to ambiguities with the rootfs (which is an internal 
> mount
> irrelevant to user-space except in the initrd), and in chroots.
> 
> With this and the next patch, /proc/mounts only reports the mounts reachable
> for the current process, which makes a lot more sense IMO.  If the current
> process is rooted in the namespace root (which it usually is), it will see all
> mounts except for the rootfs.
> 
> Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>

This change in behaviour appears to be fine for glibc (except when trying
to find the name of a file from a namespace we are not in, which wouldn't
have come out right before either)

Acked-by: Alan Cox <[EMAIL PROTECTED]>

(but still NAK on the getcwd change)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [d_path 1/7] Fix __d_path() for lazy unmounts and make it unambiguous

2007-04-20 Thread Alan Cox
On Fri, 20 Apr 2007 01:23:04 +0200
Andreas Gruenbacher <[EMAIL PROTECTED]> wrote:

> First, when __d_path() hits a lazily unmounted mount point, it tries to 
> prepend
> the name of the lazily unmounted dentry to the path name.  It gets this wrong,
> and also overwrites the slash that separates the name from the following
> pathname component. This patch fixes that; if a process was in directory
> /foo/bar and /foo got lazily unmounted, the old result was ``foobar'' (note 
> the
> missing slash), while the new result with this patch is ``foo/bar''.

ACK the fix

> of ``foobar'' in the example described above.  Subsequent patches propose to
> make getcwd() fail instead of reporting unreachable paths like this one and
> hide unreachable mount points from /proc/mounts.

NAK that change of behaviour on the following patches.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [d_path 0/7] Fixes to d_path: Respin

2007-04-20 Thread Alan Cox
> As far as I can see, glibc internally looks at /proc/mounts (or else mtab) to
> find out where tmpfs is mounted for opening files there, and to look up
> filesystem information for statfs(), while accessing that path, too. Fstatfs()
> also looks into the same files, but it only matches by filesystem type, so 
> this
> is only a very unreliable heuristic, anyway.
> 
> So judging from that, glibc users should be fine.

So glibc does use it and you will change behaviour

> > I disagree - firstly because of not breaking stuff, and secondly because
> > it separates two discussions - merging AppArmor being one of them , and
> > the correct behaviour for getcwd & /proc/mounts being the other.
> 
> I agree with the separation of discussion argument. Here are patches that
> change getcwd() and /proc/mounts independent of the changes that AppArmor
> depends on.

More useful would be AppArmour without the changes to getcwd
and /proc/mounts
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html