Re: [PATCH 00/37] Permit filesystem local caching

2008-02-20 Thread Serge E. Hallyn
Quoting David Howells ([EMAIL PROTECTED]):
> 
> 
> These patches add local caching for network filesystems such as NFS.
> 
> The patches can roughly be broken down into a number of sets:
> 
>   (*) 01-keys-inc-payload.diff
>   (*) 02-keys-search-keyring.diff
>   (*) 03-keys-callout-blob.diff
> 
>   Three patches to the keyring code made to help the CIFS people.
>   Included because of patches 05-08.
> 
>   (*) 04-keys-get-label.diff
> 
>   A patch to allow the security label of a key to be retrieved.
>   Included because of patches 05-08.
> 
>   (*) 05-security-current-fsugid.diff
>   (*) 06-security-separate-task-bits.diff

Seems *really* weird that every time you send this, patch 6 doesn't seem
to reach me in any of my mailboxes...  (did get it from the url
you listed)

I'm sorry if I miss where you explicitly state this, but is it safe to
assume, as perusing the patches suggests, that

1. tsk->sec never changes other than in task_alloc_security()?  

2. tsk->act_as is only ever dereferenced from (a) current->
   except (b) in do_coredump?

(thereby carefully avoiding locking issues)

I'd still like to see some performance numbers.  Not to object to
these patches, just to make sure there's no need to try and optimize
more of the dereferences away when they're not needed.

Oh, manually copied from patch 6, I see you have in the task_security
struct definition:

kernel_cap_tcap_bset;   /* ? */

That comment can be filled in with 'capability bounding set' (for this
task and all its future descendents).

thanks,
-serge

>   (*) 07-security-subjective.diff
>   (*) 08-security-kernel_service-class.diff
>   (*) 09-security-kernel-service.diff
>   (*) 10-security-nfsd.diff
> 
>   Patches to permit the subjective security of a task to be overridden.
>   All the security details in task_struct are decanted into a new struct
>   that task_struct then has two pointers two: one that defines the
>   objective security of that task (how other tasks may affect it) and one
>   that defines the subjective security (how it may affect other objects).
> 
>   Note that I have dropped the idea of struct cred for the moment.  With
>   the amount of stuff that was excluded from it, it wasn't actually any
>   use to me.  However, it can be added later.
> 
>   Required for cachefiles.
> 
>   (*) 11-release-page.diff
>   (*) 12-fscache-page-flags.diff
>   (*) 13-add_wait_queue_tail.diff
>   (*) 14-fscache.diff
> 
>   Patches to provide a local caching facility for network filesystems.
> 
>   (*) 15-cachefiles-ia64.diff
>   (*) 16-cachefiles-ext3-f_mapping.diff
>   (*) 17-cachefiles-write.diff
>   (*) 18-cachefiles-monitor.diff
>   (*) 19-cachefiles-export.diff
>   (*) 20-cachefiles.diff
> 
>   Patches to provide a local cache in a directory of an already mounted
>   filesystem.
> 
>   (*) 21-nfs-comment.diff
>   (*) 22-nfs-fscache-option.diff
>   (*) 23-nfs-fscache-kconfig.diff
>   (*) 24-nfs-fscache-top-index.diff
>   (*) 25-nfs-fscache-server-obj.diff
>   (*) 26-nfs-fscache-super-obj.diff
>   (*) 27-nfs-fscache-inode-obj.diff
>   (*) 28-nfs-fscache-use-inode.diff
>   (*) 29-nfs-fscache-invalidate-pages.diff
>   (*) 30-nfs-fscache-iostats.diff
>   (*) 31-nfs-fscache-page-management.diff
>   (*) 32-nfs-fscache-read-context.diff
>   (*) 33-nfs-fscache-read-fallback.diff
>   (*) 34-nfs-fscache-read-from-cache.diff
>   (*) 35-nfs-fscache-store-to-cache.diff
>   (*) 36-nfs-fscache-mount.diff
>   (*) 37-nfs-fscache-display.diff
> 
>   Patches to provide NFS with local caching.
> 
>   A couple of questions on the NFS iostat changes: (1) Should I update the
>   iostat version number; (2) is it permitted to have conditional iostats?
> 
> 
> I've brought the patchset up to date with respect to the 2.6.25-rc1 merge
> window, in particular altering Smack to handle the split in objective and
> subjective security in the task_struct.
> 
> --
> A tarball of the patches is available at:
> 
>   
> http://people.redhat.com/~dhowells/fscache/patches/nfs+fscache-30.tar.bz2
> 
> 
> To use this version of CacheFiles, the cachefilesd-0.9 is also required.  It
> is available as an SRPM:
> 
>   http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9-1.fc7.src.rpm
> 
> Or as individual bits:
> 
>   http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9.tar.bz2
>   http://people.redhat.com/~dhowells/fscache/cachefilesd.fc
>   http://people.redhat.com/~dhowells/fscache/cachefilesd.if
>   http://people.redhat.com/~dhowells/fscache/cachefilesd.te
>   http://people.redhat.com/~dhowells/fscache/cachefilesd.spec
> 
> The .fc, .if and .te files are for manipulating SELinux.
> 
> David
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send t

Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property

2008-02-07 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > > > Maybe sysctls just need to check capabilities, instead of uids.  I
> > > > > think that would make a lot of sense anyway.
> > > > 
> > > > Would it be as simple as tagging the inodes with capability sets?  One
> > > > set for writing, or one each for reading and writing?
> > > 
> > > Yes, or something even simpler, like mapping the owner permission bits
> > > to CAP_SYS_ADMIN.  There seem to be very few different permissions
> > > under /proc/sys:
> > > 
> > > --w---
> > > -r--r--r--
> > > -rw---
> > > -rw-r--r--
> > > 
> > > As long as the group and other bits are always the same, and we accept
> > > that the owner bits really mean CAP_SYS_ADMIN and not something else,
> > 
> > But I would assume some things under /proc/sys/net/ipv4 or
> > /proc/sys/net/ath0 require CAP_NET_ADMIN rather than CAP_SYS_ADMIN?
> 
> I guess so.  I'm not very familiar with the different capabilities :)
> 
> How about this patch then: a hybrid solution between just relying on
> permission bits, and specifying separate capability sets for read and
> write in addition to the permission bits.
> 
> Untested, the 'cap' field obviously still needs to be filled in where
> appropriate.
> 
> Miklos
> 
> 
> Index: linux/include/linux/sysctl.h
> ===
> --- linux.orig/include/linux/sysctl.h 2008-02-04 12:29:01.0 +0100
> +++ linux/include/linux/sysctl.h  2008-02-07 15:19:06.0 +0100
> @@ -1041,6 +1041,7 @@ struct ctl_table 
>   void *data;
>   int maxlen;
>   mode_t mode;
> + int cap;/* Capability needed to read/write */
>   struct ctl_table *child;
>   struct ctl_table *parent;   /* Automatically set */
>   proc_handler *proc_handler; /* Callback for text formatting */
> Index: linux/kernel/sysctl.c
> ===
> --- linux.orig/kernel/sysctl.c2008-02-05 22:17:05.0 +0100
> +++ linux/kernel/sysctl.c 2008-02-07 15:30:45.0 +0100
> @@ -1527,14 +1527,26 @@ out:
>   * some sysctl variables are readonly even to root.
>   */
> 
> -static int test_perm(int mode, int op)
> +static int test_perm(struct ctl_table *table, int op)
>  {
> - if (!current->euid)
> - mode >>= 6;
> - else if (in_egroup_p(0))
> - mode >>= 3;
> + int cap = table->cap;
> + mode_t mode = table->mode;
> +
> + if (!cap)
> + cap = CAP_SYS_ADMIN;
> +
> + if ((op & MAY_READ) && !(mode & S_IRUGO))
> + return -EACCES;
> +
> + if ((op & MAY_WRITE) && !(mode & S_IWUGO))
> + return -EACCES;
> +
> + if (capable(cap))
> + return 0;
> +
>   if ((mode & op & 0007) == op)
>   return 0;
> +
>   return -EACCES;

I like how simple it appears to be :)

At first I missed the fact that owning uid is always 0 so I thought the
uid processing wasn't quite enough.  But since it's always 0, the only
question is whether there are any /proc/sys files whose users currently
depend on being setgid 0 and setgid non-0 with no capabilities.

On my laptop, 'find /proc/sys -type f -perm -020' gives me no results,
so that is promising.

So this certainly seems like a good first step.  In fact, combined with
/proc/sys/ being partially remounted per container like /proc/sys/net is
doing, we may not even need to do anything with CAP_NS_OVERRIDE.

thanks,
-serge

>  }
> 
> @@ -1544,7 +1556,7 @@ int sysctl_perm(struct ctl_table *table,
>   error = security_sysctl(table, op);
>   if (error)
>   return error;
> - return test_perm(table->mode, op);
> + return test_perm(table, op);
>  }
> 
>  #ifdef CONFIG_SYSCTL_SYSCALL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property

2008-02-07 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > Maybe sysctls just need to check capabilities, instead of uids.  I
> > > think that would make a lot of sense anyway.
> > 
> > Would it be as simple as tagging the inodes with capability sets?  One
> > set for writing, or one each for reading and writing?
> 
> Yes, or something even simpler, like mapping the owner permission bits
> to CAP_SYS_ADMIN.  There seem to be very few different permissions
> under /proc/sys:
> 
> --w---
> -r--r--r--
> -rw---
> -rw-r--r--
> 
> As long as the group and other bits are always the same, and we accept
> that the owner bits really mean CAP_SYS_ADMIN and not something else,

But I would assume some things under /proc/sys/net/ipv4 or
/proc/sys/net/ath0 require CAP_NET_ADMIN rather than CAP_SYS_ADMIN?

> then the permission check would not need to look at uids or gids at
> all.
> 
> Miklos
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property

2008-02-06 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > + t->table[0].mode = 0644;
> > 
> > Yikes, this could be a problem for containers, as it's simply tied to
> > uid 0, whereas tying it to a capability would let us solve it with
> > capability bounds.
> > 
> > This might mean more urgency to get user namespaces working at least
> > with sysfs, else this is a quick way around having CAP_SYS_ADMIN taken
> > out of a container's capability bounding set.
> 
> I think I understand the problem, but not the solution.  How do user
> namespaces going to help?

Well it somewhat depends on how we implement userns for filesystems
in the first place, and whether we end up splitting sysfs into
sub-filesystems as I think Eric Biederman has been advocating.  My
thoughts had been running along the lines of just tagging vfsmounts
with userns of the mounting process.  A task from outside the mounting
process' namespace would get user other permissions whether or not
its uid was the owning uid or uid 0 (unless the task had CAP_NS_OVERRIDE).

But really it gets more complicated for sysfs than something like ext2
since we really want to be able to filter files and directories for
different namespaces...  Handling sysfs user namespaces before we sort
out the rest of the sysfs stuff (being hashed out with network
namespaces) seems like jumping the gun a bit.

> Maybe sysctls just need to check capabilities, instead of uids.  I
> think that would make a lot of sense anyway.

Would it be as simple as tagging the inodes with capability sets?  One
set for writing, or one each for reading and writing?

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property

2008-02-06 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Add the following:
> 
>   /proc/sys/fs/types/${FS_TYPE}/usermount_safe
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

Thanks, Miklos, good explanations in the docs.

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

One comment inline, but not imo your problem :)

> ---
> 
> Index: linux/fs/filesystems.c
> ===
> --- linux.orig/fs/filesystems.c   2008-02-04 23:47:46.0 +0100
> +++ linux/fs/filesystems.c2008-02-04 23:48:04.0 +0100
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  /*
> @@ -51,6 +52,57 @@ static struct file_system_type **find_fi
>   return p;
>  }
> 
> +#define MAX_FILESYSTEM_VARS 1
> +
> +struct filesystem_sysctl_table {
> + struct ctl_table_header *header;
> + struct ctl_table table[MAX_FILESYSTEM_VARS + 1];
> +};
> +
> +/*
> + * Create /sys/fs/types/${FSNAME} directory with per fs-type tunables.
> + */
> +static int filesystem_sysctl_register(struct file_system_type *fs)
> +{
> + struct filesystem_sysctl_table *t;
> + struct ctl_path path[] = {
> + { .procname = "fs", .ctl_name = CTL_FS },
> + { .procname = "types", .ctl_name = CTL_UNNUMBERED },
> + { .procname = fs->name, .ctl_name = CTL_UNNUMBERED },
> + { }
> + };
> +
> + t = kzalloc(sizeof(*t), GFP_KERNEL);
> + if (!t)
> + return -ENOMEM;
> +
> +
> + t->table[0].ctl_name = CTL_UNNUMBERED;
> + t->table[0].procname = "usermount_safe";
> + t->table[0].maxlen = sizeof(int);
> + t->table[0].data = &fs->fs_safe;
> + t->table[0].mode = 0644;

Yikes, this could be a problem for containers, as it's simply tied to
uid 0, whereas tying it to a capability would let us solve it with
capability bounds.

This might mean more urgency to get user namespaces working at least
with sysfs, else this is a quick way around having CAP_SYS_ADMIN taken
out of a container's capability bounding set.

> + t->table[0].proc_handler = &proc_dointvec;
> +
> + t->header = register_sysctl_paths(path, t->table);
> + if (!t->header) {
> + kfree(t);
> + return -ENOMEM;
> + }
> +
> + fs->sysctl_table = t;
> +
> + return 0;
> +}
> +
> +static void filesystem_sysctl_unregister(struct file_system_type *fs)
> +{
> + struct filesystem_sysctl_table *t = fs->sysctl_table;
> +
> + unregister_sysctl_table(t->header);
> + kfree(t);
> +}
> +
>  /**
>   *   register_filesystem - register a new filesystem
>   *   @fs: the file system structure
> @@ -80,6 +132,13 @@ int register_filesystem(struct file_syst
>   else
>   *p = fs;
>   write_unlock(&file_systems_lock);
> +
> + if (res == 0) {
> + res = filesystem_sysctl_register(fs);
> + if (res != 0)
> + unregister_filesystem(fs);
> + }
> +
>   return res;
>  }
> 
> @@ -108,6 +167,7 @@ int unregister_filesystem(struct file_sy
>   *tmp = fs->next;
>   fs->next = NULL;
>   write_unlock(&file_systems_lock);
> + filesystem_sysctl_unregister(fs);
>   return 0;
>   }
>   tmp = &(*tmp)->next;
> Index: linux/include/linux/fs.h
> ===
> --- linux.orig/include/linux/fs.h 2008-02-04 23:48:02.0 +0100
> +++ linux/include/linux/fs.h  2008-02-04 23:48:04.0 +0100
> @@ -1444,6 +1444,7 @@ struct file_system_type {
>   struct module *owner;
>   struct file_system_type * next;
>   struct list_head fs_supers;
> + struct filesystem_sysctl_table *sysctl_table;
> 
>   struct lock_class_key s_lock_key;
>   struct lock_class_key s_umount_key;
> Index: linux/Documentation/filesystems/proc.txt
> ===
> --- linux.orig/Documentation/filesystems/proc.txt 2008-02-04 
> 23:47:58.0 +0100
> +++ linux/Documentation/filesystems/proc.txt  2008-02-04 23:48:04.0 
> +0100
> @@ -44,6 +44,7 @@ Table of Contents
>2.14   /proc//io - Display the IO accounting fields
>2.15   /proc//coredump_filter - Core dump filtering settings
>2.16   /proc//mountinfo - Information about mounts
> +  2.17   /proc/s

Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property

2008-01-22 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > > What do you think about doing this only if FS_SAFE is also set,
> > > > so for instance at first only FUSE would allow itself to be
> > > > made user-mountable?
> > > > 
> > > > A safe thing to do, or overly intrusive?
> > > 
> > > It goes somewhat against the "no policy in kernel" policy ;).  I think
> > > the warning in the documentation should be enough to make sysadmins
> > > think twice before doing anything foolish:
> > 
> > Warning in which documentation?  A sysadmin considering setting fs_safe
> > for ext2 or xfs isn't going to be looking at fuse docs, which I think is
> > what you're talking about.  Are you going to add a file under
> > Documentation/filesystems?
> 
> Yes, I meant documentation of the new sysctl tunable in
> Documentation/filesystems/proc.txt:

Argh, sorry.

> > Index: linux/Documentation/filesystems/proc.txt
> > ===
> > --- linux.orig/Documentation/filesystems/proc.txt   2008-01-16 
> > 13:25:07.0 +0100
> > +++ linux/Documentation/filesystems/proc.txt2008-01-16 
> > 13:25:09.0 +0100
> > @@ -43,6 +43,7 @@ Table of Contents
> >2.13 /proc//oom_score - Display current oom-killer score
> >2.14 /proc//io - Display the IO accounting fields
> >2.15 /proc//coredump_filter - Core dump filtering settings
> > +  2.16 /proc/sys/fs/types - File system type specific parameters
> >  
> >  
> > --
> >  Preface
> > @@ -2283,4 +2284,21 @@ For example:
> >$ echo 0x7 > /proc/self/coredump_filter
> >$ ./some_program
> >  
> > +2.16 /proc/sys/fs/types/ - File system type specific parameters
> > +
> > +
> > +There's a separate directory /proc/sys/fs/types// for each
> > +filesystem type, containing the following files:
> > +
> > +usermount_safe
> > +--
> > +
> > +Setting this to non-zero will allow filesystems of this type to be
> > +mounted by unprivileged users (note, that there are other
> > +prerequisites as well).
> > +
> > +Care should be taken when enabling this, since most
> > +filesystems haven't been designed with unprivileged mounting
> > +in mind.
> > +
> >  
> > --
> > 
> 
> Do you think this is enough?  Or do we need something more, to prevent
> sysadmin inadvertently setting this for an unsafe filesystem?

I would think something more would be good.  First explaining
that fuse should be safe modulo warnings in the fuse documentation,
procfs and sysfs may be safe, while other filesystems are not known safe
at all.

Then explaining the dangers with not-known-safe filesystems and what is
needed to make them safe.  Clearly making sure input validation is
properly done so for instance getsb() doesn't turn into a buffer
overflow, etc.

Such a checklist also would be useful for holding a meaningful discussion
about the other filesystems and maybe turning some people loose on
an audit of other filesystems.

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] VFS: create /proc//mountinfo

2008-01-22 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > On Mon, 2008-01-21 at 22:25 +0100, Miklos Szeredi wrote:
> > > > You have removed the code that checked if the peer or
> > > > master mount was in the same namespace before reporting their
> > > > corresponding mount-ids. One downside of that approach is the
> > > > user will see an mount_id in the output with no corresponding
> > > > line to explain the details of the mount_id.  
> > > 
> > > Before the change, the peer and master ID's were basically randomly
> > > chosen from the peers, which means, it wasn't possible to always
> > > determine, that two mounts were peers, or that they were slaves to the
> > > same peer group.
> > > 
> > > After the change, this is possible, since the peer ID will be the same
> > > for all mounts which are peers.  This means, that even though the peer
> > > ID might be in a different namespace, it is possible to determine all
> > > peers within the same namespace by comparing their peer ID's.
> > 
> > 
> >  I agree with your reasoning on the random id; showing a single
> >  id avoids clutter. But my point is, why not show a
> >  id for the master or peer residing in the same namespace?
> 
> Because this way it is possible see propagation between different
> namespaces as well, by looking at the mount information for processes
> in the different namespaces.  Of course, this is only possible with
> sufficient privileges.

Gotta say I agree with Miklos this would be useful.  I'd far prefer to
see the id than a -1.

thanks,
-serge

> >  Showing a id with no corresponding entry for that id, can be
> >  intriguing.
> 
> Not if it's clearly documented (will add documentation for the next
> submission).
> 
> Miklos
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property

2008-01-22 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > What do you think about doing this only if FS_SAFE is also set,
> > so for instance at first only FUSE would allow itself to be
> > made user-mountable?
> > 
> > A safe thing to do, or overly intrusive?
> 
> It goes somewhat against the "no policy in kernel" policy ;).  I think
> the warning in the documentation should be enough to make sysadmins
> think twice before doing anything foolish:

Warning in which documentation?  A sysadmin considering setting fs_safe
for ext2 or xfs isn't going to be looking at fuse docs, which I think is
what you're talking about.  Are you going to add a file under
Documentation/filesystems?

> > +Care should be taken when enabling this, since most
> > +filesystems haven't been designed with unprivileged mounting
> > +in mind.
> > +
> 
> BTW, filesystems like 'proc' and 'sysfs' should also be safe, although
> the only use for them being marked safe is if the users are allowed to
> umount them from their private namespace (otherwise a 'mount --bind'
> has the same effect as a new mount).
> 
> Thanks,
> Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property

2008-01-21 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Add the following:
> 
>   /proc/sys/fs/types/${FS_TYPE}/usermount_safe
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
> ---
> 
> Index: linux/fs/filesystems.c
> ===
> --- linux.orig/fs/filesystems.c   2008-01-16 13:24:52.0 +0100
> +++ linux/fs/filesystems.c2008-01-16 13:25:09.0 +0100
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  /*
> @@ -51,6 +52,57 @@ static struct file_system_type **find_fi
>   return p;
>  }
> 
> +#define MAX_FILESYSTEM_VARS 1
> +
> +struct filesystem_sysctl_table {
> + struct ctl_table_header *header;
> + struct ctl_table table[MAX_FILESYSTEM_VARS + 1];
> +};
> +
> +/*
> + * Create /sys/fs/types/${FSNAME} directory with per fs-type tunables.
> + */
> +static int filesystem_sysctl_register(struct file_system_type *fs)
> +{
> + struct filesystem_sysctl_table *t;
> + struct ctl_path path[] = {
> + { .procname = "fs", .ctl_name = CTL_FS },
> + { .procname = "types", .ctl_name = CTL_UNNUMBERED },
> + { .procname = fs->name, .ctl_name = CTL_UNNUMBERED },
> + { }
> + };
> +
> + t = kzalloc(sizeof(*t), GFP_KERNEL);
> + if (!t)
> + return -ENOMEM;
> +
> +
> + t->table[0].ctl_name = CTL_UNNUMBERED;
> + t->table[0].procname = "usermount_safe";
> + t->table[0].maxlen = sizeof(int);
> + t->table[0].data = &fs->fs_safe;
> + t->table[0].mode = 0644;
> + t->table[0].proc_handler = &proc_dointvec;
> +
> + t->header = register_sysctl_paths(path, t->table);
> + if (!t->header) {
> + kfree(t);
> + return -ENOMEM;
> + }
> +
> + fs->sysctl_table = t;
> +
> + return 0;
> +}
> +
> +static void filesystem_sysctl_unregister(struct file_system_type *fs)
> +{
> + struct filesystem_sysctl_table *t = fs->sysctl_table;
> +
> + unregister_sysctl_table(t->header);
> + kfree(t);
> +}
> +
>  /**
>   *   register_filesystem - register a new filesystem
>   *   @fs: the file system structure
> @@ -80,6 +132,13 @@ int register_filesystem(struct file_syst
>   else
>   *p = fs;
>   write_unlock(&file_systems_lock);
> +
> + if (res == 0) {
> + res = filesystem_sysctl_register(fs);

What do you think about doing this only if FS_SAFE is also set,
so for instance at first only FUSE would allow itself to be
made user-mountable?

A safe thing to do, or overly intrusive?

> + if (res != 0)
> + unregister_filesystem(fs);
> + }
> +
>   return res;
>  }
> 
> @@ -108,6 +167,7 @@ int unregister_filesystem(struct file_sy
>   *tmp = fs->next;
>   fs->next = NULL;
>   write_unlock(&file_systems_lock);
> + filesystem_sysctl_unregister(fs);
>   return 0;
>   }
>   tmp = &(*tmp)->next;
> Index: linux/include/linux/fs.h
> ===
> --- linux.orig/include/linux/fs.h 2008-01-16 13:25:09.0 +0100
> +++ linux/include/linux/fs.h  2008-01-16 13:25:09.0 +0100
> @@ -1437,6 +1437,7 @@ struct file_system_type {
>   struct module *owner;
>   struct file_system_type * next;
>   struct list_head fs_supers;
> + struct filesystem_sysctl_table *sysctl_table;
> 
>   struct lock_class_key s_lock_key;
>   struct lock_class_key s_umount_key;
> Index: linux/Documentation/filesystems/proc.txt
> ===
> --- linux.orig/Documentation/filesystems/proc.txt 2008-01-16 
> 13:25:07.0 +0100
> +++ linux/Documentation/filesystems/proc.txt  2008-01-16 13:25:09.0 
> +0100
> @@ -43,6 +43,7 @@ Table of Contents
>2.13   /proc//oom_score - Display current oom-killer score
>2.14   /proc//io - Display the IO accounting fields
>2.15   /proc//coredump_filter - Core dump filtering settings
> +  2.16   /proc/sys/fs/types - File system type specific parameters
> 
>  
> --
>  Preface
> @@ -2283,4 +2284,21 @@ For example:
>$ echo 0x7 > /proc/self/coredump_filter
>$ ./some_program
> 
> +2.16 /proc/sys/fs/types/ - File system type specific parameters
> +
> +
> +There's a separate directory /proc/sys/fs/types// for each
> +filesystem type, containing the following files:
> +
> +usermount_safe
> +--
> +
> +Setting this to non-zero will allow filesystems of this type to be
> +mounted by unprivileged users (note, that there are other
> +prerequisites as well).
> +
> +Care should be taken when enabling this, since most
> +filesystems haven

Re: [patch 09/10] unprivileged mounts: propagation: inherit owner from parent

2008-01-21 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> On mount propagation, let the owner of the clone be inherited from the
> parent into which it has been propagated.
> 
> If the parent has the "nosuid" flag, set this flag for the child as
> well.  This is needed for the suid-less namespace (use case #2 in the
> first patch header), where all mounts are owned by the user and have
> the nosuid flag set.  In this case the propagated mount needs to have
> nosuid, otherwise a suid executable may be misused by the user.
> 
> Similar treatment is not needed for "nodev", because devices can't be
> abused this way: the user is not able to gain privileges to devices by
> rearranging the mount namespace.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

As discussed many months ago this does seem like the most appropriate
behavior for propagation.

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

> ---
> 
> Index: linux/fs/namespace.c
> ===
> --- linux.orig/fs/namespace.c 2008-01-16 13:25:09.0 +0100
> +++ linux/fs/namespace.c  2008-01-16 13:25:11.0 +0100
> @@ -506,10 +506,10 @@ static int reserve_user_mount(void)
>   return err;
>  }
> 
> -static void __set_mnt_user(struct vfsmount *mnt)
> +static void __set_mnt_user(struct vfsmount *mnt, uid_t owner)
>  {
>   WARN_ON(mnt->mnt_flags & MNT_USER);
> - mnt->mnt_uid = current->fsuid;
> + mnt->mnt_uid = owner;
>   mnt->mnt_flags |= MNT_USER;
> 
>   if (!capable(CAP_SETUID))
> @@ -520,7 +520,7 @@ static void __set_mnt_user(struct vfsmou
> 
>  static void set_mnt_user(struct vfsmount *mnt)
>  {
> - __set_mnt_user(mnt);
> + __set_mnt_user(mnt, current->fsuid);
>   spin_lock(&vfsmount_lock);
>   nr_user_mounts++;
>   spin_unlock(&vfsmount_lock);
> @@ -536,7 +536,7 @@ static void clear_mnt_user(struct vfsmou
>  }
> 
>  static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
> - int flag)
> + int flag, uid_t owner)
>  {
>   struct super_block *sb = old->mnt_sb;
>   struct vfsmount *mnt;
> @@ -560,7 +560,10 @@ static struct vfsmount *clone_mnt(struct
>   /* don't copy the MNT_USER flag */
>   mnt->mnt_flags &= ~MNT_USER;
>   if (flag & CL_SETUSER)
> - __set_mnt_user(mnt);
> + __set_mnt_user(mnt, owner);
> +
> + if (flag & CL_NOSUID)
> + mnt->mnt_flags |= MNT_NOSUID;
> 
>   if (flag & CL_SLAVE) {
>   list_add(&mnt->mnt_slave, &old->mnt_slave_list);
> @@ -1066,7 +1069,7 @@ static int lives_below_in_same_fs(struct
>  }
> 
>  struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
> - int flag)
> + int flag, uid_t owner)
>  {
>   struct vfsmount *res, *p, *q, *r, *s;
>   struct nameidata nd;
> @@ -1074,7 +1077,7 @@ struct vfsmount *copy_tree(struct vfsmou
>   if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt))
>   return ERR_PTR(-EPERM);
> 
> - res = q = clone_mnt(mnt, dentry, flag);
> + res = q = clone_mnt(mnt, dentry, flag, owner);
>   if (IS_ERR(q))
>   goto error;
>   q->mnt_mountpoint = mnt->mnt_mountpoint;
> @@ -1096,7 +1099,7 @@ struct vfsmount *copy_tree(struct vfsmou
>   p = s;
>   nd.path.mnt = q;
>   nd.path.dentry = p->mnt_mountpoint;
> - q = clone_mnt(p, p->mnt_root, flag);
> + q = clone_mnt(p, p->mnt_root, flag, owner);
>   if (IS_ERR(q))
>   goto error;
>   spin_lock(&vfsmount_lock);
> @@ -1121,7 +1124,7 @@ struct vfsmount *collect_mounts(struct v
>  {
>   struct vfsmount *tree;
>   down_read(&namespace_sem);
> - tree = copy_tree(mnt, dentry, CL_COPY_ALL | CL_PRIVATE);
> + tree = copy_tree(mnt, dentry, CL_COPY_ALL | CL_PRIVATE, 0);
>   up_read(&namespace_sem);
>   return tree;
>  }
> @@ -1292,7 +1295,8 @@ static int do_change_type(struct nameida
>   */
>  static int do_loopback(struct nameidata *nd, char *old_name, int flags)
>  {
> - int clone_fl;
> + int clone_fl = 0;
> + uid_t owner = 0;
>   struct nameidata old_nd;
>   struct vfsmount *mnt = NULL;
>   int err;
> @@ -1313,11 +1317,17 @@ static i

Re: [patch 08/10] unprivileged mounts: make fuse safe

2008-01-21 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Don't require the "user_id=" and "group_id=" options for unprivileged mounts,
> but if they are present, verify them for sanity.
> 
> Disallow the "allow_other" option for unprivileged mounts.
> 
> FUSE was designed from the beginning to be safe for unprivileged
> users.  This has also been verified in practice over many years, with
> some distributions enabling unprivileged FUSE mounts by default.
> 
> However there are some properties of FUSE, that could make it unsafe
> for certain situations (e.g. multiuser system with untrusted users):
> 
>  - It is not always possible to use kill(2) (not even with SIGKILL) to
>terminate a process using a FUSE filesystem.  However it is
>possible to use any of the following instead:
>  o kill the filesystem daemon
>  o use forced umounting
>  o use the "fusectl" control filesystem
> 
>  - As a special case of the above, killing a self-deadlocked FUSE
>process is not possible, and even killall5 will not terminate it.
> 
>  - Due to the design of the process freezer, a hanging (due to network
>problems, etc) or malicious filesystem may prevent suspending to
>ram or hibernation to succeed.  This is not actually unique to
>FUSE, as any hanging network filesystem will have the same affect.
> 
> If the above could pose a threat to the system, it is recommended,
> that the '/proc/sys/fs/types/fuse/safe' sysctl tunable is not turned
> on, and/or '/dev/fuse' is not made world-readable and writable.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

I was going to say "this should of course be acked by a fuse
maintainer", then I look at MAINTAINERS :)  So never mind.

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

> ---
> 
> Index: linux/fs/fuse/inode.c
> ===
> --- linux.orig/fs/fuse/inode.c2008-01-16 13:24:52.0 +0100
> +++ linux/fs/fuse/inode.c 2008-01-16 13:25:10.0 +0100
> @@ -357,6 +357,19 @@ static int parse_fuse_opt(char *opt, str
>   d->max_read = ~0;
>   d->blksize = 512;
> 
> + /*
> +  * For unprivileged mounts use current uid/gid.  Still allow
> +  * "user_id" and "group_id" options for compatibility, but
> +  * only if they match these values.
> +  */
> + if (!capable(CAP_SYS_ADMIN)) {
> + d->user_id = current->uid;
> + d->user_id_present = 1;
> + d->group_id = current->gid;
> + d->group_id_present = 1;
> +
> + }
> +
>   while ((p = strsep(&opt, ",")) != NULL) {
>   int token;
>   int value;
> @@ -385,6 +398,8 @@ static int parse_fuse_opt(char *opt, str
>   case OPT_USER_ID:
>   if (match_int(&args[0], &value))
>   return 0;
> + if (d->user_id_present && d->user_id != value)
> + return 0;
>   d->user_id = value;
>   d->user_id_present = 1;
>   break;
> @@ -392,6 +407,8 @@ static int parse_fuse_opt(char *opt, str
>   case OPT_GROUP_ID:
>   if (match_int(&args[0], &value))
>   return 0;
> + if (d->group_id_present && d->group_id != value)
> + return 0;
>   d->group_id = value;
>   d->group_id_present = 1;
>   break;
> @@ -596,6 +613,10 @@ static int fuse_fill_super(struct super_
>   if (!parse_fuse_opt((char *) data, &d, is_bdev))
>   return -EINVAL;
> 
> + /* This is a privileged option */
> + if ((d.flags & FUSE_ALLOW_OTHER) && !capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
>   if (is_bdev) {
>  #ifdef CONFIG_BLOCK
>   if (!sb_set_blocksize(sb, d.blksize))
> 
> --
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [TOMOYO #6 retry 02/21] Add struct vfsmount to struct task_struct.

2008-01-16 Thread Serge E. Hallyn
Quoting Kentaro Takeda ([EMAIL PROTECTED]):
> Hello.
> 
> Serge E. Hallyn wrote:
> > I must say I personally prefer the apparmor approach.
> No problem.
> 
> > But I'd recommend
> > you get together and get this piece pushed on its own, whichever version
> > you can agree on.
> TOMOYO can use AppArmor's patch.

Right, but one will be preferred by the community - and while I have my
own preference, I wouldn't put too much faith on that, rather talk with
the apparmor folks, look over the lkml logs for previous submissions,
and then decide.

> > Yes it needs a user, but at this point I would think
> > both tomoyo and apparmor have had enough visibility that everyone knows
> > the intended users.
> Not only AppArmor and TOMOYO but also SELinux want to use "vfsmount".
> (http://marc.info/?l=selinux&m=120005904211942&w=2)
> 
> > It seems to me you're both being held up by this piece, and getting
> > another full posting of either tomoyo or apparmor isn't going to help,
> > so hopefully you can combine your efforts to get this solved.
> We welcome AppArmor's vfsmount patches, but I wonder why AppArmor's
> vfsmount patches are not merged yet.
> 
> What prevents AppArmor's vfsmount patches from merging into -mm tree?

I don't recall what objections remained at the last posting.  Far as I
know there may have simply been no responses due to patch fatigue.  (it
happens)

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 8/9] unprivileged mounts: propagation: inherit owner from parent

2008-01-15 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > > > On mount propagation, let the owner of the clone be inherited from the
> > > > > parent into which it has been propagated.  Also if the parent has the
> > > > > "nosuid" flag, set this flag for the child as well.
> > > > 
> > > > What about nodev?
> > > 
> > > Hmm, I think the nosuid thing is meant to prevent suid mounts being
> > > introduced into a "suidless" namespace.  This doesn't apply to dev
> > > mounts, which are quite safe in a suidless environment, as long as the
> > > user is not able to create devices.  But that should be taken care of
> > > by capability tests.
> > > 
> > > I'll update the description.
> > 
> > Hmm,
> > 
> > Part of me wants to say the safest thing for now would be to refuse
> > mounts propagation from non-user mounts to user mounts.
> > 
> > I assume you're thinking about a fully user-mounted chroot, where
> > the user woudl still want to be able to stick in a cdrom and have
> > it automounted under /mnt/cdrom, propagated from the root mounts ns?
> 
> Right.
> 
> > But then are there no devices which the user could create on a floppy
> > while inserted into his own laptop, owned by his own uid, then insert
> > into this machine, and use the device under the auto-mounted /dev/floppy
> > to gain inappropriate access?
> 
> I assume, that the floppy and cdrom are already mounted with
> nosuid,nodev.

Yeah, of course, what I'm saying is no different whether the upper mount
is a user mount or not.  You're right.

> The problem case is I think is if a sysadmin does some mounting in the
> initial namespace, and this is propagated into the fully user-mounted
> namespace (or chroot), so that a mount with suid binaries slips in.
> Which is bad, because the user may be able rearange the namespace, to
> trick the suid program to something it should not do.

And really this shouldn't be an issue at all - the usermount chroot
would be set up under something like /share/hallyn/root, so the admin
would have to purposely set up propagation into that tree, so this
won't be happening by accident.

> OTOH, a mount with devices can't be abused this way, since it is not
> possible to gain privileges to files/devices just by rearanging the
> mounts.

Thanks for humoring me,

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 8/9] unprivileged mounts: propagation: inherit owner from parent

2008-01-15 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > From: Miklos Szeredi <[EMAIL PROTECTED]>
> > > 
> > > On mount propagation, let the owner of the clone be inherited from the
> > > parent into which it has been propagated.  Also if the parent has the
> > > "nosuid" flag, set this flag for the child as well.
> > 
> > What about nodev?
> 
> Hmm, I think the nosuid thing is meant to prevent suid mounts being
> introduced into a "suidless" namespace.  This doesn't apply to dev
> mounts, which are quite safe in a suidless environment, as long as the
> user is not able to create devices.  But that should be taken care of
> by capability tests.
> 
> I'll update the description.

Hmm,

Part of me wants to say the safest thing for now would be to refuse
mounts propagation from non-user mounts to user mounts.

I assume you're thinking about a fully user-mounted chroot, where
the user woudl still want to be able to stick in a cdrom and have
it automounted under /mnt/cdrom, propagated from the root mounts ns?

But then are there no devices which the user could create on a floppy
while inserted into his own laptop, owned by his own uid, then insert
into this machine, and use the device under the auto-mounted /dev/floppy
to gain inappropriate access?

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 9/9] unprivileged mounts: add "no submounts" flag

2008-01-15 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > Why not "nosubmnt"?
> 
> Why not indeed.  Maybe I should try to use my brain sometime.

Well it really should have 'user' or 'unpriv' in the name
somewhere.  'nosubmnt' is more confusing than 'nomnt' because
it no submounts really sounds like a reasonable thing in
itself...

But I never win naming arguments, so I accept that I have poor
naming judgement  :)

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-15 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > Sounds like a sysctl to enable FS_SAFE for fuse will make this patch
> > acceptable to everyone?
> 
> I think the most generic approach, is to be able to set "safeness" for
> any fs type, not just fuse (Karel's suggestion).
> 
> E.g:
> 
>   echo 1 > /proc/sys/fs/types/cifs/safe
> 
> This would also provide a way to query the FS_SAFE flag.

That sounds good.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 9/9] unprivileged mounts: add "no submounts" flag

2008-01-14 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Add a new mount flag "nomnt", which denies submounts for the owner.
> This would be useful, if we want to support traditional /etc/fstab
> based user mounts.
> 
> In this case mount(8) would still have to be suid-root, to check the
> mountpoint against the user/users flag in /etc/fstab, but /etc/mtab
> would no longer be mandatory for storing the actual owner of the
> mount.

Ah, I see, so the floppy drive could be mounted as a MNT_NOMNT but
MNT_USER mount with mnt_owner set.  Makes sense.  I'd ask for a better
name than 'nomnt', but I can't think of one myself.

> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

> ---
> 
> Index: linux/fs/namespace.c
> ===
> --- linux.orig/fs/namespace.c 2008-01-04 13:49:52.0 +0100
> +++ linux/fs/namespace.c  2008-01-04 13:50:28.0 +0100
> @@ -694,6 +694,7 @@ static int show_vfsmnt(struct seq_file *
>   { MNT_NOATIME, ",noatime" },
>   { MNT_NODIRATIME, ",nodiratime" },
>   { MNT_RELATIME, ",relatime" },
> + { MNT_NOMNT, ",nomnt" },
>   { 0, NULL }
>   };
>   struct proc_fs_info *fs_infop;
> @@ -1044,6 +1045,9 @@ static bool permit_mount(struct nameidat
>   if (S_ISLNK(inode->i_mode))
>   return false;
> 
> + if (nd->path.mnt->mnt_flags & MNT_NOMNT)
> + return false;
> +
>   if (!is_mount_owner(nd->path.mnt, current->fsuid))
>   return false;
> 
> @@ -1888,9 +1892,11 @@ long do_mount(char *dev_name, char *dir_
>   mnt_flags |= MNT_RELATIME;
>   if (flags & MS_RDONLY)
>   mnt_flags |= MNT_READONLY;
> + if (flags & MS_NOMNT)
> + mnt_flags |= MNT_NOMNT;
> 
> - flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
> -MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT);
> + flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_NOATIME |
> + MS_NODIRATIME | MS_RELATIME | MS_KERNMOUNT | MS_NOMNT);
> 
>   /* ... and get the mountpoint */
>   retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
> Index: linux/include/linux/fs.h
> ===
> --- linux.orig/include/linux/fs.h 2008-01-04 13:49:12.0 +0100
> +++ linux/include/linux/fs.h  2008-01-04 13:49:58.0 +0100
> @@ -130,6 +130,7 @@ extern int dir_notify_enable;
>  #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
>  #define MS_I_VERSION (1<<23) /* Update inode I_version field */
>  #define MS_SETUSER   (1<<24) /* set mnt_uid to current user */
> +#define MS_NOMNT (1<<25) /* don't allow unprivileged submounts */
>  #define MS_ACTIVE(1<<30)
>  #define MS_NOUSER(1<<31)
> 
> Index: linux/include/linux/mount.h
> ===
> --- linux.orig/include/linux/mount.h  2008-01-04 13:45:45.0 +0100
> +++ linux/include/linux/mount.h   2008-01-04 13:49:58.0 +0100
> @@ -30,6 +30,7 @@ struct mnt_namespace;
>  #define MNT_NODIRATIME   0x10
>  #define MNT_RELATIME 0x20
>  #define MNT_READONLY 0x40/* does the user want this to be r/o? */
> +#define MNT_NOMNT0x80
> 
>  #define MNT_SHRINKABLE   0x100
>  #define MNT_IMBALANCED_WRITE_COUNT   0x200 /* just for debugging */
> 
> --
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-14 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Use FS_SAFE for "fuse" fs type, but not for "fuseblk".
> 
> FUSE was designed from the beginning to be safe for unprivileged users.  This
> has also been verified in practice over many years.  In addition unprivileged
> mounts require the parent mount to be owned by the user, which is more strict
> than the current userspace policy.
> 
> This will enable future installations to remove the suid-root fusermount
> utility.
> 
> Don't require the "user_id=" and "group_id=" options for unprivileged mounts,
> but if they are present, verify them for sanity.
> 
> Disallow the "allow_other" option for unprivileged mounts.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

Sounds like a sysctl to enable FS_SAFE for fuse will make this patch
acceptable to everyone?

> ---
> 
> Index: linux/fs/fuse/inode.c
> ===
> --- linux.orig/fs/fuse/inode.c2008-01-03 17:13:13.0 +0100
> +++ linux/fs/fuse/inode.c 2008-01-03 21:28:01.0 +0100
> @@ -357,6 +357,19 @@ static int parse_fuse_opt(char *opt, str
>   d->max_read = ~0;
>   d->blksize = 512;
> 
> + /*
> +  * For unprivileged mounts use current uid/gid.  Still allow
> +  * "user_id" and "group_id" options for compatibility, but
> +  * only if they match these values.
> +  */
> + if (!capable(CAP_SYS_ADMIN)) {
> + d->user_id = current->uid;
> + d->user_id_present = 1;
> + d->group_id = current->gid;
> + d->group_id_present = 1;
> +
> + }
> +
>   while ((p = strsep(&opt, ",")) != NULL) {
>   int token;
>   int value;
> @@ -385,6 +398,8 @@ static int parse_fuse_opt(char *opt, str
>   case OPT_USER_ID:
>   if (match_int(&args[0], &value))
>   return 0;
> + if (d->user_id_present && d->user_id != value)
> + return 0;
>   d->user_id = value;
>   d->user_id_present = 1;
>   break;
> @@ -392,6 +407,8 @@ static int parse_fuse_opt(char *opt, str
>   case OPT_GROUP_ID:
>   if (match_int(&args[0], &value))
>   return 0;
> + if (d->group_id_present && d->group_id != value)
> + return 0;
>   d->group_id = value;
>   d->group_id_present = 1;
>   break;
> @@ -596,6 +613,10 @@ static int fuse_fill_super(struct super_
>   if (!parse_fuse_opt((char *) data, &d, is_bdev))
>   return -EINVAL;
> 
> + /* This is a privileged option */
> + if ((d.flags & FUSE_ALLOW_OTHER) && !capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
>   if (is_bdev) {
>  #ifdef CONFIG_BLOCK
>   if (!sb_set_blocksize(sb, d.blksize))
> @@ -696,9 +717,9 @@ static int fuse_get_sb(struct file_syste
>  static struct file_system_type fuse_fs_type = {
>   .owner  = THIS_MODULE,
>   .name   = "fuse",
> - .fs_flags   = FS_HAS_SUBTYPE,
>   .get_sb = fuse_get_sb,
>   .kill_sb= kill_anon_super,
> + .fs_flags   = FS_HAS_SUBTYPE | FS_SAFE,
>  };
> 
>  #ifdef CONFIG_BLOCK
> 
> --
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 8/9] unprivileged mounts: propagation: inherit owner from parent

2008-01-14 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> On mount propagation, let the owner of the clone be inherited from the
> parent into which it has been propagated.  Also if the parent has the
> "nosuid" flag, set this flag for the child as well.

What about nodev?

thanks,
-serge

> 
> This makes sense for example, when propagation is set up from the
> initial namespace into a per-user namespace, where some or all of the
> mounts may be owned by the user.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
> ---
> 
> Index: linux/fs/namespace.c
> ===
> --- linux.orig/fs/namespace.c 2008-01-04 13:48:14.0 +0100
> +++ linux/fs/namespace.c  2008-01-04 13:49:52.0 +0100
> @@ -500,10 +500,10 @@ static int reserve_user_mount(void)
>   return err;
>  }
> 
> -static void __set_mnt_user(struct vfsmount *mnt)
> +static void __set_mnt_user(struct vfsmount *mnt, uid_t owner)
>  {
>   BUG_ON(mnt->mnt_flags & MNT_USER);
> - mnt->mnt_uid = current->fsuid;
> + mnt->mnt_uid = owner;
>   mnt->mnt_flags |= MNT_USER;
> 
>   if (!capable(CAP_SETUID))
> @@ -514,7 +514,7 @@ static void __set_mnt_user(struct vfsmou
> 
>  static void set_mnt_user(struct vfsmount *mnt)
>  {
> - __set_mnt_user(mnt);
> + __set_mnt_user(mnt, current->fsuid);
>   spin_lock(&vfsmount_lock);
>   nr_user_mounts++;
>   spin_unlock(&vfsmount_lock);
> @@ -530,7 +530,7 @@ static void clear_mnt_user(struct vfsmou
>  }
> 
>  static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
> - int flag)
> + int flag, uid_t owner)
>  {
>   struct super_block *sb = old->mnt_sb;
>   struct vfsmount *mnt;
> @@ -554,7 +554,10 @@ static struct vfsmount *clone_mnt(struct
>   /* don't copy the MNT_USER flag */
>   mnt->mnt_flags &= ~MNT_USER;
>   if (flag & CL_SETUSER)
> - __set_mnt_user(mnt);
> + __set_mnt_user(mnt, owner);
> +
> + if (flag & CL_NOSUID)
> + mnt->mnt_flags |= MNT_NOSUID;
> 
>   if (flag & CL_SLAVE) {
>   list_add(&mnt->mnt_slave, &old->mnt_slave_list);
> @@ -1060,7 +1063,7 @@ static int lives_below_in_same_fs(struct
>  }
> 
>  struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
> - int flag)
> + int flag, uid_t owner)
>  {
>   struct vfsmount *res, *p, *q, *r, *s;
>   struct nameidata nd;
> @@ -1068,7 +1071,7 @@ struct vfsmount *copy_tree(struct vfsmou
>   if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt))
>   return ERR_PTR(-EPERM);
> 
> - res = q = clone_mnt(mnt, dentry, flag);
> + res = q = clone_mnt(mnt, dentry, flag, owner);
>   if (IS_ERR(q))
>   goto error;
>   q->mnt_mountpoint = mnt->mnt_mountpoint;
> @@ -1090,7 +1093,7 @@ struct vfsmount *copy_tree(struct vfsmou
>   p = s;
>   nd.path.mnt = q;
>   nd.path.dentry = p->mnt_mountpoint;
> - q = clone_mnt(p, p->mnt_root, flag);
> + q = clone_mnt(p, p->mnt_root, flag, owner);
>   if (IS_ERR(q))
>   goto error;
>   spin_lock(&vfsmount_lock);
> @@ -1115,7 +1118,7 @@ struct vfsmount *collect_mounts(struct v
>  {
>   struct vfsmount *tree;
>   down_read(&namespace_sem);
> - tree = copy_tree(mnt, dentry, CL_COPY_ALL | CL_PRIVATE);
> + tree = copy_tree(mnt, dentry, CL_COPY_ALL | CL_PRIVATE, 0);
>   up_read(&namespace_sem);
>   return tree;
>  }
> @@ -1286,7 +1289,8 @@ static int do_change_type(struct nameida
>   */
>  static int do_loopback(struct nameidata *nd, char *old_name, int flags)
>  {
> - int clone_fl;
> + int clone_fl = 0;
> + uid_t owner = 0;
>   struct nameidata old_nd;
>   struct vfsmount *mnt = NULL;
>   int err;
> @@ -1307,11 +1311,17 @@ static int do_loopback(struct nameidata 
>   if (!check_mnt(nd->path.mnt) || !check_mnt(old_nd.path.mnt))
>   goto out;
> 
> - clone_fl = (flags & MS_SETUSER) ? CL_SETUSER : 0;
> + if (flags & MS_SETUSER) {
> + clone_fl |= CL_SETUSER;
> + owner = current->fsuid;
> + }
> +
>   if (flags & MS_REC)
> - mnt = copy_tree(old_nd.path.mnt, old_nd.path.dentry, clone_fl);
> + mnt = copy_tree(old_nd.path.mnt, old_nd.path.dentry, clone_fl,
> + owner);
>   else
> - mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, clone_fl);
> + mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, clone_fl,
> + owner);
> 
>   err = PTR_ERR(mnt);
>   if (IS_ERR(mnt))
> @@ -1535,7 +1545,7 @@ static int do_new_mount(struct nameidata
>   

Re: [patch 6/9] unprivileged mounts: allow unprivileged mounts

2008-01-14 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Define a new fs flag FS_SAFE, which denotes, that unprivileged mounting of
> this filesystem may not constitute a security problem.
> 
> Since most filesystems haven't been designed with unprivileged mounting in
> mind, a thorough audit is needed before setting this flag.
> 
> For "safe" filesystems also allow unprivileged forced unmounting.
> 
> Move subtype handling from do_kern_mount() into do_new_mount().  All
> other callers are kernel-internal and do not need subtype support.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

This patch itself doesn't assign FS_SAFE to any filesystems, so
presuming that there is such a thing as an fs safe for users to
mount, and/or users sign their systems away through a sysctl,
this patch in itself appears right.

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

> ---
> 
> Index: linux/fs/namespace.c
> ===
> --- linux.orig/fs/namespace.c 2008-01-03 21:20:11.0 +0100
> +++ linux/fs/namespace.c  2008-01-03 21:21:06.0 +0100
> @@ -960,14 +960,16 @@ static bool is_mount_owner(struct vfsmou
>  /*
>   * umount is permitted for
>   *  - sysadmin
> - *  - mount owner, if not forced umount
> + *  - mount owner
> + *o if not forced umount,
> + *o if forced umount, and filesystem is "safe"
>   */
>  static bool permit_umount(struct vfsmount *mnt, int flags)
>  {
>   if (capable(CAP_SYS_ADMIN))
>   return true;
> 
> - if (flags & MNT_FORCE)
> + if ((flags & MNT_FORCE) && !(mnt->mnt_sb->s_type->fs_flags & FS_SAFE))
>   return false;
> 
>   return is_mount_owner(mnt, current->fsuid);
> @@ -1025,13 +1027,17 @@ asmlinkage long sys_oldumount(char __use
>   * - mountpoint is not a symlink
>   * - mountpoint is in a mount owned by the user
>   */
> -static bool permit_mount(struct nameidata *nd, int *flags)
> +static bool permit_mount(struct nameidata *nd, struct file_system_type *type,
> +  int *flags)
>  {
>   struct inode *inode = nd->path.dentry->d_inode;
> 
>   if (capable(CAP_SYS_ADMIN))
>   return true;
> 
> + if (type && !(type->fs_flags & FS_SAFE))
> + return false;
> +
>   if (S_ISLNK(inode->i_mode))
>   return false;
> 
> @@ -1285,7 +1291,7 @@ static int do_loopback(struct nameidata 
>   struct vfsmount *mnt = NULL;
>   int err;
> 
> - if (!permit_mount(nd, &flags))
> + if (!permit_mount(nd, NULL, &flags))
>   return -EPERM;
>   if (!old_name || !*old_name)
>   return -EINVAL;
> @@ -1466,30 +1472,76 @@ out:
>   return err;
>  }
> 
> +static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char 
> *fstype)
> +{
> + int err;
> + const char *subtype = strchr(fstype, '.');
> + if (subtype) {
> + subtype++;
> + err = -EINVAL;
> + if (!subtype[0])
> + goto err;
> + } else
> + subtype = "";
> +
> + mnt->mnt_sb->s_subtype = kstrdup(subtype, GFP_KERNEL);
> + err = -ENOMEM;
> + if (!mnt->mnt_sb->s_subtype)
> + goto err;
> + return mnt;
> +
> + err:
> + mntput(mnt);
> + return ERR_PTR(err);
> +}
> +
>  /*
>   * create a new mount for userspace and request it to be added into the
>   * namespace's tree
>   */
> -static int do_new_mount(struct nameidata *nd, char *type, int flags,
> +static int do_new_mount(struct nameidata *nd, char *fstype, int flags,
>   int mnt_flags, char *name, void *data)
>  {
> + int err;
>   struct vfsmount *mnt;
> + struct file_system_type *type;
> 
> - if (!type || !memchr(type, 0, PAGE_SIZE))
> + if (!fstype || !memchr(fstype, 0, PAGE_SIZE))
>   return -EINVAL;
> 
> - /* we need capabilities... */
> - if (!capable(CAP_SYS_ADMIN))
> - return -EPERM;
> -
> - mnt = do_kern_mount(type, flags & ~MS_SETUSER, name, data);
> - if (IS_ERR(mnt))
> + type = get_fs_type(fstype);
> + if (!type)
> + return -ENODEV;
> +
> + err = -EPERM;
> + if (!permit_mount(nd, type, &flags))
> + goto out_put_filesystem;
> +
> + if (flags & MS_SETUSER) {
> + err = reserve_user_mount();
> + if (err)
> + goto ou

Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts

2008-01-14 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Allow bind mounts to unprivileged users if the following conditions are met:
> 
>   - mountpoint is not a symlink
>   - parent mount is owned by the user
>   - the number of user mounts is below the maximum
> 
> Unprivileged mounts imply MS_SETUSER, and will also have the "nosuid" and
> "nodev" mount flags set.
> 
> In particular, if mounting process doesn't have CAP_SETUID capability,
> then the "nosuid" flag will be added, and if it doesn't have CAP_MKNOD
> capability, then the "nodev" flag will be added.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

> ---
> 
> Index: linux/fs/namespace.c
> ===
> --- linux.orig/fs/namespace.c 2008-01-04 13:47:49.0 +0100
> +++ linux/fs/namespace.c  2008-01-04 13:48:01.0 +0100
> @@ -487,11 +487,34 @@ static void dec_nr_user_mounts(void)
>   spin_unlock(&vfsmount_lock);
>  }
> 
> -static void set_mnt_user(struct vfsmount *mnt)
> +static int reserve_user_mount(void)
> +{
> + int err = 0;
> +
> + spin_lock(&vfsmount_lock);
> + if (nr_user_mounts >= max_user_mounts && !capable(CAP_SYS_ADMIN))
> + err = -EPERM;
> + else
> + nr_user_mounts++;
> + spin_unlock(&vfsmount_lock);
> + return err;
> +}
> +
> +static void __set_mnt_user(struct vfsmount *mnt)
>  {
>   BUG_ON(mnt->mnt_flags & MNT_USER);
>   mnt->mnt_uid = current->fsuid;
>   mnt->mnt_flags |= MNT_USER;
> +
> + if (!capable(CAP_SETUID))
> + mnt->mnt_flags |= MNT_NOSUID;
> + if (!capable(CAP_MKNOD))
> + mnt->mnt_flags |= MNT_NODEV;
> +}
> +
> +static void set_mnt_user(struct vfsmount *mnt)
> +{
> + __set_mnt_user(mnt);
>   spin_lock(&vfsmount_lock);
>   nr_user_mounts++;
>   spin_unlock(&vfsmount_lock);
> @@ -510,10 +533,16 @@ static struct vfsmount *clone_mnt(struct
>   int flag)
>  {
>   struct super_block *sb = old->mnt_sb;
> - struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);
> + struct vfsmount *mnt;
> 
> + if (flag & CL_SETUSER) {
> + int err = reserve_user_mount();
> + if (err)
> + return ERR_PTR(err);
> + }
> + mnt = alloc_vfsmnt(old->mnt_devname);
>   if (!mnt)
> - return ERR_PTR(-ENOMEM);
> + goto alloc_failed;
> 
>   mnt->mnt_flags = old->mnt_flags;
>   atomic_inc(&sb->s_active);
> @@ -525,7 +554,7 @@ static struct vfsmount *clone_mnt(struct
>   /* don't copy the MNT_USER flag */
>   mnt->mnt_flags &= ~MNT_USER;
>   if (flag & CL_SETUSER)
> - set_mnt_user(mnt);
> + __set_mnt_user(mnt);
> 
>   if (flag & CL_SLAVE) {
>   list_add(&mnt->mnt_slave, &old->mnt_slave_list);
> @@ -550,6 +579,11 @@ static struct vfsmount *clone_mnt(struct
>   spin_unlock(&vfsmount_lock);
>   }
>   return mnt;
> +
> + alloc_failed:
> + if (flag & CL_SETUSER)
> + dec_nr_user_mounts();
> + return ERR_PTR(-ENOMEM);
>  }
> 
>  static inline void __mntput(struct vfsmount *mnt)
> @@ -986,22 +1020,26 @@ asmlinkage long sys_oldumount(char __use
> 
>  #endif
> 
> -static int mount_is_safe(struct nameidata *nd)
> +/*
> + * Conditions for unprivileged mounts are:
> + * - mountpoint is not a symlink
> + * - mountpoint is in a mount owned by the user
> + */
> +static bool permit_mount(struct nameidata *nd, int *flags)
>  {
> + struct inode *inode = nd->path.dentry->d_inode;
> +
>   if (capable(CAP_SYS_ADMIN))
> - return 0;
> - return -EPERM;
> -#ifdef notyet
> - if (S_ISLNK(nd->path.dentry->d_inode->i_mode))
> - return -EPERM;
> - if (nd->path.dentry->d_inode->i_mode & S_ISVTX) {
> - if (current->uid != nd->path.dentry->d_inode->i_uid)
> - return -EPERM;
> - }
> - if (vfs_permission(nd, MAY_WRITE))
> - return -EPERM;
> - return 0;
> -#endif
> + return true;
> +
> + if (S_ISLNK(inode->i_mode))
> + return false;
> +
> + if (!is_mount_owner(nd->path.mnt, current->fsuid))
> + return false;
> +
> + *flags |= MS_S

Re: [patch 4/9] unprivileged mounts: propagate error values from clone_mnt

2008-01-14 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Allow clone_mnt() to return errors other than ENOMEM.  This will be used for
> returning a different error value when the number of user mounts goes over the
> limit.
> 
> Fix copy_tree() to return EPERM for unbindable mounts.
> 
> Don't propagate further from dup_mnt_ns() as that copy_tree() can only fail
> with -ENOMEM.

I see what you're saying, but it just seems like it's bound to be more
confusing this way.

What's the reason to insist on doing this?  To force people to think
about it as a form of documentation?

Still,

> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

> ---
> 
> Index: linux/fs/namespace.c
> ===
> --- linux.orig/fs/namespace.c 2008-01-04 13:47:09.0 +0100
> +++ linux/fs/namespace.c  2008-01-04 13:47:49.0 +0100
> @@ -512,41 +512,42 @@ static struct vfsmount *clone_mnt(struct
>   struct super_block *sb = old->mnt_sb;
>   struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);
> 
> - if (mnt) {
> - mnt->mnt_flags = old->mnt_flags;
> - atomic_inc(&sb->s_active);
> - mnt->mnt_sb = sb;
> - mnt->mnt_root = dget(root);
> - mnt->mnt_mountpoint = mnt->mnt_root;
> - mnt->mnt_parent = mnt;
> -
> - /* don't copy the MNT_USER flag */
> - mnt->mnt_flags &= ~MNT_USER;
> - if (flag & CL_SETUSER)
> - set_mnt_user(mnt);
> -
> - if (flag & CL_SLAVE) {
> - list_add(&mnt->mnt_slave, &old->mnt_slave_list);
> - mnt->mnt_master = old;
> - CLEAR_MNT_SHARED(mnt);
> - } else if (!(flag & CL_PRIVATE)) {
> - if ((flag & CL_PROPAGATION) || IS_MNT_SHARED(old))
> - list_add(&mnt->mnt_share, &old->mnt_share);
> - if (IS_MNT_SLAVE(old))
> - list_add(&mnt->mnt_slave, &old->mnt_slave);
> - mnt->mnt_master = old->mnt_master;
> - }
> - if (flag & CL_MAKE_SHARED)
> - set_mnt_shared(mnt);
> + if (!mnt)
> + return ERR_PTR(-ENOMEM);
> 
> - /* stick the duplicate mount on the same expiry list
> -  * as the original if that was on one */
> - if (flag & CL_EXPIRE) {
> - spin_lock(&vfsmount_lock);
> - if (!list_empty(&old->mnt_expire))
> - list_add(&mnt->mnt_expire, &old->mnt_expire);
> - spin_unlock(&vfsmount_lock);
> - }
> + mnt->mnt_flags = old->mnt_flags;
> + atomic_inc(&sb->s_active);
> + mnt->mnt_sb = sb;
> + mnt->mnt_root = dget(root);
> + mnt->mnt_mountpoint = mnt->mnt_root;
> + mnt->mnt_parent = mnt;
> +
> + /* don't copy the MNT_USER flag */
> + mnt->mnt_flags &= ~MNT_USER;
> + if (flag & CL_SETUSER)
> + set_mnt_user(mnt);
> +
> + if (flag & CL_SLAVE) {
> + list_add(&mnt->mnt_slave, &old->mnt_slave_list);
> + mnt->mnt_master = old;
> + CLEAR_MNT_SHARED(mnt);
> + } else if (!(flag & CL_PRIVATE)) {
> + if ((flag & CL_PROPAGATION) || IS_MNT_SHARED(old))
> + list_add(&mnt->mnt_share, &old->mnt_share);
> + if (IS_MNT_SLAVE(old))
> + list_add(&mnt->mnt_slave, &old->mnt_slave);
> + mnt->mnt_master = old->mnt_master;
> + }
> + if (flag & CL_MAKE_SHARED)
> + set_mnt_shared(mnt);
> +
> + /* stick the duplicate mount on the same expiry list
> +  * as the original if that was on one */
> + if (flag & CL_EXPIRE) {
> + spin_lock(&vfsmount_lock);
> + if (!list_empty(&old->mnt_expire))
> + list_add(&mnt->mnt_expire, &old->mnt_expire);
> + spin_unlock(&vfsmount_lock);
>   }
>   return mnt;
>  }
> @@ -1021,11 +1022,11 @@ struct vfsmount *copy_tree(struct vfsmou
>   struct nameidata nd;
> 
>   if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt))
> - return NULL;

Re: [patch 3/9] unprivileged mounts: account user mounts

2008-01-14 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Add sysctl variables for accounting and limiting the number of user
> mounts.
> 
> The maximum number of user mounts is set to 1024 by default.  This
> won't in itself enable user mounts, setting a mount to be owned by a
> user is first needed
> 
> [akpm]
>  - don't use enumerated sysctls
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

Seems sane enough, given your responses to Dave.

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

> ---
> 
> Index: linux/Documentation/filesystems/proc.txt
> ===
> --- linux.orig/Documentation/filesystems/proc.txt 2008-01-03 
> 17:12:58.0 +0100
> +++ linux/Documentation/filesystems/proc.txt  2008-01-03 21:15:35.0 
> +0100
> @@ -1012,6 +1012,15 @@ reaches aio-max-nr then io_setup will fa
>  raising aio-max-nr does not result in the pre-allocation or re-sizing
>  of any kernel data structures.
> 
> +nr_user_mounts and max_user_mounts
> +--
> +
> +These represent the number of "user" mounts and the maximum number of
> +"user" mounts respectively.  User mounts may be created by
> +unprivileged users.  User mounts may also be created with sysadmin
> +privileges on behalf of a user, in which case nr_user_mounts may
> +exceed max_user_mounts.
> +
>  2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
>  ---
> 
> Index: linux/fs/namespace.c
> ===
> --- linux.orig/fs/namespace.c 2008-01-03 21:14:16.0 +0100
> +++ linux/fs/namespace.c  2008-01-03 21:15:35.0 +0100
> @@ -44,6 +44,9 @@ static struct list_head *mount_hashtable
>  static struct kmem_cache *mnt_cache __read_mostly;
>  static struct rw_semaphore namespace_sem;
> 
> +int nr_user_mounts;
> +int max_user_mounts = 1024;
> +
>  /* /sys/fs */
>  struct kobject *fs_kobj;
>  EXPORT_SYMBOL_GPL(fs_kobj);
> @@ -477,11 +480,30 @@ static struct vfsmount *skip_mnt_tree(st
>   return p;
>  }
> 
> +static void dec_nr_user_mounts(void)
> +{
> + spin_lock(&vfsmount_lock);
> + nr_user_mounts--;
> + spin_unlock(&vfsmount_lock);
> +}
> +
>  static void set_mnt_user(struct vfsmount *mnt)
>  {
>   BUG_ON(mnt->mnt_flags & MNT_USER);
>   mnt->mnt_uid = current->fsuid;
>   mnt->mnt_flags |= MNT_USER;
> + spin_lock(&vfsmount_lock);
> + nr_user_mounts++;
> + spin_unlock(&vfsmount_lock);
> +}
> +
> +static void clear_mnt_user(struct vfsmount *mnt)
> +{
> + if (mnt->mnt_flags & MNT_USER) {
> + mnt->mnt_uid = 0;
> + mnt->mnt_flags &= ~MNT_USER;
> + dec_nr_user_mounts();
> + }
>  }
> 
>  static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
> @@ -542,6 +564,7 @@ static inline void __mntput(struct vfsmo
>*/
>   WARN_ON(atomic_read(&mnt->__mnt_writers));
>   dput(mnt->mnt_root);
> + clear_mnt_user(mnt);
>   free_vfsmnt(mnt);
>   deactivate_super(sb);
>  }
> @@ -1306,6 +1329,7 @@ static int do_remount(struct nameidata *
>   else
>   err = do_remount_sb(sb, flags, data, 0);
>   if (!err) {
> + clear_mnt_user(nd->path.mnt);
>   nd->path.mnt->mnt_flags = mnt_flags;
>   if (flags & MS_SETUSER)
>   set_mnt_user(nd->path.mnt);
> Index: linux/include/linux/fs.h
> ===
> --- linux.orig/include/linux/fs.h 2008-01-03 20:52:38.0 +0100
> +++ linux/include/linux/fs.h  2008-01-03 21:15:35.0 +0100
> @@ -50,6 +50,9 @@ extern struct inodes_stat_t inodes_stat;
> 
>  extern int leases_enable, lease_break_time;
> 
> +extern int nr_user_mounts;
> +extern int max_user_mounts;
> +
>  #ifdef CONFIG_DNOTIFY
>  extern int dir_notify_enable;
>  #endif
> Index: linux/kernel/sysctl.c
> ===
> --- linux.orig/kernel/sysctl.c2008-01-03 17:13:22.0 +0100
> +++ linux/kernel/sysctl.c 2008-01-03 21:15:35.0 +0100
> @@ -1288,6 +1288,22 @@ static struct ctl_table fs_table[] = {
>  #endif   
>  #endif
>   {
> + .ctl_name   = CTL_UNNUMBERED,
> + .procname   = "nr_user_mounts",
> + .data   = &nr_use

Re: [patch 2/9] unprivileged mounts: allow unprivileged umount

2008-01-14 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> The owner doesn't need sysadmin capabilities to call umount().
> 
> Similar behavior as umount(8) on mounts having "user=UID" option in /etc/mtab.
> The difference is that umount also checks /etc/fstab, presumably to exclude
> another mount on the same mountpoint.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

> ---
> 
> Index: linux/fs/namespace.c
> ===
> --- linux.orig/fs/namespace.c 2008-01-03 20:52:38.0 +0100
> +++ linux/fs/namespace.c  2008-01-03 21:14:16.0 +0100
> @@ -894,6 +894,27 @@ static int do_umount(struct vfsmount *mn
>   return retval;
>  }
> 
> +static bool is_mount_owner(struct vfsmount *mnt, uid_t uid)
> +{
> + return (mnt->mnt_flags & MNT_USER) && mnt->mnt_uid == uid;
> +}
> +
> +/*
> + * umount is permitted for
> + *  - sysadmin
> + *  - mount owner, if not forced umount
> + */
> +static bool permit_umount(struct vfsmount *mnt, int flags)
> +{
> + if (capable(CAP_SYS_ADMIN))
> + return true;
> +
> + if (flags & MNT_FORCE)
> + return false;
> +
> + return is_mount_owner(mnt, current->fsuid);
> +}
> +
>  /*
>   * Now umount can handle mount points as well as block devices.
>   * This is important for filesystems which use unnamed block devices.
> @@ -917,7 +938,7 @@ asmlinkage long sys_umount(char __user *
>   goto dput_and_out;
> 
>   retval = -EPERM;
> - if (!capable(CAP_SYS_ADMIN))
> + if (!permit_umount(nd.path.mnt, flags))
>   goto dput_and_out;
> 
>   retval = do_umount(nd.path.mnt, flags);
> 
> --
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/9] unprivileged mounts: add user mounts to the kernel

2008-01-14 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> This patchset adds support for keeping mount ownership information in the
> kernel, and allow unprivileged mount(2) and umount(2) in certain cases.
> 
> The mount owner has the following privileges:
> 
>   - unmount the owned mount
>   - create a submount under the owned mount
> 
> The sysadmin can set the owner explicitly on mount and remount.  When an
> unprivileged user creates a mount, then the owner is automatically set to the
> user.
> 
> The following use cases are envisioned:
> 
> 1) Private namespace, with selected mounts owned by user.  E.g.
>/home/$USER is a good candidate for allowing unpriv mounts and unmounts
>within.
> 
> 2) Private namespace, with all mounts owned by user and having the "nosuid"
>flag.  User can mount and umount anywhere within the namespace, but suid
>programs will not work.
> 
> 3) Global namespace, with a designated directory, which is a mount owned by
>the user.  E.g.  /mnt/users/$USER is set up so that it is bind mounted onto
>itself, and set to be owned by $USER.  The user can add/remove mounts only
>under this directory.
> 
> The following extra security measures are taken for unprivileged mounts:
> 
>  - usermounts are limited by a sysctl tunable
>  - force "nosuid,nodev" mount options on the created mount
> 
> For testing unprivileged mounts (and for other purposes) simple
> mount/umount utilities are available from:
> 
>   http://www.kernel.org/pub/linux/kernel/people/mszeredi/mmount/
> 
> After this series I'll be posting a preliminary patch for util-linux-ng,
> to add the same functionality to mount(8) and umount(8).
> 
> This patch:
> 
> A new mount flag, MS_SETUSER is used to make a mount owned by a user.  If this
> flag is specified, then the owner will be set to the current fsuid and the
> mount will be marked with the MNT_USER flag.  On remount don't preserve
> previous owner, and treat MS_SETUSER as for a new mount.  The MS_SETUSER flag
> is ignored on mount move.
> 
> The MNT_USER flag is not copied on any kind of mount cloning: namespace
> creation, binding or propagation.  For bind mounts the cloned mount(s) are set
> to MNT_USER depending on the MS_SETUSER mount flag.  In all the other cases
> MNT_USER is always cleared.
> 
> For MNT_USER mounts a "user=UID" option is added to /proc/PID/mounts.  This is
> compatible with how mount ownership is stored in /etc/mtab.
> 
> The rationale for using MS_SETUSER and MNT_USER, to distinguish "user"
> mounts from "non-user" or "legacy" mounts are follows:
> 
>   a) Mount(2) and umount(2) on legacy mounts always need CAP_SYS_ADMIN
>  capability.  As opposed to user mounts, which will only require,
>  that the mount owner matches the current fsuid.  So a process
>  with fsuid=0 should not be able to mount/umount legacy mounts
>  without the CAP_SYS_ADMIN capability.
> 
>   b) Legacy userspace programs may set fsuid to nonzero before calling
>  mount(2).  In such an unlikely case, this patchset would cause
>  an unintended side effect of making the mount owned by the fsuid.
> 
>   c) For legacy mounts, no "user=UID" option should be shown in
>  /proc/mounts for backwards compatibility.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

This looks good to me.

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

thanks,
-serge

> ---
> 
> Index: linux/fs/namespace.c
> ===
> --- linux.orig/fs/namespace.c 2008-01-03 22:10:10.0 +0100
> +++ linux/fs/namespace.c  2008-01-04 13:46:33.0 +0100
> @@ -477,6 +477,13 @@ static struct vfsmount *skip_mnt_tree(st
>   return p;
>  }
> 
> +static void set_mnt_user(struct vfsmount *mnt)
> +{
> + BUG_ON(mnt->mnt_flags & MNT_USER);
> + mnt->mnt_uid = current->fsuid;
> + mnt->mnt_flags |= MNT_USER;
> +}
> +
>  static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
>   int flag)
>  {
> @@ -491,6 +498,11 @@ static struct vfsmount *clone_mnt(struct
>   mnt->mnt_mountpoint = mnt->mnt_root;
>   mnt->mnt_parent = mnt;
> 
> + /* don't copy the MNT_USER flag */
> + mnt->mnt_flags &= ~MNT_USER;
> + if (flag & CL_SETUSER)
> + set_mnt_user(mnt);
> +
>   if (flag & CL_SLAVE) {
>   list_add(&mnt->mnt_slave, &old->mnt_slave_list

Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts

2008-01-09 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Allow bind mounts to unprivileged users if the following conditions are met:
> 
>   - mountpoint is not a symlink
>   - parent mount is owned by the user
>   - the number of user mounts is below the maximum
> 
> Unprivileged mounts imply MS_SETUSER, and will also have the "nosuid" and
> "nodev" mount flags set.
> 
> In particular, if mounting process doesn't have CAP_SETUID capability,
> then the "nosuid" flag will be added, and if it doesn't have CAP_MKNOD
> capability, then the "nodev" flag will be added.

That little part by itself is really needed in order to make the ability
to remove CAP_MKNOD from a process tree's bounding set meaningful.
Else instead of creating /dev/hda1, the user can just mount a filesystem
with hda1 existing on it.  (Which is why I was surprised when one day
I found this code missing :)

But of course I'm a fan of the patchset altogether.  I plan to review in
more detail early next week, but since I liked the previous submission I
don't see myself having any complaints, so I'm glad to see the reviews
by others.

thanks,
-serge

> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
> ---
> 
> Index: linux/fs/namespace.c
> ===
> --- linux.orig/fs/namespace.c 2008-01-04 13:47:49.0 +0100
> +++ linux/fs/namespace.c  2008-01-04 13:48:01.0 +0100
> @@ -487,11 +487,34 @@ static void dec_nr_user_mounts(void)
>   spin_unlock(&vfsmount_lock);
>  }
> 
> -static void set_mnt_user(struct vfsmount *mnt)
> +static int reserve_user_mount(void)
> +{
> + int err = 0;
> +
> + spin_lock(&vfsmount_lock);
> + if (nr_user_mounts >= max_user_mounts && !capable(CAP_SYS_ADMIN))
> + err = -EPERM;
> + else
> + nr_user_mounts++;
> + spin_unlock(&vfsmount_lock);
> + return err;
> +}
> +
> +static void __set_mnt_user(struct vfsmount *mnt)
>  {
>   BUG_ON(mnt->mnt_flags & MNT_USER);
>   mnt->mnt_uid = current->fsuid;
>   mnt->mnt_flags |= MNT_USER;
> +
> + if (!capable(CAP_SETUID))
> + mnt->mnt_flags |= MNT_NOSUID;
> + if (!capable(CAP_MKNOD))
> + mnt->mnt_flags |= MNT_NODEV;
> +}
> +
> +static void set_mnt_user(struct vfsmount *mnt)
> +{
> + __set_mnt_user(mnt);
>   spin_lock(&vfsmount_lock);
>   nr_user_mounts++;
>   spin_unlock(&vfsmount_lock);
> @@ -510,10 +533,16 @@ static struct vfsmount *clone_mnt(struct
>   int flag)
>  {
>   struct super_block *sb = old->mnt_sb;
> - struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);
> + struct vfsmount *mnt;
> 
> + if (flag & CL_SETUSER) {
> + int err = reserve_user_mount();
> + if (err)
> + return ERR_PTR(err);
> + }
> + mnt = alloc_vfsmnt(old->mnt_devname);
>   if (!mnt)
> - return ERR_PTR(-ENOMEM);
> + goto alloc_failed;
> 
>   mnt->mnt_flags = old->mnt_flags;
>   atomic_inc(&sb->s_active);
> @@ -525,7 +554,7 @@ static struct vfsmount *clone_mnt(struct
>   /* don't copy the MNT_USER flag */
>   mnt->mnt_flags &= ~MNT_USER;
>   if (flag & CL_SETUSER)
> - set_mnt_user(mnt);
> + __set_mnt_user(mnt);
> 
>   if (flag & CL_SLAVE) {
>   list_add(&mnt->mnt_slave, &old->mnt_slave_list);
> @@ -550,6 +579,11 @@ static struct vfsmount *clone_mnt(struct
>   spin_unlock(&vfsmount_lock);
>   }
>   return mnt;
> +
> + alloc_failed:
> + if (flag & CL_SETUSER)
> + dec_nr_user_mounts();
> + return ERR_PTR(-ENOMEM);
>  }
> 
>  static inline void __mntput(struct vfsmount *mnt)
> @@ -986,22 +1020,26 @@ asmlinkage long sys_oldumount(char __use
> 
>  #endif
> 
> -static int mount_is_safe(struct nameidata *nd)
> +/*
> + * Conditions for unprivileged mounts are:
> + * - mountpoint is not a symlink
> + * - mountpoint is in a mount owned by the user
> + */
> +static bool permit_mount(struct nameidata *nd, int *flags)
>  {
> + struct inode *inode = nd->path.dentry->d_inode;
> +
>   if (capable(CAP_SYS_ADMIN))
> - return 0;
> - return -EPERM;
> -#ifdef notyet
> - if (S_ISLNK(nd->path.dentry->d_inode->i_mode))
> - return -EPERM;
> - if (nd->path.dentry->d_inode->i_mode & S_ISVTX) {
> - if (current->uid != nd->path.dentry->d_inode->i_uid)
> - return -EPERM;
> - }
> - if (vfs_permission(nd, MAY_WRITE))
> - return -EPERM;
> - return 0;
> -#endif
> + return true;
> +
> + if (S_ISLNK(inode->i_mode))
> + return false;
> +
> + if (!is_mount_owner(nd->path.mnt, current->fsuid))
> + return false;
> +
> + *flags |= MS_SETUSER;
> + return true;
>  }
> 
>  static int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry)
> @@ -1245,9 +1283

Re: [PATCH][RFC] Simple tamper-proof device filesystem.

2008-01-09 Thread Serge E. Hallyn
Quoting Indan Zupancic ([EMAIL PROTECTED]):
> Hello,
> 
> On Wed, January 9, 2008 05:39, Tetsuo Handa wrote:
> > Hello.
> >
> > Indan Zupancic wrote:
> >> I think you focus too much on your way of enforcing filename/attributes
> >> pairs.
> > So?
> 
> So that you miss alternatives and don't see the bigger picture.

These emails again are getting really long, but I think the gist of
Indan's suggestion can be concisely summarized:

"To confine process P3 to /dev/hda2 being 'b 3 2', create
/dev/p3, launch P3 in a new mounts namespace, mount --bind
/dev/p3 /dev, exec what you want p3 running, and have
MAC prevent umount /dev/p3."

This is a neat idea, but Tetsuo's rebutall is

"P3 may be legacy code needing to create or delete
/dev/floppy, where -EPERM confuses P3 and prevents
it working correctly."

Indan's idea is interesting and I like it, but is there an answer to
Tetsuo's problem with it?

thanks,
-serge

PS - Indan, you also said in essence "if P3 can be trusted to create
/dev/floppy why can't it be trusted to create /dev/hda1".  I trust that,
phrased that way, the question answers itself?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Simple tamper-proof device filesystem.

2007-12-31 Thread Serge E. Hallyn
Quoting Tetsuo Handa ([EMAIL PROTECTED]):
> Hello.
> 
> Thank you for attending discussion for previous posting
> (starting from http://lkml.org/lkml/2007/12/16/23 ).
> 
> The previous posting was for feasibility test to know
> whether this kind of trivial filesystem is acceptable for mainline.
> 
> Now, it seems that there is a little chance for accepting.
> Therefore I rebased the patch using the -mm tree.
> 
> Regards.
> --
> Subject: Simple tamper-proof device filesystem.
> 
> The goal of this filesystem is to guarantee that
> "applications using well-known device locations under /dev
> get the device they want" (e.g. an application that accesses /dev/null can
> always get a character special device with major=1 and minor=3).
> 
> This idea sounds silly? Indeed, if you think the root can do whatever
> he/she wants do do. But this filesystem makes sense when used with
> access control mechanisms like MAC (mandatory access control).
> I want to use this filesystem in case where a process with root privilege was
> hijacked but the behavior of the hijacked process is still restricted by MAC.
> 
> Why not use FUSE?
> 
>   Because /dev has to be available through the lifetime of the kernel.
>   It is not acceptable if /dev stops working due to SIGKILL or OOM-killer.
> 
> Why not use SELinux?
> 
>   Because SELinux doesn't guarantee filename and its attribute.
>   As far as I know, no MAC implementation can handle filename and its 
> attribute.
>   I guess this is because
> 
> Filename and its attributes pairs are conventionally considered as
> constant and reliable.
> 
> It makes the MAC's policy syntax complicated to describe this attribute
> enforcement information in MAC's policy.
> 
>   I want to add functionality that the MACs are missing.
>   Instead of adding this functionality per MAC,
>   I propose to add it as ground work, to be combined with any MAC.
> 
> Why not drop CAP_MKNOD?
> 
>   Dropping CAP_MKNOD is not enough for emulating this filesystem because
>   a process can still rename()/unlink() to break filename and its attributes
>   handling (e.g. mv /dev/sda1 /dev/sda1.tmp; mv /dev/sda2 /dev/sda1;
>   mv /dev/sda1.tmp /dev/sda2 or unlink /dev/null; touch /dev/null ).
> 
> This time, I'm implementing this filesystem as an extension to tmpfs
> because what this filesystem does are nothing but check filename and
> its attributes in addition to what tmpfs does.
> 
> Signed-off-by: Tetsuo Handa <[EMAIL PROTECTED]>
> ---
>  fs/ramfs/inode.c   |  101 -
>  fs/ramfs/syaoran.h | 1066 
> +
>  2 files changed, 1160 insertions(+), 7 deletions(-)
> 
> --- linux-2.6-mm.orig/fs/ramfs/inode.c
> +++ linux-2.6-mm/fs/ramfs/inode.c
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include "internal.h"
> +#include "syaoran.h"
> 
>  /* some random number */
>  #define RAMFS_MAGIC  0x858458f6
> @@ -49,7 +50,8 @@ static struct backing_dev_info ramfs_bac
> BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | 
> BDI_CAP_EXEC_MAP,
>  };
> 
> -struct inode *ramfs_get_inode(struct super_block *sb, int mode, dev_t dev)
> +struct inode *__ramfs_get_inode(struct super_block *sb, int mode, dev_t dev,
> + const int mac)
>  {
>   struct inode * inode = new_inode(sb);
> 
> @@ -65,10 +67,19 @@ struct inode *ramfs_get_inode(struct sup
>   switch (mode & S_IFMT) {
>   default:
>   init_special_inode(inode, mode, dev);
> + if (mac) {
> + if (S_ISBLK(mode))
> + inode->i_fop = &wrapped_def_blk_fops;
> + else if (S_ISCHR(mode))
> + inode->i_fop = &wrapped_def_chr_fops;
> + inode->i_op = &syaoran_file_inode_operations;
> + }
>   break;
>   case S_IFREG:
>   inode->i_op = &ramfs_file_inode_operations;
>   inode->i_fop = &ramfs_file_operations;
> + if (mac)
> + inode->i_op = &syaoran_file_inode_operations;
>   break;
>   case S_IFDIR:
>   inode->i_op = &ramfs_dir_inode_operations;
> @@ -79,12 +90,19 @@ struct inode *ramfs_get_inode(struct sup
>   break;
>   case S_IFLNK:
>   inode->i_op = &page_symlink_inode_operations;
> + if (mac)
> + inode->i_op = &syaoran_symlink_inode_operations;
>   break;
>   }
>   }
>   return inode;
>  }
> 
> +struct inode *ramfs_get_inode(struct super_block *sb, int mode, dev_t dev)
> +{
> + return __ramfs_get_inode(sb, mode, dev, 0);
> +}
> +
>  /*
>   * File creation. Allocate an inode, and we're done..
>   */
> @@ -92,9 +110,17 @@ struct inode 

Re: [PATCH] Pid namespaces vs locks interaction

2007-12-21 Thread Serge E. Hallyn
Quoting Vitaliy Gusev ([EMAIL PROTECTED]):
> fcntl(F_GETLK,..) can return pid of process for not current pid namespace (if 
> process is belonged to the several namespaces). It is true also for pids 
> in /proc/locks. So correct behavior is saving pointer to the struct pid of 
> the process lock owner.
> 
> Assigned-off-by: Vitaliy Gusev <[EMAIL PROTECTED]>
> Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

Thanks, Vitaliy.

-serge

> fs/locks.c |   26 +-
>  include/linux/fs.h |3 ++-
>  2 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 8b8388e..14989fa 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -125,6 +125,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -185,6 +186,7 @@ void locks_init_lock(struct file_lock *fl)
>   fl->fl_fasync = NULL;
>   fl->fl_owner = NULL;
>   fl->fl_pid = 0;
> + fl->fl_nspid = NULL;
>   fl->fl_file = NULL;
>   fl->fl_flags = 0;
>   fl->fl_type = 0;
> @@ -553,6 +555,8 @@ static void locks_insert_lock(struct file_lock **pos, 
> struct file_lock *fl)
>  {
>   list_add(&fl->fl_link, &file_lock_list);
> 
> + fl->fl_nspid = get_pid(task_tgid(current));
> +
>   /* insert into file's list */
>   fl->fl_next = *pos;
>   *pos = fl;
> @@ -584,6 +588,9 @@ static void locks_delete_lock(struct file_lock **thisfl_p)
>   if (fl->fl_ops && fl->fl_ops->fl_remove)
>   fl->fl_ops->fl_remove(fl);
> 
> + put_pid(fl->fl_nspid);
> + fl->fl_nspid = NULL;
> +
>   locks_wake_up_blocks(fl);
>   locks_free_lock(fl);
>  }
> @@ -673,9 +680,12 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
>   if (posix_locks_conflict(fl, cfl))
>   break;
>   }
> - if (cfl)
> + if (cfl) {
>   __locks_copy_lock(fl, cfl);
> - else
> + if (cfl->fl_nspid)
> + fl->fl_pid = pid_nr_ns(cfl->fl_nspid, 
> + task_active_pid_ns(current));
> + } else
>   fl->fl_type = F_UNLCK;
>   unlock_kernel();
>   return;
> @@ -2084,6 +2094,12 @@ static void lock_get_status(struct seq_file *f, struct 
> file_lock *fl,
>   int id, char *pfx)
>  {
>   struct inode *inode = NULL;
> + unsigned int fl_pid;
> +
> + if (fl->fl_nspid)
> + fl_pid = pid_nr_ns(fl->fl_nspid, task_active_pid_ns(current));
> + else
> + fl_pid = fl->fl_pid;
> 
>   if (fl->fl_file != NULL)
>   inode = fl->fl_file->f_path.dentry->d_inode;
> @@ -2124,16 +2140,16 @@ static void lock_get_status(struct seq_file *f, 
> struct file_lock *fl,
>   }
>   if (inode) {
>  #ifdef WE_CAN_BREAK_LSLK_NOW
> - seq_printf(f, "%d %s:%ld ", fl->fl_pid,
> + seq_printf(f, "%d %s:%ld ", fl_pid,
>   inode->i_sb->s_id, inode->i_ino);
>  #else
>   /* userspace relies on this representation of dev_t ;-( */
> - seq_printf(f, "%d %02x:%02x:%ld ", fl->fl_pid,
> + seq_printf(f, "%d %02x:%02x:%ld ", fl_pid,
>   MAJOR(inode->i_sb->s_dev),
>   MINOR(inode->i_sb->s_dev), inode->i_ino);
>  #endif
>   } else {
> - seq_printf(f, "%d :0 ", fl->fl_pid);
> + seq_printf(f, "%d :0 ", fl_pid);
>   }
>   if (IS_POSIX(fl)) {
>   if (fl->fl_end == OFFSET_MAX)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b3ec4a4..1fb952f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -869,7 +869,8 @@ struct file_lock {
>   struct list_head fl_link;   /* doubly linked list of all locks */
>   struct list_head fl_block;  /* circular list of blocked processes */
>   fl_owner_t fl_owner;
> - unsigned int fl_pid;
> + unsigned int fl_pid;/* unique id and sometimes global pid */
> + struct pid *fl_nspid;   /* to calculate owner pid_nr for userspace */
>   wait_queue_head_t fl_wait;
>   struct file *fl_file;
>   unsigned char fl_flags;
> 

> diff --git a/fs/locks.c b/fs/locks.c
> index 8b8388e..d2d3d75 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -125,6 +125,7 @@
>  #include 
>  #include 
>  #include 
> +#inc

Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.

2007-12-20 Thread Serge E. Hallyn
Quoting Pavel Emelyanov ([EMAIL PROTECTED]):
> Oren Laadan wrote:
> > 
> > Serge E. Hallyn wrote:
> >> Quoting Pavel Emelyanov ([EMAIL PROTECTED]):
> >>> Oren Laadan wrote:
> >>>> Serge E. Hallyn wrote:
> >>>>> Quoting Oren Laadan ([EMAIL PROTECTED]):
> >>>>>> I hate to bring this again, but what if the admin in the container
> >>>>>> mounts an external file system (eg. nfs, usb, loop mount from a file,
> >>>>>> or via fuse), and that file system already has a device that we would
> >>>>>> like to ban inside that container ?
> >>>>> Miklos' user mount patches enforced that if !capable(CAP_MKNOD),
> >>>>> then mnt->mnt_flags |= MNT_NODEV.  So that's no problem.
> >>>> Yes, that works to disallow all device files from a mounted file system.
> >>>>
> >>>> But it's a black and white thing: either they are all banned or allowed;
> >>>> you can't have some devices allowed and others not, depending on type
> >>>> A scenario where this may be useful is, for instance, if we some apps in
> >>>> the container to execute withing a pre-made chroot (sub)tree within that
> >>>> container.
> >>>>
> >>>>> But that's been pulled out of -mm! ?  Crap.
> >>>>>
> >>>>>> Since anyway we will have to keep a white- (or black-) list of devices
> >>>>>> that are permitted in a container, and that list may change even change
> >>>>>> per container -- why not enforce the access control at the VFS layer ?
> >>>>>> It's safer in the long run.
> >>>>> By that you mean more along the lines of Pavel's patch than my whitelist
> >>>>> LSM, or you actually mean Tetsuo's filesystem (i assume you don't mean 
> >>>>> that
> >>>>> by 'vfs layer' :), or something different entirely?
> >>>> :)
> >>>>
> >>>> By 'vfs' I mean at open() time, and not at mount(), or mknod() time.
> >>>> Either yours or Pavel's; I tend to prefer not to use LSM as it may
> >>>> collide with future security modules.
> >>> Oren, AFAIS you've seen my patches for device access controller, right?
> > 
> > If you mean this one:
> > http://openvz.org/pipermail/devel/2007-September/007647.html
> > then ack :)
> 
> Great! Thanks.
> 
> >>> Maybe we can revisit the issue then and try to come to agreement on what
> >>> kind of model and implementation we all want?
> >> That would be great, Pavel.  I do prefer your solution over my LSM, so
> >> if we can get an elegant block device control right in the vfs code that
> >> would be my preference.
> > 
> > I concur.
> > 
> > So it seems to me that we are all in favor of the model where open()
> > of a device will consult a black/white-list. Also, we are all in favor
> > of a non-LSM implementation, Pavel's code being a good example.
> 
> Thank you, Oren and Serge! I will revisit this issue then, but
> I have a vacation the next week and, after this, we have a New
> Year and Christmas holidays in Russia. So I will be able to go
> on with it only after the 7th January :( Hope this is OK for you.
> 
> Besides, Andrew told that he would pay little attention to new
> features till the 2.6.24 release, so I'm afraid we won't have this 
> even in -mm in the nearest months :(
> 
> Thanks,
> Pavel

Cool, let me know any way I can help when you get started.

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.

2007-12-19 Thread Serge E. Hallyn
Quoting Tetsuo Handa ([EMAIL PROTECTED]):
> A brief description about SYAORAN:
> 
>  SYAORAN stands for "Simple Yet All-important Object Realizing Abiding
>  Nexus". SYAORAN is a filesystem for /dev with Mandatory Access Control.

I apologize if I'm commiting a faux pas by asking this, but any chance
of renaming this to something like strictdev or sdev, or at least with
'dev' in it somewhere?

Maybe the fs will sell like hotcakes and everyone will know what SYAORAN
means by next year, but just in case that doesn't happen, there is
absolutely nothing in the name that would tell me I should bother to
look at it...

>  /dev needs to be writable, but this means that files on /dev might be
>  tampered with. SYAORAN can restrict combinations of (pathname, attribute)
>  that the system can create. The attribute is one of directory, regular
>  file, FIFO, UNIX domain socket, symbolic link, character or block device
>  file with major/minor device numbers.
> 
>  SYAORAN can ensure /dev/null is a character device file with major=1 minor=3.
> 
>  Policy specifications for this filesystem is at
>  http://tomoyo.sourceforge.jp/en/1.5.x/policy-syaoran.html
> 
> Why not use FUSE?
> 
>  Because /dev has to be available through the lifetime of the kernel.
>  It is not acceptable if /dev stops working due to SIGKILL or OOM-killer.
> 
> Why not use SELinux?
> 
>  Because SELinux doesn't guarantee filename and its attribute.
>  The purpose of this filesystem is to ensure filename and its attribute
>  (e.g. /dev/null is guaranteed to be a character device file
>  with major=1 and minor=3).
> 
> Signed-off-by:  Tetsuo Handa <[EMAIL PROTECTED]>
> ---
>  fs/syaoran/syaoran.c |  338 +
>  fs/syaoran/syaoran.h |  964 
> +++
>  2 files changed, 1302 insertions(+)
> 
> --- /dev/null
> +++ linux-2.6.24-rc5/fs/syaoran/syaoran.c
> @@ -0,0 +1,338 @@
> +/*
> + * fs/syaoran/syaoran.c
> + *
> + * Implementation of the Tamper-Proof Device Filesystem.
> + *
> + * Portions Copyright (C) 2005-2007  NTT DATA CORPORATION
> + *
> + * Version: 1.5.3-pre   2007/12/16
> + *
> + * This filesystem is developed using the ramfs implementation.
> + *
> + */
> +/*
> + * Resizable simple ram filesystem for Linux.
> + *
> + * Copyright (C) 2000 Linus Torvalds.
> + *   2000 Transmeta Corp.
> + *
> + * Usage limits added by David Gibson, Linuxcare Australia.
> + * This file is released under the GPL.
> + */
> +
> +/*
> + * NOTE! This filesystem is probably most useful
> + * not as a real filesystem, but as an example of
> + * how virtual filesystems can be written.
> + *
> + * It doesn't get much simpler than this. Consider
> + * that this file implements the full semantics of
> + * a POSIX-compliant read-write filesystem.
> + *
> + * Note in particular how the filesystem does not
> + * need to implement any data structures of its own
> + * to keep track of the virtual data: using the VFS
> + * caches is sufficient.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static struct super_operations syaoran_ops;
> +static struct address_space_operations syaoran_aops;
> +static struct inode_operations syaoran_file_inode_operations;
> +static struct inode_operations syaoran_dir_inode_operations;
> +static struct inode_operations syaoran_symlink_inode_operations;
> +static struct file_operations syaoran_file_operations;
> +
> +static struct backing_dev_info syaoran_backing_dev_info = {
> + .ra_pages = 0,/* No readahead */
> + .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK |
> + BDI_CAP_MAP_DIRECT | BDI_CAP_MAP_COPY |
> + BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP,
> +};
> +
> +#include "syaoran.h"
> +
> +static struct inode *syaoran_get_inode(struct super_block *sb, int mode,
> +dev_t dev)
> +{
> + struct inode *inode = new_inode(sb);
> +
> + if (inode) {
> + struct timespec now = CURRENT_TIME;
> + inode->i_mode = mode;
> + inode->i_uid = current->fsuid;
> + inode->i_gid = current->fsgid;
> + inode->i_blocks = 0;
> + inode->i_mapping->a_ops = &syaoran_aops;
> + inode->i_mapping->backing_dev_info = &syaoran_backing_dev_info;
> + inode->i_atime = now;
> + inode->i_mtime = now;
> + inode->i_ctime = now;
> + switch (mode & S_IFMT) {
> + default:
> + init_special_inode(inode, mode, dev);
> + if (S_ISBLK(mode))
> + inode->i_fop = &wrapped_def_blk_fops;
> + else if (S_ISCHR(mode))
> + inode->i_fop = &wrapped_def_chr_fops;
> + inode->i_op = &syaoran_file_inode_operations;
> + break;

Re: [RFC/PATCH 2/8] revoke: inode revoke lock V7

2007-12-19 Thread Serge E. Hallyn
Quoting Pekka J Enberg ([EMAIL PROTECTED]):
> Hi,
> 
> On Wed, 19 Dec 2007, Serge E. Hallyn wrote:
> > > I assume you mean S_REVOKE_LOCK and not ->i_mutex, right?
> > 
> > No I did mean the i_mutex since you take the i_mutex when you set
> > S_REVOKE_LOCK.  So between that and the comment above do_lookup(),
> > I assumed you were trying to lock out concurrent do_lookups() returning
> > an inode whose revoke is starting at the same time.
> 
> No, I only use ->i_mutex for synchronizing the write to ->i_flags.

duh.

> On Wed, 19 Dec 2007, Serge E. Hallyn wrote:
> > > The caller is supposed to block open(2) with chmod(2)/chattr(2) so while 
> > > revoke is in progress, you can get references to the _revoked inode_, 
> > > which is fine (operations on it will fail with EBADFS). The 
> > > ->i_revoke_wait bits are there to make sure that while we revoke, you 
> > > can't get a _new reference_ to the inode until we're done.
> > 
> > And a new reference means through iget(), so if revoke starts
> > between the IS_REVOKE_LOCKED() check in do_lookup and its return,
> > it's ok bc we'll get a reference later on?
> 
> Yes, as soon as we unhash the dentries and the inode, do_lookup() will try 
> to find a new inode with iget() but we need to wait before writeback on 
> the revoked inode is finished.

Ok, that makes sense.  I'll let that sit for a short while and look
again :)

thanks,
-serge

> On Wed, 19 Dec 2007, Serge E. Hallyn wrote:
> > I'm a little confused but i'll keep looking.
> 
> I don't blame you. The patch is missing the following "minor detail" which 
> is needed to avoid fs corruption...
> 
>   Pekka
> 
> Index: 2.6/fs/revoke.c
> ===
> --- 2.6.orig/fs/revoke.c  2007-12-16 19:57:40.0 +0200
> +++ 2.6/fs/revoke.c   2007-12-19 18:03:13.0 +0200
> @@ -426,6 +426,8 @@   int err = 0;
>   make_revoked_inode(inode);
>   remove_inode_hash(inode);
>   revoke_aliases(inode);
> +
> + err = write_inode_now(inode, 1);
>  failed:
>   revoke_unlock(inode);
>   wake_up(&inode->i_revoke_wait);
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/8] revoke: inode revoke lock V7

2007-12-19 Thread Serge E. Hallyn
Quoting Pekka J Enberg ([EMAIL PROTECTED]):
> Hi Serge,
> 
> (Thanks for looking at this. I appreciate the review!)
> 
> On Mon, 17 Dec 2007, [EMAIL PROTECTED] wrote:
> > >   struct vfsmount *mnt = nd->mnt;
> > > - struct dentry *dentry = __d_lookup(nd->dentry, name);
> > > + struct dentry *dentry;
> > >  
> > > +again:
> > > + dentry  = __d_lookup(nd->dentry, name);
> > >   if (!dentry)
> > >   goto need_lookup;
> > > +
> > > + if (dentry->d_inode && IS_REVOKE_LOCKED(dentry->d_inode)) {
> > 
> > not sure whether this is a problem or not, but dentry->d_inode isn't
> > locked here, right?  So nothing is keeping do_lookup() returning
> > with an inode which gets revoked between here and the return 0
> > a few lines down?
> 
> I assume you mean S_REVOKE_LOCK and not ->i_mutex, right?

No I did mean the i_mutex since you take the i_mutex when you set
S_REVOKE_LOCK.  So between that and the comment above do_lookup(),
I assumed you were trying to lock out concurrent do_lookups() returning
an inode whose revoke is starting at the same time.

But based on your next paragraph it sounds like I misunderstand your
locking.

> The caller is supposed to block open(2) with chmod(2)/chattr(2) so while 
> revoke is in progress, you can get references to the _revoked inode_, 
> which is fine (operations on it will fail with EBADFS). The 
> ->i_revoke_wait bits are there to make sure that while we revoke, you 
> can't get a _new reference_ to the inode until we're done.

And a new reference means through iget(), so if revoke starts
between the IS_REVOKE_LOCKED() check in do_lookup and its return,
it's ok bc we'll get a reference later on?

I'm a little confused but i'll keep looking.

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.

2007-12-19 Thread Serge E. Hallyn
Quoting Oren Laadan ([EMAIL PROTECTED]):
> 
> Serge E. Hallyn wrote:
> > Quoting Oren Laadan ([EMAIL PROTECTED]):
> >> I hate to bring this again, but what if the admin in the container
> >> mounts an external file system (eg. nfs, usb, loop mount from a file,
> >> or via fuse), and that file system already has a device that we would
> >> like to ban inside that container ?
> > 
> > Miklos' user mount patches enforced that if !capable(CAP_MKNOD),
> > then mnt->mnt_flags |= MNT_NODEV.  So that's no problem.
> 
> Yes, that works to disallow all device files from a mounted file system.
> 
> But it's a black and white thing: either they are all banned or allowed;
> you can't have some devices allowed and others not, depending on type
> A scenario where this may be useful is, for instance, if we some apps in
> the container to execute withing a pre-made chroot (sub)tree within that
> container.

Yes, it's workable short-term, and we've always said that a more
complete solution would be worked on later, as people have time.

> > But that's been pulled out of -mm! ?  Crap.
> > 
> >> Since anyway we will have to keep a white- (or black-) list of devices
> >> that are permitted in a container, and that list may change even change
> >> per container -- why not enforce the access control at the VFS layer ?
> >> It's safer in the long run.
> > 
> > By that you mean more along the lines of Pavel's patch than my whitelist
> > LSM, or you actually mean Tetsuo's filesystem (i assume you don't mean that
> > by 'vfs layer' :), or something different entirely?
> 
> :)
> 
> By 'vfs' I mean at open() time, and not at mount(), or mknod() time.
> Either yours or Pavel's; I tend to prefer not to use LSM as it may
> collide with future security modules.

Yeah I keep waffling.  The LSM is so simple...  but i do prefer Pavel's
patch.  Let's keep pursuing that.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.

2007-12-19 Thread Serge E. Hallyn
Quoting Pavel Emelyanov ([EMAIL PROTECTED]):
> Oren Laadan wrote:
> > Serge E. Hallyn wrote:
> >> Quoting Oren Laadan ([EMAIL PROTECTED]):
> >>> I hate to bring this again, but what if the admin in the container
> >>> mounts an external file system (eg. nfs, usb, loop mount from a file,
> >>> or via fuse), and that file system already has a device that we would
> >>> like to ban inside that container ?
> >> Miklos' user mount patches enforced that if !capable(CAP_MKNOD),
> >> then mnt->mnt_flags |= MNT_NODEV.  So that's no problem.
> > 
> > Yes, that works to disallow all device files from a mounted file system.
> > 
> > But it's a black and white thing: either they are all banned or allowed;
> > you can't have some devices allowed and others not, depending on type
> > A scenario where this may be useful is, for instance, if we some apps in
> > the container to execute withing a pre-made chroot (sub)tree within that
> > container.
> > 
> >> But that's been pulled out of -mm! ?  Crap.
> >>
> >>> Since anyway we will have to keep a white- (or black-) list of devices
> >>> that are permitted in a container, and that list may change even change
> >>> per container -- why not enforce the access control at the VFS layer ?
> >>> It's safer in the long run.
> >> By that you mean more along the lines of Pavel's patch than my whitelist
> >> LSM, or you actually mean Tetsuo's filesystem (i assume you don't mean that
> >> by 'vfs layer' :), or something different entirely?
> > 
> > :)
> > 
> > By 'vfs' I mean at open() time, and not at mount(), or mknod() time.
> > Either yours or Pavel's; I tend to prefer not to use LSM as it may
> > collide with future security modules.
> 
> Oren, AFAIS you've seen my patches for device access controller, right?
> 
> Maybe we can revisit the issue then and try to come to agreement on what
> kind of model and implementation we all want?

That would be great, Pavel.  I do prefer your solution over my LSM, so
if we can get an elegant block device control right in the vfs code that
would be my preference.

The only thing that makes me keep wanting to go back to an LSM is the
fact that the code defining the whitelist seems out of place in the vfs.
But I guess that's actually separated into a modular cgroup, with the
actual enforcement built in at the vfs.  So that's really the best
solution.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.

2007-12-17 Thread Serge E. Hallyn
Quoting Oren Laadan ([EMAIL PROTECTED]):
> 
> I hate to bring this again, but what if the admin in the container
> mounts an external file system (eg. nfs, usb, loop mount from a file,
> or via fuse), and that file system already has a device that we would
> like to ban inside that container ?

Miklos' user mount patches enforced that if !capable(CAP_MKNOD),
then mnt->mnt_flags |= MNT_NODEV.  So that's no problem.

But that's been pulled out of -mm! ?  Crap.

> Since anyway we will have to keep a white- (or black-) list of devices
> that are permitted in a container, and that list may change even change
> per container -- why not enforce the access control at the VFS layer ?
> It's safer in the long run.

By that you mean more along the lines of Pavel's patch than my whitelist
LSM, or you actually mean Tetsuo's filesystem (i assume you don't mean that
by 'vfs layer' :), or something different entirely?

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.

2007-12-17 Thread Serge E. Hallyn
Quoting Serge E. Hallyn ([EMAIL PROTECTED]):
> Quoting Tetsuo Handa ([EMAIL PROTECTED]):
> > Hello.
> > 
> > Serge E. Hallyn wrote:
> > > CAP_MKNOD will be removed from its capability
> > I think it is not enough because the root can rename/unlink device files
> > (mv /dev/sda1 /dev/tmp; mv /dev/sda2 /dev/sda1; mv /dev/tmp /dev/sda2).
> 
> Sure but that doesn't bother us :)
> 
> The admin in the container has his own /dev directory and can do what he
> likes with the devices he's allowed to have.  He just shouldn't have
> access to others.  If he wants to rename /dev/sda1 to /dev/sda5 that's
> his choice.
> 
> > > To use your approach, i guess we would have to use selinux (or tomoyo)
> > > to enforce that devices may only be created under /dev?
> > Everyone can use this filesystem alone.
> 
> Sure but it is worthless alone.
> 
> No?

Oh, no, I'm sorry - I was thinking in terms of my requirements again.
But your requirements are to ensure that an application accessing a
device at a well-known location get what it expect.

So then the main quesiton is still the one I think Al had asked - what
keeps a rogue CAP_SYS_MOUNT process from doing
mount --bind /dev/hda1 /dev/null ?

thanks,
-serge

> What will keep the container admin from doing 'mknod /root/hda1 b 3 1'?
> 
> > But use with MAC (or whatever access control mechanisms that prevent
> > attackers from unmounting/overlaying this filesystem) is recomennded.
> 
> -serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.

2007-12-17 Thread Serge E. Hallyn
Quoting Tetsuo Handa ([EMAIL PROTECTED]):
> Hello.
> 
> Serge E. Hallyn wrote:
> > CAP_MKNOD will be removed from its capability
> I think it is not enough because the root can rename/unlink device files
> (mv /dev/sda1 /dev/tmp; mv /dev/sda2 /dev/sda1; mv /dev/tmp /dev/sda2).

Sure but that doesn't bother us :)

The admin in the container has his own /dev directory and can do what he
likes with the devices he's allowed to have.  He just shouldn't have
access to others.  If he wants to rename /dev/sda1 to /dev/sda5 that's
his choice.

> > To use your approach, i guess we would have to use selinux (or tomoyo)
> > to enforce that devices may only be created under /dev?
> Everyone can use this filesystem alone.

Sure but it is worthless alone.

No?

What will keep the container admin from doing 'mknod /root/hda1 b 3 1'?

> But use with MAC (or whatever access control mechanisms that prevent
> attackers from unmounting/overlaying this filesystem) is recomennded.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.

2007-12-17 Thread Serge E. Hallyn
Quoting Tetsuo Handa ([EMAIL PROTECTED]):
> A brief description about SYAORAN:
> 
>  SYAORAN stands for "Simple Yet All-important Object Realizing Abiding
>  Nexus". SYAORAN is a filesystem for /dev with Mandatory Access Control.
> 
>  /dev needs to be writable, but this means that files on /dev might be
>  tampered with. SYAORAN can restrict combinations of (pathname, attribute)
>  that the system can create. The attribute is one of directory, regular
>  file, FIFO, UNIX domain socket, symbolic link, character or block device
>  file with major/minor device numbers.
> 
>  SYAORAN can ensure /dev/null is a character device file with major=1 minor=3.
> 
>  Policy specifications for this filesystem is at
>  http://tomoyo.sourceforge.jp/en/1.5.x/policy-syaoran.html
> 
> Why not use FUSE?
> 
>  Because /dev has to be available through the lifetime of the kernel.
>  It is not acceptable if /dev stops working due to SIGKILL or OOM-killer.
> 
> Why not use SELinux?
> 
>  Because SELinux doesn't guarantee filename and its attribute.
>  The purpose of this filesystem is to ensure filename and its attribute
>  (e.g. /dev/null is guaranteed to be a character device file
>  with major=1 and minor=3).

We need something similar for system containers (like vservers).  We
will likely want root in a container to be confined to a certain set
of devices.

For starters we expect to use the capability bounding sets (see
http://lkml.org/lkml/2007/11/26/206).  So a container will have a static
/dev predefined, and CAP_MKNOD will be removed from its capability
bounding set so that root in a container cannot create any more new
devices.

For future more sophisticated device controls, two similar approaches
have been suggested (one by me, see
https://lists.linux-foundation.org/pipermail/containers/2007-September/007423.html
and
https://lists.linux-foundation.org/pipermail/containers/2007-November/008589.html
).  Both actually control the devices a process can create period,
rather than trying to control at the filesystem.  And yes, these both
lack the feature in your solution that for instance 'c 1 3' must be
called null, which appears to be the kind of guarantee apparmor likes to
provide.

To use your approach, i guess we would have to use selinux (or tomoyo)
to enforce that devices may only be created under /dev?

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Pid namespaces vs locks interaction

2007-12-13 Thread Serge E. Hallyn
Quoting Vitaliy Gusev ([EMAIL PROTECTED]):
> On 12 December 2007 21:42:25 Serge E. Hallyn wrote:
> > Ok sorry - by letting this thread sit a few days I lost track of where
> > we were.
> >
> > I see now, so you're saying fl_pid for nfs is not in fact a task pid.
> > It's a magically derived unique id.  (And you say it is unique across
> > all the nfs clients?)
> 
> It is unique for pair client,server.
> 
> >
> > So does the p in fl_pid stand for something, or could we rename it to
> > fl_id or fl_uniqueid?
> 
> If fl_pid will be renamed with fl_uniqueid or something, it still need 
> accessing from fs/locks.c: cat /proc/locks shows pids which also are NFS  
> pids (unique id). 
> 
> For example, let's look the /proc/locks in my system (NFS-server) when do 
> flock on a NFS client:
> 
> 1: POSIX  ADVISORY  WRITE 2 08:06:63116 0 EOF
> 2: POSIX  ADVISORY  WRITE 7047 08:09:1899694 0 EOF
> 3: FLOCK  ADVISORY  WRITE 3334 08:06:110497 0 EOF
> 4: FLOCK  ADVISORY  WRITE 3265 08:06:94786 0 EOF
> 5: POSIX  ADVISORY  WRITE 2582 08:06:110462 0 EOF
> 
> It indicates that process with pid 2 has a posix lock. Really it is a NFS 
> unique id. Problem can be solved by using pid of lockd.
> 
> > Maybe that's too much bother, but so long as we're bothering with a pid
> > cleanup at all it seems worth it to me.  On the other hand maybe
> > J. Bruce Fields was right and we should accept the fact that the
> > flock->fl_pid shouldn't be taken too seriously, and leave it be.
> 
> Mix pids from some namespaces is not good. We can store process pid seen from 

Agreed, and that was the basis for my earlier objection.

It sounds like it's clear to all people smarter than I that fl_pid is
not really a pid, so there is no reason for changing the name.  And your
patch (contrary to my earlier read of it) only translates fl_nspid into
fl_pids in temporary flocks being passed to userspace, through fcntl and
/proc/locks.

So I completely withdraw my objection.

Except, for the sake of other cognitively challenged types like myself,
could you add a comment by fl_pid and fl_nspid in fs.h, to the effect of

unsigned int fl_pid;  /* unique id and sometimes global pid */
struct pid *fl_nspid; /* to calculate owner pid_nr for userspace */

(or something more accurate if I'm off)?

So after all that,

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

(sorry)

thanks,
-serge

> init namespace to the flock->fl_pid (instead pid from the current namespace). 
> Thus fcntl(F_GETLK,...) and  "cat /proc/locks" will show global pids.   But 
> some LTP tests can fail.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Pid namespaces vs locks interaction

2007-12-12 Thread Serge E. Hallyn
Quoting Vitaliy Gusev ([EMAIL PROTECTED]):
> On 12 December 2007 20:31:15 Serge E. Hallyn wrote:
> > Quoting Vitaliy Gusev ([EMAIL PROTECTED]):
> > > Hello
> > >
> > > On 6 December 2007 18:51:30 Serge E. Hallyn wrote:
> > > > > fl_pid is used by nfs, fuse and gfs2. For instance nfs  keeps in
> > > > > fl_pid some unique id to identify locking process between hosts - it
> > > > > is not a process pid.
> > > >
> > > > Ok, but so the struct user_flock->fl_pid is being set to the task's
> > > > virtual pid, while the struct kernel_flock->fl_pid is being set to
> > > > task->tgid for nfsd use.
> > > >
> > > > Why can't nfs just generate a uniqueid from the struct pid when it
> > > > needs it?
> > >
> > > I think it is hard. lockd uses struct nlm_host to get process unique id
> > > (see __nlm_alloc_pid() function).
> >
> > Looks pretty simple though...  That whole set of code could even stay
> > the same except for in __nlm_alloc_pid():
> >
> > option 1: compare struct pid* instead of uint32_t pid
> > option 2: use the "global pid" out of the stored struct pid,
> > something like pid->numbers[0].nr.
> 
> We can't use process pid. Process pid is circulated!  NFS (lockd)  needs 
> unique process id between hosts which can't repeat oneself.

Ok sorry - by letting this thread sit a few days I lost track of where
we were.

I see now, so you're saying fl_pid for nfs is not in fact a task pid.
It's a magically derived unique id.  (And you say it is unique across
all the nfs clients?)

So does the p in fl_pid stand for something, or could we rename it to
fl_id or fl_uniqueid?

Maybe that's too much bother, but so long as we're bothering with a pid
cleanup at all it seems worth it to me.  On the other hand maybe
J. Bruce Fields was right and we should accept the fact that the
flock->fl_pid shouldn't be taken too seriously, and leave it be.

-serge

> > > > Fuse just seems to copy the pid to report it to userspace, so it would
> > > > just copy pid_vnr(kernel_flock->pid) into user_flock->fl_pid.
> > > >
> > > > Anyway I haven't looked at all the uses of struct fl_pid, but you
> > > > can always get the pidnr back from the struct pid if needed so there
> > > > should be no problem.
> > > >
> > > > The split definately seems worthwhile to me, so that
> > > > user_flock->fl_pidnr can always be said to be the pid in the acting
> > > > process' namespace, and flock->fl_pid can always be a struct pid,
> > > > rather than having fl_pid sometimes be current->tgid, or sometimes
> > > > pid_vnr(flock->fl_nspid)...
> > > >
> > > > -serge
> > > > -
> > > > To unsubscribe from this list: send the line "unsubscribe
> > > > linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> > > --
> > > Thank,
> > > Vitaliy Gusev
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [EMAIL PROTECTED]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Thank,
> Vitaliy Gusev
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Pid namespaces vs locks interaction

2007-12-12 Thread Serge E. Hallyn
Quoting Vitaliy Gusev ([EMAIL PROTECTED]):
> Hello
> 
> On 6 December 2007 18:51:30 Serge E. Hallyn wrote:
> > > fl_pid is used by nfs, fuse and gfs2. For instance nfs  keeps in fl_pid
> > > some unique id to identify locking process between hosts - it is not a
> > > process pid.
> >
> > Ok, but so the struct user_flock->fl_pid is being set to the task's
> > virtual pid, while the struct kernel_flock->fl_pid is being set to
> > task->tgid for nfsd use.
> >
> > Why can't nfs just generate a uniqueid from the struct pid when it
> > needs it?
> 
> I think it is hard. lockd uses struct nlm_host to get process unique id (see 
> __nlm_alloc_pid() function).

Looks pretty simple though...  That whole set of code could even stay
the same except for in __nlm_alloc_pid():

option 1: compare struct pid* instead of uint32_t pid
option 2: use the "global pid" out of the stored struct pid,
something like pid->numbers[0].nr.

> > Fuse just seems to copy the pid to report it to userspace, so it would
> > just copy pid_vnr(kernel_flock->pid) into user_flock->fl_pid.
> >
> > Anyway I haven't looked at all the uses of struct fl_pid, but you
> > can always get the pidnr back from the struct pid if needed so there
> > should be no problem.
> >
> > The split definately seems worthwhile to me, so that
> > user_flock->fl_pidnr can always be said to be the pid in the acting
> > process' namespace, and flock->fl_pid can always be a struct pid,
> > rather than having fl_pid sometimes be current->tgid, or sometimes
> > pid_vnr(flock->fl_nspid)...
> >
> > -serge
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [EMAIL PROTECTED]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Thank,
> Vitaliy Gusev
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Pid namespaces vs locks interaction

2007-12-07 Thread Serge E. Hallyn
Quoting J. Bruce Fields ([EMAIL PROTECTED]):
> On Thu, Dec 06, 2007 at 03:57:29PM +0300, Vitaliy Gusev wrote:
> > I am working on pid namespaces vs locks interaction and want to evaluate 
> > the 
> > idea.
> > fcntl(F_GETLK,..) can return pid of process for not current pid namespace 
> > (if 
> > process is belonged to the several namespaces). It is true also for pids 
> > in /proc/locks. So correct behavior is saving pointer to the struct pid of 
> > the process lock owner.
> 
> Forgive me, I'm not familiar with pid namespaces.  Exactly what bug does
> this patch aim to fix?

When a task is created inside a private pid namespace, it may know
itself as pid 5, while it's "global" pid is 1237.  So if it owned a
lock, it would be reported as being owned by 1237.

The patch replaces the pid number, which may signify different tasks in
different namespaces, with the 'struct pid', which uniquely identifies
a task.

> > @@ -673,14 +682,16 @@ posix_test_lock(struct file *filp, struct file_lock 
> > *fl)
> > if (posix_locks_conflict(fl, cfl))
> > break;
> > }
> > -   if (cfl)
> > +   if (cfl) {
> > __locks_copy_lock(fl, cfl);
> > -   else
> > +   if (cfl->fl_nspid)
> > +   fl->fl_pid = pid_nr_ns(cfl->fl_nspid, 
> > +   task_active_pid_ns(current));
> 
> What does pid_nr_ns() do?  I took a quick look at the implementation and
> didn't get it.

For the given 'struct pid', which is a unique light-weight task
identifier, it returns the pid number by which it is known in the pid
namespace sent as the second argument.  So if a process in the initial
pid namespace queries the process id of task 1237 mentioned above,
pid_nr_ns will return 1237, while a task in the private namespace will
get 5.

> I tend to think that the pid returned by fcntl(.,F_GETLK,.) shouldn't be
> taken too seriously--it may be helpful when debugging--e.g. it might
> help an administrator looking for clues as to who's holding some
> annoying lock.  But it probably shouldn't be depended on for the
> correctness of an application.  Maybe I'm wrong and there's some reason
> we should worry about it more.
> 
> It's also likely to be wrong in the presence of locks held on behalf of
> nfs clients.  

Your stance sounds sane.  So I'm ok leaving it as is, or doing the hard
work to replace pid_t fl_pid with struct pid fl_pid altogether and
having a separate struct user_flock which has a pid number.  The problem
with the patch as it stands is that at any point you now don't know
whether fl_pid is simply unused, is the global pid, or is the pid in a
private namespace.  Sounds impossible to maintain.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Pid namespaces vs locks interaction

2007-12-06 Thread Serge E. Hallyn
Quoting Vitaliy Gusev ([EMAIL PROTECTED]):
> On 6 December 2007 17:53:40 Serge E. Hallyn wrote:
> > Quoting Vitaliy Gusev ([EMAIL PROTECTED]):
> > > Hello!
> > >
> > > I am working on pid namespaces vs locks interaction and want to evaluate
> > > the idea.
> > > fcntl(F_GETLK,..) can return pid of process for not current pid namespace
> > > (if process is belonged to the several namespaces). It is true also for
> > > pids in /proc/locks. So correct behavior is saving pointer to the struct
> > > pid of the process lock owner.
> > > --
> > > Thank,
> > > Vitaliy Gusev
> > >
> > > diff --git a/fs/locks.c b/fs/locks.c
> > > index 8b8388e..d2d3d75 100644
> > > --- a/fs/locks.c
> > > +++ b/fs/locks.c
> > > @@ -125,6 +125,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >
> > >  #include 
> > >  #include 
> > > @@ -185,6 +186,7 @@ void locks_init_lock(struct file_lock *fl)
> > >   fl->fl_fasync = NULL;
> > >   fl->fl_owner = NULL;
> > >   fl->fl_pid = 0;
> > > + fl->fl_nspid = NULL;
> >
> > The idea seems right, but why are you keeping fl->fl_pid around?
> >
> > Seems like the safer thing to do would be to have a separate
> > struct user_flock, with an integer pid, for communicating to userspace,
> > and a struct flock, with struct pid, for kernel use?  Then fcntl_getlk()
> > and fcntl_setlk() do the appropriate conversions.
> 
> fl_pid is used by nfs, fuse and gfs2. For instance nfs  keeps in fl_pid some 
> unique id to identify locking process between hosts - it is not a process 
> pid.

Ok, but so the struct user_flock->fl_pid is being set to the task's
virtual pid, while the struct kernel_flock->fl_pid is being set to
task->tgid for nfsd use.

Why can't nfs just generate a uniqueid from the struct pid when it
needs it?

Fuse just seems to copy the pid to report it to userspace, so it would
just copy pid_vnr(kernel_flock->pid) into user_flock->fl_pid.

Anyway I haven't looked at all the uses of struct fl_pid, but you
can always get the pidnr back from the struct pid if needed so there
should be no problem.

The split definately seems worthwhile to me, so that
user_flock->fl_pidnr can always be said to be the pid in the acting
process' namespace, and flock->fl_pid can always be a struct pid,
rather than having fl_pid sometimes be current->tgid, or sometimes
pid_vnr(flock->fl_nspid)...

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Pid namespaces vs locks interaction

2007-12-06 Thread Serge E. Hallyn
Quoting Vitaliy Gusev ([EMAIL PROTECTED]):
> Hello!
> 
> I am working on pid namespaces vs locks interaction and want to evaluate the 
> idea.
> fcntl(F_GETLK,..) can return pid of process for not current pid namespace (if 
> process is belonged to the several namespaces). It is true also for pids 
> in /proc/locks. So correct behavior is saving pointer to the struct pid of 
> the process lock owner.
> -- 
> Thank,
> Vitaliy Gusev

> diff --git a/fs/locks.c b/fs/locks.c
> index 8b8388e..d2d3d75 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -125,6 +125,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -185,6 +186,7 @@ void locks_init_lock(struct file_lock *fl)
>   fl->fl_fasync = NULL;
>   fl->fl_owner = NULL;
>   fl->fl_pid = 0;
> + fl->fl_nspid = NULL;

The idea seems right, but why are you keeping fl->fl_pid around?

Seems like the safer thing to do would be to have a separate
struct user_flock, with an integer pid, for communicating to userspace,
and a struct flock, with struct pid, for kernel use?  Then fcntl_getlk()
and fcntl_setlk() do the appropriate conversions.

thanks,
-serge

>   fl->fl_file = NULL;
>   fl->fl_flags = 0;
>   fl->fl_type = 0;
> @@ -553,6 +555,8 @@ static void locks_insert_lock(struct file_lock **pos, 
> struct file_lock *fl)
>  {
>   list_add(&fl->fl_link, &file_lock_list);
> 
> + fl->fl_nspid = get_pid(task_tgid(current));
> +
>   /* insert into file's list */
>   fl->fl_next = *pos;
>   *pos = fl;
> @@ -584,6 +588,11 @@ static void locks_delete_lock(struct file_lock 
> **thisfl_p)
>   if (fl->fl_ops && fl->fl_ops->fl_remove)
>   fl->fl_ops->fl_remove(fl);
> 
> + if (fl->fl_nspid) {
> + put_pid(fl->fl_nspid);
> + fl->fl_nspid = NULL;
> + }
> +
>   locks_wake_up_blocks(fl);
>   locks_free_lock(fl);
>  }
> @@ -673,14 +682,16 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
>   if (posix_locks_conflict(fl, cfl))
>   break;
>   }
> - if (cfl)
> + if (cfl) {
>   __locks_copy_lock(fl, cfl);
> - else
> + if (cfl->fl_nspid)
> + fl->fl_pid = pid_nr_ns(cfl->fl_nspid, 
> + task_active_pid_ns(current));
> + } else
>   fl->fl_type = F_UNLCK;
>   unlock_kernel();
>   return;
>  }
> -
>  EXPORT_SYMBOL(posix_test_lock);
> 
>  /* This function tests for deadlock condition before putting a process to
> @@ -2084,6 +2095,12 @@ static void lock_get_status(struct seq_file *f, struct 
> file_lock *fl,
>   int id, char *pfx)
>  {
>   struct inode *inode = NULL;
> + unsigned int fl_pid;
> +
> + if (fl->fl_nspid)
> + fl_pid = pid_nr_ns(fl->fl_nspid, task_active_pid_ns(current));
> + else
> + fl_pid = fl->fl_pid;
> 
>   if (fl->fl_file != NULL)
>   inode = fl->fl_file->f_path.dentry->d_inode;
> @@ -2124,16 +2141,16 @@ static void lock_get_status(struct seq_file *f, 
> struct file_lock *fl,
>   }
>   if (inode) {
>  #ifdef WE_CAN_BREAK_LSLK_NOW
> - seq_printf(f, "%d %s:%ld ", fl->fl_pid,
> + seq_printf(f, "%d %s:%ld ", fl_pid,
>   inode->i_sb->s_id, inode->i_ino);
>  #else
>   /* userspace relies on this representation of dev_t ;-( */
> - seq_printf(f, "%d %02x:%02x:%ld ", fl->fl_pid,
> + seq_printf(f, "%d %02x:%02x:%ld ", fl_pid,
>   MAJOR(inode->i_sb->s_dev),
>   MINOR(inode->i_sb->s_dev), inode->i_ino);
>  #endif
>   } else {
> - seq_printf(f, "%d :0 ", fl->fl_pid);
> + seq_printf(f, "%d :0 ", fl_pid);
>   }
>   if (IS_POSIX(fl)) {
>   if (fl->fl_end == OFFSET_MAX)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b3ec4a4..5876f68 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -870,6 +870,7 @@ struct file_lock {
>   struct list_head fl_block;  /* circular list of blocked processes */
>   fl_owner_t fl_owner;
>   unsigned int fl_pid;
> + struct pid *fl_nspid;
>   wait_queue_head_t fl_wait;
>   struct file *fl_file;
>   unsigned char fl_flags;

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] VFS: Reorder vfs_getxattr to avoid unnecessary calls to the LSM

2007-11-01 Thread Serge E. Hallyn
Quoting David P. Quigley ([EMAIL PROTECTED]):
> Originally vfs_getxattr would pull the security xattr variable using
> the inode getxattr handle and then proceed to clobber it with a subsequent 
> call
> to the LSM. This patch reorders the two operations such that when the xattr
> requested is in the security namespace it first attempts to grab the value 
> from
> the LSM directly. If it fails to obtain the value because there is no module
> present or the module does not support the operation it will fall back to 
> using
> the inode getxattr operation. In the event that both are inaccessible it
> returns EOPNOTSUPP.
> 
> Signed-off-by: David P. Quigley <[EMAIL PROTECTED]>

No change from last time, so again

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

thanks,
-serge

> ---
>  fs/xattr.c |   15 ---
>  1 files changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 56b5b88..91c7929 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -145,11 +145,6 @@ vfs_getxattr(struct dentry *dentry, char *name, void 
> *value, size_t size)
>   if (error)
>   return error;
> 
> - if (inode->i_op->getxattr)
> - error = inode->i_op->getxattr(dentry, name, value, size);
> - else
> - error = -EOPNOTSUPP;
> -
>   if (!strncmp(name, XATTR_SECURITY_PREFIX,
>   XATTR_SECURITY_PREFIX_LEN)) {
>   const char *suffix = name + XATTR_SECURITY_PREFIX_LEN;
> @@ -158,9 +153,15 @@ vfs_getxattr(struct dentry *dentry, char *name, void 
> *value, size_t size)
>* Only overwrite the return value if a security module
>* is actually active.
>*/
> - if (ret != -EOPNOTSUPP)
> - error = ret;
> + if (ret == -EOPNOTSUPP)
> + goto nolsm;
> + return ret;
>   }
> +nolsm:
> + if (inode->i_op->getxattr)
> + error = inode->i_op->getxattr(dentry, name, value, size);
> + else
> + error = -EOPNOTSUPP;
> 
>   return error;
>  }
> -- 
> 1.5.3.4
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer

2007-11-01 Thread Serge E. Hallyn
Quoting David P. Quigley ([EMAIL PROTECTED]):
> This patch modifies the interface to inode_getsecurity to have the function
> return a buffer containing the security blob and its length via parameters
> instead of relying on the calling function to give it an appropriately sized
> buffer. Security blobs obtained with this function should be freed using the
> release_secctx LSM hook. This alleviates the problem of the caller having to
> guess a length and preallocate a buffer for this function allowing it to be
> used elsewhere for Labeled NFS. The patch also removed the unused err
> parameter. The conversion is similar to the one performed by Al Viro for the
> security_getprocattr hook.
> 
> Signed-off-by: David P. Quigley <[EMAIL PROTECTED]>

Looks good.  Looks like it's already hit -mm, but anyway

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

thanks,
-serge

> ---
>  fs/xattr.c   |   30 --
>  include/linux/security.h |   21 +
>  include/linux/xattr.h|1 +
>  mm/shmem.c   |3 +--
>  security/dummy.c |2 +-
>  security/security.c  |4 ++--
>  security/selinux/hooks.c |   45 -
>  7 files changed, 58 insertions(+), 48 deletions(-)
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 6645b73..56b5b88 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -105,6 +105,33 @@ out:
>  EXPORT_SYMBOL_GPL(vfs_setxattr);
> 
>  ssize_t
> +xattr_getsecurity(struct inode *inode, const char *name, void *value,
> + size_t size)
> +{
> + void *buffer = NULL;
> + ssize_t len;
> +
> + if (!value || !size) {
> + len = security_inode_getsecurity(inode, name, &buffer, false);
> + goto out_noalloc;
> + }
> +
> + len = security_inode_getsecurity(inode, name, &buffer, true);
> + if (len < 0)
> + return len;
> + if (size < len) {
> + len = -ERANGE;
> + goto out;
> + }
> + memcpy(value, buffer, len);
> +out:
> + security_release_secctx(buffer, len);
> +out_noalloc:
> + return len;
> +}
> +EXPORT_SYMBOL_GPL(xattr_getsecurity);
> +
> +ssize_t
>  vfs_getxattr(struct dentry *dentry, char *name, void *value, size_t size)
>  {
>   struct inode *inode = dentry->d_inode;
> @@ -126,8 +153,7 @@ vfs_getxattr(struct dentry *dentry, char *name, void 
> *value, size_t size)
>   if (!strncmp(name, XATTR_SECURITY_PREFIX,
>   XATTR_SECURITY_PREFIX_LEN)) {
>   const char *suffix = name + XATTR_SECURITY_PREFIX_LEN;
> - int ret = security_inode_getsecurity(inode, suffix, value,
> -  size, error);
> + int ret = xattr_getsecurity(inode, suffix, value, size);
>   /*
>* Only overwrite the return value if a security module
>* is actually active.
> diff --git a/include/linux/security.h b/include/linux/security.h
> index ac05083..3c4c91e 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -404,15 +404,12 @@ struct request_sock;
>   *   identified by @name for @dentry.
>   *   Return 0 if permission is granted.
>   * @inode_getsecurity:
> - *   Copy the extended attribute representation of the security label 
> - *   associated with @name for @inode into @buffer.  @buffer may be
> - *   NULL to request the size of the buffer required.  @size indicates
> - *   the size of @buffer in bytes.  Note that @name is the remainder
> - *   of the attribute name after the security. prefix has been removed.
> - *   @err is the return value from the preceding fs getxattr call,
> - *   and can be used by the security module to determine whether it
> - *   should try and canonicalize the attribute value.
> - *   Return number of bytes used/required on success.
> + *   Retrieve a copy of the extended attribute representation of the
> + *   security label associated with @name for @inode via @buffer.  Note that
> + *   @name is the remainder of the attribute name after the security prefix
> + *   has been removed. @alloc is used to specify of the call should return a
> + *   value via the buffer or just the value length Return size of buffer on
> + *   success.
>   * @inode_setsecurity:
>   *   Set the security label associated with @name for @inode from the
>   *   extended attribute value @value.  @size indicates the size of the
> @@ -1275,7 +1272,7 @@ struct security_operations {
>   int (*inode_removexattr) (struct dentry *dentry, char *name);
>   int (*inode_need_killpriv) (struct dentry *dentr

Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer

2007-10-26 Thread Serge E. Hallyn
Quoting David P. Quigley ([EMAIL PROTECTED]):
> On Fri, 2007-10-26 at 10:02 -0500, Serge E. Hallyn wrote:
> > Quoting David P. Quigley ([EMAIL PROTECTED]):
> > > On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote:
> > > > Quoting David P. Quigley ([EMAIL PROTECTED]):
> > > > >  static int task_alloc_security(struct task_struct *task)
> > > > > @@ -2423,14 +2397,22 @@ static const char 
> > > > > *selinux_inode_xattr_getsuffix(void)
> > > > >   *
> > > > >   * Permission check is handled by selinux_inode_getxattr hook.
> > > > >   */
> > > > > -static int selinux_inode_getsecurity(const struct inode *inode, 
> > > > > const char *name, void *buffer, size_t size, int err)
> > > > > +static int selinux_inode_getsecurity(const struct inode *inode,
> > > > > + const char *name,
> > > > > + void **buffer)
> > > > >  {
> > > > > + u32 size;
> > > > > + int error;
> > > > >   struct inode_security_struct *isec = inode->i_security;
> > > > >  
> > > > >   if (strcmp(name, XATTR_SELINUX_SUFFIX))
> > > > >   return -EOPNOTSUPP;
> > > > >  
> > > > > - return selinux_getsecurity(isec->sid, buffer, size);
> > > > > + error = security_sid_to_context(isec->sid, (char **)buffer, 
> > > > > &size);
> > > > 
> > > > The only other downside I see here is that when the user just passes in
> > > > NULL for a buffer, security_sid_to_context() will still
> > > > kmalloc the buffer only to have it immediately freed by
> > > > xattr_getsecurity() through release_secctx().  I trust that isn't seen
> > > > as any major performance impact?
> > > 
> > > There is no way to avoid this in the SELinux case. SELinux doesn't store
> > > the sid to string mapping directly. Rather it takes the sid and then
> > > builds the string from fields in the related structure. So regardless
> > > this data is being allocated internally. The only issue I potentially
> > > see is that if someone passes in null expecting just to get the length
> > > we are actually returning a value. However we are changing the semantics
> > > of the function so the old semantics are no longer valid.
> > 
> > Hmm?  Which semantics are no longer valid?
> > 
> > You're changing the semantincs of the in-kernel API, but userspace can
> > still send in NULL to query the length of the buffer needed.  So if
> > userspace does two getxattrs, one to get the length, then another to get
> > the value, selinux will be kmallocing twice.
> > 
> > For a file manager doing a listing on a huge directory and wanting to
> > list the selinux type, i could see that being a performance issue.  Of
> > course they could get around that by sending in a 'reasonably large'
> > buffer for a first try.
> > 
> 
> Ok lets start this line of thought over again since it has been a while
> since I wrote the patches and got almost no sleep last night. 
> 
> Your concerns are that we are double allocating buffers one of which we
> are just going to immediately free after a copy. So inside the SELinux
> helper function there was what I saw as generic code for handling
> xattrs. This can be seen in the new function xattr_getsecurity which use
> to be internal to SELinux (selinux_getsecurity). What we are doing is
> grabbing the string which internally is being allocated anyway and if
> our buffer passed in for the copy is null we just goto out returning the
> length and freeing the buffer. So here is our standard null handling
> that we had before. In LSMs where there is no internal allocation to
> handle the getsecurity call this should introduce almost no overhead.

Ah, thanks, you reminded me of what I was trying to point out.

SMACK won't do allocations so it's ok.  SELinux will do allocations
in any case so it's ok.  So in terms of current users it's fine, so I
don't want to complaint too loudly.

But the now-generic xattr_getsecurity() call passes in 'buffer' from its
stack, with no indication to the LSM of whether userspace passed in NULL
or a buffer.  So if there *were* an lsm which had to allocate space to
return data, but didn't want to do so when the user just asked for the
length of the data, then that LSM would be out of luck.

So would you object to passing in a boolean telling the LSM whethe

Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer

2007-10-26 Thread Serge E. Hallyn
Quoting Stephen Smalley ([EMAIL PROTECTED]):
> On Fri, 2007-10-26 at 10:02 -0500, Serge E. Hallyn wrote:
> > Quoting David P. Quigley ([EMAIL PROTECTED]):
> > > On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote:
> > > > Quoting David P. Quigley ([EMAIL PROTECTED]):
> > > > >  static int task_alloc_security(struct task_struct *task)
> > > > > @@ -2423,14 +2397,22 @@ static const char 
> > > > > *selinux_inode_xattr_getsuffix(void)
> > > > >   *
> > > > >   * Permission check is handled by selinux_inode_getxattr hook.
> > > > >   */
> > > > > -static int selinux_inode_getsecurity(const struct inode *inode, 
> > > > > const char *name, void *buffer, size_t size, int err)
> > > > > +static int selinux_inode_getsecurity(const struct inode *inode,
> > > > > + const char *name,
> > > > > + void **buffer)
> > > > >  {
> > > > > + u32 size;
> > > > > + int error;
> > > > >   struct inode_security_struct *isec = inode->i_security;
> > > > >  
> > > > >   if (strcmp(name, XATTR_SELINUX_SUFFIX))
> > > > >   return -EOPNOTSUPP;
> > > > >  
> > > > > - return selinux_getsecurity(isec->sid, buffer, size);
> > > > > + error = security_sid_to_context(isec->sid, (char **)buffer, 
> > > > > &size);
> > > > 
> > > > The only other downside I see here is that when the user just passes in
> > > > NULL for a buffer, security_sid_to_context() will still
> > > > kmalloc the buffer only to have it immediately freed by
> > > > xattr_getsecurity() through release_secctx().  I trust that isn't seen
> > > > as any major performance impact?
> > > 
> > > There is no way to avoid this in the SELinux case. SELinux doesn't store
> > > the sid to string mapping directly. Rather it takes the sid and then
> > > builds the string from fields in the related structure. So regardless
> > > this data is being allocated internally. The only issue I potentially
> > > see is that if someone passes in null expecting just to get the length
> > > we are actually returning a value. However we are changing the semantics
> > > of the function so the old semantics are no longer valid.
> > 
> > Hmm?  Which semantics are no longer valid?
> > 
> > You're changing the semantincs of the in-kernel API, but userspace can
> > still send in NULL to query the length of the buffer needed.  So if
> > userspace does two getxattrs, one to get the length, then another to get
> > the value, selinux will be kmallocing twice.
> > 
> > For a file manager doing a listing on a huge directory and wanting to
> > list the selinux type, i could see that being a performance issue.  Of
> > course they could get around that by sending in a 'reasonably large'
> > buffer for a first try.
> 
> That's what current userland does. libselinux always tries with an
> initial buffer first (and usually succeeds), thereby avoiding the second
> call to getxattr in the common case.

Ok - I figured for doing thousands of these in one directory listing
that could waste quite a bit of memory, but since (as i check) selinux
has always done a kmalloc for every getsecurity call, I guess it's a
fair tradeoff

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer

2007-10-26 Thread Serge E. Hallyn
Quoting David P. Quigley ([EMAIL PROTECTED]):
> On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote:
> > Quoting David P. Quigley ([EMAIL PROTECTED]):
> > >   This patch modifies the interface to inode_getsecurity to have the
> > > function return a buffer containing the security blob and its length via
> > > parameters instead of relying on the calling function to give it an
> > > appropriately sized buffer. Security blobs obtained with this function
> > > should be freed using the release_secctx LSM hook. This alleviates the
> > > problem of the caller having to guess a length and preallocate a buffer
> > > for this function allowing it to be used elsewhere for Labeled NFS. The
> > > patch also removed the unused err parameter. The conversion is similar
> > > to the one performed by Al Viro for the security_getprocattr hook.
> > > 
> > > Signed-off-by: David P. Quigley <[EMAIL PROTECTED]>
> > > ---
> > >  fs/xattr.c   |   26 --
> > >  include/linux/security.h |   27 ++-
> > >  include/linux/xattr.h|1 +
> > >  mm/shmem.c   |3 +--
> > >  security/dummy.c |4 +++-
> > >  security/selinux/hooks.c |   38 ++
> > 
> > (Hmm, I was about to ask if this diffstat could be complete, as it
> > doesn't have for instance security/security.c, but I guess this predates
> > the staticlsm patch...)
> 
> It wouldn't be much effort to rebase this patch against Linus's latest
> tree. I am assuming that the static lsm patch is in there based on the
> recent discussion on LKML?

Oh, sorry for the two emails.

Yeah it's in 2.6.24.  So a rebase will be necessary anyway.  I was just
saying I was too lazy to find another tree against which to check that
you didn't miss any getsecurity calls (hidden under some exotic .config)
to change their arguments  :)

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer

2007-10-26 Thread Serge E. Hallyn
Quoting David P. Quigley ([EMAIL PROTECTED]):
> On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote:
> > Quoting David P. Quigley ([EMAIL PROTECTED]):
> > >  static int task_alloc_security(struct task_struct *task)
> > > @@ -2423,14 +2397,22 @@ static const char 
> > > *selinux_inode_xattr_getsuffix(void)
> > >   *
> > >   * Permission check is handled by selinux_inode_getxattr hook.
> > >   */
> > > -static int selinux_inode_getsecurity(const struct inode *inode, const 
> > > char *name, void *buffer, size_t size, int err)
> > > +static int selinux_inode_getsecurity(const struct inode *inode,
> > > + const char *name,
> > > + void **buffer)
> > >  {
> > > + u32 size;
> > > + int error;
> > >   struct inode_security_struct *isec = inode->i_security;
> > >  
> > >   if (strcmp(name, XATTR_SELINUX_SUFFIX))
> > >   return -EOPNOTSUPP;
> > >  
> > > - return selinux_getsecurity(isec->sid, buffer, size);
> > > + error = security_sid_to_context(isec->sid, (char **)buffer, &size);
> > 
> > The only other downside I see here is that when the user just passes in
> > NULL for a buffer, security_sid_to_context() will still
> > kmalloc the buffer only to have it immediately freed by
> > xattr_getsecurity() through release_secctx().  I trust that isn't seen
> > as any major performance impact?
> 
> There is no way to avoid this in the SELinux case. SELinux doesn't store
> the sid to string mapping directly. Rather it takes the sid and then
> builds the string from fields in the related structure. So regardless
> this data is being allocated internally. The only issue I potentially
> see is that if someone passes in null expecting just to get the length
> we are actually returning a value. However we are changing the semantics
> of the function so the old semantics are no longer valid.

Hmm?  Which semantics are no longer valid?

You're changing the semantincs of the in-kernel API, but userspace can
still send in NULL to query the length of the buffer needed.  So if
userspace does two getxattrs, one to get the length, then another to get
the value, selinux will be kmallocing twice.

For a file manager doing a listing on a huge directory and wanting to
list the selinux type, i could see that being a performance issue.  Of
course they could get around that by sending in a 'reasonably large'
buffer for a first try.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer

2007-10-25 Thread Serge E. Hallyn
Quoting David P. Quigley ([EMAIL PROTECTED]):
>   This patch modifies the interface to inode_getsecurity to have the
> function return a buffer containing the security blob and its length via
> parameters instead of relying on the calling function to give it an
> appropriately sized buffer. Security blobs obtained with this function
> should be freed using the release_secctx LSM hook. This alleviates the
> problem of the caller having to guess a length and preallocate a buffer
> for this function allowing it to be used elsewhere for Labeled NFS. The
> patch also removed the unused err parameter. The conversion is similar
> to the one performed by Al Viro for the security_getprocattr hook.
> 
> Signed-off-by: David P. Quigley <[EMAIL PROTECTED]>
> ---
>  fs/xattr.c   |   26 --
>  include/linux/security.h |   27 ++-
>  include/linux/xattr.h|1 +
>  mm/shmem.c   |3 +--
>  security/dummy.c |4 +++-
>  security/selinux/hooks.c |   38 ++

(Hmm, I was about to ask if this diffstat could be complete, as it
doesn't have for instance security/security.c, but I guess this predates
the staticlsm patch...)

>  6 files changed, 53 insertions(+), 46 deletions(-)
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index a44fd92..d45c7ef 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -105,6 +105,29 @@ out:
>  EXPORT_SYMBOL_GPL(vfs_setxattr);
>  
>  ssize_t
> +xattr_getsecurity(struct inode *inode, const char *name, void *value,
> + size_t size)
> +{
> + void *buffer = NULL;
> + ssize_t len;
> +
> + len = security_inode_getsecurity(inode, name, &buffer);
> + if (len < 0)
> + return len;
> + if (!value || !size)
> + goto out;
> + if (size < len) {
> + len = -ERANGE;
> + goto out;
> + }
> + memcpy(value, buffer, len);
> +out:
> + security_release_secctx(buffer, len);

This is mighty misleading in -ERANGE case :)  I realize that
selinux_release_secctx() ignores len anyway.  But given the description
in security.h, I'd say either you need to keep the actual length
allocated and pass that in here, or (probably better) have another patch
remove the second argument from security_release_secctx().

> + return len;
> +}
> +EXPORT_SYMBOL_GPL(xattr_getsecurity);
> +
> +ssize_t
>  vfs_getxattr(struct dentry *dentry, char *name, void *value, size_t size)
>  {
>   struct inode *inode = dentry->d_inode;
> @@ -126,8 +149,7 @@ vfs_getxattr(struct dentry *dentry, char *name, void 
> *value, size_t size)
>   if (!strncmp(name, XATTR_SECURITY_PREFIX,
>   XATTR_SECURITY_PREFIX_LEN)) {
>   const char *suffix = name + XATTR_SECURITY_PREFIX_LEN;
> - int ret = security_inode_getsecurity(inode, suffix, value,
> -  size, error);
> + int ret = xattr_getsecurity(inode, suffix, value, size);
>   /*
>* Only overwrite the return value if a security module
>* is actually active.
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 1a15526..8658929 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -391,15 +391,11 @@ struct request_sock;
>   *   identified by @name for @dentry.
>   *   Return 0 if permission is granted.
>   * @inode_getsecurity:
> - *   Copy the extended attribute representation of the security label 
> - *   associated with @name for @inode into @buffer.  @buffer may be
> - *   NULL to request the size of the buffer required.  @size indicates
> - *   the size of @buffer in bytes.  Note that @name is the remainder
> - *   of the attribute name after the security. prefix has been removed.
> - *   @err is the return value from the preceding fs getxattr call,
> - *   and can be used by the security module to determine whether it
> - *   should try and canonicalize the attribute value.
> - *   Return number of bytes used/required on success.
> + *   Retrieve a copy of the extended attribute representation of the
> + *   security label associated with @name for @inode via @buffer.  Note that
> + *   @name is the remainder of the attribute name after the security prefix
> + *   has been removed.
> + *   Return size of buffer on success.
>   * @inode_setsecurity:
>   *   Set the security label associated with @name for @inode from the
>   *   extended attribute value @value.  @size indicates the size of the
> @@ -1233,7 +1229,8 @@ struct security_operations {
>   int (*inode_listxattr) (struct dentry *dentry);
>   int (*inode_removexattr) (struct dentry *dentry, char *name);
>   const char *(*inode_xattr_getsuffix) (void);
> - int (*inode_getsecurity)(const struct inode *inode, const char *name, 
> void *buffer, size_t size, int err);
> + int (*inode_getsecurity)(const struct inode *ino

Re: [PATCH 2/2] VFS: Reorder vfs_getxattr to avoid unnecessary calls to the LSM

2007-10-25 Thread Serge E. Hallyn
Quoting James Morris ([EMAIL PROTECTED]):
> On Mon, 22 Oct 2007, David P. Quigley wrote:
> 
> > Originally vfs_getxattr would pull the security xattr variable using
> > the inode getxattr handle and then proceed to clobber it with a subsequent 
> > call
> > to the LSM. This patch reorders the two operations such that when the xattr
> > requested is in the security namespace it first attempts to grab the value 
> > from
> > the LSM directly. If it fails to obtain the value because there is no module
> > present or the module does not support the operation it will fall back to 
> > using
> > the inode getxattr operation. In the event that both are inaccessible it
> > returns EOPNOTSUPP.
> > 
> > Signed-off-by: David P. Quigley <[EMAIL PROTECTED]>
> 
> Acked-by: James Morris <[EMAIL PROTECTED]>

(not that it matters much, esp with selinux being the only current user,
but)

Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

Makes sense and looks good.

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4/4] VFS: allow filesystem to override mknod capability checks

2007-08-10 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > From: Miklos Szeredi <[EMAIL PROTECTED]>
> > > 
> > > Add a new filesystem flag, that results in the VFS not checking if the
> > > current process has enough privileges to do an mknod().
> > > 
> > > This is needed on filesystems, where an unprivileged user may be able
> > > to create a device node, without causing security problems.
> > > 
> > > One such example is "mountlo" a loopback mount utility implemented
> > > with fuse and UML, which runs as an unprivileged userspace process.
> > > In this case the user does in fact have the right to create device
> > > nodes within the filesystem image, as long as the user has write
> > > access to the image.  Since the filesystem is mounted with "nodev",
> > > adding device nodes is not a security concern.
> > 
> > Could we enforce at do_new_mount() that if
> > type->fs_flags&FS_MKNOD_CHECKS_PERM then mnt_flags |= MS_NODEV?
> 
> Well, the problem with that is, there will be fuse filesystems which
> will want devices to work

Crud, sorry, I forgot all fuse filesystems will have the same fs_flags.

> and for those the capability checks will be
> reenabled inside ->mknod().  In fact, for backward compatibility all
> filesystems will have the mknod checks, except ones which explicitly
> request to turn it off.
> 
> Since unprivileged fuse mounts always have "nodev", the only way

Ah yes, I'd forgotten that we do if (!capable(mknod)) mnt_flags |= MNT_NODEV

No objections then anyway.  Thanks for indulging me :)

> security could be screwed up, is if a filesystem running with
> privileges disabled the mknod checks.
> 
> I will probably add some safety guards against that into the fuse
> library, but of course there's no way to stop a privileged user from
> screwing up security anyway.

Agreed.

> If for example there's a loop mount, where the disk image file is
> writable by a user, and root mounts it without "nodev", the user can
> still create device nodes (by modifying the image) even if the mknod
> checks are enabled.

thanks,
-serge

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4/4] VFS: allow filesystem to override mknod capability checks

2007-08-09 Thread Serge E. Hallyn
Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> Add a new filesystem flag, that results in the VFS not checking if the
> current process has enough privileges to do an mknod().
> 
> This is needed on filesystems, where an unprivileged user may be able
> to create a device node, without causing security problems.
> 
> One such example is "mountlo" a loopback mount utility implemented
> with fuse and UML, which runs as an unprivileged userspace process.
> In this case the user does in fact have the right to create device
> nodes within the filesystem image, as long as the user has write
> access to the image.  Since the filesystem is mounted with "nodev",
> adding device nodes is not a security concern.

Could we enforce at do_new_mount() that if
type->fs_flags&FS_MKNOD_CHECKS_PERM then mnt_flags |= MS_NODEV?

> This feature is basically "fuse-only", so it does not make sense to
> change the semantics of ->mknod().
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
> ---
> 
> Index: linux/fs/namei.c
> ===
> --- linux.orig/fs/namei.c 2007-08-09 16:49:07.0 +0200
> +++ linux/fs/namei.c  2007-08-09 16:49:12.0 +0200
> @@ -1921,7 +1921,8 @@ int vfs_mknod(struct inode *dir, struct 
>   if (error)
>   return error;
>  
> - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
> + if (!(dir->i_sb->s_type->fs_flags & FS_MKNOD_CHECKS_PERM) &&
> + (S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
>   return -EPERM;
>  
>   if (!dir->i_op || !dir->i_op->mknod)
> Index: linux/include/linux/fs.h
> ===
> --- linux.orig/include/linux/fs.h 2007-08-09 16:49:07.0 +0200
> +++ linux/include/linux/fs.h  2007-08-09 16:49:12.0 +0200
> @@ -97,6 +97,7 @@ extern int dir_notify_enable;
>  #define FS_BINARY_MOUNTDATA 2
>  #define FS_HAS_SUBTYPE 4
>  #define FS_SAFE 8/* Safe to mount by unprivileged users */
> +#define FS_MKNOD_CHECKS_PERM 16  /* FS checks if device creation is 
> allowed */
>  #define FS_REVAL_DOT 16384   /* Check the paths ".", ".." for staleness */
>  #define FS_RENAME_DOES_D_MOVE32768   /* FS will handle d_move()
>* during rename() internally.
> 
> --
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding subroot information to /proc/mounts, or obtaining that through other means

2007-06-21 Thread Serge E. Hallyn
Quoting H. Peter Anvin ([EMAIL PROTECTED]):
> Al Viro wrote:
> > On Wed, Jun 20, 2007 at 01:57:33PM -0700, H. Peter Anvin wrote:
> >> ... or, alternatively, add a subfield to the first field (which would
> >> entail escaping whatever separator we choose):
> >>
> >> /dev/md6 /export ext3 rw,data=ordered 0 0
> >> /dev/md6:/users/foo /home/foo ext3 rw,data=ordered 0 0
> >> /dev/md6:/users/bar /home/bar ext3 rw,data=ordered 0 0
> > 
> > Hell, no.  The first field is in principle impossible to parse unless
> > you know the fs type.
> > 
> > How about making a new file with sane format?  From the very
> > beginning.  E.g. mountpoint + ID + relative path + type + options,
> > where ID uniquely identifies superblock (e.g. numeric st_dev)
> > and backing device (if any) is sitting among the options...
> 
> The more I'm thinking about this, I think it's simplest to just add
> fields to the right of the existing /proc/*/mounts.  Yes, the format is
> ugly, and it will end up being uglier still, but it's also ugly to have
> a bunch of different chunks of information formatted in different ways.

Since we're defining the order "arbitrarily" in any case, I really don't
think it's all that ugly.

Are there any existing tools which would not be able to handle the extra
fields?

(suppose it's easiest to just add the fields, try a few distros, and see
which balk)

> So, the existing fields are:
> 
> mnt_devname mnt_path filesystem_type options 0 0
> 
> ... and we'd want to add ...
> 
> mnt_id propagation_info sb_dev path_to_fs_root
> 
> As previously stated, in order to avoid having to expose kernel
> addresses to userspace, I suggest we simply add a counter field to
> struct vfsmount and use that for mnt_id.

Agreed - even if it weren't frowned upon to expose the kernel addresses,
it would just be much nicer to have easier to remember ids.  Somehow
with the kernel address, even with just a set of 5 of them printed in
front of me it takes me 2 minutes to figure out which ones are the
same...

> I'm not all that up on what is needed for propagation_info.  I presume
> we want to be able to deduce the full mount lattice.  One particularly

I think Ram's existing patches just provided "PEER (next-peer-id)" or
"SLAVE (master-id)".

> important thing in my mind is to be able to distinguish overmounted
> filesystems (which I think is possible in the current setup only by

What exactly do you mean here?  Do you mean information about stackable
filesystems - i.e. ecryptfs, unionfs, etc?

If so, maybe a last column which the fs itself can fill in with such
information is the best way to go then?  Ecryptfs would have just one
pathname to fill in (the location of the encrypted dir), unionfs might
have several (the full stack of unioned directories).

> ordering -- the filesystem on top I believe will end up last in
> /proc/mounts, but I don't know if there actually is anything that
> enforces that.)

Hmm, or do you actually mean that if i'd done

mount --bind /tmp/a /tmp
mount --bind /tmp/b /tmp
mount --bind /tmp/c /tmp

that you would want to see information about the first two mounts?

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 38/45] AppArmor: Module and LSM hooks

2007-06-12 Thread Serge E. Hallyn
Quoting Karl MacMillan ([EMAIL PROTECTED]):
> On Tue, 2007-06-12 at 10:34 -0500, Serge E. Hallyn wrote:
> > Quoting Stephen Smalley ([EMAIL PROTECTED]):
> 
> [...]
> 
> > > 
> > > If we added support for named type transitions to SELinux, as proposed
> > > earlier by Kyle Moffett during this discussion, wouldn't that address
> > > that issue without needing a DTE-like approach?  The concept is to add
> > 
> > Haven't read his message, but based on what you laid out here sure, that
> > sounds good.  It still, like my dte approach, might have some trouble
> > with the wildcard/regex rules AA allows.  And while it might perfectly
> > reproduce my original DTE behavior, I don't think it does what AA wants
> > on bind mounts.  (Whether what AA wants for bind mounts makes sense I'm
> > still not convinced, especially with user mounts coming soon (or already
> > here?), but I'm staying out of that discussion for now)
> > 
> > > the last component name as a further input to the labeling decision for
> > > new files, in addition to the existing use of the creating process'
> > > label, the parent directory label, and the kind of file.  Then, you
> > > could have something like:
> > > type_transition  var_log_hosts_t:file "messages" messages_t;
> > > 
> > > The last component name is already available, so that doesn't require
> > > any changes to LSM, and it would be a straightforward extension of
> > > SELinux to support the above - it doesn't change the model at all, just
> > > adds a further input to the new file labeling logic.
> > 
> > And eliminates the need for restorecond?
> > 
> 
> Unlikely in the short term - restorecond is also used to reset contexts
> on critical files in /etc that might loose the context because tools
> used to update them are not correctly preserving contexts
> (e.g., /etc/mtab, etc/resolv.conf).

Confused - why wouldn't the new type_transition rule extension handle
that?

thanks,
-serge

> Actually - this whole notion restorecond as a critical component of
> SELinux because of a "new file problem" is pretty overblown. The default
> config file ships with:
> 
> /etc/resolv.conf
> /etc/samba/secrets.tdb
> /etc/mtab
> /var/run/utmp
> /var/log/wtmp
> ~/public_html
> ~/.mozilla/plugins/libflashplayer.so
> 
> So the only things that would be helped by type_transition rules with a 
> name component would be public_html and libflashplayer.so.
> 
> Karl
> 
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-security-module" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 38/45] AppArmor: Module and LSM hooks

2007-06-12 Thread Serge E. Hallyn
Quoting Stephen Smalley ([EMAIL PROTECTED]):
> On Mon, 2007-06-11 at 14:02 -0500, Serge E. Hallyn wrote:
> > Quoting Andreas Gruenbacher ([EMAIL PROTECTED]):
> > > On Monday 11 June 2007 16:33, Stephen Smalley wrote:
> > > > On Mon, 2007-06-11 at 01:10 +0200, Andreas Gruenbacher wrote:
> > > > > On Wednesday 06 June 2007 15:09, Stephen Smalley wrote:
> > > > > > On Mon, 2007-06-04 at 16:30 +0200, Andreas Gruenbacher wrote:
> > > > > > > On Monday 04 June 2007 15:12, Pavel Machek wrote:
> > > > > > > > How will kernel work with very long paths? I'd suspect some
> > > > > > > > problems, if path is 1MB long and I attempt to print it in /proc
> > > > > > > > somewhere.
> > > > > > >
> > > > > > > Pathnames are only used for informational purposes in the kernel,
> > > > > > > except in AppArmor of course.
> > > > > >
> > > > > > I don't mean this as a flame, but isn't the above statement the very
> > > > > > crux of this discussion?
> > > > >
> > > > > I think the question at the core of it all is, shall a pathname based
> > > > > security mechanism be allowed. I was under the impression that this
> > > > > question had already been answered affirmatively. If the answer here 
> > > > > was
> > 
> > That was the decision at ksummit last year, yes.
> > 
> > > > > no, then we could stop the entire discussion right there.
> > > >
> > > > There is a difference between using the pathname at the kernel/userland
> > > > interface as part of configuring a security mechanism and using it as
> > > > the basis for the runtime checking itself.
> > > 
> > > Yes, there is a difference. When I say pathname based security mechanism, 
> > > I 
> > > literally mean a pathname based security mechanism, meaning the pathnames 
> > > determine the outcome o the decision. This includes designs that are 
> > > based on 
> > > different abstractions internally, but the bahavior observable from 
> > > user-space must be the same (or else, it's a different model).
> > > 
> > > Unfortunately, translating pathnames to labels destroys this fundamental 
> > > abstraction. We explained why this is so in the following postings:
> > > 
> > >   http://lkml.org/lkml/2007/6/9/94
> > >   http://lkml.org/lkml/2007/6/10/141
> > > 
> > > > Further, there is a difference between generating and matching full
> > > > pathnames on each access vs. caching information in the parent dentry 
> > > > and
> > > > making decisions based on that cached information and the last
> > > > component-name.
> > > 
> > > Wait, you are mixing two issues here: access checks on existing files, 
> > > and the 
> > > creation of new files. For AppArmor as it stands today the two are the 
> > > same, 
> > > but when looking at emulating AppArmor using labels, they are not. Let's 
> > > look 
> > > at things one at a time.
> > > 
> > > Generating and matching full pathnames on each access takes time, no 
> > > question 
> > > about that. (In fact we are not checking on each access but only when 
> > > pathnames are involved, such as on open. Filehandle based operations do 
> > > not 
> > > require access checks, but that's not a very important difference at this 
> > > level of discussion.) That's a quantitative statement though, not a 
> > > qualitative one: retrieving xattrs and checking labels also takes time; 
> > > additional checks are never for free. We find that doing the pathname 
> > > checks 
> > > is easily fast enough. You may disagree, but then you don't have to use 
> > > AppArmor, and we are not standing in your way.
> > > 
> > > As far as new files are concerned, basing decisions on the parent dentry 
> > > and 
> > > component name requires that you know where in the filesystem hierarchy 
> > > this 
> > > dentry is located: with bind mounts, the same dentry shows up in multiple 
> > > locations in a process's namespace, corresponding to different pathnames. 
> > > In 
> > > other words, to make the right decision, the dentry alone is not enough; 
> > > it 
> > > takes a  pair. So there we are again.
> > > 

Re: [AppArmor 38/45] AppArmor: Module and LSM hooks

2007-06-11 Thread Serge E. Hallyn
Quoting Andreas Gruenbacher ([EMAIL PROTECTED]):
> On Monday 11 June 2007 16:33, Stephen Smalley wrote:
> > On Mon, 2007-06-11 at 01:10 +0200, Andreas Gruenbacher wrote:
> > > On Wednesday 06 June 2007 15:09, Stephen Smalley wrote:
> > > > On Mon, 2007-06-04 at 16:30 +0200, Andreas Gruenbacher wrote:
> > > > > On Monday 04 June 2007 15:12, Pavel Machek wrote:
> > > > > > How will kernel work with very long paths? I'd suspect some
> > > > > > problems, if path is 1MB long and I attempt to print it in /proc
> > > > > > somewhere.
> > > > >
> > > > > Pathnames are only used for informational purposes in the kernel,
> > > > > except in AppArmor of course.
> > > >
> > > > I don't mean this as a flame, but isn't the above statement the very
> > > > crux of this discussion?
> > >
> > > I think the question at the core of it all is, shall a pathname based
> > > security mechanism be allowed. I was under the impression that this
> > > question had already been answered affirmatively. If the answer here was

That was the decision at ksummit last year, yes.

> > > no, then we could stop the entire discussion right there.
> >
> > There is a difference between using the pathname at the kernel/userland
> > interface as part of configuring a security mechanism and using it as
> > the basis for the runtime checking itself.
> 
> Yes, there is a difference. When I say pathname based security mechanism, I 
> literally mean a pathname based security mechanism, meaning the pathnames 
> determine the outcome o the decision. This includes designs that are based on 
> different abstractions internally, but the bahavior observable from 
> user-space must be the same (or else, it's a different model).
> 
> Unfortunately, translating pathnames to labels destroys this fundamental 
> abstraction. We explained why this is so in the following postings:
> 
>   http://lkml.org/lkml/2007/6/9/94
>   http://lkml.org/lkml/2007/6/10/141
> 
> > Further, there is a difference between generating and matching full
> > pathnames on each access vs. caching information in the parent dentry and
> > making decisions based on that cached information and the last
> > component-name.
> 
> Wait, you are mixing two issues here: access checks on existing files, and 
> the 
> creation of new files. For AppArmor as it stands today the two are the same, 
> but when looking at emulating AppArmor using labels, they are not. Let's look 
> at things one at a time.
> 
> Generating and matching full pathnames on each access takes time, no question 
> about that. (In fact we are not checking on each access but only when 
> pathnames are involved, such as on open. Filehandle based operations do not 
> require access checks, but that's not a very important difference at this 
> level of discussion.) That's a quantitative statement though, not a 
> qualitative one: retrieving xattrs and checking labels also takes time; 
> additional checks are never for free. We find that doing the pathname checks 
> is easily fast enough. You may disagree, but then you don't have to use 
> AppArmor, and we are not standing in your way.
> 
> As far as new files are concerned, basing decisions on the parent dentry and 
> component name requires that you know where in the filesystem hierarchy this 
> dentry is located: with bind mounts, the same dentry shows up in multiple 
> locations in a process's namespace, corresponding to different pathnames. In 
> other words, to make the right decision, the dentry alone is not enough; it 
> takes a  pair. So there we are again.
> 
> >From the point on where you have a  pair of objects, you 
> >can 
> do two things: you can compute the full pathname and base your decision on 
> that, or you can do some caching to hopefully cut some of that work short 
> frequently enough to set off the additional cost. The difference between the 
> two approaches is quantitative -- if there is a difference in results, then 
> that's obviously a bug. I believe that caching could speed up things 
> measurably, but up to this point, neither I nor anybody else had the time to 
> look into it, and so we are not doing it -- not yet, any perhaps never at 
> all. It may be counter to your intuition, but doing those checks is not a big 
> issue.

My approach in DTE, which used pathnames to assign TE labels in the
kernel, was to use a 'shadow tree' to the vfsmnt+dentry tree, filled out
at policy load time to the depth of the deepest policy rule.  For the
behavior I was after, bind mounts were handled by storing pointers from
the new mount to the original, but the behavior AA wants would be
different.  Namespace clones are easily handled this way by copying the
cached pointers to the shadow tree.  And it was pretty fast.

When I talked about this with Tony last year, the biggest shortcoming to
use this for AA was the wildcards.  For instance, if there is a rule for
/var/log/HOSTS/*/messages, then when /var/log/HOSTS/sergelap/ is
created, a new rule would have to be cre

Re: [RFC][PATCH 5/14] Introduce union stack

2007-05-23 Thread Serge E. Hallyn
Quoting Paul Dickson ([EMAIL PROTECTED]):
> On Mon, 14 May 2007 13:23:06 -0700, Badari Pulavarty wrote:
> 
> > > + while (fs) {
> > > + locked = union_trylock(fs->root);
> > > + if (!locked)
> > > + goto loop1;
> > > + locked = union_trylock(fs->altroot);
> > > + if (!locked)
> > > + goto loop2;
> > > + locked = union_trylock(fs->pwd);
> > > + if (!locked)
> > > + goto loop3;
> > > + break;
> > > + loop3:
> > > + union_unlock(fs->altroot);
> > > + loop2:
> > > + union_unlock(fs->root);
> > > + loop1:
> > > + read_unlock(&fs->lock);
> > > + UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
> > > + cpu_relax();
> > > + read_lock(&fs->lock);
> > > + continue;
> > 
> > Nit.. why "continue" ?
> > 
> > > + }
> > > + BUG_ON(!fs);
> 
> How about getting rid of the gotos:
> 
>   while (fs) {
>   locked = union_trylock(fs->root);
>   if (locked) {
>   locked = union_trylock(fs->altroot);
>   if (locked) {
>   locked = union_trylock(fs->pwd);
>   if (locked)
>   break;
>   else {
>   union_unlock(fs->altroot);
>   union_unlock(fs->root);
>   }
>   else
>   union_unlock(fs->root);
>   }
>   }
>   read_unlock(&fs->lock);
>   UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
>   cpu_relax();
>   read_lock(&fs->lock);
>   }
>   BUG_ON(!fs);
> 
> It's the same number of lines.  Shorter if you get rid of the "locked"
> variable.

I dunno, I thought the goto versoin was cleaner and easier to tell that
the right locks are getting unlocked.  The worst part in the second
version is the break in the middle!

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] file capabilities: Introduction

2007-05-17 Thread Serge E. Hallyn
Quoting Suparna Bhattacharya ([EMAIL PROTECTED]):
> On Mon, May 14, 2007 at 08:00:11PM +, Pavel Machek wrote:
> > Hi!
> > 
> > > "Serge E. Hallyn" <[EMAIL PROTECTED]> wrote:
> > > 
> > > > Following are two patches which have been sitting for some time in -mm.
> > > 
> > > Where "some time" == "nearly six months".
> > > 
> > > We need help considering, reviewing and testing this code, please.
> > 
> > I did quick scan, and it looks ok. Plus, it means we can finally start
> > using that old capabilities subsystem... so I think we should do it.
> 
> FWIW, I looked through it recently as well, and it looked reasonable enough
> to me, though I'm not a security expert. I did have a question about
> testing corner cases etc, which Serge has tried to address.
> 
> Serge, are you planning to post an update without STRICTXATTR ? That should
> simplify the second patch.

Sorry, I did but I guess I didn't cc: you on that reply.

It is at http://lkml.org/lkml/2007/5/14/276

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 8/14] Union-mount lookup

2007-05-16 Thread Serge E. Hallyn
Quoting Jan Engelhardt ([EMAIL PROTECTED]):
> 
> On May 16 2007 10:38, Bharata B Rao wrote:
> >> 
> >> >+lookup_union:
> >> >+ do {
> >> >+ struct vfsmount *mnt = find_mnt(topmost);
> >> >+ UM_DEBUG_DCACHE("name=\"%s\", inode=%p, device=%s\n",
> >> >+ topmost->d_name.name, topmost->d_inode,
> >> >+ mnt->mnt_devname);
> >> >+ mntput(mnt);
> >> >+ } while (0);
> >> 
> >> Why the extra do{}while? [elsewhere too]
> >
> >Not sure, may be to get a scope to define 'mnt' here. Jan ?
> 
> What I was implicitly suggesting that mnt could be moved into the
> normal 'function scope'.
> 
> 
>   Jan

This code can't stay anyway so it's kind of moot.  find_mnt() is bogus,
and the topmost and overlaid mappings need to be changed from
dentry->dentry to (vfsmnt,dentry)->(vfsmnt,dentry) in order to cope with
bind mounts and mount namespaces.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] file capabilities: accomodate >32 bit capabilities

2007-05-14 Thread Serge E. Hallyn
Quoting Suparna Bhattacharya ([EMAIL PROTECTED]):
> On Thu, May 10, 2007 at 01:01:27PM -0700, Andreas Dilger wrote:
> > On May 08, 2007  16:49 -0500, Serge E. Hallyn wrote:
> > > Quoting Andreas Dilger ([EMAIL PROTECTED]):
> > > > One of the important use cases I can see today is the ability to
> > > > split the heavily-overloaded e.g. CAP_SYS_ADMIN into much more fine
> > > > grained attributes.
> > > 
> > > Sounds plausible, though it suffers from both making capabilities far
> > > more cumbersome (i.e. finding the right capability for what you wanted
> > > to do) and backward compatibility.  Perhaps at that point we should
> > > introduce security.capabilityv2 xattrs.  A binary can then carry
> > > security.capability=CAP_SYS_ADMIN=p, and
> > > security.capabilityv2=cap_may_clone_mntns=p.
> > 
> > Well, the overhead of each EA is non-trivial (16 bytes/EA) for storing
> > 12 bytes worth of data, so it is probably just better to keep extending
> > the original capability fields as was in the proposal.
> > 
> > > > What we definitely do NOT want to happen is an application that needs
> > > > priviledged access (e.g. e2fsck, mount) to stop running because the
> > > > new capabilities _would_ have been granted by the new kernel and are
> > > > not by the old kernel and STRICTXATTR is used.
> > > > 
> > > > To me it would seem that having extra capabilities on an old kernel
> > > > is relatively harmless if the old kernel doesn't know what they are.
> > > > It's like having a key to a door that you don't know where it is.
> > > 
> > > If we ditch the STRICTXATTR option do the semantics seem sane to you?
> > 
> > Seems reasonable.
> 
> It would simplify the code as well, which is good.
> 
> This does mean no sanity checking of fcaps, am not sure if that matters,
> I'm guessing it should be similar to the case for other security attributes.

which is to trust the xattr...

So here is a new consolidated patch without the STRICTXATTR config
option.

-serge

From: Serge E. Hallyn <[EMAIL PROTECTED]>
Subject: [PATCH] Implement file posix capabilities

Implement file posix capabilities.  This allows programs to be given a
subset of root's powers regardless of who runs them, without having to use
setuid and giving the binary all of root's powers.

This version works with Kaigai Kohei's userspace tools, found at
http://www.kaigai.gr.jp/index.php.  For more information on how to use this
patch, Chris Friedhoff has posted a nice page at
http://www.friedhoff.org/fscaps.html.

Changelog:
May 14:
Remove STRICTXATTR support which could make newer binaries
unusable on older kernels, and combine the two patches
into one.

[recent]:
1. Enable the CONFIG_SECURITY_FS_CAPABILITIES option
when CONFIG_SECURITY=n.
2. Rename CONFIG_SECURITY_FS_CAPABILITIES to
CONFIG_SECURITY_FILE_CAPABILITIES
3. To accomodate 64-bit caps, specify that capabilities are
stored as
u32 version; u32 eff0; u32 perm0; u32 inh0;
u32 eff1; u32 perm1; u32 inh1; (etc)

Nov 27:
Incorporate fixes from Andrew Morton
(security-introduce-file-caps-tweaks and
security-introduce-file-caps-warning-fix)
Fix Kconfig dependency.
Fix change signaling behavior when file caps are not compiled in.

Nov 13:
Integrate comments from Alexey: Remove CONFIG_ ifdef from
capability.h, and use %zd for printing a size_t.

Nov 13:
Fix endianness warnings by sparse as suggested by Alexey
Dobriyan.

Nov 09:
Address warnings of unused variables at cap_bprm_set_security
when file capabilities are disabled, and simultaneously clean
up the code a little, by pulling the new code into a helper
function.

Nov 08:
For pointers to required userspace tools and how to use
them, see http://www.friedhoff.org/fscaps.html.

Nov 07:
Fix the calculation of the highest bit checked in
check_cap_sanity().

Nov 07:
Allow file caps to be enabled without CONFIG_SECURITY, since
capabilities are the default.
Hook cap_task_setscheduler when !CONFIG_SECURITY.
Move capable(TASK_KILL) to end of cap_task_kill to reduce
audit messages.

Nov 05:
Add secondary calls in selinux/hooks.c to task_setioprio and
task_setscheduler so that selinux and capabilities with file
cap support can be stacked.

Sep 05:
As Seth Arnold points out, uid checks are out of place
for capability code.

Sep 

Re: [PATCH 2/2] file capabilities: accomodate >32 bit capabilities

2007-05-08 Thread Serge E. Hallyn
Quoting Andreas Dilger ([EMAIL PROTECTED]):
> On May 08, 2007  14:17 -0500, Serge E. Hallyn wrote:
> > As the capability set changes and distributions start tagging
> > binaries with capabilities, we would like for running an older
> > kernel to not necessarily make those binaries unusable.
> > 
> > (0. Enable the CONFIG_SECURITY_FS_CAPABILITIES option
> >when CONFIG_SECURITY=n.)
> > (1. Rename CONFIG_SECURITY_FS_CAPABILITIES to
> >CONFIG_SECURITY_FILE_CAPABILITIES)
> > 2. Introduce CONFIG_SECURITY_FILE_CAPABILITIES_STRICTXATTR
> >which, when set, prevents loading binaries with capabilities
> >set which the kernel doesn't know about.  When not set,
> >such capabilities run, ignoring the unknown caps.
> > 3. To accomodate 64-bit caps, specify that capabilities are
> >stored as
> > u32 version; u32 eff0; u32 perm0; u32 inh0;
> > u32 eff1; u32 perm1; u32 inh1; (etc)
> 
> Have you considered how such capabilities will be used in the future?

There have been all sorts of suggestions, including very fine-grained
breakdowns of existing capabilities as well as capabilities for
non-privileged operations.

Other candidates for upcoming capabilities will be to satisfy
containers/vserver/openvz, where a distinction needs to be made between
CAP_DAC_OVERRIDE inside the user namespace, and the global
CAP_DAC_OVERRIDE.  Although the path i've been pursuing (for which I
should really send out the prelim patches I've been sitting on)
follow David Howell's and Eric Biederman's suggestions of using the
keyrings to store capabilities to other user namespaces.  Still new
capabilities may be desirable to guard CLONE_NEW_NS etc (rather than
CAP_SYS_ADMIN).

> One of the important use cases I can see today is the ability to
> split the heavily-overloaded e.g. CAP_SYS_ADMIN into much more fine
> grained attributes.

Sounds plausible, though it suffers from both making capabilities far
more cumbersome (i.e. finding the right capability for what you wanted
to do) and backward compatibility.  Perhaps at that point we should
introduce security.capabilityv2 xattrs.  A binary can then carry
security.capability=CAP_SYS_ADMIN=p, and
security.capabilityv2=cap_may_clone_mntns=p.

> What we definitely do NOT want to happen is an application that needs
> priviledged access (e.g. e2fsck, mount) to stop running because the
> new capabilities _would_ have been granted by the new kernel and are
> not by the old kernel and STRICTXATTR is used.
> 
> To me it would seem that having extra capabilities on an old kernel
> is relatively harmless if the old kernel doesn't know what they are.
> It's like having a key to a door that you don't know where it is.

If we ditch the STRICTXATTR option do the semantics seem sane to you?

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] file capabilities: accomodate >32 bit capabilities

2007-05-08 Thread Serge E. Hallyn
From: Serge E. Hallyn <[EMAIL PROTECTED]>
Subject: [PATCH 2/2] file capabilities: accomodate >32 bit capabilities

(Changelog: fixed syntax error in dummy version of check_cap_sanity())

As the capability set changes and distributions start tagging
binaries with capabilities, we would like for running an older
kernel to not necessarily make those binaries unusable.

(0. Enable the CONFIG_SECURITY_FS_CAPABILITIES option
   when CONFIG_SECURITY=n.)
(1. Rename CONFIG_SECURITY_FS_CAPABILITIES to
   CONFIG_SECURITY_FILE_CAPABILITIES)
2. Introduce CONFIG_SECURITY_FILE_CAPABILITIES_STRICTXATTR
   which, when set, prevents loading binaries with capabilities
   set which the kernel doesn't know about.  When not set,
   such capabilities run, ignoring the unknown caps.
3. To accomodate 64-bit caps, specify that capabilities are
   stored as
u32 version; u32 eff0; u32 perm0; u32 inh0;
u32 eff1; u32 perm1; u32 inh1; (etc)

Signed-off-by: Serge E. Hallyn <[EMAIL PROTECTED]>
Cc: Stephen Smalley <[EMAIL PROTECTED]>
Cc: James Morris <[EMAIL PROTECTED]>
Cc: Chris Wright <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 include/linux/capability.h |   23 -
 security/Kconfig   |   14 ++-
 security/commoncap.c   |  157 ++-
 3 files changed, 132 insertions(+), 62 deletions(-)

diff -puN 
include/linux/capability.h~file-capabilities-accomodate-future-64-bit-caps 
include/linux/capability.h
--- a/include/linux/capability.h~file-capabilities-accomodate-future-64-bit-caps
+++ a/include/linux/capability.h
@@ -44,11 +44,28 @@ typedef struct __user_cap_data_struct {
 
 #define XATTR_CAPS_SUFFIX "capability"
 #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
+
+/* size of caps that we work with */
+#define XATTR_CAPS_SZ (4*sizeof(__le32))
+
+/*
+ * data[] is organized as:
+ *   effective[0]
+ *   permitted[0]
+ *   inheritable[0]
+ *   effective[1]
+ *   ...
+ * this way we can just read as much of the on-disk capability as
+ * we know should exist and know we'll get the data we'll need.
+ */
 struct vfs_cap_data_disk {
__le32 version;
-   __le32 effective;
-   __le32 permitted;
-   __le32 inheritable;
+   __le32 data[];  /* eff[0], perm[0], inh[0], eff[1], ... */
+};
+
+struct vfs_cap_data_disk_v1 {
+   __le32 version;
+   __le32 data[3];  /* eff[0], perm[0], inh[0] */
 };
 
 #ifdef __KERNEL__
diff -puN security/commoncap.c~file-capabilities-accomodate-future-64-bit-caps 
security/commoncap.c
--- a/security/commoncap.c~file-capabilities-accomodate-future-64-bit-caps
+++ a/security/commoncap.c
@@ -110,36 +110,73 @@ void cap_capset_set (struct task_struct 
target->cap_permitted = *permitted;
 }
 
-#ifdef CONFIG_SECURITY_FS_CAPABILITIES
-static inline void cap_from_disk(struct vfs_cap_data_disk *dcap,
-   struct vfs_cap_data *cap)
+#ifdef CONFIG_SECURITY_FILE_CAPABILITIES
+
+#ifdef CONFIG_SECURITY_FILE_CAPABILITIES_STRICTXATTR
+static int check_cap_sanity(struct vfs_cap_data_disk *dcap, int size)
 {
-   cap->version = le32_to_cpu(dcap->version);
-   cap->effective = le32_to_cpu(dcap->effective);
-   cap->permitted = le32_to_cpu(dcap->permitted);
-   cap->inheritable = le32_to_cpu(dcap->inheritable);
+   int word, bit;
+   u32 eff, inh, perm;
+   int sz = (size-1)/3;
+
+   word = CAP_NUMCAPS / 32;
+   bit = CAP_NUMCAPS % 32;
+
+   eff  = le32_to_cpu(dcap->data[3*word]);
+   perm = le32_to_cpu(dcap->data[3*word+1]);
+   inh  = le32_to_cpu(dcap->data[3*word+2]);
+
+   while (word < sz) {
+   if (bit == 32) {
+   bit = 0;
+   word++;
+   if (word >= sz)
+   break;
+   eff  = le32_to_cpu(dcap->data[3*word]);
+   perm = le32_to_cpu(dcap->data[3*word+1]);
+   inh  = le32_to_cpu(dcap->data[3*word+2]);
+   continue;
+   }
+   if (eff & CAP_TO_MASK(bit))
+   return -EINVAL;
+   if (inh & CAP_TO_MASK(bit))
+   return -EINVAL;
+   if (perm & CAP_TO_MASK(bit))
+   return -EINVAL;
+   bit++;
+   }
+
+   return 0;
 }
+#else
+static int check_cap_sanity(struct vfs_cap_data_disk *dcap, int sz)
+{ return 0; }
+#endif
 
-static int check_cap_sanity(struct vfs_cap_data *cap)
+static inline int cap_from_disk(struct vfs_cap_data_disk *dcap,
+   struct linux_binprm *bprm, int size)
 {
-   int i;
+   int rc, version;
 
-   if (cap->vers

[PATCH 1/2] file capabilities: implement file capabilities

2007-05-08 Thread Serge E. Hallyn
From: Serge E. Hallyn <[EMAIL PROTECTED]>
Subject: [PATCH 1/2] file capabilities: implement file capabilities

Implement file posix capabilities.  This allows programs to be given a
subset of root's powers regardless of who runs them, without having to use
setuid and giving the binary all of root's powers.

This version works with Kaigai Kohei's userspace tools, found at
http://www.kaigai.gr.jp/index.php.  For more information on how to use this
patch, Chris Friedhoff has posted a nice page at
http://www.friedhoff.org/fscaps.html.

Changelog:
Nov 27:
Incorporate fixes from Andrew Morton
(security-introduce-file-caps-tweaks and
security-introduce-file-caps-warning-fix)
Fix Kconfig dependency.
Fix change signaling behavior when file caps are not compiled in.

Nov 13:
Integrate comments from Alexey: Remove CONFIG_ ifdef from
capability.h, and use %zd for printing a size_t.

Nov 13:
Fix endianness warnings by sparse as suggested by Alexey
Dobriyan.

Nov 09:
Address warnings of unused variables at cap_bprm_set_security
when file capabilities are disabled, and simultaneously clean
up the code a little, by pulling the new code into a helper
function.

Nov 08:
For pointers to required userspace tools and how to use
them, see http://www.friedhoff.org/fscaps.html.

Nov 07:
Fix the calculation of the highest bit checked in
check_cap_sanity().

Nov 07:
Allow file caps to be enabled without CONFIG_SECURITY, since
capabilities are the default.
Hook cap_task_setscheduler when !CONFIG_SECURITY.
Move capable(TASK_KILL) to end of cap_task_kill to reduce
audit messages.

Nov 05:
Add secondary calls in selinux/hooks.c to task_setioprio and
task_setscheduler so that selinux and capabilities with file
cap support can be stacked.

Sep 05:
As Seth Arnold points out, uid checks are out of place
for capability code.

Sep 01:
Define task_setscheduler, task_setioprio, cap_task_kill, and
task_setnice to make sure a user cannot affect a process in which
they called a program with some fscaps.

One remaining question is the note under task_setscheduler: are we
ok with CAP_SYS_NICE being sufficient to confine a process to a
cpuset?

It is a semantic change, as without fsccaps, attach_task doesn't
allow CAP_SYS_NICE to override the uid equivalence check.  But since
it uses security_task_setscheduler, which elsewhere is used where
CAP_SYS_NICE can be used to override the uid equivalence check,
fixing it might be tough.

 task_setscheduler
 note: this also controls cpuset:attach_task.  Are we ok with
 CAP_SYS_NICE being used to confine to a cpuset?
 task_setioprio
 task_setnice
 sys_setpriority uses this (through set_one_prio) for another
 process.  Need same checks as setrlimit

Aug 21:
Updated secureexec implementation to reflect the fact that
euid and uid might be the same and nonzero, but the process
might still have elevated caps.

Aug 15:
Handle endianness of xattrs.
Enforce capability version match between kernel and disk.
Enforce that no bits beyond the known max capability are
set, else return -EPERM.
With this extra processing, it may be worth reconsidering
doing all the work at bprm_set_security rather than
d_instantiate.

Aug 10:
Always call getxattr at bprm_set_security, rather than
caching it at d_instantiate.

Signed-off-by: Serge E. Hallyn <[EMAIL PROTECTED]>
Cc: Stephen Smalley <[EMAIL PROTECTED]>
Cc: James Morris <[EMAIL PROTECTED]>
Cc: Chris Wright <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 include/linux/capability.h |   20 +++
 include/linux/security.h   |   12 +-
 security/Kconfig   |   10 +
 security/capability.c  |4 
 security/commoncap.c   |  194 +--
 security/selinux/hooks.c   |   12 ++
 6 files changed, 241 insertions(+), 11 deletions(-)

diff -puN include/linux/capability.h~implement-file-posix-capabilities 
include/linux/capability.h
--- a/include/linux/capability.h~implement-file-posix-capabilities
+++ a/include/linux/capability.h
@@ -40,11 +40,29 @@ typedef struct __user_cap_data_struct {
 __u32 inheritable;
 } __user *cap_user_data_t;
   
+
+
+#define XATTR_CAPS_SUFFIX "capability"
+#define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
+struct vfs_cap_data_disk {
+   __le32 version;
+   __le32 effective;
+   __le32 permitted;
+   _

[PATCH 0/2] file capabilities: Introduction

2007-05-08 Thread Serge E. Hallyn
Following are two patches which have been sitting for some time in -mm.
The first implements file capabilities, the second changes the format a
bit to accomodate potential future 64-bit capabilities.

We are hoping to get a few more eyes on the code before deciding whether
this is safe to finally push up.  There are no real objections to the
code at the moment, but the lack of serious review, especially by
filesystems experts, is somewhat worrying.  If you have some time,
please do take a look.

Appended to this email are two programs which can be used for testing.
One is the actual test program, and one is a victim who gets his file
capabilities set by, and executed by, the main test program.

Compile using
gcc -o testfscaps testfscaps.c -lcap
gcc -o print_caps print_caps.c -lcap

then run
./testfscaps 0
./testfscaps 1 eff
./testfscaps 1 perm
./testfscaps 1 inh
./testfscaps 2

Test 0 makes sure that non-root can't write file capability xattrs.
Test 1 checks various edge cases of xattr lengths and values
Test 2 checks valid xattr values and makes sure the binary with
  those values runs with the expected caps.  Compare the value which
  testfscaps says it set on print_caps with the values printed by
  print_caps.

thanks,
-serge

=
begin print_caps.c
=
/*
 * Copyright (C) IBM Corporation, 2007
 * Author: Serge Hallyn <[EMAIL PROTECTED]>
 *
 * Prints out the capabilities with which it is running.
 */
#include 
#include 

int main(int argc, char *argv[])
{
cap_t cap = cap_get_proc();

if (!cap) {
perror("print_caps - cap_get_proc");
exit(1);
}

printf("%s: running with caps %s\n", argv[0], cap_to_text(cap, NULL));

cap_free(cap);

return 0;
}
=

=
begin testfscaps.c
=
/*
 * Copyright (C) IBM Corporation, 2007
 * Author: Serge Hallyn <[EMAIL PROTECTED]>

 * Perform several tests of file capabilities:
 *  1. try setting caps without CAP_SYS_ADMIN
 *  2. try setting wrongly-sized sets of caps
 *   for eff, inh, perm, or all of the above
 * Then run the executable
 *  3. try setting valid caps, drop rights, and run the executable,
 * make sure we get the rights
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
int errno;

void usage(char *me)
{
printf("Usage: %s <0|1|2> [arg]\n", me);
printf("  0: set file caps without CAP_SYS_ADMIN\n");
printf("  1: set bogus file caps\n");
printf(" arg=eff: for effective caps\n");
printf(" arg=inh: for inheritable caps\n");
printf(" arg=perm: for permitted caps\n");
printf("  2: test that file caps are set correctly on exec\n");
exit(1);
}

int drop_root()
{
int ret;
ret = setresuid(1000, 1000, 1000);
if (ret) {
perror("setresuid");
exit(4);
}
return 1;
}

#if BYTE_ORDER == LITTLE_ENDIAN
#define le32_to_cpu(x)  x
#define le16_to_cpu(x)  x
#define cpu_to_le32(x)  x
#define cpu_to_le16(x)  x
#else
#define le32_to_cpu(x)  bswap_32(x)
#define le16_to_cpu(x)  bswap_16(x)
#define cpu_to_le32(x)  bswap_32(x)
#define cpu_to_le16(x)  bswap_16(x)
#endif

#define TSTPATH "./print_caps"
#define CAPNAME "security.capability"
#ifndef __CAP_BITS
#define __CAP_BITS 31
#endif

int perms_test(void)
{
int ret;
unsigned int value[4];

drop_root();
value[0] = cpu_to_le32(_LINUX_CAPABILITY_VERSION);
value[1] = 1;
value[2] = 1;
value[3] = 1;
ret = setxattr(TSTPATH, CAPNAME, value, 4*sizeof(unsigned int), 0);
if (ret) {
perror("setxattr");
printf("PASS: could not set capabilities as non-root\n");
ret = 0;
} else {
printf("FAIL: could set capabilities as non-root\n");
ret = 1;
}

return ret;
}

static inline int getcapflag(int w)
{
switch (w) {
case 0: return CAP_EFFECTIVE;
case 1: return CAP_PERMITTED;
case 2: return CAP_INHERITABLE;
default: exit(10);
}
}

int fork_drop_and_exec(void)
{
int pid = fork();
int ret, status;

if (ret == -1) {
perror("pipe");
exit(1);
}

if (pid < 0) {
perror("fork");
exit(1);
}
   

Re: [patch] unprivileged mounts update

2007-04-26 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > > So then as far as you're concerned, the patches which were in -mm will
> > > > remain unchanged?
> > > 
> > > Basically yes. I've merged the update patch, which was not yet added
> > > to -mm, did some cosmetic code changes, and updated the patch headers.
> > > 
> > > There's one open point, that I think we haven't really explored, and
> > > that is the propagation semantics.  I think you had the idea, that a
> > > propagated mount should inherit ownership from the parent into which
> > > it was propagated.
> > 
> > Don't think that was me.  I stayed out of those early discussions
> > because I wasn't comfortable guessing at the proper semantics yet.
> 
> Yes, sorry, it was Eric's suggestion.
> 
> > But really, I, as admin, have to set up both propagation and user mounts
> > for a particular subtree, so why would I *not* want user mounts to be
> > propagated?
> > 
> > So, in my own situation, I have done
> > 
> > make / rshared
> > mount --bind /share /share
> > make /share unbindable
> > for u in $users; do
> > mount --rbind / /share/$u/root
> > make /share/$u/root rslave
> > make /share/$u/root rshared
> > mount --bind -o user=$u /share/$u/root/home/$u 
> > /share/$u/root/home/$u
> > done
> > 
> > All users get chrooted into /share/$USER/root, some also get their own
> > namespace.  Clearly if a user in a new namespace does
> > 
> > mount --bind -o user=me ~/somedir ~/otherdir
> > 
> > then logs out, and logs back in, I want the ~/otherdir in the new
> > namespace (and the one in the 'init' namespace) to also be owned by
> > 'me'.
> > 
> > > That sounds good if everyone agrees?
> > 
> > I've shown where I think propagating the mount owner is useful.  Can you
> > detail a scenario where doing so would be bad?  Then we can work toward
> > semantics that make sense...
> 
> But in your example, the "propagated mount inherits ownership from
> parent mount" would also work, since in all namespaces the owner of
> the parent would necessary be "me".

true.

> The "inherits parent" semantics would work better for example in the
> "all nosuid" namespace, where the user is free to modify it's mount
> namespace. 
> 
> If for example propagation is set up from the initial namespace to
> this user's namespace and a new mount is added to the initial
> namespace, it would be nice if the propagated new mount would also be
> owned by the user (and be "nosuid" of course).

ok, so in the example i gave, this would be the admin in the
initial namespace mounting something under /home/$USER/, which
gets propagated to slave /share/$USER/root/home/$USER, where
we would want a different mount owner.

> Does the above make sense?  I'm not sure I've explained clearly
> enough.

I think I see.  Sounds like inherit from parent does the right thing
all around, at least in cases we've thought of so far.

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > So then as far as you're concerned, the patches which were in -mm will
> > remain unchanged?
> 
> Basically yes. I've merged the update patch, which was not yet added
> to -mm, did some cosmetic code changes, and updated the patch headers.
> 
> There's one open point, that I think we haven't really explored, and
> that is the propagation semantics.  I think you had the idea, that a
> propagated mount should inherit ownership from the parent into which
> it was propagated.

Don't think that was me.  I stayed out of those early discussions
because I wasn't comfortable guessing at the proper semantics yet.

But really, I, as admin, have to set up both propagation and user mounts
for a particular subtree, so why would I *not* want user mounts to be
propagated?

So, in my own situation, I have done

make / rshared
mount --bind /share /share
make /share unbindable
for u in $users; do
mount --rbind / /share/$u/root
make /share/$u/root rslave
make /share/$u/root rshared
mount --bind -o user=$u /share/$u/root/home/$u 
/share/$u/root/home/$u
done

All users get chrooted into /share/$USER/root, some also get their own
namespace.  Clearly if a user in a new namespace does

mount --bind -o user=me ~/somedir ~/otherdir

then logs out, and logs back in, I want the ~/otherdir in the new
namespace (and the one in the 'init' namespace) to also be owned by
'me'.

> That sounds good if everyone agrees?

I've shown where I think propagating the mount owner is useful.  Can you
detail a scenario where doing so would be bad?  Then we can work toward
semantics that make sense...

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > > Right, I figure if the normal action is to always do
> > > > mnt->user = current->fsuid, then for the special case we
> > > > pass a uid in someplace.  Of course...  do we not have a
> > > > place to do that?  Would it be a no-no to use 'data' for
> > > > a non-fs-specific arg?
> > > 
> > > I guess it would be OK for bind, but not for new- and remounts, where
> > > 'data' is already used.
> > > 
> > > Maybe it's best to stay with fsuid after all, and live with having to
> > > restore capabilities.  It's not so bad after all, this seems to do the
> > > trick:
> > > 
> > >   cap_t cap = cap_get_proc();
> > >   setfsuid(uid);
> > >   cap_set_proc(cap);
> > > 
> > > Unfortunately these functions are not in libc, but in a separate
> > > "libcap" library.  Ugh.
> > 
> > Ok, are you still planning to nix the MS_SETUSER flag, though, as
> > Eric suggested?  I think it's cleanest - always set the mnt->user
> > field to current->fsuid, and require CAP_SYS_ADMIN if the
> > mountpoint->mnt->user != current->fsuid.
> 
> It would be a nice cleanup, but I think it's unworkable for the
> following reasons:
> 
> Up till now mount(2) and umount(2) always required CAP_SYS_ADMIN, and
> we must make sure, that unless there's some explicit action by the
> sysadmin, these rules are still enfoced.
> 
> For example, with just a check for mnt->mnt_uid == current->fsuid, a
> fsuid=0 process could umount or submount all the "legacy" mounts even
> without CAP_SYS_ADMIN.
>
> This is a fundamental security problem, with getting rid of MS_SETUSER
> and MNT_USER.
> 
> Another, rather unlikely situation is if an existing program sets
> fsuid to non-zero before calling mount, hence unwantingly making that
> mount owned by some user after these patches.
> 
> Also adding "user=0" to the options in /proc/mounts would be an
> inteface breakage, that is probably harmless, but people wouldn't like
> it.  Special casing the zero uid for this case is more ugly IMO, than
> the problem we are trying to solve.
> 
> If we didn't have existing systems to deal with, then of course I'd
> agree with Eric's suggestion.
> 
> Miklos

So then as far as you're concerned, the patches which were in -mm will
remain unchanged?

-serge

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > Right, I figure if the normal action is to always do
> > mnt->user = current->fsuid, then for the special case we
> > pass a uid in someplace.  Of course...  do we not have a
> > place to do that?  Would it be a no-no to use 'data' for
> > a non-fs-specific arg?
> 
> I guess it would be OK for bind, but not for new- and remounts, where
> 'data' is already used.
> 
> Maybe it's best to stay with fsuid after all, and live with having to
> restore capabilities.  It's not so bad after all, this seems to do the
> trick:
> 
>   cap_t cap = cap_get_proc();
>   setfsuid(uid);
>   cap_set_proc(cap);
> 
> Unfortunately these functions are not in libc, but in a separate
> "libcap" library.  Ugh.

Ok, are you still planning to nix the MS_SETUSER flag, though, as Eric
suggested?  I think it's cleanest - always set the mnt->user field to
current->fsuid, and require CAP_SYS_ADMIN if the mountpoint->mnt->user !=
current->fsuid.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> "Serge E. Hallyn" <[EMAIL PROTECTED]> writes:
> 
> > Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> >> 
> >> Are there other permission checks that mount is doing that we
> >> care about.
> >
> > Not mount itself, but in looking up /share/fa/root/home/fa,
> > user fa doesn't have the rights to read /share, and by setting
> > fsuid to fa and dropping CAP_DAC_READ_SEARCH the mount action fails.
> 
> Got it. 
> 
> I'm not certain this is actually a problem it may be a feature.
> But it does fly in the face of the general principle of just
> getting out of roots way so things can get done.
> 
> I think we can solve your basic problem by simply doing like:
> chdir(/share); mount(.);  To simply avoid the permission problem.
> 
> The practical question is how much do we care.
> 
> > But the solution you outlined in your previous post would work around
> > this perfectly.
> 
> If we are not using usual permissions which user do we use current->uid?
> Or do we pass that user someplace?

Right, I figure if the normal action is to always do
mnt->user = current->fsuid, then for the special case we
pass a uid in someplace.  Of course...  do we not have a
place to do that?  Would it be a no-no to use 'data' for
a non-fs-specific arg?

> >> > If it were really the equivalent then I could keep my capabilities :)
> >> > after changing it.
> >> 
> >> We drop all capabilities after we change the euid.
> >
> > Not if we've done prctl(PR_SET_KEEPCAPS, 1)
> 
> Ah cap_clear doesn't do the obvious thing.
> 
> Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> "Serge E. Hallyn" <[EMAIL PROTECTED]> writes:
> 
> > Quoting H. Peter Anvin ([EMAIL PROTECTED]):
> >> Miklos Szeredi wrote:
> >> > 
> >> > Andrew, please skip this patch, for now.
> >> > 
> >> > Serge found a problem with the fsuid approach: setfsuid(nonzero) will
> >> > remove filesystem related capabilities.  So even if root is trying to
> >> > set the "user=UID" flag on a mount, access to the target (and in case
> >> > of bind, the source) is checked with user privileges.
> >> > 
> >> > Root should be able to set this flag on any mountpoint, _regardless_
> >> > of permissions.
> >> > 
> >> 
> >> Right, if you're using fsuid != 0, you're not running as root 
> >
> > Sure, but what I'm not clear on is why, if I've done a
> > prctl(PR_SET_KEEPCAPS, 1) before the setfsuid, I still lose the
> > CAP_FS_MASK perms.  I see the special case handling in
> > cap_task_post_setuid().  I'm sure there was a reason for it, but
> > this is a piece of the capability implementation I don't understand
> > right now.
> 
> So we drop CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH,
> CAP_FOWNER, and CAP_FSETID
> 
> Since we are checking CAP_SETUID or CAP_SYS_ADMIN how is that
> a problem?
> 
> Are there other permission checks that mount is doing that we
> care about.

Not mount itself, but in looking up /share/fa/root/home/fa,
user fa doesn't have the rights to read /share, and by setting
fsuid to fa and dropping CAP_DAC_READ_SEARCH the mount action fails.

But the solution you outlined in your previous post would work around
this perfectly.

> >> (fsuid is
> >> the equivalent to euid for the filesystem.)
> >
> > If it were really the equivalent then I could keep my capabilities :)
> > after changing it.
> 
> We drop all capabilities after we change the euid.

Not if we've done prctl(PR_SET_KEEPCAPS, 1)

> >> I fail to see how ruid should have *any* impact on mount(2).  That seems
> >> to be a design flaw.
> >
> > May be, but just using fsuid at this point stops me from enabling user
> > mounts under /share if /share is chmod 000 (which it is).
> 
> I'm dense today.  If we can't work out the details we can always use a flag.
> But what is the problem with fsuid?

See above.

> You are not trying to test this using a non-default security model are you?

Nope, at the moment CONFIG_SECURITY=n so I'm running with capabilities
only.

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> Miklos Szeredi <[EMAIL PROTECTED]> writes:
> 
> >> From: Miklos Szeredi <[EMAIL PROTECTED]>
> >> 
> >> - refine adding "nosuid" and "nodev" flags for unprivileged mounts:
> >> o add "nosuid", only if mounter doesn't have CAP_SETUID capability
> >> o add "nodev", only if mounter doesn't have CAP_MKNOD capability
> >> 
> >> - allow unprivileged forced unmount, but only for FS_SAFE filesystems
> >> 
> >> - allow mounting over special files, but not symlinks
> >> 
> >> - for mounting and umounting check "fsuid" instead of "ruid"
> >
> > Andrew, please skip this patch, for now.
> >
> > Serge found a problem with the fsuid approach: setfsuid(nonzero) will
> > remove filesystem related capabilities.  So even if root is trying to
> > set the "user=UID" flag on a mount, access to the target (and in case
> > of bind, the source) is checked with user privileges.
> 
> I do have a major problem with this patchset though.  We still have
> the unnecessary concept of user mounts.  That seems only needed now
> for the /proc/mounts output.
> 
> All mounts should have an owner.  Prior to the unprivileged mount work
> root owns all mounts.
> 
> > Root should be able to set this flag on any mountpoint, _regardless_
> > of permissions.
> 
> We don't need a flag, and thinking of it in the context of a flag
> is clearly the wrong thing.  Yes if we have the proper capability we
> should be able to explicitly specify  the owner of the mount
> 
> > It is possible to restore filesystem capabilities after setting fsuid,
> > but the interfaces are rather horrible at all levels.  mount(8) can
> > probably live with these, but I'm not sure that using "fsuid" over
> > "ruid" has enough advantages to force this.
> >
> > Why did we want to use fsuid, exactly?
> 
> - Because ruid is completely the wrong thing we want mounts owned
>   by whomever's permissions we are using to perform the mount.
> 
> 
> There are two basic cases.
> - Mounting a filesystem as who we are.
>   This can use fsuid with no problems.  If we are suid to root to perform
>   the mount by default we want root to own the mount so that is correct.
> 
> - Mounting a filesystem as another user.
>   This is the tricky case rare case needed in setup.  If we aren't
>   jumping through to many hoops to make it work when using fsuid it
>   sounds like the right thing here as well.
> 
>   How hard is it to set fsuid to a different value?  I.e. What hoops
>   does root have to jump through.
> 
> Further when using fsuid we don't need an extra flag to mount.
> 
> Plus things are a little more consistent with the rest of the
> linux/unix interface.
> 
> Now I can see doing something like using a special flag and not using
> fsuid for the one case where we explicitly want to mount a filesystem
> as someone else.  However if only user space has to special case this
> (as it does anyway) and we don't have to special case it in the
> kernel.  So much the better. 

Yes, what you describe (or my reading of it :) would simplify the
implementation, and solve the capability problem.

So in general, when you mount something, the mount is owned by you.

To mount something as you, either the mountpoint's mount is owned by
you, or you have some capability, maybe CAP_SYS_ADMIN.

So, before any non-root user can do a mount, root must mount an ancestor
mount in the name of that user.  This would be a new mount flag, so

mount -o user=some_user /share/$USER/home/$USER /share/$USER/home/$USER

as root.  Mount does not change the fsuid, it simply passes the user=
flag into do_loopback(), which sets the mnt->user flag.  And now, even
though i have /share as chmod 000, root didn't have to setfsuid so we
have the necessary caps.

(clearly, -o user requires CAP_SYS_ADMIN or something)

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-25 Thread Serge E. Hallyn
Quoting H. Peter Anvin ([EMAIL PROTECTED]):
> Miklos Szeredi wrote:
> > 
> > Andrew, please skip this patch, for now.
> > 
> > Serge found a problem with the fsuid approach: setfsuid(nonzero) will
> > remove filesystem related capabilities.  So even if root is trying to
> > set the "user=UID" flag on a mount, access to the target (and in case
> > of bind, the source) is checked with user privileges.
> > 
> > Root should be able to set this flag on any mountpoint, _regardless_
> > of permissions.
> > 
> 
> Right, if you're using fsuid != 0, you're not running as root 

Sure, but what I'm not clear on is why, if I've done a
prctl(PR_SET_KEEPCAPS, 1) before the setfsuid, I still lose the
CAP_FS_MASK perms.  I see the special case handling in
cap_task_post_setuid().  I'm sure there was a reason for it, but
this is a piece of the capability implementation I don't understand
right now.

I would send in a patch to make it honor current->keep_capabilities,
but I have a feeling there was a good reason not to do so in the
first place.

> (fsuid is
> the equivalent to euid for the filesystem.)

If it were really the equivalent then I could keep my capabilities :)
after changing it.

> I fail to see how ruid should have *any* impact on mount(2).  That seems
> to be a design flaw.

May be, but just using fsuid at this point stops me from enabling user
mounts under /share if /share is chmod 000 (which it is).

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-20 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> This patchset has now been bared to the "lowest common denominator"
> that everybody can agree on.  Or at least there weren't any objections
> to this proposal.
> 
> Andrew, please consider it for -mm.
> 
> Thanks,
> Miklos
> 
> 
> v3 -> v4:
> 
>  - simplify interface as much as possible, now only a single option
>("user=UID") is used to control everything
>  - no longer allow/deny mounting based on file/directory permissions,
>that approach does not always make sense
> 
> 
> This patchset adds support for keeping mount ownership information in
> the kernel, and allow unprivileged mount(2) and umount(2) in certain
> cases.
> 
> The mount owner has the following privileges:
> 
>   - unmount the owned mount
>   - create a submount under the owned mount
> 
> The sysadmin can set the owner explicitly on mount and remount.  When
> an unprivileged user creates a mount, then the owner is automatically
> set to the user.
> 
> The following use cases are envisioned:
> 
> 1) Private namespace, with selected mounts owned by user.
>E.g. /home/$USER is a good candidate for allowing unpriv mounts and
>unmounts within.
> 
> 2) Private namespace, with all mounts owned by user and having the
>"nosuid" flag.  User can mount and umount anywhere within the
>namespace, but suid programs will not work.
> 
> 3) Global namespace, with a designated directory, which is a mount
>owned by the user.  E.g. /mnt/users/$USER is set up so that it is
>bind mounted onto itself, and set to be owned by $USER.  The user
>can add/remove mounts only under this directory.
> 
> The following extra security measures are taken for unprivileged
> mounts:
> 
>  - usermounts are limited by a sysctl tunable
>  - force "nosuid,nodev" mount options on the created mount

Very nice.  I like these semantics.

I'll try to rework my laptop in the next few days to use this patchset
as a test.

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 5/15] Introduce union stack

2007-04-17 Thread Serge E. Hallyn
Quoting Bharata B Rao ([EMAIL PROTECTED]):
> From: Jan Blunck <[EMAIL PROTECTED]>
> Subject: Introduce union stack.
> 
> Adds union stack infrastructure to the dentry structure and provides
> locking routines to walk the union stack.
> 
> Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
> Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
> ---
>  fs/Makefile  |2 
>  fs/dcache.c  |5 
>  fs/union.c   |   53 +
>  include/linux/dcache.h   |   11 +
>  include/linux/dcache_union.h |  243 
> +++
>  5 files changed, 314 insertions(+)
> 
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -49,6 +49,8 @@ obj-$(CONFIG_FS_POSIX_ACL)  += posix_acl.
>  obj-$(CONFIG_NFS_COMMON) += nfs_common/
>  obj-$(CONFIG_GENERIC_ACL)+= generic_acl.o
> 
> +obj-$(CONFIG_UNION_MOUNT)+= union.o
> +
>  obj-$(CONFIG_QUOTA)  += dquot.o
>  obj-$(CONFIG_QFMT_V1)+= quota_v1.o
>  obj-$(CONFIG_QFMT_V2)+= quota_v2.o
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -936,6 +936,11 @@ struct dentry *d_alloc(struct dentry * p
>  #ifdef CONFIG_PROFILING
>   dentry->d_cookie = NULL;
>  #endif
> +#ifdef CONFIG_UNION_MOUNT
> + dentry->d_overlaid = NULL;
> + dentry->d_topmost = NULL;
> + dentry->d_union = NULL;
> +#endif
>   INIT_HLIST_NODE(&dentry->d_hash);
>   INIT_LIST_HEAD(&dentry->d_lru);
>   INIT_LIST_HEAD(&dentry->d_subdirs);
> --- /dev/null
> +++ b/fs/union.c
> @@ -0,0 +1,53 @@
> +/*
> + * VFS based union mount for Linux
> + *
> + * Copyright ? 2004-2007 IBM Corporation
> + *   Author(s): Jan Blunck ([EMAIL PROTECTED])
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + */
> +
> +#include 
> +
> +struct union_info * union_alloc(void)
> +{
> + struct union_info *info;
> +
> + info = kmalloc(sizeof(*info), GFP_ATOMIC);
> + if (!info)
> + return NULL;
> +
> + mutex_init(&info->u_mutex);
> + mutex_lock(&info->u_mutex);
> + atomic_set(&info->u_count, 1);
> + UM_DEBUG_LOCK("allocate union %p\n", info);
> + return info;
> +}
> +
> +struct union_info * union_get(struct union_info *info)
> +{
> + BUG_ON(!info);
> + BUG_ON(!atomic_read(&info->u_count));
> + atomic_inc(&info->u_count);
> + UM_DEBUG_LOCK("get union %p (count=%d)\n", info,
> +   atomic_read(&info->u_count));
> + return info;
> +}

The locking here needs to be laid out.  It looks like union_get() needs
to be called under union_lock(), while union_get2() (horrible name)
grabs that lock itself, and returns with the lock held?

Similarly union_put clearly needs to be called under union_lock(), so
that should be commented here.

> +void union_put(struct union_info *info)
> +{
> + BUG_ON(!info);
> + UM_DEBUG_LOCK("put union %p (count=%d)\n", info,
> +   atomic_read(&info->u_count));
> + atomic_dec(&info->u_count);
> +
> + if (!atomic_read(&info->u_count)) {
> + UM_DEBUG_LOCK("free union %p\n", info);
> + kfree(info);
> + }
> +
> + return;
> +}
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -93,6 +93,12 @@ struct dentry {
>   struct dentry *d_parent;/* parent directory */
>   struct qstr d_name;
> 
> +#ifdef CONFIG_UNION_MOUNT
> + struct dentry *d_overlaid;  /* overlaid directory */
> + struct dentry *d_topmost;   /* topmost directory */
> + struct union_info *d_union; /* union directory info */
> +#endif
> +
>   struct list_head d_lru; /* LRU list */
>   /*
>* d_child and d_rcu can share memory
> @@ -325,6 +331,11 @@ static inline struct dentry *dget(struct
>   return dentry;
>  }
> 
> +/*
> + * Reference counting for union mounts
> + */
> +#include 
> +
>  extern struct dentry * dget_locked(struct dentry *);
> 
>  /**
> --- /dev/null
> +++ b/include/linux/dcache_union.h
> @@ -0,0 +1,243 @@
> +/*
> + * VFS based union mount for Linux
> + *
> + * Copyright ? 2004-2007 IBM Corporation
> + *   Author(s): Jan Blunck ([EMAIL PROTECTED])
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + *
> + */
> +#ifndef __LINUX_DCACHE_UNION_H
> +#define __LINUX_DCACHE_UNION_H
> +#ifdef __KERNEL__
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#ifdef CONFIG_UNION_MOUNT
> +
> +/*
> + * This is the union info object, that describes general information about 
> this
> + * union directory
> + *
> + * u_mutex protects the union stack against modification. You can reach it
> 

Re: [Devel] Re: [patch 05/10] add "permit user mounts in new namespace" clone flag

2007-04-17 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > I'm a bit lost about what is currently done and who advocates for what.
> > 
> > It seems to me the MNT_ALLOWUSERMNT (or whatever :) flag should be
> > propagated.  In the /share rbind+chroot example, I assume the admin
> > would start by doing
> > 
> > mount --bind /share /share
> > mount --make-slave /share
> > mount --bind -o allow_user_mounts /share (or whatever)
> > mount --make-shared /share
> > 
> > then on login, pam does
> > 
> > chroot /share/$USER
> > 
> > or some sort of
> > 
> > mount --bind /share /home/$USER/root
> > chroot /home/$USER/root
> > 
> > or whatever.  In any case, the user cannot make user mounts except under
> > /share, and any cloned namespaces will still allow user mounts.
> 
> I don't quite understand your method.  This is how I think of it:
> 
> mount --make-rshared /
> mkdir -p /mnt/ns/$USER
> mount --rbind / /mnt/ns/$USER
> mount --make-rslave /mnt/ns/$USER

This was my main point - that the tree in which users can mount will be
a slave of /, so that propagating the "are user mounts allowed" flag
among peers is safe and intuitive.

> mount --set-flags --recursive -oallowusermnt /mnt/ns/$USER
> chroot /mnt/ns/$USER
> su - $USER
> 
> I did actually try something equivalent (without the fancy mount
> commands though), and it worked fine.  The only "problem" is the
> proliferation of mounts in /proc/mounts.  There was a recently posted
> patch in AppArmor, that at least hides unreachable mounts from
> /proc/mounts, so the user wouldn't see all those.  But it could still
> be pretty confusing to the sysadmin.
> 
> So in that sense doing it the complicated way, by first cloning the
> namespace, and then copying and sharing mounts individually which need
> to be shared could relieve this somewhat.

True.  But the kernel functionality you provide enables both ways so no
problem in either case :)

> Another point: user mounts under /proc and /sys shouldn't be allowed.
> There are files there (at least in /proc) that are seemingly writable
> by the user, but they are still not writable in the sense, that
> "normal" files are.

Good point.

> Anyway, there are lots of userspace policy issues, but those don't
> impact the kernel part.

Though it might make sense to enforce /proc and /sys not allowing user
mounts under them in the kernel.

> As for the original question of propagating the "allowusermnt" flag, I
> think it doesn't matter, as long as it's consistent and documented.
> 
> Propagating some mount flags and not propagating others is
> inconsistent and confusing, so I wouldn't want that.  Currently
> remount doesn't propagate mount flags, that may be a bug, dunno.

Dave, any thoughts on safety of propagating the vfsmount read-only
flags?

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add "permit user mounts in new namespace" clone flag

2007-04-17 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > > > > Also for bind-mount and remount operations the flag has to be 
> > > > > > propagated
> > > > > > down its propagation tree.  Otherwise a unpriviledged mount in a 
> > > > > > shared
> > > > > > mount wont get reflected in its peers and slaves, leading to 
> > > > > > unidentical
> > > > > > shared-subtrees.
> > > > > 
> > > > > That's an interesting question.  Do we want shared mounts to be
> > > > > totally identical, including mnt_flags?  It doesn't look as if
> > > > > do_remount() guarantees that currently.
> > > > 
> > > > Depends on the semantics of each of the flags. Some flags like of the
> > > > read/write flag, would not interfere with the propagation semantics
> > > > AFAICT.  But this one certainly seems to interfere.
> > > 
> > > That depends.  Current patches check the "unprivileged submounts
> > > allowed under this mount" flag only on the requested mount and not on
> > > the propagated mounts.  Do you see a problem with this?
> > 
> > Don't see a problem if the flag is propagated to all peers and slave
> > mounts. 
> > 
> > If not, I see a problem. What if the propagated mount has its flag set
> > to not do un-priviledged mounts, whereas the requested mount has it
> > allowed?
> 
> Then the mount is allowed.
> 
> It is up to the sysadmin/distro to design set up the propagations in a
> way that this is not a problem.
> 
> I think it would be much less clear conceptually, if unprivileged
> mounting would have to check propagations as well.
> 
> Miklos

I'm a bit lost about what is currently done and who advocates for what.

It seems to me the MNT_ALLOWUSERMNT (or whatever :) flag should be
propagated.  In the /share rbind+chroot example, I assume the admin
would start by doing

mount --bind /share /share
mount --make-slave /share
mount --bind -o allow_user_mounts /share (or whatever)
mount --make-shared /share

then on login, pam does

chroot /share/$USER

or some sort of

mount --bind /share /home/$USER/root
chroot /home/$USER/root

or whatever.  In any case, the user cannot make user mounts except under
/share, and any cloned namespaces will still allow user mounts.

Or are you guys talking about something else?

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 05/10] Add "permit user submounts" flag to vfsmount

2007-04-17 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > > MNT_USER and MNT_USERMNT?  I claim no way will people keep those
> > > > straight.  How about MNT_ALLOWUSER and MNT_USER?
> > > 
> > > Umm, is "allowuser" more clear than "usermnt"?  What is allowed to the
> > 
> > I think so, yes.  One makes it clear that we're talking about allowing
> > user (somethings :), one might just as well mean "this is a user mount."
> > 
> > > user?  "allowusermnt" may be more descriptive, but it's a bit too
> > > long.
> > 
> > Yes, if it weren't too long it would by far have been my preference.
> > Maybe despite the length we should still go with it...
> > 
> > > I don't think it matters all that much, the user will have to look up
> > > the semantics in the manpage anyway.  Is "nosuid" descriptive?  Not
> > > very much, but we got used to it.
> > 
> > nosuid is quite clear.
> 
> Is it?  Shouldn't these be "allowsuid", "noallowsuid", "allowexec",
> "noallowexec"?
> 
> See, we mentally add the "allow" quite easily.

But they aren't accompanied by a flag meaning "don't allow any
non-nosuid mounts below this point".  *That* is what causes the problem
here.

> > MNT_USER and MNT_USERMNT are so confusing that in the time I go from
> > quitting the manpage to foregrounding my editor, I may have already
> > forgotten which was which.
> 
> Well, to the user they are always in the form "user=123" and
> "usermnt", so they are not as easy to confuse.

It still makes the kernel code harder to read, but for the user yes that
is helpful.

> But I feel a bit stupid bickering about this, because it isn't so
> important.  "allowuser" or "allowusermnt" are fine by me if you think
> they are substantially better than "usermnt".

Thanks, I really really do  :)

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add "permit user mounts in new namespace" clone flag

2007-04-17 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> "Serge E. Hallyn" <[EMAIL PROTECTED]> writes:
> >> 
> >> Why are directory permissions not sufficient to allow/deny non-priveleged
> > mounts?
> >> I don't understand that contention yet.
> >
> > The same scenarios laid out previously in this thread.  I.e.
> >
> > 1. user hallyn does mount --bind / /home/hallyn/root
> > 2. (...)
> > 3. admin does "deluser hallyn"
> >
> > and deluser starts wiping out root
> >
> > Or,
> >
> > 1. user hallyn does mount --bind / /home/hallyn/root
> > 2. backup daemon starts backing up 
> > /home/hallyn/root/home/hallyn/root/home...
> >
> > So we started down the path of forcing users to clone a new namespace
> > before doing user mounts, which is what the clone flag was about.  Using
> > per-mount flags also suffices as you had pointed out, which is being
> > done here.  But directory permissions are inadequate.
> 
> Interesting
> 
> So far even today these things can happen, however they are sufficiently
> unlikely the tools don't account for them.
> 
> Once a hostile user can cause them things are more of a problem.
> 
> > (Unless you want to tackle each problem legacy tool one at a time to
> > remove problems - i.e. deluser should umount everything under
> > /home/hallyn before deleting, backup should be spawned from it's own
> > namespace cloned right after boot or just back up on one filesystem,
> > etc.)
> 
> I don't see a way that backup and deluser won't need to be modified
> to work properly in a system where non-priveleged mounts are allowed,
> at least they will need to account for /share.

Yes, all the tools need to avoid /share.  Though at least it's a single
location we can avoid, and it is purely a system configuration issue,
whereas fixing deluser to watch for user mounts under /home involves (I
assume) rewriting a part of it.

> That said it is clearly a hazard if we enable this functionality by
> default.
> 
> If we setup a pam module that triggers on login and perhaps when
> cron and at jobs run to setup an additional mount namespace I think
> keeping applications locked away in their own mount namespace is
> sufficient to avoid hostile users from doing unexpected things to
> the initial mount namespace.  So unless I am mistake it should be
> relatively simple to prevent user space from encountering problems.
> 
> That still leaves the question of how we handle systems with an old
> user space that is insufficiently robust to deal with mounts occurring
> at unexpected locations.
> 
> 
>   I think a simple sysctl to enable/disable of non-priveleged mounts 
>   defaulting to disabled is enough.
> 
> Am I correct or will it be more difficult than just a little pam
> module to ensure non-trusted users never run in the initial mount
> namespace?

The danger with relying on the pam module is that you have to plug it in
all the right places.  For instance, if we're talking about malicious
users, now we have to start worrying about an ftp daemon with user login
that isn't using pam, and happens to have an exploitable bug.

So it seems to me the per-mount flag you suggested really is the best
solution.  Now the pam module is still needed, but only to set things up
so that the user *can* do user mounts.  If there's a way to login
bypassing the pam module, then the user simply won't be able to do user
mounts anywhere but under /share, and as Miklos suggested the perms on
share can probably be set to 000.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 05/10] Add "permit user submounts" flag to vfsmount

2007-04-17 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > From: Miklos Szeredi <[EMAIL PROTECTED]>
> > > 
> > > If MNT_USERMNT flag is not set in the target vfsmount, then
> > 
> > MNT_USER and MNT_USERMNT?  I claim no way will people keep those
> > straight.  How about MNT_ALLOWUSER and MNT_USER?
> 
> Umm, is "allowuser" more clear than "usermnt"?  What is allowed to the

I think so, yes.  One makes it clear that we're talking about allowing
user (somethings :), one might just as well mean "this is a user mount."

> user?  "allowusermnt" may be more descriptive, but it's a bit too
> long.

Yes, if it weren't too long it would by far have been my preference.
Maybe despite the length we should still go with it...

> I don't think it matters all that much, the user will have to look up
> the semantics in the manpage anyway.  Is "nosuid" descriptive?  Not
> very much, but we got used to it.

nosuid is quite clear.  MNT_USER and MNT_USERMNT are so confusing that
in the time I go from quitting the manpage to foregrounding my editor, I
may have already forgotten which was which.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add "permit user mounts in new namespace" clone flag

2007-04-17 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> "Serge E. Hallyn" <[EMAIL PROTECTED]> writes:
> >> 
> >> Why are directory permissions not sufficient to allow/deny non-priveleged
> > mounts?
> >> I don't understand that contention yet.
> >
> > The same scenarios laid out previously in this thread.  I.e.
> >
> > 1. user hallyn does mount --bind / /home/hallyn/root
> > 2. (...)
> > 3. admin does "deluser hallyn"
> >
> > and deluser starts wiping out root
> >
> > Or,
> >
> > 1. user hallyn does mount --bind / /home/hallyn/root
> > 2. backup daemon starts backing up 
> > /home/hallyn/root/home/hallyn/root/home...
> >
> > So we started down the path of forcing users to clone a new namespace
> > before doing user mounts, which is what the clone flag was about.  Using
> > per-mount flags also suffices as you had pointed out, which is being
> > done here.  But directory permissions are inadequate.
> 
> Interesting
> 
> So far even today these things can happen, however they are sufficiently
> unlikely the tools don't account for them.
> 
> Once a hostile user can cause them things are more of a problem.
> 
> > (Unless you want to tackle each problem legacy tool one at a time to
> > remove problems - i.e. deluser should umount everything under
> > /home/hallyn before deleting, backup should be spawned from it's own
> > namespace cloned right after boot or just back up on one filesystem,
> > etc.)
> 
> I don't see a way that backup and deluser won't need to be modified
> to work properly in a system where non-priveleged mounts are allowed,
> at least they will need to account for /share.
> 
> That said it is clearly a hazard if we enable this functionality by
> default.
> 
> If we setup a pam module that triggers on login and perhaps when
> cron and at jobs run to setup an additional mount namespace I think
> keeping applications locked away in their own mount namespace is
> sufficient to avoid hostile users from doing unexpected things to
> the initial mount namespace.  So unless I am mistake it should be
> relatively simple to prevent user space from encountering problems.
> 
> That still leaves the question of how we handle systems with an old
> user space that is insufficiently robust to deal with mounts occurring
> at unexpected locations.
> 
> 
>   I think a simple sysctl to enable/disable of non-priveleged mounts 
>   defaulting to disabled is enough.

There is a sysctl for max_user_mounts which can be set to 0.

So a simple on/off sysctl is unnecessary, but given that admins might
wonder whether 0 means infinite :), and I agree on/off is important, a
second one wouldn't hurt.

> Am I correct or will it be more difficult than just a little pam
> module to ensure non-trusted users never run in the initial mount
> namespace?
> 
> Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [patch 05/10] add "permit user mounts in new namespace" clone flag

2007-04-16 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> Miklos Szeredi <[EMAIL PROTECTED]> writes:
> 
> >> > That depends.  Current patches check the "unprivileged submounts
> >> > allowed under this mount" flag only on the requested mount and not on
> >> > the propagated mounts.  Do you see a problem with this?
> >> 
> >> I think privileges of this sort should propagate.  If I read what you
> >> just said correctly if I have a private mount namespace I won't be able
> >> to mount anything unless when it was setup the unprivileged submount
> >> command was explicitly set.
> >
> > By design yes.  Why is that a problem?
> 
> It certainly doesn't match my intuition.
> 
> Why are directory permissions not sufficient to allow/deny non-priveleged 
> mounts?
> I don't understand that contention yet.

The same scenarios laid out previously in this thread.  I.e.

1. user hallyn does mount --bind / /home/hallyn/root
2. (...)
3. admin does "deluser hallyn"

and deluser starts wiping out root

Or,

1. user hallyn does mount --bind / /home/hallyn/root
2. backup daemon starts backing up /home/hallyn/root/home/hallyn/root/home...

So we started down the path of forcing users to clone a new namespace
before doing user mounts, which is what the clone flag was about.  Using
per-mount flags also suffices as you had pointed out, which is being
done here.  But directory permissions are inadequate.

(Unless you want to tackle each problem legacy tool one at a time to
remove problems - i.e. deluser should umount everything under
/home/hallyn before deleting, backup should be spawned from it's own
namespace cloned right after boot or just back up on one filesystem,
etc.)

-serge

> I should probably go back and look and see how plan9 handles mount/unmount
> permissions.  Plan9 gets away with a lot more because it doesn't have
> a suid bit and mount namespaces were always present, so they don't have
> backwards compatibility problems.
> 
> My best guess at the moment is that plan9 treated mount/unmount as
> completely unprivileged and used the mount namespaces to limit the
> scope of what would be affected by a mount/unmount operation.  I think
> that may be reasonable in linux as well but it will require the
> presence of a mount namespace to limit the affects of what a user can
> do.
> 
> So short of a more thorough audit I believe the final semantics should
> be: 
> - mount/unmount for non-priveleged processes should only be limited
>   by the mount namespace and directory permissions.
> - CLONE_NEWNS should not be a privileged operation. 
> 
> What prevents us from allowing these things?
> 
> - Unprivileged CLONE_NEWNS and unprivileged mounts needs resource
>   accounting so we don't have a denial of service attack.
> 
> - Unprivileged mounts must be limited to directories that we have
>   permission to modify in a way that we could get the same effect
>   as the mount or unmount operation in terms of what files are visible
>   otherwise we can mess up SUID executables.
> 
> - Anything else?
> 
> There are user space issues such as a reasonable pam module and how
> to do backups.  However those are user space issues.
> 
> What am I missing that requires us to add MNT_USER and MNT_USERMNT?
> 
> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 05/10] Add "permit user submounts" flag to vfsmount

2007-04-16 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> If MNT_USERMNT flag is not set in the target vfsmount, then

MNT_USER and MNT_USERMNT?  I claim no way will people keep those
straight.  How about MNT_ALLOWUSER and MNT_USER?

-serge

> unprivileged mounts will be denied.
> 
> By default this flag is cleared, and can be set on new mounts, on
> remounts or with the MS_SETFLAGS option.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
> ---
> 
> Index: linux/fs/namespace.c
> ===
> --- linux.orig/fs/namespace.c 2007-04-13 13:20:12.0 +0200
> +++ linux/fs/namespace.c  2007-04-13 13:35:40.0 +0200
> @@ -411,6 +411,7 @@ static int show_vfsmnt(struct seq_file *
>   { MNT_NOATIME, ",noatime" },
>   { MNT_NODIRATIME, ",nodiratime" },
>   { MNT_RELATIME, ",relatime" },
> + { MNT_USERMNT, ",usermnt" },
>   { 0, NULL }
>   };
>   struct proc_fs_info *fs_infop;
> @@ -1505,9 +1506,11 @@ long do_mount(char *dev_name, char *dir_
>   mnt_flags |= MNT_NODIRATIME;
>   if (flags & MS_RELATIME)
>   mnt_flags |= MNT_RELATIME;
> + if (flags & MS_USERMNT)
> + mnt_flags |= MNT_USERMNT;
> 
>   flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
> -MS_NOATIME | MS_NODIRATIME | MS_RELATIME);
> +MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_USERMNT);
> 
>   /* ... and get the mountpoint */
>   retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
> Index: linux/include/linux/mount.h
> ===
> --- linux.orig/include/linux/mount.h  2007-04-13 13:17:08.0 +0200
> +++ linux/include/linux/mount.h   2007-04-13 13:22:17.0 +0200
> @@ -28,6 +28,7 @@ struct mnt_namespace;
>  #define MNT_NOATIME  0x08
>  #define MNT_NODIRATIME   0x10
>  #define MNT_RELATIME 0x20
> +#define MNT_USERMNT  0x40
> 
>  #define MNT_SHRINKABLE   0x100
>  #define MNT_USER 0x200
> Index: linux/include/linux/fs.h
> ===
> --- linux.orig/include/linux/fs.h 2007-04-13 13:23:05.0 +0200
> +++ linux/include/linux/fs.h  2007-04-13 13:35:34.0 +0200
> @@ -130,6 +130,7 @@ extern int dir_notify_enable;
>  #define MS_SETFLAGS  (1<<23) /* set specified mount flags */
>  #define MS_CLEARFLAGS(1<<24) /* clear specified mount flags */
>  /* MS_SETFLAGS | MS_CLEARFLAGS: change mount flags to specified */
> +#define MS_USERMNT   (1<<25) /* permit unpriv. submounts under this mount */
>  #define MS_ACTIVE(1<<30)
>  #define MS_NOUSER(1<<31)
> 
> 
> --
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to query mount propagation state?

2007-04-16 Thread Serge E. Hallyn
Quoting Ram Pai ([EMAIL PROTECTED]):
> On Mon, 2007-04-16 at 12:34 +0200, Miklos Szeredi wrote:
> > Currently one of the difficulties with mount propagations is that
> > there's no way to know the current state of the propagation tree.
> > 
> > Has anyone thought about how this info could be queried from
> > userspace?
> 
> I am attaching two patches that I had done way back in Oct 2006 
> with Al Viro. I had sent these patches to Al Viro. But I forgot to
> follow them up, I guess so did Al Viro.
> 
> The first patch disambiguates multiple mount-instances of the same
> filesystem (or part of the same filesystem), by introducing a new
> interface /proc/mounts_new. 
> 
> The second patch introduces a new proc interface that exposes all the
> propagation trees within a namespace.  It does not show propagated
> mounts residing in a different namespace (for privacy reasons). Maybe
> one could modify the patch a little, to allow it; if the user has
> root priviledges. 
> 
> RP
> 
> PS: Sorry these are attachments instead of inline patches. I am scared
> of inlining in evolution. If needed I can send inline patches through
> mutt.
> 
> > 
> > Thanks,
> > Miklos

> This patch disambiguates multiple mount-instances of the same
> filesystem (or part of the same filesystem), by introducing a new
> interface /proc/mounts_new. The interface has the following format.
> 
> 
> FSID  mntpt  root-dentry  fstype fs-options
> 
> 
> NOTE: root-dentry is the path to the dentry w.r.t to the root dentry of the
> same filesystem.
> 
> for example: lets say we attempt the following commands
> mount --bind /var /mnt
> mount --bind /mnt/tmp /tmp1
> 
> 'cat /proc/mounts' shows the following:
> /dev/root /mnt ext2 rw 0 0
> /dev/root /tmp1 ext2 rw 0 0
> 
> NOTE: The above mount entries, do not indicate that /tmp1 contains the same
> directory tree as /var/tmp.
> 
> But 'cat /proc/mounts_new' shows us the following:
> 0x6200 /mnt /var ext2 rw 0 0
> 0x6200 /tmp1 /var/tmp ext2 rw 0 0
> 
> The above entries clearly indicates that /var/tmp directory of the ext2
> filesystem with fsid=0x6200 is the directory tree that resides under /tmp1
> 
> Signed-off-by: Ram Pai <[EMAIL PROTECTED]>
> 
> ---
>  fs/dcache.c  |   53 
>  fs/namespace.c   |   35 ++---
>  fs/proc/base.c   |   32 +--
>  fs/proc/proc_misc.c  |1 
>  fs/seq_file.c|   77 
> ++-
>  include/linux/dcache.h   |1 
>  include/linux/seq_file.h |1 
>  7 files changed, 172 insertions(+), 28 deletions(-)
> 
> Index: linux-2.6.17.10/fs/proc/base.c
> ===
> --- linux-2.6.17.10.orig/fs/proc/base.c
> +++ linux-2.6.17.10/fs/proc/base.c
> @@ -104,6 +104,7 @@ enum pid_directory_inos {
>   PROC_TGID_MAPS,
>   PROC_TGID_NUMA_MAPS,
>   PROC_TGID_MOUNTS,
> + PROC_TGID_MOUNTS_NEW,
>   PROC_TGID_MOUNTSTATS,
>   PROC_TGID_WCHAN,
>  #ifdef CONFIG_MMU
> @@ -145,6 +146,7 @@ enum pid_directory_inos {
>   PROC_TID_MAPS,
>   PROC_TID_NUMA_MAPS,
>   PROC_TID_MOUNTS,
> + PROC_TID_MOUNTS_NEW,
>   PROC_TID_MOUNTSTATS,
>   PROC_TID_WCHAN,
>  #ifdef CONFIG_MMU
> @@ -203,6 +205,7 @@ static struct pid_entry tgid_base_stuff[
>   E(PROC_TGID_ROOT,  "root",S_IFLNK|S_IRWXUGO),
>   E(PROC_TGID_EXE,   "exe", S_IFLNK|S_IRWXUGO),
>   E(PROC_TGID_MOUNTS,"mounts",  S_IFREG|S_IRUGO),
> + E(PROC_TGID_MOUNTS_NEW,"mounts_new",  S_IFREG|S_IRUGO),
>   E(PROC_TGID_MOUNTSTATS, "mountstats", S_IFREG|S_IRUSR),
>  #ifdef CONFIG_MMU
>   E(PROC_TGID_SMAPS, "smaps",   S_IFREG|S_IRUGO),
> @@ -246,6 +249,7 @@ static struct pid_entry tid_base_stuff[]
>   E(PROC_TID_ROOT,   "root",S_IFLNK|S_IRWXUGO),
>   E(PROC_TID_EXE,"exe", S_IFLNK|S_IRWXUGO),
>   E(PROC_TID_MOUNTS, "mounts",  S_IFREG|S_IRUGO),
> + E(PROC_TID_MOUNTS_NEW, "mounts_new",  S_IFREG|S_IRUGO),
>  #ifdef CONFIG_MMU
>   E(PROC_TID_SMAPS,  "smaps",   S_IFREG|S_IRUGO),
>  #endif
> @@ -692,13 +696,13 @@ static struct file_operations proc_smaps
>  };
>  #endif
> 
> -extern struct seq_operations mounts_op;
>  struct proc_mounts {
>   struct seq_file m;
>   int event;
>  };
> 
> -static int mounts_open(struct inode *inode, struct file *file)
> +static int __mounts_open(struct inode *inode, struct file *file,
> + struct seq_operations *mounts_op)
>  {
>   struct task_struct *task = proc_task(inode);
>   struct namespace *namespace;
> @@ -716,7 +720,7 @@ static int mounts_open(struct inode *ino
>   p = kmalloc(sizeof(struct proc_mounts), GFP_KERNEL);
>   if (p) {
>   file->private_data = &p->m;
> - ret = seq_open(fil

Re: [patch 0/8] unprivileged mount syscall

2007-04-15 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > > Agreed on desired behavior, but not on chroot sufficing.  It actually
> > > > sounds like you want exactly what was outlined in the OLS paper.
> > > > 
> > > > Users still need to be in a different mounts namespace from the admin
> > > > user so long as we consider the deluser and backup problems
> > > 
> > > I don't think it matters, because /share/$USER duplicates a part or
> > > the whole of the user's namespace.
> > > 
> > > So backup would have to be taught about /share anyway, and deluser
> > > operates on /home/$USER and not on /share/*, so there shouldn't be any
> > > problem.
> > 
> > In what I was thinking of, /share/$USER is bind mounted to
> > ~$USER/share, so it would have to be done in a private namespace in
> > order for deluser to not be tricked.
> 
> But /share/$USER is surely not bind mounted to ~$USER/share in the
> _global_ namespace, is it?  I can't see any sense in that.

No it's not, only in the private namespace.

> > > There's actually very little difference between rbind+chroot, and
> > > CLONE_NEWNS.  In a private namespace:
> > > 
> > >   1) when no more processes reference the namespace, the tree will be
> > > disbanded
> > > 
> > >   2) the mount tree won't be accessible from outside the namespace
> > 
> > But it *can* be, if properly set up.  That's part of the point of the
> > example in the OLS paper.  When a user logs in, sshd clones a new
> > namespace, then bind-mounts /share/$USER into ~$USER/share.  So assuming
> > that /share/$USER was --make-shared'd, it and ~$USER are now in the
> > same peer group, and any changes made by the user under ~$USER will
> > be reflected back into /share/$USER.
> 
> I acknowledge, that it can be done.  My point was that it can be done
> more simply _without_ using CLONE_NS.

Seems like a matter of preference, but I see what you're saying.

> > > Wanting a persistent namespace contradicts 1).
> > 
> > Not necessarily, see above.
> > 
> > > Wanting a per-user (as opposed to per-session) namespace contradicts
> > > 2).  The namespace _has_ to be accessible from outside, so that a new
> > > session can access/copy it.
> > 
> > Again, I *think* you are wrong that private namespace contradicts this
> > requirement.
> 
> I'm not saying there's any contradiction, I'm saying rbind+chroot is a
> better fit.

Ok, I see.

> I haven't yet heard a single reason why a per-session namespace with
> parts shared per-user is better than just a per-user namespace.

In fact I suspect we could show that they are functionally equivalent
(for your purposes) by drawing the fs tree and peer groups from
current->fs->root on up for both methods.

And not using private namespaces leaves the admin (at least for now)
better able to diagnose the state of the system.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-13 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > Thinking a bit more about this, I'm quite sure most users wouldn't
> > > even want private namespaces.  It would be enough to
> > > 
> > >   chroot /share/$USER
> > > 
> > > and be done with it.
> > > 
> > > Private namespaces are only good for keeping a bunch of mounts
> > > referenced by a group of processes.  But my guess is, that the natural
> > > behavior for users is to see a persistent set of mounts.
> > > 
> > > If for example they mount something on a remote machine, then log out
> > > from the ssh session and later log back in, they would want to see
> > > their previous mount still there.
> > > 
> > > Miklos
> > 
> > Agreed on desired behavior, but not on chroot sufficing.  It actually
> > sounds like you want exactly what was outlined in the OLS paper.
> > 
> > Users still need to be in a different mounts namespace from the admin
> > user so long as we consider the deluser and backup problems
> 
> I don't think it matters, because /share/$USER duplicates a part or
> the whole of the user's namespace.
> 
> So backup would have to be taught about /share anyway, and deluser
> operates on /home/$USER and not on /share/*, so there shouldn't be any
> problem.

In what I was thinking of, /share/$USER is bind mounted to
~$USER/share, so it would have to be done in a private namespace in
order for deluser to not be tricked.

> There's actually very little difference between rbind+chroot, and
> CLONE_NEWNS.  In a private namespace:
> 
>   1) when no more processes reference the namespace, the tree will be
> disbanded
> 
>   2) the mount tree won't be accessible from outside the namespace

But it *can* be, if properly set up.  That's part of the point of the
example in the OLS paper.  When a user logs in, sshd clones a new
namespace, then bind-mounts /share/$USER into ~$USER/share.  So assuming
that /share/$USER was --make-shared'd, it and ~$USER are now in the
same peer group, and any changes made by the user under ~$USER will
be reflected back into /share/$USER.

> Wanting a persistent namespace contradicts 1).

Not necessarily, see above.

> Wanting a per-user (as opposed to per-session) namespace contradicts
> 2).  The namespace _has_ to be accessible from outside, so that a new
> session can access/copy it.

Again, I *think* you are wrong that private namespace contradicts this
requirement.

> So both requirements point to the rbind/chroot solution.

It all points to a combination of the two  :-)

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 05/10] add "permit user mounts in new namespace" clone flag

2007-04-13 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > Given the existence of shared subtrees allowing/denying this at the mount
> > namespace level is silly and wrong.
> > 
> > If we need more than just the filesystem permission checks can we
> > make it a mount flag settable with mount and remount that allows
> > non-privileged users the ability to create mount points under it
> > in directories they have full read/write access to.
> 
> OK, that makes sense.
> 
> > I don't like the use of clone flags for this purpose but in this
> > case the shared subtress are a much more fundamental reasons for not
> > doing this at the namespace level.
> 
> I'll drop the clone flag, and add a mount flag instead.
> 
> Thanks,
> Miklos

Makes sense, so then on login pam has to spawn a new user namespace and
construct a root fs with no shared subtrees and with the
user-mounts-allowed flag specified?

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-13 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > On Wed, 2007-04-11 at 12:44 +0200, Miklos Szeredi wrote:
> > > > 1. clone the master namespace.
> > > > 
> > > > 2. in the new namespace
> > > > 
> > > > move the tree under /share/$me to /
> > > > for each ($user, $what, $how) {
> > > > move /share/$user/$what to /$what
> > > > if ($how == slave) {
> > > >  make the mount tree under /$what as slave
> > > > }
> > > > }
> > > > 
> > > > 3. in the new namespace make the tree under 
> > > >/share as private and unmount /share
> > > 
> > > Thanks.  I get the basic idea now: the namespace itself need not be
> > > shared between the sessions, it is enough if "share" propagation is
> > > set up between the different namespaces of a user.
> > > 
> > > I don't yet see either in your or Viro's description how the trees
> > > under /share/$USER are initialized.  I guess they are recursively
> > > bound from /, and are made slaves.
> > 
> > yes. I suppose, when a userid is created one of the steps would be
> > 
> > mount --rbind / /share/$USER
> > mount --make-rslave /share/$USER
> > mount --make-rshared /share/$USER
> 
> Thinking a bit more about this, I'm quite sure most users wouldn't
> even want private namespaces.  It would be enough to
> 
>   chroot /share/$USER
> 
> and be done with it.
> 
> Private namespaces are only good for keeping a bunch of mounts
> referenced by a group of processes.  But my guess is, that the natural
> behavior for users is to see a persistent set of mounts.
> 
> If for example they mount something on a remote machine, then log out
> from the ssh session and later log back in, they would want to see
> their previous mount still there.
> 
> Miklos

Agreed on desired behavior, but not on chroot sufficing.  It actually
sounds like you want exactly what was outlined in the OLS paper.

Users still need to be in a different mounts namespace from the admin
user so long as we consider the deluser and backup problems to be
legitimate problems (well, so long as user mounts are allowed).  So,
when they log in, pam gives them a new namespace and chroots them into
/share/$USER.

Assuming I'm thinking clearly  :)

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 05/10] add "permit user mounts in new namespace" clone flag

2007-04-12 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> If CLONE_NEWNS and CLONE_NEWNS_USERMNT are given to clone(2) or
> unshare(2), then allow user mounts within the new namespace.
> 
> This is not flexible enough, because user mounts can't be enabled for
> the initial namespace.
> 
> The remaining clone bits also getting dangerously few...
> 
> Alternatives are:
> 
>   - prctl() flag
>   - setting through the containers filesystem

Sorry, I know I had mentioned it, but this is definately my least
favorite approach.

Curious whether are any other suggestions/opinions from the containers
list?

thanks,
-serge

> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
> ---
> 
> Index: linux/fs/namespace.c
> ===
> --- linux.orig/fs/namespace.c 2007-04-12 13:46:19.0 +0200
> +++ linux/fs/namespace.c  2007-04-12 13:54:36.0 +0200
> @@ -1617,6 +1617,8 @@ struct mnt_namespace *copy_mnt_ns(int fl
>   return ns;
> 
>   new_ns = dup_mnt_ns(ns, new_fs);
> + if (new_ns && (flags & CLONE_NEWNS_USERMNT))
> + new_ns->flags |= MNT_NS_PERMIT_USERMOUNTS;
> 
>   put_mnt_ns(ns);
>   return new_ns;
> Index: linux/include/linux/sched.h
> ===
> --- linux.orig/include/linux/sched.h  2007-04-12 13:26:48.0 +0200
> +++ linux/include/linux/sched.h   2007-04-12 13:54:36.0 +0200
> @@ -26,6 +26,7 @@
>  #define CLONE_STOPPED0x0200  /* Start in stopped 
> state */
>  #define CLONE_NEWUTS 0x0400  /* New utsname group? */
>  #define CLONE_NEWIPC 0x0800  /* New ipcs */
> +#define CLONE_NEWNS_USERMNT  0x1000  /* Allow user mounts in ns? */
> 
>  /*
>   * Scheduling policies
> Index: linux/kernel/fork.c
> ===
> --- linux.orig/kernel/fork.c  2007-04-11 18:27:46.0 +0200
> +++ linux/kernel/fork.c   2007-04-12 13:59:10.0 +0200
> @@ -1586,7 +1586,7 @@ asmlinkage long sys_unshare(unsigned lon
>   err = -EINVAL;
>   if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>   CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
> - CLONE_NEWUTS|CLONE_NEWIPC))
> + CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNS_USERMNT))
>   goto bad_unshare_out;
> 
>   if ((err = unshare_thread(unshare_flags)))
> 
> --
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-11 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > Not objecting to prctl(), but two other options would be
> > 
> > 1. add a CLONE_NEW_NS_USERMNT flag - kind of ugly, but that is
> >the time at which the ns is created, so in that sense it
> >makes sense.
> 
> Yes, I thought about this, but there's no easy way to set the flag for
> the initial namespace, and a second flag CLONE_NEW_NS_NOUSERMNT would
> be needed to turn off the flag.

Not mentioning it would 'turn it off' for the cloned ns, but the default
value for the initial namespace is still a problem.

> > 2. use the nsproxy container subsystem (see Paul Menage's
> >containers patchset) to set this using, e.g.,
> > 
> > echo 1 > /containers/vserver1/mounts/usermount
> 
> That again would lose some flexibility: only namespaces which
> are part of a container could be manipulated.

In the nsproxy subsystem, every namespace gets a container so
long as the nsproxy subsystem is mounted.

> Does that exclude the
> initial namespace?

No, the initial namespace is tied to the root dentry - so if as my
example was assuming youve done

mount -t container -o ns none /containers

then to change the setting for the initial namespace you would

echo 0 > /containers/mounts/usermount

> Also how would a process find out which vserver it is running in?

cat /proc/$$/container

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-11 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > It would be nice in general if we could avoid any sort of checks for
> > (mnt->mnt_ns == init_nsproxy.mnt_ns).  Maybe that won't be possible,
> > but, taking the two listed examples:
> 
> [snip]
> 
> It's probably worthwile going after these problematic cases, and
> fixing them, OTOH it's not easy to audit a complete system for holes
> arising from user mounts in the global namespace.
> 
> So why not move this decision out from the kernel?  How about adding a
> boolean flag to namespaces, which specifies whether unprivileged
> mounts are allowed or not.  This would give complete flexibility to
> distro builders and sysadmins.
> 
> The biggest problem I see is how to set this flag.  There's no easy
> way to represent namespaces in /proc or /sys, and this is sufficiently
> obscure not to warrant a new syscall.  Adding a new flag to prctl()
> could do the trick.  Does that sound OK?

Not objecting to prctl(), but two other options would be

1. add a CLONE_NEW_NS_USERMNT flag - kind of ugly, but that is
   the time at which the ns is created, so in that sense it
   makes sense.
2. use the nsproxy container subsystem (see Paul Menage's
   containers patchset) to set this using, e.g.,

echo 1 > /containers/vserver1/mounts/usermount

The prctl() method has a huge advantage of being implementable right
now.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-11 Thread Serge E. Hallyn
Quoting Ian Kent ([EMAIL PROTECTED]):
> On Wed, 2007-04-11 at 09:26 -0500, Serge E. Hallyn wrote:
> > Quoting Ian Kent ([EMAIL PROTECTED]):
> > > On Wed, 2007-04-11 at 12:48 +0200, Miklos Szeredi wrote:
> > > > > > >>
> > > > > > >> - users can use bind mounts without having to pre-configure them 
> > > > > > >> in
> > > > > > >>   /etc/fstab
> > > > > > >>
> > > > > > 
> > > > > > This is by far the biggest concern I see.  I think the security 
> > > > > > implication of allowing anyone to do bind mounts are poorly 
> > > > > > understood.
> > > > > 
> > > > > And especially so since there is no way for a filesystem module to 
> > > > > veto
> > > > > such requests.
> > > > 
> > > > The filesystem can't veto initial mounts based on destination either.
> > > > I don't think it's up to the filesystem to police bind/move mounts in
> > > > any way.
> > > 
> > > But if a filesystem can't or the developer thinks that it shouldn't for
> > > some reason, support bind/move mounts then there should be a way for the
> > 
> > Can you list some valid reasons why an fs could care where it is
> > mounted?  The only thing I could think of is a stackable fs, but it
> > shouldn't care whether it is overlay-mounted or not.
> 
> For my part, autofs and autofs4.

Ah, thanks.

I can see I'm going to have start using autofs to get to know the
implementation, because it seems clear we'll run into it in the
containers work again (beyond the struct pid conv) at some point.

> Moving or binding isn't valid.
> I tried to design that limitation out version 5 but wasn't able to.
> In time I probably can but couldn't continue to support older versions.

thanks,
-serge

> > 
> > thanks,
> > -serge
> > 
> > > filesystem to tell the kernel that.
> > > 
> > > Surely a filesystem is in a good position to be able to decide if a
> > > mount request "for it" should be allowed to continue based on it's "own
> > > situation and capabilities".
> > > 
> > > Ian
> > > 
> > > 
> > > 
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" 
> > > in
> > > the body of a message to [EMAIL PROTECTED]
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-11 Thread Serge E. Hallyn
Quoting Ian Kent ([EMAIL PROTECTED]):
> On Wed, 2007-04-11 at 12:48 +0200, Miklos Szeredi wrote:
> > > > >>
> > > > >> - users can use bind mounts without having to pre-configure them in
> > > > >>   /etc/fstab
> > > > >>
> > > > 
> > > > This is by far the biggest concern I see.  I think the security 
> > > > implication of allowing anyone to do bind mounts are poorly understood.
> > > 
> > > And especially so since there is no way for a filesystem module to veto
> > > such requests.
> > 
> > The filesystem can't veto initial mounts based on destination either.
> > I don't think it's up to the filesystem to police bind/move mounts in
> > any way.
> 
> But if a filesystem can't or the developer thinks that it shouldn't for
> some reason, support bind/move mounts then there should be a way for the

Can you list some valid reasons why an fs could care where it is
mounted?  The only thing I could think of is a stackable fs, but it
shouldn't care whether it is overlay-mounted or not.

thanks,
-serge

> filesystem to tell the kernel that.
> 
> Surely a filesystem is in a good position to be able to decide if a
> mount request "for it" should be allowed to continue based on it's "own
> situation and capabilities".
> 
> Ian
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-09 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> This patchset adds support for keeping mount ownership information in
> the kernel, and allow unprivileged mount(2) and umount(2) in certain
> cases.

Well, I'd like to feel all smart and point out some bugs, but the code
all reads very nicely, seems to work as advertised, and while I won't
have ltp results until tomorrow, boot test results in so far are all
successful.

Looks good.

-serge

> This can be useful for the following reasons:
> 
> - mount(8) can store ownership ("user=XY" option) in the kernel
>   instead, or in addition to storing it in /etc/mtab.  For example if
>   private namespaces are used with mount propagations /etc/mtab
>   becomes unworkable, but using /proc/mounts works fine
> 
> - fuse won't need a special suid-root mount/umount utility.  Plain
>   umount(8) can easily be made to work with unprivileged fuse mounts
> 
> - users can use bind mounts without having to pre-configure them in
>   /etc/fstab
> 
> All this is done in a secure way, and unprivileged bind and fuse
> mounts are disabled by default and can be enabled through sysctl or
> /proc/sys.
> 
> One thing that is missing from this series is the ability to restrict
> user mounts to private namespaces.  The reason is that private
> namespaces have still not gained the momentum and support needed for
> painless user experience.  So such a feature would not yet get enough
> attention and testing.  However adding such an optional restriction
> can be done with minimal changes in the future, once private
> namespaces have matured.
> 
> An earlier version of these patches have been discussed here:
> 
>   http://lkml.org/lkml/2005/5/3/64
> 
> --
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-09 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > One thing that is missing from this series is the ability to restrict
> > > user mounts to private namespaces.  The reason is that private
> > > namespaces have still not gained the momentum and support needed for
> > > painless user experience.  So such a feature would not yet get enough
> > > attention and testing.  However adding such an optional restriction
> > > can be done with minimal changes in the future, once private
> > > namespaces have matured.
> > 
> > What is the main reason for that feature?  Would it be to prevent things
> > like login from being tricked by user mounts?  Isn't it sufficient, in
> > fact, better, to require that the target of the mount be owned by the
> > user doing the mount?
> 
> It's been discussed later in that thread.  Basically you can fool a

I see now, sorry.

> lot of system programs (like backup) with mounting/binding in the
> global namespace.  Restricting the destination doesn't always help.
> 
> Miklos

It would be nice in general if we could avoid any sort of checks for
(mnt->mnt_ns == init_nsproxy.mnt_ns).  Maybe that won't be possible,
but, taking the two listed examples:

1. mount --bind / ~/bindns;  (later) userdel hallyn

I assume userdel does a simple stupid rm -rf without first umounting,
then?  So (1) it seems wise to have userdel umount anything under ~user
first anyway, and (2) if $USER does a mount --bind from a source he
doesn't own, should we make the resulting mount read-only?  (realizing
the read-only bind mount patches are still under development :)  Or is
that overly restrictive somehow for fuse?

2. backups

Is this just a 'he's going to fill up the whole disk' issue?  Frankly,
it seems wise to have cron or whatever is spawning the backup start in
it's own namespace right at boot.  Generally when I think back on sites
where I've dealt with backup, backups were done on a separate server
which didn't allow userlogins anyway, so it wouldn't be a problem.  But
I'm sure that's a limited (==erroneous) POV.

I do realize that the whole problem about corner cases isn't addressing
two little ones, but the fact that there are more we haven't thought of.
So are there any currently known use cases where requiring a CLONE_NEWNS
before user mounts is unacceptable?

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-09 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> This patchset adds support for keeping mount ownership information in
> the kernel, and allow unprivileged mount(2) and umount(2) in certain
> cases.
> 
> This can be useful for the following reasons:
> 
> - mount(8) can store ownership ("user=XY" option) in the kernel
>   instead, or in addition to storing it in /etc/mtab.  For example if
>   private namespaces are used with mount propagations /etc/mtab
>   becomes unworkable, but using /proc/mounts works fine
> 
> - fuse won't need a special suid-root mount/umount utility.  Plain
>   umount(8) can easily be made to work with unprivileged fuse mounts
> 
> - users can use bind mounts without having to pre-configure them in
>   /etc/fstab
> 
> All this is done in a secure way, and unprivileged bind and fuse
> mounts are disabled by default and can be enabled through sysctl or
> /proc/sys.
> 
> One thing that is missing from this series is the ability to restrict
> user mounts to private namespaces.  The reason is that private
> namespaces have still not gained the momentum and support needed for
> painless user experience.  So such a feature would not yet get enough
> attention and testing.  However adding such an optional restriction
> can be done with minimal changes in the future, once private
> namespaces have matured.

What is the main reason for that feature?  Would it be to prevent things
like login from being tricked by user mounts?  Isn't it sufficient, in
fact, better, to require that the target of the mount be owned by the
user doing the mount?

-serge   (who's pretty sure he's missing something)

> An earlier version of these patches have been discussed here:
> 
>   http://lkml.org/lkml/2005/5/3/64
> 
> --
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-09 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > > > One thing that is missing from this series is the ability to restrict
> > > > > user mounts to private namespaces.  The reason is that private
> > > > > namespaces have still not gained the momentum and support needed for
> > > > > painless user experience.  So such a feature would not yet get enough
> > > > > attention and testing.  However adding such an optional restriction
> > > > > can be done with minimal changes in the future, once private
> > > > > namespaces have matured.
> > > > 
> > > > I suspect the people who developed and maintain nsproxy would disagree 
> > > > ;)
> > > 
> > > Well, they better show me some working and simple-to-use userspace
> > > code, because I've not seen anything like that related to mount
> > > namespaces.
> > 
> > If you mean to test/exploit them, see
> > http://lxc.sourceforge.net/patches/2.6.20/2.6.20-lxc8/broken-out/tests/
> > 
> > Compile the ns_exec.c program and do
> > 
> > ns_exec -m /bin/sh
> > 
> > to get a shell in a new mounts namespace.
> 
> Cool, thanks.  This is a very nice utility for testing, but for the
> end user rather useless:

Well that depends on which end-user.  Those wanting to create a vserver
or checkpoint-restart job will want this, but clearly we have a long way
to go for that upstream anyway.

>   - user starts up a private namespace in a shell, mounts something
> 
>   - then opens app from menu, tries to access mount, but the mount is
> not there
> 
>   - user unhappy
> 
> BTW, looking at -mm unshare() on namespace is not privileged any more.
> Why is that?  Or rather, what's the reason, that clone() is privileged
> and unshare() is not?

The check is still there - see kernel/nsproxy.c:unshare_nsproxy_namespaces().

> > > pam_namespace.so is one example of a non-working, but probably-not-too-
> > > hard-to-fix one.
> > 
> > Non-working?  I sure hope the one used for LSPP certification is
> > working...  As is the ugly version I wrote 18 mounts ago and use on my
> > laptop.
> 
> The one in pam-0.99.6.3-29.1 in opensuse-10.2 is totally broken.  Are
> you interested in the details?  I can reproduce it, but forgot to note
> down the details of the brokenness.

I don't know how far removed that is from the one being used by redhat,
but assuming it's the same, then redhat-lspp@redhat.com will be
very interested.

> > > I'm just saying this is not yet something that Joe Blow would just
> > > enable by ticking a box in their desktop setup wizard, and it would
> > > all work flawlessly thereafter.  There's still a _long_ way towards
> > > that, and mostly in userspace.
> > 
> > I'm not sure there's a that long a way to go, but clearly we need to be
> > showing users what they can do, or they'll never work their way towards
> > there.
> 
> There _is_ a long way to go.  Random things that spring to my mind:
> 
>  - using /etc/mtab is broken with private namespaces, using
>/proc/mounts is missing various functionality, that /etc/mtab has,
>for example the "user" option, which this patchset adds

Agreed those need fixing.

>  - need to set up mount propagation from global namespace to private
>ones, mount(8) does not yet have options to configure propagation

Hmm, I guess I get lost using my own little systems, and just assumed
that shared subtree functionality was making its way up into mount(8).
Ram, have you been working on that?

>  - user namespace setup: what if user has multiple sessions?
> 
>1) namespaces are shared?  That's tricky because the session needs to
>be a child of a namespace server, not of login.  I'm not sure PAM
>can handle this
> 
>2) or mounts are copied on login?  That's not possible currently,
>as there's no way to send a mount between namespaces.  Also it's
>tricky to make sure that new mounts are also shared

See toward the end of the 'shared subtrees' OLS paper from last year for
a suggestion on how to let users effectively 'log in to' an existing
private mounts ns.

> > For instance, as you say, a user admin gui with a checkmark and text
> > boxes saying 'enter new namespace on login', 'create private /tmp',
> > and 'create private dmcrypted /home' would be trivial right now.
> 
> Trivial modulo the above slightly non-trivial exemptions ;)

Ok, so it can use some very non-trivial fine-tuning...

But I've been using the above - minus the trivial gui - for over a year
without ever worrying about any of these short-comings.

> Miklos

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] unprivileged mount syscall

2007-04-09 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > This patchset adds support for keeping mount ownership information in
> > > the kernel, and allow unprivileged mount(2) and umount(2) in certain
> > > cases.
> > 
> > No replies, huh?
> 
> All we need is a comment from Andrew, and the replies come flooding in ;)
> 
> > My knowledge of the code which you're touching is not strong, and my spare
> > reviewing capacity is not high.  And this work does need close review by
> > people who are familar with the code which you're changing.
> > 
> > So could I suggest that you go for a dig through the git history, identify
> > some individuals who look like they know this code, then do a resend,
> > cc'ing those people?  Please also cc linux-kernel on that resend.
> 
> OK.
> 
> > > One thing that is missing from this series is the ability to restrict
> > > user mounts to private namespaces.  The reason is that private
> > > namespaces have still not gained the momentum and support needed for
> > > painless user experience.  So such a feature would not yet get enough
> > > attention and testing.  However adding such an optional restriction
> > > can be done with minimal changes in the future, once private
> > > namespaces have matured.
> > 
> > I suspect the people who developed and maintain nsproxy would disagree ;)
> 
> Well, they better show me some working and simple-to-use userspace
> code, because I've not seen anything like that related to mount
> namespaces.

If you mean to test/exploit them, see
http://lxc.sourceforge.net/patches/2.6.20/2.6.20-lxc8/broken-out/tests/

Compile the ns_exec.c program and do

ns_exec -m /bin/sh

to get a shell in a new mounts namespace.

> pam_namespace.so is one example of a non-working, but probably-not-too-
> hard-to-fix one.

Non-working?  I sure hope the one used for LSPP certification is
working...  As is the ugly version I wrote 18 mounts ago and use on my
laptop.

> I'm just saying this is not yet something that Joe Blow would just
> enable by ticking a box in their desktop setup wizard, and it would
> all work flawlessly thereafter.  There's still a _long_ way towards
> that, and mostly in userspace.

I'm not sure there's a that long a way to go, but clearly we need to be
showing users what they can do, or they'll never work their way towards
there.

For instance, as you say, a user admin gui with a checkmark and text
boxes saying 'enter new namespace on login', 'create private /tmp',
and 'create private dmcrypted /home' would be trivial right now.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >