from:"Serge Hallyn"

Re: [PATCH] userns: Allow init_user_ns to be used from non-gpl modules

2016-05-23 Thread Serge Hallyn

Quoting Nikolay Borisov (n.borisov.l...@gmail.com):
> This patch changes the export attributes of the init_user_ns from
> GPL-only to any modules. This needed so that non-gpl modules, such as
> ZFS, utilize functions like i_(uid|gid)_(read|write).
> 
> Signed-off-by: Nikolay Borisov 

Seems reasonable to me,

Acked-by: Serge E. Hallyn 

but it seems clear the decision belongs to Eric.

thanks,
-serge

> ---
>  kernel/user.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/user.c b/kernel/user.c
> index b069ccbfb0b0..8bbd4e628b6e 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -60,7 +60,7 @@ struct user_namespace init_user_ns = {
>   __RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem),
>  #endif
>  };
> -EXPORT_SYMBOL_GPL(init_user_ns);
> +EXPORT_SYMBOL(init_user_ns);
>  
>  /*
>   * UID task count cache, to get fast user lookup in "alloc_uid"
> -- 
> 2.7.4
>

Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2016-05-16 Thread Serge Hallyn

Hey James,

I probably did something wrong - but i applied your patch onto 4.6,
compiled in shiftfs, did

mount -t shiftfs -o uidmap=0:10:65536,gidmap=0:10:65536 /home/ubuntu 
/mnt

and ls segfaults and gives me kernel syslog msgs like:


[ 1089.744726] ===
[ 1089.748851] [ INFO: suspicious RCU usage. ]
[ 1089.752901] 4.6.0-rc5+ #10 Not tainted
[ 1089.756315] ---
[ 1089.760021] include/linux/rcupdate.h:569 Illegal context switch in RCU 
read-side critical section!
[ 1089.767348]
   other info that might help us debug this:

[ 1089.773401]
   rcu_scheduler_active = 1, debug_locks = 0
[ 1089.778417] 1 lock held by ls/3053:
[ 1089.781112]  #0:  (rcu_read_lock){..}, at: [] 
path_init+0x667/0x770
[ 1089.787492]
   stack backtrace:
[ 1089.790827] CPU: 0 PID: 3053 Comm: ls Not tainted 4.6.0-rc5+ #10
[ 1089.795304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Bochs 01/01/2011
[ 1089.801376]  0286 5ed87b3e 88007a70bb10 
8145daa3
[ 1089.807098]  88007a688000 0001 88007a70bb40 
810e7587
[ 1089.812793]   81ca8baf 0184 
88007d08f640
[ 1089.818320] Call Trace:
[ 1089.820205]  [] dump_stack+0x85/0xc2
[ 1089.824046]  [] lockdep_rcu_suspicious+0xd7/0x110
[ 1089.828871]  [] ___might_sleep+0xa7/0x230
[ 1089.833024]  [] __might_sleep+0x49/0x80
[ 1089.837118]  [] kmem_cache_alloc+0x1d9/0x2d0
[ 1089.841725]  [] prepare_creds+0x3a/0x130
[ 1089.845827]  [] shiftfs_new_creds+0x17/0x120
[ 1089.850170]  [] shiftfs_permission+0x42/0xd0
[ 1089.854507]  [] __inode_permission+0x6b/0xb0
[ 1089.858925]  [] inode_permission+0x14/0x50
[ 1089.863190]  [] link_path_walk+0x7d/0x510
[ 1089.867454]  [] ? path_init+0x52b/0x770
[ 1089.871570]  [] ? path_init+0x667/0x770
[ 1089.875577]  [] path_lookupat+0x7c/0x110
[ 1089.879830]  [] filename_lookup+0xb1/0x180
[ 1089.883937]  [] ? getname_flags+0x56/0x1f0
[ 1089.888042]  [] ? rcu_read_lock_sched_held+0x6d/0x80
[ 1089.892841]  [] ? kmem_cache_alloc+0x263/0x2d0
[ 1089.897282]  [] ? getname_flags+0x72/0x1f0
[ 1089.901483]  [] user_path_at_empty+0x36/0x40
[ 1089.905768]  [] vfs_fstatat+0x66/0xc0
[ 1089.909596]  [] SYSC_newlstat+0x31/0x60
[ 1089.913616]  [] ? __might_fault+0x96/0xa0
[ 1089.917684]  [] ? __might_fault+0x4d/0xa0
[ 1089.922750]  [] ? trace_hardirqs_on_caller+0x129/0x1b0
[ 1089.928605]  [] ? trace_hardirqs_on_thunk+0x1b/0x1d
[ 1089.934347]  [] SyS_newlstat+0xe/0x10
[ 1089.939193]  [] entry_SYSCALL_64_fastpath+0x23/0xc1
[ 1089.945045] BUG: sleeping function called from invalid context at 
mm/slab.h:388
[ 1089.951474] in_atomic(): 1, irqs_disabled(): 0, pid: 3053, name: ls
[ 1089.957214] INFO: lockdep is turned off.
[ 1089.961166] CPU: 0 PID: 3053 Comm: ls Not tainted 4.6.0-rc5+ #10
[ 1089.966739] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Bochs 01/01/2011
[ 1089.973975]  0286 5ed87b3e 88007a70bb40 
8145daa3
[ 1089.980644]  88007a688000 81ca8baf 88007a70bb68 
810bb069
[ 1089.987297]  81ca8baf 0184  
88007a70bb90
[ 1089.994180] Call Trace:
[ 1089.997097]  [] dump_stack+0x85/0xc2
[ 1090.002051]  [] ___might_sleep+0x179/0x230
[ 1090.007255]  [] __might_sleep+0x49/0x80
[ 1090.012290]  [] kmem_cache_alloc+0x1d9/0x2d0
[ 1090.017679]  [] prepare_creds+0x3a/0x130
[ 1090.022736]  [] shiftfs_new_creds+0x17/0x120
[ 1090.028090]  [] shiftfs_permission+0x42/0xd0
[ 1090.033454]  [] __inode_permission+0x6b/0xb0
[ 1090.039006]  [] inode_permission+0x14/0x50
[ 1090.044304]  [] link_path_walk+0x7d/0x510
[ 1090.049593]  [] ? path_init+0x52b/0x770
[ 1090.054795]  [] ? path_init+0x667/0x770
[ 1090.059950]  [] path_lookupat+0x7c/0x110
[ 1090.065218]  [] filename_lookup+0xb1/0x180
[ 1090.070629]  [] ? getname_flags+0x56/0x1f0
[ 1090.076265]  [] ? rcu_read_lock_sched_held+0x6d/0x80
[ 1090.082559]  [] ? kmem_cache_alloc+0x263/0x2d0
[ 1090.088153]  [] ? getname_flags+0x72/0x1f0
[ 1090.093478]  [] user_path_at_empty+0x36/0x40
[ 1090.099164]  [] vfs_fstatat+0x66/0xc0
[ 1090.104236]  [] SYSC_newlstat+0x31/0x60
[ 1090.109449]  [] ? __might_fault+0x96/0xa0
[ 1090.115506]  [] ? __might_fault+0x4d/0xa0
[ 1090.120418]  [] ? trace_hardirqs_on_caller+0x129/0x1b0
[ 1090.126325]  [] ? trace_hardirqs_on_thunk+0x1b/0x1d
[ 1090.133230]  [] SyS_newlstat+0xe/0x10
[ 1090.138320]  [] entry_SYSCALL_64_fastpath+0x23/0xc1
[ 1090.146513] [ cut here ]
[ 1090.151061] kernel BUG at include/linux/fs.h:2574!
[ 1090.155883] invalid opcode:  [#1] SMP
[ 1090.160131] Modules linked in: binfmt_misc veth ip6t_MASQUERADE 
nf_nat_masquerade_ipv6 ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 
nf_nat_ipv6 ip6_tables xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_t

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-09 Thread Serge Hallyn

Quoting Djalal Harouni (tix...@gmail.com):
> Hi,
> 
> On Wed, May 04, 2016 at 11:30:09PM +0000, Serge Hallyn wrote:
> > Quoting Djalal Harouni (tix...@gmail.com):
> > > This is version 2 of the VFS:userns support portable root filesystems
> > > RFC. Changes since version 1:
> > > 
> > > * Update documentation and remove some ambiguity about the feature.
> > >   Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > > 
> > > 
> > > This RFC tries to explore how to support filesystem operations inside
> > > user namespace using only VFS and a per mount namespace solution. This
> > > allows to take advantage of user namespace separations without
> > > introducing any change at the filesystems level. All this is handled
> > > with the virtual view of mount namespaces.
> > 
> > Given your use case, is there any way we could work in some tradeoffs
> > to protect the host?  What I'm thinking is that containers can all
> > share devices uid-mapped at will, however any device mounted with
> > uid shifting cannot be used by the inital user namespace.  Or maybe
> > just non-executable in that case, as you'll need enough access to
> > the fs to set up the containers you want to run.
> > 
> > So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
> > container rootfs source.  Mount it under /containers with uid
> > shifting.  Now all containers regardless of uid mappings see
> > the shifted fs contents.  But the host root cannot be tricked by
> > files on it, as /dev/sda2 is non-executable as far as it is
> > concerned.
> Of course the whole setup is based on the container manager to setup
> the right mount namespace, clean mounts, etc then pivot root, boot or
> whatever...
> 
> Now I guess we can achieve what you want with MS_SLAVE|MS_REC on / ?
> 
> You create a new mount/pid... namespaces with shift flags, but you are still
> in init_user_ns, you remount your / with MS_SLAVE|MS_REC, then you
> create new mount/pid namespaces with shift flag (two mount namespaces
> here if you don't want to race setting MS_SLAVE flag and creating mount
> namespace and you don't trust other processes... or you want the same nested
> setup...)
> 
> This second new secure mount namespace will be the one that you will use
> to setup the container, device nodes, loops...  fs that you want into the
> container (probably with shift options) and also filesystems that you can't
> mount inside user namespaces nor want them to show up or propagate into
> host, you may also want to umount stuff too or remount to change mount
> options too.., etc anyway here call it the cleaning of the mount namespace.
> 
> Now during this phase, when you mount and prepare these file systems,
> mount them with noexec flag first, then remount later with exec, or delay
> the mounting just before you do a new clone(CLONE_NEWUSER...). During this
> phase the container manager should get the device that you want to be
> shared from input or argument, and it will only mount it and prepare
> it inside new mount namespaces or containers and make sure that it will
> never be propagated back...
> 
> After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup
> the user namespace mapping, I guess you drop capabilities, do setuid()
> or whatever and start the PID 1 or the app of the container.
> 
> Now and to not confuse more Dave, since he doesn't like the idea of
> a shared backing device, and me neither for obvious reasons! the shared
> device should not be used for a rootfs, maybe for read-only user shared
> data, or shared config, that's it... but for real rootfs they should have
> their own *different* backing device! unless you know what you are doing
> hehe I don't want to confuse people, and I just lack time, will also
> respond to Dave email.

Yes.  We're saying slightly different things.  You're saying that the admin
should assign different backing stores for containers.  I'm saying perhaps
the kernel should enforce that, because $leaks.  Let's say the host admin
did a perfect setup of a container with shifted uids.  Now he wants to
run a quick ps in the container...  he does it in a way that leaks a
/proc/pid reference into the container so that (evil) container root can
use /proc/pid/root/ to get a toehold into the host /.  Does he now have
shifted access to that?

I think if we say "this blockdev will have shifted uids in /proc/$pid/ns/user",
then immediately that blockdev becomes not-readable (or not-executable)
in any namespace which does not have /proc/$pid/ns/user as an ancestor.
With obvious check as in

Re: [PATCH 2/2] net: Use ns_capable_noaudit() when determining net sysctl permissions

2016-05-08 Thread Serge Hallyn

Quoting Tyler Hicks (tyhi...@canonical.com):
> The capability check should not be audited since it is only being used
> to determine the inode permissions. A failed check does not indicate a
> violation of security policy but, when an LSM is enabled, a denial audit
> message was being generated.
> 
> The denial audit message caused confusion for some application authors
> because root-running Go applications always triggered the denial. To
> prevent this confusion, the capability check in net_ctl_permissions() is
> switched to the noaudit variant.
> 
> BugLink: https://launchpad.net/bugs/1465724
> 
> Signed-off-by: Tyler Hicks 

Acked-by: Serge E. Hallyn 

> ---
>  net/sysctl_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sysctl_net.c b/net/sysctl_net.c
> index ed98c1f..46a71c7 100644
> --- a/net/sysctl_net.c
> +++ b/net/sysctl_net.c
> @@ -46,7 +46,7 @@ static int net_ctl_permissions(struct ctl_table_header 
> *head,
>   kgid_t root_gid = make_kgid(net->user_ns, 0);
>  
>   /* Allow network administrator to have same access as root. */
> - if (ns_capable(net->user_ns, CAP_NET_ADMIN) ||
> + if (ns_capable_noaudit(net->user_ns, CAP_NET_ADMIN) ||
>   uid_eq(root_uid, current_euid())) {
>   int mode = (table->mode >> 6) & 7;
>   return (mode << 6) | (mode << 3) | mode;
> -- 
> 2.7.4
>

Re: [PATCH 1/2] kernel: Add noaudit variant of ns_capable()

2016-05-08 Thread Serge Hallyn

Quoting Tyler Hicks (tyhi...@canonical.com):
> When checking the current cred for a capability in a specific user
> namespace, it isn't always desirable to have the LSMs audit the check.
> This patch adds a noaudit variant of ns_capable() for when those
> situations arise.
> 
> The common logic between ns_capable() and the new ns_capable_noaudit()
> is moved into a single, shared function to keep duplicated code to a
> minimum and ease maintainability.
> 
> Signed-off-by: Tyler Hicks 

Acked-by: Serge E. Hallyn 

> ---
>  include/linux/capability.h |  5 +
>  kernel/capability.c| 46 
> --
>  2 files changed, 41 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 00690ff..5f3c63d 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -206,6 +206,7 @@ extern bool has_ns_capability_noaudit(struct task_struct 
> *t,
> struct user_namespace *ns, int cap);
>  extern bool capable(int cap);
>  extern bool ns_capable(struct user_namespace *ns, int cap);
> +extern bool ns_capable_noaudit(struct user_namespace *ns, int cap);
>  #else
>  static inline bool has_capability(struct task_struct *t, int cap)
>  {
> @@ -233,6 +234,10 @@ static inline bool ns_capable(struct user_namespace *ns, 
> int cap)
>  {
>   return true;
>  }
> +static inline bool ns_capable_noaudit(struct user_namespace *ns, int cap)
> +{
> + return true;
> +}
>  #endif /* CONFIG_MULTIUSER */
>  extern bool capable_wrt_inode_uidgid(const struct inode *inode, int cap);
>  extern bool file_ns_capable(const struct file *file, struct user_namespace 
> *ns, int cap);
> diff --git a/kernel/capability.c b/kernel/capability.c
> index 45432b5..00411c8 100644
> --- a/kernel/capability.c
> +++ b/kernel/capability.c
> @@ -361,6 +361,24 @@ bool has_capability_noaudit(struct task_struct *t, int 
> cap)
>   return has_ns_capability_noaudit(t, &init_user_ns, cap);
>  }
>  
> +static bool ns_capable_common(struct user_namespace *ns, int cap, bool audit)
> +{
> + int capable;
> +
> + if (unlikely(!cap_valid(cap))) {
> + pr_crit("capable() called with invalid cap=%u\n", cap);
> + BUG();
> + }
> +
> + capable = audit ? security_capable(current_cred(), ns, cap) :
> +   security_capable_noaudit(current_cred(), ns, cap);
> + if (capable == 0) {
> + current->flags |= PF_SUPERPRIV;
> + return true;
> + }
> + return false;
> +}
> +
>  /**
>   * ns_capable - Determine if the current task has a superior capability in 
> effect
>   * @ns:  The usernamespace we want the capability in
> @@ -374,19 +392,27 @@ bool has_capability_noaudit(struct task_struct *t, int 
> cap)
>   */
>  bool ns_capable(struct user_namespace *ns, int cap)
>  {
> - if (unlikely(!cap_valid(cap))) {
> - pr_crit("capable() called with invalid cap=%u\n", cap);
> - BUG();
> - }
> -
> - if (security_capable(current_cred(), ns, cap) == 0) {
> - current->flags |= PF_SUPERPRIV;
> - return true;
> - }
> - return false;
> + return ns_capable_common(ns, cap, true);
>  }
>  EXPORT_SYMBOL(ns_capable);
>  
> +/**
> + * ns_capable_noaudit - Determine if the current task has a superior 
> capability
> + * (unaudited) in effect
> + * @ns:  The usernamespace we want the capability in
> + * @cap: The capability to be tested for
> + *
> + * Return true if the current task has the given superior capability 
> currently
> + * available for use, false if not.
> + *
> + * This sets PF_SUPERPRIV on the task if the capability is available on the
> + * assumption that it's about to be used.
> + */
> +bool ns_capable_noaudit(struct user_namespace *ns, int cap)
> +{
> + return ns_capable_common(ns, cap, false);
> +}
> +EXPORT_SYMBOL(ns_capable_noaudit);
>  
>  /**
>   * capable - Determine if the current task has a superior capability in 
> effect
> -- 
> 2.7.4
>

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Serge Hallyn

Quoting Djalal Harouni (tix...@gmail.com):
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.
> * Use a new email address to send the RFC :-)
> 
> 
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution. This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.

Given your use case, is there any way we could work in some tradeoffs
to protect the host?  What I'm thinking is that containers can all
share devices uid-mapped at will, however any device mounted with
uid shifting cannot be used by the inital user namespace.  Or maybe
just non-executable in that case, as you'll need enough access to
the fs to set up the containers you want to run.

So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
container rootfs source.  Mount it under /containers with uid
shifting.  Now all containers regardless of uid mappings see
the shifted fs contents.  But the host root cannot be tricked by
files on it, as /dev/sda2 is non-executable as far as it is
concerned.

Just a thought.

Re: [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid

2016-05-04 Thread Serge Hallyn

Quoting Djalal Harouni (tix...@gmail.com):
> If a process gets access to a mount from a different user
> namespace, that process should not be able to take advantage of
> setuid files or selinux entrypoints from that filesystem.  Prevent
> this by treating mounts from other mount namespaces and those not
> owned by current_user_ns() or an ancestor as nosuid.
> 
> This patch was just adapted from the original one that was written
> by Andy Lutomirski 
> https://www.redhat.com/archives/dm-devel/2016-April/msg00374.html

I'm not sure that this makes sense given what you're doing.  In the
case of Seth's set, a filesystem is mounted specifically (and privately)
in a user namespace.  We don't want for instance the initial user ns
to find a link to a setuid-root exploit left in the container-mounted
filesystem.

But you are having a parent user namespace mount the fs so that its
children can all access the fs, uid-shifted for convenience.  Not
allowing the child namespaces to make use of setuid-root does not
seem applicable here.

> Signed-off-by: Djalal Harouni 
> ---
>  fs/exec.c  |  2 +-
>  fs/namespace.c | 15 +++
>  include/linux/mount.h  |  1 +
>  include/linux/user_namespace.h |  8 
>  kernel/user_namespace.c| 13 +
>  security/commoncap.c   |  2 +-
>  security/selinux/hooks.c   |  2 +-
>  7 files changed, 40 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index c4010b8..706088d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1391,7 +1391,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
>   bprm->cred->euid = current_euid();
>   bprm->cred->egid = current_egid();
>  
> - if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> + if (!mnt_may_suid(bprm->file->f_path.mnt))
>   return;
>  
>   if (task_no_new_privs(current))
> diff --git a/fs/namespace.c b/fs/namespace.c
> index de02b39..a8820fb 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3374,6 +3374,21 @@ found:
>   return visible;
>  }
>  
> +bool mnt_may_suid(struct vfsmount *mnt)
> +{
> + struct mount *m = real_mount(mnt);
> +
> + /*
> +  * Foreign mounts (accessed via fchdir or through /proc
> +  * symlinks) are always treated as if they are nosuid. This
> +  * prevents namespaces from trusting potentially unsafe
> +  * suid/sgid bits, file caps, or security labels that originate
> +  * in other namespaces.
> +  */
> + return !(mnt->mnt_flags & MNT_NOSUID) && check_mnt(m) &&
> +  in_userns(current_user_ns(), m->mnt_ns->user_ns);
> +}
> +
>  static struct ns_common *mntns_get(struct task_struct *task)
>  {
>   struct ns_common *ns = NULL;
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index f822c3c..54a594d 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -81,6 +81,7 @@ extern void mntput(struct vfsmount *mnt);
>  extern struct vfsmount *mntget(struct vfsmount *mnt);
>  extern struct vfsmount *mnt_clone_internal(struct path *path);
>  extern int __mnt_is_readonly(struct vfsmount *mnt);
> +extern bool mnt_may_suid(struct vfsmount *mnt);
>  
>  struct path;
>  extern struct vfsmount *clone_private_mount(struct path *path);
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 8297e5b..a43faa7 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -72,6 +72,8 @@ extern ssize_t proc_projid_map_write(struct file *, const 
> char __user *, size_t,
>  extern ssize_t proc_setgroups_write(struct file *, const char __user *, 
> size_t, loff_t *);
>  extern int proc_setgroups_show(struct seq_file *m, void *v);
>  extern bool userns_may_setgroups(const struct user_namespace *ns);
> +extern bool in_userns(const struct user_namespace *ns,
> +   const struct user_namespace *target_ns);
>  #else
>  
>  static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
> @@ -100,6 +102,12 @@ static inline bool userns_may_setgroups(const struct 
> user_namespace *ns)
>  {
>   return true;
>  }
> +
> +static inline bool in_userns(const struct user_namespace *ns,
> +  const struct user_namespace *target_ns)
> +{
> + return true;
> +}
>  #endif
>  
>  #endif /* _LINUX_USER_H */
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 9bafc21..9a496a8 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -938,6 +938,19 @@ bool userns_may_setgroups(const struct user_namespace 
> *ns)
>   return allowed;
>  }
>  
> +/*
> + * Returns true if @ns is the same namespace as or a descendant of
> + * @target_ns.
> + */
> +bool in_userns(const struct user_namespace *ns,
> +const struct user_namespace *target_ns)
> +{
> + for (; ns; ns = ns->parent) {
> + if (ns == target_ns)
> + return true;
> + }
> +}
> +
>

namespaced file capabilities

2016-04-22 Thread serge . hallyn

Hi,

I've sent a few patches and emails over the past months about supporting
file capabilities in user namespace confined containers.  A few of the
requirements as I see them are:

1. Root in a user namespace should be able to set file capabilities on a binary
for use by any user mapped into his namespace.

2. Any uid not mapped into the user namespace whose root user set file
capabilities should not gain privileges when running an executable which only
has file capabilities set by this root user.

3. Existing calls to cap_set_file(3) and cap_get_file(3) as well as
setcap(8) and getcap(8) should transparently work.  This would allow
package managers to simply set file capabilities in postinst.

Below is a kernel patch which implements a new security.nscapability
extended attribute.  Setting this xattr on a file requires cap_setfcap
against the current user namespace, and for the file to be owned by
a uid and gid mapped into that namespace.  When found on a file,
the capabilities will take effect only if the file is owned by the
root uid in the caller's namespace, or the root uid in any ancestor
namespace.

While this design supports nested namespaces, it does not support
use of file capabilities by users in unrelated namespaces.  So if
the same file is linked into two namespaces N1 and N2 which do not
share the same root kuid, then the only way for N1 and N2 to both
execute the file while respecting security.nscapability is to have
a common ancestor namespace write the capability.  The only reasonable
way we could handle this case would be to use a securityfs interface
to set file capabilities.  The capability.ko module could then
do the work of keeping a list of uid ranges for which file capabilities
should be honored.  I don't think that flexibility is really called
for.
 
The kernel patch follows, and can be found at
https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/log/?h=2016-04-22/nsfscaps

The libcap patch can be found at
https://git.kernel.org/cgit/linux/kernel/git/sergeh/libcap.git/log/?h=2016-04-22/nscaps

Comments/conversation/suggestions greatly appreciated.

thanks,
-serge

[PATCH 1/1] simplified security.nscapability xattr

2016-04-22 Thread serge . hallyn

From: Serge Hallyn 

This can only be set by root in his own namespace, and will
only be respected by namespaces with that same root kuid
mapped as root, or namespaces descended from it.

This allows a simple setxattr to work, allows tar/untar to
work, and allows us to tar in one namespace and untar in
another while preserving the capability, without risking
leaking privilege into a parent namespace.

Signed-off-by: Serge Hallyn 
---
 include/linux/capability.h  |5 ++-
 include/uapi/linux/capability.h |   18 
 include/uapi/linux/xattr.h  |3 ++
 security/commoncap.c|   91 +--
 4 files changed, 112 insertions(+), 5 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 00690ff..cf533ff 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -13,7 +13,7 @@
 #define _LINUX_CAPABILITY_H
 
 #include 
-
+#include 
 
 #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
 #define _KERNEL_CAPABILITY_U32S_LINUX_CAPABILITY_U32S_3
@@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
kernel_cap_t inheritable;
 };
 
+#define NS_CAPS_VERSION(x) (x & 0xFF)
+#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
+
 #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
 #define _KERNEL_CAP_T_SIZE (sizeof(kernel_cap_t))
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 12c37a1..f0b4a66 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
 #define VFS_CAP_U32_2   2
 #define XATTR_CAPS_SZ_2 (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
 
+/* version number for security.nscapability xattrs hdr->hdr_info */
+#define VFS_NS_CAP_REVISION 1
+
 #define XATTR_CAPS_SZ   XATTR_CAPS_SZ_2
 #define VFS_CAP_U32 VFS_CAP_U32_2
 #define VFS_CAP_REVISION   VFS_CAP_REVISION_2
@@ -74,6 +77,21 @@ struct vfs_cap_data {
} data[VFS_CAP_U32];
 };
 
+#define VFS_NS_CAP_EFFECTIVE0x1
+/*
+ * 32-bit hdr_info contains
+ * 16 leftmost: reserved
+ * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
+ * last 8: version
+ */
+struct vfs_ns_cap_data {
+   __le32 magic_etc;
+   struct {
+   __le32 permitted;/* Little endian */
+   __le32 inheritable;  /* Little endian */
+   } data[VFS_CAP_U32];
+};
+
 #ifndef __KERNEL__
 
 /*
diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
index 1590c49..67c80ab 100644
--- a/include/uapi/linux/xattr.h
+++ b/include/uapi/linux/xattr.h
@@ -68,6 +68,9 @@
 #define XATTR_CAPS_SUFFIX "capability"
 #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
 
+#define XATTR_NS_CAPS_SUFFIX "nscapability"
+#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
+
 #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
 #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
 #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
diff --git a/security/commoncap.c b/security/commoncap.c
index 48071ed..8f3f34a 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
if (!inode->i_op->getxattr)
   return 0;
 
+   error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
+   if (error > 0)
+   return 1;
+
error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
if (error <= 0)
return 0;
@@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
 int cap_inode_killpriv(struct dentry *dentry)
 {
struct inode *inode = d_backing_inode(dentry);
+   int ret1, ret2;
 
if (!inode->i_op->removexattr)
   return 0;
 
-   return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+   ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+   ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
+
+   if (ret1 != 0)
+   return ret1;
+   return ret2;
 }
 
 /*
@@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, 
struct cpu_vfs_cap_data
return 0;
 }
 
+int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct 
cpu_vfs_cap_data *cpu_caps)
+{
+   struct inode *inode = d_backing_inode(dentry);
+   unsigned i;
+   u32 magic_etc;
+   ssize_t size;
+   struct vfs_ns_cap_data nscap;
+   bool foundroot = false;
+   struct user_namespace *ns;
+
+   memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
+
+   if (!inode || !inode->i_op->getxattr)
+   return -ENODATA;
+
+   /* verify that current or ancestor userns root owns this file */
+   for (ns = current_user_ns(); ; ns = ns->pa

[PATCH 2/2] mountinfo: implement show_path for kernfs and cgroup

2016-04-17 Thread serge . hallyn

From: Serge Hallyn 

When showing a cgroupfs entry in mountinfo, show the
path of the mount root dentry relative to the reader's
cgroup namespace root.

Signed-off-by: Serge Hallyn 
---
 fs/kernfs/mount.c  | 14 ++
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c| 35 +++
 3 files changed, 51 insertions(+)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f73541f..3b78724 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kernfs-internal.h"
 
@@ -40,6 +41,18 @@ static int kernfs_sop_show_options(struct seq_file *sf, 
struct dentry *dentry)
return 0;
 }
 
+static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry)
+{
+   struct kernfs_node *node = dentry->d_fsdata;
+   struct kernfs_root *root = kernfs_root(node);
+   struct kernfs_syscall_ops *scops = root->syscall_ops;
+
+   if (scops && scops->show_path)
+   return scops->show_path(sf, node, root);
+
+   return seq_dentry(sf, dentry, " \t\n\\");
+}
+
 const struct super_operations kernfs_sops = {
.statfs = simple_statfs,
.drop_inode = generic_delete_inode,
@@ -47,6 +60,7 @@ const struct super_operations kernfs_sops = {
 
.remount_fs = kernfs_sop_remount_fs,
.show_options   = kernfs_sop_show_options,
+   .show_path  = kernfs_sop_show_path,
 };
 
 /**
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index c06c442..30f089e 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -152,6 +152,8 @@ struct kernfs_syscall_ops {
int (*rmdir)(struct kernfs_node *kn);
int (*rename)(struct kernfs_node *kn, struct kernfs_node *new_parent,
  const char *new_name);
+   int (*show_path)(struct seq_file *sf, struct kernfs_node *kn,
+struct kernfs_root *root);
 };
 
 struct kernfs_root {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 671dc05..9a0d7b3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1593,6 +1593,40 @@ static int rebind_subsystems(struct cgroup_root 
*dst_root, u16 ss_mask)
return 0;
 }
 
+static int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
+   struct kernfs_root *kf_root)
+{
+   int len = 0, ret = 0;
+   char *buf = NULL;
+   struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+   struct cgroup_root *kf_cgroot = cgroup_root_from_kf(kf_root);
+   struct cgroup *ns_cgroup;
+
+   mutex_lock(&cgroup_mutex);
+   spin_lock_bh(&css_set_lock);
+   ns_cgroup = cset_cgroup_from_root(ns->root_cset, kf_cgroot);
+   len = kernfs_path_from_node(kf_node, ns_cgroup->kn, NULL, 0);
+   if (len > 0)
+   buf = kmalloc(len + 1, GFP_ATOMIC);
+   if (buf)
+   ret = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, len + 
1);
+
+   spin_unlock_bh(&css_set_lock);
+   mutex_unlock(&cgroup_mutex);
+
+   if (len <= 0)
+   return len;
+   if (!buf)
+   return -ENOMEM;
+   if (ret == len) {
+   seq_escape(sf, buf, " \t\n\\");
+   ret = 0;
+   } else if (ret >= 0)
+   ret = -EINVAL;
+   kfree(buf);
+   return ret;
+}
+
 static int cgroup_show_options(struct seq_file *seq,
   struct kernfs_root *kf_root)
 {
@@ -5430,6 +5464,7 @@ static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
.mkdir  = cgroup_mkdir,
.rmdir  = cgroup_rmdir,
.rename = cgroup_rename,
+   .show_path  = cgroup_show_path,
 };
 
 static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
-- 
2.7.4

Show virtualized dentry root in mountinfo for cgroupfs

2016-04-17 Thread serge . hallyn

With the current cgroup namespace patches, the root dentry path of a
mount as shown in /proc/self/mountinfo is the full global cgroup
path.  It is common for userspace to use /proc/self/mountinfo to
search for cgroup mountpoints, and expect the root dentry path to
relate to the cgroup paths in /proc/self/cgroup.  Patch 2 in this
set therefore virtualizes the root dentry path relative to the
reader's cgroup namespace root.

Patch 1 fixes a bug in kernfs_path_from_node_locked() which is
exposed by patch 2.

[PATCH 1/2] kernfs_path_from_node_locked: don't overwrite nlen

2016-04-17 Thread serge . hallyn

From: Serge Hallyn 

We've calculated @len to be the bytes we need for '/..' entries from
@kn_from to the common ancestor, and calculated @nlen to be the extra
bytes we need to get from the common ancestor to @kn_to.  We use them
as such at the end.  But in the loop copying the actual entries, we
overwrite @nlen.  Use a temporary variable for that instead.

Without this, the return length, when the buffer is large enough, is
wrong.  (When the buffer is NULL or too small, the returned value is
correct. The buffer contents are also correct.)

Interestingly, no callers of this function are affected by this as of
yet.  However the upcoming cgroup_show_path() will be.

Signed-off-by: Serge Hallyn 
---
 fs/kernfs/dir.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 03b688d..37f9678 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -153,9 +153,9 @@ static int kernfs_path_from_node_locked(struct kernfs_node 
*kn_to,
p = buf + len + nlen;
*p = '\0';
for (kn = kn_to; kn != common; kn = kn->parent) {
-   nlen = strlen(kn->name);
-   p -= nlen;
-   memcpy(p, kn->name, nlen);
+   size_t tmp = strlen(kn->name);
+   p -= tmp;
+   memcpy(p, kn->name, tmp);
*(--p) = '/';
}
 
-- 
2.7.4

Re: [PATCH] exec: clarify reasoning for euid/egid reset

2016-04-12 Thread Serge Hallyn

Quoting Kees Cook (keesc...@chromium.org):
> This section of code initially looks redundant, but is required. This
> improves the comment to explain more clearly why the reset is needed.
> 
> Signed-off-by: Kees Cook 

Thanks, Kees.

Acked-by: Serge E. Hallyn 

> ---
>  fs/exec.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index c4010b8207a1..889221bbfdb3 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1387,7 +1387,12 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
>   kuid_t uid;
>   kgid_t gid;
>  
> - /* clear any previous set[ug]id data from a previous binary */
> + /*
> +  * Since this can be called multiple times (via prepare_binprm),
> +  * we must clear any previous work done when setting set[ug]id
> +  * bits from any earlier bprm->file uses (for example when run
> +  * first for a script then for its interpreter).
> +  */
>   bprm->cred->euid = current_euid();
>   bprm->cred->egid = current_egid();
>  
> -- 
> 2.6.3
> 
> 
> -- 
> Kees Cook
> Chrome OS & Brillo Security

Re: [PATCH] devpts: Make ptmx be owned by the userns owner instead of userns-local 0

2016-03-14 Thread Serge Hallyn

Quoting Andy Lutomirski (l...@kernel.org):
> We used to have ptmx be owned by the inner uid and gid 0.  Change
> this: if the owner and group are both mapped but are not both 0,
> then use the owner instead.
> 
> For container-style namespaces (LXC, etc), this should have no
> effect -- UID 0 is will either be the owner or will be unmapped.

This doesn't seem right - it's often the case that the owner is mapped
in as non-0 uid, safe or not.  The actual namespace root uid should be
the owner (so long as it exists).

Why not reverse the cases?  If 0 is not mapped, then check whether the
current_user_ns()->owner is mapped?

> The important behavior change is for sandboxes: many sandboxes
> intentionally do not create an inner uid 0.  Without this patch,
> mounting devpts in such a sandbox is awkward.  With this patch, it
> will just work and ptmx will be owned by the namespace owner.
> 
> Cc: Alexander Larsson 
> Cc: mcla...@redhat.com
> Cc: "Eric W. Biederman" 
> Cc: Linux Containers 
> Signed-off-by: Andy Lutomirski 
> ---
>  fs/devpts/inode.c | 34 ++
>  1 file changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
> index 655f21f99160..d6fa2d1beee3 100644
> --- a/fs/devpts/inode.c
> +++ b/fs/devpts/inode.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DEVPTS_DEFAULT_MODE 0600
>  /*
> @@ -250,10 +251,35 @@ static int mknod_ptmx(struct super_block *sb)
>   kuid_t root_uid;
>   kgid_t root_gid;
>  
> - root_uid = make_kuid(current_user_ns(), 0);
> - root_gid = make_kgid(current_user_ns(), 0);
> - if (!uid_valid(root_uid) || !gid_valid(root_gid))
> - return -EINVAL;
> + /*
> +  * For a new devpts instance, ptmx is owned by the creating user
> +  * namespace's owner.  Usually, that will be 0 as seen by the
> +  * user namespace, but for unprivileged sandbox namespaces,
> +  * there may not be a uid 0 or gid 0 at all.
> +  */
> + root_uid = current_user_ns()->owner;
> + root_gid = current_user_ns()->group;
> +
> + if (!uid_valid(root_uid) || !gid_valid(root_gid)) {
> + /*
> +  * It's very unlikely for us to get here if the userns
> +  * owner is not mapped, but it's possible -- we'd have
> +  * to be running in the userns with capabilities granted
> +  * by unshare or setns, since there is no inner
> +  * privileged user.  Nonetheless, this could happen, and
> +  * we don't want ptmx to be owned by an unmapped user or
> +  * group.
> +  *
> +  * If this happens fall back to historical behavior:
> +  * try to have ptmx be owned by 0:0.
> +  */
> + root_uid = make_kuid(current_user_ns(), 0);
> + root_gid = make_kgid(current_user_ns(), 0);
> +
> + /* If this still doesn't work, give up. */
> + if (!uid_valid(root_uid) || !gid_valid(root_gid))
> + return -EINVAL;
> + }
>  
>   inode_lock(d_inode(root));
>  
> -- 
> 2.5.0
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

Re: [PATCH 0/2] Fix debugfs bind mount regression

2016-03-09 Thread Serge Hallyn

Quoting Eric W. Biederman (ebied...@xmission.com):
> Seth Forshee  writes:
> 
> > Some full-OS container software bind mounts debugfs into containers to
> > satisfy the assumptions of older userspaces which expect to be able to
> > mount debugfs. This regressed in 4.1 due to the addition of tracefs,
> > which gets automounted in the tracing subdirectory of debugfs. In a
> > cloned mount namespace the bind mount now fails because the tracefs
> > mount is a locked child of the debugfs mount.
> >
> > For new mounts we already make an exception to the "locked child mount"
> > rule. Directories in psuedo filesystems created for the sole purpose of
> > being mountpoints are created as permanently empty directories which can
> > never contain any entries, therefore the kernel can know than any mounts
> > on these directories are not for security purposes. These mounts are
> > then excluded from locked mount tests in some circumstances.
> >
> > The same logic clearly applies to directories created in
> > debugfs_create_automount(). The following patches update this function
> > to create permanently empty directories for mountpoints and adds an
> > exclusion to the tests for bind mounts to exclude child mounts on
> > permanently empty directories.
> 
> So I don't know that this approach is bad.  However in reading through
> your patch descriptions I do not see any consideration of using
> "mount --rbind"  instead of "mount --bind".  AKA adding the MS_REC flag
> to your bind mount.
> 
> I would think simply using MS_REC would solve this problem, without
> needing any additional kernel support.  Am I missing something?

That's what we're doing to work around it fwiw, but it would be nice to
not have to.

Re: [lxc-devel] CGroup Namespaces (v10)

2016-02-26 Thread Serge Hallyn

Quoting Alban Crequy (alban.cre...@gmail.com):
> Hi,
> 
> On 29 January 2016 at 09:54,   wrote:
> > Hi,
> >
> > following is a revised set of the CGroup Namespace patchset which Aditya
> > Kali has previously sent.  The code can also be found in the cgroupns.v10
> > branch of
> >
> > https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/
> >
> > To summarize the semantics:
> >
> > 1. CLONE_NEWCGROUP re-uses 0x0200, which was previously CLONE_STOPPED
> 
> What's the best way for a userspace application to test at run-time
> whether the kernel supports cgroup namespaces? Would you recommend to
> test if the file /proc/self/ns/cgroup exists?

Yup.

Re: [PATCH v2] openvswitch: allow management from inside user namespaces

2016-02-02 Thread Serge Hallyn

Quoting Tycho Andersen (tycho.ander...@canonical.com):
> Operations with the GENL_ADMIN_PERM flag fail permissions checks because
> this flag means we call netlink_capable, which uses the init user ns.
> 
> Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations
> which should be allowed inside a user namespace.
> 
> The motivation for this is to be able to run openvswitch in unprivileged
> containers. I've tested this and it seems to work, but I really have no
> idea about the security consequences of this patch, so thoughts would be
> much appreciated.
> 
> v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function
> 
> Reported-by: James Page 
> Signed-off-by: Tycho Andersen 
> CC: Eric Biederman 
> CC: Pravin Shelar 
> CC: Justin Pettit 
> CC: "David S. Miller" 
> ---
>  include/uapi/linux/genetlink.h |  1 +
>  net/netlink/genetlink.c|  6 --
>  net/openvswitch/datapath.c | 20 ++--
>  3 files changed, 15 insertions(+), 12 deletions(-)
> 
> diff --git a/include/uapi/linux/genetlink.h b/include/uapi/linux/genetlink.h
> index c3363ba..5512c90 100644
> --- a/include/uapi/linux/genetlink.h
> +++ b/include/uapi/linux/genetlink.h
> @@ -21,6 +21,7 @@ struct genlmsghdr {
>  #define GENL_CMD_CAP_DO  0x02
>  #define GENL_CMD_CAP_DUMP0x04
>  #define GENL_CMD_CAP_HASPOL  0x08
> +#define GENL_UNS_ADMIN_PERM  0x10
>  
>  /*
>   * List of reserved static generic netlink identifiers:
> diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> index f830326..6bbb3eb 100644
> --- a/net/netlink/genetlink.c
> +++ b/net/netlink/genetlink.c
> @@ -576,8 +576,10 @@ static int genl_family_rcv_msg(struct genl_family 
> *family,
>   if (ops == NULL)
>   return -EOPNOTSUPP;
>  
> - if ((ops->flags & GENL_ADMIN_PERM) &&
> - !netlink_capable(skb, CAP_NET_ADMIN))
> + if (((ops->flags & GENL_ADMIN_PERM) &&
> + !netlink_capable(skb, CAP_NET_ADMIN)) ||

Seems like this would be a lot clearer if you split it up, i.e.:

/* CAP_NET_ADMIN required against initial user_ns */
if ((ops->flags & GENL_ADMIN_PERM) &&
!netlink_capable(skb, CAP_NET_ADMIN))
return -EPERM;

/* CAP_NET_ADMIN required against device user_ns */
if ((ops->flags & GENL_UNS_ADMIN_PERM) &&
!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
return -EPERM;

> + ((ops->flags & GENL_UNS_ADMIN_PERM) &&
> + !netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)))
>   return -EPERM;
>  
>   if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index deadfda..d6f7fe9 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -654,7 +654,7 @@ static const struct nla_policy 
> packet_policy[OVS_PACKET_ATTR_MAX + 1] = {
>  
>  static const struct genl_ops dp_packet_genl_ops[] = {
>   { .cmd = OVS_PACKET_CMD_EXECUTE,
> -   .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
> +   .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */

Hm, I'd like to suggest adding 'over netns', but I guess that breaks 80 cols...

> .policy = packet_policy,
> .doit = ovs_packet_cmd_execute
>   }
> @@ -1391,12 +1391,12 @@ static const struct nla_policy 
> flow_policy[OVS_FLOW_ATTR_MAX + 1] = {
>  
>  static const struct genl_ops dp_flow_genl_ops[] = {
>   { .cmd = OVS_FLOW_CMD_NEW,
> -   .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
> +   .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
> .policy = flow_policy,
> .doit = ovs_flow_cmd_new
>   },
>   { .cmd = OVS_FLOW_CMD_DEL,
> -   .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
> +   .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
> .policy = flow_policy,
> .doit = ovs_flow_cmd_del
>   },
> @@ -1407,7 +1407,7 @@ static const struct genl_ops dp_flow_genl_ops[] = {
> .dumpit = ovs_flow_cmd_dump
>   },
>   { .cmd = OVS_FLOW_CMD_SET,
> -   .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
> +   .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
> .policy = flow_policy,
> .doit = ovs_flow_cmd_set,
>   },
> @@ -1777,12 +1777,12 @@ static const struct nla_policy 
> datapath_policy[OVS_DP_ATTR_MAX + 1] = {
>  
>  static const struct genl_ops dp_datapath_genl_ops[] = {
>   { .cmd = OVS_DP_CMD_NEW,
> -   .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
> +   .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
> .policy = datapath_policy,
> .doit = ovs_dp_cmd_new
>   },
>   { .cmd = OVS_DP_CMD_DEL,
> -   .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
> +

[PATCH 3/8] cgroup: introduce cgroup namespaces

2016-01-29 Thread serge . hallyn

From: Aditya Kali 

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
Changelog: 2015-11-24
- move cgroup_namespace.c into cgroup.c (and .h)
- reformatting
- make get_cgroup_ns return void
- rename ns->root_cgrps to root_cset.
Changelog: 2015-12-08
- Move init_cgroup_ns to other variable declarations
- Remove accidental conversion of put-css_set to inline
- Drop BUG_ON(NULL)
- Remove unneeded pre declaration of struct cgroupns_operations.
- cgroup.h: collect common ns declerations
Changelog: 2015-12-09
- cgroup.h: move ns declarations to bottom
- cgroup.c: undo all accidental conversions to inline
Changelog: 2015-12-22
- update for new kernfs_path_from_node() return value.  Since
  cgroup_path was already gpl-exported, I abstained from updating
  its return value.
Changelog: 2015-12-23
- cgroup_path(): use init_cgroup_ns when in interupt context.
Changelog: 2015-01-02
- move to_cg_ns definition forward in patch series
- cgroup_release_agent: grab css_set_lock around cgroup_path()
- leave cgroup_path non-namespaced, use cgroup_path_ns when
  namespaced path is desired.
Changelog: 2015-01-04
- cgroup_path: continue to use kernfs_path.  Since cgroup_path is
  non-namespaced, use the old version.
- make cgroup_path_ns_locked() static.
Changelog: 2015-01-05
- don't namespace the path printed in debugfs.
Changelog: 2015-01-27
- remove unneeded NULL check before put_cgroup_ns()
Changelog: 2015-01-28
- lock around task_css_set in copy_cgroup_ns, and don't
  take rcu lock arounc copy_cgroup_ns call in cpuset.
---
 fs/proc/namespaces.c|3 +
 include/linux/cgroup.h  |   49 +
 include/linux/nsproxy.h |2 +
 include/linux/proc_ns.h |4 ++
 kernel/cgroup.c |  176 ++-
 kernel/cpuset.c |8 +--
 kernel/fork.c   |2 +-
 kernel/nsproxy.c|   19 -
 8 files changed, 253 insertions(+), 10 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 276f124..72cb26f 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -28,6 +28,9 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
 #endif
&mntns_operations,
+#ifdef CONFIG_CGROUPS
+   &cgroupns_operations,
+#endif
 };
 
 static const char *proc_ns_get_link(struct dentry *dentry,
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2162dca..1773af0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,6 +17,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 
@@ -611,4 +616,48 @@ static inline void cgroup_sk_free(struct sock_cgroup_data 
*skcd) {}
 
 #endif /* CONFIG_CGROUP_DATA */
 
+struct cgroup_namespace {
+   atomic_tcount;
+   struct ns_commonns;
+   struct user_namespace   *user_ns;
+   struct css_set  *root_cset;
+};
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+#ifdef CONFIG_CGROUPS
+
+void free_cgroup_ns(struct cgroup_namespace *ns);
+
+struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns);
+
+char *cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
+struct cgroup_namespace *ns);
+
+#else /* !CONFIG_CGROUPS */
+
+static inline void free_cgroup_ns(struct cgroup_namespace *ns) { }
+static inline struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns)
+{
+   return old_ns;
+}
+
+#endif /* !CONFIG_CGROUPS */
+
+static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(&ns->count);
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns && atomic_dec_and_test(&ns->count))
+   free_cgroup_ns(ns);
+}
+
 #endif /*

CGroup Namespaces (v10)

2016-01-29 Thread serge . hallyn

Hi,

following is a revised set of the CGroup Namespace patchset which Aditya
Kali has previously sent.  The code can also be found in the cgroupns.v10
branch of

https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/

To summarize the semantics:

1. CLONE_NEWCGROUP re-uses 0x0200, which was previously CLONE_STOPPED

2. unsharing a cgroup namespace makes all your current cgroups your new
cgroup root.

3. /proc/pid/cgroup always shows cgroup paths relative to the reader's
cgroup namespce root.  A task outside of  your cgroup looks like

8:memory:/../../..

4. when a task mounts a cgroupfs, the cgroup which shows up as root depends
on the mounting task's  cgroup namespace.

5. setns to a cgroup namespace switches your cgroup namespace but not
your cgroups.

With this, using github.com/hallyn/lxc #2015-11-09/cgns (and
github.com/hallyn/lxcfs #2015-11-10/cgns) we can start a container in a full
proper cgroup namespace, avoiding either cgmanager or lxcfs cgroup bind mounts.

This is completely backward compatible and will be completely invisible
to any existing cgroup users (except for those running inside a cgroup
namespace and looking at /proc/pid/cgroup of tasks outside their
namespace.)

Changes from V9:
1. Update to latest Linus tree
2. A few locking fixes

Changes from V8:
1. Incorporate updated documentation from tj.
2. Put lookup_one_len() under inode lock
3. Make cgroup_path non-namespaced, so only calls to cgroup_path_ns() are
   namespaced.
4. Make cgroup_path{,_ns} take the needed locks, since external callers cannot
   do so.
5. Fix the bisectability problem of to_cg_ns() being defined after use

Changes from V7:
1. Rework kernfs_path_from_node_locked to return the string length
2. Rename and reorder args to kernfs_path_from_node
3. cgroup.c: undo accidental conversoins to inline
4. cgroup.h: move ns declarations to bottom.
5. Rework the documentation to fit the style of the rest of cgroup.txt

Changes from V6:
1. Switch to some WARN_ONs to provide stack traces
2. Rename kernfs_node_distance to kernfs_depth
3. Make sure kernfs_common_ancestor() nodes are from same root
4. Split kernfs changes for cgroup_mount into separate patch
5. Rename kernfs_obtain_root to kernfs_node_dentry
(And more, see patch changelogs)

Changes from V5:
1. To get a root dentry for cgroup namespace mount, walk the path from the
   kernfs root dentry.

Changes from V4:
1. Move the FS_USERNS_MOUNT flag to last patch
2. Rebase onto cgroup/for-4.5
3. Don't non-init user namespaces to bind new subsystems when mounting.
4. Address feedback from Tejun (thanks).  Specificaly, not addressed:
   . kernfs_obtain_root - walking dentry from kernfs root.
 (I think that's the only piece)
5. Dropped unused get_task_cgroup fn/patch.
6. Reworked kernfs_path_from_node_locked() to try to simplify the logic.
   It now finds a common ancestor, walks from the source to it, then back
   up to the target.

Changes from V3:
1. Rebased onto latest cgroup changes.  In particular switch to
   css_set_lock and ns_common.
2. Support all hierarchies.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
   - removed use of threadgroup_lock() while creating new cgroupns
   - use task_lock() instead of rcu_read_lock() while accessing
 task->nsproxy
   - optimized setns() to own cgroupns
   - simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
   not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc//cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of .
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup ' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc//cgroup is further restricted by not showing
   anything if the  is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

[PATCH 1/8] kernfs: Add API to generate relative kernfs path

2016-01-29 Thread serge . hallyn

From: Aditya Kali 

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
Acked-by: Greg Kroah-Hartman 
---
Changelog 20151125:
  - Fully-wing multilinecomments
  - Rework kernfs_path_from_node_locked() logic
  - Replace BUG_ONs with returning NULL
  - Use a const char* for /.. and precalculate its size
Changelog 20151130:
  - Update kernfs_path_from_node_locked comment
Changelog 20151208:
  - kernfs_node_distance:
* Remove BUG_ON(NULL)s
* Rename kernfs_node_distance to kernfs_depth
  - kernfs_common-ancestor:
* Remove useless checks for depth == 0
* Add check to ensure nodes are from same root
  - kernfs_path_from_node_locked:
* Remove needless __must_check
* Put p;len on its own decl line.
* Fix wrong WARN_ONCE usage
Changelog 20151209:
  - kernfs_path_from_node: change arguments to 'to' and 'from', and
change their order.
Changelog 20151222:
  - kernfs_path_from_node{,_locked}: return the string length.
kernfs_path is gpl-exported, so changing their return value seemed
ill-advised, but if noone minds I can update it too.
Changelog 20151223:
  - don't allocate memory pr_cont_kernfs_path() under spinlock
---
 fs/kernfs/dir.c|  192 
 include/linux/kernfs.h |9 ++-
 2 files changed, 166 insertions(+), 35 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 996b774..38fa03a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,123 @@ static int kernfs_name_locked(struct kernfs_node *kn, char 
*buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char 
*buf,
- size_t buflen)
+/* kernfs_node_depth - compute depth from @from to @to */
+static size_t kernfs_depth(struct kernfs_node *from, struct kernfs_node *to)
 {
-   char *p = buf + buflen;
-   int len;
+   size_t depth = 0;
 
-   *--p = '\0';
+   while (to->parent && to != from) {
+   depth++;
+   to = to->parent;
+   }
+   return depth;
+}
 
-   do {
-   len = strlen(kn->name);
-   if (p - buf < len + 1) {
-   buf[0] = '\0';
-   p = NULL;
-   break;
-   }
-   p -= len;
-   memcpy(p, kn->name, len);
-   *--p = '/';
-   kn = kn->parent;
-   } while (kn && kn->parent);
+static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
+ struct kernfs_node *b)
+{
+   size_t da, db;
+   struct kernfs_root *ra = kernfs_root(a), *rb = kernfs_root(b);
 
-   return p;
+   if (ra != rb)
+   return NULL;
+
+   da = kernfs_depth(ra->kn, a);
+   db = kernfs_depth(rb->kn, b);
+
+   while (da > db) {
+   a = a->parent;
+   da--;
+   }
+   while (db > da) {
+   b = b->parent;
+   db--;
+   }
+
+   /* worst case b and a will be the same at root */
+   while (b != a) {
+   b = b->parent;
+   a = a->parent;
+   }
+
+   return a;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a pseudo-absolute path to @kn_to,
+ * where kn_from is treated as root of the path.
+ * @kn_from: kernfs node which should be treated as root for the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3 [depth=3]
+ * result:  /../..
+ *
+ * return value: length of the string.  If greater than buflen,
+ * then contents of buf are undefined.  On error, -1 is returned.
+ */
+static int
+kernfs_path_from_node_locked(struct kernfs_node *kn_to,
+struct kernfs_node *kn_from, char *buf,
+size_t buflen)
+{
+   struct kernfs_node *kn, *common;
+   const char parent_str[] = "/..";
+   size_t depth_from, depth_to, len = 0, nlen = 0;
+   char *p;
+   int i;
+
+   if (!kn_from)
+   kn_from = kernfs_root(kn_to)->kn;
+
+   if (kn_from == kn_to)
+   return strlcpy(buf, "/", buflen);
+
+   common = kernfs_common_ancestor(kn_from, kn_to);
+   if (WARN_ON(!common))
+   return -1

[PATCH 7/8] cgroup: Add documentation for cgroup namespaces

2016-01-29 Thread serge . hallyn

From: Serge Hallyn 

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
Signed-off-by: Tejun Heo 

---
Changelog (2015-12-08):
  Merge into Documentation/cgroup.txt
Changelog (2015-12-22):
  Reformat to try to follow the style of the rest of the cgroup.txt file.
Changelog (2015-12-22):
  tj: Reorganized to better fit the documentation.
---
 Documentation/cgroup-v2.txt |  147 +++
 1 file changed, 147 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 65b3eac..eee9012 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -47,6 +47,11 @@ CONTENTS
   5-3. IO
 5-3-1. IO Interface Files
 5-3-2. Writeback
+6. Namespace
+  6-1. Basics
+  6-2. The Root and Views
+  6-3. Migration and setns(2)
+  6-4. Interaction with Other Namespaces
 P. Information on Kernel Programming
   P-1. Filesystem Support for Writeback
 D. Deprecated v1 Core Features
@@ -1085,6 +1090,148 @@ writeback as follows.
vm.dirty[_background]_ratio.
 
 
+6. Namespace
+
+6-1. Basics
+
+cgroup namespace provides a mechanism to virtualize the view of the
+"/proc/$PID/cgroup" file and cgroup mounts.  The CLONE_NEWCGROUP clone
+flag can be used with clone(2) and unshare(2) to create a new cgroup
+namespace.  The process running inside the cgroup namespace will have
+its "/proc/$PID/cgroup" output restricted to cgroupns root.  The
+cgroupns root is the cgroup of the process at the time of creation of
+the cgroup namespace.
+
+Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
+complete path of the cgroup of a process.  In a container setup where
+a set of cgroups and namespaces are intended to isolate processes the
+"/proc/$PID/cgroup" file may leak potential system level information
+to the isolated processes.  For Example:
+
+  # cat /proc/self/cgroup
+  0::/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can be considered as system-data
+and undesirable to expose to the isolated processes.  cgroup namespace
+can be used to restrict visibility of this path.  For example, before
+creating a cgroup namespace, one would see:
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> 
cgroup:[4026531835]
+  # cat /proc/self/cgroup
+  0::/batchjobs/container_id1
+
+After unsharing a new namespace, the view changes.
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> 
cgroup:[4026532183]
+  # cat /proc/self/cgroup
+  0::/
+
+When some thread from a multi-threaded process unshares its cgroup
+namespace, the new cgroupns gets applied to the entire process (all
+the threads).  This is natural for the v2 hierarchy; however, for the
+legacy hierarchies, this may be unexpected.
+
+A cgroup namespace is alive as long as there are processes inside or
+mounts pinning it.  When the last usage goes away, the cgroup
+namespace is destroyed.  The cgroupns root and the actual cgroups
+remain.
+
+
+6-2. The Root and Views
+
+The 'cgroupns root' for a cgroup namespace is the cgroup in which the
+process calling unshare(2) is running.  For example, if a process in
+/batchjobs/container_id1 cgroup calls unshare, cgroup
+/batchjobs/container_id1 becomes the cgroupns root.  For the
+init_cgroup_ns, this is the real root ('/') cgroup.
+
+The cgroupns root cgroup does not change even if the namespace creator
+process later moves to a different cgroup.
+
+  # ~/unshare -c # unshare cgroupns in some cgroup
+  # cat /proc/self/cgroup
+  0::/
+  # mkdir sub_cgrp_1
+  # echo 0 > sub_cgrp_1/cgroup.procs
+  # cat /proc/self/cgroup
+  0::/sub_cgrp_1
+
+Each process gets its namespace-specific view of "/proc/$PID/cgroup"
+
+Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
+From within an unshared cgroupns:
+
+  # sleep 10 &
+  [1] 7353
+  # echo 7353 > sub_cgrp_1/cgroup.procs
+  # cat /proc/7353/cgroup
+  0::/sub_cgrp_1
+
+From the initial cgroup namespace, the real cgroup path will be
+visible:
+
+  $ cat /proc/7353/cgroup
+  0::/batchjobs/container_id1/sub_cgrp_1
+
+From a sibling cgroup namespace (that is, a namespace rooted at a
+different cgroup), the cgroup path relative to its own cgroup
+namespace root will be shown.  For instance, if PID 7353's cgroup
+namespace root is at '/batchjobs/container_id2', then it will see
+
+  # cat /proc/7353/cgroup
+  0::/../container_id2/sub_cgrp_1
+
+Note that the relative path always starts with '/' to indicate that
+its relative to the cgroup namespace root of the caller.
+
+
+6-3. Migration and setns(2)
+
+Processes inside a cgroup namespace can move into and out of the
+namespace root if they have proper access to external cgroups.  For
+example, from inside a namespace with cgroupns r

[PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns

2016-01-29 Thread serge . hallyn

From: Serge Hallyn 

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.

Signed-off-by: Serge Hallyn 
---
Changelog:
20151116 - Don't allow user namespaces to bind new subsystems
20151118 - postpone the FS_USERNS_MOUNT flag until the
   last patch, until we can convince ourselves it
   is safe.
20151207 - Switch to walking up the kernfs path from kn root.
 - Group initialized variables
 - Explain the capable(CAP_SYS_ADMIN) check
 - Style fixes
20160104 - kernfs_node_dentry: lock inode for lookup_one_len()
20160128 - grab needed lock in mount

Signed-off-by: Serge Hallyn 
---
 kernel/cgroup.c |   48 +++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 96e3dab..3e04df0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1983,6 +1983,7 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
 {
bool is_v2 = fs_type == &cgroup2_fs_type;
struct super_block *pinned_sb = NULL;
+   struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
struct cgroup_subsys *ss;
struct cgroup_root *root;
struct cgroup_sb_opts opts;
@@ -1991,6 +1992,14 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
int i;
bool new_sb;
 
+   get_cgroup_ns(ns);
+
+   /* Check if the caller has permission to mount. */
+   if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+   put_cgroup_ns(ns);
+   return ERR_PTR(-EPERM);
+   }
+
/*
 * The first time anyone tries to mount a cgroup, enable the list
 * linking each css_set to its tasks and fix up all existing tasks.
@@ -2001,6 +2010,7 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
if (is_v2) {
if (data) {
pr_err("cgroup2: unknown option \"%s\"\n", (char 
*)data);
+   put_cgroup_ns(ns);
return ERR_PTR(-EINVAL);
}
cgrp_dfl_root_visible = true;
@@ -2106,6 +2116,16 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
goto out_unlock;
}
 
+   /*
+* We know this subsystem has not yet been bound.  Users in a non-init
+* user namespace may only mount hierarchies with no bound subsystems,
+* i.e. 'none,name=user1'
+*/
+   if (!opts.none && !capable(CAP_SYS_ADMIN)) {
+   ret = -EPERM;
+   goto out_unlock;
+   }
+
root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root) {
ret = -ENOMEM;
@@ -2124,12 +2144,37 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);
 
-   if (ret)
+   if (ret) {
+   put_cgroup_ns(ns);
return ERR_PTR(ret);
+   }
 out_mount:
dentry = kernfs_mount(fs_type, flags, root->kf_root,
  is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
  &new_sb);
+
+   /*
+* In non-init cgroup namespace, instead of root cgroup's
+* dentry, we return the dentry corresponding to the
+* cgroupns->root_cgrp.
+*/
+   if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
+   struct dentry *nsdentry;
+   struct cgroup *cgrp;
+
+   mutex_lock(&cgroup_mutex);
+   spin_lock_bh(&css_set_lock);
+
+   cgrp = cset_cgroup_from_root(ns->root_cset, root);
+
+   spin_unlock_bh(&css_set_lock);
+   mutex_unlock(&cgroup_mutex);
+
+   nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
+   dput(dentry);
+   dentry = nsdentry;
+   }
+
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);
 
@@ -2142,6 +2187,7 @@ out_mount:
deactivate_super(pinned_sb);
}
 
+   put_cgroup_ns(ns);
return dentry;
 }
 
-- 
1.7.9.5

[PATCH 8/8] Add FS_USERNS_FLAG to cgroup fs

2016-01-29 Thread serge . hallyn

From: Serge Hallyn 

allowing root in a non-init user namespace to mount it.  This should
now be safe, because

1. non-init-root cannot mount a previously unbound subsystem
2. the task doing the mount must be privileged with respect to the
   user namespace owning the cgroup namespace
3. the mounted subsystem will have its current cgroup as the root dentry.
   the permissions will be unchanged, so tasks will receive no new
   privilege over the cgroups which they did not have on the original
   mounts.

Signed-off-by: Serge Hallyn 
---
 kernel/cgroup.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3e04df0..7a58749 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2216,12 +2216,14 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct file_system_type cgroup2_fs_type = {
.name = "cgroup2",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
 };
 
 static char *
-- 
1.7.9.5

[PATCH 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

2016-01-29 Thread serge . hallyn

From: Aditya Kali 

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
 include/uapi/linux/sched.h |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..5f0fe01 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x0040  /* Unused, ignored */
 #define CLONE_UNTRACED 0x0080  /* set if the tracing process 
can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x0100  /* set the TID in the child */
-/* 0x0200 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP0x0200  /* New cgroup namespace 
*/
 #define CLONE_NEWUTS   0x0400  /* New utsname namespace */
 #define CLONE_NEWIPC   0x0800  /* New ipc namespace */
 #define CLONE_NEWUSER  0x1000  /* New user namespace */
-- 
1.7.9.5

[PATCH 5/8] kernfs: define kernfs_node_dentry

2016-01-29 Thread serge . hallyn

From: Aditya Kali 

Add a new kernfs api is added to lookup the dentry for a particular
kernfs path.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
Acked-by: Greg Kroah-Hartman 
---
Changelog:
20151116 - Don't allow user namespaces to bind new subsystems
20151118 - postpone the FS_USERNS_MOUNT flag until the
   last patch, until we can convince ourselves it
   is safe.
20151207 - Switch to walking up the kernfs path from kn root.
20151208 - Split out the kernfs change
 - Style changes
 - Switch from pr_crit to WARN_ON
 - Reorder arguments to kernfs_obtain_root
 - rename kernfs_obtain_root to kernfs_node_dentry
20160104 - kernfs_node_dentry: lock inode for lookup_one_len()
---
 fs/kernfs/mount.c  |   69 
 include/linux/kernfs.h |2 ++
 2 files changed, 71 insertions(+)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 8eaf417..074bb8b 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kernfs-internal.h"
 
@@ -62,6 +63,74 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block 
*sb)
return NULL;
 }
 
+/*
+ * find the next ancestor in the path down to @child, where @parent was the
+ * ancestor whose descendant we want to find.
+ *
+ * Say the path is /a/b/c/d.  @child is d, @parent is NULL.  We return the root
+ * node.  If @parent is b, then we return the node for c.
+ * Passing in d as @parent is not ok.
+ */
+static struct kernfs_node *
+find_next_ancestor(struct kernfs_node *child, struct kernfs_node *parent)
+{
+   if (child == parent) {
+   pr_crit_once("BUG in find_next_ancestor: called with parent == 
child");
+   return NULL;
+   }
+
+   while (child->parent != parent) {
+   if (!child->parent)
+   return NULL;
+   child = child->parent;
+   }
+
+   return child;
+}
+
+/**
+ * kernfs_node_dentry - get a dentry for the given kernfs_node
+ * @kn: kernfs_node for which a dentry is needed
+ * @sb: the kernfs super_block
+ */
+struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
+ struct super_block *sb)
+{
+   struct dentry *dentry;
+   struct kernfs_node *knparent = NULL;
+
+   BUG_ON(sb->s_op != &kernfs_sops);
+
+   dentry = dget(sb->s_root);
+
+   /* Check if this is the root kernfs_node */
+   if (!kn->parent)
+   return dentry;
+
+   knparent = find_next_ancestor(kn, NULL);
+   if (WARN_ON(!knparent))
+   return ERR_PTR(-EINVAL);
+
+   do {
+   struct dentry *dtmp;
+   struct kernfs_node *kntmp;
+
+   if (kn == knparent)
+   return dentry;
+   kntmp = find_next_ancestor(kn, knparent);
+   if (WARN_ON(!kntmp))
+   return ERR_PTR(-EINVAL);
+   mutex_lock(&d_inode(dentry)->i_mutex);
+   dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
+   mutex_unlock(&d_inode(dentry)->i_mutex);
+   dput(dentry);
+   if (IS_ERR(dtmp))
+   return dtmp;
+   knparent = kntmp;
+   dentry = dtmp;
+   } while (1);
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 716bfde..c06c442 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -284,6 +284,8 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry 
*dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn);
 
+struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
+ struct super_block *sb);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
-- 
1.7.9.5

[PATCH 4/8] cgroup: cgroup namespace setns support

2016-01-29 Thread serge . hallyn

From: Aditya Kali 

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
---
 kernel/cgroup.c |   19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d828e1f..96e3dab 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -6004,10 +6004,23 @@ static inline struct cgroup_namespace *to_cg_ns(struct 
ns_common *ns)
return container_of(ns, struct cgroup_namespace, ns);
 }
 
-static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
 {
-   pr_info("setns not supported for cgroup namespace");
-   return -EINVAL;
+   struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);
+
+   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+   !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+   return -EPERM;
+
+   /* Don't need to do anything if we are attaching to our own cgroupns. */
+   if (cgroup_ns == nsproxy->cgroup_ns)
+   return 0;
+
+   get_cgroup_ns(cgroup_ns);
+   put_cgroup_ns(nsproxy->cgroup_ns);
+   nsproxy->cgroup_ns = cgroup_ns;
+
+   return 0;
 }
 
 static struct ns_common *cgroupns_get(struct task_struct *task)
-- 
1.7.9.5

Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled

2016-01-26 Thread Serge Hallyn

Quoting Josh Boyer (jwbo...@fedoraproject.org):
> On Tue, Jan 26, 2016 at 9:46 AM, Austin S. Hemmelgarn
>  wrote:
> > On 2016-01-26 09:38, Josh Boyer wrote:
> >>
> >> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
> >>  wrote:
> >>>
> >>> Kees Cook  writes:
> >>>
>  On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>   wrote:
> >
> > Kees Cook  writes:
> >>
> >>
> >> Well, I don't know about less weird, but it would leave a unneeded
> >> hole in the permission checks.
> >
> >
> > To be clear the current patch has my:
> >
> > Nacked-by: "Eric W. Biederman" 
> >
> > The code is buggy, and poorly thought through.  Your lack of interest
> > in
> > fixing the bugs in your patch is distressing.
> 
> 
>  I'm not sure where you see me having a "lack of interest". The
>  existing cap-checking sysctls have a corner-case bug, which is
>  orthogonal to this change.
> >>>
> >>>
> >>> That certainly doesn't sound like you have any plans to change anything
> >>> there.
> >>>
> > So broken code, not willing to fix.  No. We are not merging this
> > sysctl.
> 
> 
>  I think you're jumping to conclusions. :)
> >>>
> >>>
> >>> I think I am the maintainer.
> >>>
> >>> What you are proposing is very much something that is only of interst to
> >>> people who are not using user namespaces.  It is fatally flawed as
> >>> a way to avoid new attack surfaces for people who don't care as the
> >>> sysctl leaves user namespaces enabled by default.  It is fatally flawed
> >>> as remediation to recommend to people to change if a new user namespace
> >>> related but is discovered.  Any running process that happens to be
> >>> created while user namespace creation was enabled will continue to
> >>> exist.  Effectively a reboot will be required as part of a mitigation.
> >>> Many sysadmins will get that wrong.
> >>>
> >>> I can't possibly see your sysctl as proposed achieving it's goals.  A
> >>> person has to be entirely too aware of subtlety and nuance to use it
> >>> effectively.
> >>
> >>
> >> What you're saying is true for the "oh crap" case of a new userns
> >> related CVE being found.  However, there is the case where sysadmins
> >> know for a fact that a set of machines should not allow user
> >> namespaces to be enabled.  Currently they have 2 choices, 1) use their
> >> distro kernel as-is, which may not meet their goal of having userns
> >> disabled, or 2) rebuild their kernel to disable it, which may
> >> invalidate any support contracts they have.
> >>
> >> I tend to agree with you on the lack of value around runtime
> >> mitigation, but allowing an admin to toggle this as a blatant on/off
> >> switch on reboot does have value.
> >>
>  This feature is already implemented by two distros, and likely wanted
>  by others. We cannot ignore that. The sysctl default doesn't change
>  the existing behavior, so this doesn't get in your way at all. Can you
>  please respond to my earlier email where I rebutted each of your
>  arguments against it? Just saying "no" and putting words in my mouth
>  isn't very productive.
> >>>
> >>>
> >>> Calling people who make mistakes insane is not a rebuttal.  In security
> >>> usability matters, and your sysctl has low usability.
> >>>
> >>> Further you seem to have missed something crucial in your understanding.
> >>> As was explained earlier the sysctl was added to ubuntu to allow early
> >>> adopters to experiment not as a long term way of managing user
> >>> namespaces.
> >>>
> >>>
> >>> What sounds like a generally useful feature that would cover your use
> >>> case and many others is a per user limit on the number of user
> >>> namespaces users may create.
> >>
> >>
> >> Where that number may be zero?  I don't see how that is really any
> >> better than a sysctl.  Could you elaborate?
> >
> > It's a better option because it would allow better configurability. Take for
> > example a single user desktop system with some network daemons.  On such a
> > system, the actual login used for the graphical environment by the user
> > should be allowed at least a few user namespaces, because some software
> > depends on them for security (Chrome for example, as well as some distro's
> > build systems), but system users should be limited to at most one if they
> > need it, and ideally zero, so that remote exploits couldn't give access to a
> > user namespace.
> >
> > Conversely, on a server system, it's not unreasonable to completely disable
> > user namespaces for almost everything, except for giving one to services
> > that use them properly for sand-boxing.
> 
> OK, so better granularity.  Fine.
> 
> > I will state though that I only feel this is a better solution given that
> > two criteria are met:
> > 1. You can set 0 as the limit.
> > 2. You can configure this without needing some special software (this in
> > particular means that seccomp is not an option).
> 
> I

Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled

2016-01-26 Thread Serge Hallyn

Quoting Josh Boyer (jwbo...@fedoraproject.org):
> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
>  wrote:
> > Kees Cook  writes:
> >
> >> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
> >>  wrote:
> >>> Kees Cook  writes:
> 
>  Well, I don't know about less weird, but it would leave a unneeded
>  hole in the permission checks.
> >>>
> >>> To be clear the current patch has my:
> >>>
> >>> Nacked-by: "Eric W. Biederman" 
> >>>
> >>> The code is buggy, and poorly thought through.  Your lack of interest in
> >>> fixing the bugs in your patch is distressing.
> >>
> >> I'm not sure where you see me having a "lack of interest". The
> >> existing cap-checking sysctls have a corner-case bug, which is
> >> orthogonal to this change.
> >
> > That certainly doesn't sound like you have any plans to change anything
> > there.
> >
> >>> So broken code, not willing to fix.  No. We are not merging this sysctl.
> >>
> >> I think you're jumping to conclusions. :)
> >
> > I think I am the maintainer.
> >
> > What you are proposing is very much something that is only of interst to
> > people who are not using user namespaces.  It is fatally flawed as
> > a way to avoid new attack surfaces for people who don't care as the
> > sysctl leaves user namespaces enabled by default.  It is fatally flawed
> > as remediation to recommend to people to change if a new user namespace
> > related but is discovered.  Any running process that happens to be
> > created while user namespace creation was enabled will continue to
> > exist.  Effectively a reboot will be required as part of a mitigation.
> > Many sysadmins will get that wrong.
> >
> > I can't possibly see your sysctl as proposed achieving it's goals.  A
> > person has to be entirely too aware of subtlety and nuance to use it
> > effectively.
> 
> What you're saying is true for the "oh crap" case of a new userns
> related CVE being found.  However, there is the case where sysadmins
> know for a fact that a set of machines should not allow user
> namespaces to be enabled.  Currently they have 2 choices, 1) use their

Hi - can you give a specific example of this?  (Where users really should
not be able to use them - not where they might not need them)  I think
it'll help the discussion tremendously.  Because so far the only good
arguments I've seen have been about actual bugs in the user namespaces,
which would not warrant a designed-in permanent disable switch.  If
there are good use cases where such a disable switch will always be
needed (and compiling out can't satisfy) that'd be helpful.

thanks,
-serge

Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled

2016-01-25 Thread Serge Hallyn

Quoting Kees Cook (keesc...@chromium.org):
> On Fri, Jan 22, 2016 at 7:02 PM, Eric W. Biederman
> > So I have concerns about both efficacy and usability with the proposed
> > sysctl.
> 
> Two distros already have this sysctl because it was so strongly
> requested by their users. This needs to be upstream so we can manage
> the effects correctly.

Which two distros?  Was it in fact requested by their users?

My opinion remains that long-term this is a bad thing.  If we're going to
have this upstream, it should be clearly marked so as to be easily
removable at some point down the road.  Userspace that cannot count on a
feature (in the best case) won't use it or (much worse) will fall back
to broken behavior in one case or the other.

Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled

2016-01-22 Thread Serge Hallyn

Quoting Kees Cook (keesc...@chromium.org):
> On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki  wrote:
> > 2016-01-22 23:50 GMT+01:00 Kees Cook :
> >
> >>> Seems that Debian and some older Ubuntu versions are already using
> >>>
> >>> $ sysctl -a | grep usern
> >>> kernel.unprivileged_userns_clone = 0
> >>>
> >>> Shall we be consistent wit it?
> >>
> >> Oh! I didn't see that on systems I checked. On which version did you find 
> >> that?
> >
> > $ uname -a
> > Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
> > (2016-01-07) x86_64 GNU/Linux
> > $ cat /etc/debian_version
> > 8.2
> 
> Ah-ha, Debian only, though it looks like this was just committed to
> the Ubuntu kernel tree too:
> 
> 
> > IIRC some older kernels delivered with Ubuntu Precise were also using
> > it (but maybe I'm mistaken)
> 
> I don't see it there.
> 
> I think my patch is more complete, but I'm happy to change the name if
> this sysctl has already started to enter the global consciousness. ;)
> 
> Serge, Ben, what do you think?

Oh, sorry - as for the name of it, what is the alternative you are proposing?

Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled

2016-01-22 Thread Serge Hallyn

Quoting Kees Cook (keesc...@chromium.org):
> On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki  wrote:
> > 2016-01-22 23:50 GMT+01:00 Kees Cook :
> >
> >>> Seems that Debian and some older Ubuntu versions are already using
> >>>
> >>> $ sysctl -a | grep usern
> >>> kernel.unprivileged_userns_clone = 0
> >>>
> >>> Shall we be consistent wit it?
> >>
> >> Oh! I didn't see that on systems I checked. On which version did you find 
> >> that?
> >
> > $ uname -a
> > Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
> > (2016-01-07) x86_64 GNU/Linux
> > $ cat /etc/debian_version
> > 8.2
> 
> Ah-ha, Debian only, though it looks like this was just committed to
> the Ubuntu kernel tree too:
> 
> 
> > IIRC some older kernels delivered with Ubuntu Precise were also using
> > it (but maybe I'm mistaken)
> 
> I don't see it there.
> 
> I think my patch is more complete, but I'm happy to change the name if
> this sysctl has already started to enter the global consciousness. ;)
> 
> Serge, Ben, what do you think?
> 
> -Kees

Hey,

I had originally written this for Ubuntu when userns was still new
and not upstream.  Then we dropped it when it got upstream.

The reason we are re-adding it is because we're going to be pushing the
envelop again wrt unprivileged userns usage.  Seth has been working on
supporting mounts of fuse, for instance.  When everything is upstream,
(or we drop it :) we'll drop the patch again.

-serge

[PATCH 1/8] kernfs: Add API to generate relative kernfs path

2016-01-04 Thread serge . hallyn

From: Aditya Kali 

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
Acked-by: Greg Kroah-Hartman 
---
Changelog 20151125:
  - Fully-wing multilinecomments
  - Rework kernfs_path_from_node_locked() logic
  - Replace BUG_ONs with returning NULL
  - Use a const char* for /.. and precalculate its size
Changelog 20151130:
  - Update kernfs_path_from_node_locked comment
Changelog 20151208:
  - kernfs_node_distance:
* Remove BUG_ON(NULL)s
* Rename kernfs_node_distance to kernfs_depth
  - kernfs_common-ancestor:
* Remove useless checks for depth == 0
* Add check to ensure nodes are from same root
  - kernfs_path_from_node_locked:
* Remove needless __must_check
* Put p;len on its own decl line.
* Fix wrong WARN_ONCE usage
Changelog 20151209:
  - kernfs_path_from_node: change arguments to 'to' and 'from', and
change their order.
Changelog 20151222:
  - kernfs_path_from_node{,_locked}: return the string length.
kernfs_path is gpl-exported, so changing their return value seemed
ill-advised, but if noone minds I can update it too.
Changelog 20151223:
  - don't allocate memory pr_cont_kernfs_path() under spinlock
---
 fs/kernfs/dir.c|  192 
 include/linux/kernfs.h |9 ++-
 2 files changed, 166 insertions(+), 35 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 742bf4a..f2b2187 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,123 @@ static int kernfs_name_locked(struct kernfs_node *kn, char 
*buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char 
*buf,
- size_t buflen)
+/* kernfs_node_depth - compute depth from @from to @to */
+static size_t kernfs_depth(struct kernfs_node *from, struct kernfs_node *to)
 {
-   char *p = buf + buflen;
-   int len;
+   size_t depth = 0;
 
-   *--p = '\0';
+   while (to->parent && to != from) {
+   depth++;
+   to = to->parent;
+   }
+   return depth;
+}
 
-   do {
-   len = strlen(kn->name);
-   if (p - buf < len + 1) {
-   buf[0] = '\0';
-   p = NULL;
-   break;
-   }
-   p -= len;
-   memcpy(p, kn->name, len);
-   *--p = '/';
-   kn = kn->parent;
-   } while (kn && kn->parent);
+static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
+ struct kernfs_node *b)
+{
+   size_t da, db;
+   struct kernfs_root *ra = kernfs_root(a), *rb = kernfs_root(b);
+
+   if (ra != rb)
+   return NULL;
+
+   da = kernfs_depth(ra->kn, a);
+   db = kernfs_depth(rb->kn, b);
+
+   while (da > db) {
+   a = a->parent;
+   da--;
+   }
+   while (db > da) {
+   b = b->parent;
+   db--;
+   }
+
+   /* worst case b and a will be the same at root */
+   while (b != a) {
+   b = b->parent;
+   a = a->parent;
+   }
+
+   return a;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a pseudo-absolute path to @kn_to,
+ * where kn_from is treated as root of the path.
+ * @kn_from: kernfs node which should be treated as root for the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3 [depth=3]
+ * result:  /../..
+ *
+ * return value: length of the string.  If greater than buflen,
+ * then contents of buf are undefined.  On error, -1 is returned.
+ */
+static int
+kernfs_path_from_node_locked(struct kernfs_node *kn_to,
+struct kernfs_node *kn_from, char *buf,
+size_t buflen)
+{
+   struct kernfs_node *kn, *common;
+   const char parent_str[] = "/..";
+   size_t depth_from, depth_to, len = 0, nlen = 0;
+   char *p;
+   int i;
+
+   if (!kn_from)
+   kn_from = kernfs_root(kn_to)->kn;
+
+   if (kn_from == kn_to)
+   return strlcpy(buf, "/", buflen);
+
+   common = kernfs_common_ancestor(kn_from, kn_to);
+   if (WARN_ON(!common))
+   return -1;
+
+   depth_

CGroup Namespaces (v9)

2016-01-04 Thread serge . hallyn

Hi,

following is a revised set of the CGroup Namespace patchset which Aditya
Kali has previously sent.  The code can also be found in the cgroupns.v9
branch of

https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/

To summarize the semantics:

1. CLONE_NEWCGROUP re-uses 0x0200, which was previously CLONE_STOPPED

2. unsharing a cgroup namespace makes all your current cgroups your new
cgroup root.

3. /proc/pid/cgroup always shows cgroup paths relative to the reader's
cgroup namespce root.  A task outside of  your cgroup looks like

8:memory:/../../..

4. when a task mounts a cgroupfs, the cgroup which shows up as root depends
on the mounting task's  cgroup namespace.

5. setns to a cgroup namespace switches your cgroup namespace but not
your cgroups.

With this, using github.com/hallyn/lxc #2015-11-09/cgns (and
github.com/hallyn/lxcfs #2015-11-10/cgns) we can start a container in a full
proper cgroup namespace, avoiding either cgmanager or lxcfs cgroup bind mounts.

This is completely backward compatible and will be completely invisible
to any existing cgroup users (except for those running inside a cgroup
namespace and looking at /proc/pid/cgroup of tasks outside their
namespace.)

Changes from V8:
1. Incorporate updated documentation from tj.
1. Put lookup_one_len() under inode lock
2. Make cgroup_path non-namespaced, so only calls to cgroup_path_ns() are
   namespaced.
3. Make cgroup_path{,_ns} take the needed locks, since external callers cannot
   do so.
4. Fix the bisectability problem of to_cg_ns() being defined after use

Changes from V7:
1. Rework kernfs_path_from_node_locked to return the string length
2. Rename and reorder args to kernfs_path_from_node
3. cgroup.c: undo accidental conversoins to inline
4. cgroup.h: move ns declarations to bottom.
5. Rework the documentation to fit the style of the rest of cgroup.txt

Changes from V6:
1. Switch to some WARN_ONs to provide stack traces
2. Rename kernfs_node_distance to kernfs_depth
3. Make sure kernfs_common_ancestor() nodes are from same root
4. Split kernfs changes for cgroup_mount into separate patch
5. Rename kernfs_obtain_root to kernfs_node_dentry
(And more, see patch changelogs)

Changes from V5:
1. To get a root dentry for cgroup namespace mount, walk the path from the
   kernfs root dentry.

Changes from V4:
1. Move the FS_USERNS_MOUNT flag to last patch
2. Rebase onto cgroup/for-4.5
3. Don't non-init user namespaces to bind new subsystems when mounting.
4. Address feedback from Tejun (thanks).  Specificaly, not addressed:
   . kernfs_obtain_root - walking dentry from kernfs root.
 (I think that's the only piece)
5. Dropped unused get_task_cgroup fn/patch.
6. Reworked kernfs_path_from_node_locked() to try to simplify the logic.
   It now finds a common ancestor, walks from the source to it, then back
   up to the target.

Changes from V3:
1. Rebased onto latest cgroup changes.  In particular switch to
   css_set_lock and ns_common.
2. Support all hierarchies.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
   - removed use of threadgroup_lock() while creating new cgroupns
   - use task_lock() instead of rcu_read_lock() while accessing
 task->nsproxy
   - optimized setns() to own cgroupns
   - simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
   not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc//cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of .
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup ' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc//cgroup is further restricted by not showing
   anything if the  is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns

2016-01-04 Thread serge . hallyn

From: Serge Hallyn 

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.

Signed-off-by: Serge Hallyn 
---
Changelog:
20151116 - Don't allow user namespaces to bind new subsystems
20151118 - postpone the FS_USERNS_MOUNT flag until the
   last patch, until we can convince ourselves it
   is safe.
20151207 - Switch to walking up the kernfs path from kn root.
 - Group initialized variables
 - Explain the capable(CAP_SYS_ADMIN) check
 - Style fixes
20160104 - kernfs_node_dentry: lock inode for lookup_one_len()

Signed-off-by: Serge Hallyn 
---
 fs/kernfs/mount.c |2 ++
 kernel/cgroup.c   |   40 +++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 7224296..074bb8b 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -120,7 +120,9 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
kntmp = find_next_ancestor(kn, knparent);
if (WARN_ON(!kntmp))
return ERR_PTR(-EINVAL);
+   mutex_lock(&d_inode(dentry)->i_mutex);
dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
+   mutex_unlock(&d_inode(dentry)->i_mutex);
dput(dentry);
if (IS_ERR(dtmp))
return dtmp;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2bb58a1..d0bed8f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1983,6 +1983,7 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
 {
bool is_v2 = fs_type == &cgroup2_fs_type;
struct super_block *pinned_sb = NULL;
+   struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
struct cgroup_subsys *ss;
struct cgroup_root *root;
struct cgroup_sb_opts opts;
@@ -1991,6 +1992,14 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
int i;
bool new_sb;
 
+   get_cgroup_ns(ns);
+
+   /* Check if the caller has permission to mount. */
+   if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+   put_cgroup_ns(ns);
+   return ERR_PTR(-EPERM);
+   }
+
/*
 * The first time anyone tries to mount a cgroup, enable the list
 * linking each css_set to its tasks and fix up all existing tasks.
@@ -2106,6 +2115,16 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
goto out_unlock;
}
 
+   /*
+* We know this subsystem has not yet been bound.  Users in a non-init
+* user namespace may only mount hierarchies with no bound subsystems,
+* i.e. 'none,name=user1'
+*/
+   if (!opts.none && !capable(CAP_SYS_ADMIN)) {
+   ret = -EPERM;
+   goto out_unlock;
+   }
+
root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root) {
ret = -ENOMEM;
@@ -2124,12 +2143,30 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);
 
-   if (ret)
+   if (ret) {
+   put_cgroup_ns(ns);
return ERR_PTR(ret);
+   }
 out_mount:
dentry = kernfs_mount(fs_type, flags, root->kf_root,
  is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
  &new_sb);
+
+   /*
+* In non-init cgroup namespace, instead of root cgroup's
+* dentry, we return the dentry corresponding to the
+* cgroupns->root_cgrp.
+*/
+   if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
+   struct dentry *nsdentry;
+   struct cgroup *cgrp;
+
+   cgrp = cset_cgroup_from_root(ns->root_cset, root);
+   nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
+   dput(dentry);
+   dentry = nsdentry;
+   }
+
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);
 
@@ -2142,6 +2179,7 @@ out_mount:
deactivate_super(pinned_sb);
}
 
+   put_cgroup_ns(ns);
return dentry;
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8/8] Add FS_USERNS_FLAG to cgroup fs

2016-01-04 Thread serge . hallyn

From: Serge Hallyn 

allowing root in a non-init user namespace to mount it.  This should
now be safe, because

1. non-init-root cannot mount a previously unbound subsystem
2. the task doing the mount must be privileged with respect to the
   user namespace owning the cgroup namespace
3. the mounted subsystem will have its current cgroup as the root dentry.
   the permissions will be unchanged, so tasks will receive no new
   privilege over the cgroups which they did not have on the original
   mounts.

Signed-off-by: Serge Hallyn 
---
 kernel/cgroup.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d0bed8f..f2c47c1 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2208,12 +2208,14 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct file_system_type cgroup2_fs_type = {
.name = "cgroup2",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
 };
 
 char *
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 7/8] cgroup: Add documentation for cgroup namespaces

2016-01-04 Thread serge . hallyn

From: Serge Hallyn 

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
Signed-off-by: Tejun Heo 

---
Changelog (2015-12-08):
  Merge into Documentation/cgroup.txt
Changelog (2015-12-22):
  Reformat to try to follow the style of the rest of the cgroup.txt file.
Changelog (2015-12-22):
  tj: Reorganized to better fit the documentation.
---
 Documentation/cgroup.txt |  147 ++
 1 file changed, 147 insertions(+)

diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt
index 31d1f7b..983ba63 100644
--- a/Documentation/cgroup.txt
+++ b/Documentation/cgroup.txt
@@ -47,6 +47,11 @@ CONTENTS
   5-3. IO
 5-3-1. IO Interface Files
 5-3-2. Writeback
+6. Namespace
+  6-1. Basics
+  6-2. The Root and Views
+  6-3. Migration and setns(2)
+  6-4. Interaction with Other Namespaces
 P. Information on Kernel Programming
   P-1. Filesystem Support for Writeback
 D. Deprecated v1 Core Features
@@ -1013,6 +1018,148 @@ writeback as follows.
vm.dirty[_background]_ratio.
 
 
+6. Namespace
+
+6-1. Basics
+
+cgroup namespace provides a mechanism to virtualize the view of the
+"/proc/$PID/cgroup" file and cgroup mounts.  The CLONE_NEWCGROUP clone
+flag can be used with clone(2) and unshare(2) to create a new cgroup
+namespace.  The process running inside the cgroup namespace will have
+its "/proc/$PID/cgroup" output restricted to cgroupns root.  The
+cgroupns root is the cgroup of the process at the time of creation of
+the cgroup namespace.
+
+Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
+complete path of the cgroup of a process.  In a container setup where
+a set of cgroups and namespaces are intended to isolate processes the
+"/proc/$PID/cgroup" file may leak potential system level information
+to the isolated processes.  For Example:
+
+  # cat /proc/self/cgroup
+  0::/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can be considered as system-data
+and undesirable to expose to the isolated processes.  cgroup namespace
+can be used to restrict visibility of this path.  For example, before
+creating a cgroup namespace, one would see:
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> 
cgroup:[4026531835]
+  # cat /proc/self/cgroup
+  0::/batchjobs/container_id1
+
+After unsharing a new namespace, the view changes.
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> 
cgroup:[4026532183]
+  # cat /proc/self/cgroup
+  0::/
+
+When some thread from a multi-threaded process unshares its cgroup
+namespace, the new cgroupns gets applied to the entire process (all
+the threads).  This is natural for the v2 hierarchy; however, for the
+legacy hierarchies, this may be unexpected.
+
+A cgroup namespace is alive as long as there are processes inside or
+mounts pinning it.  When the last usage goes away, the cgroup
+namespace is destroyed.  The cgroupns root and the actual cgroups
+remain.
+
+
+6-2. The Root and Views
+
+The 'cgroupns root' for a cgroup namespace is the cgroup in which the
+process calling unshare(2) is running.  For example, if a process in
+/batchjobs/container_id1 cgroup calls unshare, cgroup
+/batchjobs/container_id1 becomes the cgroupns root.  For the
+init_cgroup_ns, this is the real root ('/') cgroup.
+
+The cgroupns root cgroup does not change even if the namespace creator
+process later moves to a different cgroup.
+
+  # ~/unshare -c # unshare cgroupns in some cgroup
+  # cat /proc/self/cgroup
+  0::/
+  # mkdir sub_cgrp_1
+  # echo 0 > sub_cgrp_1/cgroup.procs
+  # cat /proc/self/cgroup
+  0::/sub_cgrp_1
+
+Each process gets its namespace-specific view of "/proc/$PID/cgroup"
+
+Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
+From within an unshared cgroupns:
+
+  # sleep 10 &
+  [1] 7353
+  # echo 7353 > sub_cgrp_1/cgroup.procs
+  # cat /proc/7353/cgroup
+  0::/sub_cgrp_1
+
+From the initial cgroup namespace, the real cgroup path will be
+visible:
+
+  $ cat /proc/7353/cgroup
+  0::/batchjobs/container_id1/sub_cgrp_1
+
+From a sibling cgroup namespace (that is, a namespace rooted at a
+different cgroup), the cgroup path relative to its own cgroup
+namespace root will be shown.  For instance, if PID 7353's cgroup
+namespace root is at '/batchjobs/container_id2', then it will see
+
+  # cat /proc/7353/cgroup
+  0::/../container_id2/sub_cgrp_1
+
+Note that the relative path always starts with '/' to indicate that
+its relative to the cgroup namespace root of the caller.
+
+
+6-3. Migration and setns(2)
+
+Processes inside a cgroup namespace can move into and out of the
+namespace root if they have proper access to external cgroups.  For
+example, from inside a namespace with cgroupns root at
+/ba

[PATCH 5/8] kernfs: define kernfs_node_dentry

2016-01-04 Thread serge . hallyn

From: Aditya Kali 

Add a new kernfs api is added to lookup the dentry for a particular
kernfs path.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
Acked-by: Greg Kroah-Hartman 
---
Changelog:
20151116 - Don't allow user namespaces to bind new subsystems
20151118 - postpone the FS_USERNS_MOUNT flag until the
   last patch, until we can convince ourselves it
   is safe.
20151207 - Switch to walking up the kernfs path from kn root.
20151208 - Split out the kernfs change
 - Style changes
 - Switch from pr_crit to WARN_ON
 - Reorder arguments to kernfs_obtain_root
 - rename kernfs_obtain_root to kernfs_node_dentry
---
 fs/kernfs/mount.c  |   67 
 include/linux/kernfs.h |2 ++
 2 files changed, 69 insertions(+)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 8eaf417..7224296 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kernfs-internal.h"
 
@@ -62,6 +63,72 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block 
*sb)
return NULL;
 }
 
+/*
+ * find the next ancestor in the path down to @child, where @parent was the
+ * ancestor whose descendant we want to find.
+ *
+ * Say the path is /a/b/c/d.  @child is d, @parent is NULL.  We return the root
+ * node.  If @parent is b, then we return the node for c.
+ * Passing in d as @parent is not ok.
+ */
+static struct kernfs_node *
+find_next_ancestor(struct kernfs_node *child, struct kernfs_node *parent)
+{
+   if (child == parent) {
+   pr_crit_once("BUG in find_next_ancestor: called with parent == 
child");
+   return NULL;
+   }
+
+   while (child->parent != parent) {
+   if (!child->parent)
+   return NULL;
+   child = child->parent;
+   }
+
+   return child;
+}
+
+/**
+ * kernfs_node_dentry - get a dentry for the given kernfs_node
+ * @kn: kernfs_node for which a dentry is needed
+ * @sb: the kernfs super_block
+ */
+struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
+ struct super_block *sb)
+{
+   struct dentry *dentry;
+   struct kernfs_node *knparent = NULL;
+
+   BUG_ON(sb->s_op != &kernfs_sops);
+
+   dentry = dget(sb->s_root);
+
+   /* Check if this is the root kernfs_node */
+   if (!kn->parent)
+   return dentry;
+
+   knparent = find_next_ancestor(kn, NULL);
+   if (WARN_ON(!knparent))
+   return ERR_PTR(-EINVAL);
+
+   do {
+   struct dentry *dtmp;
+   struct kernfs_node *kntmp;
+
+   if (kn == knparent)
+   return dentry;
+   kntmp = find_next_ancestor(kn, knparent);
+   if (WARN_ON(!kntmp))
+   return ERR_PTR(-EINVAL);
+   dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
+   dput(dentry);
+   if (IS_ERR(dtmp))
+   return dtmp;
+   knparent = kntmp;
+   dentry = dtmp;
+   } while (1);
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 716bfde..c06c442 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -284,6 +284,8 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry 
*dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn);
 
+struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
+ struct super_block *sb);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/8] cgroup: cgroup namespace setns support

2016-01-04 Thread serge . hallyn

From: Aditya Kali 

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
---
 kernel/cgroup.c |   19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 60270b1..2bb58a1 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5952,10 +5952,23 @@ static inline struct cgroup_namespace *to_cg_ns(struct 
ns_common *ns)
return container_of(ns, struct cgroup_namespace, ns);
 }
 
-static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
 {
-   pr_info("setns not supported for cgroup namespace");
-   return -EINVAL;
+   struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);
+
+   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+   !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+   return -EPERM;
+
+   /* Don't need to do anything if we are attaching to our own cgroupns. */
+   if (cgroup_ns == nsproxy->cgroup_ns)
+   return 0;
+
+   get_cgroup_ns(cgroup_ns);
+   put_cgroup_ns(nsproxy->cgroup_ns);
+   nsproxy->cgroup_ns = cgroup_ns;
+
+   return 0;
 }
 
 static struct ns_common *cgroupns_get(struct task_struct *task)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/8] cgroup: introduce cgroup namespaces

2016-01-04 Thread serge . hallyn

From: Aditya Kali 

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
Changelog: 2015-11-24
- move cgroup_namespace.c into cgroup.c (and .h)
- reformatting
- make get_cgroup_ns return void
- rename ns->root_cgrps to root_cset.
Changelog: 2015-12-08
- Move init_cgroup_ns to other variable declarations
- Remove accidental conversion of put-css_set to inline
- Drop BUG_ON(NULL)
- Remove unneeded pre declaration of struct cgroupns_operations.
- cgroup.h: collect common ns declerations
Changelog: 2015-12-09
- cgroup.h: move ns declarations to bottom
- cgroup.c: undo all accidental conversions to inline
Changelog: 2015-12-22
- update for new kernfs_path_from_node() return value.  Since
  cgroup_path was already gpl-exported, I abstained from updating
  its return value.
Changelog: 2015-12-23
- cgroup_path(): use init_cgroup_ns when in interupt context.
Changelog: 2015-01-02
- move to_cg_ns definition forward in patch series
- cgroup_release_agent: grab css_set_lock around cgroup_path()
- leave cgroup_path non-namespaced, use cgroup_path_ns when
  namespaced path is desired.
---
 fs/proc/namespaces.c|3 +
 include/linux/cgroup.h  |   56 +--
 include/linux/nsproxy.h |2 +
 include/linux/proc_ns.h |4 ++
 kernel/cgroup.c |  177 ++-
 kernel/cpuset.c |3 +-
 kernel/fork.c   |2 +-
 kernel/nsproxy.c|   21 +-
 kernel/sched/debug.c|3 +-
 9 files changed, 257 insertions(+), 14 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index f6e8354..bd61075 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -28,6 +28,9 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
 #endif
&mntns_operations,
+#ifdef CONFIG_CGROUPS
+   &cgroupns_operations,
+#endif
 };
 
 static const char *proc_ns_follow_link(struct dentry *dentry, void **cookie)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9d70b48..149ae0a 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,6 +17,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 
@@ -532,12 +537,6 @@ static inline int cgroup_name(struct cgroup *cgrp, char 
*buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
 }
 
-static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
- size_t buflen)
-{
-   return kernfs_path(cgrp->kn, buf, buflen);
-}
-
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
 {
pr_cont_kernfs_name(cgrp->kn);
@@ -570,4 +569,49 @@ static inline int cgroup_init(void) { return 0; }
 
 #endif /* !CONFIG_CGROUPS */
 
+struct cgroup_namespace {
+   atomic_tcount;
+   struct ns_commonns;
+   struct user_namespace   *user_ns;
+   struct css_set  *root_cset;
+};
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+#ifdef CONFIG_CGROUPS
+
+void free_cgroup_ns(struct cgroup_namespace *ns);
+
+struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns);
+
+char *cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
+struct cgroup_namespace *ns);
+char *cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen);
+
+#else /* !CONFIG_CGROUPS */
+
+static inline void free_cgroup_ns(struct cgroup_namespace *ns) { }
+static inline struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns)
+{
+   return old_ns;
+}
+
+#endif /* !CONFIG_CGROUPS */
+
+static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(&ns->count);
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns && atomic_dec_and_test(&ns->

[PATCH 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

2016-01-04 Thread serge . hallyn

From: Aditya Kali 

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
 include/uapi/linux/sched.h |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..5f0fe01 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x0040  /* Unused, ignored */
 #define CLONE_UNTRACED 0x0080  /* set if the tracing process 
can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x0100  /* set the TID in the child */
-/* 0x0200 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP0x0200  /* New cgroup namespace 
*/
 #define CLONE_NEWUTS   0x0400  /* New utsname namespace */
 #define CLONE_NEWIPC   0x0800  /* New ipc namespace */
 #define CLONE_NEWUSER  0x1000  /* New user namespace */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/8] cgroup: Add documentation for cgroup namespaces

2015-12-28 Thread Serge Hallyn

On Mon Dec 28 2015 09:47:35 AM PST, Tejun Heo  wrote:

> Hello,
> 
> I did some heavy editing of the documentation.  How does this look?

Thanks Tejun, just three things (which come from my version):

> Did I miss anything?
> 
> Thanks.
> ---
>  Documentation/cgroup.txt |  146
> +++ 1 file changed, 146
> insertions(+)
> 
> --- a/Documentation/cgroup.txt
> +++ b/Documentation/cgroup.txt
> @@ -47,6 +47,11 @@ CONTENTS
>      5-3. IO
>          5-3-1. IO Interface Files
>          5-3-2. Writeback
> +6. Namespace
> +  6-1. Basics
> +  6-2. The Root and Views
> +  6-3. Migration and setns(2)
> +  6-4. Interaction with Other Namespaces
>  P. Information on Kernel Programming
>      P-1. Filesystem Support for Writeback
>  D. Deprecated v1 Core Features
> @@ -1013,6 +1018,147 @@ writeback as follows.
>      vm.dirty[_background]_ratio.
>  
>  
> +6. Namespace
> +
> +6-1. Basics
> +
> +cgroup namespace provides a mechanism to virtualize the view of the
> +"/proc/$PID/cgroup" file

and cgroup mounts

>.  The CLONE_NEWCGROUP clone flag can be used
> +with clone(2) and unshare(2) to create a new cgroup namespace.  The
> +process running inside the cgroup namespace will have its
> +"/proc/$PID/cgroup" output restricted to cgroupns root.  The cgroupns
> +root is the cgroup of the process at the time of creation of the
> +cgroup namespace.
> +
> +Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
> +complete path of the cgroup of a process.  In a container setup where
> +a set of cgroups and namespaces are intended to isolate processes the
> +"/proc/$PID/cgroup" file may leak potential system level information
> +to the isolated processes.  For Example:
> +
> +  # cat /proc/self/cgroup
> +  0::/batchjobs/container_id1
> +
> +The path '/batchjobs/container_id1' can be considered as system-data
> +and undesirable to expose to the isolated processes.  cgroup namespace
> +can be used to restrict visibility of this path.  For example, before
> +creating a cgroup namespace, one would see:
> +
> +  # ls -l /proc/self/ns/cgroup
> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup ->
> cgroup:[4026531835] +  # cat /proc/self/cgroup
> +  0::/batchjobs/container_id1
> +
> +After unsharing a new namespace, the view changes.
> +
> +  # ls -l /proc/self/ns/cgroup
> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
> cgroup:[4026532183] +  # cat /proc/self/cgroup
> +  0::/
> +
> +When some thread from a multi-threaded process unshares its cgroup
> +namespace, the new cgroupns gets applied to the entire process (all
> +the threads).  This is natural for the v2 hierarchy; however, for the
> +legacy hierarchies, this may be unexpected.
> +
> +A cgroup namespace is alive as long as there are processes inside it.

Or mounts pinning it.

> +When the last process exits

or the last mount is umounted,

>, the cgroup namespace is destroyed.  The
> +cgroupns root and the actual cgroups remain.
> +
> +
> +6-2. The Root and Views
> +
> +The 'cgroupns root' for a cgroup namespace is the cgroup in which the
> +process calling unshare(2) is running.  For example, if a process in
> +/batchjobs/container_id1 cgroup calls unshare, cgroup
> +/batchjobs/container_id1 becomes the cgroupns root.  For the
> +init_cgroup_ns, this is the real root ('/') cgroup.
> +
> +The cgroupns root cgroup does not change even if the namespace creator
> +process later moves to a different cgroup.
> +
> +  # ~/unshare -c # unshare cgroupns in some cgroup
> +  # cat /proc/self/cgroup
> +  0::/
> +  # mkdir sub_cgrp_1
> +  # echo 0 > sub_cgrp_1/cgroup.procs
> +  # cat /proc/self/cgroup
> +  0::/sub_cgrp_1
> +
> +Each process gets its namespace-specific view of "/proc/$PID/cgroup"
> +
> +Processes running inside the cgroup namespace will be able to see
> +cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
> +From within an unshared cgroupns:
> +
> +  # sleep 10 &
> +  [1] 7353
> +  # echo 7353 > sub_cgrp_1/cgroup.procs
> +  # cat /proc/7353/cgroup
> +  0::/sub_cgrp_1
> +
> +From the initial cgroup namespace, the real cgroup path will be
> +visible:
> +
> +  $ cat /proc/7353/cgroup
> +  0::/batchjobs/container_id1/sub_cgrp_1
> +
> +From a sibling cgroup namespace (that is, a namespace rooted at a
> +different cgroup), the cgroup path relative to its own cgroup
> +namespace root will be shown.  For instance, if PID 7353's cgroup
> +namespace root is at '/batchjobs/container_id2', then it will see
> +
> +  # cat /proc/7353/cgroup
> +  0::/../container_id2/sub_cgrp_1
> +
> +Note that the relative path always starts with '/' to indicate that
> +its relative to the cgroup namespace root of the caller.
> +
> +
> +6-3. Migration and setns(2)
> +
> +Processes insi

[PATCH 3/8] cgroup: introduce cgroup namespaces

2015-12-22 Thread serge . hallyn

From: Aditya Kali 

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
Changelog: 2015-11-24
- move cgroup_namespace.c into cgroup.c (and .h)
- reformatting
- make get_cgroup_ns return void
- rename ns->root_cgrps to root_cset.
Changelog: 2015-12-08
- Move init_cgroup_ns to other variable declarations
- Remove accidental conversion of put-css_set to inline
- Drop BUG_ON(NULL)
- Remove unneeded pre declaration of struct cgroupns_operations.
- cgroup.h: collect common ns declerations
Changelog: 2015-12-09
- cgroup.h: move ns declarations to bottom
- cgroup.c: undo all accidental conversions to inline
Changelog: 2015-12-22
- update for new kernfs_path_from_node() return value.  Since
  cgroup_path was already gpl-exported, I abstained from updating
  its return value.
---
 fs/proc/namespaces.c|3 +
 include/linux/cgroup.h  |   54 --
 include/linux/nsproxy.h |2 +
 include/linux/proc_ns.h |4 ++
 kernel/cgroup.c |  144 +++
 kernel/fork.c   |2 +-
 kernel/nsproxy.c|   21 ++-
 7 files changed, 221 insertions(+), 9 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index f6e8354..bd61075 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -28,6 +28,9 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
 #endif
&mntns_operations,
+#ifdef CONFIG_CGROUPS
+   &cgroupns_operations,
+#endif
 };
 
 static const char *proc_ns_follow_link(struct dentry *dentry, void **cookie)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9d70b48..6d0992f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,6 +17,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 
@@ -532,12 +537,6 @@ static inline int cgroup_name(struct cgroup *cgrp, char 
*buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
 }
 
-static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
- size_t buflen)
-{
-   return kernfs_path(cgrp->kn, buf, buflen);
-}
-
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
 {
pr_cont_kernfs_name(cgrp->kn);
@@ -570,4 +569,47 @@ static inline int cgroup_init(void) { return 0; }
 
 #endif /* !CONFIG_CGROUPS */
 
+struct cgroup_namespace {
+   atomic_tcount;
+   struct ns_commonns;
+   struct user_namespace   *user_ns;
+   struct css_set  *root_cset;
+};
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+#ifdef CONFIG_CGROUPS
+
+void free_cgroup_ns(struct cgroup_namespace *ns);
+
+struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns);
+
+char *cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen);
+
+#else /* !CONFIG_CGROUPS */
+
+static inline void free_cgroup_ns(struct cgroup_namespace *ns) { }
+static inline struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns)
+{
+   return old_ns;
+}
+
+#endif /* !CONFIG_CGROUPS */
+
+static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(&ns->count);
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns && atomic_dec_and_test(&ns->count))
+   free_cgroup_ns(ns);
+}
+
 #endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net

[PATCH 1/8] kernfs: Add API to generate relative kernfs path

2015-12-22 Thread serge . hallyn

From: Aditya Kali 

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
---
Changelog 20151125:
  - Fully-wing multilinecomments
  - Rework kernfs_path_from_node_locked() logic
  - Replace BUG_ONs with returning NULL
  - Use a const char* for /.. and precalculate its size
Changelog 20151130:
  - Update kernfs_path_from_node_locked comment
Changelog 20151208:
  - kernfs_node_distance:
* Remove BUG_ON(NULL)s
* Rename kernfs_node_distance to kernfs_depth
  - kernfs_common-ancestor:
* Remove useless checks for depth == 0
* Add check to ensure nodes are from same root
  - kernfs_path_from_node_locked:
* Remove needless __must_check
* Put p;len on its own decl line.
* Fix wrong WARN_ONCE usage
Changelog 20151209:
  - kernfs_path_from_node: change arguments to 'to' and 'from', and
change their order.
Changelog 20151222:
  - kernfs_path_from_node{,_locked}: return the string length.
kernfs_path is gpl-exported, so changing their return value seemed
ill-advised, but if noone minds I can update it too.
---
 fs/kernfs/dir.c|  205 
 include/linux/kernfs.h |9 ++-
 2 files changed, 179 insertions(+), 35 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 742bf4a..e82b9a1 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,123 @@ static int kernfs_name_locked(struct kernfs_node *kn, char 
*buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char 
*buf,
- size_t buflen)
+/* kernfs_node_depth - compute depth from @from to @to */
+static size_t kernfs_depth(struct kernfs_node *from, struct kernfs_node *to)
 {
-   char *p = buf + buflen;
-   int len;
+   size_t depth = 0;
 
-   *--p = '\0';
+   while (to->parent && to != from) {
+   depth++;
+   to = to->parent;
+   }
+   return depth;
+}
 
-   do {
-   len = strlen(kn->name);
-   if (p - buf < len + 1) {
-   buf[0] = '\0';
-   p = NULL;
-   break;
-   }
-   p -= len;
-   memcpy(p, kn->name, len);
-   *--p = '/';
-   kn = kn->parent;
-   } while (kn && kn->parent);
+static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
+ struct kernfs_node *b)
+{
+   size_t da, db;
+   struct kernfs_root *ra = kernfs_root(a), *rb = kernfs_root(b);
+
+   if (ra != rb)
+   return NULL;
+
+   da = kernfs_depth(ra->kn, a);
+   db = kernfs_depth(rb->kn, b);
+
+   while (da > db) {
+   a = a->parent;
+   da--;
+   }
+   while (db > da) {
+   b = b->parent;
+   db--;
+   }
+
+   /* worst case b and a will be the same at root */
+   while (b != a) {
+   b = b->parent;
+   a = a->parent;
+   }
+
+   return a;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a pseudo-absolute path to @kn_to,
+ * where kn_from is treated as root of the path.
+ * @kn_from: kernfs node which should be treated as root for the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3 [depth=3]
+ * result:  /../..
+ *
+ * return value: length of the string.  If greater than buflen,
+ * then contents of buf are undefined.  On error, -1 is returned.
+ */
+static int
+kernfs_path_from_node_locked(struct kernfs_node *kn_to,
+struct kernfs_node *kn_from, char *buf,
+size_t buflen)
+{
+   struct kernfs_node *kn, *common;
+   const char parent_str[] = "/..";
+   size_t depth_from, depth_to, len = 0, nlen = 0;
+   char *p;
+   int i;
+
+   if (!kn_from)
+   kn_from = kernfs_root(kn_to)->kn;
+
+   if (kn_from == kn_to)
+   return strlcpy(buf, "/", buflen);
+
+   common = kernfs_common_ancestor(kn_from, kn_to);
+   if (WARN_ON(!common))
+   return -1;
+
+   depth_to = kernfs_depth(common, kn_to);
+   depth_from = kernfs_depth(common, kn_from);
+
+   if (buf)
+

[PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns

2015-12-22 Thread serge . hallyn

From: Serge Hallyn 

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.

Signed-off-by: Serge Hallyn 
---
Changelog:
20151116 - Don't allow user namespaces to bind new subsystems
20151118 - postpone the FS_USERNS_MOUNT flag until the
   last patch, until we can convince ourselves it
   is safe.
20151207 - Switch to walking up the kernfs path from kn root.
 - Group initialized variables
 - Explain the capable(CAP_SYS_ADMIN) check
 - Style fixes
---
 kernel/cgroup.c |   40 +++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e85fbf9..99c4443 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1983,6 +1983,7 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
 {
bool is_v2 = fs_type == &cgroup2_fs_type;
struct super_block *pinned_sb = NULL;
+   struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
struct cgroup_subsys *ss;
struct cgroup_root *root;
struct cgroup_sb_opts opts;
@@ -1991,6 +1992,14 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
int i;
bool new_sb;
 
+   get_cgroup_ns(ns);
+
+   /* Check if the caller has permission to mount. */
+   if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+   put_cgroup_ns(ns);
+   return ERR_PTR(-EPERM);
+   }
+
/*
 * The first time anyone tries to mount a cgroup, enable the list
 * linking each css_set to its tasks and fix up all existing tasks.
@@ -2106,6 +2115,16 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
goto out_unlock;
}
 
+   /*
+* We know this subsystem has not yet been bound.  Users in a non-init
+* user namespace may only mount hierarchies with no bound subsystems,
+* i.e. 'none,name=user1'
+*/
+   if (!opts.none && !capable(CAP_SYS_ADMIN)) {
+   ret = -EPERM;
+   goto out_unlock;
+   }
+
root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root) {
ret = -ENOMEM;
@@ -2124,12 +2143,30 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);
 
-   if (ret)
+   if (ret) {
+   put_cgroup_ns(ns);
return ERR_PTR(ret);
+   }
 out_mount:
dentry = kernfs_mount(fs_type, flags, root->kf_root,
  is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
  &new_sb);
+
+   /*
+* In non-init cgroup namespace, instead of root cgroup's
+* dentry, we return the dentry corresponding to the
+* cgroupns->root_cgrp.
+*/
+   if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
+   struct dentry *nsdentry;
+   struct cgroup *cgrp;
+
+   cgrp = cset_cgroup_from_root(ns->root_cset, root);
+   nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
+   dput(dentry);
+   dentry = nsdentry;
+   }
+
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);
 
@@ -2142,6 +2179,7 @@ out_mount:
deactivate_super(pinned_sb);
}
 
+   put_cgroup_ns(ns);
return dentry;
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8/8] Add FS_USERNS_FLAG to cgroup fs

2015-12-22 Thread serge . hallyn

From: Serge Hallyn 

allowing root in a non-init user namespace to mount it.  This should
now be safe, because

1. non-init-root cannot mount a previously unbound subsystem
2. the task doing the mount must be privileged with respect to the
   user namespace owning the cgroup namespace
3. the mounted subsystem will have its current cgroup as the root dentry.
   the permissions will be unchanged, so tasks will receive no new
   privilege over the cgroups which they did not have on the original
   mounts.

Signed-off-by: Serge Hallyn 
---
 kernel/cgroup.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 99c4443..587247e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2208,12 +2208,14 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct file_system_type cgroup2_fs_type = {
.name = "cgroup2",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
 };
 
 static int cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/8] kernfs: define kernfs_node_dentry

2015-12-22 Thread serge . hallyn

From: Aditya Kali 

Add a new kernfs api is added to lookup the dentry for a particular
kernfs path.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
---
Changelog:
20151116 - Don't allow user namespaces to bind new subsystems
20151118 - postpone the FS_USERNS_MOUNT flag until the
   last patch, until we can convince ourselves it
   is safe.
20151207 - Switch to walking up the kernfs path from kn root.
20151208 - Split out the kernfs change
 - Style changes
 - Switch from pr_crit to WARN_ON
 - Reorder arguments to kernfs_obtain_root
 - rename kernfs_obtain_root to kernfs_node_dentry
---
 fs/kernfs/mount.c  |   67 
 include/linux/kernfs.h |2 ++
 2 files changed, 69 insertions(+)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 8eaf417..7224296 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kernfs-internal.h"
 
@@ -62,6 +63,72 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block 
*sb)
return NULL;
 }
 
+/*
+ * find the next ancestor in the path down to @child, where @parent was the
+ * ancestor whose descendant we want to find.
+ *
+ * Say the path is /a/b/c/d.  @child is d, @parent is NULL.  We return the root
+ * node.  If @parent is b, then we return the node for c.
+ * Passing in d as @parent is not ok.
+ */
+static struct kernfs_node *
+find_next_ancestor(struct kernfs_node *child, struct kernfs_node *parent)
+{
+   if (child == parent) {
+   pr_crit_once("BUG in find_next_ancestor: called with parent == 
child");
+   return NULL;
+   }
+
+   while (child->parent != parent) {
+   if (!child->parent)
+   return NULL;
+   child = child->parent;
+   }
+
+   return child;
+}
+
+/**
+ * kernfs_node_dentry - get a dentry for the given kernfs_node
+ * @kn: kernfs_node for which a dentry is needed
+ * @sb: the kernfs super_block
+ */
+struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
+ struct super_block *sb)
+{
+   struct dentry *dentry;
+   struct kernfs_node *knparent = NULL;
+
+   BUG_ON(sb->s_op != &kernfs_sops);
+
+   dentry = dget(sb->s_root);
+
+   /* Check if this is the root kernfs_node */
+   if (!kn->parent)
+   return dentry;
+
+   knparent = find_next_ancestor(kn, NULL);
+   if (WARN_ON(!knparent))
+   return ERR_PTR(-EINVAL);
+
+   do {
+   struct dentry *dtmp;
+   struct kernfs_node *kntmp;
+
+   if (kn == knparent)
+   return dentry;
+   kntmp = find_next_ancestor(kn, knparent);
+   if (WARN_ON(!kntmp))
+   return ERR_PTR(-EINVAL);
+   dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
+   dput(dentry);
+   if (IS_ERR(dtmp))
+   return dtmp;
+   knparent = kntmp;
+   dentry = dtmp;
+   } while (1);
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 716bfde..c06c442 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -284,6 +284,8 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry 
*dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn);
 
+struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
+ struct super_block *sb);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 7/8] cgroup: Add documentation for cgroup namespaces

2015-12-22 Thread serge . hallyn

From: Aditya Kali 

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
Changelog (2015-12-08):
  Merge into Documentation/cgroup.txt
Changelog (2015-12-22):
  Reformat to try to follow the style of the rest of the cgroup.txt file.

Signed-off-by: Serge Hallyn 
---
 Documentation/cgroup.txt |  150 ++
 1 file changed, 150 insertions(+)

diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt
index 31d1f7b..03ad757 100644
--- a/Documentation/cgroup.txt
+++ b/Documentation/cgroup.txt
@@ -47,6 +47,7 @@ CONTENTS
   5-3. IO
 5-3-1. IO Interface Files
 5-3-2. Writeback
+6. Namespaces
 P. Information on Kernel Programming
   P-1. Filesystem Support for Writeback
 D. Deprecated v1 Core Features
@@ -1013,6 +1014,155 @@ writeback as follows.
vm.dirty[_background]_ratio.
 
 
+6. Cgroup Namespaces
+
+Cgroup namespaces provides a mechanism to virtualize the view of the
+"/proc/$PID/cgroup" file. The CLONE_NEWCGROUP clone flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace.  The process
+running inside the cgroup namespace will have its "/proc/$PID/cgroup" output
+restricted to cgroupns root.  The cgroupns root is the cgroup of the process at
+the time of creation of the cgroup namespace.
+
+Prior to cgroup namespaces, the "/proc/$PID/cgroup" file showed the complete
+path of the cgroup of a process. In a container setup where a set of cgroups
+and namespaces are intended to isolate processes the "/proc/$PID/cgroup" file
+may leak potential system level information to the isolated processes.
+
+For Example:
+  # cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+Cgroup namespaces can be used to restrict visibility of this path.
+For example, before creating a cgroup namespace, one would see:
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> 
cgroup:[4026531835]
+  # cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+After unsharing a new namespace, the view has changed.
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> 
cgroup:[4026532183]
+  # cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+While a task in the global cgroup namespace sees the full path.
+
+  # cat /proc/$PID/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+If also unsharing the user and mounts namespaces, then when mounting cgroupfs
+then the mount's root will be the task's cgroup.
+
+  # lxc-usernsexec --unshare -m -c
+  # mount -t cgroup cgroup /tmp/cgroup
+  # ls -l /tmp/cgroup
+  total 0
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns root' for a cgroup namespace is the cgroup in which
+the process calling unshare is running.
+For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+cgroup /batchjobs/container_id1 becomes the cgroupns root.
+For the init_cgroup_ns, this is the real root ('/') cgroup
+(identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns root cgroup does not change even if the namespace
+creator process later moves to a different cgroup.
+# ~/unshare -c # unshare cgroupns in some cgroup
+ # cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+ # mkdir sub_cgrp_1
+ # echo 0 > sub_cgrp_1/cgroup.procs
+ # cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its namespace-specific view of "/proc/$PID/cgroup"
+
+(a) Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
+From within an unshared cgroupns:
+# sleep 10 &
+[1] 7353
+# echo 7353 > sub_cgrp_1/cgroup.procs
+# cat /proc/7353/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From the initial cgrou

[PATCH 4/8] cgroup: cgroup namespace setns support

2015-12-22 Thread serge . hallyn

From: Aditya Kali 

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
---
 kernel/cgroup.c |   24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 72336f5..e85fbf9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5925,10 +5925,28 @@ err_out:
return ERR_PTR(err);
 }
 
-static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+static inline struct cgroup_namespace *to_cg_ns(struct ns_common *ns)
 {
-   pr_info("setns not supported for cgroup namespace");
-   return -EINVAL;
+   return container_of(ns, struct cgroup_namespace, ns);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
+{
+   struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);
+
+   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+   !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+   return -EPERM;
+
+   /* Don't need to do anything if we are attaching to our own cgroupns. */
+   if (cgroup_ns == nsproxy->cgroup_ns)
+   return 0;
+
+   get_cgroup_ns(cgroup_ns);
+   put_cgroup_ns(nsproxy->cgroup_ns);
+   nsproxy->cgroup_ns = cgroup_ns;
+
+   return 0;
 }
 
 static struct ns_common *cgroupns_get(struct task_struct *task)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

2015-12-22 Thread serge . hallyn

From: Aditya Kali 

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
 include/uapi/linux/sched.h |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..5f0fe01 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x0040  /* Unused, ignored */
 #define CLONE_UNTRACED 0x0080  /* set if the tracing process 
can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x0100  /* set the TID in the child */
-/* 0x0200 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP0x0200  /* New cgroup namespace 
*/
 #define CLONE_NEWUTS   0x0400  /* New utsname namespace */
 #define CLONE_NEWIPC   0x0800  /* New ipc namespace */
 #define CLONE_NEWUSER  0x1000  /* New user namespace */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

CGroup Namespaces (v8)

2015-12-22 Thread serge . hallyn

Hi,

following is a revised set of the CGroup Namespace patchset which Aditya
Kali has previously sent.  The code can also be found in the cgroupns.v8
branch of

https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/

To summarize the semantics:

1. CLONE_NEWCGROUP re-uses 0x0200, which was previously CLONE_STOPPED

2. unsharing a cgroup namespace makes all your current cgroups your new
cgroup root.

3. /proc/pid/cgroup always shows cgroup paths relative to the reader's
cgroup namespce root.  A task outside of  your cgroup looks like

8:memory:/../../..

4. when a task mounts a cgroupfs, the cgroup which shows up as root depends
on the mounting task's  cgroup namespace.

5. setns to a cgroup namespace switches your cgroup namespace but not
your cgroups.

With this, using github.com/hallyn/lxc #2015-11-09/cgns (and
github.com/hallyn/lxcfs #2015-11-10/cgns) we can start a container in a full
proper cgroup namespace, avoiding either cgmanager or lxcfs cgroup bind mounts.

This is completely backward compatible and will be completely invisible
to any existing cgroup users (except for those running inside a cgroup
namespace and looking at /proc/pid/cgroup of tasks outside their
namespace.)

Changes from V7:
1. Rework kernfs_path_from_node_locked to return the string length
2. Rename and reorder args to kernfs_path_from_node
3. cgroup.c: undo accidental conversoins to inline
4. cgroup.h: move ns declarations to bottom.
5. Rework the documentation to fit the style of the rest of cgroup.txt

Changes from V6:
1. Switch to some WARN_ONs to provide stack traces
2. Rename kernfs_node_distance to kernfs_depth
3. Make sure kernfs_common_ancestor() nodes are from same root
4. Split kernfs changes for cgroup_mount into separate patch
5. Rename kernfs_obtain_root to kernfs_node_dentry
(And more, see patch changelogs)

Changes from V5:
1. To get a root dentry for cgroup namespace mount, walk the path from the
   kernfs root dentry.

Changes from V4:
1. Move the FS_USERNS_MOUNT flag to last patch
2. Rebase onto cgroup/for-4.5
3. Don't non-init user namespaces to bind new subsystems when mounting.
4. Address feedback from Tejun (thanks).  Specificaly, not addressed:
   . kernfs_obtain_root - walking dentry from kernfs root.
 (I think that's the only piece)
5. Dropped unused get_task_cgroup fn/patch.
6. Reworked kernfs_path_from_node_locked() to try to simplify the logic.
   It now finds a common ancestor, walks from the source to it, then back
   up to the target.

Changes from V3:
1. Rebased onto latest cgroup changes.  In particular switch to
   css_set_lock and ns_common.
2. Support all hierarchies.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
   - removed use of threadgroup_lock() while creating new cgroupns
   - use task_lock() instead of rcu_read_lock() while accessing
 task->nsproxy
   - optimized setns() to own cgroupns
   - simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
   not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc//cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of .
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup ' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc//cgroup is further restricted by not showing
   anything if the  is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [RFC] selftests/cgroupns: new test for cgroup namespaces

2015-12-22 Thread Serge Hallyn

Quoting Alban Crequy (alban.cre...@gmail.com):
> From: Alban Crequy 
> 
> This adds the selftest "cgroupns_test" in order to test the CGroup
> Namespace patchset.
> 
> cgroupns_test creates two child processes. They perform a list of
> actions defined by the array cgroupns_test. This array can easily be
> extended to more scenarios without adding much code. They are
> synchronized with eventfds to ensure only one action is performed at a
> time.
> 
> The memory is shared between the processes (CLONE_VM) so each child
> process can know the pid of their siblings without extra IPC.
> 
> The output explains the scenario being played. Short extract:
> 
> > current cgroup: /user.slice/user-0.slice/session-1.scope
> > child process #0: check that process #self (pid=482) has cgroup 
> > /user.slice/user-0.slice/session-1.scope
> > child process #0: unshare cgroupns
> > child process #0: move process #self (pid=482) to cgroup 
> > cgroup-a/subcgroup-a
> > child process #0: join parent cgroupns
> 
> The test does not change the mount namespace and does not mount any
> new cgroup2 filesystem. Therefore this does not test that the cgroup2
> mount is correctly rooted to the cgroupns root at mount time.
> 
> Signed-off-by: Alban Crequy 
> 
> ---
> 
> This patch is available in the cgroupns.v7-tests branch of
> https://github.com/kinvolk/linux.git
> It is based on top of Serge Hallyn's cgroupns.v7 branch of
> https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/
> 
> I see Linux does not have a lot of selftests and there are more Linux
> container tests in Linux Test Project:
> https://github.com/linux-test-project/ltp/tree/master/testcases/kernel/containers
> 
> Is it better to send this test here or to LTP?
> ---
>  tools/testing/selftests/Makefile |   1 +
>  tools/testing/selftests/cgroupns/Makefile|  11 +
>  tools/testing/selftests/cgroupns/cgroupns_test.c | 378 
> +++
>  3 files changed, 390 insertions(+)
>  create mode 100644 tools/testing/selftests/cgroupns/Makefile
>  create mode 100644 tools/testing/selftests/cgroupns/cgroupns_test.c
> 
> diff --git a/tools/testing/selftests/Makefile 
> b/tools/testing/selftests/Makefile
> index c8edff6..694325a 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -1,4 +1,5 @@
>  TARGETS = breakpoints
> +TARGETS += cgroupns
>  TARGETS += cpu-hotplug
>  TARGETS += efivarfs
>  TARGETS += exec
> diff --git a/tools/testing/selftests/cgroupns/Makefile 
> b/tools/testing/selftests/cgroupns/Makefile
> new file mode 100644
> index 000..0fdbe0a
> --- /dev/null
> +++ b/tools/testing/selftests/cgroupns/Makefile
> @@ -0,0 +1,11 @@
> +CFLAGS += -I../../../../usr/include/
> +CFLAGS += -I../../../../include/uapi/
> +
> +all: cgroupns_test
> +
> +TEST_PROGS := cgroupns_test
> +
> +include ../lib.mk
> +
> +clean:
> + $(RM) cgroupns_test
> diff --git a/tools/testing/selftests/cgroupns/cgroupns_test.c 
> b/tools/testing/selftests/cgroupns/cgroupns_test.c
> new file mode 100644
> index 000..d45017c
> --- /dev/null
> +++ b/tools/testing/selftests/cgroupns/cgroupns_test.c
> @@ -0,0 +1,378 @@
> +#define _GNU_SOURCE
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +
> +#include "../kselftest.h"
> +
> +#define STACK_SIZE 65536
> +
> +static char root_cgroup[4096];
> +
> +#define CHILDREN_COUNT 2
> +typedef struct {
> + int pid;
> + uint8_t *stack;
> + int start_semfd;
> + int end_semfd;
> +} cgroupns_child_t;
> +cgroupns_child_t children[CHILDREN_COUNT];
> +
> +typedef enum {
> + UNSHARE_CGROUPNS,
> + JOIN_CGROUPNS,
> + CHECK_CGROUP,
> + CHECK_CGROUP_WITH_ROOT_PREFIX,
> + MOVE_CGROUP,
> + MOVE_CGROUP_WITH_ROOT_PREFIX,
> +} cgroupns_action_t;
> +
> +static const struct {
> + int actor_pid;
> + cgroupns_action_t action;
> + int target_pid;
> + char *path;
> +} cgroupns_tests[] = {
> + { 0, CHECK_CGROUP_WITH_ROOT_PREFIX, -1, ""},
> + { 0, CHECK_CGROUP_WITH_ROOT_PREFIX, 0, ""},
> + { 0, CHECK_CGROUP_WITH_ROOT_PREFIX, 1, ""},
> + { 1, CHECK_CGROUP_WITH_ROOT_PREFIX, -1, ""},
> + { 1, CHECK_CGROUP_WITH_ROOT_PREFIX, 0, ""},
> + { 1, CHECK_CGROUP_WITH_ROOT_PREFIX, 1, ""},
> +
> + { 0, UNSHARE_CGROUPNS, -1, NULL},
> +
> + { 0, CHECK_CGROUP, -1, "/"},
> + { 0, CHECK_CGROUP, 0, "/"},
> + { 0, CHECK_CGROUP, 1, "/"},
> + { 1, CHECK_CGROUP_WITH_ROOT_PREFIX, -1, ""},
> + { 1, CHECK_CGROUP_WITH_ROOT_PREFIX, 0, ""},
> + { 1, CHECK_CGROUP_WITH_ROOT_PREFIX, 1, ""},
> +
> + { 1, UNSHARE_CGROUPNS, -1, NULL},
> +
> + { 0, CHECK_CGROUP, -1, "/"},
> + { 0, CHECK_CGROUP, 0, "/"},
> + { 0, CHECK_CGROUP, 1, "/"},
> + { 1, CHECK_CGROUP, -1, "/"},
> + { 1, CHECK_CGROUP, 0, "/"},
> + { 1, CHECK_CGROUP, 1, "/"},
> +
> +

Re: [PATCH] [RFC] selftests/cgroupns: new test for cgroup namespaces

2015-12-22 Thread Serge Hallyn

Quoting Alban Crequy (alban.cre...@gmail.com):
> From: Alban Crequy 
> 
> This adds the selftest "cgroupns_test" in order to test the CGroup
> Namespace patchset.
> 
> cgroupns_test creates two child processes. They perform a list of
> actions defined by the array cgroupns_test. This array can easily be
> extended to more scenarios without adding much code. They are
> synchronized with eventfds to ensure only one action is performed at a
> time.
> 
> The memory is shared between the processes (CLONE_VM) so each child
> process can know the pid of their siblings without extra IPC.
> 
> The output explains the scenario being played. Short extract:
> 
> > current cgroup: /user.slice/user-0.slice/session-1.scope
> > child process #0: check that process #self (pid=482) has cgroup 
> > /user.slice/user-0.slice/session-1.scope
> > child process #0: unshare cgroupns
> > child process #0: move process #self (pid=482) to cgroup 
> > cgroup-a/subcgroup-a
> > child process #0: join parent cgroupns
> 
> The test does not change the mount namespace and does not mount any
> new cgroup2 filesystem. Therefore this does not test that the cgroup2
> mount is correctly rooted to the cgroupns root at mount time.
> 
> Signed-off-by: Alban Crequy 
> 
> ---
> 
> This patch is available in the cgroupns.v7-tests branch of
> https://github.com/kinvolk/linux.git
> It is based on top of Serge Hallyn's cgroupns.v7 branch of
> https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/
> 
> I see Linux does not have a lot of selftests and there are more Linux
> container tests in Linux Test Project:
> https://github.com/linux-test-project/ltp/tree/master/testcases/kernel/containers
> 
> Is it better to send this test here or to LTP?

Thanks, I'm working on the next iteration today, will give this a shot too.

> ---
>  tools/testing/selftests/Makefile |   1 +
>  tools/testing/selftests/cgroupns/Makefile|  11 +
>  tools/testing/selftests/cgroupns/cgroupns_test.c | 378 
> +++
>  3 files changed, 390 insertions(+)
>  create mode 100644 tools/testing/selftests/cgroupns/Makefile
>  create mode 100644 tools/testing/selftests/cgroupns/cgroupns_test.c
> 
> diff --git a/tools/testing/selftests/Makefile 
> b/tools/testing/selftests/Makefile
> index c8edff6..694325a 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -1,4 +1,5 @@
>  TARGETS = breakpoints
> +TARGETS += cgroupns
>  TARGETS += cpu-hotplug
>  TARGETS += efivarfs
>  TARGETS += exec
> diff --git a/tools/testing/selftests/cgroupns/Makefile 
> b/tools/testing/selftests/cgroupns/Makefile
> new file mode 100644
> index 000..0fdbe0a
> --- /dev/null
> +++ b/tools/testing/selftests/cgroupns/Makefile
> @@ -0,0 +1,11 @@
> +CFLAGS += -I../../../../usr/include/
> +CFLAGS += -I../../../../include/uapi/
> +
> +all: cgroupns_test
> +
> +TEST_PROGS := cgroupns_test
> +
> +include ../lib.mk
> +
> +clean:
> + $(RM) cgroupns_test
> diff --git a/tools/testing/selftests/cgroupns/cgroupns_test.c 
> b/tools/testing/selftests/cgroupns/cgroupns_test.c
> new file mode 100644
> index 000..d45017c
> --- /dev/null
> +++ b/tools/testing/selftests/cgroupns/cgroupns_test.c
> @@ -0,0 +1,378 @@
> +#define _GNU_SOURCE
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +
> +#include "../kselftest.h"
> +
> +#define STACK_SIZE 65536
> +
> +static char root_cgroup[4096];
> +
> +#define CHILDREN_COUNT 2
> +typedef struct {
> + int pid;
> + uint8_t *stack;
> + int start_semfd;
> + int end_semfd;
> +} cgroupns_child_t;
> +cgroupns_child_t children[CHILDREN_COUNT];
> +
> +typedef enum {
> + UNSHARE_CGROUPNS,
> + JOIN_CGROUPNS,
> + CHECK_CGROUP,
> + CHECK_CGROUP_WITH_ROOT_PREFIX,
> + MOVE_CGROUP,
> + MOVE_CGROUP_WITH_ROOT_PREFIX,
> +} cgroupns_action_t;
> +
> +static const struct {
> + int actor_pid;
> + cgroupns_action_t action;
> + int target_pid;
> + char *path;
> +} cgroupns_tests[] = {
> + { 0, CHECK_CGROUP_WITH_ROOT_PREFIX, -1, ""},
> + { 0, CHECK_CGROUP_WITH_ROOT_PREFIX, 0, ""},
> + { 0, CHECK_CGROUP_WITH_ROOT_PREFIX, 1, ""},
> + { 1, CHECK_CGROUP_WITH_ROOT_PREFIX, -1, ""},
> + { 1, CHECK_CGROUP_WITH_ROOT_PREFIX, 0, ""},
> + { 1, CHECK_CGROUP_WITH_ROOT_PREFIX, 1, ""},
> +
> + { 0, UNSHARE_CGROUPNS, -1, NULL},
> +
> + { 0, CHECK_CGROUP, -1, "/"},
> + { 0, CHECK_CGROUP, 0, "/"},
> + { 0, CHECK_CGROUP, 1, "/"},
> + { 1, CHECK_CGROUP_WITH_ROOT_PREFIX, -1, ""},
> + { 1, CHECK_CGROUP_WITH_ROOT_PREFIX, 0, ""},
> + { 1, CHECK_CGROUP_WITH_ROOT_PREFIX, 1, ""},
> +
> + { 1, UNSHARE_CGROUPNS, -1, NULL},
> +
> + { 0, CHECK_CGROUP, -1, "/"},
> + { 0, CHECK_CGROUP, 0, "/"},
> + { 0, CHECK_CGROUP, 1, "/"},
> + { 1, CHECK_CGROUP, -1, "/"},
> +

Re: [PATCH 1/8] kernfs: Add API to generate relative kernfs path

2015-12-09 Thread Serge Hallyn

Quoting Tejun Heo (t...@kernel.org):
> Hello, Serge.
> 
> On Wed, Dec 09, 2015 at 01:28:54PM -0600, serge.hal...@ubuntu.com wrote:
> > +/* kernfs_node_depth - compute depth from @from to @to */
> > +static size_t kernfs_depth(struct kernfs_node *from, struct kernfs_node 
> > *to)
> ...
> > +char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> > +{
> > +   return kernfs_path_from_node(NULL, kn, buf, buflen);
> > +}
> ...
> > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> > index 5d4e9c4..d025ebd 100644
> > --- a/include/linux/kernfs.h
> > +++ b/include/linux/kernfs.h
> > @@ -267,6 +267,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node 
> > *kn)
> >  
> >  int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
> >  size_t kernfs_path_len(struct kernfs_node *kn);
> > +char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
> > + struct kernfs_node *kn, char *buf,
> > + size_t buflen);
> 
> I think I commented on the same thing before, but I think it'd make
> more sense to put @from after @to

Oh.  You said that for kernfs_path_from_node_locked(), and those were
changed.  kernfs_path_form_node() is a different fn, but

> and the prototype is using @root_kn
> which is a bit confusing.

we can rename kn_root to from here if you think that's clearer (and
change the order here as well).

> Was converting the path functions to return
> length too much work?  If so, that's fine but please explain what
> decisions were made.

Yes, I had replied saying:

 |I can change that, but the callers right now don't re-try with
 |larger buffer anyway, so this would actually complicate them just
 |a smidgeon.  Would you want them changed to do that?  (pr_cont_kernfs_path
 |right now writes into a static char[] for instance)

I can still make that change if you like.

> I skimmed through the series and spotted several other review points
> which didn't get addressed.  Can you please go over the previous
> review cycle and address the review points?

I did go through every email twice, once while making changes (one
branch per response) and once while making changelog for each patch,
sorry about whatever I missed.  I'll go through each again.

I'm going to be out for awhile after today, so next version will
unfortunately take awhile.

thanks,
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

CGroup Namespaces (v7)

2015-12-09 Thread serge . hallyn

Hi,

following is a revised set of the CGroup Namespace patchset which Aditya
Kali has previously sent.  The code can also be found in the cgroupns.v7
branch of

https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/

To summarize the semantics:

1. CLONE_NEWCGROUP re-uses 0x0200, which was previously CLONE_STOPPED

2. unsharing a cgroup namespace makes all your current cgroups your new
cgroup root.

3. /proc/pid/cgroup always shows cgroup paths relative to the reader's
cgroup namespce root.  A task outside of  your cgroup looks like

8:memory:/../../..

4. when a task mounts a cgroupfs, the cgroup which shows up as root depends
on the mounting task's  cgroup namespace.

5. setns to a cgroup namespace switches your cgroup namespace but not
your cgroups.

With this, using github.com/hallyn/lxc #2015-11-09/cgns (and
github.com/hallyn/lxcfs #2015-11-10/cgns) we can start a container in a full
proper cgroup namespace, avoiding either cgmanager or lxcfs cgroup bind mounts.

This is completely backward compatible and will be completely invisible
to any existing cgroup users (except for those running inside a cgroup
namespace and looking at /proc/pid/cgroup of tasks outside their
namespace.)

Changes from V6:
1. Switch to some WARN_ONs to provide stack traces
2. Rename kernfs_node_distance to kernfs_depth
3. Make sure kernfs_common_ancestor() nodes are from same root
4. Split kernfs changes for cgroup_mount into separate patch
5. Rename kernfs_obtain_root to kernfs_node_dentry
(And more, see patch changelogs)

Changes from V5:
1. To get a root dentry for cgroup namespace mount, walk the path from the
   kernfs root dentry.

Changes from V4:
1. Move the FS_USERNS_MOUNT flag to last patch
2. Rebase onto cgroup/for-4.5
3. Don't non-init user namespaces to bind new subsystems when mounting.
4. Address feedback from Tejun (thanks).  Specificaly, not addressed:
   . kernfs_obtain_root - walking dentry from kernfs root.
 (I think that's the only piece)
5. Dropped unused get_task_cgroup fn/patch.
6. Reworked kernfs_path_from_node_locked() to try to simplify the logic.
   It now finds a common ancestor, walks from the source to it, then back
   up to the target.

Changes from V3:
1. Rebased onto latest cgroup changes.  In particular switch to
   css_set_lock and ns_common.
2. Support all hierarchies.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
   - removed use of threadgroup_lock() while creating new cgroupns
   - use task_lock() instead of rcu_read_lock() while accessing
 task->nsproxy
   - optimized setns() to own cgroupns
   - simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
   not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc//cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of .
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup ' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc//cgroup is further restricted by not showing
   anything if the  is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/8] cgroup: introduce cgroup namespaces

2015-12-09 Thread serge . hallyn

From: Aditya Kali 

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
Changelog: 2015-11-24
- move cgroup_namespace.c into cgroup.c (and .h)
- reformatting
- make get_cgroup_ns return void
- rename ns->root_cgrps to root_cset.
Changelog: 2015-12-08
- Move init_cgroup_ns to other variable declarations
- Remove accidental conversion of put-css_set to inline
- Drop BUG_ON(NULL)
- Remove unneeded pre declaration of struct cgroupns_operations.
- cgroup.h: collect common ns declerations
---
 fs/proc/namespaces.c|3 +
 include/linux/cgroup.h  |   54 --
 include/linux/nsproxy.h |2 +
 include/linux/proc_ns.h |4 ++
 kernel/cgroup.c |  146 ++-
 kernel/fork.c   |2 +-
 kernel/nsproxy.c|   21 ++-
 7 files changed, 220 insertions(+), 12 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index f6e8354..bd61075 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -28,6 +28,9 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
 #endif
&mntns_operations,
+#ifdef CONFIG_CGROUPS
+   &cgroupns_operations,
+#endif
 };
 
 static const char *proc_ns_follow_link(struct dentry *dentry, void **cookie)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2b3e2314..906e348 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,9 +17,57 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 
+struct cgroup_namespace {
+   atomic_tcount;
+   struct ns_commonns;
+   struct user_namespace   *user_ns;
+   struct css_set  *root_cset;
+};
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+#ifdef CONFIG_CGROUPS
+
+void free_cgroup_ns(struct cgroup_namespace *ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+   struct user_namespace *user_ns,
+   struct cgroup_namespace *old_ns);
+
+char * cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen);
+
+#else /* !CONFIG_CGROUPS */
+
+static inline void free_cgroup_ns(struct cgroup_namespace *ns) { }
+static inline struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns)
+{
+   return old_ns;
+}
+
+#endif /* !CONFIG_CGROUPS */
+
+static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(&ns->count);
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns && atomic_dec_and_test(&ns->count))
+   free_cgroup_ns(ns);
+}
+
 #ifdef CONFIG_CGROUPS
 
 /*
@@ -509,12 +557,6 @@ static inline int cgroup_name(struct cgroup *cgrp, char 
*buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
 }
 
-static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
- size_t buflen)
-{
-   return kernfs_path(cgrp->kn, buf, buflen);
-}
-
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
 {
pr_cont_kernfs_name(cgrp->kn);
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net   *net_ns;
+   struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 42dfc61..de0e771 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -9,6 +9,8 @@
 struct pid_namespace;
 struct nsproxy;
 struct path;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
const char *name;

[PATCH 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

2015-12-09 Thread serge . hallyn

From: Aditya Kali 

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
 include/uapi/linux/sched.h |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..5f0fe01 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x0040  /* Unused, ignored */
 #define CLONE_UNTRACED 0x0080  /* set if the tracing process 
can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x0100  /* set the TID in the child */
-/* 0x0200 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP0x0200  /* New cgroup namespace 
*/
 #define CLONE_NEWUTS   0x0400  /* New utsname namespace */
 #define CLONE_NEWIPC   0x0800  /* New ipc namespace */
 #define CLONE_NEWUSER  0x1000  /* New user namespace */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8/8] Add FS_USERNS_FLAG to cgroup fs

2015-12-09 Thread serge . hallyn

From: Serge Hallyn 

allowing root in a non-init user namespace to mount it.  This should
now be safe, because

1. non-init-root cannot mount a previously unbound subsystem
2. the task doing the mount must be privileged with respect to the
   user namespace owning the cgroup namespace
3. the mounted subsystem will have its current cgroup as the root dentry.
   the permissions will be unchanged, so tasks will receive no new
   privilege over the cgroups which they did not have on the original
   mounts.

Signed-off-by: Serge Hallyn 
---
 kernel/cgroup.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b92b3fd..7d5d7e1 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2231,12 +2231,14 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct file_system_type cgroup2_fs_type = {
.name = "cgroup2",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
 };
 
 static char * cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/8] kernfs: define kernfs_node_dentry

2015-12-09 Thread serge . hallyn

From: Aditya Kali 

Add a new kernfs api is added to lookup the dentry for a particular
kernfs path.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
---
Changelog:
20151116 - Don't allow user namespaces to bind new subsystems
20151118 - postpone the FS_USERNS_MOUNT flag until the
   last patch, until we can convince ourselves it
   is safe.
20151207 - Switch to walking up the kernfs path from kn root.
20151208 - Split out the kernfs change
 - Style changes
 - Switch from pr_crit to WARN_ON
 - Reorder arguments to kernfs_obtain_root
 - rename kernfs_obtain_root to kernfs_node_dentry
---
 fs/kernfs/mount.c  |   67 
 include/linux/kernfs.h |2 ++
 2 files changed, 69 insertions(+)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 8eaf417..7224296 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kernfs-internal.h"
 
@@ -62,6 +63,72 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block 
*sb)
return NULL;
 }
 
+/*
+ * find the next ancestor in the path down to @child, where @parent was the
+ * ancestor whose descendant we want to find.
+ *
+ * Say the path is /a/b/c/d.  @child is d, @parent is NULL.  We return the root
+ * node.  If @parent is b, then we return the node for c.
+ * Passing in d as @parent is not ok.
+ */
+static struct kernfs_node *
+find_next_ancestor(struct kernfs_node *child, struct kernfs_node *parent)
+{
+   if (child == parent) {
+   pr_crit_once("BUG in find_next_ancestor: called with parent == 
child");
+   return NULL;
+   }
+
+   while (child->parent != parent) {
+   if (!child->parent)
+   return NULL;
+   child = child->parent;
+   }
+
+   return child;
+}
+
+/**
+ * kernfs_node_dentry - get a dentry for the given kernfs_node
+ * @kn: kernfs_node for which a dentry is needed
+ * @sb: the kernfs super_block
+ */
+struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
+ struct super_block *sb)
+{
+   struct dentry *dentry;
+   struct kernfs_node *knparent = NULL;
+
+   BUG_ON(sb->s_op != &kernfs_sops);
+
+   dentry = dget(sb->s_root);
+
+   /* Check if this is the root kernfs_node */
+   if (!kn->parent)
+   return dentry;
+
+   knparent = find_next_ancestor(kn, NULL);
+   if (WARN_ON(!knparent))
+   return ERR_PTR(-EINVAL);
+
+   do {
+   struct dentry *dtmp;
+   struct kernfs_node *kntmp;
+
+   if (kn == knparent)
+   return dentry;
+   kntmp = find_next_ancestor(kn, knparent);
+   if (WARN_ON(!kntmp))
+   return ERR_PTR(-EINVAL);
+   dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
+   dput(dentry);
+   if (IS_ERR(dtmp))
+   return dtmp;
+   knparent = kntmp;
+   dentry = dtmp;
+   } while (1);
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index d025ebd..6eba888 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -284,6 +284,8 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry 
*dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn);
 
+struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
+ struct super_block *sb);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns

2015-12-09 Thread serge . hallyn

From: Serge Hallyn 

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.

Signed-off-by: Serge Hallyn 
---
Changelog:
20151116 - Don't allow user namespaces to bind new subsystems
20151118 - postpone the FS_USERNS_MOUNT flag until the
   last patch, until we can convince ourselves it
   is safe.
20151207 - Switch to walking up the kernfs path from kn root.
 - Group initialized variables
 - Explain the capable(CAP_SYS_ADMIN) check
 - Style fixes
---
 kernel/cgroup.c |   40 +++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f34551a..b92b3fd 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2006,6 +2006,7 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
 {
bool is_v2 = fs_type == &cgroup2_fs_type;
struct super_block *pinned_sb = NULL;
+   struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
struct cgroup_subsys *ss;
struct cgroup_root *root;
struct cgroup_sb_opts opts;
@@ -2014,6 +2015,14 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
int i;
bool new_sb;
 
+   get_cgroup_ns(ns);
+
+   /* Check if the caller has permission to mount. */
+   if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+   put_cgroup_ns(ns);
+   return ERR_PTR(-EPERM);
+   }
+
/*
 * The first time anyone tries to mount a cgroup, enable the list
 * linking each css_set to its tasks and fix up all existing tasks.
@@ -2129,6 +2138,16 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
goto out_unlock;
}
 
+   /*
+* We know this subsystem has not yet been bound.  Users in a non-init
+* user namespace may only mount hierarchies with no bound subsystems,
+* i.e. 'none,name=user1'
+*/
+   if (!opts.none && !capable(CAP_SYS_ADMIN)) {
+   ret = -EPERM;
+   goto out_unlock;
+   }
+
root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root) {
ret = -ENOMEM;
@@ -2147,12 +2166,30 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);
 
-   if (ret)
+   if (ret) {
+   put_cgroup_ns(ns);
return ERR_PTR(ret);
+   }
 out_mount:
dentry = kernfs_mount(fs_type, flags, root->kf_root,
  is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
  &new_sb);
+
+   /*
+* In non-init cgroup namespace, instead of root cgroup's
+* dentry, we return the dentry corresponding to the
+* cgroupns->root_cgrp.
+*/
+   if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
+   struct dentry *nsdentry;
+   struct cgroup *cgrp;
+
+   cgrp = cset_cgroup_from_root(ns->root_cset, root);
+   nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
+   dput(dentry);
+   dentry = nsdentry;
+   }
+
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);
 
@@ -2165,6 +2202,7 @@ out_mount:
deactivate_super(pinned_sb);
}
 
+   put_cgroup_ns(ns);
return dentry;
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/8] cgroup: cgroup namespace setns support

2015-12-09 Thread serge . hallyn

From: Aditya Kali 

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
---
 kernel/cgroup.c |   24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 92db64c..f34551a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5904,10 +5904,28 @@ err_out:
return ERR_PTR(err);
 }
 
-static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+static inline struct cgroup_namespace *to_cg_ns(struct ns_common *ns)
 {
-   pr_info("setns not supported for cgroup namespace");
-   return -EINVAL;
+   return container_of(ns, struct cgroup_namespace, ns);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
+{
+   struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);
+
+   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+   !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+   return -EPERM;
+
+   /* Don't need to do anything if we are attaching to our own cgroupns. */
+   if (cgroup_ns == nsproxy->cgroup_ns)
+   return 0;
+
+   get_cgroup_ns(cgroup_ns);
+   put_cgroup_ns(nsproxy->cgroup_ns);
+   nsproxy->cgroup_ns = cgroup_ns;
+
+   return 0;
 }
 
 static struct ns_common *cgroupns_get(struct task_struct *task)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/8] kernfs: Add API to generate relative kernfs path

2015-12-09 Thread serge . hallyn

From: Aditya Kali 

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
---
Changelog 20151125:
  - Fully-wing multilinecomments
  - Rework kernfs_path_from_node_locked() logic
  - Replace BUG_ONs with returning NULL
  - Use a const char* for /.. and precalculate its size
Changelog 20151130:
  - Update kernfs_path_from_node_locked comment
Changelog 20151208:
  - kernfs_node_distance:
* Remove BUG_ON(NULL)s
* Rename kernfs_node_distance to kernfs_depth
  - kernfs_common-ancestor:
* Remove useless checks for depth == 0
* Add check to ensure nodes are from same root
  - kernfs_path_from_node_locked:
* Remove needless __must_check
* Put p;len on its own decl line.
* Fix wrong WARN_ONCE usage
---
 fs/kernfs/dir.c|  177 
 include/linux/kernfs.h |3 +
 2 files changed, 153 insertions(+), 27 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 91e0045..d1a001a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,129 @@ static int kernfs_name_locked(struct kernfs_node *kn, char 
*buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char 
*buf,
- size_t buflen)
+/* kernfs_node_depth - compute depth from @from to @to */
+static size_t kernfs_depth(struct kernfs_node *from, struct kernfs_node *to)
 {
-   char *p = buf + buflen;
-   int len;
+   size_t depth = 0;
 
-   *--p = '\0';
+   while (to->parent && to != from) {
+   depth++;
+   to = to->parent;
+   }
+   return depth;
+}
 
-   do {
-   len = strlen(kn->name);
-   if (p - buf < len + 1) {
-   buf[0] = '\0';
-   p = NULL;
-   break;
-   }
-   p -= len;
-   memcpy(p, kn->name, len);
-   *--p = '/';
-   kn = kn->parent;
-   } while (kn && kn->parent);
+static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
+   struct kernfs_node *b)
+{
+   size_t da, db;
+   struct kernfs_root *ra = kernfs_root(a), *rb = kernfs_root(b);
 
-   return p;
+   if (ra != rb)
+   return NULL;
+
+   da = kernfs_depth(ra->kn, a);
+   db = kernfs_depth(rb->kn, b);
+
+   while (da > db) {
+   a = a->parent;
+   da--;
+   }
+   while (db > da) {
+   b = b->parent;
+   db--;
+   }
+
+   /* worst case b and a will be the same at root */
+   while (b != a) {
+   b = b->parent;
+   a = a->parent;
+   }
+
+   return a;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a pseudo-absolute path to @kn_to,
+ * where kn_from is treated as root of the path.
+ * @kn_from: kernfs node which should be treated as root for the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3 [depth=3]
+ * result:  /../..
+ */
+static char *
+kernfs_path_from_node_locked(struct kernfs_node *kn_to,
+struct kernfs_node *kn_from, char *buf,
+size_t buflen)
+{
+   char *p = buf;
+   struct kernfs_node *kn, *common;
+   const char parent_str[] = "/..";
+   int i;
+   size_t depth_from, depth_to, len = 0, nlen = 0;
+   size_t plen = sizeof(parent_str) - 1;
+
+   /* We atleast need 2 bytes to write "/\0". */
+   if (buflen < 2)
+   return NULL;
+
+   if (!kn_from)
+   kn_from = kernfs_root(kn_to)->kn;
+
+   if (kn_from == kn_to) {
+   *p = '/';
+   *(++p) = '\0';
+   return buf;
+   }
+
+   common = kernfs_common_ancestor(kn_from, kn_to);
+   if (WARN_ON(!common))
+   return NULL;
+
+   depth_to = kernfs_depth(common, kn_to);
+   depth_from = kernfs_depth(common, kn_from);
+
+   for (i = 0; i < depth_from; i++) {
+   if (len + plen + 1 > buflen)
+   return NULL;
+   strcpy(p, parent_str);
+   p += plen;
+   len += plen;
+   }
+
+   /* Calculate how many bytes we need for the rest

[PATCH 7/8] cgroup: Add documentation for cgroup namespaces

2015-12-09 Thread serge . hallyn

From: Aditya Kali 

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
Changelog (2015-12-08): Merge into Documentation/cgroup.txt
---
 Documentation/cgroup.txt |  144 ++
 1 file changed, 144 insertions(+)

diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt
index 31d1f7b..ca42df4 100644
--- a/Documentation/cgroup.txt
+++ b/Documentation/cgroup.txt
@@ -47,6 +47,7 @@ CONTENTS
   5-3. IO
 5-3-1. IO Interface Files
 5-3-2. Writeback
+6. Namespaces
 P. Information on Kernel Programming
   P-1. Filesystem Support for Writeback
 D. Deprecated v1 Core Features
@@ -1013,6 +1014,149 @@ writeback as follows.
vm.dirty[_background]_ratio.
 
 
+6. CGroup Namespaces
+
+CGroup Namespace provides a mechanism to virtualize the view of the
+/proc//cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace.
+The process running inside the cgroup namespace will have its 
/proc//cgroup
+output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
+at the time of creation of the cgroup namespace.
+
+Prior to CGroup Namespace, the /proc//cgroup file used to show complete
+path of the cgroup of a process. In a container setup (where a set of cgroups
+and namespaces are intended to isolate processes), the /proc//cgroup file
+may leak potential system level information to the isolated processes.
+
+For Example:
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+CGroup Namespaces can be used to restrict visibility of this path.
+For Example:
+  # Before creating cgroup namespace
+  $ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> 
cgroup:[4026531835]
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
+  $ ~/unshare -c
+  [ns]$ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> 
cgroup:[4026532183]
+  # From within new cgroupns, process sees that its in the root cgroup
+  [ns]$ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+  # From global cgroupns:
+  $ cat /proc//cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # Unshare cgroupns along with userns and mountns
+  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
+  # sets up uid/gid map and execs /bin/bash
+  $ ~/unshare -c -u -m
+  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own 
cgroup
+  # hierarchy.
+  [ns]$ mount -t cgroup cgroup /tmp/cgroup
+  [ns]$ ls -l /tmp/cgroup
+  total 0
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns-root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
+the process calling unshare is running.
+For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+cgroup /batchjobs/container_id1 becomes the cgroupns-root.
+For the init_cgroup_ns, this is the real root ('/') cgroup
+(identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns-root cgroup does not change even if the namespace
+creator process later moves to a different cgroup.
+$ ~/unshare -c # unshare cgroupns in some cgroup
+[ns]$ cat /proc/self/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+[ns]$ mkdir sub_cgrp_1
+[ns]$ echo 0 > sub_cgrp_1/cgroup.procs
+[ns]$ cat /proc/self/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its CGROUPNS specific view of /proc//cgroup
+(a) Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup
+[ns]$ sleep 10 &  # From within unshared cgroupns
+[1] 7353
+[ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
+[ns]$ cat /proc/7353/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From global cgroupns, the real cgroup path will be visible:
+

CGroup Namespaces (v6)

2015-12-07 Thread serge . hallyn

Hi,

following is a revised set of the CGroup Namespace patchset which Aditya
Kali has previously sent.  The code can also be found in the cgroupns.v6
branch of

https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/

To summarize the semantics:

1. CLONE_NEWCGROUP re-uses 0x0200, which was previously CLONE_STOPPED

2. unsharing a cgroup namespace makes all your current cgroups your new
cgroup root.

3. /proc/pid/cgroup always shows cgroup paths relative to the reader's
cgroup namespce root.  A task outside of  your cgroup looks like

8:memory:/../../..

4. when a task mounts a cgroupfs, the cgroup which shows up as root depends
on the mounting task's  cgroup namespace.

5. setns to a cgroup namespace switches your cgroup namespace but not
your cgroups.

With this, using github.com/hallyn/lxc #2015-11-09/cgns (and
github.com/hallyn/lxcfs #2015-11-10/cgns) we can start a container in a full
proper cgroup namespace, avoiding either cgmanager or lxcfs cgroup bind mounts.

This is completely backward compatible and will be completely invisible
to any existing cgroup users (except for those running inside a cgroup
namespace and looking at /proc/pid/cgroup of tasks outside their
namespace.)

Changes from V5:
1. To get a root dentry for cgroup namespace mount, walk the path from the
   kernfs root dentry.

Changes from V4:
1. Move the FS_USERNS_MOUNT flag to last patch
2. Rebase onto cgroup/for-4.5
3. Don't non-init user namespaces to bind new subsystems when mounting.
4. Address feedback from Tejun (thanks).  Specificaly, not addressed:
   . kernfs_obtain_root - walking dentry from kernfs root.
 (I think that's the only piece)
5. Dropped unused get_task_cgroup fn/patch.
6. Reworked kernfs_path_from_node_locked() to try to simplify the logic.
   It now finds a common ancestor, walks from the source to it, then back
   up to the target.

Changes from V3:
1. Rebased onto latest cgroup changes.  In particular switch to
   css_set_lock and ns_common.
2. Support all hierarchies.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
   - removed use of threadgroup_lock() while creating new cgroupns
   - use task_lock() instead of rcu_read_lock() while accessing
 task->nsproxy
   - optimized setns() to own cgroupns
   - simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
   not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc//cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of .
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup ' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc//cgroup is further restricted by not showing
   anything if the  is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/7] cgroup: introduce cgroup namespaces

2015-12-07 Thread serge . hallyn

From: Aditya Kali 

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Changelog: 2015-11-24
- move cgroup_namespace.c into cgroup.c (and .h)
- reformatting
- make get_cgroup_ns return void
- rename ns->root_cgrps to root_cset.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
 fs/proc/namespaces.c|3 +
 include/linux/cgroup.h  |   51 
 include/linux/nsproxy.h |2 +
 include/linux/proc_ns.h |4 ++
 kernel/cgroup.c |  151 +--
 kernel/fork.c   |2 +-
 kernel/nsproxy.c|   21 ++-
 7 files changed, 217 insertions(+), 17 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index f6e8354..bd61075 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -28,6 +28,9 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
 #endif
&mntns_operations,
+#ifdef CONFIG_CGROUPS
+   &cgroupns_operations,
+#endif
 };
 
 static const char *proc_ns_follow_link(struct dentry *dentry, void **cookie)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2b3e2314..906f240 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,11 +17,36 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 
+struct cgroup_namespace {
+   atomic_tcount;
+   struct ns_commonns;
+   struct user_namespace   *user_ns;
+   struct css_set  *root_cset;
+};
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(&ns->count);
+}
+
 #ifdef CONFIG_CGROUPS
 
+void free_cgroup_ns(struct cgroup_namespace *ns);
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+   struct user_namespace *user_ns,
+   struct cgroup_namespace *old_ns);
+
 /*
  * All weight knobs on the default hierarhcy should use the following min,
  * default and max values.  The default value is the logarithmic center of
@@ -105,6 +130,10 @@ void cgroup_free(struct task_struct *p);
 int cgroup_init_early(void);
 int cgroup_init(void);
 
+char * __must_check cgroup_path_ns(struct cgroup *cgrp, char *buf,
+  size_t buflen, struct cgroup_namespace *ns);
+char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen);
+
 /*
  * Iteration helpers and macros.
  */
@@ -272,10 +301,6 @@ void css_task_iter_end(struct css_task_iter *it);
;   \
else
 
-/*
- * Inline functions.
- */
-
 /**
  * css_get - obtain a reference on the specified css
  * @css: target css
@@ -509,12 +534,6 @@ static inline int cgroup_name(struct cgroup *cgrp, char 
*buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
 }
 
-static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
- size_t buflen)
-{
-   return kernfs_path(cgrp->kn, buf, buflen);
-}
-
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
 {
pr_cont_kernfs_name(cgrp->kn);
@@ -527,6 +546,12 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 
 #else /* !CONFIG_CGROUPS */
 
+static inline void free_cgroup_ns(struct cgroup_namespace *ns) { }
+static inline struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+   struct user_namespace *user_ns,
+   struct cgroup_namespace *old_ns)
+{ return old_ns; }
+
 struct cgroup_subsys_state;
 
 static inline void css_put(struct cgroup_subsys_state *css) {}
@@ -547,4 +572,10 @@ static inline int cgroup_init(void) { return 0; }
 
 #endif /* !CONFIG_CGROUPS */
 
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns && atomic_dec_and_test(&ns->count))
+   free_cgroup_ns(ns);
+}
+
 #endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/li

[PATCH 7/7] Add FS_USERNS_FLAG to cgroup fs

2015-12-07 Thread serge . hallyn

From: Serge Hallyn 

allowing root in a non-init user namespace to mount it.  This should
now be safe, because

1. non-init-root cannot mount a previously unbound subsystem
2. the task doing the mount must be privileged with respect to the
   user namespace owning the cgroup namespace
3. the mounted subsystem will have its current cgroup as the root dentry.
   the permissions will be unchanged, so tasks will receive no new
   privilege over the cgroups which they did not have on the original
   mounts.

Signed-off-by: Serge Hallyn 
---
 kernel/cgroup.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 09cd718..5419ef7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2228,12 +2228,14 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct file_system_type cgroup2_fs_type = {
.name = "cgroup2",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
 };
 
 char * __must_check
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/7] cgroup: Add documentation for cgroup namespaces

2015-12-07 Thread serge . hallyn

From: Aditya Kali 

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
 Documentation/cgroups/namespace.txt |  142 +++
 1 file changed, 142 insertions(+)
 create mode 100644 Documentation/cgroups/namespace.txt

diff --git a/Documentation/cgroups/namespace.txt 
b/Documentation/cgroups/namespace.txt
new file mode 100644
index 000..a5b80e8
--- /dev/null
+++ b/Documentation/cgroups/namespace.txt
@@ -0,0 +1,142 @@
+   CGroup Namespaces
+
+CGroup Namespace provides a mechanism to virtualize the view of the
+/proc//cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace.
+The process running inside the cgroup namespace will have its 
/proc//cgroup
+output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
+at the time of creation of the cgroup namespace.
+
+Prior to CGroup Namespace, the /proc//cgroup file used to show complete
+path of the cgroup of a process. In a container setup (where a set of cgroups
+and namespaces are intended to isolate processes), the /proc//cgroup file
+may leak potential system level information to the isolated processes.
+
+For Example:
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+CGroup Namespaces can be used to restrict visibility of this path.
+For Example:
+  # Before creating cgroup namespace
+  $ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> 
cgroup:[4026531835]
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
+  $ ~/unshare -c
+  [ns]$ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> 
cgroup:[4026532183]
+  # From within new cgroupns, process sees that its in the root cgroup
+  [ns]$ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+  # From global cgroupns:
+  $ cat /proc//cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # Unshare cgroupns along with userns and mountns
+  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
+  # sets up uid/gid map and execs /bin/bash
+  $ ~/unshare -c -u -m
+  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own 
cgroup
+  # hierarchy.
+  [ns]$ mount -t cgroup cgroup /tmp/cgroup
+  [ns]$ ls -l /tmp/cgroup
+  total 0
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns-root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
+the process calling unshare is running.
+For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+cgroup /batchjobs/container_id1 becomes the cgroupns-root.
+For the init_cgroup_ns, this is the real root ('/') cgroup
+(identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns-root cgroup does not change even if the namespace
+creator process later moves to a different cgroup.
+$ ~/unshare -c # unshare cgroupns in some cgroup
+[ns]$ cat /proc/self/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+[ns]$ mkdir sub_cgrp_1
+[ns]$ echo 0 > sub_cgrp_1/cgroup.procs
+[ns]$ cat /proc/self/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its CGROUPNS specific view of /proc//cgroup
+(a) Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup
+[ns]$ sleep 10 &  # From within unshared cgroupns
+[1] 7353
+[ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
+[ns]$ cat /proc/7353/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From global cgroupns, the real cgroup path will be visible:
+$ cat /proc/7353/cgroup
+
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
+
+(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
+path relative to its own cgroupns-roo

[PATCH 5/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2015-12-07 Thread serge . hallyn

From: Aditya Kali 

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Changelog:
20151116 - Don't allow user namespaces to bind new subsystems
20151118 - postpone the FS_USERNS_MOUNT flag until the
   last patch, until we can convince ourselves it
   is safe.
20151207 - Switch to walking up the kernfs path from kn root.

Signed-off-by: Aditya Kali 
Acked-by: Serge E. Hallyn 
---
 fs/kernfs/mount.c  |   74 
 include/linux/kernfs.h |2 ++
 kernel/cgroup.c|   39 -
 3 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 8eaf417..9219444 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kernfs-internal.h"
 
@@ -62,6 +63,79 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block 
*sb)
return NULL;
 }
 
+/*
+ * find the next ancestor in the path down to @child, where @parent was the
+ * parent whose child we want to find.
+ *
+ * Say the path is /a/b/c/d.  @child is d, @parent is NULL.  We return the root
+ * node.  If @parent is b, then we return the node for c.
+ * Passing in d as @parent is not ok.
+ */
+static struct kernfs_node *
+find_next_ancestor(struct kernfs_node *child, struct kernfs_node *parent)
+{
+   if (child == parent) {
+   pr_crit_once("BUG in find_next_ancestor: called with parent == 
child");
+   return NULL;
+   }
+
+   while (child->parent != parent) {
+   if (!child->parent)
+   return NULL;
+   child = child->parent;
+   }
+
+   return child;
+}
+
+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can be used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+   struct dentry *dentry;
+   struct kernfs_node *knparent = NULL;
+
+   BUG_ON(sb->s_op != &kernfs_sops);
+
+   dentry = dget(sb->s_root);
+   if (!kn->parent) // this is the root
+   return dentry;
+
+   knparent = find_next_ancestor(kn, NULL);
+   if (!knparent) {
+   pr_crit("BUG: find_next_ancestor for root dentry returned 
NULL\n");
+   return ERR_PTR(-EINVAL);
+   }
+
+   do {
+   struct dentry *dtmp;
+   struct kernfs_node *kntmp;
+
+   if (kn == knparent)
+   return dentry;
+   kntmp = find_next_ancestor(kn, knparent);
+   if (!kntmp) {
+   pr_crit("BUG: find_next_ancestor returned NULL for 
node\n");
+   return ERR_PTR(-EINVAL);
+   }
+   dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
+   dput(dentry);
+   if (IS_ERR(dtmp))
+   return dtmp;
+   knparent = kntmp;
+   dentry = dtmp;
+   } while (1);
+
+   // notreached
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index d025ebd..1903777 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -284,6 +284,8 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry 
*dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a5ab74d..09cd718 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2011,6 +2011,15 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
int ret;
int i;
bool new_sb;
+   struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+
+   get_cgroup_ns(ns);
+
+   /* Check if the caller has permission to mount. */
+   if (!n

[PATCH 2/7] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

2015-12-07 Thread serge . hallyn

From: Aditya Kali 

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali 
Acked-by: Serge Hallyn 
---
 include/uapi/linux/sched.h |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..5f0fe01 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x0040  /* Unused, ignored */
 #define CLONE_UNTRACED 0x0080  /* set if the tracing process 
can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x0100  /* set the TID in the child */
-/* 0x0200 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP0x0200  /* New cgroup namespace 
*/
 #define CLONE_NEWUTS   0x0400  /* New utsname namespace */
 #define CLONE_NEWIPC   0x0800  /* New ipc namespace */
 #define CLONE_NEWUSER  0x1000  /* New user namespace */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/7] cgroup: cgroup namespace setns support

2015-12-07 Thread serge . hallyn

From: Aditya Kali 

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali 
Acked-by: Serge E. Hallyn 
---
 kernel/cgroup.c |   24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4fd07b5a..a5ab74d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5907,10 +5907,28 @@ err_out:
return ERR_PTR(err);
 }
 
-static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+static inline struct cgroup_namespace *to_cg_ns(struct ns_common *ns)
 {
-   pr_info("setns not supported for cgroup namespace");
-   return -EINVAL;
+   return container_of(ns, struct cgroup_namespace, ns);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
+{
+   struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);
+
+   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+   !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+   return -EPERM;
+
+   /* Don't need to do anything if we are attaching to our own cgroupns. */
+   if (cgroup_ns == nsproxy->cgroup_ns)
+   return 0;
+
+   get_cgroup_ns(cgroup_ns);
+   put_cgroup_ns(nsproxy->cgroup_ns);
+   nsproxy->cgroup_ns = cgroup_ns;
+
+   return 0;
 }
 
 static struct ns_common *cgroupns_get(struct task_struct *task)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/7] kernfs: Add API to generate relative kernfs path

2015-12-07 Thread serge . hallyn

From: Aditya Kali 

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Changelog 20151125:
  - Fully-wing multilinecomments
  - Rework kernfs_path_from_node_locked() logic
  - Replace BUG_ONs with returning NULL
  - Use a const char* for /.. and precalculate its size
Changelog 20151130:
  - Update kernfs_path_from_node_locked comment

Signed-off-by: Aditya Kali 
Acked-by: Serge E. Hallyn 
---
 fs/kernfs/dir.c|  182 +---
 include/linux/kernfs.h |3 +
 2 files changed, 158 insertions(+), 27 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 91e0045..7cd4bb4 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,134 @@ static int kernfs_name_locked(struct kernfs_node *kn, char 
*buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char 
*buf,
- size_t buflen)
+/* kernfs_node_depth - compute depth from @from to @to */
+static size_t kernfs_node_distance(struct kernfs_node *from, struct 
kernfs_node *to)
 {
-   char *p = buf + buflen;
-   int len;
+   size_t depth = 0;
 
-   *--p = '\0';
+   BUG_ON(!to);
+   BUG_ON(!from);
 
-   do {
-   len = strlen(kn->name);
-   if (p - buf < len + 1) {
-   buf[0] = '\0';
-   p = NULL;
-   break;
-   }
-   p -= len;
-   memcpy(p, kn->name, len);
-   *--p = '/';
-   kn = kn->parent;
-   } while (kn && kn->parent);
+   while (to->parent && to != from) {
+   depth++;
+   to = to->parent;
+   }
+   return depth;
+}
 
-   return p;
+static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
+   struct kernfs_node *b)
+{
+   size_t da = kernfs_node_distance(kernfs_root(a)->kn, a);
+   size_t db = kernfs_node_distance(kernfs_root(b)->kn, b);
+
+   if (da == 0)
+   return a;
+   if (db == 0)
+   return b;
+
+   while (da > db) {
+   a = a->parent;
+   da--;
+   }
+   while (db > da) {
+   b = b->parent;
+   db--;
+   }
+
+   /* worst case b and a will be the same at root */
+   while (b != a) {
+   b = b->parent;
+   a = a->parent;
+   }
+
+   return a;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a pseudo-absolute path to @kn_to,
+ * where kn_from is treated as root of the path.
+ * @kn_from: kernfs node which should be treated as root for the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3 [depth=3]
+ * result:  /../..
+ */
+static char *
+__must_check kernfs_path_from_node_locked(struct kernfs_node *kn_from,
+ struct kernfs_node *kn_to, char *buf,
+ size_t buflen)
+{
+   char *p = buf;
+   struct kernfs_node *kn, *common;
+   const char parent_str[] = "/..";
+   int i;
+   size_t depth_from, depth_to, len = 0, nlen = 0,
+  plen = sizeof(parent_str) - 1;
+
+   /* We atleast need 2 bytes to write "/\0". */
+   if (buflen < 2)
+   return NULL;
+
+   if (!kn_from)
+   kn_from = kernfs_root(kn_to)->kn;
+
+   if (kn_from == kn_to) {
+   *p = '/';
+   *(++p) = '\0';
+   return buf;
+   }
+
+   common = kernfs_common_ancestor(kn_from, kn_to);
+   if (!common) {
+   WARN_ONCE("%s: kn_from and kn_to on different roots\n",
+   __func__);
+   return NULL;
+   }
+
+   depth_to = kernfs_node_distance(common, kn_to);
+   depth_from = kernfs_node_distance(common, kn_from);
+
+   for (i = 0; i < depth_from; i++) {
+   if (len + plen + 1 > buflen)
+   return NULL;
+   strcpy(p, parent_str);
+   p += plen;
+   len += plen;
+   }
+
+   /* Calculate how many bytes we need for the rest */
+   for (kn = kn_to; kn != common; kn = kn->parent)
+   nlen += strlen(kn->name) + 1;
+
+   if (len + nlen + 1 > buflen)
+   return NULL

Re: [PATCH 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns

2015-12-07 Thread Serge Hallyn

Quoting Tejun Heo (t...@kernel.org):
> Hello, Serge.
> 
> On Thu, Dec 03, 2015 at 04:47:06PM -0600, Serge E. Hallyn wrote:
> ...
> > +   dentry = dget(sb->s_root);
> > +   if (!kn->parent) // this is the root
> > +   return dentry;
> > +
> > +   knparent = find_kn_ancestor_below(kn, NULL);
> > +   BUG_ON(!knparent);
> 
> Doing WARN_ON() and returning failure is better, I think.  Failing ns
> mount is an okay failure mode and a lot better than crashing the
> system.

Ok - this shouldn't be user-triggerable, so if it happens it really
is a bug in our code, but I'll change it,

> Also, how about find_next_ancestor() for the name of the
> function?

Yeah it's static anyway :)

will change, squash, and resend the set.

> > +   do {
> > +   struct dentry *dtmp;
> > +   struct kernfs_node *kntmp;
> > +
> > +   if (kn == knparent)
> > +   return dentry;
> > +   kntmp = find_kn_ancestor_below(kn, knparent);
> > +   BUG_ON(!kntmp);
> > +   dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
> > +   dput(dentry);
> > +   if (IS_ERR(dtmp))
> > +   return dtmp;
> > +   knparent = kntmp;
> > +   dentry = dtmp;
> > +   } while (1);
> 
> Other than the nitpicks, looks good to me.
> 
> Thanks.
> 
> -- 
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/7] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

2015-11-27 Thread serge . hallyn

From: Aditya Kali 

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali 
Acked-by: Serge Hallyn 
---
 include/uapi/linux/sched.h |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..5f0fe01 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x0040  /* Unused, ignored */
 #define CLONE_UNTRACED 0x0080  /* set if the tracing process 
can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x0100  /* set the TID in the child */
-/* 0x0200 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP0x0200  /* New cgroup namespace 
*/
 #define CLONE_NEWUTS   0x0400  /* New utsname namespace */
 #define CLONE_NEWIPC   0x0800  /* New ipc namespace */
 #define CLONE_NEWUSER  0x1000  /* New user namespace */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/7] cgroup: introduce cgroup namespaces

2015-11-27 Thread serge . hallyn

From: Aditya Kali 

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Changelog: 2015-11-24
- move cgroup_namespace.c into cgroup.c (and .h)
- reformatting
- make get_cgroup_ns return void
- rename ns->root_cgrps to root_cset.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
 fs/proc/namespaces.c|3 +
 include/linux/cgroup.h  |   51 
 include/linux/nsproxy.h |2 +
 include/linux/proc_ns.h |4 ++
 kernel/cgroup.c |  151 +--
 kernel/fork.c   |2 +-
 kernel/nsproxy.c|   21 ++-
 7 files changed, 217 insertions(+), 17 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index f6e8354..bd61075 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -28,6 +28,9 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
 #endif
&mntns_operations,
+#ifdef CONFIG_CGROUPS
+   &cgroupns_operations,
+#endif
 };
 
 static const char *proc_ns_follow_link(struct dentry *dentry, void **cookie)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f640830..ea03c83 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,11 +17,36 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 
+struct cgroup_namespace {
+   atomic_tcount;
+   struct ns_commonns;
+   struct user_namespace   *user_ns;
+   struct css_set  *root_cset;
+};
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(&ns->count);
+}
+
 #ifdef CONFIG_CGROUPS
 
+void free_cgroup_ns(struct cgroup_namespace *ns);
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+   struct user_namespace *user_ns,
+   struct cgroup_namespace *old_ns);
+
 /*
  * All weight knobs on the default hierarhcy should use the following min,
  * default and max values.  The default value is the logarithmic center of
@@ -108,6 +133,10 @@ void cgroup_free(struct task_struct *p);
 int cgroup_init_early(void);
 int cgroup_init(void);
 
+char * __must_check cgroup_path_ns(struct cgroup *cgrp, char *buf,
+  size_t buflen, struct cgroup_namespace *ns);
+char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen);
+
 /*
  * Iteration helpers and macros.
  */
@@ -264,10 +293,6 @@ void css_task_iter_end(struct css_task_iter *it);
;   \
else
 
-/*
- * Inline functions.
- */
-
 /**
  * css_get - obtain a reference on the specified css
  * @css: target css
@@ -501,12 +526,6 @@ static inline int cgroup_name(struct cgroup *cgrp, char 
*buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
 }
 
-static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
- size_t buflen)
-{
-   return kernfs_path(cgrp->kn, buf, buflen);
-}
-
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
 {
pr_cont_kernfs_name(cgrp->kn);
@@ -519,6 +538,12 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 
 #else /* !CONFIG_CGROUPS */
 
+static inline void free_cgroup_ns(struct cgroup_namespace *ns) { }
+static inline struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+   struct user_namespace *user_ns,
+   struct cgroup_namespace *old_ns)
+{ return old_ns; }
+
 struct cgroup_subsys_state;
 
 static inline void css_put(struct cgroup_subsys_state *css) {}
@@ -543,4 +568,10 @@ static inline int cgroup_init(void) { return 0; }
 
 #endif /* !CONFIG_CGROUPS */
 
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns && atomic_dec_and_test(&ns->count))
+   free_cgroup_ns(ns);
+}
+
 #endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/li

[PATCH 4/7] cgroup: cgroup namespace setns support

2015-11-27 Thread serge . hallyn

From: Aditya Kali 

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali 
Acked-by: Serge E. Hallyn 
---
 kernel/cgroup.c |   24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c570957..0afed6b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5892,10 +5892,28 @@ err_out:
return ERR_PTR(err);
 }
 
-static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+static inline struct cgroup_namespace *to_cg_ns(struct ns_common *ns)
 {
-   pr_info("setns not supported for cgroup namespace");
-   return -EINVAL;
+   return container_of(ns, struct cgroup_namespace, ns);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
+{
+   struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);
+
+   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+   !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+   return -EPERM;
+
+   /* Don't need to do anything if we are attaching to our own cgroupns. */
+   if (cgroup_ns == nsproxy->cgroup_ns)
+   return 0;
+
+   get_cgroup_ns(cgroup_ns);
+   put_cgroup_ns(nsproxy->cgroup_ns);
+   nsproxy->cgroup_ns = cgroup_ns;
+
+   return 0;
 }
 
 static struct ns_common *cgroupns_get(struct task_struct *task)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/7] cgroup: Add documentation for cgroup namespaces

2015-11-27 Thread serge . hallyn

From: Aditya Kali 

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
---
 Documentation/cgroups/namespace.txt |  142 +++
 1 file changed, 142 insertions(+)
 create mode 100644 Documentation/cgroups/namespace.txt

diff --git a/Documentation/cgroups/namespace.txt 
b/Documentation/cgroups/namespace.txt
new file mode 100644
index 000..a5b80e8
--- /dev/null
+++ b/Documentation/cgroups/namespace.txt
@@ -0,0 +1,142 @@
+   CGroup Namespaces
+
+CGroup Namespace provides a mechanism to virtualize the view of the
+/proc//cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace.
+The process running inside the cgroup namespace will have its 
/proc//cgroup
+output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
+at the time of creation of the cgroup namespace.
+
+Prior to CGroup Namespace, the /proc//cgroup file used to show complete
+path of the cgroup of a process. In a container setup (where a set of cgroups
+and namespaces are intended to isolate processes), the /proc//cgroup file
+may leak potential system level information to the isolated processes.
+
+For Example:
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+CGroup Namespaces can be used to restrict visibility of this path.
+For Example:
+  # Before creating cgroup namespace
+  $ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> 
cgroup:[4026531835]
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
+  $ ~/unshare -c
+  [ns]$ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> 
cgroup:[4026532183]
+  # From within new cgroupns, process sees that its in the root cgroup
+  [ns]$ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+  # From global cgroupns:
+  $ cat /proc//cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # Unshare cgroupns along with userns and mountns
+  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
+  # sets up uid/gid map and execs /bin/bash
+  $ ~/unshare -c -u -m
+  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own 
cgroup
+  # hierarchy.
+  [ns]$ mount -t cgroup cgroup /tmp/cgroup
+  [ns]$ ls -l /tmp/cgroup
+  total 0
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns-root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
+the process calling unshare is running.
+For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+cgroup /batchjobs/container_id1 becomes the cgroupns-root.
+For the init_cgroup_ns, this is the real root ('/') cgroup
+(identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns-root cgroup does not change even if the namespace
+creator process later moves to a different cgroup.
+$ ~/unshare -c # unshare cgroupns in some cgroup
+[ns]$ cat /proc/self/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+[ns]$ mkdir sub_cgrp_1
+[ns]$ echo 0 > sub_cgrp_1/cgroup.procs
+[ns]$ cat /proc/self/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its CGROUPNS specific view of /proc//cgroup
+(a) Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup
+[ns]$ sleep 10 &  # From within unshared cgroupns
+[1] 7353
+[ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
+[ns]$ cat /proc/7353/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From global cgroupns, the real cgroup path will be visible:
+$ cat /proc/7353/cgroup
+
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
+
+(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
+path relative to its own cgroupns-roo

CGroup Namespaces (v5)

2015-11-27 Thread serge . hallyn

Hi,

following is a revised set of the CGroup Namespace patchset which Aditya
Kali has previously sent.  The code can also be found in the cgroupns.v5
branch of

https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/

To summarize the semantics:

1. CLONE_NEWCGROUP re-uses 0x0200, which was previously CLONE_STOPPED

2. unsharing a cgroup namespace makes all your current cgroups your new
cgroup root.

3. /proc/pid/cgroup always shows cgroup paths relative to the reader's
cgroup namespce root.  A task outside of  your cgroup looks like

8:memory:/../../..

4. when a task mounts a cgroupfs, the cgroup which shows up as root depends
on the mounting task's  cgroup namespace.

5. setns to a cgroup namespace switches your cgroup namespace but not
your cgroups.

With this, using github.com/hallyn/lxc #2015-11-09/cgns (and
github.com/hallyn/lxcfs #2015-11-10/cgns) we can start a container in a full
proper cgroup namespace, avoiding either cgmanager or lxcfs cgroup bind mounts.

This is completely backward compatible and will be completely invisible
to any existing cgroup users (except for those running inside a cgroup
namespace and looking at /proc/pid/cgroup of tasks outside their
namespace.)

Changes from V4:
1. Move the FS_USERNS_MOUNT flag to last patch
2. Rebase onto cgroup/for-4.5
3. Don't non-init user namespaces to bind new subsystems when mounting.
4. Address feedback from Tejun (thanks).  Specificaly, not addressed:
   . kernfs_obtain_root - walking dentry from kernfs root.
 (I think that's the only piece)
5. Dropped unused get_task_cgroup fn/patch.
6. Reworked kernfs_path_from_node_locked() to try to simplify the logic.
   It now finds a common ancestor, walks from the source to it, then back
   up to the target.

Changes from V3:
1. Rebased onto latest cgroup changes.  In particular switch to
   css_set_lock and ns_common.
2. Support all hierarchies.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
   - removed use of threadgroup_lock() while creating new cgroupns
   - use task_lock() instead of rcu_read_lock() while accessing
 task->nsproxy
   - optimized setns() to own cgroupns
   - simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
   not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc//cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of .
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup ' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc//cgroup is further restricted by not showing
   anything if the  is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2015-11-27 Thread serge . hallyn

From: Aditya Kali 

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Changelog:
20151116 - Don't allow user namespaces to bind new subsystems
20151118 - postpone the FS_USERNS_MOUNT flag until the
   last patch, until we can convince ourselves it
   is safe.

Signed-off-by: Aditya Kali 
Acked-by: Serge E. Hallyn 
---
 fs/kernfs/mount.c  |   50 
 include/linux/kernfs.h |2 ++
 kernel/cgroup.c|   39 -
 3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 8eaf417..cc41fe1 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,56 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block 
*sb)
return NULL;
 }
 
+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can be used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+   struct dentry *dentry;
+   struct inode *inode;
+
+   BUG_ON(sb->s_op != &kernfs_sops);
+
+   /* inode for the given kernfs_node should already exist. */
+   inode = kernfs_get_inode(sb, kn);
+   if (!inode) {
+   pr_debug("kernfs: could not get inode for '");
+   pr_cont_kernfs_path(kn);
+   pr_cont("'.\n");
+   return ERR_PTR(-EINVAL);
+   }
+
+   /* instantiate and link root dentry */
+   dentry = d_obtain_root(inode);
+   if (!dentry) {
+   pr_debug("kernfs: could not get dentry for '");
+   pr_cont_kernfs_path(kn);
+   pr_cont("'.\n");
+   return ERR_PTR(-ENOMEM);
+   }
+
+   /*
+* If this is a new dentry, set it up. We need kernfs_mutex because
+* this may be called by callers other than kernfs_fill_super.
+*/
+   mutex_lock(&kernfs_mutex);
+   if (!dentry->d_fsdata) {
+   kernfs_get(kn);
+   dentry->d_fsdata = kn;
+   } else {
+   WARN_ON(dentry->d_fsdata != kn);
+   }
+   mutex_unlock(&kernfs_mutex);
+
+   return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index d025ebd..1903777 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -284,6 +284,8 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry 
*dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0afed6b..2f487a4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2005,6 +2005,15 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
int ret;
int i;
bool new_sb;
+   struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+
+   get_cgroup_ns(ns);
+
+   /* Check if the caller has permission to mount. */
+   if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+   put_cgroup_ns(ns);
+   return ERR_PTR(-EPERM);
+   }
 
/*
 * The first time anyone tries to mount a cgroup, enable the list
@@ -2121,6 +2130,11 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
goto out_unlock;
}
 
+   if (!opts.none && !capable(CAP_SYS_ADMIN)) {
+   ret = -EPERM;
+   goto out_unlock;
+   }
+
root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root) {
ret = -ENOMEM;
@@ -2139,12 +2153,34 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);
 
-   if (ret)
+   if (ret) {
+   put_cgroup_ns(ns);
return ERR_PTR(ret);
+   }
+
 out_mount:
dentry = kernfs_mount(fs_type, flags, root->kf_root,

[PATCH 1/7] kernfs: Add API to generate relative kernfs path

2015-11-27 Thread serge . hallyn

From: Aditya Kali 

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Changelog 20151125:
  - Fully-wing multilinecomments
  - Rework kernfs_path_from_node_locked() logic
  - Replace BUG_ONs with returning NULL
  - Use a const char* for /.. and precalculate its size

Note - kernfs_path_from_node_locked from x to x will return '/'.  This
is precisely what we want for cgroup namespaces.  If in general '.' is
preferred, we could switch that and change it back for /proc/$$/cgroups.

Signed-off-by: Aditya Kali 
Acked-by: Serge E. Hallyn 
---
 fs/kernfs/dir.c|  181 
 include/linux/kernfs.h |3 +
 2 files changed, 157 insertions(+), 27 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 91e0045..d456896 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,133 @@ static int kernfs_name_locked(struct kernfs_node *kn, char 
*buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char 
*buf,
- size_t buflen)
+/* kernfs_node_depth - compute depth from @from to @to */
+static size_t kernfs_node_distance(struct kernfs_node *from, struct 
kernfs_node *to)
 {
-   char *p = buf + buflen;
-   int len;
+   size_t depth = 0;
 
-   *--p = '\0';
+   BUG_ON(!to);
+   BUG_ON(!from);
 
-   do {
-   len = strlen(kn->name);
-   if (p - buf < len + 1) {
-   buf[0] = '\0';
-   p = NULL;
-   break;
-   }
-   p -= len;
-   memcpy(p, kn->name, len);
-   *--p = '/';
-   kn = kn->parent;
-   } while (kn && kn->parent);
+   while (to->parent && to != from) {
+   depth++;
+   to = to->parent;
+   }
+   return depth;
+}
 
-   return p;
+static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
+   struct kernfs_node *b)
+{
+   size_t da = kernfs_node_distance(kernfs_root(a)->kn, a);
+   size_t db = kernfs_node_distance(kernfs_root(b)->kn, b);
+
+   if (da == 0)
+   return a;
+   if (db == 0)
+   return b;
+
+   while (da > db) {
+   a = a->parent;
+   da--;
+   }
+   while (db > da) {
+   b = b->parent;
+   db--;
+   }
+
+   /* worst case b and a will be the same at root */
+   while (b != a) {
+   b = b->parent;
+   a = a->parent;
+   }
+
+   return a;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3 [depth=3]
+ * result:  /../..
+ */
+static char *
+__must_check kernfs_path_from_node_locked(struct kernfs_node *kn_from,
+ struct kernfs_node *kn_to, char *buf,
+ size_t buflen)
+{
+   char *p = buf;
+   struct kernfs_node *kn, *common;
+   const char parent_str[] = "/..";
+   int i;
+   size_t depth_from, depth_to, len = 0, nlen = 0,
+  plen = sizeof(parent_str) - 1;
+
+   /* We atleast need 2 bytes to write "/\0". */
+   if (buflen < 2)
+   return NULL;
+
+   if (!kn_from)
+   kn_from = kernfs_root(kn_to)->kn;
+
+   if (kn_from == kn_to) {
+   *p = '/';
+   *(++p) = '\0';
+   return buf;
+   }
+
+   common = kernfs_common_ancestor(kn_from, kn_to);
+   if (!common) {
+   WARN_ONCE("%s: kn_from and kn_to on different roots\n",
+   __func__);
+   return NULL;
+   }
+
+   depth_to = kernfs_node_distance(common, kn_to);
+   depth_from = kernfs_node_distance(common, kn_from);
+
+   for (i = 0; i < depth_from; i++) {
+   if (len + plen + 1 > buflen)
+   return NULL;
+   strcpy(p, parent_str);
+   p += plen;
+   len += plen;
+   }
+
+   /* Calculate how many bytes we need for the rest */
+   for (kn = kn_to; kn != common; kn = kn->parent)
+   nlen += strlen(kn->nam

[PATCH 7/7] Add FS_USERNS_FLAG to cgroup fs

2015-11-27 Thread serge . hallyn

From: Serge Hallyn 

allowing root in a non-init user namespace to mount it.  This should
now be safe, because

1. non-init-root cannot mount a previously unbound subsystem
2. the task doing the mount must be privileged with respect to the
   user namespace owning the cgroup namespace
3. the mounted subsystem will have its current cgroup as the root dentry.
   the permissions will be unchanged, so tasks will receive no new
   privilege over the cgroups which they did not have on the original
   mounts.

Signed-off-by: Serge Hallyn 
---
 kernel/cgroup.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2f487a4..ff81303 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -,12 +,14 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct file_system_type cgroup2_fs_type = {
.name = "cgroup2",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
 };
 
 char * __must_check
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns

2015-11-25 Thread Serge Hallyn

Quoting Tejun Heo (t...@kernel.org):
> Hello, Serge.
> 
> On Wed, Nov 25, 2015 at 12:01:56AM -0600, Serge E. Hallyn wrote:
> > that was my goal with 
> > https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/commit/?h=cgroupns.v4&id=8eb75d2bb24df59e262f050dce567d2332adc5f3
> > (which was sent inline earlier in this thread in response to Eric)  Does
> > that look sufficient?
> 
> Hmmm... but that wouldn't work with non-root and user ns.  I think

Are you sure?  IIUC that code block is only hit when we didn't find
an already-mounted subsystem.

> what's necessary is ensuring that namespace scoped mount never creates
> a new hierarchy but always reuses an existing one.
> 
> Thanks.
> 
> -- 
> tejun
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/7] User namespace mount updates

2015-11-18 Thread Serge Hallyn

Quoting Theodore Ts'o (ty...@mit.edu):
> On Tue, Nov 17, 2015 at 12:34:44PM -0600, Seth Forshee wrote:
> > On Tue, Nov 17, 2015 at 05:55:06PM +, Al Viro wrote:
> > > On Tue, Nov 17, 2015 at 11:25:51AM -0600, Seth Forshee wrote:
> > > 
> > > > Shortly after that I plan to follow with support for ext4. I've been
> > > > fuzzing ext4 for a while now and it has held up well, and I'm currently
> > > > working on hand-crafted attacks. Ted has commented privately (to others,
> > > > not to me personally) that he will fix bugs for such attacks, though I
> > > > haven't seen any public comments to that effect.
> > > 
> > > _Static_ attacks, or change-image-under-mounted-fs attacks?
> > 
> > Right now only static attacks, change-image-under-mounted-fs attacks
> > will be next.
> 
> I will fix bugs about static attacks.  That is, it's interesting to me
> that a buggy file system (no matter how it is created), not cause the
> kernel to crash --- and privilege escalation attacks tend to be
> strongly related to those bugs where we're not doing strong enough
> checking.
> 
> Protecting against a malicious user which changes the image under the
> file system is a whole other kettle of fish.  I am not at all user you
> can do this without completely sacrificing performance or making the
> code impossible to maintain.  So my comments do *not* extend to
> protecting against a malicious user who is changing the block device
> underneath the kernel.

Yup, thanks, Ted.  I think the only sane thing to do is work on making the
mounted files immutable.  Guarding against under-mounted-writes seems
crazy.  Well, actually it seems like a fascinating problem, and maybe
solvable without fs changes, but not in scope here.

> If you want to submit patches to make the kernel more robust against
> these attacks, I'm certainly willing to look at the patches.  But I'm
> certainly not guaranteeing that they will go in, and I'm certainly not
> promising to fix all vulnerabilities that you might find that are
> caused by a malicious block device.  Sorry, that's too much buying a
> pig in a poke
> 
>   - Ted
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] namei: prevent sgid-hardlinks for unmapped gids

2015-11-03 Thread Serge Hallyn

Quoting Dirk Steinmetz (pub...@rsjtdrjgfuzkfg.com):
> In order to hardlink to a sgid-executable, it is sufficient to be the
> file's owner. When hardlinking within an unprivileged user namespace, the
> users of that namespace could thus use hardlinks to pin setgid binaries
> owned by themselves (or any mapped uid, with CAP_FOWNER) and a gid outside
> of the namespace. This is a possible security risk.
> 
> This change prevents hardlinking of sgid-executables within user
> namespaces, if the file is not owned by a mapped gid.
> 
> Signed-off-by: Dirk Steinmetz 

Hey,

Hoping this gets a close review by Kees, but this looks good to me, thanks!

Acked-by: Serge E. Hallyn 

> ---
> 
> MISSING: Documentation/sysctl/fs.txt not updated, as this patch is
> intended for discussion.
> 
> If there are no further misunderstandings on my side, this patch is what
> Serge and I agree on (modulo my not-that-much-linux-kernel-experience
> codestyle, feel free to suggest improvements!).
> 
> The new condition for sgid-executables is equivalent to
> > inode_owner_or_capable(inode) && kgid_has_mapping(ns, inode->i_gid)
> which, as recommended by Serge, does not change the behaviour for the init
> namespace. It fixes the problem of pinning parent namespace's gids.
> 
> However, I think the "same" security issue is also valid within any
> namespace, for regular users pinning other gids within the same namespace.
> I already presented an example for that in a previous mail:
> - A file has the setgid and user/group executable bits set, and is owned
>   by user:group.
> - The user 'user' is not in the group 'group', and does not have any
>   capabilities.
> - The user 'user' hardlinks the file. The permission check will succeed,
>   as the user is the owner of the file.
> - The file is replaced with a newer version (for example fixing a security
>   issue)
> - Now user can still use the hardlink-pinned version to execute the file
>   as 'user:group' (and for example exploit the security issue).
> 
> To prevent that, the condition would need to be changed to something like
> inode_group_or_capable, resembling inode_owner_or_capable, but checking
> that the caller is in the group the inode belongs to or has some
> capability (for consistency with former behaviour, CAP_FOWNER? for
> consistency with the documentation, CAP_FSETID?). However, this would
> change userland behaviour outside of userns. Thus my main question:
> Is the scenario above bad enough to change userland behaviour?
> 
> I'd apprechiate your comments.
> 
> - Dirk
> 
> 
> Diffstat:
>  fs/namei.c | 47 ++-
>  1 file changed, 38 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 29fc6a6..9c6c2e2 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -913,18 +913,19 @@ static inline int may_follow_link(struct nameidata *nd)
>  }
>  
>  /**
> - * safe_hardlink_source - Check for safe hardlink conditions
> + * safe_hardlink_source_uid - Check for safe hardlink conditions not 
> dependent
> + * on the inode's group. These conditions may be overridden by inode 
> ownership
> + * or CAP_FOWNER with respect to the inode's uid
>   * @inode: the source inode to hardlink from
>   *
>   * Return false if at least one of the following conditions:
>   *- inode is not a regular file
>   *- inode is setuid
> - *- inode is setgid and group-exec
>   *- access failure for read and write
>   *
>   * Otherwise returns true.
>   */
> -static bool safe_hardlink_source(struct inode *inode)
> +static bool safe_hardlink_source_uid(struct inode *inode)
>  {
>   umode_t mode = inode->i_mode;
>  
> @@ -936,10 +937,6 @@ static bool safe_hardlink_source(struct inode *inode)
>   if (mode & S_ISUID)
>   return false;
>  
> - /* Executable setgid files should not get pinned to the filesystem. */
> - if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP))
> - return false;
> -
>   /* Hardlinking to unreadable or unwritable sources is dangerous. */
>   if (inode_permission(inode, MAY_READ | MAY_WRITE))
>   return false;
> @@ -948,30 +945,62 @@ static bool safe_hardlink_source(struct inode *inode)
>  }
>  
>  /**
> + * safe_hardlink_source_gid - Check for safe hardlink conditions dependent
> + * on the inode's group. These conditions may be overridden by inode 
> ownership
> + * or CAP_FOWNER with respect to the inode's gid
> + * @inode: the source inode to hardlink from
> + *
> + * Return false if inode is setgid and group-exec
> + *
> + * Otherwise returns true.
> + */
> +static bool safe_hardlink_source_gid(struct inode *inode)
> +{
> + umode_t mode = inode->i_mode;
> +
> + /* Executable setgid files should not get pinned to the filesystem. */
> + if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP))
> + return false;
> +
> + return true;
> +}
> +
> +/**
>   * may_linkat - Check permissions for creating a hardlink
>   * @link: th

Re: [PATCH] namei: permit linking with CAP_FOWNER in userns

2015-11-02 Thread Serge Hallyn

Quoting Dirk Steinmetz (pub...@rsjtdrjgfuzkfg.com):
> On Wed, 28 Oct 2015 17:33:10 +0000, Serge Hallyn wrote:
> > Quoting Dirk Steinmetz (pub...@rsjtdrjgfuzkfg.com):
> > > On Tue, 27 Oct 2015 20:28:02 +, Serge Hallyn wrote:
> > > > Quoting Dirk Steinmetz (pub...@rsjtdrjgfuzkfg.com):
> > > > > On Tue, 27 Oct 2015 09:33:44 -0500, Seth Forshee wrote:
> > > > > > I did want to point what seems to be an inconsistency in how
> > > > > > capabilities in user namespaces are handled with respect to inodes. 
> > > > > > When
> > > > > > I started looking at this my initial thought was to replace
> > > > > > capable(CAP_FOWNER) with capable_wrt_inode_uidgid(inode, 
> > > > > > CAP_FOWNER). On
> > > > > > the face of it this should be equivalent to what's done here, but it
> > > > > > turns out that capable_wrt_inode_uidgid requires that the inode's 
> > > > > > uid
> > > > > > and gid are both mapped into the namespace whereas
> > > > > > inode_owner_or_capable only requires the uid be mapped. I'm not 
> > > > > > sure how
> > > > > > significant that is, but it seems a bit odd.
> > > > > 
> > > > > I agree that this seems odd. I've chosen inode_owner_or_capable over
> > > > > capable_wrt_inode_uidgid(inode, CAP_FOWNER) as it seemed consistent:
> > > > > a privileged user (with CAP_SETUID) can impersonate the owner UID and 
> > > > > thus
> > > > > bypass the check completely; this also matches the documented 
> > > > > behavior of
> > > > > CAP_FOWNER: "Bypass permission checks on operations that normally 
> > > > > require
> > > > > the filesystem UID of the process to match the UID of the file".
> > > > > 
> > > > > However, thinking about it I get the feeling that checking the gid 
> > > > > seems
> > > > > reasonable as well. This is, however, independently of user 
> > > > > namespaces.
> > > > > Consider the following scenario in any namespace, including the init 
> > > > > one:
> > > > > - A file has the setgid and user/group executable bits set, and is 
> > > > > owned
> > > > >   by user:group.
> > > > > - The user 'user' is not in the group 'group', and does not have any
> > > > >   capabilities.
> > > > > - The user 'user' hardlinks the file. The permission check will 
> > > > > succeed,
> > > > >   as the user is the owner of the file.
> > > > > - The file is replaced with a newer version (for example fixing a 
> > > > > security
> > > > >   issue)
> > > > > - Now user can still use the hardlink-pinned version to execute the 
> > > > > file
> > > > >   as 'user:group' (and for example exploit the security issue).
> > > > > I would have expected the user to not be able to hardlink, as he lacks
> > > > > CAP_FSETID, and thus is not allowed to chmod, change or move the file
> > > > > without loosing the setgid bit. So it is impossible for him to make a 
> > > > > non-
> > > > > hardlink copy with the setgid bit set -- why should he be able to 
> > > > > make a
> > > > > hardlinked one?
> > > > 
> > > > Yeah, this sounds sensible.  It allows a user without access to 'disk',
> > > > for instance, to become that group.
> > > > 
> > > > > It seems to me as if may_linkat would additionally require a check
> > > > > verifying that either
> > > > > - not both setgid and group executable bit set
> > > > > - fsgid == owner gid
> > > > > - capable_wrt_inode_uidgid(CAP_FSETID) -- or CAP_FOWNER, depending on
> > > > >   whether to adapt chmod's behavior or keeping everything hardlink-
> > > > >   related in CAP_FOWNER; I don't feel qualified enough to pick ;)
> > > > 
> > > > In particular just changing it is not ok since people who are using file
> > > > capabilities to grant what they currently need would be stuck with a
> > > > mysterious new failure.
> > > 
> > > Is there any use case (besides exploiting hardlinks with malicious intent)
> > > that would be broken w

Re: [PATCH] namei: permit linking with CAP_FOWNER in userns

2015-10-28 Thread Serge Hallyn

Quoting Dirk Steinmetz (pub...@rsjtdrjgfuzkfg.com):
> On Tue, 27 Oct 2015 20:28:02 +0000, Serge Hallyn wrote:
> > Quoting Dirk Steinmetz (pub...@rsjtdrjgfuzkfg.com):
> > > On Tue, 27 Oct 2015 09:33:44 -0500, Seth Forshee wrote:
> > > > I did want to point what seems to be an inconsistency in how
> > > > capabilities in user namespaces are handled with respect to inodes. When
> > > > I started looking at this my initial thought was to replace
> > > > capable(CAP_FOWNER) with capable_wrt_inode_uidgid(inode, CAP_FOWNER). On
> > > > the face of it this should be equivalent to what's done here, but it
> > > > turns out that capable_wrt_inode_uidgid requires that the inode's uid
> > > > and gid are both mapped into the namespace whereas
> > > > inode_owner_or_capable only requires the uid be mapped. I'm not sure how
> > > > significant that is, but it seems a bit odd.
> > > 
> > > I agree that this seems odd. I've chosen inode_owner_or_capable over
> > > capable_wrt_inode_uidgid(inode, CAP_FOWNER) as it seemed consistent:
> > > a privileged user (with CAP_SETUID) can impersonate the owner UID and thus
> > > bypass the check completely; this also matches the documented behavior of
> > > CAP_FOWNER: "Bypass permission checks on operations that normally require
> > > the filesystem UID of the process to match the UID of the file".
> > > 
> > > However, thinking about it I get the feeling that checking the gid seems
> > > reasonable as well. This is, however, independently of user namespaces.
> > > Consider the following scenario in any namespace, including the init one:
> > > - A file has the setgid and user/group executable bits set, and is owned
> > >   by user:group.
> > > - The user 'user' is not in the group 'group', and does not have any
> > >   capabilities.
> > > - The user 'user' hardlinks the file. The permission check will succeed,
> > >   as the user is the owner of the file.
> > > - The file is replaced with a newer version (for example fixing a security
> > >   issue)
> > > - Now user can still use the hardlink-pinned version to execute the file
> > >   as 'user:group' (and for example exploit the security issue).
> > > I would have expected the user to not be able to hardlink, as he lacks
> > > CAP_FSETID, and thus is not allowed to chmod, change or move the file
> > > without loosing the setgid bit. So it is impossible for him to make a non-
> > > hardlink copy with the setgid bit set -- why should he be able to make a
> > > hardlinked one?
> > 
> > Yeah, this sounds sensible.  It allows a user without access to 'disk',
> > for instance, to become that group.
> > 
> > > It seems to me as if may_linkat would additionally require a check
> > > verifying that either
> > > - not both setgid and group executable bit set
> > > - fsgid == owner gid
> > > - capable_wrt_inode_uidgid(CAP_FSETID) -- or CAP_FOWNER, depending on
> > >   whether to adapt chmod's behavior or keeping everything hardlink-
> > >   related in CAP_FOWNER; I don't feel qualified enough to pick ;)
> > 
> > In particular just changing it is not ok since people who are using file
> > capabilities to grant what they currently need would be stuck with a
> > mysterious new failure.
> 
> Is there any use case (besides exploiting hardlinks with malicious intent)
> that would be broken when changing this? There are some (imho) rather
> unlikely conditions to be met in order to observe changed behavior:

The simplest example would be if I wanted to run a very quick program to
just add the symbolic link.  Let's say the link /usr/sbin/uuidd were owned
by root:disk and setuid and setgid.  The proposed change would force me
to bind in both the root user and disk group, whereas without it I can
just bind in only the root user.

We've already dealt with such regressions and iirc agreed that they were
worthwhile.

> - a user owns an executable setgid-file belonging to a group he is not in
> - the user does not have CAP_FSETID (or CAP_FOWNER, depending on which one
>   is chosen to be required)
> - the user is for some legitimate reason supposed to hardlink the file
> If these conditions are not met in practice, the change would not break
> anything. In that case, it would be imho better to not provide
> backward-compatibility to reduce complexity in these checks. Else, I'd
> propose adding a new possible value '2' for
> /proc/sys/f

Re: [PATCH] namei: permit linking with CAP_FOWNER in userns

2015-10-27 Thread Serge Hallyn

Quoting Dirk Steinmetz (pub...@rsjtdrjgfuzkfg.com):
> On Tue, 27 Oct 2015 09:33:44 -0500, Seth Forshee wrote:
> > On Tue, Oct 20, 2015 at 04:09:19PM +0200, Dirk Steinmetz wrote:
> > > Attempting to hardlink to an unsafe file (e.g. a setuid binary) from
> > > within an unprivileged user namespace fails, even if CAP_FOWNER is held
> > > within the namespace. This may cause various failures, such as a gentoo
> > > installation within a lxc container failing to build and install specific
> > > packages.
> > > 
> > > This change permits hardlinking of files owned by mapped uids, if
> > > CAP_FOWNER is held for that namespace. Furthermore, it improves 
> > > consistency
> > > by using the existing inode_owner_or_capable(), which is aware of
> > > namespaced capabilities as of 23adbe12ef7d3 ("fs,userns: Change
> > > inode_capable to capable_wrt_inode_uidgid").
> > > 
> > > Signed-off-by: Dirk Steinmetz 
> > 
> > Tested-by: Seth Forshee 
> > 
> > This is hitting us in Ubuntu during some dpkg upgrades in containers.
> > When upgrading a file dpkg creates a hard link to the old file to back
> > it up before overwriting it. When packages upgrade suid files owned by a
> > non-root user the link isn't permitted, and the package upgrade fails.
> > This patch fixes our problem.
> > 
> > I did want to point what seems to be an inconsistency in how
> > capabilities in user namespaces are handled with respect to inodes. When
> > I started looking at this my initial thought was to replace
> > capable(CAP_FOWNER) with capable_wrt_inode_uidgid(inode, CAP_FOWNER). On
> > the face of it this should be equivalent to what's done here, but it
> > turns out that capable_wrt_inode_uidgid requires that the inode's uid
> > and gid are both mapped into the namespace whereas
> > inode_owner_or_capable only requires the uid be mapped. I'm not sure how
> > significant that is, but it seems a bit odd.
> 
> I agree that this seems odd. I've chosen inode_owner_or_capable over
> capable_wrt_inode_uidgid(inode, CAP_FOWNER) as it seemed consistent:
> a privileged user (with CAP_SETUID) can impersonate the owner UID and thus
> bypass the check completely; this also matches the documented behavior of
> CAP_FOWNER: "Bypass permission checks on operations that normally require
> the filesystem UID of the process to match the UID of the file".
> 
> However, thinking about it I get the feeling that checking the gid seems
> reasonable as well. This is, however, independently of user namespaces.
> Consider the following scenario in any namespace, including the init one:
> - A file has the setgid and user/group executable bits set, and is owned
>   by user:group.
> - The user 'user' is not in the group 'group', and does not have any
>   capabilities.
> - The user 'user' hardlinks the file. The permission check will succeed,
>   as the user is the owner of the file.
> - The file is replaced with a newer version (for example fixing a security
>   issue)
> - Now user can still use the hardlink-pinned version to execute the file
>   as 'user:group' (and for example exploit the security issue).
> I would have expected the user to not be able to hardlink, as he lacks
> CAP_FSETID, and thus is not allowed to chmod, change or move the file
> without loosing the setgid bit. So it is impossible for him to make a non-
> hardlink copy with the setgid bit set -- why should he be able to make a
> hardlinked one?

Yeah, this sounds sensible.  It allows a user without access to 'disk',
for instance, to become that group.

> It seems to me as if may_linkat would additionally require a check
> verifying that either
> - not both setgid and group executable bit set
> - fsgid == owner gid
> - capable_wrt_inode_uidgid(CAP_FSETID) -- or CAP_FOWNER, depending on
>   whether to adapt chmod's behavior or keeping everything hardlink-
>   related in CAP_FOWNER; I don't feel qualified enough to pick ;)

In particular just changing it is not ok since people who are using file
capabilities to grant what they currently need would be stuck with a
mysterious new failure.

> This would change documented behavior (at least man proc.5's description
> of /proc/sys/fs/protected_hardlinks), and I'd consider it a separate
> issue, if any (as I'm unsure how realistic that scenario is). I'd
> appreciate comments on that.
> 
> For other situations than setgid-executable files I do not see issues with
> not checking the group id's mapping, as linking would be permitted without
> privileges outside of the user namespace (disregarding namespace-internal
> setuid bits).
> 
> Independently of that, it might be reasonable to consider switching
> inode_owner_or_capable towards checking the gid as well and define
> something along "uid checks in user namespaces with uid/gid maps require
> the file's uid and gid to be mapped, else they will fail" for consistency.
> 
> Dirk
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
Mor

Re: [PATCH RFC] pidns: introduce syscall getvpid

2015-09-15 Thread Serge Hallyn

Quoting Stéphane Graber (stgra...@ubuntu.com):
> On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote:
> > On 15.09.2015 17:27, Eric W. Biederman wrote:
> > >Konstantin Khlebnikov  writes:
> > >
> > >>pid_t getvpid(pid_t pid, pid_t source, pid_t target);
> > >>
> > >>This syscall converts pid from one pid-ns into pid in another pid-ns:
> > >>it takes @pid in namespace of @source task (zero for current) and
> > >>returns related pid in namespace of @target task (zero for current too).
> > >>If pid is unreachable from target pid-ns then it returns zero.
> > >
> > >This interface as presented is inherently racy.  It would be better
> > >if source and target were file descriptors referring to the namespaces
> > >you wish to translate between.
> > 
> > Yep, it's racy. As well as any operation with non-child pids.
> > With file descriptors for source/target result will be racy anyway.
> > 
> > >
> > >>Such conversion is required for interaction between processes from
> > >>different pid-namespaces. For example when system service talks with
> > >>client from isolated container via socket about task in container:
> > >
> > >Sockets are already supported.  At least the metadata of sockets is.
> > >
> > >Maybe we need this but I am not convinced of it's utility.
> > >
> > >What are you trying to do that motivates this?
> > 
> > I'm working on hierarchical container management system which
> > allows to create and control nested sub-containers from containers
> > ( https://github.com/yandex/porto ). Main server works in host and
> > have to interact with all levels of nested namespaces. This syscall
> > makes some operations much easier: server must remember only pid in
> > host pid namespace and convert it into right vpid on demand.
> 
> Note that as Eric said earlier, sending a PID inside a ucred through a
> unix socket will have the pid translated.
> 
> So while your solution certainly should be faster, you can already achieve
> what you want today by doing:
> 
> == Translate PID in container to PID in host
>  - open a socket
>  - setns to container's pidns
>  - send ucred from that container containing the requested container PID
>  - host sees the host PID
> 
> == Translate PID on host to PID in container
>  - open a socket
>  - setns to container's pidns
>  - send ucred from the host containing the request host PID
>(send will fail if the host PID isn't part of that container)
>  - container sees the container PID

In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid 
according to ns
we now also have 'NSpid' etc in /proc/$$/status.

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7] Initial support for user namespace owned mounts

2015-07-30 Thread Serge Hallyn

Quoting Amir Goldstein (a...@cellrox.com):
> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
>  wrote:
> >
> > On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> > > > This is what I currently think you want for user ns mounts:
> > > >
> > > >  1. smk_root and smk_default are assigned the label of the backing
> > > > device.
> 
> Seth,
> 
> There were 2 main concerns discussed in this thread:
> 1. trusting LSM labels outside the namespace
> 2. trusting the content of the image file/loopdev
> 
> While your approach addresses the first concern, I suspect it may be placing
> an obstacle in a way for resolving the second concern.
> 
> A viable security policy to mitigate the second concern could be:
> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
> - Allow mount only of 'Loopback' images
> 
> This should allow the system as a whole to trust unprivileged mounts based on
> the trust of the entities that had raw access the the fs layout.

Just to be sure I understand right, you're looking for a way to let
the host admin trust that the kernel's superblock parsers aren't being
fed trash or an exploit?

> Alas, if you choose to propagate the backing dev label to contained files,
> they would all share the designated 'Loopback' label and render the policy 
> above
> useless.
> 
> Any thoughts on how to reconcile this conflict?
> 
> Amir.
> 
> 
> > > >  2. s_root is assigned the transmute property.
> > > >  3. For existing files:
> > > > a. Files with the same label as the backing device are accessible.
> > > > b. Files with any other label are not accessible.
> > >
> > > That's right. Accept correct data, reject anything that's not right.
> > >
> > > > If this is right, there are a couple lingering questions in my mind.
> > > >
> > > > First, what happens with files created in directories with the same
> > > > label as the backing device but without the transmute property set? The
> > > > inode for the new file will initially be labeled with smk_of_current(),
> > > > but then during d_instantiate it will get smk_default and thus end up
> > > > with the label we want. So that seems okay.
> > >
> > > Yes.
> > >
> > > > The second is whether files with the SMACK64EXEC attribute is still a
> > > > problem. It seems it is, for files with the same label as the backing
> > > > store at least. I think we can simply skip the code that reads out this
> > > > xattr and sets smk_task for user ns mounts, or else skip assigning the
> > > > label to the new task in bprm_set_creds. The latter seems more
> > > > consistent with the approach you've suggested for dealing with labels
> > > > from disk.
> > >
> > > Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> > > smack_d_instantiate for unprivileged mounts would do the trick.
> > >
> > > > So I guess all of that seems okay, though perhaps a bit restrictive
> > > > given that the user who mounted the filesystem already has full access
> > > > to the backing store.
> > >
> > > In truth, there is no reason to expect that the "user" who did the
> > > mount will ever have a Smack label that differs from the label of
> > > the backing store. If what we've got here seems restrictive, it's
> > > because you've got access from someone other than the "user".
> > >
> > > > Please let me know whether or not this matches up with what you are
> > > > thinking, then I can procede with the implementation.
> > >
> > > My current mindset is that, if you're going to allow unprivileged
> > > mounts of user defined backing stores, this is as safe as we can
> > > make it.
> >
> > All right, I've got a patch which I think does this, and I've managed to
> > do some testing to confirm that it behaves like I expect. How does this
> > look?
> >
> > What's missing is getting the label from the block device inode; as
> > Stephen discovered the inode that I thought we could get the label from
> > turned out to be the wrong one. Afaict we would need a new hook in order
> > to do that, so for now I'm using the label of the proccess calling
> > mount.
> >
> > ---
> >
> > diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> > index a143328f75eb..8e631a66b03c 100644
> > --- a/security/smack/smack_lsm.c
> > +++ b/security/smack/smack_lsm.c
> > @@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, 
> > int flags, void *data)
> > skp = smk_of_current();
> > sp->smk_root = skp;
> > sp->smk_default = skp;
> > +   if (sb_in_userns(sb))
> > +   transmute = 1;
> > }
> > /*
> >  * Initialize the root inode.
> > @@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode 
> > *inode, int mask)
> > if (mask == 0)
> > return 0;
> >
> > +   if (sb_in_userns(inode->i_sb)) {
> > +   struct superblock_smack *sbsp = inode->i_sb->s_security;
> > +   if

Re: [PATCH v4] seccomp: add ptrace options for suspend/resume

2015-06-10 Thread Serge Hallyn

Quoting Andy Lutomirski (l...@amacapital.net):
> On Wed, Jun 10, 2015 at 9:31 AM, Oleg Nesterov  wrote:
> > On 06/09, Andy Lutomirski wrote:
> >>
> >> On Tue, Jun 9, 2015 at 5:49 PM, Tycho Andersen
> >> >
> >> > @@ -556,6 +556,15 @@ static int ptrace_setoptions(struct task_struct 
> >> > *child, unsigned long data)
> >> > if (data & ~(unsigned long)PTRACE_O_MASK)
> >> > return -EINVAL;
> >> >
> >> > +   if (unlikely(data & PTRACE_O_SUSPEND_SECCOMP)) {
> >
> > Well, we should do this if
> >
> > (data & O_SUSPEND) && !(flags & O_SUSPEND)
> >
> > or at least if
> >
> > (data ^ flags) & O_SUSPEND
> >
> >
> >> > +   if (!config_enabled(CONFIG_CHECKPOINT_RESTORE) ||
> >> > +   !config_enabled(CONFIG_SECCOMP))
> >> > +   return -EINVAL;
> >> > +
> >> > +   if (!capable(CAP_SYS_ADMIN))
> >> > +   return -EPERM;
> >>
> >> I tend to think that we should also require that current not be using
> >> seccomp.  Otherwise, in principle, there's a seccomp bypass for
> >> privileged-but-seccomped programs.
> >
> > Andy, I simply can't understand why do we need any security check at all.
> >
> > OK, yes, in theory we can have a seccomped CAP_SYS_ADMIN process, seccomp
> > doesn't filter ptrace, you hack that process and force it to attach to
> > another CAP_SYS_ADMIN/seccomped process, etc, etc... Looks too paranoid
> > to me.
> 
> I've sometimes considered having privileged processes I write fork and
> seccomp their child.  Of course, if you're allowing ptrace through
> your seccomp filter, you open a giant can of worms, but I think we
> should take the more paranoid approach to start and relax it later as

I really do intend to look at your old proposed tree for improving that...
have only done a once-over so far, though.

> needed.  After all, for the intended use of this patch, stuff will
> break regardless of what we do if the ptracer is itself seccomped.
> 
> I could be convinced that if the ptracer is outside seccomp then we
> shouldn't need the CAP_SYS_ADMIN check.  That would at least make this
> work in a user namespace.
> 
> >> > @@ -590,6 +590,10 @@ void secure_computing_strict(int this_syscall)
> >> >  {
> >> > int mode = current->seccomp.mode;
> >> >
> >> > +   if (config_enabled(CONFIG_CHECKPOINT_RESTORE) &&
> >> > +   unlikely(current->ptrace & PT_SUSPEND_SECCOMP))
> >> > +   return;
> >> > +
> >> > if (mode == 0)
> >> > return;
> >> > else if (mode == SECCOMP_MODE_STRICT)
> >> > @@ -691,6 +695,10 @@ u32 seccomp_phase1(struct seccomp_data *sd)
> >> > int this_syscall = sd ? sd->nr :
> >> > syscall_get_nr(current, task_pt_regs(current));
> >> >
> >> > +   if (config_enabled(CONFIG_CHECKPOINT_RESTORE) &&
> >> > +   unlikely(current->ptrace & PT_SUSPEND_SECCOMP))
> >> > +   return SECCOMP_PHASE1_OK;
> >> > +
> >>
> >> If it's not hard, it might still be nice to try to fold this into
> >> mode.  This code is rather hot.  If it would be a mess, then don't
> >> worry about it for now.
> >
> > IMO, this would be a mess ;) At least compared to this simple patch.
> >
> > Suppose we add SECCOMP_MODE_SUSPENDED. Not only this adds the problems
> > with detach if the tracer dies.
> >
> > We need to change copy_seccomp(). And it is not clear what should we
> > do if the child is traced too.
> >
> > We need to change prctl_set_seccomp() paths.
> >
> > And even the "tracee->seccomp.mode = SECCOMP_MODE_SUSPENDED" code needs
> > some locking even if the tracee is stopped, we need to avoid the races
> > with SECCOMP_FILTER_FLAG_TSYNC from other threads.
> >
> 
> Agreed.  Let's hold off until this becomes a problem (if it ever does).
> 
> --Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/2] capabilities: Ambient capabilities

2015-05-23 Thread Serge Hallyn

Thanks very much, Andy.  Comments and ack below.

Quoting Andy Lutomirski (l...@kernel.org):
> Credit where credit is due: this idea comes from Christoph Lameter
> with a lot of valuable input from Serge Hallyn.  This patch is
> heavily based on Christoph's patch.
> 
> = The status quo =
> 
> On Linux, there are a number of capabilities defined by the kernel.
> To perform various privileged tasks, processes can wield
> capabilities that they hold.
> 
> Each task has four capability masks: effective (pE), permitted (pP),
> inheritable (pI), and a bounding set (X).  When the kernel checks
> for a capability, it checks pE.  The other capability masks serve to
> modify what capabilities can be in pE.
> 
> Any task can remove capabilities from pE, pP, or pI at any time.  If
> a task has a capability in pP, it can add that capability to pE
> and/or pI.  If a task has CAP_SETPCAP, then it can add any
> capability to pI, and it can remove capabilities from X.
> 
> Tasks are not the only things that can have capabilities; files can
> also have capabilities.  A file can have no capabilty information at
> all [1].  If a file has capability information, then it has a
> permitted mask (fP) and an inheritable mask (fI) as well as a single
> effective bit (fE) [2].  File capabilities modify the capabilities
> of tasks that execve(2) them.
> 
> A task that successfully calls execve has its capabilities modified
> for the file ultimately being excecuted (i.e. the binary itself if
> that binary is ELF or for the interpreter if the binary is a
> script.) [3] In the capability evolution rules, for each mask Z, pZ
> represents the old value and pZ' represents the new value.  The
> rules are:
> 
>   pP' = (X & fP) | (pI & fI)
>   pI' = pI
>   pE' = (fE ? pP' : 0)
>   X is unchanged
> 
> For setuid binaries, fP, fI, and fE are modified by a moderately
> complicated set of rules that emulate POSIX behavior.  Similarly, if
> euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
> (primary, fP and fI usually end up being the full set).  For nonroot
> users executing binaries with neither setuid nor file caps, fI and
> fP are empty and fE is false.
> 
> As an extra complication, if you execute a process as nonroot and fE
> is set, then the "secure exec" rules are in effect: AT_SECURE gets
> set, LD_PRELOAD doesn't work, etc.
> 
> This is rather messy.  We've learned that making any changes is
> dangerous, though: if a new kernel version allows an unprivileged
> program to change its security state in a way that persists cross
> execution of a setuid program or a program with file caps, this
> persistent state is surprisingly likely to allow setuid or
> file-capped programs to be exploited for privilege escalation.
> 
> = The problem =
> 
> Capability inheritance is basically useless.
> 
> If you aren't root and you execute an ordinary binary, fI is zero,
> so your capabilities have no effect whatsoever on pP'.  This means
> that you can't usefully execute a helper process or a shell command
> with elevated capabilities if you aren't root.
> 
> On current kernels, you can sort of work around this by setting fI
> to the full set for most or all non-setuid executable files.  This
> causes pP' = pI for nonroot, and inheritance works.  No one does
> this because it's a PITA and it isn't even supported on most
> filesystems.
> 
> If you try this, you'll discover that every nonroot program ends up
> with secure exec rules, breaking many things.

PI would have worked great if most programs wanting privilege were
self-contained and compiled.  Shell scripts and lots of fork+execing
make pI [much less useful] [completely useless].  See also golang's
predisposition to fork+exec.

> This is a problem that has bitten many people who have tried to use
> capabilities for anything useful.
> 
> = The proposed change =
> 
> This patch adds a fifth capability mask called the ambient mask
> (pA).  pA does what pI should have done.

Or at least what most people want it to do.

> pA obeys the invariant that no bit can ever be set in pA if it is
> not set in both pP and pI.  Dropping a bit from pP or pI drops that
> bit from pA.  This ensures that existing programs that try to drop
> capabilities still do so, with a complication.  Because capability
> inheritance is so broken, setting KEEPCAPS, using setresuid to

Sorry, did you mean "... setting KEEPCAPS and then either using
setresuid to a nonroot uid or calling execve ..." ?

> switch to nonroot uids, or calling execve effectively drops
> capabilities.  Therefore, setresuid from root to

Re: [RFC] capabilities: Ambient capabilities

2015-04-24 Thread Serge Hallyn

Quoting Andy Lutomirski (l...@amacapital.net):
> On Apr 24, 2015 2:15 PM, "Serge E. Hallyn"  wrote:
> >
> > On Fri, Apr 24, 2015 at 01:18:44PM -0700, Andy Lutomirski wrote:
> > > On Fri, Apr 24, 2015 at 1:13 PM, Christoph Lameter  wrote:
> > > > On Fri, 24 Apr 2015, Andy Lutomirski wrote:
> > > >
> > > >> That's sort of what my patch does -- you need CAP_SETPCAP to switch
> > > >> the securebit.
> > > >>
> > > >> But Christoph's patch required it to add caps to the ambient set, 
> > > >> right?
> > > >
> > > > Yes but you seem to be just adding one additional step without too much 
> > > > of
> > > > a benefit because you still need CAP_SETPCAP.
> > > >
> > >
> > > No, because I set the default to on :)
> >
> > Right - I definately prefer
> >
> > . default off
> > . CAP_SETPCAP required to turn it on (for self and children)
> > . once on, anyone can copy bits from (whatever we decided) to pA.
> >
> 
> Why default off?  If there's some weird reason that switching it on
> could cause a security problem, then I'd agree, but I haven't spotted
> a reason yet.

Cause it's less scary?

I'll just wait for the new patchset :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] capabilities: Ambient capabilities

2015-04-24 Thread Serge Hallyn

Quoting Andy Lutomirski (l...@amacapital.net):
> On Fri, Apr 24, 2015 at 10:53 AM, Serge Hallyn  
> wrote:
> > Quoting Christoph Lameter (c...@linux.com):
> >> On Thu, 9 Apr 2015, Christoph Lameter wrote:
> >>
> >> > > I'll submit a new version this week with the securebits.  Sorry for 
> >> > > the delay.
> >>  > Are we going to get a new version?
> >>
> >> Replying to my own here. Cant we simply use the SETPCAP approach as per
> >> the patch I posted?
> >
> > Andy had objections to that, but it seems ok to me.
> >
> 
> I object because CAP_SETPCAP is very powerful whereas
> CAP_NET_BIND_SERVICE, for example, isn't.  I'm fine with having a
> switch to turn off ambient caps, but requiring the "on" state to give

Would only really be needed for the initial 'enable ambient caps for this
process tree', though.  Once that was set, add/remove'ing caps from the
ambient set wouldn't need to be required.

> processes superpowers seems unfortunate.
> 
> Sorry for the huge delay.  I got caught up with travel and the merge
> window.  Here's a sneak peek:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=cap_ambient
> 
> I need to write the user code to go with it and test it a bit before
> sending it out for real.

Ok, thanks

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] capabilities: Ambient capabilities

2015-04-24 Thread Serge Hallyn

Quoting Christoph Lameter (c...@linux.com):
> On Thu, 9 Apr 2015, Christoph Lameter wrote:
> 
> > > I'll submit a new version this week with the securebits.  Sorry for the 
> > > delay.
>  > Are we going to get a new version?
> 
> Replying to my own here. Cant we simply use the SETPCAP approach as per
> the patch I posted?

Andy had objections to that, but it seems ok to me.

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] devpts: Add ptmx_uid and ptmx_gid options

2015-04-02 Thread Serge Hallyn

Quoting Andy Lutomirski (l...@amacapital.net):
> On Thu, Apr 2, 2015 at 7:29 AM, Alexander Larsson  wrote:
> > On Thu, 2015-04-02 at 07:06 -0700, Andy Lutomirski wrote:
> >> On Thu, Apr 2, 2015 at 3:12 AM, James Bottomley
> >>  wrote:
> >> > On Tue, 2015-03-31 at 16:17 +0200, Alexander Larsson wrote:
> >> >> On tis, 2015-03-31 at 17:08 +0300, James Bottomley wrote:
> >> >> > On Tue, 2015-03-31 at 06:59 -0700, Andy Lutomirski wrote:
> >> >> > >
> >> >> > > I don't think that this is correct.  That user can already create a
> >> >> > > nested userns and map themselves as 0 inside it.  Then they can 
> >> >> > > mount
> >> >> > > devpts.
> >> >> >
> >> >> > I don't mind if they create a container and control the isolated ttys 
> >> >> > in
> >> >> > that sub container in the VPS; that's fine.  I do mind if they get
> >> >> > access to the ttys in the VPS.
> >> >> >
> >> >> > If you can convince me (and the rest of Linux) that the tty subsystem
> >> >> > should be mountable by an unprivileged user generally, then what you
> >> >> > propose is OK.
> >> >>
> >> >> That is controlled by the general rights to mount stuff. I.e. unless you
> >> >> have CAP_SYS_ADMIN in the VPS container you will not be able to mount
> >> >> devpts there. You can only do it in a subcontainer where you got
> >> >> permissions to mount via using user namespaces.
> >> >
> >> > OK let me try again.  Fine, if you want to speak capabilities, you've
> >> > given a non-root user an unexpected capability (the capability of
> >> > creating a ptmx device).  But you haven't used a capability separation
> >> > to do this, you've just hard coded it via a mount parameter mechanism.
> >> >
> >> > If you want to do this thing, do it properly, so it's acceptable to the
> >> > whole of Linux, not a special corner case for one particular type of
> >> > container.
> >> >
> >> > Security breaches are created when people code in special, little used,
> >> > corner cases because they don't get as thoroughly tested and inspected
> >> > as generally applicable mechanisms.
> >> >
> >> > What you want is to be able to use the tty subsystem as a non root user:
> >> > fine, but set that up globally, don't hide it in containers so a lot
> >> > fewer people care.
> >>
> >> I tend to agree, and not just for the tty subsystem.  This is an
> >> attack surface issue.  With unprivileged user namespaces, unprivileged
> >> users can create mount namespaces (probably a good thing for bind
> >> mounts, etc), network namespaces (reasonably safe by themselves),
> >> network interfaces and iptables rules (scary), fresh
> >> instances/superblocks of some filesystems (scariness depends on the fs
> >> -- tmpfs is probably fine), and more.
> >>
> >> I think we should have real controls for this, and this is mostly
> >> Eric's domain.  Eric?  A silly issue that sometimes prevents devpts
> >> from being mountable isn't a real control, though.
> >
> > I'm honestly surprised that non-root is allowed to mount things in
> > general with user namespaces. This was long disabled use for non-root in
> > Fedora, but it is now enabled.
> >
> > For instance, using loopback mounted files you could probably attack
> > some of the less well tested filesystem implementations by feeding them
> > fuzzed data.
> >
> 
> You actually can't do that right now.  Filesystems have to opt in to
> being mounted in unprivileged user namespaces, and no filesystems with
> backing stores have opted in.  devpts has, but it's buggy without this
> patch IMO.
> 
> > Anyway, I don't see how this affects devpts though. If you're running in
> > a container (or uncontained), as a regular users with no mount
> > capabilities you can already mount a devpts filesystem if you create a
> > subbcontainer with user namespaces and map your uid to 0 in the
> > subcontainer. Then you get a new ptmx device that you can do whatever
> > you want with. The mount option would let you do the same, except be
> > your regular uid in the subcontainer.
> >
> > The only difference outside of the subcontainer is that if the outer
> > container has no uid 0 mapped, yet the user has CAP_SYSADMIN rights in
> > that container. Then he can mount devpts in the outer container where he
> > before could only mount it in an inner container.
> >
> 
> Agreed.  Also, devpts doesn't seem scary at all to me from a userns
> perspective.  Regular users on normal systems can already use ptmx,
> and AFAICS basically all of the attack surface is already available
> through the normal /dev/ptmx node.

I've been ignoring this thread bc I was pretty sure I had acked the
original patch.  If you don't have a record of that (or I'm plain wrong
and never did) please let me know.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] capabilities: Ambient capability set V1

2015-02-24 Thread Serge Hallyn

Quoting Christoph Lameter (c...@linux.com):
> On Tue, 24 Feb 2015, Serge Hallyn wrote:
> 
> > Unless I'm misunderstanding what you are saying, apps do have surprises.
> > They drop capabilities, execute a file, and the result has capabilities
> > which the app couldn't have expected.  At least if the bits have to be
> > in fI to become part of pP', the app has a clue.
> 
> Well yes but the surprises do not occur in the cap bits they are
> manipulating or inspecting via prctl.
> 
> > To be clear, I'm suggesting that the rules at exec become:
> >
> > pI' = pI
> 
> Ok that is new and on its own may solve the issue?

No that's not new.

> > pA' = pA  (pA is ambient)
> 
> Thats what this patch does
> 
> > pP' = (X & fP) | (pI & (fI | pA))
> 
> Hmmm... fP is empty for the file not having caps. so
> 
> pP' = pI & pA

Right.

> > pE' = pP' & fE
> 
> fE? So the inherited caps are not effective? fE would be empty for a file
> not having caps thus the ambient caps would not be available in the child.

Yeah we could make this

pE' = pP' & (fE | pA)

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] capabilities: Ambient capability set V1

2015-02-24 Thread Serge Hallyn

Quoting Christoph Lameter (c...@linux.com):
> On Tue, 24 Feb 2015, Serge E. Hallyn wrote:
> 
> > The other way to look at it then is that it's basically as though the
> > privileged task (which has CAP_SETFCAP) could've just added fI=full to
> > all binaries on the filesystem;  instead it's using the ambient set
> > so that the risk from fI=full is contained to its own process tree.
> 
> The way that our internal patch works is to leave these things alone and
> just check the ambient mask in the *capable*() functions. That way the
> behavior of the existing cap bits does not change but the ambient caps
> stay available. Apps have no surprises.

Unless I'm misunderstanding what you are saying, apps do have surprises.
They drop capabilities, execute a file, and the result has capabilities
which the app couldn't have expected.  At least if the bits have to be
in fI to become part of pP', the app has a clue.

To be clear, I'm suggesting that the rules at exec become:

pI' = pI
pA' = pA  (pA is ambient)
pP' = (X & fP) | (pI & (fI | pA))
pE' = pP' & fE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] capabilities: Ambient capability set V1

2015-02-23 Thread Serge Hallyn

Quoting Christoph Lameter (c...@linux.com):
> On Mon, 23 Feb 2015, Serge E. Hallyn wrote:
> 
> > > I do not see a problem with dropping privilege since the ambient set
> > > is supposed to be preserved across a drop of priviledge.
> >
> > Because you're tricking the program into thinking it has dropped
> > the privilege, when in fact it has not.
> 
> So the cap was dropped from the cap perm set but it is still active
> in the ambient set?

Right, and the legacy program doesn't know to check the new set.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] capabilities: Ambient capability set V1

2015-02-23 Thread Serge Hallyn

Quoting Christoph Lameter (c...@linux.com):
> Ok 4.0-rc1 is out and this patch has been sitting here for a couple of
> weeks without comment after an intensive discussion about the RFCs.
> 
> Since there were no objections: Is there any chance to get this into -next
> somehow?

Andrew Morgan and Andy Lutomirski appear to have a similar concern
but competing ideas on how to address them.  We need them to agree
on an approach.

The core concern for amorgan is that an unprivileged user not be
able to cause a privileged program to run in a way that it fails to
drop privilege before running unprivileged-user-provided code.

Andy Lutomirski's concern is simply that code which is currently
doing the right thing to drop privilege not be run in a way that
it thinks it is dropping privilege, but in fact is not.

(Please correct me where I've mis-spoken or misunderstood)

Since your desire is precisely for a mode where dropping privilege
works as usual, but exec then re-gains some or all of that privilege,
we need to either agree on a way to enter that mode that ordinary
use caes can't be tricked into using, or find a way for legacy
users to be tpiped off as to what's going on (without having to be
re-written)

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

cpusets in non-unified hierarchy broken?

2015-02-12 Thread Serge Hallyn

Hi,

as of some point in 3.18, cpuset.cpus doesn't seem to be
enforced any more.  I don't see an obvious reason in the
code, but it seems likely to be related to the effective_cpus.

If I mount -t cgroup -o cpuset cpuset /mnt and then mkdir /mnt/lxc,
then /mnt/lxc has:


ubuntu@cpuset1:~$ cat /mnt/lxc/cpuset.effective_cpus

ubuntu@cpuset1:~$ cat /mnt/lxc/cpuset.cpus
0-3


while


ubuntu@cpuset1:~$ cat /mnt/cpuset.effective_cpus
0-3
ubuntu@cpuset1:~$ cat /mnt/cpuset.cpus
0-3


My understanding is that effective_cpus in /lxc should be
the /cpuset.effective_cpus & /lxc/cpuset.cpus.  But that
doesn't seem to be the case.  So then, when I start a
container confined to a single cpu (which will use cgroup
/lxc/c1, for instance) then it looks like:


ubuntu@cpuset1:~$ cat /mnt/lxc/v1/cpuset.effective_cpus

ubuntu@cpuset1:~$ cat /mnt/lxc/v1/cpuset.cpus
1


While the /proc/self/status inside that container and cgroup
shows:


root@v1:~# grep -i cpu /proc/self/status
Cpus_allowed:   f
Cpus_allowed_list:  0-3


Christian, who originally found this and reported it at
https://github.com/lxc/lxc/issues/427 , also tested that in fact
the tasks are not confined (so it's not just an issue of
improper reporting, it seems)

thanks,
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [capabilities] Allow normal inheritance for a configurable set of capabilities

2015-02-02 Thread Serge Hallyn

Quoting Andy Lutomirski (l...@amacapital.net):
> On Mon, Feb 2, 2015 at 9:12 AM, Serge Hallyn  wrote:
> > A key concept behind posix capabilities is that the privilege comes from
> > both the person and the file being executed.  As you say below basically
> > anything can be executed by the program so that is completely violated.
> >
> > Still, it's not that different from mmapping some arbitrary code and
> > jumping into it whlie retaining caps.
> >
> > If we were to support such a feature, I'm thinking I'd prefer we do
> > it somewhat analogously to the capability bounding set.  Perhaps add a
> > ambient_inh_caps set or something.  Empty by default.  To add caps to it you
> > must have the cap in your permitted set already.  (Ok to do in a user
> > namespace).  Then at exec,
> >
> > pP' = (X & fP) | (pI & fI) | (pI & pA)
> >
> > pA being your ambient_inh set
> >
> > Not saying this is a good idea necessarily, but worth thinking about.
> 
> This isn't obviously a bad formulation.  We could control pA using some 
> syscall.

My first thought was prctl (since we have PR_CAPBSET_DROP)

> Another formulation would be a single per-user-ns or
> inherited-per-process bit that sets fI to the full set regardless of
> file caps.  Dealing with the file effective bit will be an added
> complication, as will dealing with setuid binaries.
> 
> How many of you will be at LSF/MM?  This might be a decent topic.

I'm not scheduled to be there.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [capabilities] Allow normal inheritance for a configurable set of capabilities

2015-02-02 Thread Serge Hallyn

Quoting Casey Schaufler (ca...@schaufler-ca.com):
> I'm game to participate in such an effort. The POSIX scheme
> is workable, but given that it's 20 years old and hasn't
> developed real traction it's hard to call it successful.

Over the years we've several times discussed possible reasons for this
and how to help.  I personally think it's two things:  1. lack of
toolchain and fs support.  The fact that we cannot to this day enable
ping using capabilities by default because of cpio, tar and non-xattr
filesystems is disheartening.  2. It's hard for users and applications
to know what caps they need.  yes the API is a bear to use, but we can
hide that behind fancier libraries.  But using capabilities requires too
much in-depth knowledge of precisely what caps you might need for
whatever operations library may now do when you asked for something. 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 314 matches

Mail list logo