Re: [Devel] Device Namespaces

2013-10-01 Thread Serge E. Hallyn
Quoting Andy Lutomirski (l...@amacapital.net):
> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen  
> wrote:
> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
> >  wrote:
> >
> >>> - We can relay a call of /sbin/hotplug from outside of a container
> >>>   to inside of a container based on policy.
> >>>   (But no one uses /sbin/hotplug anymore).
> >>
> >> That's right, they should be listening to libudev events, so why can't
> >> your daemon shuffle them off to the proper container, all in userspace?
> >
> > Which reminds me, one potential reason being..
> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
> >
> 
> Can't the daemon live outside the container and shuffle stuff in?

That's exactly what Michael Warfield is suggesting, fwiw.

-serge
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 9/9] userns: check user namespace for task->file uid equivalence checks

2011-02-23 Thread Serge E. Hallyn
Quoting Andrew Morton (a...@linux-foundation.org):
> On Thu, 17 Feb 2011 15:04:07 +
> "Serge E. Hallyn"  wrote:
> 
> > Cheat for now and say all files belong to init_user_ns.  Next
> > step will be to let superblocks belong to a user_ns, and derive
> > inode_userns(inode) from inode->i_sb->s_user_ns.  Finally we'll
> > introduce more flexible arrangements.
> > 
> >
> > ...
> >
> > +
> > +/*
> > + * return 1 if current either has CAP_FOWNER to the
> > + * file, or owns the file.
> > + */
> > +int is_owner_or_cap(const struct inode *inode)
> > +{
> > +   struct user_namespace *ns = inode_userns(inode);
> > +
> > +   if (current_user_ns() == ns && current_fsuid() == inode->i_uid)
> > +   return 1;
> > +   if (ns_capable(ns, CAP_FOWNER))
> > +   return 1;
> > +   return 0;
> > +}
> 
> bool?
> 
> > +EXPORT_SYMBOL(is_owner_or_cap);
> 
> There's a fairly well adhered to convention that global symbols (and
> often static symbols) have a prefix which identifies the subsystem to
> which they belong.  This patchset rather scorns that convention.
> 
> Most of these identifiers are pretty obviously from the capability
> subsystem, but still...

Would 'inode_owner_or_capable' be better and and make sense?

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] userns: ptrace: incorporate feedback from Eric

2011-02-23 Thread Serge E. Hallyn
Quoting Andrew Morton (a...@linux-foundation.org):
> On Thu, 24 Feb 2011 00:49:01 +
> "Serge E. Hallyn"  wrote:
> 
> > same_or_ancestore_user_ns() was not an appropriate check to
> > constrain cap_issubset.  Rather, cap_issubset() only is
> > meaningful when both capsets are in the same user_ns.
> 
> I queued this as a fix against
> userns-allow-ptrace-from-non-init-user-namespaces.patch, but I get the
> feeling that it would be better to just drop everything and start
> again?

Sure, I'll rebase and resend.  I wonder if I should trim the Cc list
for the next round.

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH] userns: ptrace: incorporate feedback from Eric

2011-02-23 Thread Serge E. Hallyn
same_or_ancestore_user_ns() was not an appropriate check to
constrain cap_issubset.  Rather, cap_issubset() only is
meaningful when both capsets are in the same user_ns.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/user_namespace.h |9 -
 kernel/user_namespace.c|   16 
 security/commoncap.c   |   28 ++--
 3 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 862fc59..faf4679 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -39,9 +39,6 @@ static inline void put_user_ns(struct user_namespace *ns)
 uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, 
uid_t uid);
 gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, 
gid_t gid);
 
-int same_or_ancestor_user_ns(struct task_struct *task,
-   struct task_struct *victim);
-
 #else
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -69,12 +66,6 @@ static inline gid_t user_ns_map_gid(struct user_namespace 
*to,
return gid;
 }
 
-static inline int same_or_ancestor_user_ns(struct task_struct *task,
-   struct task_struct *victim)
-{
-   return 1;
-}
-
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 0ef2258..9da289c 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -129,22 +129,6 @@ gid_t user_ns_map_gid(struct user_namespace *to, const 
struct cred *cred, gid_t
return overflowgid;
 }
 
-int same_or_ancestor_user_ns(struct task_struct *task,
-   struct task_struct *victim)
-{
-   struct user_namespace *u1 = task_cred_xxx(task, user)->user_ns;
-   struct user_namespace *u2 = task_cred_xxx(victim, user)->user_ns;
-   for (;;) {
-   if (u1 == u2)
-   return 1;
-   if (u1 == &init_user_ns)
-   return 0;
-   u1 = u1->creator->user_ns;
-   }
-   /* We never get here */
-   return 0;
-}
-
 static __init int user_namespaces_init(void)
 {
user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
diff --git a/security/commoncap.c b/security/commoncap.c
index 12ff65c..526106f 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -142,19 +142,15 @@ int cap_settime(struct timespec *ts, struct timezone *tz)
 int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
 {
int ret = 0;
-   const struct cred *cred, *tcred;
+   const struct cred *cred, *child_cred;
 
rcu_read_lock();
cred = current_cred();
-   tcred = __task_cred(child);
-   /*
-* The ancestor user_ns check may be gratuitous, as I think
-* we've already guaranteed that in kernel/ptrace.c.
-*/
-   if (same_or_ancestor_user_ns(current, child) &&
-   cap_issubset(tcred->cap_permitted, cred->cap_permitted))
+   child_cred = __task_cred(child);
+   if (cred->user->user_ns == child_cred->user->user_ns &&
+   cap_issubset(child_cred->cap_permitted, cred->cap_permitted))
goto out;
-   if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
+   if (ns_capable(child_cred->user->user_ns, CAP_SYS_PTRACE))
goto out;
ret = -EPERM;
 out:
@@ -178,19 +174,15 @@ out:
 int cap_ptrace_traceme(struct task_struct *parent)
 {
int ret = 0;
-   const struct cred *cred, *tcred;
+   const struct cred *cred, *child_cred;
 
rcu_read_lock();
cred = __task_cred(parent);
-   tcred = current_cred();
-   /*
-* The ancestor user_ns check may be gratuitous, as I think
-* we've already guaranteed that in kernel/ptrace.c.
-*/
-   if (same_or_ancestor_user_ns(parent, current) &&
-   cap_issubset(tcred->cap_permitted, cred->cap_permitted))
+   child_cred = current_cred();
+   if (cred->user->user_ns == child_cred->user->user_ns &&
+   cap_issubset(child_cred->cap_permitted, cred->cap_permitted))
goto out;
-   if (has_ns_capability(parent, tcred->user->user_ns, CAP_SYS_PTRACE))
+   if (has_ns_capability(parent, child_cred->user->user_ns, 
CAP_SYS_PTRACE))
goto out;
ret = -EPERM;
 out:
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/9] allow killing tasks in your own or child userns

2011-02-23 Thread Serge E. Hallyn
Quoting Andrew Morton (a...@linux-foundation.org):
> On Thu, 17 Feb 2011 15:03:25 +
> "Serge E. Hallyn"  wrote:
> 
> >  /*
> > + * called with RCU read lock from check_kill_permission()
> > + */
> > +static inline int kill_ok_by_cred(struct task_struct *t)
> > +{
> > +   const struct cred *cred = current_cred();
> > +   const struct cred *tcred = __task_cred(t);
> > +
> > +   if (cred->user->user_ns == tcred->user->user_ns &&
> > +   (cred->euid == tcred->suid ||
> > +cred->euid == tcred->uid ||
> > +cred->uid  == tcred->suid ||
> > +cred->uid  == tcred->uid))
> > +   return 1;
> > +
> > +   if (ns_capable(tcred->user->user_ns, CAP_KILL))
> > +   return 1;
> > +
> > +   return 0;
> > +}
> 
> The compiler will inline this for us.

Is that simply true with everything (worth inlining) nowadays, or is
there a particular implicit hint to the compiler that'll make that
happen?

Not that I guess it's even particularly important in this case.

From: Serge E. Hallyn 
Date: Thu, 24 Feb 2011 00:26:02 +
Subject: [PATCH 1/2] userns: let compiler inline kill_ok_by_cred (per akpm)

Signed-off-by: Serge E. Hallyn 
---
 kernel/signal.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index ffe4bdf..12702b4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -638,7 +638,7 @@ static inline bool si_fromuser(const struct siginfo *info)
 /*
  * called with RCU read lock from check_kill_permission()
  */
-static inline int kill_ok_by_cred(struct task_struct *t)
+static int kill_ok_by_cred(struct task_struct *t)
 {
const struct cred *cred = current_cred();
const struct cred *tcred = __task_cred(t);
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 5/9] Allow ptrace from non-init user namespaces

2011-02-23 Thread Serge E. Hallyn
Quoting Andrew Morton (a...@linux-foundation.org):
> On Thu, 17 Feb 2011 15:03:33 +
> "Serge E. Hallyn"  wrote:
> 
> > ptrace is allowed to tasks in the same user namespace according to
> > the usual rules (i.e. the same rules as for two tasks in the init
> > user namespace).  ptrace is also allowed to a user namespace to
> > which the current task the has CAP_SYS_PTRACE capability.
> > 
> >
> > ...
> >
> > --- a/include/linux/capability.h
> > +++ b/include/linux/capability.h
> > @@ -546,6 +546,8 @@ extern const kernel_cap_t __cap_init_eff_set;
> >   */
> >  #define has_capability(t, cap) (security_real_capable((t), &init_user_ns, 
> > (cap)) == 0)
> >  
> > +#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), 
> > (cap)) == 0)
> 
> macroitis.

Thanks for the review, Andrew.  Unfortunately this one is hard to turn
into a function beecause it uses security_real_capable(), which is
sometimes defined in security/security.c as a real function, and
other times as a static inline in include/linux/security.h.  So
I'd have to #include security.h in capability.h, but security.h
already #includes capability.h.

All the other comments affect same_or_ancestor_user_ns(), which
following Eric's feedback is going away.

> >  /**
> >   * has_capability_noaudit - Determine if a task has a superior capability 
> > available (unaudited)
> >   * @t: The task in question
> > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> > index faf4679..862fc59 100644
> > --- a/include/linux/user_namespace.h
> > +++ b/include/linux/user_namespace.h
> > @@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns)
> >  uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, 
> > uid_t uid);
> >  gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, 
> > gid_t gid);
> >  
> > +int same_or_ancestor_user_ns(struct task_struct *task,
> > +   struct task_struct *victim);
> 
> bool.
> 
> >  #else
> >  
> >  static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
> >
> > ...
> >
> > --- a/kernel/user_namespace.c
> > +++ b/kernel/user_namespace.c
> > @@ -129,6 +129,22 @@ gid_t user_ns_map_gid(struct user_namespace *to, const 
> > struct cred *cred, gid_t
> > return overflowgid;
> >  }
> >  
> > +int same_or_ancestor_user_ns(struct task_struct *task,
> > +   struct task_struct *victim)
> > +{
> > +   struct user_namespace *u1 = task_cred_xxx(task, user)->user_ns;
> > +   struct user_namespace *u2 = task_cred_xxx(victim, user)->user_ns;
> > +   for (;;) {
> > +   if (u1 == u2)
> > +   return 1;
> > +   if (u1 == &init_user_ns)
> > +   return 0;
> > +   u1 = u1->creator->user_ns;
> > +   }
> > +   /* We never get here */
> > +   return 0;
> 
> Remove?
> 
> > +}
> > +
> >  static __init int user_namespaces_init(void)
> >  {
> > user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
> >
> > ...
> >
> >  int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
> >  {
> > int ret = 0;
> > +   const struct cred *cred, *tcred;
> >  
> > rcu_read_lock();
> > -   if (!cap_issubset(__task_cred(child)->cap_permitted,
> > - current_cred()->cap_permitted) &&
> > -   !capable(CAP_SYS_PTRACE))
> > -   ret = -EPERM;
> > +   cred = current_cred();
> > +   tcred = __task_cred(child);
> > +   /*
> > +* The ancestor user_ns check may be gratuitous, as I think
> > +* we've already guaranteed that in kernel/ptrace.c.
> > +*/
> 
> ?
> 
> > +   if (same_or_ancestor_user_ns(current, child) &&
> > +   cap_issubset(tcred->cap_permitted, cred->cap_permitted))
> > +   goto out;
> > +   if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
> > +   goto out;
> > +   ret = -EPERM;
> > +out:
> > rcu_read_unlock();
> > return ret;
> >  }
> >
> > ...
> >
> 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel



[Devel] [PATCH 5/4] Clean up capability.h and capability.c

2011-02-23 Thread Serge E. Hallyn
Convert macros to functions to let type safety do its thing.  Switch
some functions from ints to more appropriate bool.  Move all forward
declarations together to top of the #ifdef __KERNEL__ section.  Use
kernel-doc format for comments.

Some macros couldn't be converted because they use functions from
security.h which sometimes are extern and sometimes static inline,
and we don't want to #include security.h in capability.h.

Also add a real current_user_ns function (and convert the existing
macro to _current_user_ns() so we can use it in capability.h
without #including cred.h.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/capability.h |   38 ++
 include/linux/cred.h   |4 +++-
 kernel/capability.c|   20 
 kernel/cred.c  |5 +
 4 files changed, 46 insertions(+), 21 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index bc0f262..688462f 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -368,6 +368,17 @@ struct cpu_vfs_cap_data {
 
 #ifdef __KERNEL__
 
+struct dentry;
+struct user_namespace;
+
+extern struct user_namespace init_user_ns;
+
+struct user_namespace *current_user_ns(void);
+
+extern const kernel_cap_t __cap_empty_set;
+extern const kernel_cap_t __cap_full_set;
+extern const kernel_cap_t __cap_init_eff_set;
+
 /*
  * Internal kernel functions only
  */
@@ -530,10 +541,6 @@ static inline kernel_cap_t cap_raise_nfsd_set(const 
kernel_cap_t a,
   cap_intersect(permitted, __cap_nfsd_set));
 }
 
-extern const kernel_cap_t __cap_empty_set;
-extern const kernel_cap_t __cap_full_set;
-extern const kernel_cap_t __cap_init_eff_set;
-
 /**
  * has_capability - Determine if a task has a superior capability available
  * @t: The task in question
@@ -560,18 +567,25 @@ extern const kernel_cap_t __cap_init_eff_set;
  * Note that this does not set PF_SUPERPRIV on the task.
  */
 #define has_capability_noaudit(t, cap) \
-   (security_real_capable_noaudit((t), &init_user_ns, (cap)) == 0)
+   (security_real_capable_noaudit((t), &init_user_ns, (cap)) == 0)
 
-struct user_namespace;
-extern struct user_namespace init_user_ns;
-extern int capable(int cap);
-extern int ns_capable(struct user_namespace *ns, int cap);
-extern int task_ns_capable(struct task_struct *t, int cap);
+extern bool capable(int cap);
+extern bool ns_capable(struct user_namespace *ns, int cap);
+extern bool task_ns_capable(struct task_struct *t, int cap);
 
-#define nsown_capable(cap) (ns_capable(current_user_ns(), (cap)))
+/**
+ * nsown_capable - Check superior capability to one's own user_ns
+ * @cap: The capability in question
+ *
+ * Return true if the current task has the given superior capability
+ * targeted at its own user namespace.
+ */
+static inline bool nsown_capable(int cap)
+{
+   return ns_capable(current_user_ns(), cap);
+}
 
 /* audit system wants to get cap info from files as well */
-struct dentry;
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
cpu_vfs_cap_data *cpu_caps);
 
 #endif /* __KERNEL__ */
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 4aaeab3..9aeeb0b 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -354,9 +354,11 @@ static inline void put_cred(const struct cred *_cred)
 #define current_fsgid()(current_cred_xxx(fsgid))
 #define current_cap()  (current_cred_xxx(cap_effective))
 #define current_user() (current_cred_xxx(user))
-#define current_user_ns()  (current_cred_xxx(user)->user_ns)
+#define _current_user_ns() (current_cred_xxx(user)->user_ns)
 #define current_security() (current_cred_xxx(security))
 
+extern struct user_namespace *current_user_ns(void);
+
 #define current_uid_gid(_uid, _gid)\
 do {   \
const struct cred *__cred;  \
diff --git a/kernel/capability.c b/kernel/capability.c
index 916658c..0a3d2c8 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -300,7 +300,7 @@ error:
  * This sets PF_SUPERPRIV on the task if the capability is available on the
  * assumption that it's about to be used.
  */
-int capable(int cap)
+bool capable(int cap)
 {
return ns_capable(&init_user_ns, cap);
 }
@@ -317,7 +317,7 @@ EXPORT_SYMBOL(capable);
  * This sets PF_SUPERPRIV on the task if the capability is available on the
  * assumption that it's about to be used.
  */
-int ns_capable(struct user_namespace *ns, int cap)
+bool ns_capable(struct user_namespace *ns, int cap)
 {
if (unlikely(!cap_valid(cap))) {
printk(KERN_CRIT "capable() called with invalid cap=%u\n", cap);
@@ -326,17 +326,21 @@ int ns_capable(struct user_namespace *ns, int cap)
 
if (security_capable(ns, current_cred(), cap) == 0) {
current->flags |= PF_SUPERPRIV;
-   return 1;

[Devel] [PATCH 2/4] userns: let copy_ipcs handle setting ipc_ns->user_ns

2011-02-23 Thread Serge E. Hallyn
To do that, we have to pass in the task_struct of the task which
will own the ipc_ns, so we can assign its user_ns.

Changelog:
Feb 23: As per Oleg comment, just pass in tsk.  To get the
ipc_ns from the nsproxy we need to include nsproxy.h

Signed-off-by: Serge E. Hallyn 
---
 include/linux/ipc_namespace.h |7 ---
 ipc/namespace.c   |   13 -
 kernel/nsproxy.c  |7 +--
 3 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 46d2eb4..c079d09 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * ipc namespace events
@@ -93,7 +94,7 @@ static inline int mq_init_ns(struct ipc_namespace *ns) { 
return 0; }
 
 #if defined(CONFIG_IPC_NS)
 extern struct ipc_namespace *copy_ipcs(unsigned long flags,
-  struct ipc_namespace *ns);
+  struct task_struct *tsk);
 static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 {
if (ns)
@@ -104,12 +105,12 @@ static inline struct ipc_namespace *get_ipc_ns(struct 
ipc_namespace *ns)
 extern void put_ipc_ns(struct ipc_namespace *ns);
 #else
 static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
-   struct ipc_namespace *ns)
+ struct task_struct *tsk)
 {
if (flags & CLONE_NEWIPC)
return ERR_PTR(-EINVAL);
 
-   return ns;
+   return tsk->nsproxy->ipc_ns;
 }
 
 static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
diff --git a/ipc/namespace.c b/ipc/namespace.c
index aa18899..3c3e522 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -15,7 +15,8 @@
 
 #include "util.h"
 
-static struct ipc_namespace *create_ipc_ns(struct ipc_namespace *old_ns)
+static struct ipc_namespace *create_ipc_ns(struct task_struct *tsk,
+  struct ipc_namespace *old_ns)
 {
struct ipc_namespace *ns;
int err;
@@ -44,17 +45,19 @@ static struct ipc_namespace *create_ipc_ns(struct 
ipc_namespace *old_ns)
ipcns_notify(IPCNS_CREATED);
register_ipcns_notifier(ns);
 
-   ns->user_ns = old_ns->user_ns;
-   get_user_ns(ns->user_ns);
+   ns->user_ns = get_user_ns(task_cred_xxx(tsk, user)->user_ns);
 
return ns;
 }
 
-struct ipc_namespace *copy_ipcs(unsigned long flags, struct ipc_namespace *ns)
+struct ipc_namespace *copy_ipcs(unsigned long flags,
+   struct task_struct *tsk)
 {
+   struct ipc_namespace *ns = tsk->nsproxy->ipc_ns;
+
if (!(flags & CLONE_NEWIPC))
return get_ipc_ns(ns);
-   return create_ipc_ns(ns);
+   return create_ipc_ns(tsk, ns);
 }
 
 /*
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ac8a56e..a05d191 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -75,16 +75,11 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
goto out_uts;
}
 
-   new_nsp->ipc_ns = copy_ipcs(flags, tsk->nsproxy->ipc_ns);
+   new_nsp->ipc_ns = copy_ipcs(flags, tsk);
if (IS_ERR(new_nsp->ipc_ns)) {
err = PTR_ERR(new_nsp->ipc_ns);
goto out_ipc;
}
-   if (new_nsp->ipc_ns != tsk->nsproxy->ipc_ns) {
-   put_user_ns(new_nsp->ipc_ns->user_ns);
-   new_nsp->ipc_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
-   get_user_ns(new_nsp->ipc_ns->user_ns);
-   }
 
new_nsp->pid_ns = copy_pid_ns(flags, task_active_pid_ns(tsk));
if (IS_ERR(new_nsp->pid_ns)) {
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/4] userns: let clone_uts_ns() handle setting uts->user_ns

2011-02-23 Thread Serge E. Hallyn
To do so we need to pass in the task_struct who'll get the utsname,
so we can get its user_ns.

Changelog:
Feb 23: As per Oleg's coment, just pass in tsk.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/utsname.h |6 +++---
 kernel/nsproxy.c|7 +--
 kernel/utsname.c|   12 +++-
 3 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 85171be..21b4566 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -53,7 +53,7 @@ static inline void get_uts_ns(struct uts_namespace *ns)
 }
 
 extern struct uts_namespace *copy_utsname(unsigned long flags,
-   struct uts_namespace *ns);
+ struct task_struct *tsk);
 extern void free_uts_ns(struct kref *kref);
 
 static inline void put_uts_ns(struct uts_namespace *ns)
@@ -70,12 +70,12 @@ static inline void put_uts_ns(struct uts_namespace *ns)
 }
 
 static inline struct uts_namespace *copy_utsname(unsigned long flags,
-   struct uts_namespace *ns)
+struct task_struct *tsk)
 {
if (flags & CLONE_NEWUTS)
return ERR_PTR(-EINVAL);
 
-   return ns;
+   return tsk->nsproxy->uts_ns;
 }
 #endif
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b6dbff2..ac8a56e 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -69,16 +69,11 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
goto out_ns;
}
 
-   new_nsp->uts_ns = copy_utsname(flags, tsk->nsproxy->uts_ns);
+   new_nsp->uts_ns = copy_utsname(flags, tsk);
if (IS_ERR(new_nsp->uts_ns)) {
err = PTR_ERR(new_nsp->uts_ns);
goto out_uts;
}
-   if (new_nsp->uts_ns != tsk->nsproxy->uts_ns) {
-   put_user_ns(new_nsp->uts_ns->user_ns);
-   new_nsp->uts_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
-   get_user_ns(new_nsp->uts_ns->user_ns);
-   }
 
new_nsp->ipc_ns = copy_ipcs(flags, tsk->nsproxy->ipc_ns);
if (IS_ERR(new_nsp->ipc_ns)) {
diff --git a/kernel/utsname.c b/kernel/utsname.c
index a7b3a8d..4464617 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -31,7 +31,8 @@ static struct uts_namespace *create_uts_ns(void)
  * @old_ns: namespace to clone
  * Return NULL on error (failure to kmalloc), new ns otherwise
  */
-static struct uts_namespace *clone_uts_ns(struct uts_namespace *old_ns)
+static struct uts_namespace *clone_uts_ns(struct task_struct *tsk,
+ struct uts_namespace *old_ns)
 {
struct uts_namespace *ns;
 
@@ -41,8 +42,7 @@ static struct uts_namespace *clone_uts_ns(struct 
uts_namespace *old_ns)
 
down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
-   ns->user_ns = old_ns->user_ns;
-   get_user_ns(ns->user_ns);
+   ns->user_ns = get_user_ns(task_cred_xxx(tsk, user)->user_ns);
up_read(&uts_sem);
return ns;
 }
@@ -53,8 +53,10 @@ static struct uts_namespace *clone_uts_ns(struct 
uts_namespace *old_ns)
  * utsname of this process won't be seen by parent, and vice
  * versa.
  */
-struct uts_namespace *copy_utsname(unsigned long flags, struct uts_namespace 
*old_ns)
+struct uts_namespace *copy_utsname(unsigned long flags,
+  struct task_struct *tsk)
 {
+   struct uts_namespace *old_ns = tsk->nsproxy->uts_ns;
struct uts_namespace *new_ns;
 
BUG_ON(!old_ns);
@@ -63,7 +65,7 @@ struct uts_namespace *copy_utsname(unsigned long flags, 
struct uts_namespace *ol
if (!(flags & CLONE_NEWUTS))
return old_ns;
 
-   new_ns = clone_uts_ns(old_ns);
+   new_ns = clone_uts_ns(tsk, old_ns);
 
put_uts_ns(old_ns);
return new_ns;
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: User namespaces and keys

2011-02-23 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> David Howells  writes:
> 
> > Serge E. Hallyn  wrote:
> >
> >> > I guess we need to look at how to mix keys and namespaces again.
> >> 
> >> From strictly kernel pov, at the moment, keys are strictly usable only
> >> by the user in your own user namespace.
> >
> > I'm not sure that's currently completely true.  Key quota maintenance is
> > namespaced, and the key's owner UID/GID belong to that namespace, so that's
> > okay, but:
> >
> >  (*) key_task_permission() does not distinguish UIDs and GIDs from different
> >  namespaces.
> >
> >  (*) A key can be referred to by its serial number, no matter whose 
> > namespace
> >  it is in, and will yield up its given UID/GID, even if these aren't
> >  actually meaningful in your namespace.
> >
> >  This means request_key() can successfully upcall at the moment.
> >
> > I wonder if I should make the following changes:
> >
> >  (1) If the key and the accessor are in different user namespaces, then skip
> >  the UID and GID comparisons in key_task_permission().  That means that 
> > to
> >  be able to access the key you'd have to possess the key and the key 
> > would
> >  have to grant you Possessor access, or the key would have to grant you
> >  Other access.
> >
> >  (2) If the key and someone viewing the key description are in different
> >  namespaces, then indicate that the UID and the GID are -1, 
> > irrespective of
> >  the actual values.
> >
> >  (3) When an upcall is attempting to instantiate a key, it is allowed to 
> > access
> >  the keys of requestor using the requestor's credentials (UID, GID, 
> > groups,
> >  security label).  Ensure that this will be done in the requestor's user
> >  namespace.
> >
> >  Nothing should need to be done here, since search_process_keyrings()
> >  switches to the requestor's creds.
> >
> > Oh, and are security labels user-namespaced?
> 
> Not at this time.  The user namespace as currently merged is little more
> than a place holder for a proper implementation.  Serge is busily
> fleshing out that proper implementation.
> 
> Until we reach the point where all checks that have historically been
> "if (uid1 == uid2)" become "if ((uidns1 == uidns2) && (uid1 == uid2))"
> there will be problems.
> 
> The security labels and probably lsm's in general need to be per user
> namespace but we simply have not gotten that far.  For the short term I
> will be happy when we get a minimally usable user namespace.

Note also that when Eric brought this up at the LSM mini-conf two or three
years ago, there was pretty general, strong objection to the idea.

Like Eric says, I think that'll have to wait.  In the meantime, isolating
user namespace sandboxes (or containers) using simple LSM configurations
is a very good idea.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: User namespaces and keys

2011-02-23 Thread Serge E. Hallyn
Quoting David Howells (dhowe...@redhat.com):
> 
> I guess we need to look at how to mix keys and namespaces again.

>From strictly kernel pov, at the moment, keys are strictly usable only
by the user in your own user namespace.

We may want to look at this again, but for now I think that would be a
safe enough default.  Later, we'll probably want the user creating a
child_user_ns to allow his keys to be inherited by the child user_ns.
Though, as I type that, it seems to me that that'll just become a
maintenance pain, and it's just plain safer to have the user re-enter
his keys, sharing them over a file if needed.

I'm going to not consider the TPM at the moment :)

> Possibly the trickiest problem with keys is how to upcall key construction to
> /sbin/request-key when the keys may be of a different user namespace.

Hm, jinkeys, yes.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/9] security: Make capabilities relative to the user namespace.

2011-02-23 Thread Serge E. Hallyn
Quoting David Howells (dhowe...@redhat.com):
> David Howells  wrote:
> 
> > >   int (*capable) (struct task_struct *tsk, const struct cred *cred,
> > > - int cap, int audit);
> > > + struct user_namespace *ns, int cap, int audit);
> > 
> > Hmmm...  A chunk of the contents of the cred struct are user-namespaced.
> > Could you add the user_namespace pointer to the cred struct and thus avoid
> > passing it as an argument to other things.
> 
> Ah, no...  Ignore that, I think I see that you do need it.

Thanks for reviewing, David.

> > +int cap_capable(struct task_struct *tsk, const struct cred *cred,
> > +   struct user_namespace *targ_ns, int cap, int audit)
> >  {
> > -   return cap_raised(cred->cap_effective, cap) ? 0 : -EPERM;
> > +   for (;;) {
> > +   /* The creator of the user namespace has all caps. */
> > +   if (targ_ns != &init_user_ns && targ_ns->creator == cred->user)
> > +   return 0;
> 
> Why is that last comment so?  Why should the creating namespace sport all
> possible capabilities?  Do you have to have all capabilities available to you
> to be permitted create a new user namespace?

It's not the creating namespace, but the creating user, which has all caps.

So for instance, if uid 500 in init_user_ns creates a namespace, then:
  a. uid 500 in init_user_ns has all caps to the child user_ns, so it can
 kill the tasks in the userns, clean up, etc.
  b. uid X in any other child user_ns has no caps to the child user_ns.
  c. root in init_user_ns has whatever capabilities are in his pE to the
 child user_ns.  Again, this is so that the admin in any user_ns can
 clean up any messes made by users in his user_ns.

One of the goals of the user namespaces it to make it safe to allow
unprivileged users to create child user namespaces in which they have
targeted privilege.  Anything which happens in that child user namespace
should be:

  a. constrained to resources which the user can control anyway
  b. able to be cleaned up by the user
  c. (especially) able to be cleaned up by the privileged user in the
 parent user_ns.

> Also, would it be worth having a separate cap_ns_capable()?  Wouldn't most
> calls to cap_capable() only be checking the caps granted in the current user
> namespace?

Hm.  There is nsown_capable() which is targeted to current_userns(), but
that still needs to enable the caps for the privileged ancestors as
described above.

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/4] userns: let clone_uts_ns() handle setting uts->user_ns

2011-02-21 Thread Serge E. Hallyn
Quoting Oleg Nesterov (o...@redhat.com):
> On 02/21, Daniel Lezcano wrote:
> >
> > On 02/21/2011 05:01 AM, Serge E. Hallyn wrote:
> >> To do so we need to pass in the task_struct who'll get the utsname,
> >> so we can get its user_ns.
> >>
> >> -extern struct uts_namespace *copy_utsname(unsigned long flags,
> >> -  struct uts_namespace *ns);
> >> +extern struct uts_namespace *copy_utsname(struct task_struct *tsk,
> >> +unsigned long flags,
> >> +struct uts_namespace *ns);
> >
> > Why don't we pass 'user_ns' instead of 'tsk' ? that will look
> > semantically clearer for the caller no ?
> > (example below).
> > ...
> >
> > new_nsp->uts_ns = copy_utsname(flags, tsk->nsproxy->uts_ns, 
> > task_cred_xxx(tsk, user)->user_ns);
> 
> To me tsk looks more readable, I mean
> 
>   new_nsp->uts_ns = copy_utsname(flags, tsk);
> 
> copy_utsname() can find both uts_ns and user_ns looking at task_strcut.

Uh, yeah.  I should remove the 'ns' argument there shouldn't I.

Daniel, does that sway your opinion then?

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 4/4] userns: uts and ipc: fix checkpatch warning

2011-02-20 Thread Serge E. Hallyn
As pointed out by Andrew Morton (and checkpatch), init/version.c
(and ipc/msgutil.c) should not have an extern declaration for
init_user_ns.  Instead, move those to ipc_namespace.h and utsname.h.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/ipc_namespace.h |3 ++-
 include/linux/utsname.h   |1 +
 init/version.c|1 -
 ipc/msgutil.c |2 --
 4 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 9974429..ebd4a93 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -15,6 +15,8 @@
 
 #define IPCNS_CALLBACK_PRI 0
 
+struct user_namespace;
+extern struct user_namespace init_user_ns;
 
 struct ipc_ids {
int in_use;
@@ -24,7 +26,6 @@ struct ipc_ids {
struct idr ipcs_idr;
 };
 
-struct user_namespace;
 struct ipc_namespace {
atomic_tcount;
struct ipc_ids  ids[3];
diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 165b17b..69957ca 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -38,6 +38,7 @@ struct new_utsname {
 #include 
 
 struct user_namespace;
+extern struct user_namespace init_user_ns;
 
 struct uts_namespace {
struct kref kref;
diff --git a/init/version.c b/init/version.c
index 97bb86f..86fe0cc 100644
--- a/init/version.c
+++ b/init/version.c
@@ -21,7 +21,6 @@ extern int version_string(LINUX_VERSION_CODE);
 int version_string(LINUX_VERSION_CODE);
 #endif
 
-extern struct user_namespace init_user_ns;
 struct uts_namespace init_uts_ns = {
.kref = {
.refcount   = ATOMIC_INIT(2),
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index d91ff4b..8b5ce5d 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -20,8 +20,6 @@
 
 DEFINE_SPINLOCK(mq_lock);
 
-extern struct user_namespace init_user_ns;
-
 /*
  * The next 2 defines are here bc this is the only file
  * compiled when either CONFIG_SYSVIPC and CONFIG_POSIX_MQUEUE
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/4] Add the required user_ns parameter to security_capable

2011-02-20 Thread Serge E. Hallyn
Fixes a compile failure.

Signed-off-by: Serge E. Hallyn 
---
 drivers/pci/pci-sysfs.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index ea25e5b..90a6b04 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -369,7 +369,7 @@ pci_read_config(struct file *filp, struct kobject *kobj,
u8 *data = (u8*) buf;
 
/* Several chips lock up trying to read undefined config space */
-   if (security_capable(filp->f_cred, CAP_SYS_ADMIN) == 0) {
+   if (security_capable(&init_user_ns, filp->f_cred, CAP_SYS_ADMIN) == 0) {
size = dev->cfg_size;
} else if (dev->hdr_type == PCI_HEADER_TYPE_CARDBUS) {
size = 128;
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/4] userns: let copy_ipcs handle setting ipc_ns->user_ns

2011-02-20 Thread Serge E. Hallyn
To do that, we have to pass in the task_struct of the task which
will own the ipc_ns, so we can assign its user_ns.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/ipc_namespace.h |8 +---
 ipc/namespace.c   |   12 +++-
 kernel/nsproxy.c  |7 +--
 3 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 46d2eb4..9974429 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -92,7 +92,8 @@ static inline int mq_init_ns(struct ipc_namespace *ns) { 
return 0; }
 #endif
 
 #if defined(CONFIG_IPC_NS)
-extern struct ipc_namespace *copy_ipcs(unsigned long flags,
+extern struct ipc_namespace *copy_ipcs(struct task_struct *tsk,
+  unsigned long flags,
   struct ipc_namespace *ns);
 static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 {
@@ -103,8 +104,9 @@ static inline struct ipc_namespace *get_ipc_ns(struct 
ipc_namespace *ns)
 
 extern void put_ipc_ns(struct ipc_namespace *ns);
 #else
-static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
-   struct ipc_namespace *ns)
+static inline struct ipc_namespace *copy_ipcs(struct task_struct *tsk,
+ unsigned long flags,
+ struct ipc_namespace *ns)
 {
if (flags & CLONE_NEWIPC)
return ERR_PTR(-EINVAL);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index aa18899..ee84882 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -15,7 +15,8 @@
 
 #include "util.h"
 
-static struct ipc_namespace *create_ipc_ns(struct ipc_namespace *old_ns)
+static struct ipc_namespace *create_ipc_ns(struct task_struct *tsk,
+  struct ipc_namespace *old_ns)
 {
struct ipc_namespace *ns;
int err;
@@ -44,17 +45,18 @@ static struct ipc_namespace *create_ipc_ns(struct 
ipc_namespace *old_ns)
ipcns_notify(IPCNS_CREATED);
register_ipcns_notifier(ns);
 
-   ns->user_ns = old_ns->user_ns;
-   get_user_ns(ns->user_ns);
+   ns->user_ns = get_user_ns(task_cred_xxx(tsk, user)->user_ns);
 
return ns;
 }
 
-struct ipc_namespace *copy_ipcs(unsigned long flags, struct ipc_namespace *ns)
+struct ipc_namespace *copy_ipcs(struct task_struct *tsk,
+   unsigned long flags,
+   struct ipc_namespace *ns)
 {
if (!(flags & CLONE_NEWIPC))
return get_ipc_ns(ns);
-   return create_ipc_ns(ns);
+   return create_ipc_ns(tsk, ns);
 }
 
 /*
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ffa6b67..b905ecc 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -75,16 +75,11 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
goto out_uts;
}
 
-   new_nsp->ipc_ns = copy_ipcs(flags, tsk->nsproxy->ipc_ns);
+   new_nsp->ipc_ns = copy_ipcs(tsk, flags, tsk->nsproxy->ipc_ns);
if (IS_ERR(new_nsp->ipc_ns)) {
err = PTR_ERR(new_nsp->ipc_ns);
goto out_ipc;
}
-   if (new_nsp->ipc_ns != tsk->nsproxy->ipc_ns) {
-   put_user_ns(new_nsp->ipc_ns->user_ns);
-   new_nsp->ipc_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
-   get_user_ns(new_nsp->ipc_ns->user_ns);
-   }
 
new_nsp->pid_ns = copy_pid_ns(flags, task_active_pid_ns(tsk));
if (IS_ERR(new_nsp->pid_ns)) {
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/4] userns: let clone_uts_ns() handle setting uts->user_ns

2011-02-20 Thread Serge E. Hallyn
To do so we need to pass in the task_struct who'll get the utsname,
so we can get its user_ns.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/utsname.h |   10 ++
 kernel/nsproxy.c|7 +--
 kernel/utsname.c|   12 +++-
 3 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 85171be..165b17b 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -52,8 +52,9 @@ static inline void get_uts_ns(struct uts_namespace *ns)
kref_get(&ns->kref);
 }
 
-extern struct uts_namespace *copy_utsname(unsigned long flags,
-   struct uts_namespace *ns);
+extern struct uts_namespace *copy_utsname(struct task_struct *tsk,
+ unsigned long flags,
+ struct uts_namespace *ns);
 extern void free_uts_ns(struct kref *kref);
 
 static inline void put_uts_ns(struct uts_namespace *ns)
@@ -69,8 +70,9 @@ static inline void put_uts_ns(struct uts_namespace *ns)
 {
 }
 
-static inline struct uts_namespace *copy_utsname(unsigned long flags,
-   struct uts_namespace *ns)
+static inline struct uts_namespace *copy_utsname(struct task_struct *tsk,
+unsigned long flags,
+struct uts_namespace *ns)
 {
if (flags & CLONE_NEWUTS)
return ERR_PTR(-EINVAL);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b6dbff2..ffa6b67 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -69,16 +69,11 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
goto out_ns;
}
 
-   new_nsp->uts_ns = copy_utsname(flags, tsk->nsproxy->uts_ns);
+   new_nsp->uts_ns = copy_utsname(tsk, flags, tsk->nsproxy->uts_ns);
if (IS_ERR(new_nsp->uts_ns)) {
err = PTR_ERR(new_nsp->uts_ns);
goto out_uts;
}
-   if (new_nsp->uts_ns != tsk->nsproxy->uts_ns) {
-   put_user_ns(new_nsp->uts_ns->user_ns);
-   new_nsp->uts_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
-   get_user_ns(new_nsp->uts_ns->user_ns);
-   }
 
new_nsp->ipc_ns = copy_ipcs(flags, tsk->nsproxy->ipc_ns);
if (IS_ERR(new_nsp->ipc_ns)) {
diff --git a/kernel/utsname.c b/kernel/utsname.c
index a7b3a8d..9462580 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -31,7 +31,8 @@ static struct uts_namespace *create_uts_ns(void)
  * @old_ns: namespace to clone
  * Return NULL on error (failure to kmalloc), new ns otherwise
  */
-static struct uts_namespace *clone_uts_ns(struct uts_namespace *old_ns)
+static struct uts_namespace *clone_uts_ns(struct task_struct *tsk,
+ struct uts_namespace *old_ns)
 {
struct uts_namespace *ns;
 
@@ -41,8 +42,7 @@ static struct uts_namespace *clone_uts_ns(struct 
uts_namespace *old_ns)
 
down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
-   ns->user_ns = old_ns->user_ns;
-   get_user_ns(ns->user_ns);
+   ns->user_ns = get_user_ns(task_cred_xxx(tsk, user)->user_ns);
up_read(&uts_sem);
return ns;
 }
@@ -53,7 +53,9 @@ static struct uts_namespace *clone_uts_ns(struct 
uts_namespace *old_ns)
  * utsname of this process won't be seen by parent, and vice
  * versa.
  */
-struct uts_namespace *copy_utsname(unsigned long flags, struct uts_namespace 
*old_ns)
+struct uts_namespace *copy_utsname(struct task_struct *tsk,
+  unsigned long flags,
+  struct uts_namespace *old_ns)
 {
struct uts_namespace *new_ns;
 
@@ -63,7 +65,7 @@ struct uts_namespace *copy_utsname(unsigned long flags, 
struct uts_namespace *ol
if (!(flags & CLONE_NEWUTS))
return old_ns;
 
-   new_ns = clone_uts_ns(old_ns);
+   new_ns = clone_uts_ns(tsk, old_ns);
 
put_uts_ns(old_ns);
return new_ns;
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/1][3rd resend] sys_unshare: remove the dead CLONE_THREAD/SIGHAND/VM code

2011-02-20 Thread Serge E. Hallyn
Quoting Oleg Nesterov (o...@redhat.com):
> Cleanup: kill the dead code which does nothing but complicates the code
> and confuses the reader.
> 
> sys_unshare(CLONE_THREAD/SIGHAND/VM) is not really implemented, and I doubt
> very much it will ever work. At least, nobody even tried since the original
> "unshare system call -v5: system call handler function" commit
> 99d1419d96d7df9cfa56bc977810be831bd5ef64 was applied more than 4 years ago.
> 
> And the code is not consistent. unshare_thread() always fails unconditionally,
> while unshare_sighand() and unshare_vm() pretend to work if there is nothing
> to unshare.
> 
> Remove unshare_thread(), unshare_sighand(), unshare_vm() helpers and related
> variables and add a simple CLONE_THREAD | CLONE_SIGHAND| CLONE_VM check into
> check_unshare_flags().
> 
> Also, move the "CLONE_NEWNS needs CLONE_FS" check from check_unshare_flags()
> to sys_unshare(). This looks more consistent and matches the similar
> do_sysvsem check in sys_unshare().
> 
> Note: with or without this patch "atomic_read(mm->mm_users) > 1" can give
> a false positive due to get_task_mm().
> 
> Signed-off-by: Oleg Nesterov 
> Acked-by: Roland McGrath 

Yes, please.

Acked-by: Serge Hallyn 

thanks,
-serge


> ---
> 
>  kernel/fork.c |  123 
> +++---
>  1 file changed, 25 insertions(+), 98 deletions(-)
> 
> --- 2.6.37/kernel/fork.c~unshare-killcrap 2010-11-05 18:03:28.0 
> +0100
> +++ 2.6.37/kernel/fork.c  2010-11-05 18:09:52.0 +0100
> @@ -1522,38 +1522,24 @@ void __init proc_caches_init(void)
>  }
>  
>  /*
> - * Check constraints on flags passed to the unshare system call and
> - * force unsharing of additional process context as appropriate.
> + * Check constraints on flags passed to the unshare system call.
>   */
> -static void check_unshare_flags(unsigned long *flags_ptr)
> +static int check_unshare_flags(unsigned long unshare_flags)
>  {
> + if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
> + CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
> + CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET))
> + return -EINVAL;
>   /*
> -  * If unsharing a thread from a thread group, must also
> -  * unshare vm.
> -  */
> - if (*flags_ptr & CLONE_THREAD)
> - *flags_ptr |= CLONE_VM;
> -
> - /*
> -  * If unsharing vm, must also unshare signal handlers.
> -  */
> - if (*flags_ptr & CLONE_VM)
> - *flags_ptr |= CLONE_SIGHAND;
> -
> - /*
> -  * If unsharing namespace, must also unshare filesystem information.
> +  * Not implemented, but pretend it works if there is nothing to
> +  * unshare. Note that unsharing CLONE_THREAD or CLONE_SIGHAND
> +  * needs to unshare vm.
>*/
> - if (*flags_ptr & CLONE_NEWNS)
> - *flags_ptr |= CLONE_FS;
> -}
> -
> -/*
> - * Unsharing of tasks created with CLONE_THREAD is not supported yet
> - */
> -static int unshare_thread(unsigned long unshare_flags)
> -{
> - if (unshare_flags & CLONE_THREAD)
> - return -EINVAL;
> + if (unshare_flags & (CLONE_THREAD | CLONE_SIGHAND | CLONE_VM)) {
> + /* FIXME: get_task_mm() increments ->mm_users */
> + if (atomic_read(¤t->mm->mm_users) > 1)
> + return -EINVAL;
> + }
>  
>   return 0;
>  }
> @@ -1580,34 +1566,6 @@ static int unshare_fs(unsigned long unsh
>  }
>  
>  /*
> - * Unsharing of sighand is not supported yet
> - */
> -static int unshare_sighand(unsigned long unshare_flags, struct 
> sighand_struct **new_sighp)
> -{
> - struct sighand_struct *sigh = current->sighand;
> -
> - if ((unshare_flags & CLONE_SIGHAND) && atomic_read(&sigh->count) > 1)
> - return -EINVAL;
> - else
> - return 0;
> -}
> -
> -/*
> - * Unshare vm if it is being shared
> - */
> -static int unshare_vm(unsigned long unshare_flags, struct mm_struct 
> **new_mmp)
> -{
> - struct mm_struct *mm = current->mm;
> -
> - if ((unshare_flags & CLONE_VM) &&
> - (mm && atomic_read(&mm->mm_users) > 1)) {
> - return -EINVAL;
> - }
> -
> - return 0;
> -}
> -
> -/*
>   * Unshare file descriptor table if it is being shared
>   */
>  static int unshare_fd(unsigned long unshare_flags, struct files_struct 
> **new_fdp)
> @@ -1635,45 +1593,37 @@ static int unshare_fd(unsigned long unsh
>   */
>  SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
>  {
> - int err = 0;
>   struct fs_struct *fs, *new_fs = NULL;
> - struct sighand_struct *new_sigh = NULL;
> - struct mm_struct *mm, *new_mm = NULL, *active_mm = NULL;
>   struct files_struct *fd, *new_fd = NULL;
>   struct nsproxy *new_nsproxy = NULL;
>   int do_sysvsem = 0;
> + int err;
>  
> - check_unshare_flags(&unshare_flags);
> -
> - /* Return -EINVAL for all unsupported flags */
> - err = -EINVAL;
> - i

[Devel] Re: [PATCH] Reduce uidhash lock hold time when lookup succeeds

2011-02-18 Thread Serge E. Hallyn
Quoting Matt Helsley (matth...@us.ibm.com):
> When lookup succeeds we don't need the "new" user struct which hasn't
> been linked into the uidhash. So we can immediately drop the lock and
> then free "new" rather than free it with the lock held.
> 
> Signed-off-by: Matt Helsley 
> Cc: David Howells 
> Cc: Pavel Emelyanov 
> Cc: Alexey Dobriyan 
> Cc: "Serge E. Hallyn" 

Acked-by: Serge E. Hallyn 

And might I say that the label 'out_unlock' in that function is
sadly named :)

> Cc: contain...@lists.linux-foundation.org
> ---
>  kernel/user.c |   12 +++-
>  1 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/user.c b/kernel/user.c
> index 5c598ca..4ea8e58 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -157,16 +157,18 @@ struct user_struct *alloc_uid(struct user_namespace 
> *ns, uid_t uid)
>*/
>   spin_lock_irq(&uidhash_lock);
>   up = uid_hash_find(uid, hashent);
> - if (up) {
> + if (!up) {
> + uid_hash_insert(new, hashent);
> + up = new;
> + }
> + spin_unlock_irq(&uidhash_lock);
> +
> + if (up != new) {
>   put_user_ns(ns);
>   key_put(new->uid_keyring);
>   key_put(new->session_keyring);
>   kmem_cache_free(uid_cachep, new);
> - } else {
> - uid_hash_insert(new, hashent);
> - up = new;
>   }
> - spin_unlock_irq(&uidhash_lock);
>   }
>  
>   return up;
> -- 
> 1.6.3.3
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 5/9] Allow ptrace from non-init user namespaces

2011-02-17 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> "Serge E. Hallyn"  writes:
> 
> > ptrace is allowed to tasks in the same user namespace according to
> > the usual rules (i.e. the same rules as for two tasks in the init
> > user namespace).  ptrace is also allowed to a user namespace to
> > which the current task the has CAP_SYS_PTRACE capability.
> 
> 
> I don't see how it can go wrong at the moment but
> same_or_ancestore_user_ns is too permissive and potentially inefficient.
> Can you please replace it with a simple user namespace equality check.
> 
> Eric
> 
> 
> > Changelog:
> > Dec 31: Address feedback by Eric:
> > . Correct ptrace uid check
> > . Rename may_ptrace_ns to ptrace_capable
> > . Also fix the cap_ptrace checks.
> > Jan  1: Use const cred struct
> > Jan 11: use task_ns_capable() in place of ptrace_capable().
> >
> > Signed-off-by: Serge E. Hallyn 
> > ---
> >  include/linux/capability.h |2 +
> >  include/linux/user_namespace.h |9 +++
> >  kernel/ptrace.c|   27 --
> >  kernel/user_namespace.c|   16 +
> >  security/commoncap.c   |   48 
> > +--
> >  5 files changed, 82 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/capability.h b/include/linux/capability.h
> > index cb3d2d9..bc0f262 100644
> > --- a/include/linux/capability.h
> > +++ b/include/linux/capability.h
> > @@ -546,6 +546,8 @@ extern const kernel_cap_t __cap_init_eff_set;
> >   */
> >  #define has_capability(t, cap) (security_real_capable((t), &init_user_ns, 
> > (cap)) == 0)
> >  
> > +#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), 
> > (cap)) == 0)
> > +
> >  /**
> >   * has_capability_noaudit - Determine if a task has a superior capability 
> > available (unaudited)
> >   * @t: The task in question
> > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> > index faf4679..862fc59 100644
> > --- a/include/linux/user_namespace.h
> > +++ b/include/linux/user_namespace.h
> > @@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns)
> >  uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, 
> > uid_t uid);
> >  gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, 
> > gid_t gid);
> >  
> > +int same_or_ancestor_user_ns(struct task_struct *task,
> > +   struct task_struct *victim);
> > +
> >  #else
> >  
> >  static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
> > @@ -66,6 +69,12 @@ static inline gid_t user_ns_map_gid(struct 
> > user_namespace *to,
> > return gid;
> >  }
> >  
> > +static inline int same_or_ancestor_user_ns(struct task_struct *task,
> > +   struct task_struct *victim)
> > +{
> > +   return 1;
> > +}
> > +
> >  #endif
> >  
> >  #endif /* _LINUX_USER_H */
> > diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> > index 1708b1e..cde4655 100644
> > --- a/kernel/ptrace.c
> > +++ b/kernel/ptrace.c
> > @@ -134,21 +134,24 @@ int __ptrace_may_access(struct task_struct *task, 
> > unsigned int mode)
> > return 0;
> > rcu_read_lock();
> > tcred = __task_cred(task);
> > -   if ((cred->uid != tcred->euid ||
> > -cred->uid != tcred->suid ||
> > -cred->uid != tcred->uid  ||
> > -cred->gid != tcred->egid ||
> > -cred->gid != tcred->sgid ||
> > -cred->gid != tcred->gid) &&
> > -   !capable(CAP_SYS_PTRACE)) {
> > -   rcu_read_unlock();
> > -   return -EPERM;
> > -   }
> > +   if (cred->user->user_ns == tcred->user->user_ns &&
> > +   (cred->uid == tcred->euid &&
> > +cred->uid == tcred->suid &&
> > +cred->uid == tcred->uid  &&
> > +cred->gid == tcred->egid &&
> > +cred->gid == tcred->sgid &&
> > +cred->gid == tcred->gid))
> > +   goto ok;
> > +   if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
> > +   goto ok;
> > +   rcu_read_unlock();
> > +   return -EPERM;
> > +ok:
> > rcu_read_unlock();
> > smp_rmb();
> > if (task->mm)
> > dumpable = 

[Devel] Re: userns: targeted capabilities v5

2011-02-17 Thread Serge E. Hallyn
Quoting Andrew Morton (a...@linux-foundation.org):
> On Thu, 17 Feb 2011 15:02:24 +
> "Serge E. Hallyn"  wrote:
> 
> > Here is a repost of my previous user namespace patch, ported onto
> > last night's git head.
> > 
> > It fixes several things I was doing wrong in the last (v4)
> > posting, in particular:
> > 
> > 1. don't set uts_ns->user_ns to current's when !CLONE_NEWUTS
> > 2. add a ipc_ns->user_ns which owns ipc_ns, and use that to
> >decide CAP_IPC_OWNER
> > 3. fix logic flaw caused by bad parantheses
> > 4. allow do_prlimit to current
> > 5. don't always give root full privs to init_user_ns
> > 
> > The expected course of development for user namespaces is laid out
> > at https://wiki.ubuntu.com/UserNamespace.
> 
> Seems like a nice feature to be developing.
> 
> I worry about the maturity of it all at this stage.  How far along is
> it *really*?
> 
> Is anyone else working with you on developing and reviewing this work?

Thanks, Andrew.  I'm not sure what definition of 'maturity' you were
looking for here.  If you meant completeness of the feature, it's
definately not there.  Of the goals for user namespaces sandboxing
will be the quickest to mature.  Completing that will largely be an
exercise of running the breadth of the kernel looking for simple
uid/gid comparisons and making them namespace aware.

The design has been meshed around (publicly) on and off for many
years by eric and I.  This particular patchset has gotten some great
reviews by Eric Biederman and Bastian Blank (to who, unfortunately,
to this day I cannot send a direct email - they're always bounced).

As Eric said, this feature will have to go in incrementally.
Furthermore, each piece touches scary code so it's likely to go
pretty slowly.  My hope is less than a year for sandboxing, and
two years for containers.  It might go way faster, but experience
tells me that's unlikely  :)

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 5/9] Allow ptrace from non-init user namespaces

2011-02-17 Thread Serge E. Hallyn
ptrace is allowed to tasks in the same user namespace according to
the usual rules (i.e. the same rules as for two tasks in the init
user namespace).  ptrace is also allowed to a user namespace to
which the current task the has CAP_SYS_PTRACE capability.

Changelog:
Dec 31: Address feedback by Eric:
. Correct ptrace uid check
. Rename may_ptrace_ns to ptrace_capable
. Also fix the cap_ptrace checks.
Jan  1: Use const cred struct
Jan 11: use task_ns_capable() in place of ptrace_capable().

Signed-off-by: Serge E. Hallyn 
---
 include/linux/capability.h |2 +
 include/linux/user_namespace.h |9 +++
 kernel/ptrace.c|   27 --
 kernel/user_namespace.c|   16 +
 security/commoncap.c   |   48 +--
 5 files changed, 82 insertions(+), 20 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index cb3d2d9..bc0f262 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -546,6 +546,8 @@ extern const kernel_cap_t __cap_init_eff_set;
  */
 #define has_capability(t, cap) (security_real_capable((t), &init_user_ns, 
(cap)) == 0)
 
+#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) 
== 0)
+
 /**
  * has_capability_noaudit - Determine if a task has a superior capability 
available (unaudited)
  * @t: The task in question
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index faf4679..862fc59 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns)
 uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, 
uid_t uid);
 gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, 
gid_t gid);
 
+int same_or_ancestor_user_ns(struct task_struct *task,
+   struct task_struct *victim);
+
 #else
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -66,6 +69,12 @@ static inline gid_t user_ns_map_gid(struct user_namespace 
*to,
return gid;
 }
 
+static inline int same_or_ancestor_user_ns(struct task_struct *task,
+   struct task_struct *victim)
+{
+   return 1;
+}
+
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 1708b1e..cde4655 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -134,21 +134,24 @@ int __ptrace_may_access(struct task_struct *task, 
unsigned int mode)
return 0;
rcu_read_lock();
tcred = __task_cred(task);
-   if ((cred->uid != tcred->euid ||
-cred->uid != tcred->suid ||
-cred->uid != tcred->uid  ||
-cred->gid != tcred->egid ||
-cred->gid != tcred->sgid ||
-cred->gid != tcred->gid) &&
-   !capable(CAP_SYS_PTRACE)) {
-   rcu_read_unlock();
-   return -EPERM;
-   }
+   if (cred->user->user_ns == tcred->user->user_ns &&
+   (cred->uid == tcred->euid &&
+cred->uid == tcred->suid &&
+cred->uid == tcred->uid  &&
+cred->gid == tcred->egid &&
+cred->gid == tcred->sgid &&
+cred->gid == tcred->gid))
+   goto ok;
+   if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
+   goto ok;
+   rcu_read_unlock();
+   return -EPERM;
+ok:
rcu_read_unlock();
smp_rmb();
if (task->mm)
dumpable = get_dumpable(task->mm);
-   if (!dumpable && !capable(CAP_SYS_PTRACE))
+   if (!dumpable && !task_ns_capable(task, CAP_SYS_PTRACE))
return -EPERM;
 
return security_ptrace_access_check(task, mode);
@@ -198,7 +201,7 @@ int ptrace_attach(struct task_struct *task)
goto unlock_tasklist;
 
task->ptrace = PT_PTRACED;
-   if (capable(CAP_SYS_PTRACE))
+   if (task_ns_capable(task, CAP_SYS_PTRACE))
task->ptrace |= PT_PTRACE_CAP;
 
__ptrace_link(task, current);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 9da289c..0ef2258 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -129,6 +129,22 @@ gid_t user_ns_map_gid(struct user_namespace *to, const 
struct cred *cred, gid_t
return overflowgid;
 }
 
+int same_or_ancestor_user_ns(struct task_struct *task,
+   struct task_struct *victim)
+{
+   struct user_namespace *u1 = task_cred_xxx(task, user)->user_ns;
+   struct user_namespace *u2 = task_cred_xxx(victim, user)->user_ns;
+   for (;;) {
+   if (u1 == u2)
+   return 1;
+   if (u1 == &init_user_ns

[Devel] [PATCH 8/9] user namespaces: convert several capable() calls

2011-02-17 Thread Serge E. Hallyn
CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(),
because the resource comes from current's own ipc namespace.

setuid/setgid are to uids in own namespace, so again checks can be
against current_user_ns().

Changelog:
Jan 11: Use task_ns_capable() in place of sched_capable().
Jan 11: Use nsown_capable() as suggested by Bastian Blank.
Jan 11: Clarify (hopefully) some logic in futex and sched.c
Feb 15: use ns_capable for ipc, not nsown_capable

Signed-off-by: Serge E. Hallyn 
---
 ipc/shm.c |2 +-
 ipc/util.c|5 +++--
 kernel/futex.c|   11 ++-
 kernel/futex_compat.c |   11 ++-
 kernel/groups.c   |2 +-
 kernel/sched.c|9 ++---
 kernel/uid16.c|2 +-
 7 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 7d3bb22..e91e2e9 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -773,7 +773,7 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct 
shmid_ds __user *, buf)
 
audit_ipc_obj(&(shp->shm_perm));
 
-   if (!capable(CAP_IPC_LOCK)) {
+   if (!ns_capable(ns->user_ns, CAP_IPC_LOCK)) {
uid_t euid = current_euid();
err = -EPERM;
if (euid != shp->shm_perm.uid &&
diff --git a/ipc/util.c b/ipc/util.c
index 69a0cc1..8e7ec6a 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -627,7 +627,7 @@ int ipcperms (struct kern_ipc_perm *ipcp, short flag)
granted_mode >>= 3;
/* is there some bit set in requested_mode but not in granted_mode? */
if ((requested_mode & ~granted_mode & 0007) && 
-   !capable(CAP_IPC_OWNER))
+   !ns_capable(current->nsproxy->ipc_ns->user_ns, CAP_IPC_OWNER))
return -1;
 
return security_ipc_permission(ipcp, flag);
@@ -800,7 +800,8 @@ struct kern_ipc_perm *ipcctl_pre_down(struct ipc_ids *ids, 
int id, int cmd,
 
euid = current_euid();
if (euid == ipcp->cuid ||
-   euid == ipcp->uid  || capable(CAP_SYS_ADMIN))
+   euid == ipcp->uid  ||
+   ns_capable(current->nsproxy->ipc_ns->user_ns, CAP_SYS_ADMIN))
return ipcp;
 
err = -EPERM;
diff --git a/kernel/futex.c b/kernel/futex.c
index b766d28..1e876f1 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2421,10 +2421,19 @@ SYSCALL_DEFINE3(get_robust_list, int, pid,
goto err_unlock;
ret = -EPERM;
pcred = __task_cred(p);
+   /* If victim is in different user_ns, then uids are not
+  comparable, so we must have CAP_SYS_PTRACE */
+   if (cred->user->user_ns != pcred->user->user_ns) {
+   if (!ns_capable(pcred->user->user_ns, CAP_SYS_PTRACE))
+   goto err_unlock;
+   goto ok;
+   }
+   /* If victim is in same user_ns, then uids are comparable */
if (cred->euid != pcred->euid &&
cred->euid != pcred->uid &&
-   !capable(CAP_SYS_PTRACE))
+   !ns_capable(pcred->user->user_ns, CAP_SYS_PTRACE))
goto err_unlock;
+ok:
head = p->robust_list;
rcu_read_unlock();
}
diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index a7934ac..5f9e689 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -153,10 +153,19 @@ compat_sys_get_robust_list(int pid, compat_uptr_t __user 
*head_ptr,
goto err_unlock;
ret = -EPERM;
pcred = __task_cred(p);
+   /* If victim is in different user_ns, then uids are not
+  comparable, so we must have CAP_SYS_PTRACE */
+   if (cred->user->user_ns != pcred->user->user_ns) {
+   if (!ns_capable(pcred->user->user_ns, CAP_SYS_PTRACE))
+   goto err_unlock;
+   goto ok;
+   }
+   /* If victim is in same user_ns, then uids are comparable */
if (cred->euid != pcred->euid &&
cred->euid != pcred->uid &&
-   !capable(CAP_SYS_PTRACE))
+   !ns_capable(pcred->user->user_ns, CAP_SYS_PTRACE))
goto err_unlock;
+ok:
head = p->compat_robust_list;
rcu_read_unlock();
}
diff --git a/kernel/groups.c b/kernel/groups.c
index 253dc0f..1cc476d 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -233,7 +233,7 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, 
grouplist)
struct group_info *group_info;
int retval;
 
-   

[Devel] [PATCH 4/9] allow killing tasks in your own or child userns

2011-02-17 Thread Serge E. Hallyn
Changelog:
Dec  8: Fixed bug in my check_kill_permission pointed out by
Eric Biederman.
Dec 13: Apply Eric's suggestion to pass target task into 
kill_ok_by_cred()
for clarity
Dec 31: address comment by Eric Biederman:
don't need cred/tcred in check_kill_permission.
Jan  1: use const cred struct.
Jan 11: Per Bastian Blank's advice, clean up kill_ok_by_cred().
Feb 16: kill_ok_by_cred: fix bad parentheses

Signed-off-by: Serge E. Hallyn 
---
 kernel/signal.c |   30 ++
 1 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..ffe4bdf 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -636,13 +636,33 @@ static inline bool si_fromuser(const struct siginfo *info)
 }
 
 /*
+ * called with RCU read lock from check_kill_permission()
+ */
+static inline int kill_ok_by_cred(struct task_struct *t)
+{
+   const struct cred *cred = current_cred();
+   const struct cred *tcred = __task_cred(t);
+
+   if (cred->user->user_ns == tcred->user->user_ns &&
+   (cred->euid == tcred->suid ||
+cred->euid == tcred->uid ||
+cred->uid  == tcred->suid ||
+cred->uid  == tcred->uid))
+   return 1;
+
+   if (ns_capable(tcred->user->user_ns, CAP_KILL))
+   return 1;
+
+   return 0;
+}
+
+/*
  * Bad permissions for sending the signal
  * - the caller must hold the RCU read lock
  */
 static int check_kill_permission(int sig, struct siginfo *info,
 struct task_struct *t)
 {
-   const struct cred *cred, *tcred;
struct pid *sid;
int error;
 
@@ -656,14 +676,8 @@ static int check_kill_permission(int sig, struct siginfo 
*info,
if (error)
return error;
 
-   cred = current_cred();
-   tcred = __task_cred(t);
if (!same_thread_group(current, t) &&
-   (cred->euid ^ tcred->suid) &&
-   (cred->euid ^ tcred->uid) &&
-   (cred->uid  ^ tcred->suid) &&
-   (cred->uid  ^ tcred->uid) &&
-   !capable(CAP_KILL)) {
+   !kill_ok_by_cred(t)) {
switch (sig) {
case SIGCONT:
sid = task_session(t);
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 6/9] user namespaces: convert all capable checks in kernel/sys.c

2011-02-17 Thread Serge E. Hallyn
This allows setuid/setgid in containers.  It also fixes some
corner cases where kernel logic foregoes capability checks when
uids are equivalent.  The latter will need to be done throughout
the whole kernel.

Changelog:
Jan 11: Use nsown_capable() as suggested by Bastian Blank.
Jan 11: Fix logic errors in uid checks pointed out by Bastian.
Feb 15: allow prlimit to current (was regression in previous version)

Signed-off-by: Serge E. Hallyn 
---
 kernel/sys.c |   74 -
 1 files changed, 47 insertions(+), 27 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 7a1bbad..075370d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -118,17 +118,29 @@ EXPORT_SYMBOL(cad_pid);
 
 void (*pm_power_off_prepare)(void);
 
+/* called with rcu_read_lock, creds are safe */
+static inline int set_one_prio_perm(struct task_struct *p)
+{
+   const struct cred *cred = current_cred(), *pcred = __task_cred(p);
+
+   if (pcred->user->user_ns == cred->user->user_ns &&
+   (pcred->uid  == cred->euid ||
+pcred->euid == cred->euid))
+   return 1;
+   if (ns_capable(pcred->user->user_ns, CAP_SYS_NICE))
+   return 1;
+   return 0;
+}
+
 /*
  * set the priority of a task
  * - the caller must hold the RCU read lock
  */
 static int set_one_prio(struct task_struct *p, int niceval, int error)
 {
-   const struct cred *cred = current_cred(), *pcred = __task_cred(p);
int no_nice;
 
-   if (pcred->uid  != cred->euid &&
-   pcred->euid != cred->euid && !capable(CAP_SYS_NICE)) {
+   if (!set_one_prio_perm(p)) {
error = -EPERM;
goto out;
}
@@ -502,7 +514,7 @@ SYSCALL_DEFINE2(setregid, gid_t, rgid, gid_t, egid)
if (rgid != (gid_t) -1) {
if (old->gid == rgid ||
old->egid == rgid ||
-   capable(CAP_SETGID))
+   nsown_capable(CAP_SETGID))
new->gid = rgid;
else
goto error;
@@ -511,7 +523,7 @@ SYSCALL_DEFINE2(setregid, gid_t, rgid, gid_t, egid)
if (old->gid == egid ||
old->egid == egid ||
old->sgid == egid ||
-   capable(CAP_SETGID))
+   nsown_capable(CAP_SETGID))
new->egid = egid;
else
goto error;
@@ -546,7 +558,7 @@ SYSCALL_DEFINE1(setgid, gid_t, gid)
old = current_cred();
 
retval = -EPERM;
-   if (capable(CAP_SETGID))
+   if (nsown_capable(CAP_SETGID))
new->gid = new->egid = new->sgid = new->fsgid = gid;
else if (gid == old->gid || gid == old->sgid)
new->egid = new->fsgid = gid;
@@ -613,7 +625,7 @@ SYSCALL_DEFINE2(setreuid, uid_t, ruid, uid_t, euid)
new->uid = ruid;
if (old->uid != ruid &&
old->euid != ruid &&
-   !capable(CAP_SETUID))
+   !nsown_capable(CAP_SETUID))
goto error;
}
 
@@ -622,7 +634,7 @@ SYSCALL_DEFINE2(setreuid, uid_t, ruid, uid_t, euid)
if (old->uid != euid &&
old->euid != euid &&
old->suid != euid &&
-   !capable(CAP_SETUID))
+   !nsown_capable(CAP_SETUID))
goto error;
}
 
@@ -670,7 +682,7 @@ SYSCALL_DEFINE1(setuid, uid_t, uid)
old = current_cred();
 
retval = -EPERM;
-   if (capable(CAP_SETUID)) {
+   if (nsown_capable(CAP_SETUID)) {
new->suid = new->uid = uid;
if (uid != old->uid) {
retval = set_user(new);
@@ -712,7 +724,7 @@ SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, 
suid)
old = current_cred();
 
retval = -EPERM;
-   if (!capable(CAP_SETUID)) {
+   if (!nsown_capable(CAP_SETUID)) {
if (ruid != (uid_t) -1 && ruid != old->uid &&
ruid != old->euid  && ruid != old->suid)
goto error;
@@ -776,7 +788,7 @@ SYSCALL_DEFINE3(setresgid, gid_t, rgid, gid_t, egid, gid_t, 
sgid)
old = current_cred();
 
retval = -EPERM;
-   if (!capable(CAP_SETGID)) {
+   if (!nsown_capable(CAP_SETGID)) {
if (rgid != (gid_t) -1 && rgid != old->gid &&
rgid != old->egid  && rgid != old->sgid)
goto error;
@@ -836,7 +848,7 @@ SYSCALL_DEFINE1(setfsuid, uid_t, uid)
 
if (uid == old->uid  || uid == old->euid  ||
uid == old->suid || uid == o

[Devel] [PATCH 3/9] allow sethostname in a container

2011-02-17 Thread Serge E. Hallyn
Signed-off-by: Serge E. Hallyn 
---
 kernel/sys.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 18da702..7a1bbad 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1177,7 +1177,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, 
len)
int errno;
char tmp[__NEW_UTS_LEN];
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(current->nsproxy->uts_ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
if (len < 0 || len > __NEW_UTS_LEN)
return -EINVAL;
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/9] Add a user_namespace as creator/owner of uts_namespace

2011-02-17 Thread Serge E. Hallyn
copy_process() handles CLONE_NEWUSER before the rest of the
namespaces.  So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
the new uts namespace will have the new user namespace as its
owner.  That is what we want, since we want root in that new
userns to be able to have privilege over it.

Changelog:
Feb 15: don't set uts_ns->user_ns if we didn't create
a new uts_ns.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/utsname.h |3 +++
 init/version.c  |2 ++
 kernel/nsproxy.c|5 +
 kernel/user.c   |8 ++--
 kernel/utsname.c|4 
 5 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 69f3997..85171be 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -37,9 +37,12 @@ struct new_utsname {
 #include 
 #include 
 
+struct user_namespace;
+
 struct uts_namespace {
struct kref kref;
struct new_utsname name;
+   struct user_namespace *user_ns;
 };
 extern struct uts_namespace init_uts_ns;
 
diff --git a/init/version.c b/init/version.c
index adff586..97bb86f 100644
--- a/init/version.c
+++ b/init/version.c
@@ -21,6 +21,7 @@ extern int version_string(LINUX_VERSION_CODE);
 int version_string(LINUX_VERSION_CODE);
 #endif
 
+extern struct user_namespace init_user_ns;
 struct uts_namespace init_uts_ns = {
.kref = {
.refcount   = ATOMIC_INIT(2),
@@ -33,6 +34,7 @@ struct uts_namespace init_uts_ns = {
.machine= UTS_MACHINE,
.domainname = UTS_DOMAINNAME,
},
+   .user_ns = &init_user_ns,
 };
 EXPORT_SYMBOL_GPL(init_uts_ns);
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f74e6c0..034dc2e 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -74,6 +74,11 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
err = PTR_ERR(new_nsp->uts_ns);
goto out_uts;
}
+   if (new_nsp->uts_ns != tsk->nsproxy->uts_ns) {
+   put_user_ns(new_nsp->uts_ns->user_ns);
+   new_nsp->uts_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
+   get_user_ns(new_nsp->uts_ns->user_ns);
+   }
 
new_nsp->ipc_ns = copy_ipcs(flags, tsk->nsproxy->ipc_ns);
if (IS_ERR(new_nsp->ipc_ns)) {
diff --git a/kernel/user.c b/kernel/user.c
index 5c598ca..9e03e9c 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -17,9 +17,13 @@
 #include 
 #include 
 
+/*
+ * userns count is 1 for root user, 1 for init_uts_ns,
+ * and 1 for... ?
+ */
 struct user_namespace init_user_ns = {
.kref = {
-   .refcount   = ATOMIC_INIT(2),
+   .refcount   = ATOMIC_INIT(3),
},
.creator = &root_user,
 };
@@ -47,7 +51,7 @@ static struct kmem_cache *uid_cachep;
  */
 static DEFINE_SPINLOCK(uidhash_lock);
 
-/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->creator */
+/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->user_ns */
 struct user_struct root_user = {
.__count= ATOMIC_INIT(2),
.processes  = ATOMIC_INIT(1),
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 8a82b4b..a7b3a8d 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct uts_namespace *create_uts_ns(void)
 {
@@ -40,6 +41,8 @@ static struct uts_namespace *clone_uts_ns(struct 
uts_namespace *old_ns)
 
down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
+   ns->user_ns = old_ns->user_ns;
+   get_user_ns(ns->user_ns);
up_read(&uts_sem);
return ns;
 }
@@ -71,5 +74,6 @@ void free_uts_ns(struct kref *kref)
struct uts_namespace *ns;
 
ns = container_of(kref, struct uts_namespace, kref);
+   put_user_ns(ns->user_ns);
kfree(ns);
 }
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 7/9] add a user namespace owner of ipc ns

2011-02-17 Thread Serge E. Hallyn
Changelog:
Feb 15: Don't set new ipc->user_ns if we didn't create a new
ipc_ns.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/ipc_namespace.h |3 +++
 ipc/msgutil.c |3 +++
 ipc/namespace.c   |9 +++--
 kernel/nsproxy.c  |5 +
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 5195298..46d2eb4 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -24,6 +24,7 @@ struct ipc_ids {
struct idr ipcs_idr;
 };
 
+struct user_namespace;
 struct ipc_namespace {
atomic_tcount;
struct ipc_ids  ids[3];
@@ -56,6 +57,8 @@ struct ipc_namespace {
unsigned intmq_msg_max;  /* initialized to DFLT_MSGMAX */
unsigned intmq_msgsize_max;  /* initialized to DFLT_MSGSIZEMAX */
 
+   /* user_ns which owns the ipc ns */
+   struct user_namespace *user_ns;
 };
 
 extern struct ipc_namespace init_ipc_ns;
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index f095ee2..d91ff4b 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -20,6 +20,8 @@
 
 DEFINE_SPINLOCK(mq_lock);
 
+extern struct user_namespace init_user_ns;
+
 /*
  * The next 2 defines are here bc this is the only file
  * compiled when either CONFIG_SYSVIPC and CONFIG_POSIX_MQUEUE
@@ -32,6 +34,7 @@ struct ipc_namespace init_ipc_ns = {
.mq_msg_max  = DFLT_MSGMAX,
.mq_msgsize_max  = DFLT_MSGSIZEMAX,
 #endif
+   .user_ns = &init_user_ns,
 };
 
 atomic_t nr_ipc_ns = ATOMIC_INIT(1);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index a1094ff..aa18899 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -11,10 +11,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "util.h"
 
-static struct ipc_namespace *create_ipc_ns(void)
+static struct ipc_namespace *create_ipc_ns(struct ipc_namespace *old_ns)
 {
struct ipc_namespace *ns;
int err;
@@ -43,6 +44,9 @@ static struct ipc_namespace *create_ipc_ns(void)
ipcns_notify(IPCNS_CREATED);
register_ipcns_notifier(ns);
 
+   ns->user_ns = old_ns->user_ns;
+   get_user_ns(ns->user_ns);
+
return ns;
 }
 
@@ -50,7 +54,7 @@ struct ipc_namespace *copy_ipcs(unsigned long flags, struct 
ipc_namespace *ns)
 {
if (!(flags & CLONE_NEWIPC))
return get_ipc_ns(ns);
-   return create_ipc_ns();
+   return create_ipc_ns(ns);
 }
 
 /*
@@ -105,6 +109,7 @@ static void free_ipc_ns(struct ipc_namespace *ns)
 * order to have a correct value when recomputing msgmni.
 */
ipcns_notify(IPCNS_REMOVED);
+   put_user_ns(ns->user_ns);
 }
 
 /*
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 034dc2e..b6dbff2 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -85,6 +85,11 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
err = PTR_ERR(new_nsp->ipc_ns);
goto out_ipc;
}
+   if (new_nsp->ipc_ns != tsk->nsproxy->ipc_ns) {
+   put_user_ns(new_nsp->ipc_ns->user_ns);
+   new_nsp->ipc_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
+   get_user_ns(new_nsp->ipc_ns->user_ns);
+   }
 
new_nsp->pid_ns = copy_pid_ns(flags, task_active_pid_ns(tsk));
if (IS_ERR(new_nsp->pid_ns)) {
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 9/9] userns: check user namespace for task->file uid equivalence checks

2011-02-17 Thread Serge E. Hallyn
Cheat for now and say all files belong to init_user_ns.  Next
step will be to let superblocks belong to a user_ns, and derive
inode_userns(inode) from inode->i_sb->s_user_ns.  Finally we'll
introduce more flexible arrangements.

Changelog:
Feb 15: make is_owner_or_cap take const struct inode

Signed-off-by: Serge E. Hallyn 
---
 fs/inode.c |   17 +
 fs/namei.c |   20 +++-
 include/linux/fs.h |9 +++--
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index da85e56..1930b45 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * This is needed for the following functions:
@@ -1722,3 +1723,19 @@ void inode_init_owner(struct inode *inode, const struct 
inode *dir,
inode->i_mode = mode;
 }
 EXPORT_SYMBOL(inode_init_owner);
+
+/*
+ * return 1 if current either has CAP_FOWNER to the
+ * file, or owns the file.
+ */
+int is_owner_or_cap(const struct inode *inode)
+{
+   struct user_namespace *ns = inode_userns(inode);
+
+   if (current_user_ns() == ns && current_fsuid() == inode->i_uid)
+   return 1;
+   if (ns_capable(ns, CAP_FOWNER))
+   return 1;
+   return 0;
+}
+EXPORT_SYMBOL(is_owner_or_cap);
diff --git a/fs/namei.c b/fs/namei.c
index 9e701e2..cfac5b4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -176,6 +176,9 @@ static int acl_permission_check(struct inode *inode, int 
mask, unsigned int flag
 
mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
 
+   if (current_user_ns() != inode_userns(inode))
+   goto other_perms;
+
if (current_fsuid() == inode->i_uid)
mode >>= 6;
else {
@@ -189,6 +192,7 @@ static int acl_permission_check(struct inode *inode, int 
mask, unsigned int flag
mode >>= 3;
}
 
+other_perms:
/*
 * If the DACs are ok we don't need any capability check.
 */
@@ -230,7 +234,7 @@ int generic_permission(struct inode *inode, int mask, 
unsigned int flags,
 * Executable DACs are overridable if at least one exec bit is set.
 */
if (!(mask & MAY_EXEC) || execute_ok(inode))
-   if (capable(CAP_DAC_OVERRIDE))
+   if (ns_capable(inode_userns(inode), CAP_DAC_OVERRIDE))
return 0;
 
/*
@@ -238,7 +242,7 @@ int generic_permission(struct inode *inode, int mask, 
unsigned int flags,
 */
mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
if (mask == MAY_READ || (S_ISDIR(inode->i_mode) && !(mask & MAY_WRITE)))
-   if (capable(CAP_DAC_READ_SEARCH))
+   if (ns_capable(inode_userns(inode), CAP_DAC_READ_SEARCH))
return 0;
 
return -EACCES;
@@ -675,6 +679,7 @@ force_reval_path(struct path *path, struct nameidata *nd)
 static inline int exec_permission(struct inode *inode, unsigned int flags)
 {
int ret;
+   struct user_namespace *ns = inode_userns(inode);
 
if (inode->i_op->permission) {
ret = inode->i_op->permission(inode, MAY_EXEC, flags);
@@ -687,7 +692,7 @@ static inline int exec_permission(struct inode *inode, 
unsigned int flags)
if (ret == -ECHILD)
return ret;
 
-   if (capable(CAP_DAC_OVERRIDE) || capable(CAP_DAC_READ_SEARCH))
+   if (ns_capable(ns, CAP_DAC_OVERRIDE) || ns_capable(ns, 
CAP_DAC_READ_SEARCH))
goto ok;
 
return ret;
@@ -1940,11 +1945,15 @@ static inline int check_sticky(struct inode *dir, 
struct inode *inode)
 
if (!(dir->i_mode & S_ISVTX))
return 0;
+   if (current_user_ns() != inode_userns(inode))
+   goto other_userns;
if (inode->i_uid == fsuid)
return 0;
if (dir->i_uid == fsuid)
return 0;
-   return !capable(CAP_FOWNER);
+
+other_userns:
+   return !ns_capable(inode_userns(inode), CAP_FOWNER);
 }
 
 /*
@@ -2635,7 +2644,8 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, 
int mode, dev_t dev)
if (error)
return error;
 
-   if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
+   if ((S_ISCHR(mode) || S_ISBLK(mode)) &&
+   !ns_capable(inode_userns(dir), CAP_MKNOD))
return -EPERM;
 
if (!dir->i_op->mknod)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bd32159..c84417a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1446,8 +1446,13 @@ enum {
 #define put_fs_excl() atomic_dec(¤t->fs_excl)
 #define has_fs_excl() atomic_read(¤t->fs_excl)
 
-#define is_owner_or_cap(inode) \
-   ((current_fsuid() == (inode)->i_uid) || capable(CAP_FOWNER))
+/*
+ * until VFS tracks user namespaces for inodes, just make all files
+ * belong to init_user_ns

[Devel] [PATCH 2/9] security: Make capabilities relative to the user namespace.

2011-02-17 Thread Serge E. Hallyn
- Introduce ns_capable to test for a capability in a non-default
  user namespace.
- Teach cap_capable to handle capabilities in a non-default
  user namespace.

The motivation is to get to the unprivileged creation of new
namespaces.  It looks like this gets us 90% of the way there, with
only potential uid confusion issues left.

I still need to handle getting all caps after creation but otherwise I
think I have a good starter patch that achieves all of your goals.

Changelog:
11/05/2010: [serge] add apparmor
12/14/2010: [serge] fix capabilities to created user namespaces
Without this, if user serge creates a user_ns, he won't have
capabilities to the user_ns he created.  THis is because we
were first checking whether his effective caps had the caps
he needed and returning -EPERM if not, and THEN checking whether
he was the creator.  Reverse those checks.
12/16/2010: [serge] security_real_capable needs ns argument in 
!security case
01/11/2011: [serge] add task_ns_capable helper
01/11/2011: [serge] add nsown_capable() helper per Bastian Blank 
suggestion
02/16/2011: [serge] fix a logic bug: the root user is always creator of
init_user_ns, but should not always have capabilities to
it!  Fix the check in cap_capable().

Signed-off-by: Eric W. Biederman 
Signed-off-by: Serge E. Hallyn 
---
 include/linux/capability.h |   10 --
 include/linux/security.h   |   25 ++---
 kernel/capability.c|   32 ++--
 security/apparmor/lsm.c|5 +++--
 security/commoncap.c   |   40 +---
 security/security.c|   16 ++--
 security/selinux/hooks.c   |   14 +-
 7 files changed, 107 insertions(+), 35 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index fb16a36..cb3d2d9 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -544,7 +544,7 @@ extern const kernel_cap_t __cap_init_eff_set;
  *
  * Note that this does not set PF_SUPERPRIV on the task.
  */
-#define has_capability(t, cap) (security_real_capable((t), (cap)) == 0)
+#define has_capability(t, cap) (security_real_capable((t), &init_user_ns, 
(cap)) == 0)
 
 /**
  * has_capability_noaudit - Determine if a task has a superior capability 
available (unaudited)
@@ -558,9 +558,15 @@ extern const kernel_cap_t __cap_init_eff_set;
  * Note that this does not set PF_SUPERPRIV on the task.
  */
 #define has_capability_noaudit(t, cap) \
-   (security_real_capable_noaudit((t), (cap)) == 0)
+   (security_real_capable_noaudit((t), &init_user_ns, (cap)) == 0)
 
+struct user_namespace;
+extern struct user_namespace init_user_ns;
 extern int capable(int cap);
+extern int ns_capable(struct user_namespace *ns, int cap);
+extern int task_ns_capable(struct task_struct *t, int cap);
+
+#define nsown_capable(cap) (ns_capable(current_user_ns(), (cap)))
 
 /* audit system wants to get cap info from files as well */
 struct dentry;
diff --git a/include/linux/security.h b/include/linux/security.h
index b2b7f97..6bbee08 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -46,13 +46,14 @@
 
 struct ctl_table;
 struct audit_krule;
+struct user_namespace;
 
 /*
  * These functions are in security/capability.c and are used
  * as the default capabilities functions
  */
 extern int cap_capable(struct task_struct *tsk, const struct cred *cred,
-  int cap, int audit);
+  struct user_namespace *ns, int cap, int audit);
 extern int cap_settime(struct timespec *ts, struct timezone *tz);
 extern int cap_ptrace_access_check(struct task_struct *child, unsigned int 
mode);
 extern int cap_ptrace_traceme(struct task_struct *parent);
@@ -1254,6 +1255,7 @@ static inline void security_free_mnt_opts(struct 
security_mnt_opts *opts)
  * credentials.
  * @tsk contains the task_struct for the process.
  * @cred contains the credentials to use.
+ *  @ns contains the user namespace we want the capability in
  * @cap contains the capability .
  * @audit: Whether to write an audit message or not
  * Return 0 if the capability is granted for @tsk.
@@ -1382,7 +1384,7 @@ struct security_operations {
   const kernel_cap_t *inheritable,
   const kernel_cap_t *permitted);
int (*capable) (struct task_struct *tsk, const struct cred *cred,
-   int cap, int audit);
+   struct user_namespace *ns, int cap, int audit);
int (*sysctl) (struct ctl_table *table, int op);
int (*quotactl) (int cmds, int type, int id, struct super_block *sb);
int (*quota_on) (struct dentry *dentry);
@@ -1662,9 +1664,9 @@ int security_capset(struct cred *new, const struct cred 
*old,
const kernel_cap

[Devel] userns: targeted capabilities v5

2011-02-17 Thread Serge E. Hallyn
Here is a repost of my previous user namespace patch, ported onto
last night's git head.

It fixes several things I was doing wrong in the last (v4)
posting, in particular:

1. don't set uts_ns->user_ns to current's when !CLONE_NEWUTS
2. add a ipc_ns->user_ns which owns ipc_ns, and use that to
   decide CAP_IPC_OWNER
3. fix logic flaw caused by bad parantheses
4. allow do_prlimit to current
5. don't always give root full privs to init_user_ns

The expected course of development for user namespaces is laid out
at https://wiki.ubuntu.com/UserNamespace.  Bugs aside, this
patchset is supposed to not at all affect systems which are not
actively using user namespaces, and only restrict what tasks in
child user namespace can do.  They begin to limit privilege to
a user namespace, so that root in a container cannot kill or
ptrace tasks in the parent user namespace, and can only get
world access rights to files.  Since all files currently belong
to the initila user namespace, that means that child user
namespaces can only get world access rights to *all* files.
While this temporarily makes user namespaces bad for system
containers, it starts to get useful for some sandboxing.

I've run the 'runltplite.sh' with and without this patchset and
found no difference.  So all in all, this is the first version
of this patchset for which I feel comfortable asking:  please
consider applying.

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 3/3] procfs: kill the global proc_mnt variable

2011-02-16 Thread Serge E. Hallyn
Quoting Daniel Lezcano (daniel.lezc...@free.fr):
> From: Oleg Nesterov 
> 
> After the previous cleanup in proc_get_sb() the global proc_mnt has
> no reasons to exists, kill it.
> 
> Signed-off-by: Oleg Nesterov 
> Signed-off-by: Eric W. Biederman 
> Signed-off-by: Daniel Lezcano 

Acked-by: Serge E. Hallyn 

> ---
>  fs/proc/inode.c|2 --
>  fs/proc/internal.h |1 -
>  fs/proc/root.c |7 ---
>  3 files changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/proc/inode.c b/fs/proc/inode.c
> index 176ce4c..ee0f802 100644
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -42,8 +42,6 @@ static void proc_evict_inode(struct inode *inode)
>   sysctl_head_put(PROC_I(inode)->sysctl);
>  }
>  
> -struct vfsmount *proc_mnt;
> -
>  static struct kmem_cache * proc_inode_cachep;
>  
>  static struct inode *proc_alloc_inode(struct super_block *sb)
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 9ad561d..c03e8d3 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -107,7 +107,6 @@ static inline struct proc_dir_entry *pde_get(struct 
> proc_dir_entry *pde)
>  }
>  void pde_put(struct proc_dir_entry *pde);
>  
> -extern struct vfsmount *proc_mnt;
>  int proc_fill_super(struct super_block *);
>  struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
>  
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index e5e2bfa..a9000e9 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -90,19 +90,20 @@ static struct file_system_type proc_fs_type = {
>  
>  void __init proc_root_init(void)
>  {
> + struct vfsmount *mnt;
>   int err;
>  
>   proc_init_inodecache();
>   err = register_filesystem(&proc_fs_type);
>   if (err)
>   return;
> - proc_mnt = kern_mount_data(&proc_fs_type, &init_pid_ns);
> - if (IS_ERR(proc_mnt)) {
> + mnt = kern_mount_data(&proc_fs_type, &init_pid_ns);
> + if (IS_ERR(mnt)) {
>   unregister_filesystem(&proc_fs_type);
>   return;
>   }
>  
> - init_pid_ns.proc_mnt = proc_mnt;
> + init_pid_ns.proc_mnt = mnt;
>   proc_symlink("mounts", NULL, "self/mounts");
>  
>   proc_net_init();
> -- 
> 1.7.1
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/3] pidns: Call pid_ns_prepare_proc from create_pid_namespace

2011-02-16 Thread Serge E. Hallyn
Quoting Daniel Lezcano (daniel.lezc...@free.fr):
> From: Eric W. Biederman 
> 
> Reorganize proc_get_sb so it can be called before the struct pid
> of the first process is allocated.
> 
> Signed-off-by: Eric W. Biederman 
> Signed-off-by: Daniel Lezcano 

Acked-by: Serge E. Hallyn 

> ---
>  fs/proc/root.c |   25 +++--
>  kernel/fork.c  |6 --
>  kernel/pid_namespace.c |   11 +--
>  3 files changed, 16 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index ef9fa8e..e5e2bfa 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -43,17 +43,6 @@ static struct dentry *proc_mount(struct file_system_type 
> *fs_type,
>   struct pid_namespace *ns;
>   struct proc_inode *ei;
>  
> - if (proc_mnt) {
> - /* Seed the root directory with a pid so it doesn't need
> -  * to be special in base.c.  I would do this earlier but
> -  * the only task alive when /proc is mounted the first time
> -  * is the init_task and it doesn't have any pids.
> -  */
> - ei = PROC_I(proc_mnt->mnt_sb->s_root->d_inode);
> - if (!ei->pid)
> - ei->pid = find_get_pid(1);
> - }
> -
>   if (flags & MS_KERNMOUNT)
>   ns = (struct pid_namespace *)data;
>   else
> @@ -71,16 +60,16 @@ static struct dentry *proc_mount(struct file_system_type 
> *fs_type,
>   return ERR_PTR(err);
>   }
>  
> - ei = PROC_I(sb->s_root->d_inode);
> - if (!ei->pid) {
> - rcu_read_lock();
> - ei->pid = get_pid(find_pid_ns(1, ns));
> - rcu_read_unlock();
> - }
> -
>   sb->s_flags |= MS_ACTIVE;
>   }
>  
> + ei = PROC_I(sb->s_root->d_inode);
> + if (!ei->pid) {
> + rcu_read_lock();
> + ei->pid = get_pid(find_pid_ns(1, ns));
> + rcu_read_unlock();
> + }
> +
>   return dget(sb->s_root);
>  }
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c9f0784..e7a5907 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1180,12 +1180,6 @@ static struct task_struct *copy_process(unsigned long 
> clone_flags,
>   pid = alloc_pid(p->nsproxy->pid_ns);
>   if (!pid)
>   goto bad_fork_cleanup_io;
> -
> - if (clone_flags & CLONE_NEWPID) {
> - retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
> - if (retval < 0)
> - goto bad_fork_free_pid;
> - }
>   }
>  
>   p->pid = pid_nr(pid);
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index a5aff94..e9c9adc 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define BITS_PER_PAGE(PAGE_SIZE*8)
>  
> @@ -72,7 +73,7 @@ static struct pid_namespace *create_pid_namespace(struct 
> pid_namespace *parent_p
>  {
>   struct pid_namespace *ns;
>   unsigned int level = parent_pid_ns->level + 1;
> - int i;
> + int i, err = -ENOMEM;
>  
>   ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
>   if (ns == NULL)
> @@ -96,14 +97,20 @@ static struct pid_namespace *create_pid_namespace(struct 
> pid_namespace *parent_p
>   for (i = 1; i < PIDMAP_ENTRIES; i++)
>   atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
>  
> + err = pid_ns_prepare_proc(ns);
> + if (err)
> + goto out_put_parent_pid_ns;
> +
>   return ns;
>  
> +out_put_parent_pid_ns:
> + put_pid_ns(parent_pid_ns);
>  out_free_map:
>   kfree(ns->pidmap[0].page);
>  out_free:
>   kmem_cache_free(pid_ns_cachep, ns);
>  out:
> - return ERR_PTR(-ENOMEM);
> + return ERR_PTR(err);
>  }
>  
>  static void destroy_pid_namespace(struct pid_namespace *ns)
> -- 
> 1.7.1
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/3] pid: Remove the child_reaper special case in init/main.c

2011-02-16 Thread Serge E. Hallyn
Quoting Daniel Lezcano (daniel.lezc...@free.fr):
> From: Eric W. Biederman 
> 
> It turns out that the existing assignment in copy_process of
> the child_reaper can handle the initial assignment of child_reaper
> we just need to generalize the test in kernel/fork.c
> 
> Signed-off-by: Eric W. Biederman 
> Signed-off-by: Daniel Lezcano 

Acked-by: Serge E. Hallyn 

> ---
>  include/linux/pid.h |   11 +++
>  init/main.c |9 -
>  kernel/fork.c   |2 +-
>  3 files changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 49f1c2f..efceda0 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -141,6 +141,17 @@ static inline struct pid_namespace *ns_of_pid(struct pid 
> *pid)
>  }
>  
>  /*
> + * is_child_reaper returns true if the pid is the init process
> + * of the current namespace. As this one could be checked before
> + * pid_ns->child_reaper is assigned in copy_process, we check
> + * with the pid number.
> + */
> +static inline bool is_child_reaper(struct pid *pid)
> +{
> + return pid->numbers[pid->level].nr == 1;
> +}
> +
> +/*
>   * the helpers to get the pid's id seen from different namespaces
>   *
>   * pid_nr(): global id, i.e. the id seen from the init namespace;
> diff --git a/init/main.c b/init/main.c
> index 33c37c3..793ebfd 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -875,15 +875,6 @@ static int __init kernel_init(void * unused)
>* init can run on any cpu.
>*/
>   set_cpus_allowed_ptr(current, cpu_all_mask);
> - /*
> -  * Tell the world that we're going to be the grim
> -  * reaper of innocent orphaned children.
> -  *
> -  * We don't want people to have to make incorrect
> -  * assumptions about where in the task array this
> -  * can be found.
> -  */
> - init_pid_ns.child_reaper = current;
>  
>   cad_pid = task_pid(current);
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 25e4291..c9f0784 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1289,7 +1289,7 @@ static struct task_struct *copy_process(unsigned long 
> clone_flags,
>   tracehook_finish_clone(p, clone_flags, trace);
>  
>   if (thread_group_leader(p)) {
> - if (clone_flags & CLONE_NEWPID)
> + if (is_child_reaper(pid))
>   p->nsproxy->pid_ns->child_reaper = p;
>  
>   p->signal->leader_pid = pid;
> -- 
> 1.7.1
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [linux-cr PATCH 1/1] update x86-64 eclone and cr syscall numbers

2011-02-14 Thread Serge E. Hallyn
(for ckpt-v23-rc1-pids branch)

Signed-off-by: Serge E. Hallyn 
---
 arch/x86/include/asm/unistd_64.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 706d90a..f5d1b9e 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -671,9 +671,9 @@ __SYSCALL(__NR_fanotify_mark, sys_fanotify_mark)
 __SYSCALL(__NR_prlimit64, sys_prlimit64)
 #define __NR_eclone303
 __SYSCALL(__NR_eclone, stub_eclone)
-#define __NR_checkpoint301
+#define __NR_checkpoint304
 __SYSCALL(__NR_checkpoint, stub_checkpoint)
-#define __NR_restart   302
+#define __NR_restart   305
 __SYSCALL(__NR_restart, stub_restart)
 
 #ifndef __NO_STUBS
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [user-cr PATCH 1/1] Fix x86-64 syscall numbers

2011-02-14 Thread Serge E. Hallyn
Signed-off-by: Serge Hallyn 
---
 clone_x86_64.c |2 +-
 include/linux/checkpoint.h |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/clone_x86_64.c b/clone_x86_64.c
index 5a22093..6750786 100644
--- a/clone_x86_64.c
+++ b/clone_x86_64.c
@@ -26,7 +26,7 @@
 #include "eclone.h"
 
 #ifndef __NR_eclone
-#define __NR_eclone 300
+#define __NR_eclone 303
 #endif
 
 int eclone(int (*fn)(void *), void *fn_arg, int clone_flags_low,
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index b6ac12d..3688ae1 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -54,11 +54,11 @@
 #elif __x86_64__
 
 #  ifndef __NR_checkpoint
-#  define __NR_checkpoint 301
+#  define __NR_checkpoint 304
 #  endif
 
 #  ifndef __NR_restart
-#  define __NR_restart 302
+#  define __NR_restart 305
 #  endif
 
 #else
-- 
1.7.2.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] cr: fix trivial compile error

2011-02-13 Thread Serge E. Hallyn
diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c
index 5bff486..6261993 100644
--- a/fs/sysv/dir.c
+++ b/fs/sysv/dir.c
@@ -25,7 +25,6 @@ const struct file_operations sysv_dir_operations = {
.read   = generic_read_dir,
.readdir= sysv_readdir,
.fsync  = generic_file_fsync,
-   .fsync  = simple_fsync,
 #ifdef CONFIG_CHECKPOINT
.checkpoint = generic_file_checkpoint,
 #endif
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] if you use user namespaces

2011-02-06 Thread Serge E. Hallyn
Please let me know.  lxc does not use them right now.  Libvirt uses them
for lxc containers f they are available, but I hope we can essentially
have it stop for awhile.  In addition, there's tons of software out
there that I don't know about, and fear of breaking their use of current
user namespaces has been keeping me from pushing further userns patches.

I've outlined how I see user namespaces developing at
https://wiki.ubuntu.com/UserNamespace .  Note there is nothing new
in there - some of it goes a year back, much of it more than two
years.  Nothing actually new.

Currently user namespaces are not very useful, but they do provide
separate uid accounting, and simply tossing CLONE_NEWUSER in with
CLONE_NEWNS and friends has until now been safe to do.  As you can
see, that is going to change.  So if that would cause you pain that
you can't work around, please get back to me.  Otherwise, I'd like
to get serious soon about expanding upon, and pushing upstream, the
patches to make CLONE_NEWUSER more useful for sandboxing.

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 03/08] allow sethostname in a container

2011-02-04 Thread Serge E. Hallyn
Quoting Serge E. Hallyn (se...@hallyn.com):
> Quoting Serge E. Hallyn (se...@hallyn.com):
> > Signed-off-by: Serge E. Hallyn 
> > ---
> >  kernel/sys.c |2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 2745dcd..9b9b03b 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1171,7 +1171,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, 
> > int, len)
> > int errno;
> > char tmp[__NEW_UTS_LEN];
> >  
> > -   if (!capable(CAP_SYS_ADMIN))
> > +   if (!ns_capable(current->nsproxy->uts_ns->user_ns, CAP_SYS_ADMIN))
> > return -EPERM;
> > if (len < 0 || len > __NEW_UTS_LEN)
> > return -EINVAL;
> > -- 
> > 1.7.0.4
> 
> An interesting note here is that since the task doing ns_exec (and
> therefore in the init_user_ns) requires CAP_SYS_ADMIN to unshare,
> this check will actually always be true if uts_ns was not unshared.

Noone ever called me on this, so for the sake of posterity reading the
m-l archives:  what I said above is not true.  If uts_ns was not
unshared, then current->nsproxy->uts_ns->user_ns != current_user_ns(),
so current should not have ns_capable(current->nsproxy->uts_ns->user_ns,
CAP_SYS_ADMIN).  So the check will always return false.

> If uts is unshared, then regular capabilities semantics in the
> child user_ns apply (that is, root can do sethostname, unpriv user
> cannot)  The intent is that user namespaces will eventually allow
> unprivileged users to unshare, after which this will make much more
> sense.
> 
> -serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: udev in containers

2011-01-28 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> "Serge E. Hallyn"  writes:
> 
> > Hi,
> >
> > Now that we are allowing udev to run in containers, Daniel has
> > noticed that updates to sysfs uevent files will trigger a flurry
> > of activity in all containers on the host.  While not a problem
> > with just a few containers, this can severaly impact performance
> > with hundreds or more containers.
> >
> > (Daniel, would it be possible for you to get some measurements
> > on host and in a container versus # of active containers, with
> > and without udev?  Do you have a otehrwise unused machien you
> > could try that on?)
> >
> > Is there anything we can/should do about this?
> >
> > Two approaches, neither sufficiently thought out yet, would be
> > to generalize the directory tagging currently used for
> > /sys/class/net, and full-fledged implementation of a device
> > namespace.
> >
> > The directory tagging would probably only work if we can assign
> > multiple tags to a device, but we could for instance make
> > /sys/block tagged, and really no container probably needs to see
> > /sys/block/sda.
> >
> > The device namespace would be similar, except I suspect it
> > would not only hide certain devices from certain namespaces,
> > but it would actually virtualize the device major:minor
> > mapping, for checkpoint/restart, so that /dev/sda could be
> > redirected to another device more completely than simply
> > fudging the nodes under /dev.
> >
> > Comments?  Designs?  Plans?
> 
> To answer you earlier question: What did I expect the device namespace
> to look like.
> 
> - Only purely virtual devices like  /dev/pts, /dev/null, /dev/nbd and 
> /dev/loop0 present.
> - Fully virtualized major/minor look up preventing us from even talking
>   about devices in other namespaces.

What does an interface look like for hooking up /dev/sdc on the host
to 'b 8:0' in a container?

> - Support from the user/security namespace so that mknod and mount are safe.
> 
> I get a certain uncomfortable feeling about mknod and mount running free
> in a container without restrictions that make container without 
> restrictions...

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] udev in containers

2011-01-28 Thread Serge E. Hallyn
Hi,

Now that we are allowing udev to run in containers, Daniel has
noticed that updates to sysfs uevent files will trigger a flurry
of activity in all containers on the host.  While not a problem
with just a few containers, this can severaly impact performance
with hundreds or more containers.

(Daniel, would it be possible for you to get some measurements
on host and in a container versus # of active containers, with
and without udev?  Do you have a otehrwise unused machien you
could try that on?)

Is there anything we can/should do about this?

Two approaches, neither sufficiently thought out yet, would be
to generalize the directory tagging currently used for
/sys/class/net, and full-fledged implementation of a device
namespace.

The directory tagging would probably only work if we can assign
multiple tags to a device, but we could for instance make
/sys/block tagged, and really no container probably needs to see
/sys/block/sda.

The device namespace would be similar, except I suspect it
would not only hide certain devices from certain namespaces,
but it would actually virtualize the device major:minor
mapping, for checkpoint/restart, so that /dev/sda could be
redirected to another device more completely than simply
fudging the nodes under /dev.

Comments?  Designs?  Plans?

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: device(s) namespace

2011-01-28 Thread Serge E. Hallyn
Quoting Oren Laadan (or...@cs.columbia.edu):
> Hi,
> 
> I vaguely recall some discussions/ideas about the possibility
> of a devices namespace, its pros and cons, and alternative.
> Related to that is also device viruatlization, and isolation
> of devices in containers.
> 
> Any thoughts and or pointers to past/current discussions are
> welcome :)
> 
> Thanks,

What I find in my old containers folders when grepping for
'device namespace' is mainly vague references to devicens as
a more ideal solution to other patches being submitted (namely
the device cgroup, the sysfs directory tagging, and ptyns).

A link which was referenced in one of those emails:

https://lists.linux-foundation.org/pipermail/containers/2008-April/010810.html

So while I'm pretty sure I have in the past seen discussions on what
the device namespace would look like, they must have been on irc or
in person.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: device(s) namespace

2011-01-27 Thread Serge E. Hallyn
Quoting Oren Laadan (or...@cs.columbia.edu):
> Hi,
> 
> I vaguely recall some discussions/ideas about the possibility
> of a devices namespace, its pros and cons, and alternative.
> Related to that is also device viruatlization, and isolation
> of devices in containers.
> 
> Any thoughts and or pointers to past/current discussions are
> welcome :)

I'm hoping to get my archive disk out this weekend or monday
and search for these, if noone else finds them before that.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/7] allow killing tasks in your own or child userns

2011-01-15 Thread Serge E. Hallyn
Quoting Bastian Blank (bast...@waldi.eu.org):
> On Sat, Jan 15, 2011 at 12:31:14AM +0000, Serge E. Hallyn wrote:
> > Quoting Bastian Blank (bast...@waldi.eu.org):
> > > On Tue, Jan 11, 2011 at 01:31:52AM +0000, Serge E. Hallyn wrote:
> > > > Quoting Bastian Blank (bast...@waldi.eu.org):
> > > > > What is this flag used for anyway? I only see it used in the 
> > > > > accounting
> > > > > stuff, and if every user can get it, it is not longer useful.
> > > > hm, I'm not sure...  maybe noone is using it!
> > > However with your patches (or at least the goal), everyone is super-user
> > > in derived namespaces.
> > 
> > No, a task just sitting in a derived ns won't necessarily need/use
> > super-user privileges...  (and, if we ever get far enough along, it
> > won't even necessarily have CAP_SYS_ADMIN/etc targeted to the parent
> > userns, bc it won't need those to do the unshares).
> 
> The goal, as I understand it, is that everyone can create derived user
> namespaces. However the creator have automatically all the capabilities
> in the derived namespace.

Yes.  While he is root in that namespace.  But then he can drop
privileges and spawn new tasks, which have no capabilities in any
namespace.

Even if he didn't do that, the PF_SUPERPRIV flag doesn't say that
you have capabilities, but that you used them.  So if I star a task
as root which never does anything but ls /root, then the flag should
not be set.

> So he can use them to gain this super-user
> flag. Even killing a tasks in the derived namespace needs the capability
> already.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/7] allow killing tasks in your own or child userns

2011-01-14 Thread Serge E. Hallyn
Quoting Bastian Blank (bast...@waldi.eu.org):
> On Tue, Jan 11, 2011 at 01:31:52AM +0000, Serge E. Hallyn wrote:
> > Quoting Bastian Blank (bast...@waldi.eu.org):
> > > What is this flag used for anyway? I only see it used in the accounting
> > > stuff, and if every user can get it, it is not longer useful.
> > hm, I'm not sure...  maybe noone is using it!
> 
> This flag is from pre-git.
> 
> The only information is:
> | #define ASU0x02/* ... used super-user privileges */
> 
> However with your patches (or at least the goal), everyone is super-user
> in derived namespaces.

No, a task just sitting in a derived ns won't necessarily need/use
super-user privileges...  (and, if we ever get far enough along, it
won't even necessarily have CAP_SYS_ADMIN/etc targeted to the parent
userns, bc it won't need those to do the unshares).

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 03/08] allow sethostname in a container

2011-01-11 Thread Serge E. Hallyn
Quoting Serge E. Hallyn (se...@hallyn.com):
> Signed-off-by: Serge E. Hallyn 
> ---
>  kernel/sys.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 2745dcd..9b9b03b 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1171,7 +1171,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, 
> len)
>   int errno;
>   char tmp[__NEW_UTS_LEN];
>  
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(current->nsproxy->uts_ns->user_ns, CAP_SYS_ADMIN))
>   return -EPERM;
>   if (len < 0 || len > __NEW_UTS_LEN)
>   return -EINVAL;
> -- 
> 1.7.0.4

An interesting note here is that since the task doing ns_exec (and
therefore in the init_user_ns) requires CAP_SYS_ADMIN to unshare,
this check will actually always be true if uts_ns was not unshared.
If uts is unshared, then regular capabilities semantics in the
child user_ns apply (that is, root can do sethostname, unpriv user
cannot)  The intent is that user namespaces will eventually allow
unprivileged users to unshare, after which this will make much more
sense.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 08/08] userns: check user namespace for task->file uid equivalence checks

2011-01-10 Thread Serge E. Hallyn
Cheat for now and say all files belong to init_user_ns.  Next
step will be to let superblocks belong to a user_ns, and derive
inode_userns(inode) from inode->i_sb->s_user_ns.  Finally we'll
introduce more flexible arrangements.

Signed-off-by: Serge E. Hallyn 
---
 fs/inode.c |   17 +
 fs/namei.c |   20 +++-
 include/linux/fs.h |9 +++--
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 4a924c7..ebb2b85 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * This is needed for the following functions:
@@ -1710,6 +1711,22 @@ void inode_init_owner(struct inode *inode, const struct 
inode *dir,
 }
 EXPORT_SYMBOL(inode_init_owner);
 
+/*
+ * return 1 if current either has CAP_FOWNER to the
+ * file, or owns the file.
+ */
+int is_owner_or_cap(struct inode *inode)
+{
+   struct user_namespace *ns = inode_userns(inode);
+
+   if (current_user_ns() == ns && current_fsuid() == inode->i_uid)
+   return 1;
+   if (ns_capable(ns, CAP_FOWNER))
+   return 1;
+   return 0;
+}
+EXPORT_SYMBOL(is_owner_or_cap);
+
 #define CREATE_TRACE_POINTS
 #include 
 
diff --git a/fs/namei.c b/fs/namei.c
index b020c45..d956ce3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -176,6 +176,9 @@ static int acl_permission_check(struct inode *inode, int 
mask,
 
mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
 
+   if (current_user_ns() != inode_userns(inode))
+   goto other_perms;
+
if (current_fsuid() == inode->i_uid)
mode >>= 6;
else {
@@ -189,6 +192,7 @@ static int acl_permission_check(struct inode *inode, int 
mask,
mode >>= 3;
}
 
+other_perms:
/*
 * If the DACs are ok we don't need any capability check.
 */
@@ -225,7 +229,7 @@ int generic_permission(struct inode *inode, int mask,
 * Executable DACs are overridable if at least one exec bit is set.
 */
if (!(mask & MAY_EXEC) || execute_ok(inode))
-   if (capable(CAP_DAC_OVERRIDE))
+   if (ns_capable(inode_userns(inode), CAP_DAC_OVERRIDE))
return 0;
 
/*
@@ -233,7 +237,7 @@ int generic_permission(struct inode *inode, int mask,
 */
mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
if (mask == MAY_READ || (S_ISDIR(inode->i_mode) && !(mask & MAY_WRITE)))
-   if (capable(CAP_DAC_READ_SEARCH))
+   if (ns_capable(inode_userns(inode), CAP_DAC_READ_SEARCH))
return 0;
 
return -EACCES;
@@ -463,6 +467,7 @@ force_reval_path(struct path *path, struct nameidata *nd)
 static int exec_permission(struct inode *inode)
 {
int ret;
+   struct user_namespace *ns = inode_userns(inode);
 
if (inode->i_op->permission) {
ret = inode->i_op->permission(inode, MAY_EXEC);
@@ -474,7 +479,7 @@ static int exec_permission(struct inode *inode)
if (!ret)
goto ok;
 
-   if (capable(CAP_DAC_OVERRIDE) || capable(CAP_DAC_READ_SEARCH))
+   if (ns_capable(ns, CAP_DAC_OVERRIDE) || ns_capable(ns, 
CAP_DAC_READ_SEARCH))
goto ok;
 
return ret;
@@ -1262,11 +1267,15 @@ static inline int check_sticky(struct inode *dir, 
struct inode *inode)
 
if (!(dir->i_mode & S_ISVTX))
return 0;
+   if (current_user_ns() != inode_userns(inode))
+   goto other_userns;
if (inode->i_uid == fsuid)
return 0;
if (dir->i_uid == fsuid)
return 0;
-   return !capable(CAP_FOWNER);
+
+other_userns:
+   return !ns_capable(inode_userns(inode), CAP_FOWNER);
 }
 
 /*
@@ -1954,7 +1963,8 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, 
int mode, dev_t dev)
if (error)
return error;
 
-   if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
+   if ((S_ISCHR(mode) || S_ISBLK(mode)) &&
+   !ns_capable(inode_userns(dir), CAP_MKNOD))
return -EPERM;
 
if (!dir->i_op->mknod)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 090f0ea..674c06a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1437,8 +1437,13 @@ enum {
 #define put_fs_excl() atomic_dec(¤t->fs_excl)
 #define has_fs_excl() atomic_read(¤t->fs_excl)
 
-#define is_owner_or_cap(inode) \
-   ((current_fsuid() == (inode)->i_uid) || capable(CAP_FOWNER))
+/*
+ * until VFS tracks user namespaces for inodes, just make all files
+ * belong to init_user_ns
+ */
+extern struct user_namespace init_user_ns;
+#define inode_userns(inode) (&init_user_ns)
+extern int is_owner_or_cap(struct inode *inode);
 
 /* not quite ready to be deprecated, but... */
 extern vo

[Devel] [PATCH 06/08] user namespaces: convert all capable checks in kernel/sys.c

2011-01-10 Thread Serge E. Hallyn
This allows setuid/setgid in containers.  It also fixes some
corner cases where kernel logic foregoes capability checks when
uids are equivalent.  The latter will need to be done throughout
the whole kernel.

Changelog:
Jan 11: Use nsown_capable() as suggested by Bastian Blank.
Jan 11: Fix logic errors in uid checks pointed out by Bastian.

Signed-off-by: Serge E. Hallyn 
---
 kernel/sys.c |   67 +++--
 1 files changed, 41 insertions(+), 26 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 9b9b03b..b68cd67 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -116,17 +116,29 @@ EXPORT_SYMBOL(cad_pid);
 
 void (*pm_power_off_prepare)(void);
 
+/* called with rcu_read_lock, creds are safe */
+static inline int set_one_prio_perm(struct task_struct *p)
+{
+   const struct cred *cred = current_cred(), *pcred = __task_cred(p);
+
+   if (pcred->user->user_ns == cred->user->user_ns &&
+   (pcred->uid  == cred->euid ||
+pcred->euid == cred->euid))
+   return 1;
+   if (ns_capable(pcred->user->user_ns, CAP_SYS_NICE))
+   return 1;
+   return 0;
+}
+
 /*
  * set the priority of a task
  * - the caller must hold the RCU read lock
  */
 static int set_one_prio(struct task_struct *p, int niceval, int error)
 {
-   const struct cred *cred = current_cred(), *pcred = __task_cred(p);
-   int no_nice;
+   int ret, no_nice;
 
-   if (pcred->uid  != cred->euid &&
-   pcred->euid != cred->euid && !capable(CAP_SYS_NICE)) {
+   if (!set_one_prio_perm(p)) {
error = -EPERM;
goto out;
}
@@ -496,7 +508,7 @@ SYSCALL_DEFINE2(setregid, gid_t, rgid, gid_t, egid)
if (rgid != (gid_t) -1) {
if (old->gid == rgid ||
old->egid == rgid ||
-   capable(CAP_SETGID))
+   nsown_capable(CAP_SETGID))
new->gid = rgid;
else
goto error;
@@ -505,7 +517,7 @@ SYSCALL_DEFINE2(setregid, gid_t, rgid, gid_t, egid)
if (old->gid == egid ||
old->egid == egid ||
old->sgid == egid ||
-   capable(CAP_SETGID))
+   nsown_capable(CAP_SETGID))
new->egid = egid;
else
goto error;
@@ -540,7 +552,7 @@ SYSCALL_DEFINE1(setgid, gid_t, gid)
old = current_cred();
 
retval = -EPERM;
-   if (capable(CAP_SETGID))
+   if (nsown_capable(CAP_SETGID))
new->gid = new->egid = new->sgid = new->fsgid = gid;
else if (gid == old->gid || gid == old->sgid)
new->egid = new->fsgid = gid;
@@ -607,7 +619,7 @@ SYSCALL_DEFINE2(setreuid, uid_t, ruid, uid_t, euid)
new->uid = ruid;
if (old->uid != ruid &&
old->euid != ruid &&
-   !capable(CAP_SETUID))
+   !nsown_capable(CAP_SETUID))
goto error;
}
 
@@ -616,7 +628,7 @@ SYSCALL_DEFINE2(setreuid, uid_t, ruid, uid_t, euid)
if (old->uid != euid &&
old->euid != euid &&
old->suid != euid &&
-   !capable(CAP_SETUID))
+   !nsown_capable(CAP_SETUID))
goto error;
}
 
@@ -664,7 +676,7 @@ SYSCALL_DEFINE1(setuid, uid_t, uid)
old = current_cred();
 
retval = -EPERM;
-   if (capable(CAP_SETUID)) {
+   if (nsown_capable(CAP_SETUID)) {
new->suid = new->uid = uid;
if (uid != old->uid) {
retval = set_user(new);
@@ -706,7 +718,7 @@ SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, 
suid)
old = current_cred();
 
retval = -EPERM;
-   if (!capable(CAP_SETUID)) {
+   if (!nsown_capable(CAP_SETUID)) {
if (ruid != (uid_t) -1 && ruid != old->uid &&
ruid != old->euid  && ruid != old->suid)
goto error;
@@ -770,7 +782,7 @@ SYSCALL_DEFINE3(setresgid, gid_t, rgid, gid_t, egid, gid_t, 
sgid)
old = current_cred();
 
retval = -EPERM;
-   if (!capable(CAP_SETGID)) {
+   if (!nsown_capable(CAP_SETGID)) {
if (rgid != (gid_t) -1 && rgid != old->gid &&
rgid != old->egid  && rgid != old->sgid)
goto error;
@@ -830,7 +842,7 @@ SYSCALL_DEFINE1(setfsuid, uid_t, uid)
 
if (uid == old->uid  || uid == old->euid  ||
uid == old->suid || uid == old->fsuid ||
-   capable(CAP_SETUID)) {
+  

[Devel] [PATCH 05/08] Allow ptrace from non-init user namespaces

2011-01-10 Thread Serge E. Hallyn
ptrace is allowed to tasks in the same user namespace according to
the usual rules (i.e. the same rules as for two tasks in the init
user namespace).  ptrace is also allowed to a user namespace to
which the current task the has CAP_SYS_PTRACE capability.

Changelog:
Dec 31: Address feedback by Eric:
. Correct ptrace uid check
. Rename may_ptrace_ns to ptrace_capable
. Also fix the cap_ptrace checks.
Jan  1: Use const cred struct
Jan 11: use task_ns_capable() in place of ptrace_capable().

Signed-off-by: Serge E. Hallyn 
---
 include/linux/capability.h |2 +
 include/linux/user_namespace.h |9 +++
 kernel/ptrace.c|   27 --
 kernel/user_namespace.c|   16 +
 security/commoncap.c   |   48 +--
 5 files changed, 82 insertions(+), 20 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 1711ff5..9095fdc 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -543,6 +543,8 @@ extern const kernel_cap_t __cap_init_eff_set;
  */
 #define has_capability(t, cap) (security_real_capable((t), &init_user_ns, 
(cap)) == 0)
 
+#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) 
== 0)
+
 /**
  * has_capability_noaudit - Determine if a task has a superior capability 
available (unaudited)
  * @t: The task in question
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 8178156..91c4f10 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns)
 uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, 
uid_t uid);
 gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, 
gid_t gid);
 
+int same_or_ancestor_user_ns(struct task_struct *task,
+   struct task_struct *victim);
+
 #else
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -66,6 +69,12 @@ static inline gid_t user_ns_map_gid(struct user_namespace 
*to,
return gid;
 }
 
+static inline int same_or_ancestor_user_ns(struct task_struct *task,
+   struct task_struct *victim)
+{
+   return 1;
+}
+
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 99bbaa3..ec7605d 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -134,21 +134,24 @@ int __ptrace_may_access(struct task_struct *task, 
unsigned int mode)
return 0;
rcu_read_lock();
tcred = __task_cred(task);
-   if ((cred->uid != tcred->euid ||
-cred->uid != tcred->suid ||
-cred->uid != tcred->uid  ||
-cred->gid != tcred->egid ||
-cred->gid != tcred->sgid ||
-cred->gid != tcred->gid) &&
-   !capable(CAP_SYS_PTRACE)) {
-   rcu_read_unlock();
-   return -EPERM;
-   }
+   if (cred->user->user_ns == tcred->user->user_ns &&
+   (cred->uid == tcred->euid &&
+cred->uid == tcred->suid &&
+cred->uid == tcred->uid  &&
+cred->gid == tcred->egid &&
+cred->gid == tcred->sgid &&
+cred->gid == tcred->gid))
+   goto ok;
+   if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
+   goto ok;
+   rcu_read_unlock();
+   return -EPERM;
+ok:
rcu_read_unlock();
smp_rmb();
if (task->mm)
dumpable = get_dumpable(task->mm);
-   if (!dumpable && !capable(CAP_SYS_PTRACE))
+   if (!dumpable && !task_ns_capable(task, CAP_SYS_PTRACE))
return -EPERM;
 
return security_ptrace_access_check(task, mode);
@@ -198,7 +201,7 @@ int ptrace_attach(struct task_struct *task)
goto unlock_tasklist;
 
task->ptrace = PT_PTRACED;
-   if (capable(CAP_SYS_PTRACE))
+   if (task_ns_capable(task, CAP_SYS_PTRACE))
task->ptrace |= PT_PTRACE_CAP;
 
__ptrace_link(task, current);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 2591583..4b70999 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -126,3 +126,19 @@ gid_t user_ns_map_gid(struct user_namespace *to, const 
struct cred *cred, gid_t
/* No useful relationship so no mapping */
return overflowgid;
 }
+
+int same_or_ancestor_user_ns(struct task_struct *task,
+   struct task_struct *victim)
+{
+   struct user_namespace *u1 = task_cred_xxx(task, user)->user_ns;
+   struct user_namespace *u2 = task_cred_xxx(victim, user)->user_ns;
+   for (;;) {
+   if (u1 == u2)
+   return 1;
+

[Devel] [PATCH 04/08] allow killing tasks in your own or child userns

2011-01-10 Thread Serge E. Hallyn
Changelog:
Dec  8: Fixed bug in my check_kill_permission pointed out by
Eric Biederman.
Dec 13: Apply Eric's suggestion to pass target task into 
kill_ok_by_cred()
for clarity
Dec 31: address comment by Eric Biederman:
don't need cred/tcred in check_kill_permission.
Jan  1: use const cred struct.
Jan 11: Per Bastian Blank's advice, clean up kill_ok_by_cred().

Signed-off-by: Serge E. Hallyn 
Reviewed-by: "Eric W. Biederman" 
---
 kernel/signal.c |   30 ++
 1 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..b64d06e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -636,13 +636,33 @@ static inline bool si_fromuser(const struct siginfo *info)
 }
 
 /*
+ * called with RCU read lock from check_kill_permission()
+ */
+static inline int kill_ok_by_cred(struct task_struct *t)
+{
+   const struct cred *cred = current_cred();
+   const struct cred *tcred = __task_cred(t);
+
+   if (cred->user->user_ns == tcred->user->user_ns &&
+   (cred->euid == tcred->suid) ||
+cred->euid == tcred->uid ||
+cred->uid  == tcred->suid ||
+cred->uid  == tcred->uid)
+   return 1;
+
+   if (ns_capable(tcred->user->user_ns, CAP_KILL))
+   return 1;
+
+   return 0;
+}
+
+/*
  * Bad permissions for sending the signal
  * - the caller must hold the RCU read lock
  */
 static int check_kill_permission(int sig, struct siginfo *info,
 struct task_struct *t)
 {
-   const struct cred *cred, *tcred;
struct pid *sid;
int error;
 
@@ -656,14 +676,8 @@ static int check_kill_permission(int sig, struct siginfo 
*info,
if (error)
return error;
 
-   cred = current_cred();
-   tcred = __task_cred(t);
if (!same_thread_group(current, t) &&
-   (cred->euid ^ tcred->suid) &&
-   (cred->euid ^ tcred->uid) &&
-   (cred->uid  ^ tcred->suid) &&
-   (cred->uid  ^ tcred->uid) &&
-   !capable(CAP_KILL)) {
+   !kill_ok_by_cred(t)) {
switch (sig) {
case SIGCONT:
sid = task_session(t);
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 03/08] allow sethostname in a container

2011-01-10 Thread Serge E. Hallyn
Signed-off-by: Serge E. Hallyn 
---
 kernel/sys.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 2745dcd..9b9b03b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1171,7 +1171,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, 
len)
int errno;
char tmp[__NEW_UTS_LEN];
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(current->nsproxy->uts_ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
if (len < 0 || len > __NEW_UTS_LEN)
return -EINVAL;
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 02/08] security: Make capabilities relative to the user namespace.

2011-01-10 Thread Serge E. Hallyn
- Introduce ns_capable to test for a capability in a non-default
  user namespace.
- Teach cap_capable to handle capabilities in a non-default
  user namespace.

The motivation is to get to the unprivileged creation of new
namespaces.  It looks like this gets us 90% of the way there, with
only potential uid confusion issues left.

I still need to handle getting all caps after creation but otherwise I
think I have a good starter patch that achieves all of your goals.

Changelog:
11/05/2010: [serge] add apparmor
12/14/2010: [serge] fix capabilities to created user namespaces
Without this, if user serge creates a user_ns, he won't have
capabilities to the user_ns he created.  THis is because we
were first checking whether his effective caps had the caps
he needed and returning -EPERM if not, and THEN checking whether
he was the creator.  Reverse those checks.
12/16/2010: [serge] security_real_capable needs ns argument in 
!security case
01/11/2011: [serge] add task_ns_capable helper
01/11/2011: [serge] add nsown_capable() helper per Bastian Blank 
suggestion

Signed-off-by: Eric W. Biederman 
Signed-off-by: Serge E. Hallyn 

fold into 2

Signed-off-by: Serge E. Hallyn 
---
 include/linux/capability.h |   10 --
 include/linux/security.h   |   22 --
 kernel/capability.c|   32 ++--
 security/apparmor/lsm.c|5 +++--
 security/commoncap.c   |   40 +---
 security/security.c|   12 ++--
 security/selinux/hooks.c   |   14 +-
 7 files changed, 101 insertions(+), 34 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 90012b9..1711ff5 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -541,7 +541,7 @@ extern const kernel_cap_t __cap_init_eff_set;
  *
  * Note that this does not set PF_SUPERPRIV on the task.
  */
-#define has_capability(t, cap) (security_real_capable((t), (cap)) == 0)
+#define has_capability(t, cap) (security_real_capable((t), &init_user_ns, 
(cap)) == 0)
 
 /**
  * has_capability_noaudit - Determine if a task has a superior capability 
available (unaudited)
@@ -555,9 +555,15 @@ extern const kernel_cap_t __cap_init_eff_set;
  * Note that this does not set PF_SUPERPRIV on the task.
  */
 #define has_capability_noaudit(t, cap) \
-   (security_real_capable_noaudit((t), (cap)) == 0)
+   (security_real_capable_noaudit((t), &init_user_ns, (cap)) == 0)
 
+struct user_namespace;
+extern struct user_namespace init_user_ns;
 extern int capable(int cap);
+extern int ns_capable(struct user_namespace *ns, int cap);
+extern int task_ns_capable(struct task_struct *t, int cap);
+
+#define nsown_capable(cap) (ns_capable(current_user_ns(), (cap)))
 
 /* audit system wants to get cap info from files as well */
 struct dentry;
diff --git a/include/linux/security.h b/include/linux/security.h
index 39f5b7e..2141f5a 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -46,13 +46,14 @@
 
 struct ctl_table;
 struct audit_krule;
+struct user_namespace;
 
 /*
  * These functions are in security/capability.c and are used
  * as the default capabilities functions
  */
 extern int cap_capable(struct task_struct *tsk, const struct cred *cred,
-  int cap, int audit);
+  struct user_namespace *ns, int cap, int audit);
 extern int cap_settime(struct timespec *ts, struct timezone *tz);
 extern int cap_ptrace_access_check(struct task_struct *child, unsigned int 
mode);
 extern int cap_ptrace_traceme(struct task_struct *parent);
@@ -1258,6 +1259,7 @@ static inline void security_free_mnt_opts(struct 
security_mnt_opts *opts)
  * credentials.
  * @tsk contains the task_struct for the process.
  * @cred contains the credentials to use.
+ *  @ns contains the user namespace we want the capability in
  * @cap contains the capability .
  * @audit: Whether to write an audit message or not
  * Return 0 if the capability is granted for @tsk.
@@ -1386,7 +1388,7 @@ struct security_operations {
   const kernel_cap_t *inheritable,
   const kernel_cap_t *permitted);
int (*capable) (struct task_struct *tsk, const struct cred *cred,
-   int cap, int audit);
+   struct user_namespace *ns, int cap, int audit);
int (*sysctl) (struct ctl_table *table, int op);
int (*quotactl) (int cmds, int type, int id, struct super_block *sb);
int (*quota_on) (struct dentry *dentry);
@@ -1668,9 +1670,9 @@ int security_capset(struct cred *new, const struct cred 
*old,
const kernel_cap_t *effective,
const kernel_cap_t *inheritable,
const kernel_cap_t *permitted);
-int security_capable(int cap);
-int security_real_capable(struc

[Devel] [PATCH 07/08] user namespaces: convert several capable() calls

2011-01-10 Thread Serge E. Hallyn
CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(),
because the resource comes from current's own ipc namespace.

setuid/setgid are to uids in own namespace, so again checks can be
against current_user_ns().

Changelog:
Jan 11: Use task_ns_capable() in place of sched_capable().
Jan 11: Use nsown_capable() as suggested by Bastian Blank.
Jan 11: Clarify (hopefully) some logic in futex and sched.c

Signed-off-by: Serge E. Hallyn 
---
 ipc/shm.c |2 +-
 ipc/util.c|5 +++--
 kernel/futex.c|   11 ++-
 kernel/futex_compat.c |   11 ++-
 kernel/groups.c   |2 +-
 kernel/sched.c|9 ++---
 kernel/uid16.c|2 +-
 7 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 7d3bb22..b5a0c2b 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -773,7 +773,7 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct 
shmid_ds __user *, buf)
 
audit_ipc_obj(&(shp->shm_perm));
 
-   if (!capable(CAP_IPC_LOCK)) {
+   if (!nsown_capable(CAP_IPC_LOCK)) {
uid_t euid = current_euid();
err = -EPERM;
if (euid != shp->shm_perm.uid &&
diff --git a/ipc/util.c b/ipc/util.c
index 69a0cc1..0bb65a6 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -627,7 +627,7 @@ int ipcperms (struct kern_ipc_perm *ipcp, short flag)
granted_mode >>= 3;
/* is there some bit set in requested_mode but not in granted_mode? */
if ((requested_mode & ~granted_mode & 0007) && 
-   !capable(CAP_IPC_OWNER))
+   !nsown_capable(CAP_IPC_OWNER))
return -1;
 
return security_ipc_permission(ipcp, flag);
@@ -800,7 +800,8 @@ struct kern_ipc_perm *ipcctl_pre_down(struct ipc_ids *ids, 
int id, int cmd,
 
euid = current_euid();
if (euid == ipcp->cuid ||
-   euid == ipcp->uid  || capable(CAP_SYS_ADMIN))
+   euid == ipcp->uid  ||
+   nsown_capable(CAP_SYS_ADMIN))
return ipcp;
 
err = -EPERM;
diff --git a/kernel/futex.c b/kernel/futex.c
index 40a8777..f02cb1c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2401,10 +2401,19 @@ SYSCALL_DEFINE3(get_robust_list, int, pid,
goto err_unlock;
ret = -EPERM;
pcred = __task_cred(p);
+   /* If victim is in different user_ns, then uids are not
+  comparable, so we must have CAP_SYS_PTRACE */
+   if (cred->user->user_ns != pcred->user->user_ns) {
+   if (!ns_capable(pcred->user->user_ns, CAP_SYS_PTRACE))
+   goto err_unlock;
+   goto ok;
+   }
+   /* If victim is in same user_ns, then uids are comparable */
if (cred->euid != pcred->euid &&
cred->euid != pcred->uid &&
-   !capable(CAP_SYS_PTRACE))
+   !ns_capable(pcred->user->user_ns, CAP_SYS_PTRACE))
goto err_unlock;
+ok:
head = p->robust_list;
rcu_read_unlock();
}
diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index a7934ac..5f9e689 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -153,10 +153,19 @@ compat_sys_get_robust_list(int pid, compat_uptr_t __user 
*head_ptr,
goto err_unlock;
ret = -EPERM;
pcred = __task_cred(p);
+   /* If victim is in different user_ns, then uids are not
+  comparable, so we must have CAP_SYS_PTRACE */
+   if (cred->user->user_ns != pcred->user->user_ns) {
+   if (!ns_capable(pcred->user->user_ns, CAP_SYS_PTRACE))
+   goto err_unlock;
+   goto ok;
+   }
+   /* If victim is in same user_ns, then uids are comparable */
if (cred->euid != pcred->euid &&
cred->euid != pcred->uid &&
-   !capable(CAP_SYS_PTRACE))
+   !ns_capable(pcred->user->user_ns, CAP_SYS_PTRACE))
goto err_unlock;
+ok:
head = p->compat_robust_list;
rcu_read_unlock();
}
diff --git a/kernel/groups.c b/kernel/groups.c
index 253dc0f..1cc476d 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -233,7 +233,7 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, 
grouplist)
struct group_info *group_info;
int retval;
 
-   if (!capable(CAP_SETGID))
+   if (!nsown_capable(CAP_SETGID))
return -EPERM;
if ((unsigned)gidsetsize > NGROUPS_MAX)
 

[Devel] [PATCH 01/08] Add a user_namespace as creator/owner of uts_namespace

2011-01-10 Thread Serge E. Hallyn
copy_process() handles CLONE_NEWUSER before the rest of the
namespaces.  So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
the new uts namespace will have the new user namespace as its
owner.  That is what we want, since we want root in that new
userns to be able to have privilege over it.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/utsname.h |3 +++
 init/version.c  |2 ++
 kernel/nsproxy.c|3 +++
 kernel/user.c   |8 ++--
 kernel/utsname.c|4 
 5 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 69f3997..85171be 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -37,9 +37,12 @@ struct new_utsname {
 #include 
 #include 
 
+struct user_namespace;
+
 struct uts_namespace {
struct kref kref;
struct new_utsname name;
+   struct user_namespace *user_ns;
 };
 extern struct uts_namespace init_uts_ns;
 
diff --git a/init/version.c b/init/version.c
index 79fb8c2..9eb19fb 100644
--- a/init/version.c
+++ b/init/version.c
@@ -21,6 +21,7 @@ extern int version_string(LINUX_VERSION_CODE);
 int version_string(LINUX_VERSION_CODE);
 #endif
 
+extern struct user_namespace init_user_ns;
 struct uts_namespace init_uts_ns = {
.kref = {
.refcount   = ATOMIC_INIT(2),
@@ -33,6 +34,7 @@ struct uts_namespace init_uts_ns = {
.machine= UTS_MACHINE,
.domainname = UTS_DOMAINNAME,
},
+   .user_ns = &init_user_ns,
 };
 EXPORT_SYMBOL_GPL(init_uts_ns);
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f74e6c0..5a22dcf 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -74,6 +74,9 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
err = PTR_ERR(new_nsp->uts_ns);
goto out_uts;
}
+   put_user_ns(new_nsp->uts_ns->user_ns);
+   new_nsp->uts_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
+   get_user_ns(new_nsp->uts_ns->user_ns);
 
new_nsp->ipc_ns = copy_ipcs(flags, tsk->nsproxy->ipc_ns);
if (IS_ERR(new_nsp->ipc_ns)) {
diff --git a/kernel/user.c b/kernel/user.c
index 5c598ca..9e03e9c 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -17,9 +17,13 @@
 #include 
 #include 
 
+/*
+ * userns count is 1 for root user, 1 for init_uts_ns,
+ * and 1 for... ?
+ */
 struct user_namespace init_user_ns = {
.kref = {
-   .refcount   = ATOMIC_INIT(2),
+   .refcount   = ATOMIC_INIT(3),
},
.creator = &root_user,
 };
@@ -47,7 +51,7 @@ static struct kmem_cache *uid_cachep;
  */
 static DEFINE_SPINLOCK(uidhash_lock);
 
-/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->creator */
+/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->user_ns */
 struct user_struct root_user = {
.__count= ATOMIC_INIT(2),
.processes  = ATOMIC_INIT(1),
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 8a82b4b..a7b3a8d 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct uts_namespace *create_uts_ns(void)
 {
@@ -40,6 +41,8 @@ static struct uts_namespace *clone_uts_ns(struct 
uts_namespace *old_ns)
 
down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
+   ns->user_ns = old_ns->user_ns;
+   get_user_ns(ns->user_ns);
up_read(&uts_sem);
return ns;
 }
@@ -71,5 +74,6 @@ void free_uts_ns(struct kref *kref)
struct uts_namespace *ns;
 
ns = container_of(kref, struct uts_namespace, kref);
+   put_user_ns(ns->user_ns);
kfree(ns);
 }
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] userns: targeted capabilities v4

2011-01-10 Thread Serge E. Hallyn
This version addresses feedback from and bugs pointed out by
Bastian.  It also adds user namespace checks in fs/namei.c.
If a task reads a file owned by another user_ns, it gets the
world access rights to that file.  Since inodes don't yet have
a user namespace, we just declare that init_user_ns owns them all.
So if you are root in a child user namespace, you effectively
are roaming the system as user nobody.  See 
http://www.spinics.net/lists/linux-containers/msg09716.html
and
http://www.spinics.net/lists/linux-containers/msg08486.html
for prior discussions.

[ Intro to v3 follows ]

The core of the set is patch 2, originally conceived and
implemented by Eric Biederman.  The concept is to target
capabilities at user namespaces.  A task's capabilities are
now contextualized as follows (previously, capabilities had
no context):

1. For a task in the initial user namespace, the calculated
capabilities (pi, pe, pp) are available to act upon any
user namespace.

2. For a task in a child user namespace, the calculated
capabilities are available to act only on its own or any
descendent user namespace.  It has no capabilities to any
parent or unrelated user namespaces.

3. If a user A creates a new user namespace, that user has
all capabilities to that new user namespace and any of its
descendents.  (Contrast this with a user namespace created
by another user B in the same user namespace, to which this
user A has only his calculated capabilities)

All existing 'capable' checks are automatically converted to
checks against the initial user namespace.  The rest of the
patches begin to enable capabilities in child user namespaces
to setuid, setgid, set hostnames, kill tasks, and do ptrace.

My next step would be to re-introduce a part of a several year
old patchset which assigns a userns to a superblock (and hence
to inodes), and grants 'user other' permissions to any task
whose uid does not map to the target userns.  (By default, this
will be all but the initial userns)

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 6/7] user namespaces: convert all capable checks in kernel/sys.c

2011-01-10 Thread Serge E. Hallyn
Quoting Bastian Blank (bast...@waldi.eu.org):
> On Mon, Jan 10, 2011 at 09:14:07PM +0000, Serge E. Hallyn wrote:
> > -   if (pcred->uid  != cred->euid &&
> > -   pcred->euid != cred->euid && !capable(CAP_SYS_NICE)) {
> > +   if (pcred->user->user_ns != cred->user->user_ns &&
> > +   pcred->uid  != cred->euid &&
> > +   pcred->euid != cred->euid &&
> > +   !ns_capable(pcred->user->user_ns, CAP_SYS_NICE)) {
> 
> I don't think this is correct. This would not error out if the both
> userns are the same. Because the same patern (check uid if same userns,
> otherwise only capability) shows up in several parts of the code, maybe
> this should be factored out.

Yeah, I'd really like to factor this out because it shows up everywhere
and I have to think about it every time I look at it.  But each time it
shows up, the uids being compared slightly change.  There must be some
clever way of doing it, hopefully it'll fall out soon.

Eric's ns_capable() has already simplified this quite a bit - which is
part of why I've sometimes not been thinking about it right, it's now
simpler than it used to be :)

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/7] allow killing tasks in your own or child userns

2011-01-10 Thread Serge E. Hallyn
Quoting Oren Laadan (or...@cs.columbia.edu):
...
> > If permission is granted based on userids and the capability
> > isn't needed, then we don't want to needlessly set PF_SUPERPRIV.
> 
> A bit off-topic: does this means that c/r needs to save and 
> restore this process flag ?

It should, yeah.  (Until we decide to nuke the flag)

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/7] allow killing tasks in your own or child userns

2011-01-10 Thread Serge E. Hallyn
Quoting Bastian Blank (bast...@waldi.eu.org):
> On Mon, Jan 10, 2011 at 04:51:51PM -0600, Serge Hallyn wrote:
> > Quoting Bastian Blank (bast...@waldi.eu.org):
> > > Isn't that equal to this?
> > > 
> > >   if (ns_capable(tcred->user->user_ns, CAP_KILL))
> > >   return 1;
> > > 
> > >   if (cred->user->user_ns == tcred->user->user_ns &&
> > >   (cred->euid == tcred->suid ||
> > >cred->euid == tcred->uid ||
> > >cred->uid == tcred->suid ||
> > >cred->uid == tcred->uid))
> > >   return 1;
> > > 
> > >   return 0;
> > > 
> > > I would consider this much easier to read.
> > 
> > Unfortunately, it's actually not equivalent.  when capable()
> > returns success, then it sets the current->flags |= PF_SUPERPRIV.
> > If permission is granted based on userids and the capability
> > isn't needed, then we don't want to needlessly set PF_SUPERPRIV.
> 
> Well, then switch the two if-clauses.

hup, will do, much nicer, thanks.

> What is this flag used for anyway? I only see it used in the accounting
> stuff, and if every user can get it, it is not longer useful.

hm, I'm not sure...  maybe noone is using it!
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 7/7] user namespaces: convert several capable() calls

2011-01-10 Thread Serge E. Hallyn
CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(),
because the resource comes from current's own ipc namespace.

setuid/setgid are to uids in own namespace, so again checks can be
against current_user_ns().

Signed-off-by: Serge E. Hallyn 
---
 ipc/shm.c |2 +-
 ipc/util.c|5 +++--
 kernel/futex.c|   11 ++-
 kernel/futex_compat.c |   11 ++-
 kernel/groups.c   |2 +-
 kernel/sched.c|   25 ++---
 kernel/uid16.c|2 +-
 7 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 7d3bb22..13891f8 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -773,7 +773,7 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct 
shmid_ds __user *, buf)
 
audit_ipc_obj(&(shp->shm_perm));
 
-   if (!capable(CAP_IPC_LOCK)) {
+   if (!ns_capable(current_user_ns(), CAP_IPC_LOCK)) {
uid_t euid = current_euid();
err = -EPERM;
if (euid != shp->shm_perm.uid &&
diff --git a/ipc/util.c b/ipc/util.c
index 69a0cc1..0e832b9 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -627,7 +627,7 @@ int ipcperms (struct kern_ipc_perm *ipcp, short flag)
granted_mode >>= 3;
/* is there some bit set in requested_mode but not in granted_mode? */
if ((requested_mode & ~granted_mode & 0007) && 
-   !capable(CAP_IPC_OWNER))
+   !ns_capable(current->cred->user->user_ns, CAP_IPC_OWNER))
return -1;
 
return security_ipc_permission(ipcp, flag);
@@ -800,7 +800,8 @@ struct kern_ipc_perm *ipcctl_pre_down(struct ipc_ids *ids, 
int id, int cmd,
 
euid = current_euid();
if (euid == ipcp->cuid ||
-   euid == ipcp->uid  || capable(CAP_SYS_ADMIN))
+   euid == ipcp->uid  ||
+   ns_capable(current->cred->user->user_ns, CAP_SYS_ADMIN))
return ipcp;
 
err = -EPERM;
diff --git a/kernel/futex.c b/kernel/futex.c
index 3019b92..1025fd7 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2387,10 +2387,19 @@ SYSCALL_DEFINE3(get_robust_list, int, pid,
goto err_unlock;
ret = -EPERM;
pcred = __task_cred(p);
+   /* If victim is in different user_ns, then uids are not
+  comparable, so we must have CAP_SYS_PTRACE */
+   if (cred->user->user_ns != pcred->user->user_ns) {
+   if (ns_capable(pcred->user->user_ns, CAP_SYS_PTRACE))
+   goto ok;
+   goto err_unlock;
+   }
+   /* If victim is in same user_ns, then uids are comparable */
if (cred->euid != pcred->euid &&
cred->euid != pcred->uid &&
-   !capable(CAP_SYS_PTRACE))
+   !ns_capable(pcred->user->user_ns, CAP_SYS_PTRACE))
goto err_unlock;
+ok:
head = p->robust_list;
rcu_read_unlock();
}
diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index a7934ac..f84cb9a 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -153,10 +153,19 @@ compat_sys_get_robust_list(int pid, compat_uptr_t __user 
*head_ptr,
goto err_unlock;
ret = -EPERM;
pcred = __task_cred(p);
+   /* If victim is in different user_ns, then uids are not
+  comparable, so we must have CAP_SYS_PTRACE */
+   if (cred->user->user_ns != pcred->user->user_ns) {
+   if (ns_capable(pcred->user->user_ns, CAP_SYS_PTRACE))
+   goto ok;
+   goto err_unlock;
+   }
+   /* If victim is in same user_ns, then uids are comparable */
if (cred->euid != pcred->euid &&
cred->euid != pcred->uid &&
-   !capable(CAP_SYS_PTRACE))
+   !ns_capable(pcred->user->user_ns, CAP_SYS_PTRACE))
goto err_unlock;
+ok:
head = p->compat_robust_list;
rcu_read_unlock();
}
diff --git a/kernel/groups.c b/kernel/groups.c
index 253dc0f..335586a 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -233,7 +233,7 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, 
grouplist)
struct group_info *group_info;
int retval;
 
-   if (!capable(CAP_SETGID))
+   if (!ns_capable(current_user_ns(), CAP_SETGID))
return -EPERM;
if ((unsigned)gidsetsize > NGROUPS_MAX)
return -EINVAL;
diff --git a/kernel/sched.c b/kernel/sched.c
index a0eb094..107

[Devel] [PATCH 6/7] user namespaces: convert all capable checks in kernel/sys.c

2011-01-10 Thread Serge E. Hallyn
This allows setuid/setgid in containers.  It also fixes some
corner cases where kernel logic foregoes capability checks when
uids are equivalent.  The latter will need to be done throughout
the whole kernel.

Signed-off-by: Serge E. Hallyn 
---
 kernel/sys.c |   35 ---
 1 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 9b9b03b..2278e87 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -125,8 +125,10 @@ static int set_one_prio(struct task_struct *p, int 
niceval, int error)
const struct cred *cred = current_cred(), *pcred = __task_cred(p);
int no_nice;
 
-   if (pcred->uid  != cred->euid &&
-   pcred->euid != cred->euid && !capable(CAP_SYS_NICE)) {
+   if (pcred->user->user_ns != cred->user->user_ns &&
+   pcred->uid  != cred->euid &&
+   pcred->euid != cred->euid &&
+   !ns_capable(pcred->user->user_ns, CAP_SYS_NICE)) {
error = -EPERM;
goto out;
}
@@ -496,7 +498,7 @@ SYSCALL_DEFINE2(setregid, gid_t, rgid, gid_t, egid)
if (rgid != (gid_t) -1) {
if (old->gid == rgid ||
old->egid == rgid ||
-   capable(CAP_SETGID))
+   ns_capable(current_user_ns(), CAP_SETGID))
new->gid = rgid;
else
goto error;
@@ -505,7 +507,7 @@ SYSCALL_DEFINE2(setregid, gid_t, rgid, gid_t, egid)
if (old->gid == egid ||
old->egid == egid ||
old->sgid == egid ||
-   capable(CAP_SETGID))
+   ns_capable(current_user_ns(), CAP_SETGID))
new->egid = egid;
else
goto error;
@@ -540,7 +542,7 @@ SYSCALL_DEFINE1(setgid, gid_t, gid)
old = current_cred();
 
retval = -EPERM;
-   if (capable(CAP_SETGID))
+   if (ns_capable(current_user_ns(), CAP_SETGID))
new->gid = new->egid = new->sgid = new->fsgid = gid;
else if (gid == old->gid || gid == old->sgid)
new->egid = new->fsgid = gid;
@@ -607,7 +609,7 @@ SYSCALL_DEFINE2(setreuid, uid_t, ruid, uid_t, euid)
new->uid = ruid;
if (old->uid != ruid &&
old->euid != ruid &&
-   !capable(CAP_SETUID))
+   !ns_capable(current_user_ns(), CAP_SETUID))
goto error;
}
 
@@ -616,7 +618,7 @@ SYSCALL_DEFINE2(setreuid, uid_t, ruid, uid_t, euid)
if (old->uid != euid &&
old->euid != euid &&
old->suid != euid &&
-   !capable(CAP_SETUID))
+   !ns_capable(current_user_ns(), CAP_SETUID))
goto error;
}
 
@@ -664,7 +666,7 @@ SYSCALL_DEFINE1(setuid, uid_t, uid)
old = current_cred();
 
retval = -EPERM;
-   if (capable(CAP_SETUID)) {
+   if (ns_capable(current_user_ns(), CAP_SETUID)) {
new->suid = new->uid = uid;
if (uid != old->uid) {
retval = set_user(new);
@@ -706,7 +708,7 @@ SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, 
suid)
old = current_cred();
 
retval = -EPERM;
-   if (!capable(CAP_SETUID)) {
+   if (!ns_capable(current_user_ns(), CAP_SETUID)) {
if (ruid != (uid_t) -1 && ruid != old->uid &&
ruid != old->euid  && ruid != old->suid)
goto error;
@@ -770,7 +772,7 @@ SYSCALL_DEFINE3(setresgid, gid_t, rgid, gid_t, egid, gid_t, 
sgid)
old = current_cred();
 
retval = -EPERM;
-   if (!capable(CAP_SETGID)) {
+   if (!ns_capable(current_user_ns(), CAP_SETGID)) {
if (rgid != (gid_t) -1 && rgid != old->gid &&
rgid != old->egid  && rgid != old->sgid)
goto error;
@@ -830,7 +832,7 @@ SYSCALL_DEFINE1(setfsuid, uid_t, uid)
 
if (uid == old->uid  || uid == old->euid  ||
uid == old->suid || uid == old->fsuid ||
-   capable(CAP_SETUID)) {
+   ns_capable(current_user_ns(), CAP_SETUID)) {
if (uid != old_fsuid) {
new->fsuid = uid;
if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 
0)
@@ -863,7 +865,7 @@ SYSCALL_DEFINE1(setfsgid, gid_t, gid)
 
if (gid == old->gid  || gid == old->egid  ||
gid == old->sgid || gid == old->fsgid ||
-   capable(CAP_SETGID)) {
+   ns_capable(current_user_ns(), CAP_SETGID)) {
  

[Devel] [PATCH 5/7] Allow ptrace from non-init user namespaces

2011-01-10 Thread Serge E. Hallyn
ptrace is allowed to tasks in the same user namespace according to
the usual rules (i.e. the same rules as for two tasks in the init
user namespace).  ptrace is also allowed to a user namespace to
which the current task the has CAP_SYS_PTRACE capability.

Changelog:
Dec 31: Address feedback by Eric:
. Correct ptrace uid check
. Rename may_ptrace_ns to ptrace_capable
. Also fix the cap_ptrace checks.
Jan 1: Use const cred struct

Signed-off-by: Serge E. Hallyn 
---
 include/linux/capability.h |2 +
 include/linux/user_namespace.h |9 +++
 kernel/ptrace.c|   40 +++--
 kernel/user_namespace.c|   16 +
 security/commoncap.c   |   48 +--
 5 files changed, 95 insertions(+), 20 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 7b3be11..501b8c9 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -546,6 +546,8 @@ extern const kernel_cap_t __cap_init_eff_set;
  */
 #define has_capability(t, cap) (security_real_capable((t), &init_user_ns, 
(cap)) == 0)
 
+#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) 
== 0)
+
 /**
  * has_capability_noaudit - Determine if a task has a superior capability 
available (unaudited)
  * @t: The task in question
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 8178156..91c4f10 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns)
 uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, 
uid_t uid);
 gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, 
gid_t gid);
 
+int same_or_ancestor_user_ns(struct task_struct *task,
+   struct task_struct *victim);
+
 #else
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -66,6 +69,12 @@ static inline gid_t user_ns_map_gid(struct user_namespace 
*to,
return gid;
 }
 
+static inline int same_or_ancestor_user_ns(struct task_struct *task,
+   struct task_struct *victim)
+{
+   return 1;
+}
+
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 99bbaa3..88e3fb3 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -116,6 +116,19 @@ int ptrace_check_attach(struct task_struct *child, int 
kill)
return ret;
 }
 
+static inline int ptrace_capable(struct task_struct *t)
+{
+   struct user_namespace *ns;
+   int ret;
+
+   rcu_read_lock();
+   ns = task_cred_xxx(t, user)->user_ns;
+   ret = ns_capable(ns, CAP_SYS_PTRACE);
+   rcu_read_unlock();
+
+   return ret;
+}
+
 int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 {
const struct cred *cred = current_cred(), *tcred;
@@ -134,21 +147,24 @@ int __ptrace_may_access(struct task_struct *task, 
unsigned int mode)
return 0;
rcu_read_lock();
tcred = __task_cred(task);
-   if ((cred->uid != tcred->euid ||
-cred->uid != tcred->suid ||
-cred->uid != tcred->uid  ||
-cred->gid != tcred->egid ||
-cred->gid != tcred->sgid ||
-cred->gid != tcred->gid) &&
-   !capable(CAP_SYS_PTRACE)) {
-   rcu_read_unlock();
-   return -EPERM;
-   }
+   if (cred->user->user_ns == tcred->user->user_ns &&
+   (cred->uid == tcred->euid &&
+cred->uid == tcred->suid &&
+cred->uid == tcred->uid  &&
+cred->gid == tcred->egid &&
+cred->gid == tcred->sgid &&
+cred->gid == tcred->gid))
+   goto ok;
+   if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
+   goto ok;
+   rcu_read_unlock();
+   return -EPERM;
+ok:
rcu_read_unlock();
smp_rmb();
if (task->mm)
dumpable = get_dumpable(task->mm);
-   if (!dumpable && !capable(CAP_SYS_PTRACE))
+   if (!dumpable && !ptrace_capable(task))
return -EPERM;
 
return security_ptrace_access_check(task, mode);
@@ -198,7 +214,7 @@ int ptrace_attach(struct task_struct *task)
goto unlock_tasklist;
 
task->ptrace = PT_PTRACED;
-   if (capable(CAP_SYS_PTRACE))
+   if (ptrace_capable(task))
task->ptrace |= PT_PTRACE_CAP;
 
__ptrace_link(task, current);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 2591583..4b70999 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -126,3 +126,19 @@ gid_t user_ns_map_gid(struct user_namespace *to, const

[Devel] [PATCH 2/7] security: Make capabilities relative to the user namespace.

2011-01-10 Thread Serge E. Hallyn

- Introduce ns_capable to test for a capability in a non-default
  user namespace.
- Teach cap_capable to handle capabilities in a non-default
  user namespace.

The motivation is to get to the unprivileged creation of new
namespaces.  It looks like this gets us 90% of the way there, with
only potential uid confusion issues left.

I still need to handle getting all caps after creation but otherwise I
think I have a good starter patch that achieves all of your goals.

Changelog:
11/05/2010: [serge] add apparmor
12/14/2010: [serge] fix capabilities to created user namespaces
Without this, if user serge creates a user_ns, he won't have
capabilities to the user_ns he created.  THis is because we
were first checking whether his effective caps had the caps
he needed and returning -EPERM if not, and THEN checking whether
he was the creator.  Reverse those checks.
12/16/2010: [serge] security_real_capable needs ns argument in 
!security case

Signed-off-by: Eric W. Biederman 
Signed-off-by: Serge E. Hallyn 
---
 include/linux/capability.h |7 +--
 include/linux/security.h   |   22 --
 kernel/capability.c|   22 --
 security/apparmor/lsm.c|5 +++--
 security/commoncap.c   |   40 +---
 security/security.c|   12 ++--
 security/selinux/hooks.c   |   14 +-
 7 files changed, 88 insertions(+), 34 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index fb16a36..7b3be11 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -544,7 +544,7 @@ extern const kernel_cap_t __cap_init_eff_set;
  *
  * Note that this does not set PF_SUPERPRIV on the task.
  */
-#define has_capability(t, cap) (security_real_capable((t), (cap)) == 0)
+#define has_capability(t, cap) (security_real_capable((t), &init_user_ns, 
(cap)) == 0)
 
 /**
  * has_capability_noaudit - Determine if a task has a superior capability 
available (unaudited)
@@ -558,9 +558,12 @@ extern const kernel_cap_t __cap_init_eff_set;
  * Note that this does not set PF_SUPERPRIV on the task.
  */
 #define has_capability_noaudit(t, cap) \
-   (security_real_capable_noaudit((t), (cap)) == 0)
+   (security_real_capable_noaudit((t), &init_user_ns, (cap)) == 0)
 
+struct user_namespace;
+extern struct user_namespace init_user_ns;
 extern int capable(int cap);
+extern int ns_capable(struct user_namespace *ns, int cap);
 
 /* audit system wants to get cap info from files as well */
 struct dentry;
diff --git a/include/linux/security.h b/include/linux/security.h
index c642bb8..f31bffd 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -46,13 +46,14 @@
 
 struct ctl_table;
 struct audit_krule;
+struct user_namespace;
 
 /*
  * These functions are in security/capability.c and are used
  * as the default capabilities functions
  */
 extern int cap_capable(struct task_struct *tsk, const struct cred *cred,
-  int cap, int audit);
+  struct user_namespace *ns, int cap, int audit);
 extern int cap_settime(struct timespec *ts, struct timezone *tz);
 extern int cap_ptrace_access_check(struct task_struct *child, unsigned int 
mode);
 extern int cap_ptrace_traceme(struct task_struct *parent);
@@ -1254,6 +1255,7 @@ static inline void security_free_mnt_opts(struct 
security_mnt_opts *opts)
  * credentials.
  * @tsk contains the task_struct for the process.
  * @cred contains the credentials to use.
+ *  @ns contains the user namespace we want the capability in
  * @cap contains the capability .
  * @audit: Whether to write an audit message or not
  * Return 0 if the capability is granted for @tsk.
@@ -1382,7 +1384,7 @@ struct security_operations {
   const kernel_cap_t *inheritable,
   const kernel_cap_t *permitted);
int (*capable) (struct task_struct *tsk, const struct cred *cred,
-   int cap, int audit);
+   struct user_namespace *ns, int cap, int audit);
int (*sysctl) (struct ctl_table *table, int op);
int (*quotactl) (int cmds, int type, int id, struct super_block *sb);
int (*quota_on) (struct dentry *dentry);
@@ -1662,9 +1664,9 @@ int security_capset(struct cred *new, const struct cred 
*old,
const kernel_cap_t *effective,
const kernel_cap_t *inheritable,
const kernel_cap_t *permitted);
-int security_capable(int cap);
-int security_real_capable(struct task_struct *tsk, int cap);
-int security_real_capable_noaudit(struct task_struct *tsk, int cap);
+int security_capable(struct user_namespace *ns, int cap);
+int security_real_capable(struct task_struct *tsk, struct user_namespace *ns, 
int cap);
+int security_real_capable_noaudit(struct task_struct *tsk, struct 
user_names

[Devel] [PATCH 4/7] allow killing tasks in your own or child userns

2011-01-10 Thread Serge E. Hallyn
Changelog:
Dec 8: Fixed bug in my check_kill_permission pointed out by
   Eric Biederman.
Dec 13: Apply Eric's suggestion to pass target task into 
kill_ok_by_cred()
for clarity
Dec 31: address comment by Eric Biederman:
don't need cred/tcred in check_kill_permission.
Jan 1: use const cred struct.

Signed-off-by: Serge E. Hallyn 
Reviewed-by: "Eric W. Biederman" 
---
 kernel/signal.c |   36 
 1 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..6a12eae 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -636,13 +636,39 @@ static inline bool si_fromuser(const struct siginfo *info)
 }
 
 /*
+ * called with RCU read lock from check_kill_permission()
+ */
+static inline int kill_ok_by_cred(struct task_struct *t)
+{
+   const struct cred *cred = current_cred();
+   const struct cred *tcred = __task_cred(t);
+
+   if (cred->user->user_ns != tcred->user->user_ns) {
+   /* userids are not equivalent - either you have the
+  capability to the target user ns or you don't */
+   if (ns_capable(tcred->user->user_ns, CAP_KILL))
+   return 1;
+   return 0;
+   }
+
+   /* same user namespace - usual credentials checks apply */
+   if ((cred->euid ^ tcred->suid) &&
+   (cred->euid ^ tcred->uid) &&
+   (cred->uid  ^ tcred->suid) &&
+   (cred->uid  ^ tcred->uid) &&
+   !ns_capable(tcred->user->user_ns, CAP_KILL))
+   return 0;
+
+   return 1;
+}
+
+/*
  * Bad permissions for sending the signal
  * - the caller must hold the RCU read lock
  */
 static int check_kill_permission(int sig, struct siginfo *info,
 struct task_struct *t)
 {
-   const struct cred *cred, *tcred;
struct pid *sid;
int error;
 
@@ -656,14 +682,8 @@ static int check_kill_permission(int sig, struct siginfo 
*info,
if (error)
return error;
 
-   cred = current_cred();
-   tcred = __task_cred(t);
if (!same_thread_group(current, t) &&
-   (cred->euid ^ tcred->suid) &&
-   (cred->euid ^ tcred->uid) &&
-   (cred->uid  ^ tcred->suid) &&
-   (cred->uid  ^ tcred->uid) &&
-   !capable(CAP_KILL)) {
+   !kill_ok_by_cred(t)) {
switch (sig) {
case SIGCONT:
sid = task_session(t);
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/7] allow sethostname in a container

2011-01-10 Thread Serge E. Hallyn

Signed-off-by: Serge E. Hallyn 
---
 kernel/sys.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 2745dcd..9b9b03b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1171,7 +1171,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, 
len)
int errno;
char tmp[__NEW_UTS_LEN];
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(current->nsproxy->uts_ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
if (len < 0 || len > __NEW_UTS_LEN)
return -EINVAL;
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/7] Add a user_namespace as creator/owner of uts_namespace

2011-01-10 Thread Serge E. Hallyn
copy_process() handles CLONE_NEWUSER before the rest of the
namespaces.  So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
the new uts namespace will have the new user namespace as its
owner.  That is what we want, since we want root in that new
userns to be able to have privilege over it.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/utsname.h |3 +++
 init/version.c  |2 ++
 kernel/nsproxy.c|3 +++
 kernel/user.c   |8 ++--
 kernel/utsname.c|4 
 5 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 69f3997..85171be 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -37,9 +37,12 @@ struct new_utsname {
 #include 
 #include 
 
+struct user_namespace;
+
 struct uts_namespace {
struct kref kref;
struct new_utsname name;
+   struct user_namespace *user_ns;
 };
 extern struct uts_namespace init_uts_ns;
 
diff --git a/init/version.c b/init/version.c
index adff586..97bb86f 100644
--- a/init/version.c
+++ b/init/version.c
@@ -21,6 +21,7 @@ extern int version_string(LINUX_VERSION_CODE);
 int version_string(LINUX_VERSION_CODE);
 #endif
 
+extern struct user_namespace init_user_ns;
 struct uts_namespace init_uts_ns = {
.kref = {
.refcount   = ATOMIC_INIT(2),
@@ -33,6 +34,7 @@ struct uts_namespace init_uts_ns = {
.machine= UTS_MACHINE,
.domainname = UTS_DOMAINNAME,
},
+   .user_ns = &init_user_ns,
 };
 EXPORT_SYMBOL_GPL(init_uts_ns);
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f74e6c0..5a22dcf 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -74,6 +74,9 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
err = PTR_ERR(new_nsp->uts_ns);
goto out_uts;
}
+   put_user_ns(new_nsp->uts_ns->user_ns);
+   new_nsp->uts_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
+   get_user_ns(new_nsp->uts_ns->user_ns);
 
new_nsp->ipc_ns = copy_ipcs(flags, tsk->nsproxy->ipc_ns);
if (IS_ERR(new_nsp->ipc_ns)) {
diff --git a/kernel/user.c b/kernel/user.c
index 5c598ca..9e03e9c 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -17,9 +17,13 @@
 #include 
 #include 
 
+/*
+ * userns count is 1 for root user, 1 for init_uts_ns,
+ * and 1 for... ?
+ */
 struct user_namespace init_user_ns = {
.kref = {
-   .refcount   = ATOMIC_INIT(2),
+   .refcount   = ATOMIC_INIT(3),
},
.creator = &root_user,
 };
@@ -47,7 +51,7 @@ static struct kmem_cache *uid_cachep;
  */
 static DEFINE_SPINLOCK(uidhash_lock);
 
-/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->creator */
+/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->user_ns */
 struct user_struct root_user = {
.__count= ATOMIC_INIT(2),
.processes  = ATOMIC_INIT(1),
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 8a82b4b..a7b3a8d 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct uts_namespace *create_uts_ns(void)
 {
@@ -40,6 +41,8 @@ static struct uts_namespace *clone_uts_ns(struct 
uts_namespace *old_ns)
 
down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
+   ns->user_ns = old_ns->user_ns;
+   get_user_ns(ns->user_ns);
up_read(&uts_sem);
return ns;
 }
@@ -71,5 +74,6 @@ void free_uts_ns(struct kref *kref)
struct uts_namespace *ns;
 
ns = container_of(kref, struct uts_namespace, kref);
+   put_user_ns(ns->user_ns);
kfree(ns);
 }
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] userns: targeted capabilities v3

2011-01-10 Thread Serge E. Hallyn
Following is the next version of my user namespace patchset.

The core of the set is patch 2, originally conceived and
implemented by Eric Biederman.  The concept is to target
capabilities at user namespaces.  A task's capabilities are
now contextualized as follows (previously, capabilities had
no context):

1. For a task in the initial user namespace, the calculated
capabilities (pi, pe, pp) are available to act upon any
user namespace.

2. For a task in a child user namespace, the calculated
capabilities are available to act only on its own or any
descendent user namespace.  It has no capabilities to any
parent or unrelated user namespaces.

3. If a user A creates a new user namespace, that user has
all capabilities to that new user namespace and any of its
descendents.  (Contrast this with a user namespace created
by another user B in the same user namespace, to which this
user A has only his calculated capabilities)

All existing 'capable' checks are automatically converted to
checks against the initial user namespace.  The rest of the
patches begin to enable capabilities in child user namespaces
to setuid, setgid, set hostnames, kill tasks, and do ptrace.

My next step would be to re-introduce a part of a several year
old patchset which assigns a userns to a superblock (and hence
to inodes), and grants 'user other' permissions to any task
whose uid does not map to the target userns.  (By default, this
will be all but the initial userns)

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC 4/5] user namespaces: allow killing tasks in your own or child userns

2011-01-02 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> > +static inline int kill_ok_by_cred(struct task_struct *t)
> > +{
> > +   struct cred *cred = current_cred();
> > +   struct cred *tcred = __task_cred(t);

(Note, here and elsewhere these should have been const.  My next posting,
with some new patches, will have that fix)

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC 5/5] user namespaces: Allow ptrace from non-init user namespaces

2010-12-31 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> "Serge E. Hallyn"  writes:
> 
> > ptrace is allowed to tasks in the same user namespace according to
> > the usual rules (i.e. the same rules as for two tasks in the init
> > user namespace).  ptrace is also allowed to a user namespace to
> > which the current task the has CAP_SYS_PTRACE capability.
> 
> The uid equality check below is broken.

Thanks for the review, Eric.  Updated version appended.  Assuming there
are no big problems with this version, I hope to do setuid/setgid and
start the simplest vfs access controls next.

Subject: [PATCH 5/5] Allow ptrace from non-init user namespaces

ptrace is allowed to tasks in the same user namespace according to
the usual rules (i.e. the same rules as for two tasks in the init
user namespace).  ptrace is also allowed to a user namespace to
which the current task the has CAP_SYS_PTRACE capability.

Changelog:
Dec 31: Address feedback by Eric:
. Correct ptrace uid check
. Rename may_ptrace_ns to ptrace_capable
. Also fix the cap_ptrace checks.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/capability.h |2 +
 include/linux/user_namespace.h |9 +++
 kernel/ptrace.c|   40 +++--
 kernel/user_namespace.c|   16 +
 security/commoncap.c   |   48 +--
 5 files changed, 95 insertions(+), 20 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index cc3e976..777a166 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -543,6 +543,8 @@ extern const kernel_cap_t __cap_init_eff_set;
  */
 #define has_capability(t, cap) (security_real_capable((t), &init_user_ns, 
(cap)) == 0)
 
+#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) 
== 0)
+
 /**
  * has_capability_noaudit - Determine if a task has a superior capability 
available (unaudited)
  * @t: The task in question
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 8178156..91c4f10 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns)
 uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, 
uid_t uid);
 gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, 
gid_t gid);
 
+int same_or_ancestor_user_ns(struct task_struct *task,
+   struct task_struct *victim);
+
 #else
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -66,6 +69,12 @@ static inline gid_t user_ns_map_gid(struct user_namespace 
*to,
return gid;
 }
 
+static inline int same_or_ancestor_user_ns(struct task_struct *task,
+   struct task_struct *victim)
+{
+   return 1;
+}
+
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 99bbaa3..88e3fb3 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -116,6 +116,19 @@ int ptrace_check_attach(struct task_struct *child, int 
kill)
return ret;
 }
 
+static inline int ptrace_capable(struct task_struct *t)
+{
+   struct user_namespace *ns;
+   int ret;
+
+   rcu_read_lock();
+   ns = task_cred_xxx(t, user)->user_ns;
+   ret = ns_capable(ns, CAP_SYS_PTRACE);
+   rcu_read_unlock();
+
+   return ret;
+}
+
 int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 {
const struct cred *cred = current_cred(), *tcred;
@@ -134,21 +147,24 @@ int __ptrace_may_access(struct task_struct *task, 
unsigned int mode)
return 0;
rcu_read_lock();
tcred = __task_cred(task);
-   if ((cred->uid != tcred->euid ||
-cred->uid != tcred->suid ||
-cred->uid != tcred->uid  ||
-cred->gid != tcred->egid ||
-cred->gid != tcred->sgid ||
-cred->gid != tcred->gid) &&
-   !capable(CAP_SYS_PTRACE)) {
-   rcu_read_unlock();
-   return -EPERM;
-   }
+   if (cred->user->user_ns == tcred->user->user_ns &&
+   (cred->uid == tcred->euid &&
+cred->uid == tcred->suid &&
+cred->uid == tcred->uid  &&
+cred->gid == tcred->egid &&
+cred->gid == tcred->sgid &&
+cred->gid == tcred->gid))
+   goto ok;
+   if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
+   goto ok;
+   rcu_read_unlock();
+   return -EPERM;
+ok:
rcu_read_unlock();
smp_rmb();
if (task->mm)
dumpable = get_dumpable(task->mm);
-   if (!dumpable && !capable(CAP_SYS_PTRACE))
+   i

[Devel] Re: [RFC 4/5] user namespaces: allow killing tasks in your own or child userns

2010-12-31 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> "Serge E. Hallyn"  writes:
> 
> > Quoting Eric W. Biederman (ebied...@xmission.com):
> >> > --- a/kernel/signal.c
> >> > +++ b/kernel/signal.c
> >> > @@ -659,11 +686,7 @@ static int check_kill_permission(int sig, struct 
> >> > siginfo *info,
> >> >  cred = current_cred();
> >> >  tcred = __task_cred(t);
> >> Nit pick  you don't need to compute cred and tcred here now.
> >
> > Just to make sure I understand right: you mean wait until after the
> > same_thread_group() check to save calculation in that case, right?
> 
> I mean cred and tcred are only use in kill_ok_by_cred.
> So we can eliminate those two variables from check_kill_permission.

Thanks for the review.  Here is an updated version.

Subject: [PATCH 4/5] allow killing tasks in your own or child userns

Changelog:
Dec 8: Fixed bug in my check_kill_permission pointed out by
   Eric Biederman.
Dec 13: Apply Eric's suggestion to pass target task into 
kill_ok_by_cred()
for clarity
    Dec 31: address comment by Eric Biederman:
don't need cred/tcred in check_kill_permission.

Signed-off-by: Serge E. Hallyn 
---
 kernel/signal.c |   36 
 1 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..d890c99 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -636,13 +636,39 @@ static inline bool si_fromuser(const struct siginfo *info)
 }
 
 /*
+ * called with RCU read lock from check_kill_permission()
+ */
+static inline int kill_ok_by_cred(struct task_struct *t)
+{
+   struct cred *cred = current_cred();
+   struct cred *tcred = __task_cred(t);
+
+   if (cred->user->user_ns != tcred->user->user_ns) {
+   /* userids are not equivalent - either you have the
+  capability to the target user ns or you don't */
+   if (ns_capable(tcred->user->user_ns, CAP_KILL))
+   return 1;
+   return 0;
+   }
+
+   /* same user namespace - usual credentials checks apply */
+   if ((cred->euid ^ tcred->suid) &&
+   (cred->euid ^ tcred->uid) &&
+   (cred->uid  ^ tcred->suid) &&
+   (cred->uid  ^ tcred->uid) &&
+   !ns_capable(tcred->user->user_ns, CAP_KILL))
+   return 0;
+
+   return 1;
+}
+
+/*
  * Bad permissions for sending the signal
  * - the caller must hold the RCU read lock
  */
 static int check_kill_permission(int sig, struct siginfo *info,
 struct task_struct *t)
 {
-   const struct cred *cred, *tcred;
struct pid *sid;
int error;
 
@@ -656,14 +682,8 @@ static int check_kill_permission(int sig, struct siginfo 
*info,
if (error)
return error;
 
-   cred = current_cred();
-   tcred = __task_cred(t);
if (!same_thread_group(current, t) &&
-   (cred->euid ^ tcred->suid) &&
-   (cred->euid ^ tcred->uid) &&
-   (cred->uid  ^ tcred->suid) &&
-   (cred->uid  ^ tcred->uid) &&
-   !capable(CAP_KILL)) {
+   !kill_ok_by_cred(t)) {
switch (sig) {
case SIGCONT:
sid = task_session(t);
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC 4/5] user namespaces: allow killing tasks in your own or child userns

2010-12-17 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> >> >  cred = current_cred();
> >> >  tcred = __task_cred(t);
> >> Nit pick  you don't need to compute cred and tcred here now.
> >
> > Just to make sure I understand right: you mean wait until after the
> > same_thread_group() check to save calculation in that case, right?
> 
> I mean cred and tcred are only use in kill_ok_by_cred.
> So we can eliminate those two variables from check_kill_permission.

D'oh.  Should've looked at the original tree, not the context.  Got it,
thanks.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC 4/5] user namespaces: allow killing tasks in your own or child userns

2010-12-17 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -659,11 +686,7 @@ static int check_kill_permission(int sig, struct 
> > siginfo *info,
> > cred = current_cred();
> > tcred = __task_cred(t);
> Nit pick  you don't need to compute cred and tcred here now.

Just to make sure I understand right: you mean wait until after the
same_thread_group() check to save calculation in that case, right?

> > if (!same_thread_group(current, t) &&
> > -   (cred->euid ^ tcred->suid) &&
> > -   (cred->euid ^ tcred->uid) &&
> > -   (cred->uid  ^ tcred->suid) &&
> > -   (cred->uid  ^ tcred->uid) &&
> > -   !capable(CAP_KILL)) {
> > +   !kill_ok_by_cred(t)) {
> > switch (sig) {
> > case SIGCONT:
> > sid = task_session(t);
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC 5/5] user namespaces: Allow ptrace from non-init user namespaces

2010-12-17 Thread Serge E. Hallyn
ptrace is allowed to tasks in the same user namespace according to
the usual rules (i.e. the same rules as for two tasks in the init
user namespace).  ptrace is also allowed to a user namespace to
which the current task the has CAP_SYS_PTRACE capability.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/capability.h |2 ++
 kernel/ptrace.c|   40 
 security/commoncap.c   |   26 +-
 3 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index cc3e976..777a166 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -543,6 +543,8 @@ extern const kernel_cap_t __cap_init_eff_set;
  */
 #define has_capability(t, cap) (security_real_capable((t), &init_user_ns, 
(cap)) == 0)
 
+#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) 
== 0)
+
 /**
  * has_capability_noaudit - Determine if a task has a superior capability 
available (unaudited)
  * @t: The task in question
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 99bbaa3..aed24eb 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -116,6 +116,19 @@ int ptrace_check_attach(struct task_struct *child, int 
kill)
return ret;
 }
 
+static inline int may_ptrace_ns(struct task_struct *t)
+{
+   struct user_namespace *ns;
+   int ret;
+
+   rcu_read_lock();
+   ns = task_cred_xxx(t, user)->user_ns;
+   ret = ns_capable(ns, CAP_SYS_PTRACE);
+   rcu_read_unlock();
+
+   return ret;
+}
+
 int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 {
const struct cred *cred = current_cred(), *tcred;
@@ -134,21 +147,24 @@ int __ptrace_may_access(struct task_struct *task, 
unsigned int mode)
return 0;
rcu_read_lock();
tcred = __task_cred(task);
-   if ((cred->uid != tcred->euid ||
-cred->uid != tcred->suid ||
-cred->uid != tcred->uid  ||
-cred->gid != tcred->egid ||
-cred->gid != tcred->sgid ||
-cred->gid != tcred->gid) &&
-   !capable(CAP_SYS_PTRACE)) {
-   rcu_read_unlock();
-   return -EPERM;
-   }
+   if (cred->user->user_ns == tcred->user->user_ns &&
+   (cred->uid == tcred->euid ||
+cred->uid == tcred->suid ||
+cred->uid == tcred->uid  ||
+cred->gid == tcred->egid ||
+cred->gid == tcred->sgid ||
+cred->gid == tcred->gid))
+   goto ok;
+   if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
+   goto ok;
+   rcu_read_unlock();
+   return -EPERM;
+ok:
rcu_read_unlock();
smp_rmb();
if (task->mm)
dumpable = get_dumpable(task->mm);
-   if (!dumpable && !capable(CAP_SYS_PTRACE))
+   if (!dumpable && !may_ptrace_ns(task))
return -EPERM;
 
return security_ptrace_access_check(task, mode);
@@ -198,7 +214,7 @@ int ptrace_attach(struct task_struct *task)
goto unlock_tasklist;
 
task->ptrace = PT_PTRACED;
-   if (capable(CAP_SYS_PTRACE))
+   if (may_ptrace_ns(task))
task->ptrace |= PT_PTRACE_CAP;
 
__ptrace_link(task, current);
diff --git a/security/commoncap.c b/security/commoncap.c
index 9d910e6..bd0bcc6 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -136,12 +136,20 @@ int cap_settime(struct timespec *ts, struct timezone *tz)
 int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
 {
int ret = 0;
+   struct cred *cred, *tcred;
 
rcu_read_lock();
-   if (!cap_issubset(__task_cred(child)->cap_permitted,
- current_cred()->cap_permitted) &&
+   cred = current_cred();
+   tcred = __task_cred(child);
+   if (cred->user->user_ns != tcred->user->user_ns) {
+   if (!ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
+   ret = -EPERM;
+   goto out;
+   }
+   if (!cap_issubset(tcred->cap_permitted, cred->cap_permitted) &&
!capable(CAP_SYS_PTRACE))
ret = -EPERM;
+out:
rcu_read_unlock();
return ret;
 }
@@ -156,12 +164,20 @@ int cap_ptrace_access_check(struct task_struct *child, 
unsigned int mode)
 int cap_ptrace_traceme(struct task_struct *parent)
 {
int ret = 0;
+   struct cred *cred, *tcred;
 
rcu_read_lock();
-   if (!cap_issubset(current_cred()->cap_permitted,
- __task_cred(parent)->cap_permitted) &&
-   !has_capability(parent, CAP_SYS_PTRACE))
+   cred = __task_cred(parent);
+   tcred = current_cred();
+  

[Devel] [RFC 4/5] user namespaces: allow killing tasks in your own or child userns

2010-12-17 Thread Serge E. Hallyn
Changelog:
Dec 8: Fixed bug in my check_kill_permission pointed out by
   Eric Biederman.
Dec 13: Apply Eric's suggestion to pass target task into 
kill_ok_by_cred()
for clarity

Signed-off-by: Serge E. Hallyn 
---
 kernel/signal.c |   33 -
 1 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..499bd36 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -636,6 +636,33 @@ static inline bool si_fromuser(const struct siginfo *info)
 }
 
 /*
+ * called with RCU read lock from check_kill_permission()
+ */
+static inline int kill_ok_by_cred(struct task_struct *t)
+{
+   struct cred *cred = current_cred();
+   struct cred *tcred = __task_cred(t);
+
+   if (cred->user->user_ns != tcred->user->user_ns) {
+   /* userids are not equivalent - either you have the
+  capability to the target user ns or you don't */
+   if (ns_capable(tcred->user->user_ns, CAP_KILL))
+   return 1;
+   return 0;
+   }
+
+   /* same user namespace - usual credentials checks apply */
+   if ((cred->euid ^ tcred->suid) &&
+   (cred->euid ^ tcred->uid) &&
+   (cred->uid  ^ tcred->suid) &&
+   (cred->uid  ^ tcred->uid) &&
+   !ns_capable(tcred->user->user_ns, CAP_KILL))
+   return 0;
+
+   return 1;
+}
+
+/*
  * Bad permissions for sending the signal
  * - the caller must hold the RCU read lock
  */
@@ -659,11 +686,7 @@ static int check_kill_permission(int sig, struct siginfo 
*info,
cred = current_cred();
tcred = __task_cred(t);
if (!same_thread_group(current, t) &&
-   (cred->euid ^ tcred->suid) &&
-   (cred->euid ^ tcred->uid) &&
-   (cred->uid  ^ tcred->suid) &&
-   (cred->uid  ^ tcred->uid) &&
-   !capable(CAP_KILL)) {
+   !kill_ok_by_cred(t)) {
switch (sig) {
case SIGCONT:
sid = task_session(t);
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC 3/5] user namespaces: allow sethostname in a container

2010-12-17 Thread Serge E. Hallyn
Signed-off-by: Serge E. Hallyn 
---
 kernel/sys.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 2745dcd..9b9b03b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1171,7 +1171,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, 
len)
int errno;
char tmp[__NEW_UTS_LEN];
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(current->nsproxy->uts_ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
if (len < 0 || len > __NEW_UTS_LEN)
return -EINVAL;
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC 2/5] user namespaces: make capabilities relative to the user namespace.

2010-12-17 Thread Serge E. Hallyn
- Introduce ns_capable to test for a capability in a non-default
  user namespace.
- Teach cap_capable to handle capabilities in a non-default
  user namespace.

The motivation is to get to the unprivileged creation of new
namespaces.  It looks like this gets us 90% of the way there, with
only potential uid confusion issues left.

I still need to handle getting all caps after creation but otherwise I
think I have a good starter patch that achieves all of your goals.

Changelog:
11/05/2010: [serge] add apparmor
12/14/2010: [serge] fix capabilities to created user namespaces
Without this, if user serge creates a user_ns, he won't have
capabilities to the user_ns he created.  THis is because we
were first checking whether his effective caps had the caps
he needed and returning -EPERM if not, and THEN checking whether
he was the creator.  Reverse those checks.
12/16/2010: [serge] security_real_capable needs ns argument in 
!security case

Signed-off-by: Eric W. Biederman 
Signed-off-by: Serge E. Hallyn 
---
 include/linux/capability.h |7 +--
 include/linux/security.h   |   22 --
 kernel/capability.c|   22 --
 security/apparmor/lsm.c|5 +++--
 security/commoncap.c   |   40 +---
 security/security.c|   12 ++--
 security/selinux/hooks.c   |   14 +-
 7 files changed, 88 insertions(+), 34 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 90012b9..cc3e976 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -541,7 +541,7 @@ extern const kernel_cap_t __cap_init_eff_set;
  *
  * Note that this does not set PF_SUPERPRIV on the task.
  */
-#define has_capability(t, cap) (security_real_capable((t), (cap)) == 0)
+#define has_capability(t, cap) (security_real_capable((t), &init_user_ns, 
(cap)) == 0)
 
 /**
  * has_capability_noaudit - Determine if a task has a superior capability 
available (unaudited)
@@ -555,9 +555,12 @@ extern const kernel_cap_t __cap_init_eff_set;
  * Note that this does not set PF_SUPERPRIV on the task.
  */
 #define has_capability_noaudit(t, cap) \
-   (security_real_capable_noaudit((t), (cap)) == 0)
+   (security_real_capable_noaudit((t), &init_user_ns, (cap)) == 0)
 
+struct user_namespace;
+extern struct user_namespace init_user_ns;
 extern int capable(int cap);
+extern int ns_capable(struct user_namespace *ns, int cap);
 
 /* audit system wants to get cap info from files as well */
 struct dentry;
diff --git a/include/linux/security.h b/include/linux/security.h
index 39f5b7e..2141f5a 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -46,13 +46,14 @@
 
 struct ctl_table;
 struct audit_krule;
+struct user_namespace;
 
 /*
  * These functions are in security/capability.c and are used
  * as the default capabilities functions
  */
 extern int cap_capable(struct task_struct *tsk, const struct cred *cred,
-  int cap, int audit);
+  struct user_namespace *ns, int cap, int audit);
 extern int cap_settime(struct timespec *ts, struct timezone *tz);
 extern int cap_ptrace_access_check(struct task_struct *child, unsigned int 
mode);
 extern int cap_ptrace_traceme(struct task_struct *parent);
@@ -1258,6 +1259,7 @@ static inline void security_free_mnt_opts(struct 
security_mnt_opts *opts)
  * credentials.
  * @tsk contains the task_struct for the process.
  * @cred contains the credentials to use.
+ *  @ns contains the user namespace we want the capability in
  * @cap contains the capability .
  * @audit: Whether to write an audit message or not
  * Return 0 if the capability is granted for @tsk.
@@ -1386,7 +1388,7 @@ struct security_operations {
   const kernel_cap_t *inheritable,
   const kernel_cap_t *permitted);
int (*capable) (struct task_struct *tsk, const struct cred *cred,
-   int cap, int audit);
+   struct user_namespace *ns, int cap, int audit);
int (*sysctl) (struct ctl_table *table, int op);
int (*quotactl) (int cmds, int type, int id, struct super_block *sb);
int (*quota_on) (struct dentry *dentry);
@@ -1668,9 +1670,9 @@ int security_capset(struct cred *new, const struct cred 
*old,
const kernel_cap_t *effective,
const kernel_cap_t *inheritable,
const kernel_cap_t *permitted);
-int security_capable(int cap);
-int security_real_capable(struct task_struct *tsk, int cap);
-int security_real_capable_noaudit(struct task_struct *tsk, int cap);
+int security_capable(struct user_namespace *ns, int cap);
+int security_real_capable(struct task_struct *tsk, struct user_namespace *ns, 
int cap);
+int security_real_capable_noaudit(struct task_struct *tsk, struct 
user_namespace

[Devel] [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

2010-12-17 Thread Serge E. Hallyn
copy_process() handles CLONE_NEWUSER before the rest of the
namespaces.  So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
the new uts namespace will have the new user namespace as its
owner.  That is what we want, since we want root in that new
userns to be able to have privilege over it.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/utsname.h |3 +++
 init/version.c  |2 ++
 kernel/nsproxy.c|3 +++
 kernel/user.c   |8 ++--
 kernel/utsname.c|4 
 5 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 69f3997..85171be 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -37,9 +37,12 @@ struct new_utsname {
 #include 
 #include 
 
+struct user_namespace;
+
 struct uts_namespace {
struct kref kref;
struct new_utsname name;
+   struct user_namespace *user_ns;
 };
 extern struct uts_namespace init_uts_ns;
 
diff --git a/init/version.c b/init/version.c
index 79fb8c2..9eb19fb 100644
--- a/init/version.c
+++ b/init/version.c
@@ -21,6 +21,7 @@ extern int version_string(LINUX_VERSION_CODE);
 int version_string(LINUX_VERSION_CODE);
 #endif
 
+extern struct user_namespace init_user_ns;
 struct uts_namespace init_uts_ns = {
.kref = {
.refcount   = ATOMIC_INIT(2),
@@ -33,6 +34,7 @@ struct uts_namespace init_uts_ns = {
.machine= UTS_MACHINE,
.domainname = UTS_DOMAINNAME,
},
+   .user_ns = &init_user_ns,
 };
 EXPORT_SYMBOL_GPL(init_uts_ns);
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f74e6c0..5a22dcf 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -74,6 +74,9 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
err = PTR_ERR(new_nsp->uts_ns);
goto out_uts;
}
+   put_user_ns(new_nsp->uts_ns->user_ns);
+   new_nsp->uts_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
+   get_user_ns(new_nsp->uts_ns->user_ns);
 
new_nsp->ipc_ns = copy_ipcs(flags, tsk->nsproxy->ipc_ns);
if (IS_ERR(new_nsp->ipc_ns)) {
diff --git a/kernel/user.c b/kernel/user.c
index 2c7d8d5..5125681 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -17,9 +17,13 @@
 #include 
 #include 
 
+/*
+ * userns count is 1 for root user, 1 for init_uts_ns,
+ * and 1 for... ?
+ */
 struct user_namespace init_user_ns = {
.kref = {
-   .refcount   = ATOMIC_INIT(2),
+   .refcount   = ATOMIC_INIT(3),
},
.creator = &root_user,
 };
@@ -47,7 +51,7 @@ static struct kmem_cache *uid_cachep;
  */
 static DEFINE_SPINLOCK(uidhash_lock);
 
-/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->creator */
+/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->user_ns */
 struct user_struct root_user = {
.__count= ATOMIC_INIT(2),
.processes  = ATOMIC_INIT(1),
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 8a82b4b..a7b3a8d 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct uts_namespace *create_uts_ns(void)
 {
@@ -40,6 +41,8 @@ static struct uts_namespace *clone_uts_ns(struct 
uts_namespace *old_ns)
 
down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
+   ns->user_ns = old_ns->user_ns;
+   get_user_ns(ns->user_ns);
up_read(&uts_sem);
return ns;
 }
@@ -71,5 +74,6 @@ void free_uts_ns(struct kref *kref)
struct uts_namespace *ns;
 
ns = container_of(kref, struct uts_namespace, kref);
+   put_user_ns(ns->user_ns);
kfree(ns);
 }
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC 0/5] user namespaces: start clamping down

2010-12-17 Thread Serge E. Hallyn
Following is the next version of the user namespace patchset.

The core of the set is patch 2, originally conceived of and
implemented by Eric Biederman.  The concept is to target
capabilities at user namespaces.  A task's capabilities are
now contextualized as follows (previously, capabilities had
no context):

1. For a task in the initial user namespace, the calculated
capabilities (pi, pe, pp) are available to act upon any
user namespace.

2. For a task in a child user namespace, the calculated
capabilities are available to act only on its own or any
descendent user namespace.  It has no capabilities to any
parent or unrelated user namespaces.

3. If a user A creates a new user namespace, that user has
all capabilities to that new user namespace and any of its
descendents.  (Contrast this with a user namespace created
by another user B in the same user namespace, to which this
user A has only his calculated capabilities)

All existing 'capable' checks are automatically converted to
checks against the initial user namespace.  This means that
by default, root in a child user namespace is powerless.
Patches 3-5 begin to enable capabilities in child user
namespaces to set hostnames, kill tasks, and do ptrace.

My near-term next goals will be to enable setuid and setgid,
and to provide a way for the filesystem to be usable in child
user namespaces.  At the very least I'd like a fresh loopback
or LVM mount and proc mounts to be supported.

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: cgroup tasks file error

2010-12-14 Thread Serge E. Hallyn
Quoting ccmail111 (ccmail...@yahoo.com):
> Hi Li,
> 
> uid is already root:
> 
> [host:/dev/cgroup]$ id
> uid=0(root) gid=0(root) groups=0(root)
> 
> 
> [host:/dev/cgroup]$ echo 580 > tasks
> -bash: echo: write error: Operation not permitted
> 
> [host:/dev/cgroup]$ cat hello/tasks
> 580
> 610
> 2104
> [host:/dev/cgroup]$

Could you give:
ls -lZd /dev/cgroup /dev/cgroup/tasks
id -Z
ls -laFZ /dev/cgroup/
cat /dev/cgroup/*
mount

I'm wondering whether an lsm is stopping you.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: cgroup tasks file error

2010-12-13 Thread Serge E. Hallyn
Quoting ccmail111 (ccmail...@yahoo.com):
> 
> I see error:[host:/dev/cgroup]$ echo 693 > hello-test/tasks
> -bash: echo: write error: No space left on device
> [host:/dev/cgroup]$ pwd/dev/cgroup
> 
> But the user process is up and running..
> 
> [host:/dev/cgroup]$ ps aux | grep procroot       
> 
> 693  0.0  0.4  34720  1112 ttyS0    Sl   19:11   0:00 /opt/bin/myproc -ext
> 
> Also the cgroup exists and valid..
> 
> [host:/dev/cgroup]$ ls | grep hello-test
> hello-test
> 
> What above error mean and any suggestions ?
> Please email.

Which cgroups do you have composed on that mount?  I'm guess you
have cpuset, and you need to set the cpuset.mems and cpuset.cpus.
Until you do that, no tasks can be assigned to it.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R of termios

2010-12-13 Thread Serge E. Hallyn
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com):
> Oren,
> 
> Any reason we only checkpoint/restore a 'struct termio' instead of a
> 'struct termios' ?
> 
> AFAICT, 'struct termios' seems to supersede 'struct termio' (i.e includes
> all fields of 'struct termio' plus more). The TCGETS ioctl and tcgetattr()
> interface return a 'struct termios' to user space.
> 
> The man page termio(7) says the 'struct termio' interface is obsolete.
> 
> The kernel uses 'struct ktermios' to represent the attributes internally.
> So shouldn't we checkpoint/restore the 'struct ktermios' object ?
> 
> If application uses legacy interface (TCGETA/TCSETA with 'struct termio')
> the kernel converts the 'struct ktermios' to the 'struct termio' in
> kernel_termios_to_user_termio(). So we should be fine if we C/R the
> ktermios object right ?
> 
> Here is a quick hack, just for reference. With this hack, I can C/R an
> app that does tcgetattr() of a pty before/after checkpoint.

Looks reasonable to me.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC PATCH 4/4] allow killing tasks in your own or child userns

2010-12-10 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> "Serge E. Hallyn"  writes:
> > +static inline int kill_ok_by_cred(struct cred *cred, struct cred *tcred)
> > +{
> Nit: You should just pass in the target task here.
> Making it abundantly clear where current and tcred come from.
> ns_capable implicitly uses current which is a little surprising
> when everything else is being passed in, but makes perfect sense
> in this context.

Thanks, that makes sense, will do.

If the set seems fine overall, then I'll also look at adding ptrace
controls, and hopefully send the result out next week.

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC PATCH 4/4] allow killing tasks in your own or child userns

2010-12-09 Thread Serge E. Hallyn
Changelog:
Dec 8: Fixed bug in my check_kill_permission pointed out by
   Eric Biederman.

To test:
1. Test killing tasks as usual.  No change.
2. Clone a new user namespace without a new pidns.
   a. You CAN kill -CONT tasks in your thread group but outside
  your user ns.
   b. You can NOT otherwise kill tasks outside your user_ns.
   c. Inside your new userns, signal semantics are as normal
  with respect to userids, CAP_KILL, and thread groups.

Signed-off-by: Serge E. Hallyn 
---
 kernel/signal.c |   27 ++-
 1 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..677025c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -635,6 +635,27 @@ static inline bool si_fromuser(const struct siginfo *info)
(!is_si_special(info) && SI_FROMUSER(info));
 }
 
+static inline int kill_ok_by_cred(struct cred *cred, struct cred *tcred)
+{
+   if (cred->user->user_ns != tcred->user->user_ns) {
+   /* userids are not equivalent - either you have the
+  capability to the target user ns or you don't */
+   if (ns_capable(tcred->user->user_ns, CAP_KILL))
+   return 1;
+   return 0;
+   }
+
+   /* same user namespace - usual credentials checks apply */
+   if ((cred->euid ^ tcred->suid) &&
+   (cred->euid ^ tcred->uid) &&
+   (cred->uid  ^ tcred->suid) &&
+   (cred->uid  ^ tcred->uid) &&
+   !ns_capable(tcred->user->user_ns, CAP_KILL))
+   return 0;
+
+   return 1;
+}
+
 /*
  * Bad permissions for sending the signal
  * - the caller must hold the RCU read lock
@@ -659,11 +680,7 @@ static int check_kill_permission(int sig, struct siginfo 
*info,
cred = current_cred();
tcred = __task_cred(t);
if (!same_thread_group(current, t) &&
-   (cred->euid ^ tcred->suid) &&
-   (cred->euid ^ tcred->uid) &&
-   (cred->uid  ^ tcred->suid) &&
-   (cred->uid  ^ tcred->uid) &&
-   !capable(CAP_KILL)) {
+   !kill_ok_by_cred(cred, tcred)) {
switch (sig) {
case SIGCONT:
sid = task_session(t);
-- 
1.7.2.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC PATCH 3/4] allow sethostname in a container

2010-12-09 Thread Serge E. Hallyn
To test this, you can:
1. clone a new user namespace without a new uts namespace.
   You can NOT set hostname.
2. clone both a new user and uts namespace.  You can set
   hostname.

Signed-off-by: Serge E. Hallyn 
---
 kernel/sys.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 2745dcd..9b9b03b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1171,7 +1171,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, 
len)
int errno;
char tmp[__NEW_UTS_LEN];
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(current->nsproxy->uts_ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
if (len < 0 || len > __NEW_UTS_LEN)
return -EINVAL;
-- 
1.7.2.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC PATCH 2/4] security: Make capabilities relative to the user namespace.

2010-12-09 Thread Serge E. Hallyn
- Introduce ns_capable to test for a capability in a non-default
  user namespace.
- Teach cap_capable to handle capabilities in a non-default
  user namespace.

The motivation is to get to the unprivileged creation of new
namespaces.  It looks like this gets us 90% of the way there, with
only potential uid confusion issues left.

I still need to handle getting all caps after creation but otherwise I
think I have a good starter patch that achieves all of your goals.

Changelog:
11/05/2010: [serge] add apparmor

Signed-off-by: Eric W. Biederman 
Acked-by: Serge E. Hallyn 
---
 include/linux/capability.h |7 +--
 include/linux/security.h   |   12 +++-
 kernel/capability.c|   22 --
 security/apparmor/lsm.c|5 +++--
 security/commoncap.c   |   40 +---
 security/security.c|   12 ++--
 security/selinux/hooks.c   |   14 +-
 7 files changed, 83 insertions(+), 29 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 90012b9..cc3e976 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -541,7 +541,7 @@ extern const kernel_cap_t __cap_init_eff_set;
  *
  * Note that this does not set PF_SUPERPRIV on the task.
  */
-#define has_capability(t, cap) (security_real_capable((t), (cap)) == 0)
+#define has_capability(t, cap) (security_real_capable((t), &init_user_ns, 
(cap)) == 0)
 
 /**
  * has_capability_noaudit - Determine if a task has a superior capability 
available (unaudited)
@@ -555,9 +555,12 @@ extern const kernel_cap_t __cap_init_eff_set;
  * Note that this does not set PF_SUPERPRIV on the task.
  */
 #define has_capability_noaudit(t, cap) \
-   (security_real_capable_noaudit((t), (cap)) == 0)
+   (security_real_capable_noaudit((t), &init_user_ns, (cap)) == 0)
 
+struct user_namespace;
+extern struct user_namespace init_user_ns;
 extern int capable(int cap);
+extern int ns_capable(struct user_namespace *ns, int cap);
 
 /* audit system wants to get cap info from files as well */
 struct dentry;
diff --git a/include/linux/security.h b/include/linux/security.h
index 39f5b7e..9e05b08 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -46,13 +46,14 @@
 
 struct ctl_table;
 struct audit_krule;
+struct user_namespace;
 
 /*
  * These functions are in security/capability.c and are used
  * as the default capabilities functions
  */
 extern int cap_capable(struct task_struct *tsk, const struct cred *cred,
-  int cap, int audit);
+  struct user_namespace *ns, int cap, int audit);
 extern int cap_settime(struct timespec *ts, struct timezone *tz);
 extern int cap_ptrace_access_check(struct task_struct *child, unsigned int 
mode);
 extern int cap_ptrace_traceme(struct task_struct *parent);
@@ -1258,6 +1259,7 @@ static inline void security_free_mnt_opts(struct 
security_mnt_opts *opts)
  * credentials.
  * @tsk contains the task_struct for the process.
  * @cred contains the credentials to use.
+ *  @ns contains the user namespace we want the capability in
  * @cap contains the capability .
  * @audit: Whether to write an audit message or not
  * Return 0 if the capability is granted for @tsk.
@@ -1386,7 +1388,7 @@ struct security_operations {
   const kernel_cap_t *inheritable,
   const kernel_cap_t *permitted);
int (*capable) (struct task_struct *tsk, const struct cred *cred,
-   int cap, int audit);
+   struct user_namespace *ns, int cap, int audit);
int (*sysctl) (struct ctl_table *table, int op);
int (*quotactl) (int cmds, int type, int id, struct super_block *sb);
int (*quota_on) (struct dentry *dentry);
@@ -1668,9 +1670,9 @@ int security_capset(struct cred *new, const struct cred 
*old,
const kernel_cap_t *effective,
const kernel_cap_t *inheritable,
const kernel_cap_t *permitted);
-int security_capable(int cap);
-int security_real_capable(struct task_struct *tsk, int cap);
-int security_real_capable_noaudit(struct task_struct *tsk, int cap);
+int security_capable(struct user_namespace *ns, int cap);
+int security_real_capable(struct task_struct *tsk, struct user_namespace *ns, 
int cap);
+int security_real_capable_noaudit(struct task_struct *tsk, struct 
user_namespace *ns, int cap);
 int security_sysctl(struct ctl_table *table, int op);
 int security_quotactl(int cmds, int type, int id, struct super_block *sb);
 int security_quota_on(struct dentry *dentry);
diff --git a/kernel/capability.c b/kernel/capability.c
index 2f05303..744dd6e 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -301,15 +302,32 @@ error:
  */
 int capable(int cap)
 {
+   return ns_capable(&init_

[Devel] [RFC PATCH 1/4] Add a user_namespace as creator/owner of uts_namespace

2010-12-09 Thread Serge E. Hallyn
copy_process() handles CLONE_NEWUSER before the rest of the
namespaces.  So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
the new uts namespace will have the new user namespace as its
owner.  That is what we want, since we want root in that new
userns to be able to have privilege over it.

Signed-off-by: Serge E. Hallyn 
---
 include/linux/utsname.h |3 +++
 init/version.c  |2 ++
 kernel/nsproxy.c|3 +++
 kernel/user.c   |8 ++--
 kernel/utsname.c|4 
 5 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 69f3997..85171be 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -37,9 +37,12 @@ struct new_utsname {
 #include 
 #include 
 
+struct user_namespace;
+
 struct uts_namespace {
struct kref kref;
struct new_utsname name;
+   struct user_namespace *user_ns;
 };
 extern struct uts_namespace init_uts_ns;
 
diff --git a/init/version.c b/init/version.c
index 79fb8c2c..9eb19fb 100644
--- a/init/version.c
+++ b/init/version.c
@@ -21,6 +21,7 @@ extern int version_string(LINUX_VERSION_CODE);
 int version_string(LINUX_VERSION_CODE);
 #endif
 
+extern struct user_namespace init_user_ns;
 struct uts_namespace init_uts_ns = {
.kref = {
.refcount   = ATOMIC_INIT(2),
@@ -33,6 +34,7 @@ struct uts_namespace init_uts_ns = {
.machine= UTS_MACHINE,
.domainname = UTS_DOMAINNAME,
},
+   .user_ns = &init_user_ns,
 };
 EXPORT_SYMBOL_GPL(init_uts_ns);
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f74e6c0..5a22dcf 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -74,6 +74,9 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
err = PTR_ERR(new_nsp->uts_ns);
goto out_uts;
}
+   put_user_ns(new_nsp->uts_ns->user_ns);
+   new_nsp->uts_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
+   get_user_ns(new_nsp->uts_ns->user_ns);
 
new_nsp->ipc_ns = copy_ipcs(flags, tsk->nsproxy->ipc_ns);
if (IS_ERR(new_nsp->ipc_ns)) {
diff --git a/kernel/user.c b/kernel/user.c
index 2c7d8d5..5125681 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -17,9 +17,13 @@
 #include 
 #include 
 
+/*
+ * userns count is 1 for root user, 1 for init_uts_ns,
+ * and 1 for... ?
+ */
 struct user_namespace init_user_ns = {
.kref = {
-   .refcount   = ATOMIC_INIT(2),
+   .refcount   = ATOMIC_INIT(3),
},
.creator = &root_user,
 };
@@ -47,7 +51,7 @@ static struct kmem_cache *uid_cachep;
  */
 static DEFINE_SPINLOCK(uidhash_lock);
 
-/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->creator */
+/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->user_ns */
 struct user_struct root_user = {
.__count= ATOMIC_INIT(2),
.processes  = ATOMIC_INIT(1),
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 8a82b4b..a7b3a8d 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct uts_namespace *create_uts_ns(void)
 {
@@ -40,6 +41,8 @@ static struct uts_namespace *clone_uts_ns(struct 
uts_namespace *old_ns)
 
down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
+   ns->user_ns = old_ns->user_ns;
+   get_user_ns(ns->user_ns);
up_read(&uts_sem);
return ns;
 }
@@ -71,5 +74,6 @@ void free_uts_ns(struct kref *kref)
struct uts_namespace *ns;
 
ns = container_of(kref, struct uts_namespace, kref);
+   put_user_ns(ns->user_ns);
kfree(ns);
 }
-- 
1.7.2.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: trying to build simple checkpoint/restart recipes

2010-12-08 Thread Serge E. Hallyn
Quoting Rob Landley (rland...@parallels.com):
> > > The restoration of the mounts is not scriptable however. It involves
> > > parsing the mountinfo file and coordinating the mounts with those done by
> > > lxc itself during lxc-restart. I honestly haven't looked at that closely
> >
> > I'd be fine with requiring some bit of hand-parsing.  But right, even
> > once we get a list of the mounts to be restored, I don't know of any
> > good way to get those mounts re-created at the right time.
> 
> Mount code is one of my old stomping grounds from back when I wrote
> the busybox mount and switch_root commands and had to learn more
> implementation details about it than I ever wanted to know. :)
> 
> I never could find a proper mount spec, and kept meaning to write one,
> but I blathered about some of the less obvious details here:
> 
>   http://www.mail-archive.com/busy...@busybox.net/msg07013.html

Bookmarked :)

> There are four top level categories of filesystem:  Block backed, ram backed,
> pipe backed (network and fuse and so on), and synthetic (sysfs, procfs,
> devtmpfs...).  And that's not counting bind mounts (which are internal
> to the VFS and not really a filesystem), and loopback devices (which are
> sort of the _opposite_ of a filesystem)...

Right, for starters handling only bind mounts would be useful.  It's feasible
for userspace to rsync the contents of tmpfs filesystems during checkpoint
and before restart - but it's harder to find the right place for the bind
mounts to get re-attached if done in userspace, because we don't want to
do it too early and risk having mount leaks (so we can't checkpoint later),
and it's hard to coordinate doing it later since someone inside the container
has to do it (unlesss, again, we have leaks - well, or maybe having
MNT_SHARED / for the container would suffice).

> > I suppose I could hack lxc-restart to do it.  But I'm sort of hoping we
> > can get something less hacked and more true to the 'real' upstream
> > code.
> 
> Which upstream code?

Heh, I should have said upstream-destined code.  Referring to
lxc.sf.net and the kernel at www.linux-cr.org.

> > So do you know of anyone who's been working on re-creation of mounts
> > in the kernel?  If not, what have you been doing, hand-scripting
> > all container creation, checkpoint, and restart?
> 
> I express interest in this topic.

Awesome.  Note that we've had lots of prior discussions about the
topic, it's just that we never came to a conclusion, so some fresh
experienced blood would be very helpful.

The last time I went into detail on the topic was at
http://www.mail-archive.com/devel@openvz.org/msg21418.html
while some older notes on the simpler topics are at
https://ckpt.wiki.kernel.org/index.php/Mounts

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Containers HOWTO? (Where do I start?)

2010-12-08 Thread Serge E. Hallyn
Quoting Rob Landley (rland...@parallels.com):
> But how does pivot_root enter into this when you haven't got an initrd to
> free?  I thought when you killed a container's init process that killed all

But pivot_root isn't just for initrd.  At this point I think both
libvirt-lxc and lxc.sf.net use pivot_root in favor of chroot for
creating containers.  Of course there are some stringent rules about the
pre-existing old (put) and new roots regarding sharing - you can best
see those in fs/namespace.c:pivot_root, i don't know that they're well
documented anywhere.

> the children and freed the resources, so how does pivot_root enter
> into this?  (You don't reparent existing processes, you span new ones,
> right?)

Right.  And you do the pivot_root only for the container, not the
whole system.  Sorry, I'm missing something about what you're saying
about killing the container.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Containers HOWTO? (Where do I start?)

2010-12-08 Thread Serge E. Hallyn
A few places to start since you want to start from the ground up:

1. man clone
2. man pivot_root
3. git co git://git.sr71.net/~hallyn/cr_tests;
   cd cr_tests
   git co hs_exec
   vi ns_exec.c
4. 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=Documentation/cgroups;h=8c6b3f6c41a929f8db38b51a39442387ecbd5986;hb=HEAD
5. http://www.mnis.fr/france/services/virtualisation/pdf/cr.pdf

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: trying to build simple checkpoint/restart recipes

2010-12-08 Thread Serge E. Hallyn
Quoting Matt Helsley (matth...@us.ibm.com):
> > So far, so good.  Note that I couldn't use upstart for my init bc upstart
> > uses inotify, which we don't yet checkpoint.  The kernel is compiled without
> 
> Interesting, I didn't know that. What does upstart use inotify for?

Dunno :)  I was quite put out though.

> > There are two issues:
> > 
> > 1. how to re-create the mounts.  Kernel doesn't do it yet.  There
> >isn't (that I know of) a clean way to hook lxc-restart to do it.
> >Comments?
> 
> It's incomplete but I think you can save the most important portions of
> a mount namespace with a simple 1-line command:
> 
> lxc-attach -n cr1 cat /proc/self/mountinfo > cr1.mountinfo
> 
> It's incomplete because:
> 
>   1. It does not adequately address cross-mount-ns bind mounts (IIRC).
> 
>   2. It won't work for nested containers (though I don't know if
>   lxc supports this already it's not *too* far fetched
>   to expect folks will ask for it in the future). We can
>   extend the hack to deal with this by making a small
>   change in sys_checkpoint but I can't see how to fix #1
>   without doing it all in-kernel anyway.

Heck, for these examples I don't mind just having a sort of dummy
fstab file which both the dummy init and restart use.

> The restoration of the mounts is not scriptable however. It involves
> parsing the mountinfo file and coordinating the mounts with those done by
> lxc itself during lxc-restart. I honestly haven't looked at that closely

I'd be fine with requiring some bit of hand-parsing.  But right, even
once we get a list of the mounts to be restored, I don't know of any
good way to get those mounts re-created at the right time.

I suppose I could hack lxc-restart to do it.  But I'm sort of hoping we
can get something less hacked and more true to the 'real' upstream
code.

> enough yet to say how pretty/ugly that'd be but it entails
> modifications to lxc-restart itself. And since #1 above would still
> be an issue I'm not sure it's worth doing it that way.

So do you know of anyone who's been working on re-creation of mounts
in the kernel?  If not, what have you been doing, hand-scripting
all container creation, checkpoint, and restart?

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] trying to build simple checkpoint/restart recipes

2010-12-07 Thread Serge E. Hallyn
What I've done so far:

created a KVM vm and installed up-to-date maverick
add-apt-repository ppa:appcr/ppa
apt-get update && apt-get dist-upgrade
apt-get install libvirt-bin lxc linux-image-2.6.34-1cr4
sed -i 's/GRUB_DEFAULT=0/GRUB_DEFAULT="Ubuntu, with Linux 
2.6.34-1cr4-generic"/' /etc/default/grub
update-grub

replaced 122 with 123 in /etc/libvirt/qemu/networks/default.xml and 
/var/lib/libvirt/network/default.xml
reboot

# The following should go into an upstart script shipped with the appcr 
packages
# as they must be done on each boot
chmod 666 /dev/pts/ptmx
rm /dev/ptmx
ln -s /dev/pts/ptmx /dev/ptmx
mkdir -p /cgroup
mount -t cgroup cggroup /cgroup/
echo /bin/remove_dead_cgroup.sh > /cgroup/release_agent
echo 1 > /cgroup/notify_on_release
#

cat > /etc/lxc-basic.conf << EOF
lxc.network.type=veth
lxc.network.link=virbr0
lxc.network.flags=up
EOF

lxc-create -f /etc/lxc-basic.conf -n cr1 -t ubuntu
cd /var/lib/lxc/cr1/rootfs/sbin
mv init upstart

cat > init << EOF
#!/bin/sh
rm -f /shutdown
hostname cr1

exec 0<&-
exec 0&-
exec 1>nohup.out
exec 2>&-
exec 2>nohup.out

mkdir -p /tmp2
mount --bind /tmp2 /tmp

mount -a
mount -t proc proc /proc
mount -t tmpfs varrun /var/run
mkdir /var/run/network
mkdir /var/run/sshd
ifconfig eth0 192.168.123.21 up
screen -A -d -m -S console

/usr/sbin/sshd
while [ ! -f /shutdown ]; do
  sleep 4s
done
EOF

lxc-start -n cr1

(in another console)
ssh 192.168.123.21
  screen -r
  ps
  ctrl-a d
exit

lxc-freeze -n cr1
lxc-checkout -n cr1 -S /root/cr1.s1

So far, so good.  Note that I couldn't use upstart for my init bc upstart
uses inotify, which we don't yet checkpoint.  The kernel is compiled without
ipv6 bc that was also causing a problem (though I thought ipv6 was supported
for checkpoint?) and therefore I needed a custom libvirt package which didn't
break when ipv6 is not there.

The problem now is when attempting to restart:

lxc-stop -n cr1
lxc-restart -n cr1 -S /root/cr1.s1

There are two issues:

1. how to re-create the mounts.  Kernel doesn't do it yet.  There
   isn't (that I know of) a clean way to hook lxc-restart to do it.
   Comments?

2. likewise there *may* end up being a question of where to best hook
   the backup/snapshot and restore of filesystems.  Though for these
   examples (screen and next vncserver) it shouldn't be necessary.

I'm trying to do this using lxc-checkpoint and lxc-restart so as to
keep the instructions as simple as possible.  The hope is in the next
few weeks to have a few recipes that people can try out.  But if it's
not (cleanly) possible using lxc-restart, then I guess I can switch to
using user-cr and some container tarballs, which seems a lot easier in
the short term but less useful in the long term.

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] user_ns: Improve the user_ns on-the-slab packaging

2010-12-07 Thread Serge E. Hallyn
Quoting Pavel Emelyanov (xe...@parallels.com):
> On 12/07/2010 05:27 PM, Serge E. Hallyn wrote:
> > Quoting Pavel Emelyanov (xe...@parallels.com):
> >> Currently on 64-bit arch the user_namespace is 2096 and when
> >> being kmalloc-ed it resides on a 4k slab wasting 2003 bytes.
> >>
> >> If we allocate a separate cache for it and reduce the hash size
> >> from 128 to 64 chains the packaging becomes *much* better - the
> > 
> > Hey Pavel,
> > 
> > I trust you've done some performance tests and found no
> > regressions with a few hundred users?
> 
> How many hundreds are you interested in? :) 128 users didn't
> reveal any regressions.

I have no good guess, would have said 500, 128 sounds good :)  So
long as actual benchmarks showed no regression within a 95%
confidence interval.

Thanks for the patch, the memory savings are impressive.

Acked-by: Serge E. Hallyn 

-serge

PS - I'm hoping to send out a version of the targeted capabilities
(based on userns) patchset later this week.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] user_ns: Improve the user_ns on-the-slab packaging

2010-12-07 Thread Serge E. Hallyn
Quoting Pavel Emelyanov (xe...@parallels.com):
> Currently on 64-bit arch the user_namespace is 2096 and when
> being kmalloc-ed it resides on a 4k slab wasting 2003 bytes.
> 
> If we allocate a separate cache for it and reduce the hash size
> from 128 to 64 chains the packaging becomes *much* better - the

Hey Pavel,

I trust you've done some performance tests and found no
regressions with a few hundred users?

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [Lxc-users] regular lxc development call?

2010-12-02 Thread Serge E. Hallyn
Quoting Daniel Lezcano (daniel.lezc...@free.fr):
> On 12/02/2010 03:21 PM, Serge E. Hallyn wrote:
> >Quoting Daniel Lezcano (daniel.lezc...@free.fr):
> >>On 11/30/2010 04:06 AM, Serge E. Hallyn wrote:
> >>>Quoting Daniel Lezcano (daniel.lezc...@free.fr):
> >>>Looks like we'll be starting small anyway, so let's just try skype.  Anyone
> >>>interested in joining, please send me your skype id.
> >>>
> >>>What is a good time?  I'll just toss thursday at 9:30am US Central time
> >>>(15:30 UTC) out there.
> >>>
> >>Ok for me.
> >>
> >>Do we begin January, 6th ?
> >I'm feeling like time is passing us by far too quickly.  I realize today is
> >thursday, and really I wouldn't mind a first call today just to get everyone
> >a sense of what everyone else is working on.  Otherwise, can we start next
> >week?  Or is december just a wash?  :(
> 
> Ok for next week.

Ok.

> Do you want me to create a google calendar event ?

That'd be great, thanks!

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [Lxc-users] regular lxc development call?

2010-12-02 Thread Serge E. Hallyn
Quoting Daniel Lezcano (daniel.lezc...@free.fr):
> On 11/30/2010 04:06 AM, Serge E. Hallyn wrote:
> > Quoting Daniel Lezcano (daniel.lezc...@free.fr):
> > Looks like we'll be starting small anyway, so let's just try skype.  Anyone
> > interested in joining, please send me your skype id.
> >
> > What is a good time?  I'll just toss thursday at 9:30am US Central time
> > (15:30 UTC) out there.
> >
> 
> Ok for me.
> 
> Do we begin January, 6th ?

I'm feeling like time is passing us by far too quickly.  I realize today is
thursday, and really I wouldn't mind a first call today just to get everyone
a sense of what everyone else is working on.  Otherwise, can we start next
week?  Or is december just a wash?  :(
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: regular lxc development call?

2010-11-29 Thread Serge E. Hallyn
Quoting Daniel Lezcano (daniel.lezc...@free.fr):
> On 11/29/2010 03:53 PM, Serge E. Hallyn wrote:
> > Hi,
> >
> > at UDS-N we had a session on 'fine-tuning containers'.  The focus was
> > things we can do in the next few months to improve containers.  The
> > meeting proeedings can be found at
> > https://wiki.ubuntu.com/UDSProceedings/N/CloudInfrastructure#Make%20LXC%20ready%20for%20production
> >
> > We have a few work items written down at
> > https://blueprints.edge.launchpad.net/ubuntu/+spec/cloud-server-n-containers-finetune
> > The list is flexible fwiw, but we thought it might help to have a regular
> > call, perhaps every other week, to discuss work items, their design,
> > and their progress.  For some features like reboot/shutdown, I think
> > design still needs discussion.  For other things, it's more important
> > that we just discuss who's doing what and what's been done.
> >
> > Is there interest in having such a call?
> >
> 
> Yep, IMO it is a good idea.
> 
> > I suspect most of the containers work now is purely volunteer driven,
> > so a free venue seems worthwhile.  Should we do this over skype?  IRC?
> > Does someone want to set up a conference number?
> >
> 
> I don't have a conf number, if anyone has one that will be great, 
> otherwise I am fine with skype or irc.

Looks like we'll be starting small anyway, so let's just try skype.  Anyone
interested in joining, please send me your skype id.

What is a good time?  I'll just toss thursday at 9:30am US Central time
(15:30 UTC) out there.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] regular lxc development call?

2010-11-29 Thread Serge E. Hallyn
Hi,

at UDS-N we had a session on 'fine-tuning containers'.  The focus was
things we can do in the next few months to improve containers.  The
meeting proeedings can be found at
https://wiki.ubuntu.com/UDSProceedings/N/CloudInfrastructure#Make%20LXC%20ready%20for%20production

We have a few work items written down at
https://blueprints.edge.launchpad.net/ubuntu/+spec/cloud-server-n-containers-finetune
The list is flexible fwiw, but we thought it might help to have a regular
call, perhaps every other week, to discuss work items, their design,
and their progress.  For some features like reboot/shutdown, I think
design still needs discussion.  For other things, it's more important
that we just discuss who's doing what and what's been done.

Is there interest in having such a call?

I suspect most of the containers work now is purely volunteer driven,
so a free venue seems worthwhile.  Should we do this over skype?  IRC?
Does someone want to set up a conference number?

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

2010-11-17 Thread Serge E. Hallyn
Quoting Tejun Heo (t...@kernel.org):
> Hello, Oren.
> 
> On 11/07/2010 10:59 PM, Oren Laadan wrote:
> > We could work to add ABIs and APIs for each and every possible piece
> > of state that affects userspace. And for each we'll argue forever
> > about the design and some time later regret that it wasn't designed
> > correctly :p
> 
> I'm sorry but in-kernel CR already looks like a major misdesign to me.

By this do you mean the very idea of having CR support in the kernel?
Or our design of it in the kernel?  Let's go back to July 2008, at the
containers mini-summit, where it was unanimously agreed upon that the
kernel was the right place (Checkpoint/Resetart [CR] under
http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that
we would start by supporting a single task with no resources.  Was that
whole discussion effectively misguided, in your opinion?  Or do you
feel that since the first steps outlined in that discussion we've
either "gone too far" or strayed in the subsequent design?

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/2] Kconfig : default all the namespaces to 'yes'

2010-10-14 Thread Serge E. Hallyn
Quoting Andrew Morton (a...@linux-foundation.org):
> On Wed, 13 Oct 2010 09:44:30 -0500
> "Serge E. Hallyn"  wrote:
> 
> > Quoting Daniel Lezcano (daniel.lezc...@free.fr):
> > > On 10/12/2010 07:16 PM, Serge E. Hallyn wrote:
> > > >Quoting Matt Helsley (matth...@us.ibm.com):
> > > >>On Thu, Oct 07, 2010 at 03:15:33PM +0200, Daniel Lezcano wrote:
> > > >>>As the different namespaces depend on 'CONFIG_NAMESPACES', it is
> > > >>>logical to enable all the namespaces when we enable NAMESPACES.
> > > >>>
> > > >>>Signed-off-by: Daniel Lezcano
> > > >>Subject of the patch email is a little confusing as it's not
> > > >>quite what happens. I'm mostly OK with it but I'm not sure we
> > > >>should enable user-ns by default just yet.
> > > >>
> > > >>Acked-By: Matt Helsley
> > > >In fact, perhaps we should keep the experimental tag on user namespaces.
> > > 
> > > The experimental tag is kept on the user namespace. This one is
> > > defaulting to yes when the namespaces and experimental are selected.
> > 
> > Oh, sounds good
> > 
> 
> My attention flagged.  Can we please confirm that the current patch is
> still good?

Yup, the patch below only sets USER_NS=y when EXPERIMENTAL=y, which I'd
failed to notice the first time.

Acked-by: Serge Hallyn 

> From: Daniel Lezcano 
> 
> As the different namespaces depend on 'CONFIG_NAMESPACES', it is logical
> to enable all the namespaces when we enable NAMESPACES.
> 
> Signed-off-by: Daniel Lezcano 
> Cc: "Eric W. Biederman" 
> Cc: David Miller 
> Acked-By: Matt Helsley 
> Signed-off-by: Andrew Morton 
> ---
> 
>  init/Kconfig |7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff -puN 
> init/Kconfig~namespaces-default-all-the-namespaces-to-yes-when-config_namespaces-is-selected
>  init/Kconfig
> --- 
> a/init/Kconfig~namespaces-default-all-the-namespaces-to-yes-when-config_namespaces-is-selected
> +++ a/init/Kconfig
> @@ -739,6 +739,7 @@ config NAMESPACES
>  config UTS_NS
>   bool "UTS namespace"
>   depends on NAMESPACES
> + default y
>   help
> In this namespace tasks see different info provided with the
> uname() system call
> @@ -746,6 +747,7 @@ config UTS_NS
>  config IPC_NS
>   bool "IPC namespace"
>   depends on NAMESPACES && (SYSVIPC || POSIX_MQUEUE)
> + default y
>   help
> In this namespace tasks work with IPC ids which correspond to
> different IPC objects in different namespaces.
> @@ -753,6 +755,7 @@ config IPC_NS
>  config USER_NS
>   bool "User namespace (EXPERIMENTAL)"
>   depends on NAMESPACES && EXPERIMENTAL
> + default y
>   help
> This allows containers, i.e. vservers, to use user namespaces
> to provide different user info for different servers.
> @@ -760,8 +763,8 @@ config USER_NS
>  
>  config PID_NS
>   bool "PID Namespaces"
> - default n
>   depends on NAMESPACES
> + default y
>   help
> Support process id namespaces.  This allows having multiple
> processes with the same pid as long as they are in different
> @@ -769,8 +772,8 @@ config PID_NS
>  
>  config NET_NS
>   bool "Network namespace"
> - default n
>   depends on NAMESPACES && NET
> + default y
>   help
> Allow user space to create what appear to be multiple instances
> of the network stack.
> _
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/2] Kconfig : default all the namespaces to 'yes'

2010-10-13 Thread Serge E. Hallyn
Quoting Daniel Lezcano (daniel.lezc...@free.fr):
> On 10/12/2010 07:16 PM, Serge E. Hallyn wrote:
> >Quoting Matt Helsley (matth...@us.ibm.com):
> >>On Thu, Oct 07, 2010 at 03:15:33PM +0200, Daniel Lezcano wrote:
> >>>As the different namespaces depend on 'CONFIG_NAMESPACES', it is
> >>>logical to enable all the namespaces when we enable NAMESPACES.
> >>>
> >>>Signed-off-by: Daniel Lezcano
> >>Subject of the patch email is a little confusing as it's not
> >>quite what happens. I'm mostly OK with it but I'm not sure we
> >>should enable user-ns by default just yet.
> >>
> >>Acked-By: Matt Helsley
> >In fact, perhaps we should keep the experimental tag on user namespaces.
> 
> The experimental tag is kept on the user namespace. This one is
> defaulting to yes when the namespaces and experimental are selected.

Oh, sounds good

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Fwd: Re: lxc-performance?

2010-10-12 Thread Serge E. Hallyn
Quoting MALATTAR (mouhannad.alat...@univ-fcomte.fr):
> 
> 
>  Message original 
> Sujet:Re: lxc-performance
> Date :Thu, 07 Oct 2010 16:56:05 +0200
> De :  MALATTAR 
> Pour :MALATTAR 
> 
> 
> 
> Le 07/10/2010 16:43, MALATTAR a écrit :
> > 
> > 06.10.2010 23:41, MALATTAR ?:
> >
> > >/  the container dora1, where i launch an instance of my IDS, does not take
> > />/  more than 70 MB as memory even though the memory limit for it is much
> > />/  bigger than this value,
> > /
> > How do you measure memory usage?
> 
> by using the command:
> lxc-cgroup -n dora1 memory.usage_in_bytes
> >   What's in memory.max_usage_in_bytes of
> > the container's cgroup?
> executing the next command lxc-cgroup -n dora1 memory.max_usage_in_bytes
> gave me 70193152 bytes

Ah, linux/Documentation/cgroups/memory.txt explains that max_usage_in_bytes
reports the maximum memory usage recorded, not the limit.  Look at
lxc-cgroup -n dora1 memory.max_usage_in_bytes

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


  1   2   3   4   5   6   7   8   9   10   >