from:"Daniel Lezcano"

[Devel] Re: containers and cgroups mini-summit @ Linux Plumbers

2012-07-30 Thread Daniel Lezcano

On 07/11/2012 11:41 PM, Kir Kolyshkin wrote:
 Gentlemen,
 
 We are organizing containers mini-summit during next Linux Plumbers (San
 Diego, August 29-31).
 The idea is to gather and discuss everything relevant to namespaces,
 cgroups, resource management,
 checkpoint-restore and so on.
 
 We are trying to come up with a list of topics to discuss, so please
 reply with topic suggestions, and
 let me know if you are going to come.
 
 I probably forgot a few more people (such as, I am not sure who else
 from Google is working
 on cgroups stuff), so fill free to forward this to anyone you believe
 should go,
 or just let me know whom I missed.

Hi Kir,

I have a presentation for LPC and I am awaiting the approval for the
funding. If it is accepted I will be there.

One point to address could be the time virtualization.

Thanks
  -- Daniel

-- 
 http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  http://www.facebook.com/pages/Linaro Facebook |
http://twitter.com/#!/linaroorg Twitter |
http://www.linaro.org/linaro-blog/ Blog

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] allow a task to join a pid namespace

2012-06-05 Thread Daniel Lezcano

On 06/04/2012 03:33 PM, Glauber Costa wrote:
 Currently, it is possible for a process  to join existing
 net, uts and ipc namespaces. This patch allows a process to join an
 existing pid namespace as well.
 
 For that to remain sane, some restrictions are made in the calling process:
 
 * It needs to be in the parent namespace of the namespace it wants to jump to
 * It needs to sit in its own session and group as a leader.
 
 The rationale for that, is that people want to trigger actions in a Container
 from the outside. For instance, mainstream linux recently gained the ability
 to safely reboot a container. It would be desirable, however, that this
 action is triggered from an admin in the outside world, very much like a
 power switch in a physical box.
 
 This would also allow us to connect a console to the container, provide a
 repair mode for setups without networking (or with a broken one), etc.

Hi Glauber,

I am in favor of this patch but I think the pidns support won't be
complete and some corner-cases are not handled.

May be you can look at Eric's patchset [1] where, IMO, everything is
taken into account. Some of the patches may be already upstream.

Thanks
  -- Daniel

[1]
http://git.kernel.org/?p=linux/kernel/git/ebiederm/linux-namespace-control-devel.git;a=summary

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] allow a task to join a pid namespace

2012-06-05 Thread Daniel Lezcano

On 06/04/2012 06:51 PM, Oleg Nesterov wrote:
 On 06/04, Glauber Costa wrote:

 Currently, it is possible for a process  to join existing
 net, uts and ipc namespaces. This patch allows a process to join an
 existing pid namespace as well.
 
 I can't understand this patch... but probably I missed something,
 I never really understood setns.

Hi Oleg,

let me clarify why is needed setns. In the world of container, setns
allows to administrate the container from outside. One good example is
to shutdown the container. The users setup their hosts with the init's
services to startup the containers when the system starts, but they have
no way to invoke 'shutdown' from inside the container when the system
goes down except doing some trick with the signals. The setns syscall
with the pid namespace support will allow to do that.

Also a complete setns support will allow to write some administrative
tools to have a global view of the different separated resources running
in several containers.

For example, if you are the administrator of the host and you have
hundred of containers running on it, you can use setns to run netstat
within each container and build a view of the different network stack.
The same applies for 'ps' or 'top'.

Without setns, things are much more complicated and in some cases,
impossible. For instance, you can run a daemon inside the container,
send command to it and redirect its output to the fifo  but that
increase the number of processes and has some limitations. Also that
means the command you want to run is present in the container's FS.

The setns syscall is highly needed for the VRF, where a single process
can handle thousand of network namespaces and switch from a network
namespace to another network namespace with one syscall. The usage of
the file descriptors pins the namespace and prevent it from being
destroyed when switching from one namespace to another.

In other words, +1 for pid ns support with setns :)

  -- Daniel

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

Re: [Devel] Re: [PATCH] allow a task to join a pid namespace

2012-06-05 Thread Daniel Lezcano

On 06/05/2012 12:00 PM, Glauber Costa wrote:
 On 06/05/2012 01:37 PM, Glauber Costa wrote:
 On 06/05/2012 01:36 PM, Daniel Lezcano wrote:
 On 06/04/2012 03:33 PM, Glauber Costa wrote:
 Currently, it is possible for a process to join existing
 net, uts and ipc namespaces. This patch allows a process to join an
 existing pid namespace as well.

 For that to remain sane, some restrictions are made in the calling
 process:

 * It needs to be in the parent namespace of the namespace it wants to
 jump to
 * It needs to sit in its own session and group as a leader.

 The rationale for that, is that people want to trigger actions in a
 Container
 from the outside. For instance, mainstream linux recently gained the
 ability
 to safely reboot a container. It would be desirable, however, that
 this
 action is triggered from an admin in the outside world, very much
 like a
 power switch in a physical box.

 This would also allow us to connect a console to the container,
 provide a
 repair mode for setups without networking (or with a broken one), etc.

 Hi Glauber,

 I am in favor of this patch but I think the pidns support won't be
 complete and some corner-cases are not handled.

 May be you can look at Eric's patchset [1] where, IMO, everything is
 taken into account. Some of the patches may be already upstream.

 Thanks
 -- Daniel

 I don't remember seeing such patchset in the mailing lists, but that
 might be my fault, due to traffic...

 I'll take a look. If it does what I need, I can just drop this.


 Ok. In a quick look, it does not seem to go all the way. This is just
 by reading, but your reboot patch, for instance, is unlikely to work
 with that, since if it doesn't alter pid-level, things like task
 ns_of_pid won't work.

 Running the test scripts I wrote for my testing of that patch also
 doesn't seem to produce the expected result:

 after doing setns, the pid won't show up in that namespace.

Yes, AFAIR, pid won't show up, you have to do fork-exec.


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

Re: [Devel] Re: [PATCH] allow a task to join a pid namespace

2012-06-05 Thread Daniel Lezcano

On 06/05/2012 02:53 PM, Glauber Costa wrote:
 On 06/05/2012 04:52 PM, Daniel Lezcano wrote:
 On 06/05/2012 12:00 PM, Glauber Costa wrote:
 On 06/05/2012 01:37 PM, Glauber Costa wrote:
 On 06/05/2012 01:36 PM, Daniel Lezcano wrote:
 On 06/04/2012 03:33 PM, Glauber Costa wrote:
 Currently, it is possible for a process to join existing
 net, uts and ipc namespaces. This patch allows a process to join an
 existing pid namespace as well.

 For that to remain sane, some restrictions are made in the calling
 process:

 * It needs to be in the parent namespace of the namespace it
 wants to
 jump to
 * It needs to sit in its own session and group as a leader.

 The rationale for that, is that people want to trigger actions in a
 Container
 from the outside. For instance, mainstream linux recently gained the
 ability
 to safely reboot a container. It would be desirable, however, that
 this
 action is triggered from an admin in the outside world, very much
 like a
 power switch in a physical box.

 This would also allow us to connect a console to the container,
 provide a
 repair mode for setups without networking (or with a broken one),
 etc.

 Hi Glauber,

 I am in favor of this patch but I think the pidns support won't be
 complete and some corner-cases are not handled.

 May be you can look at Eric's patchset [1] where, IMO, everything is
 taken into account. Some of the patches may be already upstream.

 Thanks
 -- Daniel

 I don't remember seeing such patchset in the mailing lists, but that
 might be my fault, due to traffic...

 I'll take a look. If it does what I need, I can just drop this.


 Ok. In a quick look, it does not seem to go all the way. This is just
 by reading, but your reboot patch, for instance, is unlikely to work
 with that, since if it doesn't alter pid-level, things like task
 ns_of_pid won't work.

 Running the test scripts I wrote for my testing of that patch also
 doesn't seem to produce the expected result:

 after doing setns, the pid won't show up in that namespace.

 Yes, AFAIR, pid won't show up, you have to do fork-exec.

 Ah, so you mean the kid will show up... Well, ok.

 That's acceptable, but how about the behavior I am proposing ? (in the
 patch I sent as a reply to this thread).

Let me look at the patch closely.

 I believe it to be saner, even though there is a price tag attached to
 it. None of the other setns calls require you to do any such trickery...

Yeah, but the pidns is different from the other namespace, it is not
supposed to be unshared.
I remember we had a discussion about this and Eric had some good reasons
to do it this way. One of them of the pid cached by the glibc. Also, we
don't want to have our pid changing in our application.

You may find more informations in there [1]

Thanks
  -- Daniel

[1] http://thread.gmane.org/gmane.linux.network/153200

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/4] userns: let clone_uts_ns() handle setting uts-user_ns

2011-02-24 Thread Daniel Lezcano

On 02/24/2011 01:21 AM, Serge E. Hallyn wrote:
 To do so we need to pass in the task_struct who'll get the utsname,
 so we can get its user_ns.

 Changelog:
   Feb 23: As per Oleg's coment, just pass in tsk.

 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com

Acked-by: Daniel Lezcano daniel.lezc...@free.fr

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 2/4] userns: let copy_ipcs handle setting ipc_ns-user_ns

2011-02-24 Thread Daniel Lezcano

On 02/24/2011 01:22 AM, Serge E. Hallyn wrote:
 To do that, we have to pass in the task_struct of the task which
 will own the ipc_ns, so we can assign its user_ns.

 Changelog:
   Feb 23: As per Oleg comment, just pass in tsk.  To get the
   ipc_ns from the nsproxy we need to include nsproxy.h

 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com
 ---

Acked-by: Daniel Lezcano daniel.lezc...@free.fr
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 5/4] Clean up capability.h and capability.c

2011-02-24 Thread Daniel Lezcano

On 02/24/2011 01:22 AM, Serge E. Hallyn wrote:
 Convert macros to functions to let type safety do its thing.  Switch
 some functions from ints to more appropriate bool.  Move all forward
 declarations together to top of the #ifdef __KERNEL__ section.  Use
 kernel-doc format for comments.

 Some macros couldn't be converted because they use functions from
 security.h which sometimes are extern and sometimes static inline,
 and we don't want to #include security.h in capability.h.

 Also add a real current_user_ns function (and convert the existing
 macro to _current_user_ns() so we can use it in capability.h
 without #including cred.h.

 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com
 ---

Acked-by: Daniel Lezcano daniel.lezc...@free.fr

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/4] userns: let clone_uts_ns() handle setting uts-user_ns

2011-02-21 Thread Daniel Lezcano

On 02/21/2011 05:01 AM, Serge E. Hallyn wrote:
 To do so we need to pass in the task_struct who'll get the utsname,
 so we can get its user_ns.

 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com
 ---


   include/linux/utsname.h |   10 ++
   kernel/nsproxy.c|7 +--
   kernel/utsname.c|   12 +++-
   3 files changed, 14 insertions(+), 15 deletions(-)

 diff --git a/include/linux/utsname.h b/include/linux/utsname.h
 index 85171be..165b17b 100644
 --- a/include/linux/utsname.h
 +++ b/include/linux/utsname.h
 @@ -52,8 +52,9 @@ static inline void get_uts_ns(struct uts_namespace *ns)
   kref_get(ns-kref);
   }

 -extern struct uts_namespace *copy_utsname(unsigned long flags,
 - struct uts_namespace *ns);
 +extern struct uts_namespace *copy_utsname(struct task_struct *tsk,
 +   unsigned long flags,
 +   struct uts_namespace *ns);

Why don't we pass 'user_ns' instead of 'tsk' ? that will look 
semantically clearer for the caller no ?
(example below).

   extern void free_uts_ns(struct kref *kref);

   static inline void put_uts_ns(struct uts_namespace *ns)
 @@ -69,8 +70,9 @@ static inline void put_uts_ns(struct uts_namespace *ns)
   {
   }

 -static inline struct uts_namespace *copy_utsname(unsigned long flags,
 - struct uts_namespace *ns)
 +static inline struct uts_namespace *copy_utsname(struct task_struct *tsk,
 +  unsigned long flags,
 +  struct uts_namespace *ns)
   {
   if (flags  CLONE_NEWUTS)
   return ERR_PTR(-EINVAL);
 diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
 index b6dbff2..ffa6b67 100644
 --- a/kernel/nsproxy.c
 +++ b/kernel/nsproxy.c
 @@ -69,16 +69,11 @@ static struct nsproxy *create_new_namespaces(unsigned 
 long flags,
   goto out_ns;
   }

 - new_nsp-uts_ns = copy_utsname(flags, tsk-nsproxy-uts_ns);
 + new_nsp-uts_ns = copy_utsname(tsk, flags, tsk-nsproxy-uts_ns);
   if (IS_ERR(new_nsp-uts_ns)) {
   err = PTR_ERR(new_nsp-uts_ns);
   goto out_uts;
   }

...

new_nsp-uts_ns = copy_utsname(flags, tsk-nsproxy-uts_ns, task_cred_xxx(tsk, 
user)-user_ns);

...

 - if (new_nsp-uts_ns != tsk-nsproxy-uts_ns) {
 - put_user_ns(new_nsp-uts_ns-user_ns);
 - new_nsp-uts_ns-user_ns = task_cred_xxx(tsk, user)-user_ns;
 - get_user_ns(new_nsp-uts_ns-user_ns);
 - }

   new_nsp-ipc_ns = copy_ipcs(flags, tsk-nsproxy-ipc_ns);
   if (IS_ERR(new_nsp-ipc_ns)) {
 diff --git a/kernel/utsname.c b/kernel/utsname.c
 index a7b3a8d..9462580 100644
 --- a/kernel/utsname.c
 +++ b/kernel/utsname.c
 @@ -31,7 +31,8 @@ static struct uts_namespace *create_uts_ns(void)
* @old_ns: namespace to clone
* Return NULL on error (failure to kmalloc), new ns otherwise
*/
 -static struct uts_namespace *clone_uts_ns(struct uts_namespace *old_ns)
 +static struct uts_namespace *clone_uts_ns(struct task_struct *tsk,
 +   struct uts_namespace *old_ns)
   {
   struct uts_namespace *ns;

 @@ -41,8 +42,7 @@ static struct uts_namespace *clone_uts_ns(struct 
 uts_namespace *old_ns)

   down_read(uts_sem);
   memcpy(ns-name,old_ns-name, sizeof(ns-name));
 - ns-user_ns = old_ns-user_ns;
 - get_user_ns(ns-user_ns);
 + ns-user_ns = get_user_ns(task_cred_xxx(tsk, user)-user_ns);
   up_read(uts_sem);
   return ns;
   }
 @@ -53,7 +53,9 @@ static struct uts_namespace *clone_uts_ns(struct 
 uts_namespace *old_ns)
* utsname of this process won't be seen by parent, and vice
* versa.
*/
 -struct uts_namespace *copy_utsname(unsigned long flags, struct uts_namespace 
 *old_ns)
 +struct uts_namespace *copy_utsname(struct task_struct *tsk,
 +unsigned long flags,
 +struct uts_namespace *old_ns)
   {
   struct uts_namespace *new_ns;

 @@ -63,7 +65,7 @@ struct uts_namespace *copy_utsname(unsigned long flags, 
 struct uts_namespace *ol
   if (!(flags  CLONE_NEWUTS))
   return old_ns;

 - new_ns = clone_uts_ns(old_ns);
 + new_ns = clone_uts_ns(tsk, old_ns);

   put_uts_ns(old_ns);
   return new_ns;

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 2/4] userns: let copy_ipcs handle setting ipc_ns-user_ns

2011-02-21 Thread Daniel Lezcano

On 02/21/2011 05:02 AM, Serge E. Hallyn wrote:
 To do that, we have to pass in the task_struct of the task which
 will own the ipc_ns, so we can assign its user_ns.

 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com
 ---
   include/linux/ipc_namespace.h |8 +---
   ipc/namespace.c   |   12 +++-
   kernel/nsproxy.c  |7 +--
   3 files changed, 13 insertions(+), 14 deletions(-)

 diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
 index 46d2eb4..9974429 100644
 --- a/include/linux/ipc_namespace.h
 +++ b/include/linux/ipc_namespace.h
 @@ -92,7 +92,8 @@ static inline int mq_init_ns(struct ipc_namespace *ns) { 
 return 0; }
   #endif

   #if defined(CONFIG_IPC_NS)
 -extern struct ipc_namespace *copy_ipcs(unsigned long flags,
 +extern struct ipc_namespace *copy_ipcs(struct task_struct *tsk,
 +unsigned long flags,
  struct ipc_namespace *ns);

Same comment than patch 1/4
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/4] userns: let clone_uts_ns() handle setting uts-user_ns

2011-02-21 Thread Daniel Lezcano

On 02/21/2011 02:58 PM, Serge E. Hallyn wrote:
 Quoting Oleg Nesterov (o...@redhat.com):
 On 02/21, Daniel Lezcano wrote:
 On 02/21/2011 05:01 AM, Serge E. Hallyn wrote:
 To do so we need to pass in the task_struct who'll get the utsname,
 so we can get its user_ns.

 -extern struct uts_namespace *copy_utsname(unsigned long flags,
 -  struct uts_namespace *ns);
 +extern struct uts_namespace *copy_utsname(struct task_struct *tsk,
 +unsigned long flags,
 +struct uts_namespace *ns);
 Why don't we pass 'user_ns' instead of 'tsk' ? that will look
 semantically clearer for the caller no ?
 (example below).
 ...

 new_nsp-uts_ns = copy_utsname(flags, tsk-nsproxy-uts_ns, 
 task_cred_xxx(tsk, user)-user_ns);
 To me tsk looks more readable, I mean

  new_nsp-uts_ns = copy_utsname(flags, tsk);

 copy_utsname() can find both uts_ns and user_ns looking at task_strcut.
 Uh, yeah.  I should remove the 'ns' argument there shouldn't I.

 Daniel, does that sway your opinion then?

Well, I prefer to pass the needed parameters to a function. AFAICS, 
'tsk' is not really needed but 'user_ns'.
But it is a detail, so if passing the tsk parameter in the other copy_* 
functions helps to cleanup, that will be consistent.
So I am fine with that.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 4/9] allow killing tasks in your own or child userns

2011-02-19 Thread Daniel Lezcano

On 02/17/2011 04:03 PM, Serge E. Hallyn wrote:
 Changelog:
   Dec  8: Fixed bug in my check_kill_permission pointed out by
   Eric Biederman.
   Dec 13: Apply Eric's suggestion to pass target task into 
 kill_ok_by_cred()
   for clarity
   Dec 31: address comment by Eric Biederman:
   don't need cred/tcred in check_kill_permission.
   Jan  1: use const cred struct.
   Jan 11: Per Bastian Blank's advice, clean up kill_ok_by_cred().
   Feb 16: kill_ok_by_cred: fix bad parentheses

 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com

Acked-by: Daniel Lezcano daniel.lezc...@free.fr
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 5/9] Allow ptrace from non-init user namespaces

2011-02-19 Thread Daniel Lezcano

On 02/17/2011 04:03 PM, Serge E. Hallyn wrote:
 ptrace is allowed to tasks in the same user namespace according to
 the usual rules (i.e. the same rules as for two tasks in the init
 user namespace).  ptrace is also allowed to a user namespace to
 which the current task the has CAP_SYS_PTRACE capability.

 Changelog:
   Dec 31: Address feedback by Eric:
   . Correct ptrace uid check
   . Rename may_ptrace_ns to ptrace_capable
   . Also fix the cap_ptrace checks.
   Jan  1: Use const cred struct
   Jan 11: use task_ns_capable() in place of ptrace_capable().

 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com
Acked-by: Daniel Lezcano daniel.lezc...@free.fr

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 6/9] user namespaces: convert all capable checks in kernel/sys.c

2011-02-19 Thread Daniel Lezcano

On 02/17/2011 04:03 PM, Serge E. Hallyn wrote:
 This allows setuid/setgid in containers.  It also fixes some
 corner cases where kernel logic foregoes capability checks when
 uids are equivalent.  The latter will need to be done throughout
 the whole kernel.

 Changelog:
   Jan 11: Use nsown_capable() as suggested by Bastian Blank.
   Jan 11: Fix logic errors in uid checks pointed out by Bastian.
   Feb 15: allow prlimit to current (was regression in previous version)

 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com

Acked-by: Daniel Lezcano daniel.lezc...@free.fr


 - if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN))
 + if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN)) {
 + printk(KERN_NOTICE %s: did not have CAP_SYS_ADMIN\n, 
 __func__);
   return -EPERM;
 + }
 + printk(KERN_NOTICE %s: did have CAP_SYS_ADMIN\n, __func__);

A couple of printk left here.


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 7/9] add a user namespace owner of ipc ns

2011-02-19 Thread Daniel Lezcano

On 02/17/2011 04:03 PM, Serge E. Hallyn wrote:
 Changelog:
   Feb 15: Don't set new ipc-user_ns if we didn't create a new
   ipc_ns.

 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com
 ---

Acked-by: Daniel Lezcano daniel.lezc...@free.fr


[ ... ]

 + ns-user_ns = old_ns-user_ns;
 + get_user_ns(ns-user_ns);

A mindless change.

ns-user_ns = get_user_ns(old_ns-user_ns);


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 8/9] user namespaces: convert several capable() calls

2011-02-19 Thread Daniel Lezcano

On 02/17/2011 04:03 PM, Serge E. Hallyn wrote:
 CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(),
 because the resource comes from current's own ipc namespace.

 setuid/setgid are to uids in own namespace, so again checks can be
 against current_user_ns().

 Changelog:
   Jan 11: Use task_ns_capable() in place of sched_capable().
   Jan 11: Use nsown_capable() as suggested by Bastian Blank.
   Jan 11: Clarify (hopefully) some logic in futex and sched.c
   Feb 15: use ns_capable for ipc, not nsown_capable

 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com
 ---

Acked-by: Daniel Lezcano daniel.lezc...@free.fr
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 9/9] userns: check user namespace for task-file uid equivalence checks

2011-02-19 Thread Daniel Lezcano

On 02/17/2011 04:04 PM, Serge E. Hallyn wrote:
 Cheat for now and say all files belong to init_user_ns.  Next
 step will be to let superblocks belong to a user_ns, and derive
 inode_userns(inode) from inode-i_sb-s_user_ns.  Finally we'll
 introduce more flexible arrangements.

 Changelog:
   Feb 15: make is_owner_or_cap take const struct inode

 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com
 ---

Acked-by: Daniel Lezcano daniel.lezc...@free.fr
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/9] Add a user_namespace as creator/owner of uts_namespace

2011-02-18 Thread Daniel Lezcano

On 02/17/2011 04:02 PM, Serge E. Hallyn wrote:
 copy_process() handles CLONE_NEWUSER before the rest of the
 namespaces.  So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
 the new uts namespace will have the new user namespace as its
 owner.  That is what we want, since we want root in that new
 userns to be able to have privilege over it.

 Changelog:
   Feb 15: don't set uts_ns-user_ns if we didn't create
   a new uts_ns.

 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com

Acked-by: Daniel Lezcano daniel.lezc...@free.fr

A couple of comments.

 ---
   include/linux/utsname.h |3 +++
   init/version.c  |2 ++
   kernel/nsproxy.c|5 +
   kernel/user.c   |8 ++--
   kernel/utsname.c|4 
   5 files changed, 20 insertions(+), 2 deletions(-)

 diff --git a/include/linux/utsname.h b/include/linux/utsname.h
 index 69f3997..85171be 100644
 --- a/include/linux/utsname.h
 +++ b/include/linux/utsname.h
 @@ -37,9 +37,12 @@ struct new_utsname {
   #includelinux/nsproxy.h
   #includelinux/err.h

 +struct user_namespace;
 +
   struct uts_namespace {
   struct kref kref;
   struct new_utsname name;
 + struct user_namespace *user_ns;
   };
   extern struct uts_namespace init_uts_ns;

 diff --git a/init/version.c b/init/version.c
 index adff586..97bb86f 100644
 --- a/init/version.c
 +++ b/init/version.c
 @@ -21,6 +21,7 @@ extern int version_string(LINUX_VERSION_CODE);
   int version_string(LINUX_VERSION_CODE);
   #endif

 +extern struct user_namespace init_user_ns;
   struct uts_namespace init_uts_ns = {
   .kref = {
   .refcount   = ATOMIC_INIT(2),
 @@ -33,6 +34,7 @@ struct uts_namespace init_uts_ns = {
   .machine= UTS_MACHINE,
   .domainname = UTS_DOMAINNAME,
   },
 + .user_ns =init_user_ns,
   };
   EXPORT_SYMBOL_GPL(init_uts_ns);

 diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
 index f74e6c0..034dc2e 100644
 --- a/kernel/nsproxy.c
 +++ b/kernel/nsproxy.c
 @@ -74,6 +74,11 @@ static struct nsproxy *create_new_namespaces(unsigned long 
 flags,
   err = PTR_ERR(new_nsp-uts_ns);
   goto out_uts;
   }
 + if (new_nsp-uts_ns != tsk-nsproxy-uts_ns) {
 + put_user_ns(new_nsp-uts_ns-user_ns);
 + new_nsp-uts_ns-user_ns = task_cred_xxx(tsk, user)-user_ns;
 + get_user_ns(new_nsp-uts_ns-user_ns);
 + }

IMO you should add a comment telling this code assume create_user_ns was 
called before (via copy_cred).


   new_nsp-ipc_ns = copy_ipcs(flags, tsk-nsproxy-ipc_ns);
   if (IS_ERR(new_nsp-ipc_ns)) {

[ ... ]

   static struct uts_namespace *create_uts_ns(void)
   {
 @@ -40,6 +41,8 @@ static struct uts_namespace *clone_uts_ns(struct 
 uts_namespace *old_ns)

   down_read(uts_sem);
   memcpy(ns-name,old_ns-name, sizeof(ns-name));
 + ns-user_ns = old_ns-user_ns;
 + get_user_ns(ns-user_ns);

ns-user_ns = get_user_ns(old_ns-user_ns);

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 2/9] security: Make capabilities relative to the user namespace.

2011-02-18 Thread Daniel Lezcano

On 02/17/2011 04:03 PM, Serge E. Hallyn wrote:
 - Introduce ns_capable to test for a capability in a non-default
user namespace.
 - Teach cap_capable to handle capabilities in a non-default
user namespace.

 The motivation is to get to the unprivileged creation of new
 namespaces.  It looks like this gets us 90% of the way there, with
 only potential uid confusion issues left.

 I still need to handle getting all caps after creation but otherwise I
 think I have a good starter patch that achieves all of your goals.

 Changelog:
   11/05/2010: [serge] add apparmor
   12/14/2010: [serge] fix capabilities to created user namespaces
   Without this, if user serge creates a user_ns, he won't have
   capabilities to the user_ns he created.  THis is because we
   were first checking whether his effective caps had the caps
   he needed and returning -EPERM if not, and THEN checking whether
   he was the creator.  Reverse those checks.
   12/16/2010: [serge] security_real_capable needs ns argument in 
 !security case
   01/11/2011: [serge] add task_ns_capable helper
   01/11/2011: [serge] add nsown_capable() helper per Bastian Blank 
 suggestion
   02/16/2011: [serge] fix a logic bug: the root user is always creator of
   init_user_ns, but should not always have capabilities to
   it!  Fix the check in cap_capable().

 Signed-off-by: Eric W. Biedermanebied...@xmission.com
 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com
 ---

Acked-by: Daniel Lezcano daniel.lezc...@free.fr

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 3/9] allow sethostname in a container

2011-02-18 Thread Daniel Lezcano

On 02/17/2011 04:03 PM, Serge E. Hallyn wrote:
 Signed-off-by: Serge E. Hallynserge.hal...@canonical.com
 ---
Acked-by: Daniel Lezcano daniel.lezc...@free.fr
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/2] pidns: Don't allow new pids after the namespace is dead.

2011-02-17 Thread Daniel Lezcano

On 02/15/2011 07:30 PM, Oleg Nesterov wrote:
 On 02/15, Daniel Lezcano wrote:
 In the case of unsharing or joining a pid namespace, it becomes
 possible to attempt to allocate a pid after zap_pid_namespace has
 killed everything in the namespace.  Close the hole for now by simply
 not allowing any of those pid allocations to succeed.
 Daniel, please explain more. It seems, a long ago I knew the reason
 for this patch, but now I can't recall and can't understand this change.

The idea behind unsharing the pid namespace is the current pid is not 
mapped in the newly created pid namespace and appears as the pid 0. When 
it forks, the child process becomes the init pid of the new pid 
namespace. When this pid namespace dies because the init pid exited, the 
parent process (aka pid 0) can no longer fork because the pid namespace 
is flagged dead. This is what does this patch.

The next patch allows a single process to spawn different processes in 
different pid namespace. You can argue we can already do that with 
clone(CLONE_NEWPID). That's true. But if we are able to unshare the pid 
namespace, then the next patchset (which will come right after this one) 
will allow to attach a process to a namespace and the implementation 
will be very simple and consistent with attaching to any namespace.

 --- a/include/linux/pid_namespace.h
 +++ b/include/linux/pid_namespace.h
 @@ -20,6 +20,7 @@ struct pid_namespace {
  struct kref kref;
  struct pidmap pidmap[PIDMAP_ENTRIES];
  int last_pid;
 +atomic_t dead;
 Why atomic_t? It is used as a plain boolean.

 And I can't unde

I think Eric used an atomic because it is lockless with alloc_pid vs 
zap_pid_ns_processes.

 --- a/kernel/pid.c
 +++ b/kernel/pid.c
 @@ -282,6 +282,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
  struct pid_namespace *tmp;
  struct upid *upid;

 +pid = NULL;
 +if (atomic_read(ns-dead))
 +goto out;
 +
 So why this is needed?

 If we see ns-dead != 0 we are already killed by zap_pid_ns_processes()
 which sets ns-dead = 1.

The current process unshares the pid namespace.
When it forks, the child process is the pid 1. When this one exits, the 
zap_pid_ns_processes is called and tag the pid namespace as dead. The 
current process can no longer fork.

   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 2/2] pidns: Support unsharing the pid namespace.

2011-02-17 Thread Daniel Lezcano

On 02/15/2011 08:01 PM, Oleg Nesterov wrote:
 On 02/15, Daniel Lezcano wrote:
 - Pass both nsproxy-pid_ns and task_active_pid_ns to copy_pid_ns
As they can now be different.
 But since they can be different we have to convert some users of
 current-nsproxy first? But that patch was dropped.

 Unsharing of the pid namespace unlike unsharing of other namespaces
 does not take effect immediately.  Instead it affects the children
 created with fork and clone.
 IOW, unshare(CLONE_NEWPID) implicitly affects the subsequent fork(),
 using the very subtle way.

 I have to admit, I can't say I like this very much. OK, if we need
 this, can't we just put something into, say, signal-flags so that
 copy_process can check and create the new namespace.

 Also. I remember, I already saw something like this and google found
 my questions. I didn't actually read the new version, perhaps my
 concerns were already answered...

   But what if the task T does unshare(CLONE_NEWPID) and then, say,
   pthread_create() ? Unless I missed something, the new thread won't
   be able to see T ?

Right. Is it really a problem ? I mean it is a weird use case where we 
fall in a weird situation.
I suppose we can do the same weird combination with clone.
IMHO, the userspace is responsible of how it uses the syscalls. Until 
the system is safe, everything is ok, no ?

   and, in this case the exiting sub-namespace init also kills its
   parent?

I don't think so because the zap_pid_ns_processes does not hit the 
parent process when it browses the pidmap.

I tried the following program without problem:

#include stdio.h
#define _GNU_SOURCE
#include sched.h
#include pthread.h

void *routine(void *data)
{
 printf(pid %d!\n, getpid());
 return NULL;
}

int main(int argc, char *argv[])
{
 char **aux = argv[1];
 pthread_t t;

 if (unshare(CLONE_NEWPID)) {
 perror(unshare);
 return -1;
 }

 if (pthread_create(t, NULL, routine, NULL)) {
 perror(pthread_create);
 return -1;
 }

 if (pthread_join(t, NULL)) {
 perror(pthread_join);
 return -1;
 }

 printf(joined\n);

 return 0;
}

   OK, suppose it does fork() after unshare(), then another fork().
   In this case the second child lives in the same namespace with
   init created by the 1st fork, but it is not descendant ? This means
   in particular that if the new init exits, zap_pid_ns_processes()-
   do_wait() can't work.

Hmm, good question. IMO, we should prevent such case for now in the same 
way we added the flag 'dead', IOW adding a flag 'busy' for example.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 1/2] pidns: Don't allow new pids after the namespace is dead.

2011-02-16 Thread Daniel Lezcano

From: Eric W. Biederman ebied...@xmission.com

In the case of unsharing or joining a pid namespace, it becomes
possible to attempt to allocate a pid after zap_pid_namespace has
killed everything in the namespace.  Close the hole for now by simply
not allowing any of those pid allocations to succeed.  At least for
now it is too strange to think about.

Signed-off-by: Eric W. Biederman ebied...@xmission.com
Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 include/linux/pid_namespace.h |1 +
 kernel/pid.c  |4 
 kernel/pid_namespace.c|2 ++
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 38d1032..b447d37 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -20,6 +20,7 @@ struct pid_namespace {
struct kref kref;
struct pidmap pidmap[PIDMAP_ENTRIES];
int last_pid;
+   atomic_t dead;
struct task_struct *child_reaper;
struct kmem_cache *pid_cachep;
unsigned int level;
diff --git a/kernel/pid.c b/kernel/pid.c
index 39b65b6..e996950 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -282,6 +282,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
struct pid_namespace *tmp;
struct upid *upid;
 
+   pid = NULL;
+   if (atomic_read(ns-dead))
+   goto out;
+
pid = kmem_cache_alloc(ns-pid_cachep, GFP_KERNEL);
if (!pid)
goto out;
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index e9c9adc..e8ea25d 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -90,6 +90,7 @@ static struct pid_namespace *create_pid_namespace(struct 
pid_namespace *parent_p
kref_init(ns-kref);
ns-level = level;
ns-parent = get_pid_ns(parent_pid_ns);
+   atomic_set(ns-dead, 0);
 
set_bit(0, ns-pidmap[0].page);
atomic_set(ns-pidmap[0].nr_free, BITS_PER_PAGE - 1);
@@ -164,6 +165,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 *
 */
read_lock(tasklist_lock);
+   atomic_set(pid_ns-dead, 1);
nr = next_pidmap(pid_ns, 1);
while (nr  0) {
rcu_read_lock();
-- 
1.7.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 2/2] pidns: Support unsharing the pid namespace.

2011-02-16 Thread Daniel Lezcano

From: Eric W. Biederman ebied...@xmission.com

- Allow CLONEW_NEWPID into unshare.
- Pass both nsproxy-pid_ns and task_active_pid_ns to copy_pid_ns
  As they can now be different.

Unsharing of the pid namespace unlike unsharing of other namespaces
does not take effect immediately.  Instead it affects the children
created with fork and clone.  The first of these children becomes the init
process of the new pid namespace, the rest become oddball children
of pid 0.  From the point of view of the new pid namespace the process
that created it is pid 0, as it's pid does not map.

A couple of different semantics were considered but this one was
settled on because it is easy to implement and it is usable from
pam modules.  The core reasons for the existence of unshare.

I took a survey of the callers of pam modules and the following
appears to be a representative sample of their logic.
{
setup stuff include pam
child = fork();
if (!child) {
setuid()
exec /bin/bash
}
waitpid(child);

pam and other cleanup
}

As you can see there is a fork to create the unprivileged user
space process.  Which means that the unprivileged user space
process will appear as pid 1 in the new pid namespace.  Further
most login processes do not cope with extraneous children which
means shifting the duty of reaping extraneous child process to
the creator of those extraneous children makes the system more
comprehensible.

The practical reason for this set of pid namespace semantics is
that it is simple to implement and verify they work correctly.
Whereas an implementation that requires changing the struct
pid on a process comes with a lot more races and pain.  Not
the least of which is that glibc caches getpid().

These semantics are implemented by having two notions
of the pid namespace of a process.  There is task_active_pid_ns
which is the pid namspace the process was created with
and the pid namespace that all pids are presented to
that process in.  The task_active_pid_ns is stored
in the struct pid of the task.

There is the pid namespace that will be used for children
that pid namespace is stored in task-nsproxy-pid_ns.

There is one really nasty corner case in all of this.  Which
pid namespace are you in if your parent unshared it's pid
namespace and then on clone you also unshare the pid namespace.
To me there are only two possible answers.  Either the cases
is so bizarre and we deny it completely.  or the new pid
namespace is a descendent of our parent's active pid namespace,
and we ignore the task-nsproxy-pid_ns.

To that end I have modified copy_pid_ns to take both of these
pid namespaces.  The active pid namespace and the default
pid namespace of children.  Allowing me to simply implement
unsharing a pid namespace in clone after already unsharing
a pid namespace with unshare.

Signed-off-by: Eric W. Biederman ebied...@xmission.com
Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 include/linux/pid_namespace.h |   14 +-
 kernel/fork.c |3 ++-
 kernel/nsproxy.c  |5 +++--
 kernel/pid_namespace.c|8 +---
 4 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index b447d37..4316347 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -43,7 +43,10 @@ static inline struct pid_namespace *get_pid_ns(struct 
pid_namespace *ns)
return ns;
 }
 
-extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct 
pid_namespace *ns);
+extern struct pid_namespace *copy_pid_ns(unsigned long flags,
+struct pid_namespace *default_ns,
+struct pid_namespace *active_ns);
+
 extern void free_pid_ns(struct kref *kref);
 extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
 
@@ -61,12 +64,13 @@ static inline struct pid_namespace *get_pid_ns(struct 
pid_namespace *ns)
return ns;
 }
 
-static inline struct pid_namespace *
-copy_pid_ns(unsigned long flags, struct pid_namespace *ns)
+static inline struct pid_namespace *copy_pid_ns(unsigned long flags,
+   struct pid_namespace 
*default_ns,
+   struct pid_namespace *active_ns)
 {
if (flags  CLONE_NEWPID)
-   ns = ERR_PTR(-EINVAL);
-   return ns;
+   return ERR_PTR(-EINVAL);
+   return default_ns;
 }
 
 static inline void put_pid_ns(struct pid_namespace *ns)
diff --git a/kernel/fork.c b/kernel/fork.c
index e7a5907..4b019f1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1633,7 +1633,8 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
err = -EINVAL;
if (unshare_flags  ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM

[Devel] [PATCH 2/3] pidns: Call pid_ns_prepare_proc from create_pid_namespace

2011-02-14 Thread Daniel Lezcano

From: Eric W. Biederman ebied...@xmission.com

Reorganize proc_get_sb so it can be called before the struct pid
of the first process is allocated.

Signed-off-by: Eric W. Biederman ebied...@xmission.com
Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 fs/proc/root.c |   25 +++--
 kernel/fork.c  |6 --
 kernel/pid_namespace.c |   11 +--
 3 files changed, 16 insertions(+), 26 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index ef9fa8e..e5e2bfa 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -43,17 +43,6 @@ static struct dentry *proc_mount(struct file_system_type 
*fs_type,
struct pid_namespace *ns;
struct proc_inode *ei;
 
-   if (proc_mnt) {
-   /* Seed the root directory with a pid so it doesn't need
-* to be special in base.c.  I would do this earlier but
-* the only task alive when /proc is mounted the first time
-* is the init_task and it doesn't have any pids.
-*/
-   ei = PROC_I(proc_mnt-mnt_sb-s_root-d_inode);
-   if (!ei-pid)
-   ei-pid = find_get_pid(1);
-   }
-
if (flags  MS_KERNMOUNT)
ns = (struct pid_namespace *)data;
else
@@ -71,16 +60,16 @@ static struct dentry *proc_mount(struct file_system_type 
*fs_type,
return ERR_PTR(err);
}
 
-   ei = PROC_I(sb-s_root-d_inode);
-   if (!ei-pid) {
-   rcu_read_lock();
-   ei-pid = get_pid(find_pid_ns(1, ns));
-   rcu_read_unlock();
-   }
-
sb-s_flags |= MS_ACTIVE;
}
 
+   ei = PROC_I(sb-s_root-d_inode);
+   if (!ei-pid) {
+   rcu_read_lock();
+   ei-pid = get_pid(find_pid_ns(1, ns));
+   rcu_read_unlock();
+   }
+
return dget(sb-s_root);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index c9f0784..e7a5907 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1180,12 +1180,6 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
pid = alloc_pid(p-nsproxy-pid_ns);
if (!pid)
goto bad_fork_cleanup_io;
-
-   if (clone_flags  CLONE_NEWPID) {
-   retval = pid_ns_prepare_proc(p-nsproxy-pid_ns);
-   if (retval  0)
-   goto bad_fork_free_pid;
-   }
}
 
p-pid = pid_nr(pid);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a5aff94..e9c9adc 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -14,6 +14,7 @@
 #include linux/err.h
 #include linux/acct.h
 #include linux/slab.h
+#include linux/proc_fs.h
 
 #define BITS_PER_PAGE  (PAGE_SIZE*8)
 
@@ -72,7 +73,7 @@ static struct pid_namespace *create_pid_namespace(struct 
pid_namespace *parent_p
 {
struct pid_namespace *ns;
unsigned int level = parent_pid_ns-level + 1;
-   int i;
+   int i, err = -ENOMEM;
 
ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
if (ns == NULL)
@@ -96,14 +97,20 @@ static struct pid_namespace *create_pid_namespace(struct 
pid_namespace *parent_p
for (i = 1; i  PIDMAP_ENTRIES; i++)
atomic_set(ns-pidmap[i].nr_free, BITS_PER_PAGE);
 
+   err = pid_ns_prepare_proc(ns);
+   if (err)
+   goto out_put_parent_pid_ns;
+
return ns;
 
+out_put_parent_pid_ns:
+   put_pid_ns(parent_pid_ns);
 out_free_map:
kfree(ns-pidmap[0].page);
 out_free:
kmem_cache_free(pid_ns_cachep, ns);
 out:
-   return ERR_PTR(-ENOMEM);
+   return ERR_PTR(err);
 }
 
 static void destroy_pid_namespace(struct pid_namespace *ns)
-- 
1.7.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 1/3] pid: Remove the child_reaper special case in init/main.c

2011-02-14 Thread Daniel Lezcano

From: Eric W. Biederman ebied...@xmission.com

It turns out that the existing assignment in copy_process of
the child_reaper can handle the initial assignment of child_reaper
we just need to generalize the test in kernel/fork.c

Signed-off-by: Eric W. Biederman ebied...@xmission.com
Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 include/linux/pid.h |   11 +++
 init/main.c |9 -
 kernel/fork.c   |2 +-
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..efceda0 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -141,6 +141,17 @@ static inline struct pid_namespace *ns_of_pid(struct pid 
*pid)
 }
 
 /*
+ * is_child_reaper returns true if the pid is the init process
+ * of the current namespace. As this one could be checked before
+ * pid_ns-child_reaper is assigned in copy_process, we check
+ * with the pid number.
+ */
+static inline bool is_child_reaper(struct pid *pid)
+{
+   return pid-numbers[pid-level].nr == 1;
+}
+
+/*
  * the helpers to get the pid's id seen from different namespaces
  *
  * pid_nr(): global id, i.e. the id seen from the init namespace;
diff --git a/init/main.c b/init/main.c
index 33c37c3..793ebfd 100644
--- a/init/main.c
+++ b/init/main.c
@@ -875,15 +875,6 @@ static int __init kernel_init(void * unused)
 * init can run on any cpu.
 */
set_cpus_allowed_ptr(current, cpu_all_mask);
-   /*
-* Tell the world that we're going to be the grim
-* reaper of innocent orphaned children.
-*
-* We don't want people to have to make incorrect
-* assumptions about where in the task array this
-* can be found.
-*/
-   init_pid_ns.child_reaper = current;
 
cad_pid = task_pid(current);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 25e4291..c9f0784 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1289,7 +1289,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
tracehook_finish_clone(p, clone_flags, trace);
 
if (thread_group_leader(p)) {
-   if (clone_flags  CLONE_NEWPID)
+   if (is_child_reaper(pid))
p-nsproxy-pid_ns-child_reaper = p;
 
p-signal-leader_pid = pid;
-- 
1.7.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 0/3] procfs wrt pid namespace cleanups

2011-02-14 Thread Daniel Lezcano

This patchset is a cleanup and a preparation to unshare the pid
namespace. These prerequisite prepare the next Eric's patchset
to give a file descriptor to a namespace and join an existing
namespace.

The initial authors of this patchset are Eric Biederman and Oleg
Nesterov.

Changelog:

02/14/11 - daniel.lezc...@free.fr
* patch 2/3 : fix return ENOMEM and put_pid_ns on error
* removed buggy patch #4 from the initial patchset

01/31/11 - daniel.lezc...@free.fr:
* patch 1/4 : wrapped test in a function
* patch 2/4 : handle proc_pid_ns_prepare_proc error
* patch 2/4 : put parent pid_ns on error
* other patches : refreshed against linux-next


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 3/3] procfs: kill the global proc_mnt variable

2011-02-14 Thread Daniel Lezcano

From: Oleg Nesterov o...@redhat.com

After the previous cleanup in proc_get_sb() the global proc_mnt has
no reasons to exists, kill it.

Signed-off-by: Oleg Nesterov o...@redhat.com
Signed-off-by: Eric W. Biederman ebied...@xmission.com
Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 fs/proc/inode.c|2 --
 fs/proc/internal.h |1 -
 fs/proc/root.c |7 ---
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 176ce4c..ee0f802 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -42,8 +42,6 @@ static void proc_evict_inode(struct inode *inode)
sysctl_head_put(PROC_I(inode)-sysctl);
 }
 
-struct vfsmount *proc_mnt;
-
 static struct kmem_cache * proc_inode_cachep;
 
 static struct inode *proc_alloc_inode(struct super_block *sb)
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 9ad561d..c03e8d3 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -107,7 +107,6 @@ static inline struct proc_dir_entry *pde_get(struct 
proc_dir_entry *pde)
 }
 void pde_put(struct proc_dir_entry *pde);
 
-extern struct vfsmount *proc_mnt;
 int proc_fill_super(struct super_block *);
 struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index e5e2bfa..a9000e9 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -90,19 +90,20 @@ static struct file_system_type proc_fs_type = {
 
 void __init proc_root_init(void)
 {
+   struct vfsmount *mnt;
int err;
 
proc_init_inodecache();
err = register_filesystem(proc_fs_type);
if (err)
return;
-   proc_mnt = kern_mount_data(proc_fs_type, init_pid_ns);
-   if (IS_ERR(proc_mnt)) {
+   mnt = kern_mount_data(proc_fs_type, init_pid_ns);
+   if (IS_ERR(mnt)) {
unregister_filesystem(proc_fs_type);
return;
}
 
-   init_pid_ns.proc_mnt = proc_mnt;
+   init_pid_ns.proc_mnt = mnt;
proc_symlink(mounts, NULL, self/mounts);
 
proc_net_init();
-- 
1.7.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 4/4] pidns: Use task_active_pid_ns where appropriate

2011-02-01 Thread Daniel Lezcano

On 01/31/2011 04:36 PM, Oleg Nesterov wrote:
 On 01/31, Daniel Lezcano wrote:
 On 01/31/2011 12:26 PM, Alexey Dobriyan wrote:
 This task_active_pid_ns() is misnamed(?) because it does matter only
 for dead tasks?
 Actually this function is later used, for the unshare, to get the pid_ns
 of a specific task, not the current one.
 Well, it is already used to get the pid_ns of !current task.

 http://kerneltrap.org/mailarchive/linux-kernel/2010/6/20/4585095
 The actual need for this change is that you are going to complicate
 the things so that current-proxy-pid_ns != task_active_pid_ns().
 This makes me cry ;)

Mmh, ok that makes sense.

 Please don't forget, this patch is buggy, iirc...

Ok, I will resend the cleanup patchset without this patch.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Prepare for the unshare support of the pid namespace

2011-01-31 Thread Daniel Lezcano

This patchset is a cleanup and a preparation to unshare the pid
namespace. These prerequisites prepares the next Eric's patchset
to give a file descriptor to a namespace and join an existing
namespace.

The initial authors of this patchset are Eric Biederman and Oleg
Nesterov.

Changelog:
01/31/11 - Daniel Lezcano daniel.lezc...@free.fr
* patch 1/4 : wrapped test in a function
* other patches : refreshed against linux-next


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 2/4] pidns: Call pid_ns_prepare_proc from create_pid_namespace

2011-01-31 Thread Daniel Lezcano

From: Eric W. Biederman ebied...@xmission.com

Reorganize proc_get_sb so it can be called before the struct pid
of the first process is allocated.

Signed-off-by: Eric W. Biederman ebied...@xmission.com
Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 fs/proc/root.c |   25 +++--
 kernel/fork.c  |6 --
 kernel/pid_namespace.c |4 
 3 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index ef9fa8e..e5e2bfa 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -43,17 +43,6 @@ static struct dentry *proc_mount(struct file_system_type 
*fs_type,
struct pid_namespace *ns;
struct proc_inode *ei;
 
-   if (proc_mnt) {
-   /* Seed the root directory with a pid so it doesn't need
-* to be special in base.c.  I would do this earlier but
-* the only task alive when /proc is mounted the first time
-* is the init_task and it doesn't have any pids.
-*/
-   ei = PROC_I(proc_mnt-mnt_sb-s_root-d_inode);
-   if (!ei-pid)
-   ei-pid = find_get_pid(1);
-   }
-
if (flags  MS_KERNMOUNT)
ns = (struct pid_namespace *)data;
else
@@ -71,16 +60,16 @@ static struct dentry *proc_mount(struct file_system_type 
*fs_type,
return ERR_PTR(err);
}
 
-   ei = PROC_I(sb-s_root-d_inode);
-   if (!ei-pid) {
-   rcu_read_lock();
-   ei-pid = get_pid(find_pid_ns(1, ns));
-   rcu_read_unlock();
-   }
-
sb-s_flags |= MS_ACTIVE;
}
 
+   ei = PROC_I(sb-s_root-d_inode);
+   if (!ei-pid) {
+   rcu_read_lock();
+   ei-pid = get_pid(find_pid_ns(1, ns));
+   rcu_read_unlock();
+   }
+
return dget(sb-s_root);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index c9f0784..e7a5907 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1180,12 +1180,6 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
pid = alloc_pid(p-nsproxy-pid_ns);
if (!pid)
goto bad_fork_cleanup_io;
-
-   if (clone_flags  CLONE_NEWPID) {
-   retval = pid_ns_prepare_proc(p-nsproxy-pid_ns);
-   if (retval  0)
-   goto bad_fork_free_pid;
-   }
}
 
p-pid = pid_nr(pid);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a5aff94..b90e4ab 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -14,6 +14,7 @@
 #include linux/err.h
 #include linux/acct.h
 #include linux/slab.h
+#include linux/proc_fs.h
 
 #define BITS_PER_PAGE  (PAGE_SIZE*8)
 
@@ -96,6 +97,9 @@ static struct pid_namespace *create_pid_namespace(struct 
pid_namespace *parent_p
for (i = 1; i  PIDMAP_ENTRIES; i++)
atomic_set(ns-pidmap[i].nr_free, BITS_PER_PAGE);
 
+   if (pid_ns_prepare_proc(ns))
+   goto out_free_map;
+
return ns;
 
 out_free_map:
-- 
1.7.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 1/4] pid: Remove the child_reaper special case in init/main.c

2011-01-31 Thread Daniel Lezcano

From: Eric W. Biederman ebied...@xmission.com

It turns out that the existing assignment in copy_process of
the child_reaper can handle the initial assignment of child_reaper
we just need to generalize the test in kernel/fork.c

Signed-off-by: Eric W. Biederman ebied...@xmission.com
Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 include/linux/pid.h |   11 +++
 init/main.c |9 -
 kernel/fork.c   |2 +-
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..efceda0 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -141,6 +141,17 @@ static inline struct pid_namespace *ns_of_pid(struct pid 
*pid)
 }
 
 /*
+ * is_child_reaper returns true if the pid is the init process
+ * of the current namespace. As this one could be checked before
+ * pid_ns-child_reaper is assigned in copy_process, we check
+ * with the pid number.
+ */
+static inline bool is_child_reaper(struct pid *pid)
+{
+   return pid-numbers[pid-level].nr == 1;
+}
+
+/*
  * the helpers to get the pid's id seen from different namespaces
  *
  * pid_nr(): global id, i.e. the id seen from the init namespace;
diff --git a/init/main.c b/init/main.c
index 33c37c3..793ebfd 100644
--- a/init/main.c
+++ b/init/main.c
@@ -875,15 +875,6 @@ static int __init kernel_init(void * unused)
 * init can run on any cpu.
 */
set_cpus_allowed_ptr(current, cpu_all_mask);
-   /*
-* Tell the world that we're going to be the grim
-* reaper of innocent orphaned children.
-*
-* We don't want people to have to make incorrect
-* assumptions about where in the task array this
-* can be found.
-*/
-   init_pid_ns.child_reaper = current;
 
cad_pid = task_pid(current);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 25e4291..c9f0784 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1289,7 +1289,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
tracehook_finish_clone(p, clone_flags, trace);
 
if (thread_group_leader(p)) {
-   if (clone_flags  CLONE_NEWPID)
+   if (is_child_reaper(pid))
p-nsproxy-pid_ns-child_reaper = p;
 
p-signal-leader_pid = pid;
-- 
1.7.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 3/4] procfs: kill the global proc_mnt variable

2011-01-31 Thread Daniel Lezcano

From: Oleg Nesterov o...@redhat.com

After the previous cleanup in proc_get_sb() the global proc_mnt has
no reasons to exists, kill it.

Signed-off-by: Oleg Nesterov o...@redhat.com
Signed-off-by: Eric W. Biederman ebied...@xmission.com
Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 fs/proc/inode.c|2 --
 fs/proc/internal.h |1 -
 fs/proc/root.c |7 ---
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 176ce4c..ee0f802 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -42,8 +42,6 @@ static void proc_evict_inode(struct inode *inode)
sysctl_head_put(PROC_I(inode)-sysctl);
 }
 
-struct vfsmount *proc_mnt;
-
 static struct kmem_cache * proc_inode_cachep;
 
 static struct inode *proc_alloc_inode(struct super_block *sb)
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 9ad561d..c03e8d3 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -107,7 +107,6 @@ static inline struct proc_dir_entry *pde_get(struct 
proc_dir_entry *pde)
 }
 void pde_put(struct proc_dir_entry *pde);
 
-extern struct vfsmount *proc_mnt;
 int proc_fill_super(struct super_block *);
 struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index e5e2bfa..a9000e9 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -90,19 +90,20 @@ static struct file_system_type proc_fs_type = {
 
 void __init proc_root_init(void)
 {
+   struct vfsmount *mnt;
int err;
 
proc_init_inodecache();
err = register_filesystem(proc_fs_type);
if (err)
return;
-   proc_mnt = kern_mount_data(proc_fs_type, init_pid_ns);
-   if (IS_ERR(proc_mnt)) {
+   mnt = kern_mount_data(proc_fs_type, init_pid_ns);
+   if (IS_ERR(mnt)) {
unregister_filesystem(proc_fs_type);
return;
}
 
-   init_pid_ns.proc_mnt = proc_mnt;
+   init_pid_ns.proc_mnt = mnt;
proc_symlink(mounts, NULL, self/mounts);
 
proc_net_init();
-- 
1.7.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 4/4] pidns: Use task_active_pid_ns where appropriate

2011-01-31 Thread Daniel Lezcano

From: Eric W. Biederman ebied...@xmission.com

The expressions tsk-nsproxy-pid_ns and task_active_pid_ns
aka ns_of_pid(task_pid(tsk)) should have the same number of
cache line misses with the practical difference that
ns_of_pid(task_pid(tsk)) is released later in a processes life.

Furthermore by using task_active_pid_ns it becomes trivial
to write an unshare implementation for the the pid namespace.

So I have used task_active_pid_ns everywhere I can.

Signed-off-by: Eric W. Biederman ebied...@xmission.com
Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 arch/powerpc/platforms/cell/spufs/sched.c |2 +-
 arch/um/drivers/mconsole_kern.c   |2 +-
 fs/proc/root.c|2 +-
 kernel/cgroup.c   |3 +--
 kernel/perf_event.c   |2 +-
 kernel/pid.c  |8 
 kernel/signal.c   |9 -
 kernel/sysctl_binary.c|2 +-
 8 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c 
b/arch/powerpc/platforms/cell/spufs/sched.c
index 0b04662..82e26a0 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -1095,7 +1095,7 @@ static int show_spu_loadavg(struct seq_file *s, void 
*private)
LOAD_INT(c), LOAD_FRAC(c),
count_active_contexts(),
atomic_read(nr_spu_contexts),
-   current-nsproxy-pid_ns-last_pid);
+   task_active_pid_ns(current)-last_pid);
return 0;
 }
 
diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c
index 975613b..edac0da 100644
--- a/arch/um/drivers/mconsole_kern.c
+++ b/arch/um/drivers/mconsole_kern.c
@@ -125,7 +125,7 @@ void mconsole_log(struct mc_request *req)
 void mconsole_proc(struct mc_request *req)
 {
struct nameidata nd;
-   struct vfsmount *mnt = current-nsproxy-pid_ns-proc_mnt;
+   struct vfsmount *mnt = task_active_pid_ns(current)-proc_mnt;
struct file *file;
int n, err;
char *ptr = req-request.data, *buf;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index a9000e9..9ea237e 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -46,7 +46,7 @@ static struct dentry *proc_mount(struct file_system_type 
*fs_type,
if (flags  MS_KERNMOUNT)
ns = (struct pid_namespace *)data;
else
-   ns = current-nsproxy-pid_ns;
+   ns = task_active_pid_ns(current);
 
sb = sget(fs_type, proc_test_super, proc_set_super, ns);
if (IS_ERR(sb))
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b24d702..5cb4ae7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2741,8 +2741,7 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct 
cgroup *cgrp,
 {
struct cgroup_pidlist *l;
/* don't need task_nsproxy() if we're looking at ourself */
-   struct pid_namespace *ns = current-nsproxy-pid_ns;
-
+   struct pid_namespace *ns = task_active_pid_ns(current);
/*
 * We can't drop the pidlist_mutex before taking the l-mutex in case
 * the last ref-holder is trying to remove l from the list at the same
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 852ae8c..42bdb40 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5581,7 +5581,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
event-parent   = parent_event;
 
-   event-ns   = get_pid_ns(current-nsproxy-pid_ns);
+   event-ns   = get_pid_ns(task_active_pid_ns(current));
event-id   = atomic64_inc_return(perf_event_id);
 
event-state= PERF_EVENT_STATE_INACTIVE;
diff --git a/kernel/pid.c b/kernel/pid.c
index 39b65b6..b45189d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -339,7 +339,7 @@ EXPORT_SYMBOL_GPL(find_pid_ns);
 
 struct pid *find_vpid(int nr)
 {
-   return find_pid_ns(nr, current-nsproxy-pid_ns);
+   return find_pid_ns(nr, task_active_pid_ns(current));
 }
 EXPORT_SYMBOL_GPL(find_vpid);
 
@@ -422,7 +422,7 @@ struct task_struct *find_task_by_pid_ns(pid_t nr, struct 
pid_namespace *ns)
 
 struct task_struct *find_task_by_vpid(pid_t vnr)
 {
-   return find_task_by_pid_ns(vnr, current-nsproxy-pid_ns);
+   return find_task_by_pid_ns(vnr, task_active_pid_ns(current));
 }
 
 struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
@@ -474,7 +474,7 @@ pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
 
 pid_t pid_vnr(struct pid *pid)
 {
-   return pid_nr_ns(pid, current-nsproxy-pid_ns);
+   return pid_nr_ns(pid, task_active_pid_ns(current));
 }
 EXPORT_SYMBOL_GPL(pid_vnr);
 
@@ -485,7 +485,7 @@ pid_t __task_pid_nr_ns(struct task_struct *task, enum 
pid_type type,
 
rcu_read_lock();
if (!ns)
-   ns = current-nsproxy-pid_ns;
+   ns

[Devel] Re: [PATCH 2/4] pidns: Call pid_ns_prepare_proc from create_pid_namespace

2011-01-31 Thread Daniel Lezcano

On 01/31/2011 02:22 PM, Oleg Nesterov wrote:
 On 01/31, Daniel Lezcano wrote:
 @@ -96,6 +97,9 @@ static struct pid_namespace *create_pid_namespace(struct 
 pid_namespace *parent_p
  for (i = 1; i  PIDMAP_ENTRIES; i++)
  atomic_set(ns-pidmap[i].nr_free, BITS_PER_PAGE);

 +if (pid_ns_prepare_proc(ns))
 +goto out_free_map;
 +
  return ns;
 This is not right, afaics. I already sent the similar patches, but
 they were ignored ;)

 Please see http://marc.info/?l=linux-kernelm=127697484000334

 If pid_ns_prepare_proc() fails we shouldn't blindly return ENOMEM
 and, more importantly, we need put_pid_ns(parent_ns).

Oh, ok. Right. Thanks for the pointer.

Are you ok if I replace the patch 2/4 with your patch ?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 2/4] pidns: Call pid_ns_prepare_proc from create_pid_namespace

2011-01-31 Thread Daniel Lezcano

On 01/31/2011 03:02 PM, Oleg Nesterov wrote:
 On 01/31, Daniel Lezcano wrote:
 On 01/31/2011 02:22 PM, Oleg Nesterov wrote:
 On 01/31, Daniel Lezcano wrote:
 @@ -96,6 +97,9 @@ static struct pid_namespace *create_pid_namespace(struct 
 pid_namespace *parent_p
for (i = 1; i   PIDMAP_ENTRIES; i++)
atomic_set(ns-pidmap[i].nr_free, BITS_PER_PAGE);

 +  if (pid_ns_prepare_proc(ns))
 +  goto out_free_map;
 +
return ns;
 This is not right, afaics. I already sent the similar patches, but
 they were ignored ;)

 Please see http://marc.info/?l=linux-kernelm=127697484000334

 If pid_ns_prepare_proc() fails we shouldn't blindly return ENOMEM
 and, more importantly, we need put_pid_ns(parent_ns).
 Oh, ok. Right. Thanks for the pointer.

 Are you ok if I replace the patch 2/4 with your patch ?
 My patch depends on 1/4, http://marc.info/?l=linux-kernelm=127697468632667

 Your change looks very similar to 1/4 + 3/4. Just fix the problem
 in create_pid_namespace(), no need to replace.

Ok, will do.

Thanks Oleg.

   -- Daniel

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 4/4] pidns: Use task_active_pid_ns where appropriate

2011-01-31 Thread Daniel Lezcano

On 01/31/2011 12:26 PM, Alexey Dobriyan wrote:
 On Mon, Jan 31, 2011 at 12:25 PM, Daniel Lezcanodaniel.lezc...@free.fr  
 wrote:
 The expressions tsk-nsproxy-pid_ns and task_active_pid_ns
 aka ns_of_pid(task_pid(tsk)) should have the same number of
 cache line misses with the practical difference that
 ns_of_pid(task_pid(tsk)) is released later in a processes life.

 Furthermore by using task_active_pid_ns it becomes trivial
 to write an unshare implementation for the the pid namespace.

 So I have used task_active_pid_ns everywhere I can.
 Yet current-nsproxy-pid_ns is way clearer.
 Because live current always has pid_ns.

 This task_active_pid_ns() is misnamed(?) because it does matter only
 for dead tasks?

Actually this function is later used, for the unshare, to get the pid_ns 
of a specific task, not the current one.

http://kerneltrap.org/mailarchive/linux-kernel/2010/6/20/4585095

Do you suggest task_pid_ns(struct task_struct *tsk) would be a better name ?

 -   current-nsproxy-pid_ns-last_pid);
 +   task_active_pid_ns(current)-last_pid);
 I thought of doing exactly opposite patch :-)


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] cgroup : remove the ns_cgroup

2011-01-27 Thread Daniel Lezcano

On 01/27/2011 02:45 AM, Andrew Morton wrote:
 On Thu, 27 Jan 2011 09:08:51 +0800 Li Zefanl...@cn.fujitsu.com  wrote:

 Andrew Morton wrote:
 On Tue, 25 Jan 2011 10:39:48 +0100
 Daniel Lezcanodaniel.lezc...@free.fr  wrote:

 This patch removes the ns_cgroup as suggested in the following thread:
 I had this patch queued up in September last year, but dropped it.  Why
 did I do that?
 Because you wanted to wait for some time for users (if any) to notice this
 coming change.

 Author: Daniel Lezcanodaniel.lezc...@free.fr
 Date:   Wed Oct 27 15:33:38 2010 -0700

  cgroup: notify ns_cgroup deprecated

  The ns_cgroup will be removed very soon.  Let's warn, for this version,
  ns_cgroup is deprecated.

  Make ns_cgroup and clone_children exclusive.  If the clone_children is 
 set
  and the ns_cgroup is mounted, let's fail with EINVAL when the ns_cgroup
  subsys is created (a printk will help the user to understand why the
  creation fails).

  Update the feature remove schedule file with the deprecated ns_cgroup.

  Signed-off-by: Daniel Lezcanodaniel.lezc...@free.fr
  Acked-by: Paul Menagemen...@google.com
  Signed-off-by: Andrew Mortona...@linux-foundation.org
  Signed-off-by: Linus Torvaldstorva...@linux-foundation.org
 ooh, that was clever of me.

 Here is the text which was missing from the changelog:

This is a userspace-visible change.  Commit 45531757b45c (cgroup:
notify ns_cgroup deprecated) (merged into 2.6.27) caused the kernel
to emit a printk warning users that the feature is planned for
removal.  Since that time we have heard from XXX users who were
affected by this.

 Please provide XXX.

Ok, AFAIK nobody makes use of the ns_cgroup except the LXC userspace 
tools which I maintain and where
the backward compatibility with the ns_cgroup and the clone_children 
flag is already implemented.
Since today nobody seems to be affected by this.

I Cc'ed the libvirt mailing list.

 How do we know that 2.6.37-2.6.38 is long enough?  Will any major
 distros be released containing this warning in that timeframe?  I doubt
 it.

Hmm, maybe it is too short but I don't think someone will complain about 
this feature removal.
Google chromium is using the namespaces, hence a lot of cgroup is 
created on the system. The vsftpd and some pam modules uses the 
namespaces too.
I won't be surprised if one of these applications fails with 'clone' 
returning EEXIST ...

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/6] Unshare support for the pid namespace.

2011-01-26 Thread Daniel Lezcano


Hi All,

I refreshed this patchset on top of linux-next and I would like to 
resend it for inclusion if nobody is opposed.
I read the looong thread about this patchset and I am a bit lost with if 
the patches are right or not.
I will be glad to contribute to this patchset if that make sense to 
improve some parts.

Can someone enlight me ?

Thanks
   -- Daniel


javascript:void(0);
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH] cgroup : remove the ns_cgroup

2011-01-25 Thread Daniel Lezcano

The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier
and leads to some problems:

* cgroup creation is out-of-control
* cgroup name can conflict when pids are looping
* it is not possible to have a single process handling
a lot of namespaces without falling in a exponential creation time
* we may want to create a namespace without creating a cgroup

The ns_cgroup was replaced by a compatibility flag 'clone_children',
where a newly created cgroup will copy the parent cgroup values.
The userspace has to manually create a cgroup and add a task to
the 'tasks' file.

This patch removes the ns_cgroup as suggested in the following thread:

https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

The 'cgroup_clone' function is removed because it is no longer used.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
Cc: Eric W. Biederman ebied...@xmission.com
Cc: Jamal Hadi Salim h...@cyberus.ca
Reviewed-by: Li Zefan l...@cn.fujitsu.com
Acked-by: Paul Menage men...@google.com
Acked-by: Matt Helsley matth...@us.ibm.com
---
 Documentation/cgroups/cgroups.txt  |2 +-
 arch/mips/configs/bcm47xx_defconfig|1 -
 arch/mn10300/configs/asb2364_defconfig |1 -
 arch/powerpc/configs/ppc6xx_defconfig  |1 -
 arch/powerpc/configs/pseries_defconfig |1 -
 arch/sh/configs/apsh4ad0a_defconfig|1 -
 arch/sh/configs/sdk7786_defconfig  |1 -
 arch/sh/configs/se7206_defconfig   |1 -
 arch/sh/configs/shx3_defconfig |1 -
 arch/sh/configs/urquell_defconfig  |1 -
 arch/x86/configs/i386_defconfig|1 -
 arch/x86/configs/x86_64_defconfig  |1 -
 include/linux/cgroup.h |3 -
 include/linux/cgroup_subsys.h  |6 --
 include/linux/nsproxy.h|9 ---
 init/Kconfig   |8 --
 kernel/Makefile|1 -
 kernel/cgroup.c|  116 ---
 kernel/cpuset.c|7 +-
 kernel/fork.c  |6 --
 kernel/ns_cgroup.c |  118 
 kernel/nsproxy.c   |4 -
 22 files changed, 4 insertions(+), 287 deletions(-)
 delete mode 100644 kernel/ns_cgroup.c

diff --git a/Documentation/cgroups/cgroups.txt 
b/Documentation/cgroups/cgroups.txt
index 44b8b7a..ac759b6 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -618,7 +618,7 @@ always handled well.
 void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
 (cgroup_mutex held by caller)
 
-Called at the end of cgroup_clone() to do any parameter
+Called during cgroup_create() to do any parameter
 initialization which might be required before a task could attach.  For
 example in cpusets, no task may attach before 'cpus' and 'mems' are set
 up.
diff --git a/arch/mips/configs/bcm47xx_defconfig 
b/arch/mips/configs/bcm47xx_defconfig
index 927d58b..c4338e0 100644
--- a/arch/mips/configs/bcm47xx_defconfig
+++ b/arch/mips/configs/bcm47xx_defconfig
@@ -16,7 +16,6 @@ CONFIG_TASK_IO_ACCOUNTING=y
 CONFIG_AUDIT=y
 CONFIG_TINY_RCU=y
 CONFIG_CGROUPS=y
-CONFIG_CGROUP_NS=y
 CONFIG_CGROUP_CPUACCT=y
 CONFIG_RELAY=y
 CONFIG_BLK_DEV_INITRD=y
diff --git a/arch/mn10300/configs/asb2364_defconfig 
b/arch/mn10300/configs/asb2364_defconfig
index 83ce2f2..d38391a 100644
--- a/arch/mn10300/configs/asb2364_defconfig
+++ b/arch/mn10300/configs/asb2364_defconfig
@@ -8,7 +8,6 @@ CONFIG_TASK_XACCT=y
 CONFIG_TASK_IO_ACCOUNTING=y
 CONFIG_LOG_BUF_SHIFT=14
 CONFIG_CGROUPS=y
-CONFIG_CGROUP_NS=y
 CONFIG_CGROUP_FREEZER=y
 CONFIG_CGROUP_DEVICE=y
 CONFIG_CGROUP_CPUACCT=y
diff --git a/arch/powerpc/configs/ppc6xx_defconfig 
b/arch/powerpc/configs/ppc6xx_defconfig
index 9d64a68..9b253f6 100644
--- a/arch/powerpc/configs/ppc6xx_defconfig
+++ b/arch/powerpc/configs/ppc6xx_defconfig
@@ -10,7 +10,6 @@ CONFIG_TASK_XACCT=y
 CONFIG_TASK_IO_ACCOUNTING=y
 CONFIG_AUDIT=y
 CONFIG_CGROUPS=y
-CONFIG_CGROUP_NS=y
 CONFIG_CGROUP_DEVICE=y
 CONFIG_CGROUP_CPUACCT=y
 CONFIG_RESOURCE_COUNTERS=y
diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index f87f0e1..972587f 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -15,7 +15,6 @@ CONFIG_AUDITSYSCALL=y
 CONFIG_IKCONFIG=y
 CONFIG_IKCONFIG_PROC=y
 CONFIG_CGROUPS=y
-CONFIG_CGROUP_NS=y
 CONFIG_CGROUP_FREEZER=y
 CONFIG_CGROUP_DEVICE=y
 CONFIG_CPUSETS=y
diff --git a/arch/sh/configs/apsh4ad0a_defconfig 
b/arch/sh/configs/apsh4ad0a_defconfig
index e71a531..f722a3d 100644
--- a/arch/sh/configs/apsh4ad0a_defconfig
+++ b/arch/sh/configs/apsh4ad0a_defconfig
@@ -7,7 +7,6 @@ CONFIG_IKCONFIG=y
 CONFIG_IKCONFIG_PROC=y
 CONFIG_LOG_BUF_SHIFT=14
 CONFIG_CGROUPS=y
-CONFIG_CGROUP_NS=y
 CONFIG_CGROUP_FREEZER=y

[Devel] Re: [PATCH] Add linux-next specific files for 20110121

2011-01-25 Thread Daniel Lezcano

On 01/25/2011 10:39 AM, Daniel Lezcano wrote:
 From: Stephen Rothwells...@canb.auug.org.au

 Signed-off-by: Stephen Rothwells...@canb.auug.org.au
 ---

Oops ! wrong patch.
Please ignore.

sorry for the noise.

   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFD] reboot / shutdown of a container

2011-01-14 Thread Daniel Lezcano

On 01/15/2011 12:11 AM, Bruno Prémont wrote:
 On Thu, 13 January 2011 Daniel Lezcanodaniel.lezc...@free.fr  wrote:
 On 01/13/2011 10:50 PM, Bruno Prémont wrote:
 On Thu, 13 January 2011 Daniel Lezcanodaniel.lezc...@free.fr   wrote:
 On 01/13/2011 09:09 PM, Bruno Prémont wrote:
 On Thu, 13 January 2011 Daniel Lezcanodaniel.lezc...@free.frwrote:
 in the container implementation, we are facing the problem of a process
 calling the sys_reboot syscall which of course makes the host to
 poweroff/reboot.

 If we drop the cap_sys_reboot capability, sys_reboot fails and the
 container reach a shutdown state but the init process stay there, hence
 the container becomes stuck waiting indefinitely the process '1' to exit.

 The current implementation to make the shutdown / reboot of the
 container to work is we watch, from a process outside of the container,
 therootfs/var/run/utmp file and check the runlevel each time the file
 changes. When the 'reboot' or 'shutdown' level is detected, we wait for
 a single remaining in the container and then we kill it.

 That works but this is not efficient in case of a large number of
 containers as we will have to watch a lot of utmp files. In addition,
 the /var/run directory must *not* mounted as tmpfs in the distro.
 Unfortunately, it is the default setup on most of the distros and tends
 to generalize. That implies, the rootfs init's scripts must be modified
 for the container when we put in place its rootfs and as /var/run is
 supposed to be a tmpfs, most of the applications do not cleanup the
 directory, so we need to add extra services to wipeout the files.

 More problems arise when we do an upgrade of the distro inside the
 container, because all the setup we made at creation time will be lost.
 The upgrade overwrite the scripts, the fstab and so on.

 We did what was possible to solve the problem from userspace but we
 reach always a limit because there are different implementations of the
 'init' process and the init's scripts differ from a distro to another
 and the same with the versions.

 We think this problem can only be solved from the kernel.

 The idea was to send a signal SIGPWR to the parent of the pid '1' of the
 pid namespace when the sys_reboot is called. Of course that won't occur
 for the init pid namespace.
 Wouldn't sending SIGKILL to the pid '1' process of the originating PID
 namespace be sufficient (that would trigger a SIGCHLD for the parent
 process in the outer PID namespace.
 This is already the case. The question is : when do we send this signal ?
 We have to wait for the container system shutdown before killing it.
 I meant that sys_reboot() would kill the namespace's init if it's not
 called from boot namespace.

 See below

 (as far as I remember the PID namespace is killed when its 'init' exits,
 if this is not the case all other processes in the given namespace would
 have to be killed as well)
 Yes, absolutely but this is not the point, reaping the container is not
 a problem.

 What we are trying to achieve is to shutdown properly the container from
 inside (from outside will be possible too with the setns syscall).

 Assuming the process '1234' creates a new process in a new namespace set
 and wait for it.

 The new process '1' will exec /sbin/init and the system will boot up.
 But, when the system is shutdown or rebooted, after the down scripts are
 executed the kill -15 -1 will be invoked, killing all the processes
 expect the process '1' and the caller. This one will then call
 'sys_reboot' and exit. Hence we still have the init process idle and its
 parent '1234' waiting for it to die.
 This call to sys_reboot() would kill new process '1' instead of trying to
 operate on the HW box.
 This also has the advantage that a container would not require an informed
 parent monitoring it from outside (though it would not be restarted even 
 if
 requested without such informed outside parent).
 Oh, ok. Sorry I misunderstood.

 Yes, that could be better than crossing the namespace boundaries.

 If we are able to receive the information in the process '1234' : the
 sys_reboot was called in the child pid namespace, we can take then kill
 our child pid.  If this information is raised via a signal sent by the
 kernel with the proper information in the siginfo_t (eg. si_code
 contains LINUX_REBOOT_CMD_RESTART, LINUX_REBOOT_CMD_HALT, ... ), the
 solution will be generic for all the shutdown/reboot of any kind of
 container and init version.
 Could this be passed for a SIGCHLD? (when namespace is reaped, and received
 by 1234 from above example assuming sys_reboot() kills the new process 
 '1')
 Yes, that sounds a good idea.

 Looks like yes, but with the need to define new values for si_code (reusing
 LINUX_REBOOT_CMD_* would certainly clash, no matter which signal is 
 choosen).
 CLD_REBOOT_CMD_RESTART
 CLD_REBOOT_CMD_HALT
 CLD_REBOOT_CMD_POWER_OFF
 I would just map both to the same thing...

 CLD_REBOOT_CMD_RESTART2 (what about the cmd

[Devel] [RFD] reboot / shutdown of a container

2011-01-13 Thread Daniel Lezcano


Hi all,

in the container implementation, we are facing the problem of a process 
calling the sys_reboot syscall which of course makes the host to 
poweroff/reboot.

If we drop the cap_sys_reboot capability, sys_reboot fails and the 
container reach a shutdown state but the init process stay there, hence 
the container becomes stuck waiting indefinitely the process '1' to exit.

The current implementation to make the shutdown / reboot of the 
container to work is we watch, from a process outside of the container, 
the rootfs/var/run/utmp file and check the runlevel each time the file 
changes. When the 'reboot' or 'shutdown' level is detected, we wait for 
a single remaining in the container and then we kill it.

That works but this is not efficient in case of a large number of 
containers as we will have to watch a lot of utmp files. In addition, 
the /var/run directory must *not* mounted as tmpfs in the distro. 
Unfortunately, it is the default setup on most of the distros and tends 
to generalize. That implies, the rootfs init's scripts must be modified 
for the container when we put in place its rootfs and as /var/run is 
supposed to be a tmpfs, most of the applications do not cleanup the 
directory, so we need to add extra services to wipeout the files.

More problems arise when we do an upgrade of the distro inside the 
container, because all the setup we made at creation time will be lost. 
The upgrade overwrite the scripts, the fstab and so on.

We did what was possible to solve the problem from userspace but we 
reach always a limit because there are different implementations of the 
'init' process and the init's scripts differ from a distro to another 
and the same with the versions.

We think this problem can only be solved from the kernel.

The idea was to send a signal SIGPWR to the parent of the pid '1' of the 
pid namespace when the sys_reboot is called. Of course that won't occur 
for the init pid namespace.

Does it make sense ?

Any idea is very welcome :)

   -- Daniel




___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFD] reboot / shutdown of a container

2011-01-13 Thread Daniel Lezcano

On 01/13/2011 10:50 PM, Bruno Prémont wrote:
 On Thu, 13 January 2011 Daniel Lezcanodaniel.lezc...@free.fr  wrote:

 On 01/13/2011 09:09 PM, Bruno Prémont wrote:
 On Thu, 13 January 2011 Daniel Lezcanodaniel.lezc...@free.fr   wrote:
 in the container implementation, we are facing the problem of a process
 calling the sys_reboot syscall which of course makes the host to
 poweroff/reboot.

 If we drop the cap_sys_reboot capability, sys_reboot fails and the
 container reach a shutdown state but the init process stay there, hence
 the container becomes stuck waiting indefinitely the process '1' to exit.

 The current implementation to make the shutdown / reboot of the
 container to work is we watch, from a process outside of the container,
 therootfs/var/run/utmp file and check the runlevel each time the file
 changes. When the 'reboot' or 'shutdown' level is detected, we wait for
 a single remaining in the container and then we kill it.

 That works but this is not efficient in case of a large number of
 containers as we will have to watch a lot of utmp files. In addition,
 the /var/run directory must *not* mounted as tmpfs in the distro.
 Unfortunately, it is the default setup on most of the distros and tends
 to generalize. That implies, the rootfs init's scripts must be modified
 for the container when we put in place its rootfs and as /var/run is
 supposed to be a tmpfs, most of the applications do not cleanup the
 directory, so we need to add extra services to wipeout the files.

 More problems arise when we do an upgrade of the distro inside the
 container, because all the setup we made at creation time will be lost.
 The upgrade overwrite the scripts, the fstab and so on.

 We did what was possible to solve the problem from userspace but we
 reach always a limit because there are different implementations of the
 'init' process and the init's scripts differ from a distro to another
 and the same with the versions.

 We think this problem can only be solved from the kernel.

 The idea was to send a signal SIGPWR to the parent of the pid '1' of the
 pid namespace when the sys_reboot is called. Of course that won't occur
 for the init pid namespace.
 Wouldn't sending SIGKILL to the pid '1' process of the originating PID
 namespace be sufficient (that would trigger a SIGCHLD for the parent
 process in the outer PID namespace.
 This is already the case. The question is : when do we send this signal ?
 We have to wait for the container system shutdown before killing it.
 I meant that sys_reboot() would kill the namespace's init if it's not
 called from boot namespace.

 See below

 (as far as I remember the PID namespace is killed when its 'init' exits,
 if this is not the case all other processes in the given namespace would
 have to be killed as well)
 Yes, absolutely but this is not the point, reaping the container is not
 a problem.

 What we are trying to achieve is to shutdown properly the container from
 inside (from outside will be possible too with the setns syscall).

 Assuming the process '1234' creates a new process in a new namespace set
 and wait for it.

 The new process '1' will exec /sbin/init and the system will boot up.
 But, when the system is shutdown or rebooted, after the down scripts are
 executed the kill -15 -1 will be invoked, killing all the processes
 expect the process '1' and the caller. This one will then call
 'sys_reboot' and exit. Hence we still have the init process idle and its
 parent '1234' waiting for it to die.
 This call to sys_reboot() would kill new process '1' instead of trying to
 operate on the HW box.
 This also has the advantage that a container would not require an informed
 parent monitoring it from outside (though it would not be restarted even if
 requested without such informed outside parent).

Oh, ok. Sorry I misunderstood.

Yes, that could be better than crossing the namespace boundaries.

 If we are able to receive the information in the process '1234' : the
 sys_reboot was called in the child pid namespace, we can take then kill
 our child pid.  If this information is raised via a signal sent by the
 kernel with the proper information in the siginfo_t (eg. si_code
 contains LINUX_REBOOT_CMD_RESTART, LINUX_REBOOT_CMD_HALT, ... ), the
 solution will be generic for all the shutdown/reboot of any kind of
 container and init version.
 Could this be passed for a SIGCHLD? (when namespace is reaped, and received
 by 1234 from above example assuming sys_reboot() kills the new process '1')

Yes, that sounds a good idea.

 Looks like yes, but with the need to define new values for si_code (reusing
 LINUX_REBOOT_CMD_* would certainly clash, no matter which signal is choosen).

CLD_REBOOT_CMD_RESTART
CLD_REBOOT_CMD_HALT
CLD_REBOOT_CMD_POWER_OFF
CLD_REBOOT_CMD_RESTART2 (what about the cmd buffer, shall we ignore it ?)
CLD_REBOOT_CMD_KEXEC (?)
CLD_REBOOT_CMD_SW_SUSPEND (useful for the future checkpoint/restart)

LINUX_REBOOT_CMD_CAD_ON and LINUX_REBOOT_CMD_CAD_OFF

[Devel] Re: [RFD] reboot / shutdown of a container

2011-01-13 Thread Daniel Lezcano

On 01/13/2011 09:09 PM, Bruno Prémont wrote:
 On Thu, 13 January 2011 Daniel Lezcanodaniel.lezc...@free.fr  wrote:
 in the container implementation, we are facing the problem of a process
 calling the sys_reboot syscall which of course makes the host to
 poweroff/reboot.

 If we drop the cap_sys_reboot capability, sys_reboot fails and the
 container reach a shutdown state but the init process stay there, hence
 the container becomes stuck waiting indefinitely the process '1' to exit.

 The current implementation to make the shutdown / reboot of the
 container to work is we watch, from a process outside of the container,
 therootfs/var/run/utmp file and check the runlevel each time the file
 changes. When the 'reboot' or 'shutdown' level is detected, we wait for
 a single remaining in the container and then we kill it.

 That works but this is not efficient in case of a large number of
 containers as we will have to watch a lot of utmp files. In addition,
 the /var/run directory must *not* mounted as tmpfs in the distro.
 Unfortunately, it is the default setup on most of the distros and tends
 to generalize. That implies, the rootfs init's scripts must be modified
 for the container when we put in place its rootfs and as /var/run is
 supposed to be a tmpfs, most of the applications do not cleanup the
 directory, so we need to add extra services to wipeout the files.

 More problems arise when we do an upgrade of the distro inside the
 container, because all the setup we made at creation time will be lost.
 The upgrade overwrite the scripts, the fstab and so on.

 We did what was possible to solve the problem from userspace but we
 reach always a limit because there are different implementations of the
 'init' process and the init's scripts differ from a distro to another
 and the same with the versions.

 We think this problem can only be solved from the kernel.

 The idea was to send a signal SIGPWR to the parent of the pid '1' of the
 pid namespace when the sys_reboot is called. Of course that won't occur
 for the init pid namespace.
 Wouldn't sending SIGKILL to the pid '1' process of the originating PID
 namespace be sufficient (that would trigger a SIGCHLD for the parent
 process in the outer PID namespace.

This is already the case. The question is : when do we send this signal ?
We have to wait for the container system shutdown before killing it.

 (as far as I remember the PID namespace is killed when its 'init' exits,
 if this is not the case all other processes in the given namespace would
 have to be killed as well)

Yes, absolutely but this is not the point, reaping the container is not 
a problem.

What we are trying to achieve is to shutdown properly the container from 
inside (from outside will be possible too with the setns syscall).

Assuming the process '1234' creates a new process in a new namespace set 
and wait for it.

The new process '1' will exec /sbin/init and the system will boot up. 
But, when the system is shutdown or rebooted, after the down scripts are 
executed the kill -15 -1 will be invoked, killing all the processes 
expect the process '1' and the caller. This one will then call 
'sys_reboot' and exit. Hence we still have the init process idle and its 
parent '1234' waiting for it to die.

If we are able to receive the information in the process '1234' : the 
sys_reboot was called in the child pid namespace, we can take then kill 
our child pid.  If this information is raised via a signal sent by the 
kernel with the proper information in the siginfo_t (eg. si_code 
contains LINUX_REBOOT_CMD_RESTART, LINUX_REBOOT_CMD_HALT, ... ), the 
solution will be generic for all the shutdown/reboot of any kind of 
container and init version.

 Only issue is how to differentiate the various reboot() modes (restart,
 power-off/halt) from outside, though that one also exists with the SIGPWR
 signal.


javascript:void(0);
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: Two newbie questions on containers

2011-01-12 Thread Daniel Lezcano

On 01/12/2011 01:02 AM, Nathan Lynch wrote:
 Hi Timur,

 On Tue, 2011-01-11 at 16:08 -0600, Timur Tabi wrote:
 I'm in the process of learning about Linux containers, including cgroups, and
 the learning curve seems pretty steep to me.  So I have a couple newbie
 questions for all of you.  Any detailed answered are greatly appreciated.

 1) For the PowerPC architecture, is there anything that is missing?  I 
 can't
 really tell how much of cgroups and lxc is architecture-specific, and there
 appears to be PowerPC support for both already.  I'd like to know if this
 another one of those areas, like KVM, where x86 is fully implemented and 
 PowerPC
 support is lagging.
 cgroups and the lxc utilities work just as well on powerpc as they do on
 x86; there's nothing arch-specific about them.

On PowerPC, the network physical driver (at the hypervisor level) will 
prevent different LPAR to communicate on the same blade if the network 
is virtualized.

 2) Given a random device driver, like a driver for a serial port, is there an
 opportunity for the driver to be enhanced to support cgroups or lxc?  Doing a
 simple search of the kernel source code, I don't really see any drivers 
 making
 calls into any cgroup code, so I don't understand how to restrict device 
 access
 to a specific container or cgroup.
 Not totally sure about this myself, but have you read
 Documentation/cgroups/devices.txt in the kernel source yet?


 ___
 Containers mailing list
 contain...@lists.linux-foundation.org
 https://lists.linux-foundation.org/mailman/listinfo/containers


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: Containers and /proc/sys/vm/drop_caches

2011-01-05 Thread Daniel Lezcano

On 01/05/2011 10:40 AM, Mike Hommey wrote:
 [Copy/pasted from a previous message to lkml, where it was suggested to
   try contain...@]

 Hi,

 I noticed that from within a lxc container, writing 3 to
 /proc/sys/vm/drop_caches would flush the host page cache. That sounds a
 little dangerous for VPS offerings that would be based on lxc, as in one
 VPS instance root user could impact the overall performance of the host.
 I don't know about other containers but I've been told openvz isn't
 subject to this problem.
 I only tested the current Debian Squeeze kernel, which is based on
 2.6.32.27.

There is definitively a big work to do with /proc.

Some files should be not accessible (/proc/sys/vm/drop_caches, 
/proc/sys/kernel/sysrq, ...) and some other should be virtualized 
(/proc/meminfo, /proc/cpuinfo, ...).

Serge suggested to create something similar to the cgroup device 
whitelist but for /proc, maybe it is a good approach for denying access 
a specific proc's file.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: Mapping PIDs from parent-child namespaces

2011-01-04 Thread Daniel Lezcano

On 01/04/2011 12:02 AM, Mike Heffner wrote:
 Hi,

 Is it possible for a process running in a parent PID namespace to map
 the PID of a process running in a child's namespace from the
 parent-child's namespace? For example, if I span the process myproc
 with CLONE_NEWPID then a call to getpid() inside myproc will return 1
 whereas in the parent's namespace that process could actually be PID
 23495. I'd like to be able to know that 23495 maps to 1 in the new NS.
 Obviously, just mapping the first PID is straightforward since I can
 just look at the result of clone(). However, mapping the PIDs of
 processes subsequently forked from myproc -- in this example -- I
 haven't been able to figure out.

AFAIK, it is not possible.

That would be very nice to show the pid - vpid association.

The procfs is a good candidate to show these informations.

That would makes sense to show the content of /proc/pid/status with 
the pid relatively to the namespace.

Let me give an example:

Assuming the process '1234' creates a new pid namespace, and the child 
which is '1' in the new namespace has the real pid '4321'. This one 
mounts its /proc.

If the process '1234' looks at /proc/4321/root/proc/1/status, it sees:

...
Tgid:   1
Pid:1
PPid:   0
...


It could be:

...
Tgid:   4321
Pid:4321
PPid:   1234
...

as the file is inspected from the parent namespace. Of course, if the 
file is looked from the child namespace context, we will see '1', '1' 
and '0'.

I suppose the patch in the kernel should very small also.

Thoughts ?

Thanks.
   -- Daniel
















































Sauf indication contraire ci-dessus:
Compagnie IBM France
Siège Social : Tour Descartes, 2, avenue Gambetta, La Défense 5, 92400
Courbevoie
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 542.737.118 ?
SIREN/SIRET : 552 118 465 02430
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: Mapping PIDs from parent-child namespaces

2011-01-04 Thread Daniel Lezcano

On 01/04/2011 08:57 PM, Mike Heffner wrote:
 On 01/04/2011 11:04 AM, Daniel Lezcano wrote:
 On 01/04/2011 12:02 AM, Mike Heffner wrote:
 Hi,

 Is it possible for a process running in a parent PID namespace to map
 the PID of a process running in a child's namespace from the
 parent-child's namespace? For example, if I span the process myproc
 with CLONE_NEWPID then a call to getpid() inside myproc will return 1
 whereas in the parent's namespace that process could actually be PID
 23495. I'd like to be able to know that 23495 maps to 1 in the new 
 NS.
 Obviously, just mapping the first PID is straightforward since I can
 just look at the result of clone(). However, mapping the PIDs of
 processes subsequently forked from myproc -- in this example -- I
 haven't been able to figure out.

 AFAIK, it is not possible.

 That would be very nice to show the pid-  vpid association.

 The procfs is a good candidate to show these informations.

 That would makes sense to show the content of /proc/pid/status with
 the pid relatively to the namespace.

 Let me give an example:

 Assuming the process '1234' creates a new pid namespace, and the child
 which is '1' in the new namespace has the real pid '4321'. This one
 mounts its /proc.

 If the process '1234' looks at /proc/4321/root/proc/1/status, it sees:

 ...
 Tgid:1
 Pid:1
 PPid:0
 ...


 It could be:

 ...
 Tgid:4321
 Pid:4321
 PPid:1234
 ...

 as the file is inspected from the parent namespace. Of course, if the
 file is looked from the child namespace context, we will see '1', '1'
 and '0'.

 I suppose the patch in the kernel should very small also.

 Thoughts ?

 Would that mean that finding the pid-vpid association for a real PID 
 X requires checking all files /proc/X/root/proc/Y/status where Y 
 is all vpids until you find the one where Pid == X? It would be nice 
 to have a have a way to check a single file for the association where 
 vpid is not known beforehand -- unless I'm misunderstanding your 
 solution.

Hmm, right. But how do you know a pid is belonging to a specific pid 
namespace ? I mean you can have a single process creating several pid 
namespaces. So while looking at the /proc/pid/status, you can see 
several times the same vpid, no ?
I am not sure the kind of informations you want to collect but it is not 
really a problem to build an association table from the userspace by 
browsing the /proc/pid/root/proc/vpids and their corresponding pid 
from the 'status' file information.

Do you have an example for a pid - vpid association without looking at 
more informations from /proc ?

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: regular lxc development call?

2010-12-02 Thread Daniel Lezcano

On 11/30/2010 04:06 AM, Serge E. Hallyn wrote:
Quoting Daniel Lezcano (daniel.lezc...@free.fr):

On 11/29/2010 03:53 PM, Serge E. Hallyn wrote:

Hi,

at UDS-N we had a session on 'fine-tuning containers'. The focus was
things we can do in the next few months to improve containers. The
meeting proeedings can be found at
https://wiki.ubuntu.com/UDSProceedings/N/CloudInfrastructure#Make%20LXC%20ready%20for%20production

We have a few work items written down at
https://blueprints.edge.launchpad.net/ubuntu/+spec/cloud-server-n-containers-finetune
The list is flexible fwiw, but we thought it might help to have a regular
call, perhaps every other week, to discuss work items, their design,
and their progress. For some features like reboot/shutdown, I think
design still needs discussion. For other things, it's more important
that we just discuss who's doing what and what's been done.

Is there interest in having such a call?

Yep, IMO it is a good idea.

I suspect most of the containers work now is purely volunteer driven,
so a free venue seems worthwhile. Should we do this over skype? IRC?
Does someone want to set up a conference number?

I don't have a conf number, if anyone has one that will be great,
otherwise I am fine with skype or irc.

Looks like we'll be starting small anyway, so let's just try skype. Anyone
interested in joining, please send me your skype id.

What is a good time? I'll just toss thursday at 9:30am US Central time
(15:30 UTC) out there.

Ok for me.

Do we begin January, 6th ?

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [Lxc-users] regular lxc development call?

2010-12-02 Thread Daniel Lezcano

On 12/02/2010 03:21 PM, Serge E. Hallyn wrote:
 Quoting Daniel Lezcano (daniel.lezc...@free.fr):

 On 11/30/2010 04:06 AM, Serge E. Hallyn wrote:
  
 Quoting Daniel Lezcano (daniel.lezc...@free.fr):
 Looks like we'll be starting small anyway, so let's just try skype.  Anyone
 interested in joining, please send me your skype id.

 What is a good time?  I'll just toss thursday at 9:30am US Central time
 (15:30 UTC) out there.


 Ok for me.

 Do we begin January, 6th ?
  
 I'm feeling like time is passing us by far too quickly.  I realize today is
 thursday, and really I wouldn't mind a first call today just to get everyone
 a sense of what everyone else is working on.  Otherwise, can we start next
 week?  Or is december just a wash?  :(


Ok for next week.

Do you want me to create a google calendar event ?

   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: regular lxc development call?

2010-11-29 Thread Daniel Lezcano

On 11/29/2010 03:53 PM, Serge E. Hallyn wrote:
 Hi,

 at UDS-N we had a session on 'fine-tuning containers'.  The focus was
 things we can do in the next few months to improve containers.  The
 meeting proeedings can be found at
 https://wiki.ubuntu.com/UDSProceedings/N/CloudInfrastructure#Make%20LXC%20ready%20for%20production

 We have a few work items written down at
 https://blueprints.edge.launchpad.net/ubuntu/+spec/cloud-server-n-containers-finetune
 The list is flexible fwiw, but we thought it might help to have a regular
 call, perhaps every other week, to discuss work items, their design,
 and their progress.  For some features like reboot/shutdown, I think
 design still needs discussion.  For other things, it's more important
 that we just discuss who's doing what and what's been done.

 Is there interest in having such a call?


Yep, IMO it is a good idea.

 I suspect most of the containers work now is purely volunteer driven,
 so a free venue seems worthwhile.  Should we do this over skype?  IRC?
 Does someone want to set up a conference number?


I don't have a conf number, if anyone has one that will be great, 
otherwise I am fine with skype or irc.

Thanks

   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/4] Kconfig: make namespace a submenu

2010-10-20 Thread Daniel Lezcano

On 10/13/2010 11:28 AM, Daniel Lezcano wrote:
 Make the namespaces config option a submenu.

 Signed-off-by: Daniel Lezcanodaniel.lezc...@free.fr
 ---


Hi Andrew,

do you plan to take this patchset ?

Thanks
   -- Daniel


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 2/4] Kconfig: remove pointless cgroup dependency

2010-10-13 Thread Daniel Lezcano

The different cgroup subsystems are under the cgroup submenu.
The dependency between the cgroups and the menu subsystems is
pointless.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 init/Kconfig |   14 --
 1 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 14c84e7..335ce89 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -511,7 +511,6 @@ if CGROUPS
 
 config CGROUP_DEBUG
bool Example debug cgroup subsystem
-   depends on CGROUPS
default n
help
  This option enables a simple cgroup subsystem that
@@ -522,7 +521,6 @@ config CGROUP_DEBUG
 
 config CGROUP_NS
bool Namespace cgroup subsystem
-   depends on CGROUPS
help
  Provides a simple namespace cgroup subsystem to
  provide hierarchical naming of sets of namespaces,
@@ -531,21 +529,19 @@ config CGROUP_NS
 
 config CGROUP_FREEZER
bool Freezer cgroup subsystem
-   depends on CGROUPS
help
  Provides a way to freeze and unfreeze all tasks in a
  cgroup.
 
 config CGROUP_DEVICE
bool Device controller for cgroups
-   depends on CGROUPS  EXPERIMENTAL
+   depends on EXPERIMENTAL
help
  Provides a cgroup implementing whitelists for devices which
  a process in the cgroup can mknod or open.
 
 config CPUSETS
bool Cpuset support
-   depends on CGROUPS
help
  This option will let you create and manage CPUSETs which
  allow dynamically partitioning a system into sets of CPUs and
@@ -561,7 +557,6 @@ config PROC_PID_CPUSET
 
 config CGROUP_CPUACCT
bool Simple CPU accounting cgroup subsystem
-   depends on CGROUPS
help
  Provides a simple Resource Controller for monitoring the
  total CPU consumed by the tasks in a cgroup.
@@ -571,11 +566,10 @@ config RESOURCE_COUNTERS
help
  This option enables controller independent resource accounting
  infrastructure that works with cgroups.
-   depends on CGROUPS
 
 config CGROUP_MEM_RES_CTLR
bool Memory Resource Controller for Control Groups
-   depends on CGROUPS  RESOURCE_COUNTERS
+   depends on RESOURCE_COUNTERS
select MM_OWNER
help
  Provides a memory resource controller that manages both anonymous
@@ -616,7 +610,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
 
 menuconfig CGROUP_SCHED
bool Group CPU scheduler
-   depends on EXPERIMENTAL  CGROUPS
+   depends on EXPERIMENTAL
default n
help
  This feature lets CPU scheduler recognize task groups and control CPU
@@ -645,7 +639,7 @@ endif #CGROUP_SCHED
 
 config BLK_CGROUP
tristate Block IO controller
-   depends on CGROUPS  BLOCK
+   depends on BLOCK
default n
---help---
Generic block IO controller cgroup interface. This is the common
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 1/4] Kconfig: make namespace a submenu

2010-10-13 Thread Daniel Lezcano

Make the namespaces config option a submenu.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 init/Kconfig |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index a7fe61e..14c84e7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -727,7 +727,7 @@ config RELAY
 
  If unsure, say N.
 
-config NAMESPACES
+menuconfig NAMESPACES
bool Namespaces support if EMBEDDED
default !EMBEDDED
help
@@ -736,9 +736,10 @@ config NAMESPACES
  or same user id or pid may refer to different tasks when used in
  different namespaces.
 
+if NAMESPACES
+
 config UTS_NS
bool UTS namespace
-   depends on NAMESPACES
default y
help
  In this namespace tasks see different info provided with the
@@ -746,7 +747,7 @@ config UTS_NS
 
 config IPC_NS
bool IPC namespace
-   depends on NAMESPACES  (SYSVIPC || POSIX_MQUEUE)
+   depends on (SYSVIPC || POSIX_MQUEUE)
default y
help
  In this namespace tasks work with IPC ids which correspond to
@@ -754,7 +755,7 @@ config IPC_NS
 
 config USER_NS
bool User namespace (EXPERIMENTAL)
-   depends on NAMESPACES  EXPERIMENTAL
+   depends on EXPERIMENTAL
default y
help
  This allows containers, i.e. vservers, to use user namespaces
@@ -763,7 +764,6 @@ config USER_NS
 
 config PID_NS
bool PID Namespaces
-   depends on NAMESPACES
default y
help
  Support process id namespaces.  This allows having multiple
@@ -772,12 +772,14 @@ config PID_NS
 
 config NET_NS
bool Network namespace
-   depends on NAMESPACES  NET
+   depends on NET
default y
help
  Allow user space to create what appear to be multiple instances
  of the network stack.
 
+endif # NAMESPACES
+
 config BLK_DEV_INITRD
bool Initial RAM filesystem and RAM disk (initramfs/initrd) support
depends on BROKEN || !FRV
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 3/4] Kconfig: remove the cgroup device whitelist experimental tag

2010-10-13 Thread Daniel Lezcano

This subsystem is merged since a long time now, I think we can
consider it mature enough.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 init/Kconfig |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 335ce89..806f7d9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -535,7 +535,6 @@ config CGROUP_FREEZER
 
 config CGROUP_DEVICE
bool Device controller for cgroups
-   depends on EXPERIMENTAL
help
  Provides a cgroup implementing whitelists for devices which
  a process in the cgroup can mknod or open.
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 4/4] Kconfig: move namespace menu location after the cgroup

2010-10-13 Thread Daniel Lezcano

We have the namespaces as a menuconfig like the cgroup.
The cgroup and the namespace are two base bricks for the containers.

It is more logical to put the namespace menu right after the cgroup menu.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 init/Kconfig |  104 +-
 1 files changed, 52 insertions(+), 52 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 806f7d9..3445d11 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -668,58 +668,6 @@ config DEBUG_BLK_CGROUP
 
 endif # CGROUPS
 
-config MM_OWNER
-   bool
-
-config SYSFS_DEPRECATED
-   bool enable deprecated sysfs features to support old userspace tools
-   depends on SYSFS
-   default n
-   help
- This option adds code that switches the layout of the block class
- devices, to not show up in /sys/class/block/, but only in
- /sys/block/.
-
- This switch is only active when the sysfs.deprecated=1 boot option is
- passed or the SYSFS_DEPRECATED_V2 option is set.
-
- This option allows new kernels to run on old distributions and tools,
- which might get confused by /sys/class/block/. Since 2007/2008 all
- major distributions and tools handle this just fine.
-
- Recent distributions and userspace tools after 2009/2010 depend on
- the existence of /sys/class/block/, and will not work with this
- option enabled.
-
- Only if you are using a new kernel on an old distribution, you might
- need to say Y here.
-
-config SYSFS_DEPRECATED_V2
-   bool enabled deprecated sysfs features by default
-   default n
-   depends on SYSFS
-   depends on SYSFS_DEPRECATED
-   help
- Enable deprecated sysfs by default.
-
- See the CONFIG_SYSFS_DEPRECATED option for more details about this
- option.
-
- Only if you are using a new kernel on an old distribution, you might
- need to say Y here. Even then, odds are you would not need it
- enabled, you can always pass the boot option if absolutely necessary.
-
-config RELAY
-   bool Kernel-user space relay support (formerly relayfs)
-   help
- This option enables support for relay interface support in
- certain file systems (such as debugfs).
- It is designed to provide an efficient mechanism for tools and
- facilities to relay large amounts of data from kernel space to
- user space.
-
- If unsure, say N.
-
 menuconfig NAMESPACES
bool Namespaces support if EMBEDDED
default !EMBEDDED
@@ -773,6 +721,58 @@ config NET_NS
 
 endif # NAMESPACES
 
+config MM_OWNER
+   bool
+
+config SYSFS_DEPRECATED
+   bool enable deprecated sysfs features to support old userspace tools
+   depends on SYSFS
+   default n
+   help
+ This option adds code that switches the layout of the block class
+ devices, to not show up in /sys/class/block/, but only in
+ /sys/block/.
+
+ This switch is only active when the sysfs.deprecated=1 boot option is
+ passed or the SYSFS_DEPRECATED_V2 option is set.
+
+ This option allows new kernels to run on old distributions and tools,
+ which might get confused by /sys/class/block/. Since 2007/2008 all
+ major distributions and tools handle this just fine.
+
+ Recent distributions and userspace tools after 2009/2010 depend on
+ the existence of /sys/class/block/, and will not work with this
+ option enabled.
+
+ Only if you are using a new kernel on an old distribution, you might
+ need to say Y here.
+
+config SYSFS_DEPRECATED_V2
+   bool enabled deprecated sysfs features by default
+   default n
+   depends on SYSFS
+   depends on SYSFS_DEPRECATED
+   help
+ Enable deprecated sysfs by default.
+
+ See the CONFIG_SYSFS_DEPRECATED option for more details about this
+ option.
+
+ Only if you are using a new kernel on an old distribution, you might
+ need to say Y here. Even then, odds are you would not need it
+ enabled, you can always pass the boot option if absolutely necessary.
+
+config RELAY
+   bool Kernel-user space relay support (formerly relayfs)
+   help
+ This option enables support for relay interface support in
+ certain file systems (such as debugfs).
+ It is designed to provide an efficient mechanism for tools and
+ facilities to relay large amounts of data from kernel space to
+ user space.
+
+ If unsure, say N.
+
 config BLK_DEV_INITRD
bool Initial RAM filesystem and RAM disk (initramfs/initrd) support
depends on BROKEN || !FRV
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

[Devel] Re: [RFC V1] Replace pid_t in autofs4 with struct pid reference.

2010-10-12 Thread Daniel Lezcano

On 10/12/2010 05:47 PM, Serge E. Hallyn wrote:
 Quoting Daniel Lezcano (daniel.lezc...@free.fr):

 I resurect and refreshed this old patch from
 https://lists.linux-foundation.org/pipermail/containers/2007-February/003726.html

 This patch makes automount to work within a container.

 Make autofs4 container-friendly by caching struct pid reference rather
 than pid_t and using pid_nr() to retreive a task's pid_t.

 ChangeLog:

 V1:
  - fixed pgrp option in parse_options
  - used get_task_pid(current, PIDTYPE_PGID) instead of task_pgrp
  - fixed how is passed the 'pgrp' argument autofs4_fill_super
  - fixed bad pid conversion, was pid_vnr not pid_nr in autofs4_wait
 V0:
  - Refreshed against linux-next (added dev-ioctl.c)
  - Fix Eric Biederman's comments - Use find_get_pid() to hold a
reference to oz_pgrp and release while unmounting; separate out
changes to autofs and autofs4.
  - Also rollback my earlier change to autofs_wait_queue (pid and tgid
in the wait queue are just used to write to a userspace daemon's
pipe).
  - Fix Cedric's comments: retain old prototype of parse_options()
and move necessary change to its caller.

 Signed-off-by: Sukadev Bhattiprolusuka...@us.ibm.com
 Signed-off-by: Daniel Lezcanodaniel.lezc...@free.fr
 Cc: Ian Kentra...@themaw.net
 Cc: Cedric Le Goaterc...@fr.ibm.com
 Cc: Dave Hansenhaveb...@us.ibm.com
 Cc: Serge E. Hallynserge.hal...@canonical.com
  
 Acked-by: Serge E. Hallynserge.hal...@canonical.com

 Thanks, Daniel, this looks good!  Thanks for pushing this needed fix.


Thanks for reviewing. I tried to do some 'stress' test and it appears 
there is a deadlock between two containers with the automount daemon, I 
will fix it and resend a new version.

   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 2/2] Kconfig : default all the namespaces to 'yes'

2010-10-12 Thread Daniel Lezcano

On 10/12/2010 07:16 PM, Serge E. Hallyn wrote:
 Quoting Matt Helsley (matth...@us.ibm.com):

 On Thu, Oct 07, 2010 at 03:15:33PM +0200, Daniel Lezcano wrote:
  
 As the different namespaces depend on 'CONFIG_NAMESPACES', it is
 logical to enable all the namespaces when we enable NAMESPACES.

 Signed-off-by: Daniel Lezcanodaniel.lezc...@free.fr

 Subject of the patch email is a little confusing as it's not
 quite what happens. I'm mostly OK with it but I'm not sure we
 should enable user-ns by default just yet.

 Acked-By: Matt Helsleymatth...@us.ibm.com
  
 In fact, perhaps we should keep the experimental tag on user namespaces.


The experimental tag is kept on the user namespace. This one is 
defaulting to yes when the namespaces and experimental are selected.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/2] Kconfig : remove pid_ns and net_ns experimental

2010-10-12 Thread Daniel Lezcano

On 10/12/2010 07:53 PM, Oren Laadan wrote:
 Daniel,

 Maybe you can throw this on in the series as well ?
 http://www.mail-archive.com/linux-...@vger.kernel.org/msg01431.html

 It's a one-liner to move the namespaces options into its own
 sub-menu under 'General Setup'.

Right, that makes sense.

   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 2/2] Kconfig : default all the namespaces to 'yes'

2010-10-07 Thread Daniel Lezcano

As the different namespaces depend on 'CONFIG_NAMESPACES', it is
logical to enable all the namespaces when we enable NAMESPACES.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 init/Kconfig |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index a52124e..a7fe61e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -739,6 +739,7 @@ config NAMESPACES
 config UTS_NS
bool UTS namespace
depends on NAMESPACES
+   default y
help
  In this namespace tasks see different info provided with the
  uname() system call
@@ -746,6 +747,7 @@ config UTS_NS
 config IPC_NS
bool IPC namespace
depends on NAMESPACES  (SYSVIPC || POSIX_MQUEUE)
+   default y
help
  In this namespace tasks work with IPC ids which correspond to
  different IPC objects in different namespaces.
@@ -753,6 +755,7 @@ config IPC_NS
 config USER_NS
bool User namespace (EXPERIMENTAL)
depends on NAMESPACES  EXPERIMENTAL
+   default y
help
  This allows containers, i.e. vservers, to use user namespaces
  to provide different user info for different servers.
@@ -760,8 +763,8 @@ config USER_NS
 
 config PID_NS
bool PID Namespaces
-   default n
depends on NAMESPACES
+   default y
help
  Support process id namespaces.  This allows having multiple
  processes with the same pid as long as they are in different
@@ -769,8 +772,8 @@ config PID_NS
 
 config NET_NS
bool Network namespace
-   default n
depends on NAMESPACES  NET
+   default y
help
  Allow user space to create what appear to be multiple instances
  of the network stack.
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 1/2] Kconfig : remove pid_ns and net_ns experimental

2010-10-07 Thread Daniel Lezcano

The pid namespace is in the kernel since 2.6.27 and the net_ns
since 2.6.29. They are enabled in the distro by default and used by
userspace component. They are mature enough to remove the 'experimental'
label.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 init/Kconfig |9 +++--
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index a175935..a52124e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -759,21 +759,18 @@ config USER_NS
  If unsure, say N.
 
 config PID_NS
-   bool PID Namespaces (EXPERIMENTAL)
+   bool PID Namespaces
default n
-   depends on NAMESPACES  EXPERIMENTAL
+   depends on NAMESPACES
help
  Support process id namespaces.  This allows having multiple
  processes with the same pid as long as they are in different
  pid namespaces.  This is a building block of containers.
 
- Unless you want to work with an experimental feature
- say N here.
-
 config NET_NS
bool Network namespace
default n
-   depends on NAMESPACES  EXPERIMENTAL  NET
+   depends on NAMESPACES  NET
help
  Allow user space to create what appear to be multiple instances
  of the network stack.
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [RFC V1] Replace pid_t in autofs4 with struct pid reference.

2010-10-05 Thread Daniel Lezcano

I resurect and refreshed this old patch from
https://lists.linux-foundation.org/pipermail/containers/2007-February/003726.html

This patch makes automount to work within a container.

Make autofs4 container-friendly by caching struct pid reference rather
than pid_t and using pid_nr() to retreive a task's pid_t.

ChangeLog:

V1:
- fixed pgrp option in parse_options
- used get_task_pid(current, PIDTYPE_PGID) instead of task_pgrp
- fixed how is passed the 'pgrp' argument autofs4_fill_super
- fixed bad pid conversion, was pid_vnr not pid_nr in autofs4_wait
V0:
- Refreshed against linux-next (added dev-ioctl.c)
- Fix Eric Biederman's comments - Use find_get_pid() to hold a
  reference to oz_pgrp and release while unmounting; separate out
  changes to autofs and autofs4.
- Also rollback my earlier change to autofs_wait_queue (pid and tgid
  in the wait queue are just used to write to a userspace daemon's
  pipe).
- Fix Cedric's comments: retain old prototype of parse_options()
  and move necessary change to its caller.

Signed-off-by: Sukadev Bhattiprolu suka...@us.ibm.com
Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
Cc: Ian Kent ra...@themaw.net
Cc: Cedric Le Goater c...@fr.ibm.com
Cc: Dave Hansen haveb...@us.ibm.com
Cc: Serge E. Hallyn serge.hal...@canonical.com
Cc: Eric Biederman ebied...@xmission.com
Cc: Helmut Lichtenberg h...@tzv.fal.de
---
 fs/autofs4/autofs_i.h  |   31 +--
 fs/autofs4/dev-ioctl.c |   32 +++-
 fs/autofs4/inode.c |   35 +--
 fs/autofs4/root.c  |3 ++-
 fs/autofs4/waitq.c |4 ++--
 5 files changed, 65 insertions(+), 40 deletions(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index 3d283ab..bf2f9b1 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -39,25 +39,25 @@
 /* #define DEBUG */
 
 #ifdef DEBUG
-#define DPRINTK(fmt, args...)  \
-do {   \
-   printk(KERN_DEBUG pid %d: %s:  fmt \n,  \
-   current-pid, __func__, ##args);\
+#define DPRINTK(fmt, args...)   \
+do {\
+   printk(KERN_DEBUG pid %d: %s:  fmt \n,   \
+  pid_nr(task_pid(current)), __func__, ##args); \
 } while (0)
 #else
 #define DPRINTK(fmt, args...) do {} while (0)
 #endif
 
-#define AUTOFS_WARN(fmt, args...)  \
-do {   \
-   printk(KERN_WARNING pid %d: %s:  fmt \n,\
-   current-pid, __func__, ##args);\
+#define AUTOFS_WARN(fmt, args...)   \
+do {\
+   printk(KERN_WARNING pid %d: %s:  fmt \n, \
+  pid_nr(task_pid(current)), __func__, ##args); \
 } while (0)
 
-#define AUTOFS_ERROR(fmt, args...) \
-do {   \
-   printk(KERN_ERR pid %d: %s:  fmt \n,\
-   current-pid, __func__, ##args);\
+#define AUTOFS_ERROR(fmt, args...)  \
+do {\
+   printk(KERN_ERR pid %d: %s:  fmt \n, \
+  pid_nr(task_pid(current)), __func__, ##args); \
 } while (0)
 
 /* Unified info structure.  This is pointed to by both the dentry and
@@ -122,7 +122,7 @@ struct autofs_sb_info {
u32 magic;
int pipefd;
struct file *pipe;
-   pid_t oz_pgrp;
+   struct pid *oz_pgrp;
int catatonic;
int version;
int sub_version;
@@ -156,7 +156,10 @@ static inline struct autofs_info 
*autofs4_dentry_ino(struct dentry *dentry)
filesystem without magic.) */
 
 static inline int autofs4_oz_mode(struct autofs_sb_info *sbi) {
-   return sbi-catatonic || task_pgrp_nr(current) == sbi-oz_pgrp;
+   struct pid *pgrp = get_task_pid(current, PIDTYPE_PGID);
+   bool oz_mode = sbi-catatonic || sbi-oz_pgrp == pgrp;
+   put_pid(pgrp);
+   return oz_mode;
 }
 
 /* Does a dentry have some pending activity? */
diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index eff9a41..7db3b73 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -360,6 +360,7 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
 {
int pipefd;
int err = 0;
+   struct file *pipe;
 
if (param-setpipefd.pipefd == -1)
return -EINVAL;
@@ -368,22 +369,27 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
 
mutex_lock(sbi-wq_mutex);
if (!sbi-catatonic) {
-   mutex_unlock(sbi-wq_mutex);
-   return -EBUSY;
-   } else {
-   struct file *pipe = fget(pipefd

[Devel] Re: [PATCH 8/8] net: Implement socketat.

2010-10-04 Thread Daniel Lezcano

On 10/03/2010 03:44 PM, jamal wrote:
 Hi Daniel,

 Thanks for clarifying this ..

 On Sat, 2010-10-02 at 23:13 +0200, Daniel Lezcano wrote:

 Just to clarify this point. You enter the namespace, create the socket
 and go back to the initial namespace (or create a new one). Further
 operations can be made against this fd because it is the network
 namespace stored in the sock struct which is used, not the current
 process network namespace which is used at the socket creation only.

 We can actually already do that by unsharing and then create a
 socket.
 This socket will pin the namespace and can be used as a control socket
 for the namespace (assuming the socket domain will be ok for all the
 operations).

 Jamal, I don't know what kind of application you want to use but if I
 assume you want to create a process controlling 1024 netns,
  
 At the moment i am looking at 8K on a Nehalem with lots of RAM. They
 will mostly be created at startup but some could be created afterwards.
 Each will have its own netdevs etc. also created at startup (and some
 other config that may happen later).
 Because startup time may accumulate, it is clearly important to me
 to pick whatever scheme that reduces the number of calls...


8K ! whow ! :)


 let's try to identificate what happen with setns and with socketat :

 With setns:

   * open /proc/self/ns/net (1)
   * unshare the netns
   * open /proc/self/ns/net (2)
   * setns (1)
   * create a virtual network device
   * move the virtual device to (2) (using the set netns by fd)
   * unshare the netns
   ...

 With socketat:

   * open a socket (1)
   * unshare the netns
   * open a netlink with socketat(1) =  (2)
   * create a virtual device using (2) (at this point it is
 init_net_ns)
   * move the virtual device to the current netns (using the set
 netns
 by pid)
   * open a socket (3)
   * unshare the netns
   ...

 We have the same number of file descriptors kept opened. Except, with
 setns we can bind mount the directory somewhere, that will pin the
 namespace and then we can close the /proc/self/ns/net file descriptors
 and reopen them later.

  
 Ok, so a wrapper such as: create_socket_on(namespaceid)
 will have generally less system calls with socketat()


Yes, I think so.

 If your application has to do a lot of specific network processing,
 during its life cycle, in different namespaces, the socketat syscall
 will be better because it will reduce the number of syscalls but at
 the cost of keeping the file descriptors opened (potentially a big
 number). Otherwise, setns should fit your needs.
  
 Makes sense.

 One thing still confuses me...
 The app control point is in namespace0. I still want to be able to
 boot namespaces first and maybe a few seconds later do a socketat()...
 and create devices, tcp sockets etc. I suspect create_ns(namespace-name)
 would involve:
   * open /proc/self/ns/net (namespace-name)
   * unshare the netns
 Is this correct?


Maybe I misunderstanding but you are trying to save some syscalls, you 
should use socketat only and keep app control namespace0 socket for it. 
The process will be in the last netns you unshared (maybe you can use 
here one setns syscall to return back to the namespace0).

 (1) socketat  :
 * pros : 1 syscall to create a socket
 * cons : a file descriptor per namespace, namespace is only 
manageable via a socket

 (2) setns :
 * pros : namespace is fully manageable with a generic code
 * cons : 2 syscall (or 3 if we want to return to the initial 
netns) to create a socket(setns + socket [ + setns ]), a file descriptor 
per namespace

 (3) setns + bind mount :
 * pros : no file descriptor need to be kept opened
 * cons : startup longer, (unshare + mount --bind), 4 syscalls 
to create a socket in the namespace (open, setns, socket, close), (may 
be 5 syscalls if we want to return to the initial netns).

Depending of the scheme you choose the startup will be for:

 (1) socketat :
  * open /proc/self/ns/net (one time to 'save' and pin the 
initial netns)
 and then

 int create_ns(void)
 {
 unshare(CLONE_NEWNET);
 return socket(...)
 }

 and,

  for (i = 0; i  8192; i++)
  mynsfd[i] = create_ns();

 (2) setns :
  * open /proc/self/ns/net (one time to 'save' and pin the 
initial netns)
   and then

 int create_ns(void)
 {
 unshare(CLONE_NEWNET);
 return open(/proc/self/ns/net);
 }

 and,

 for (i = 0; i  8192; i++)
   mynsfd[i] = create_ns();

 (3) setns + mount :

  * open /proc/self/ns/net (one time to 'save' and pin the 
initial netns)
   and then

 int create_ns(const char *nspath)
 {
unshare

[Devel] Re: [PATCH 8/8] net: Implement socketat.

2010-10-02 Thread Daniel Lezcano

On 09/23/2010 01:53 PM, Pavel Emelyanov wrote:
 On 09/23/2010 03:40 PM, jamal wrote:

 On Thu, 2010-09-23 at 15:33 +0400, Pavel Emelyanov wrote:

  
 This particular usecase is unneeded once you have the enter ability.

 Is that cheaper from a syscall count/cost?
  
 Why does it matter? You told, that the usage scenario was to
 add routes to container. If I do 2 syscalls instead of 1, is
 it THAT worse?


 i.e do I have to enter every time i want to write/read this fd?
  
 No - you enter once, create a socket and do whatever you need
 withing the enterned namespace.


Just to clarify this point. You enter the namespace, create the socket 
and go back to the initial namespace (or create a new one). Further 
operations can be made against this fd because it is the network 
namespace stored in the sock struct which is used, not the current 
process network namespace which is used at the socket creation only.

We can actually already do that by unsharing and then create a socket. 
This socket will pin the namespace and can be used as a control socket 
for the namespace (assuming the socket domain will be ok for all the 
operations).

Jamal, I don't know what kind of application you want to use but if I 
assume you want to create a process controlling 1024 netns, let's try to 
identificate what happen with setns and with socketat :

With setns:

 * open /proc/self/ns/net (1)
 * unshare the netns
 * open /proc/self/ns/net (2)
 * setns (1)
 * create a virtual network device
 * move the virtual device to (2) (using the set netns by fd)
 * unshare the netns
 ...

With socketat:

 * open a socket (1)
 * unshare the netns
 * open a netlink with socketat(1) = (2)
 * create a virtual device using (2) (at this point it is init_net_ns)
 * move the virtual device to the current netns (using the set netns 
by pid)
 * open a socket (3)
 * unshare the netns
 ...

We have the same number of file descriptors kept opened. Except, with 
setns we can bind mount the directory somewhere, that will pin the 
namespace and then we can close the /proc/self/ns/net file descriptors 
and reopen them later.

If your application has to do a lot of specific network processing, 
during its life cycle, in different namespaces, the socketat syscall 
will be better because it will reduce the number of syscalls but at the 
cost of keeping the file descriptors opened (potentially a big number). 
Otherwise, setns should fit your needs.



 How does poll/select work in that enter scenario?
  
 Just like it used to before the enter.


 cheers,
 jamal


  
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/



___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] Replace pid_t in autofs4 with struct pid reference.

2010-10-01 Thread Daniel Lezcano

On 10/01/2010 12:36 AM, Serge Hallyn wrote:
 Quoting Daniel Lezcano (daniel.lezc...@free.fr):

 I resurect and refreshed this old patch from
 https://lists.linux-foundation.org/pipermail/containers/2007-February/003726.html

 This patch makes automount to work within a container.

 Make autofs4 container-friendly by caching struct pid reference rather
 than pid_t and using pid_nr() to retreive a task's pid_t.

 ChangeLog:
  - Refreshed against linux-next (added dev-ioctl.c)
  - Fix Eric Biederman's comments - Use find_get_pid() to hold a
reference to oz_pgrp and release while unmounting; separate out
changes to autofs and autofs4.
  - Also rollback my earlier change to autofs_wait_queue (pid and tgid
in the wait queue are just used to write to a userspace daemon's
pipe).
  - Fix Cedric's comments: retain old prototype of parse_options()
and move necessary change to its caller.

 Signed-off-by: Sukadev Bhattiprolusuka...@us.ibm.com
 Signed-off-by: Daniel Lezcanodaniel.lezc...@free.fr
 Cc: Ian Kentra...@themaw.net
 Cc: Cedric Le Goaterc...@fr.ibm.com
 Cc: Dave Hansenhaveb...@us.ibm.com
 Cc: Serge E. Hallynserge.hal...@canonical.com
 Cc: Eric Biedermanebied...@xmission.com
 Cc: Helmut Lichtenbergh...@tzv.fal.de
 ---
  

[ cut ]

 @@ -133,7 +133,7 @@ static int autofs4_show_options(struct seq_file *m, 
 struct vfsmount *mnt)
  seq_printf(m, ,uid=%u, root_inode-i_uid);
  if (root_inode-i_gid != 0)
  seq_printf(m, ,gid=%u, root_inode-i_gid);
 -seq_printf(m, ,pgrp=%d, sbi-oz_pgrp);
 +seq_printf(m, ,pgrp=%d, pid_nr(sbi-oz_pgrp));
  seq_printf(m, ,timeout=%lu, sbi-exp_timeout/HZ);
  seq_printf(m, ,minproto=%d, sbi-min_proto);
  seq_printf(m, ,maxproto=%d, sbi-max_proto);
 @@ -263,6 +263,7 @@ int autofs4_fill_super(struct super_block *s, void 
 *data, int silent)
  int pipefd;
  struct autofs_sb_info *sbi;
  struct autofs_info *ino;
 +pid_t pgid;

  sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
  if (!sbi)
 @@ -275,7 +276,7 @@ int autofs4_fill_super(struct super_block *s, void 
 *data, int silent)
  sbi-pipe = NULL;
  sbi-catatonic = 1;
  sbi-exp_timeout = 0;
 -sbi-oz_pgrp = task_pgrp_nr(current);
 +sbi-oz_pgrp = task_pgrp(current);
  sbi-sb = s;
  sbi-version = 0;
  sbi-sub_version = 0;
 @@ -314,7 +315,7 @@ int autofs4_fill_super(struct super_block *s, void 
 *data, int silent)

  /* Can this call block? */
  if (parse_options(data,pipefd,root_inode-i_uid,root_inode-i_gid,
 -sbi-oz_pgrp,sbi-type,sbi-min_proto,
 +pgid,sbi-type,sbi-min_proto,
  sbi-max_proto)) {
  printk(autofs: called with bogus options\n);
  goto fail_dput;
 @@ -342,12 +343,19 @@ int autofs4_fill_super(struct super_block *s, void 
 *data, int silent)
  sbi-version = sbi-max_proto;
  sbi-sub_version = AUTOFS_PROTO_SUBVERSION;

 -DPRINTK(pipe fd = %d, pgrp = %u, pipefd, sbi-oz_pgrp);
 +DPRINTK(pipe fd = %d, pgrp = %u, pipefd, pgid);
 +
 +sbi-oz_pgrp = find_get_pid(pgid);
  
 This is a little backward.  You first get current's pgid pid, but don't
 take a reference;  then parse_options gets current's pgid pid_nr (and
 keeps that if no pgid was specified), passes that back here, and here we
 get the pid_nr and take a ref.  I was actually first going to say that
 I didn't want to block this patch on this, but it should be cleaned up
 at some point (i.e. at top of this function get the struct pid and get
 a ref, pass that to parse_options, and have parse_options get the
 specified pgid instead if a valid one was passed in.


I agree, I will cleanup this part.

Also, I noticed the:

 ...
 case Opt_pgrp:
 if (match_int(args, option))
 return 1;
 *pgrp = option;
 break;
 ...

ouch !


 But now I'm wondering whether this actually is unsafe, bc I'm not quite
 sure how to read the comment above task_pgrp() (in sched.h) says not
 to dereference this if it wasn't gotten under task_lock or rcu_read_lock.
 Which this isn't.  So is this actually unsafe?


Good point.

task_pgrp_nr calls __task_pid_nr_ns which does rcu_read_lock.
task_pgrp does not take any lock.

So you are right, replacing task_pgrp_nr by task_pgrp is unsafe.

I suppose get_task_pid(current, PIDTYPE_PGID) is the right call.


Thanks for looking at the patch.

   -- Daniel


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH] Replace pid_t in autofs4 with struct pid reference.

2010-09-30 Thread Daniel Lezcano

I resurect and refreshed this old patch from
https://lists.linux-foundation.org/pipermail/containers/2007-February/003726.html

This patch makes automount to work within a container.

Make autofs4 container-friendly by caching struct pid reference rather
than pid_t and using pid_nr() to retreive a task's pid_t.

ChangeLog:
- Refreshed against linux-next (added dev-ioctl.c)
- Fix Eric Biederman's comments - Use find_get_pid() to hold a
  reference to oz_pgrp and release while unmounting; separate out
  changes to autofs and autofs4.
- Also rollback my earlier change to autofs_wait_queue (pid and tgid
  in the wait queue are just used to write to a userspace daemon's
  pipe).
- Fix Cedric's comments: retain old prototype of parse_options()
  and move necessary change to its caller.

Signed-off-by: Sukadev Bhattiprolu suka...@us.ibm.com
Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
Cc: Ian Kent ra...@themaw.net
Cc: Cedric Le Goater c...@fr.ibm.com
Cc: Dave Hansen haveb...@us.ibm.com
Cc: Serge E. Hallyn serge.hal...@canonical.com
Cc: Eric Biederman ebied...@xmission.com
Cc: Helmut Lichtenberg h...@tzv.fal.de
---
 fs/autofs4/autofs_i.h  |   28 ++--
 fs/autofs4/dev-ioctl.c |2 +-
 fs/autofs4/inode.c |   22 --
 fs/autofs4/root.c  |3 ++-
 fs/autofs4/waitq.c |4 ++--
 5 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index 3d283ab..e7298a1 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -39,25 +39,25 @@
 /* #define DEBUG */
 
 #ifdef DEBUG
-#define DPRINTK(fmt, args...)  \
-do {   \
-   printk(KERN_DEBUG pid %d: %s:  fmt \n,  \
-   current-pid, __func__, ##args);\
+#define DPRINTK(fmt, args...)   \
+do {\
+   printk(KERN_DEBUG pid %d: %s:  fmt \n,   \
+  pid_nr(task_pid(current)), __func__, ##args); \
 } while (0)
 #else
 #define DPRINTK(fmt, args...) do {} while (0)
 #endif
 
-#define AUTOFS_WARN(fmt, args...)  \
-do {   \
-   printk(KERN_WARNING pid %d: %s:  fmt \n,\
-   current-pid, __func__, ##args);\
+#define AUTOFS_WARN(fmt, args...)   \
+do {\
+   printk(KERN_WARNING pid %d: %s:  fmt \n, \
+  pid_nr(task_pid(current)), __func__, ##args); \
 } while (0)
 
-#define AUTOFS_ERROR(fmt, args...) \
-do {   \
-   printk(KERN_ERR pid %d: %s:  fmt \n,\
-   current-pid, __func__, ##args);\
+#define AUTOFS_ERROR(fmt, args...)  \
+do {\
+   printk(KERN_ERR pid %d: %s:  fmt \n, \
+  pid_nr(task_pid(current)), __func__, ##args); \
 } while (0)
 
 /* Unified info structure.  This is pointed to by both the dentry and
@@ -122,7 +122,7 @@ struct autofs_sb_info {
u32 magic;
int pipefd;
struct file *pipe;
-   pid_t oz_pgrp;
+   struct pid *oz_pgrp;
int catatonic;
int version;
int sub_version;
@@ -156,7 +156,7 @@ static inline struct autofs_info *autofs4_dentry_ino(struct 
dentry *dentry)
filesystem without magic.) */
 
 static inline int autofs4_oz_mode(struct autofs_sb_info *sbi) {
-   return sbi-catatonic || task_pgrp_nr(current) == sbi-oz_pgrp;
+   return sbi-catatonic || task_pgrp(current) == sbi-oz_pgrp;
 }
 
 /* Does a dentry have some pending activity? */
diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index eff9a41..94a523a 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -377,7 +377,7 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
fput(pipe);
goto out;
}
-   sbi-oz_pgrp = task_pgrp_nr(current);
+   sbi-oz_pgrp = task_pgrp(current);
sbi-pipefd = pipefd;
sbi-pipe = pipe;
sbi-catatonic = 0;
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 821b2b9..b36af5a 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -111,7 +111,7 @@ void autofs4_kill_sb(struct super_block *sb)
 
/* Free wait queues, close pipe */
autofs4_catatonic_mode(sbi);
-
+   put_pid(sbi-oz_pgrp);
sb-s_fs_info = NULL;
kfree(sbi);
 
@@ -133,7 +133,7 @@ static int autofs4_show_options(struct seq_file *m, struct 
vfsmount *mnt)
seq_printf(m, ,uid=%u, root_inode-i_uid);
if (root_inode-i_gid != 0

[Devel] [PATCH] cgroup: notify ns_cgroup deprecated

2010-09-29 Thread Daniel Lezcano

The ns_cgroup will be removed very soon. Let's warn, for this
version, ns_cgroup is deprecated.

Make ns_cgroup and clone_children exclusive. If the clone_children
is set and the ns_cgroup is mounted, let's fail with EINVAL when
the ns_cgroup subsys is created (a printk will help the user to
understand why the creation fails).

Update the feature remove schedule file with the deprecated ns_cgroup.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 Documentation/feature-removal-schedule.txt |   17 +
 kernel/ns_cgroup.c |8 
 2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt 
b/Documentation/feature-removal-schedule.txt
index b32911b..a39e2f3 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -545,4 +545,21 @@ Why:This is a legacy interface which have been 
replaced by a more
 Who:NeilBrown ne...@suse.de
 
 
+
+What:   namespace cgroup (ns_cgroup)
+When:   2.6.38
+Why:The ns_cgroup leads to some problems:
+   * cgroup creation is out-of-control
+   * cgroup name can conflict when pids are looping
+   * it is not possible to have a single process handling
+   a lot of namespaces without falling in a exponential creation time
+   * we may want to create a namespace without creating a cgroup
+
+   The ns_cgroup is replaced by a compatibility flag 'clone_children',
+   where a newly created cgroup will copy the parent cgroup values.
+   The userspace has to manually create a cgroup and add a task to
+   the 'tasks' file.
+Who:Daniel Lezcano daniel.lezc...@free.fr
+
+

diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2a5dfec..2c98ad9 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -85,6 +85,14 @@ static struct cgroup_subsys_state *ns_create(struct 
cgroup_subsys *ss,
return ERR_PTR(-EPERM);
if (!cgroup_is_descendant(cgroup, current))
return ERR_PTR(-EPERM);
+   if (test_bit(CGRP_CLONE_CHILDREN, cgroup-flags)) {
+   printk(ns_cgroup can't be created with parent 
+  'clone_children' set.\n);
+   return ERR_PTR(-EINVAL);
+   }
+
+   printk_once(ns_cgroup deprecated: consider using the 
+   'clone_children' flag without the ns_cgroup.\n);
 
ns_cgroup = kzalloc(sizeof(*ns_cgroup), GFP_KERNEL);
if (!ns_cgroup)
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/3][V2] remove the ns_cgroup

2010-09-28 Thread Daniel Lezcano

On 09/27/2010 10:46 PM, Andrew Morton wrote:
 On Mon, 27 Sep 2010 15:36:58 -0500
 Serge E. Hallynserge.hal...@canonical.com  wrote:


 This patchset removes the ns_cgroup by adding a new flag to the cgroup
 and the cgroupfs mount option. It enables the copy of the parent cgroup
 when a child cgroup is created. We can then safely remove the ns_cgroup as
 this flag brings a compatibility. We have now to manually create and add 
 the
 task to a cgroup, which is consistent with the cgroup framework.
  
 So this is a non-backward-compatible userspace-visible change?

 Yes, it is.

 Patch 1 is needed to let lxc and libvirt both control containers with
 same cgroup setup.  Patch 3 however isn't *necessary* for that.  Daniel,
 what do you think about holding off on patch 3?
  
 One way of handling this would be to merge patches 12 which add the
 new interface and also arrange for usage of the old interface(s) to
 emit a printk, telling people that they're using a feature which is
 scheduled for removal.


Right, that makes sense.

Do you will take the patches #1 and #2, drop the patch #3, and I send a 
new patch with the printk warning ?
Or shall I resend all ?

Thanks
   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 0/3][V2] remove the ns_cgroup

2010-09-27 Thread Daniel Lezcano

The ns_cgroup is a control group interacting with the namespaces.
When a new namespace is created, a corresponding cgroup is 
automatically created too. The cgroup name is the pid of the process
who did 'unshare' or the child of 'clone'.

This cgroup is tied with the namespace because it prevents a
process to escape the control group and use the post_clone callback,
so the child cgroup inherits the values of the parent cgroup.

Unfortunately, the more we use this cgroup and the more we are facing
problems with it:

 (1) when a process unshares, the cgroup name may conflict with a previous
 cgroup with the same pid, so unshare or clone return -EEXIST

 (2) the cgroup creation is out of control because there may have an
 application creating several namespaces where the system will automatically
 create several cgroups in his back and let them on the cgroupfs (eg. a vrf
 based on the network namespace).

 (3) the mix of (1) and (2) force an administrator to regularly check and
 clean these cgroups.

This patchset removes the ns_cgroup by adding a new flag to the cgroup
and the cgroupfs mount option. It enables the copy of the parent cgroup
when a child cgroup is created. We can then safely remove the ns_cgroup as
this flag brings a compatibility. We have now to manually create and add the
task to a cgroup, which is consistent with the cgroup framework.

Changelog:
=

 * V2 
Changed the following as Paul Menage suggested:
* removed the clone_children flag from the cgroupfs_root
* used the 'top_cgroup' to check if the 'clone_children' or not
  in the mount option
* improved the description of the patch 2/3

* removed CONFIG_CGROUP_NS against new default configs
 * V1 
initial post

Daniel Lezcano (3):
  cgroup : add clone_children control file
  cgroup : make the mount options parsing more accurate
  cgroup : remove the ns_cgroup

 Documentation/cgroups/cgroups.txt  |   16 ++-
 arch/arm/configs/tegra_defconfig   |1 -
 arch/mips/configs/bcm47xx_defconfig|1 -
 arch/powerpc/configs/ppc6xx_defconfig  |1 -
 arch/powerpc/configs/pseries_defconfig |1 -
 arch/s390/defconfig|1 -
 arch/sh/configs/sdk7786_defconfig  |1 -
 arch/sh/configs/se7206_defconfig   |1 -
 arch/sh/configs/shx3_defconfig |1 -
 arch/sh/configs/urquell_defconfig  |1 -
 arch/x86/configs/i386_defconfig|1 -
 arch/x86/configs/x86_64_defconfig  |1 -
 include/linux/cgroup.h |7 +-
 include/linux/cgroup_subsys.h  |6 -
 include/linux/nsproxy.h|9 --
 init/Kconfig   |9 --
 kernel/Makefile|1 -
 kernel/cgroup.c|  243 +---
 kernel/cpuset.c|7 +-
 kernel/fork.c  |6 -
 kernel/ns_cgroup.c |  110 --
 kernel/nsproxy.c   |4 -
 22 files changed, 118 insertions(+), 311 deletions(-)
 delete mode 100644 kernel/ns_cgroup.c

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 2/3][V2] cgroup : make the mount options parsing more accurate

2010-09-27 Thread Daniel Lezcano

Current behavior:
=

(1) When we mount a cgroup, we can specify the 'all' option which means
to enable all the cgroup subsystems. This is the default option when
no option is specified.

(2) If we want to mount a cgroup with a subset of the supported cgroup
subsystems, we have to specify a subsystems name list for the mount
option.

(3) If we specify another option like 'noprefix' or 'release_agent', the
actual code wants the 'all' or a subsystem name option specified also.
Not critical but a bit not friendly as we should assume (1) in this case.

(4) Logically, the 'all' option is mutually exclusive with a subsystem
name, but this is not detected.

In other words:
 succeed : mount -t cgroup -o all,freezer cgroup /cgroup
= is it 'all' or 'freezer' ?
 fails : mount -t cgroup -o noprefix cgroup /cgroup
= succeed if we do '-o noprefix,all'

The following patches consolidate a bit the mount options check.

New behavior:
=

(1) untouched
(2) untouched
(3) the 'all' option will be by default when specifying other than
a subsystem name option
(4) raises an error

In other words:
 fails   : mount -t cgroup -o all,freezer cgroup /cgroup
 succeed : mount -t cgroup -o noprefix cgroup /cgroup

For the sake of lisibility, the if ... then ... else ... if ...
indentation when parsing the options has been changed to:
if ... then
...
continue
fi

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
Reviewed-by: Li Zefan l...@cn.fujitsu.com
Reviewed-by: Paul Menage men...@google.com
---
 kernel/cgroup.c |   90 --
 1 files changed, 60 insertions(+), 30 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7b17c3e..9eace43 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1073,7 +1073,8 @@ struct cgroup_sb_opts {
  */
 static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 {
-   char *token, *o = data ?: all;
+   char *token, *o = data;
+   bool all_ss = false, one_ss = false;
unsigned long mask = (unsigned long)-1;
int i;
bool module_pin_failed = false;
@@ -1089,24 +1090,27 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
while ((token = strsep(o, ,)) != NULL) {
if (!*token)
return -EINVAL;
-   if (!strcmp(token, all)) {
-   /* Add all non-disabled subsystems */
-   opts-subsys_bits = 0;
-   for (i = 0; i  CGROUP_SUBSYS_COUNT; i++) {
-   struct cgroup_subsys *ss = subsys[i];
-   if (ss == NULL)
-   continue;
-   if (!ss-disabled)
-   opts-subsys_bits |= 1ul  i;
-   }
-   } else if (!strcmp(token, none)) {
+   if (!strcmp(token, none)) {
/* Explicitly have no subsystems */
opts-none = true;
-   } else if (!strcmp(token, noprefix)) {
+   continue;
+   }
+   if (!strcmp(token, all)) {
+   /* Mutually exclusive option 'all' + subsystem name */
+   if (one_ss)
+   return -EINVAL;
+   all_ss = true;
+   continue;
+   }
+   if (!strcmp(token, noprefix)) {
set_bit(ROOT_NOPREFIX, opts-flags);
-   } else if (!strcmp(token, clone_children)) {
+   continue;
+   }
+   if (!strcmp(token, clone_children)) {
opts-clone_children = true;
-   } else if (!strncmp(token, release_agent=, 14)) {
+   continue;
+   }
+   if (!strncmp(token, release_agent=, 14)) {
/* Specifying two release agents is forbidden */
if (opts-release_agent)
return -EINVAL;
@@ -1114,7 +1118,9 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL);
if (!opts-release_agent)
return -ENOMEM;
-   } else if (!strncmp(token, name=, 5)) {
+   continue;
+   }
+   if (!strncmp(token, name=, 5)) {
const char *name = token + 5;
/* Can't specify an empty name */
if (!strlen(name))
@@ -1136,20 +1142,44 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
  GFP_KERNEL

[Devel] [PATCH 1/3][V2] cgroup : add clone_children control file

2010-09-27 Thread Daniel Lezcano

This patch is sent as an answer to a previous thread around the ns_cgroup.

https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html

It adds a control file 'clone_children' for a cgroup.
This control file is a boolean specifying if the child cgroup should
be a clone of the parent cgroup or not. The default value is 'false'.

This flag makes the child cgroup to call the post_clone callback of all
the subsystem, if it is available.

At present, the cpuset is the only one which had implemented the post_clone
callback.

The option can be set at mount time by specifying the 'clone_children' mount
option.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
Cc: Eric W. Biederman ebied...@xmission.com
Cc: Paul Menage men...@google.com
Reviewed-by: Li Zefan l...@cn.fujitsu.com
---
 Documentation/cgroups/cgroups.txt |   14 +++-
 include/linux/cgroup.h|4 +++
 kernel/cgroup.c   |   39 +
 3 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt 
b/Documentation/cgroups/cgroups.txt
index b34823f..190018b 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -18,7 +18,8 @@ CONTENTS:
   1.2 Why are cgroups needed ?
   1.3 How are cgroups implemented ?
   1.4 What does notify_on_release do ?
-  1.5 How do I use cgroups ?
+  1.5 What does clone_children do ?
+  1.6 How do I use cgroups ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Attaching processes
@@ -293,7 +294,16 @@ notify_on_release in the root cgroup at system boot is 
disabled
 value of their parents notify_on_release setting. The default value of
 a cgroup hierarchy's release_agent path is empty.
 
-1.5 How do I use cgroups ?
+1.5 What does clone_children do ?
+-
+
+If the clone_children flag is enabled (1) in a cgroup, then all
+cgroups created beneath will call the post_clone callbacks for each
+subsystem of the newly created cgroup. Usually when this callback is
+implemented for a subsystem, it copies the values of the parent
+subsystem, this is the case for the cpuset.
+
+1.6 How do I use cgroups ?
 --
 
 To start a new job that is to be contained within a cgroup, using
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 709dfb9..ed4ba11 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -154,6 +154,10 @@ enum {
 * A thread in rmdir() is wating for this cgroup.
 */
CGRP_WAIT_ON_RMDIR,
+   /*
+* Clone cgroup values when creating a new child cgroup
+*/
+   CGRP_CLONE_CHILDREN,
 };
 
 /* which pidlist file are we talking about? */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7b69b8d..7b17c3e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -243,6 +243,11 @@ static int notify_on_release(const struct cgroup *cgrp)
return test_bit(CGRP_NOTIFY_ON_RELEASE, cgrp-flags);
 }
 
+static int clone_children(const struct cgroup *cgrp)
+{
+   return test_bit(CGRP_CLONE_CHILDREN, cgrp-flags);
+}
+
 /*
  * for_each_subsys() allows you to iterate on each subsystem attached to
  * an active hierarchy
@@ -1039,6 +1044,8 @@ static int cgroup_show_options(struct seq_file *seq, 
struct vfsmount *vfs)
seq_puts(seq, ,noprefix);
if (strlen(root-release_agent_path))
seq_printf(seq, ,release_agent=%s, root-release_agent_path);
+   if (clone_children(root-top_cgroup))
+   seq_puts(seq, ,clone_children);
if (strlen(root-name))
seq_printf(seq, ,name=%s, root-name);
mutex_unlock(cgroup_mutex);
@@ -1049,6 +1056,7 @@ struct cgroup_sb_opts {
unsigned long subsys_bits;
unsigned long flags;
char *release_agent;
+   bool clone_children;
char *name;
/* User explicitly requested empty subsystem */
bool none;
@@ -1096,6 +1104,8 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
opts-none = true;
} else if (!strcmp(token, noprefix)) {
set_bit(ROOT_NOPREFIX, opts-flags);
+   } else if (!strcmp(token, clone_children)) {
+   opts-clone_children = true;
} else if (!strncmp(token, release_agent=, 14)) {
/* Specifying two release agents is forbidden */
if (opts-release_agent)
@@ -1354,6 +1364,8 @@ static struct cgroupfs_root *cgroup_root_from_opts(struct 
cgroup_sb_opts *opts)
strcpy(root-release_agent_path, opts-release_agent);
if (opts-name)
strcpy(root-name, opts-name);
+   if (opts-clone_children)
+   set_bit(CGRP_CLONE_CHILDREN, root-top_cgroup.flags);
return root;
 }
 
@@ -3172,6 +3184,23 @@ fail:
return ret

[Devel] [PATCH 3/3][V2] cgroup : remove the ns_cgroup

2010-09-27 Thread Daniel Lezcano

The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier.

For example, a single process can not handle a big amount of namespaces
without interacting with this cgroup and falling in an exponential creation
time due to the nested cgroup directory depth (eg. /cgroup/pid/.../pid/...).

That was spotted when creating a single process using multiple network 
namespaces,
the objective was 4096 network namespaces, but at 820 netns, the creation time
was dramatically slow and the creation time for a namespace increased from 
10msec
to 10sec. After five hours, the expected numbers of netns was not reached.
Without the ns_cgroup interaction, 4K netns are created after 2 minutes.

In order to solve that, we have to mount the cgroup with all the subsystems
except the ns_cgroup, it's a little weird and hard to manage from an 
administration
pov because we have to know what are the cgroup available on the system and we
can't do a simple 'mount -t cgroup cgroup /cgroup'.

With the previous patch which adds a 'clone_children' parameter to a cgroup,
we should be able to remove the ns_cgroup and manage manually the creation +
adding a task to the cgroup consistenly with the rest of the subsystems.

This patch removes the ns_cgroup as suggested in the following thread:

https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

The 'cgroup_clone' function is removed because it is no longer used.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
Cc: Eric W. Biederman ebied...@xmission.com
Cc: Jamal Hadi Salim h...@cyberus.ca
Reviewed-by: Li Zefan l...@cn.fujitsu.com
Acked-by: Paul Menage men...@google.com
Acked-by: Matt Helsley matth...@us.ibm.com
---
 Documentation/cgroups/cgroups.txt  |2 +-
 arch/arm/configs/tegra_defconfig   |1 -
 arch/mips/configs/bcm47xx_defconfig|1 -
 arch/powerpc/configs/ppc6xx_defconfig  |1 -
 arch/powerpc/configs/pseries_defconfig |1 -
 arch/s390/defconfig|1 -
 arch/sh/configs/sdk7786_defconfig  |1 -
 arch/sh/configs/se7206_defconfig   |1 -
 arch/sh/configs/shx3_defconfig |1 -
 arch/sh/configs/urquell_defconfig  |1 -
 arch/x86/configs/i386_defconfig|1 -
 arch/x86/configs/x86_64_defconfig  |1 -
 include/linux/cgroup.h |3 -
 include/linux/cgroup_subsys.h  |6 --
 include/linux/nsproxy.h|9 ---
 init/Kconfig   |9 ---
 kernel/Makefile|1 -
 kernel/cgroup.c|  116 
 kernel/cpuset.c|7 +-
 kernel/fork.c  |6 --
 kernel/ns_cgroup.c |  110 --
 kernel/nsproxy.c   |4 -
 22 files changed, 4 insertions(+), 280 deletions(-)
 delete mode 100644 kernel/ns_cgroup.c

diff --git a/Documentation/cgroups/cgroups.txt 
b/Documentation/cgroups/cgroups.txt
index 190018b..6a5ba63 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -618,7 +618,7 @@ always handled well.
 void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
 (cgroup_mutex held by caller)
 
-Called at the end of cgroup_clone() to do any parameter
+Called during cgroup_create() to do any parameter
 initialization which might be required before a task could attach.  For
 example in cpusets, no task may attach before 'cpus' and 'mems' are set
 up.
diff --git a/arch/arm/configs/tegra_defconfig b/arch/arm/configs/tegra_defconfig
index c81b6d9..ebb8c55 100644
--- a/arch/arm/configs/tegra_defconfig
+++ b/arch/arm/configs/tegra_defconfig
@@ -65,7 +65,6 @@ CONFIG_IKCONFIG_PROC=y
 CONFIG_LOG_BUF_SHIFT=17
 CONFIG_CGROUPS=y
 CONFIG_CGROUP_DEBUG=y
-# CONFIG_CGROUP_NS is not set
 CONFIG_CGROUP_FREEZER=y
 # CONFIG_CGROUP_DEVICE is not set
 # CONFIG_CPUSETS is not set
diff --git a/arch/mips/configs/bcm47xx_defconfig 
b/arch/mips/configs/bcm47xx_defconfig
index 927d58b..c4338e0 100644
--- a/arch/mips/configs/bcm47xx_defconfig
+++ b/arch/mips/configs/bcm47xx_defconfig
@@ -16,7 +16,6 @@ CONFIG_TASK_IO_ACCOUNTING=y
 CONFIG_AUDIT=y
 CONFIG_TINY_RCU=y
 CONFIG_CGROUPS=y
-CONFIG_CGROUP_NS=y
 CONFIG_CGROUP_CPUACCT=y
 CONFIG_RELAY=y
 CONFIG_BLK_DEV_INITRD=y
diff --git a/arch/powerpc/configs/ppc6xx_defconfig 
b/arch/powerpc/configs/ppc6xx_defconfig
index 9d64a68..9b253f6 100644
--- a/arch/powerpc/configs/ppc6xx_defconfig
+++ b/arch/powerpc/configs/ppc6xx_defconfig
@@ -10,7 +10,6 @@ CONFIG_TASK_XACCT=y
 CONFIG_TASK_IO_ACCOUNTING=y
 CONFIG_AUDIT=y
 CONFIG_CGROUPS=y
-CONFIG_CGROUP_NS=y
 CONFIG_CGROUP_DEVICE=y
 CONFIG_CGROUP_CPUACCT=y
 CONFIG_RESOURCE_COUNTERS=y
diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index f87f0e1..972587f 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs

[Devel] Re: [PATCH 0/3][V2] remove the ns_cgroup

2010-09-27 Thread Daniel Lezcano

On 09/27/2010 09:57 PM, Andrew Morton wrote:
 On Mon, 27 Sep 2010 12:14:10 +0200
 Daniel Lezcanodaniel.lezc...@free.fr  wrote:


 The ns_cgroup is a control group interacting with the namespaces.
 When a new namespace is created, a corresponding cgroup is
 automatically created too. The cgroup name is the pid of the process
 who did 'unshare' or the child of 'clone'.

 This cgroup is tied with the namespace because it prevents a
 process to escape the control group and use the post_clone callback,
 so the child cgroup inherits the values of the parent cgroup.

 Unfortunately, the more we use this cgroup and the more we are facing
 problems with it:

   (1) when a process unshares, the cgroup name may conflict with a previous
   cgroup with the same pid, so unshare or clone return -EEXIST

   (2) the cgroup creation is out of control because there may have an
   application creating several namespaces where the system will automatically
   create several cgroups in his back and let them on the cgroupfs (eg. a vrf
   based on the network namespace).

   (3) the mix of (1) and (2) force an administrator to regularly check and
   clean these cgroups.

 This patchset removes the ns_cgroup by adding a new flag to the cgroup
 and the cgroupfs mount option. It enables the copy of the parent cgroup
 when a child cgroup is created. We can then safely remove the ns_cgroup as
 this flag brings a compatibility. We have now to manually create and add the
 task to a cgroup, which is consistent with the cgroup framework.
  
 So this is a non-backward-compatible userspace-visible change?

 What are the implications of this?


An application will have to create a directory in the cgroup directory 
and write the pid in the tasks file, instead of assuming it is 
automatically created with the unshare/clone. The cgroupfs should be 
mounted with the 'clone_children' option set.

AFAIK, I am the only one, with the lxc tools, using the ns_cgroup and I 
will be happy to get rid of it. People is used to change the default 
cgroup mount options to mount all the subsystems except the ns_cgroup 
(for example this is needed for libvirt if I am not wrong). IMHO, a very 
few people will be impacted, to not say nobody.

   -- Daniel

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd

2010-09-24 Thread Daniel Lezcano

On 09/23/2010 10:51 AM, Eric W. Biederman wrote:

 Take advantage of the new abstraction and allow network devices
 to be placed in any network namespace that we have a fd to talk
 about.

 Signed-off-by: Eric W. Biedermanebied...@xmission.com
 ---

[ ... ]

 +struct net *get_net_ns_by_fd(int fd)
 +{
 + struct proc_inode *ei;
 + struct file *file;
 + struct net *net;
 +
 + file = NULL;
 + net = ERR_PTR(-EINVAL);
 + file = proc_ns_fget(fd);
 + if (!fd)
 + goto out;
 + return ERR_PTR(-EINVAL);
 +
 + ei = PROC_I(file-f_dentry-d_inode);
 + if (ei-ns_ops !=netns_operations)
 + goto out;

Is this check necessary here ? proc_ns_fget checks file-f_op != 
ns_file_operations, no ?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [ABI REVIEW][PATCH 0/8] Namespace file descriptors

2010-09-24 Thread Daniel Lezcano

On 09/24/2010 03:02 PM, Andrew Lutomirski wrote:
 Eric W. Biederman wrote:
 Introduce file for manipulating namespaces and related syscalls.
 files:
 /proc/self/ns/nstype

 syscalls:
 int setns(unsigned long nstype, int fd);
 socketat(int nsfd, int family, int type, int protocol);


 How does security work?  Are there different kinds of fd that give (say) 
 pin-the-namespace permission, socketat permission, and setns permission?

AFAICS, socketat, setns and set netns by fd only accept fd from 
/proc/pid/ns/ns.

setns does :

file = proc_ns_fget(fd);
if (IS_ERR(file))
return PTR_ERR(file);

proc_ns_fget checks if (file-f_op != ns_file_operations)


socketat and get_net_ns_by_fd:

net = get_net_ns_by_fd(fd);

this one calls proc_ns_fget.

We have the guarantee here, the fd is resulting from an open of the file 
with the right permissions.

Another way to pin the namespace, would be to mount --bind 
/proc/pid/ns/ns but we have to be root to do that ...
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 2/3] cgroup : make the mount options parsing more accurate

2010-09-07 Thread Daniel Lezcano

On 09/07/2010 09:38 PM, Paul Menage wrote:
 On Sat, Sep 4, 2010 at 12:31 AM, Daniel Lezcanodaniel.lezc...@free.fr  
 wrote:

 The actual code does not detect 'all' with one subsystem name, which
 is IMHO mutually exclusive and when an option is specified even if it
 is not a subsystem name, we have to specify the 'all' option with the
 other option.
 eg:
   not detected : mount -t cgroup -o all,freezer cgroup /cgroup
   not flexible : mount -t cgroup -o noprefix,all cgroup /cgroup

 This patch fix this and makes the code a bit more clear by replacing
 'else if' indentation by 'continue' blocks in the loop.
  
 Can you fix this description to be clearer about the new behaviour of the 
 code?

 Reviewed-by: Paul Menagemen...@google.com


Sure no problem.

Thanks for the review.
   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/3] cgroup : add clone_children control file

2010-09-07 Thread Daniel Lezcano

On 09/07/2010 09:34 PM, Paul Menage wrote:
 On Sat, Sep 4, 2010 at 12:31 AM, Daniel Lezcanodaniel.lezc...@free.fr  
 wrote:

 @@ -229,6 +229,7 @@ inline int cgroup_is_removed(const struct cgroup *cgrp)
   /* bits in struct cgroupfs_root flags field */
   enum {
 ROOT_NOPREFIX, /* mounted subsystems have no named prefix */
 +   ROOT_CLONE_CHILDREN, /* mounted subsystems will inherit from parent 
 */
   };
  
 This bit is awkward - you're storing the original value of the
 clone_children flag to report in the mount options, but this isn't
 necessarily the current state. Might it be better to not store this
 and just report the current value of the root cgroup's
 CGRP_CLONE_CHILDREN flag?


Sure. Shall I do the same as the release agent mount option ?

Thanks
   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/3] cgroup : add clone_children control file

2010-09-07 Thread Daniel Lezcano

On 09/07/2010 10:26 PM, Paul Menage wrote:
 On Tue, Sep 7, 2010 at 1:23 PM, Daniel Lezcanodaniel.lezc...@free.fr  wrote:

 This bit is awkward - you're storing the original value of the
 clone_children flag to report in the mount options, but this isn't
 necessarily the current state. Might it be better to not store this
 and just report the current value of the root cgroup's
 CGRP_CLONE_CHILDREN flag?


 Sure. Shall I do the same as the release agent mount option ?
  
 I think so - the slight problem with this new flag is that it's
 possible for the root cgroup to have one setting for clone_children
 and all its children to have a different setting. But I guess we can
 live with that. (Or maybe simply not make the default clone_children
 value a mount option, but require it to be set on the root cgroup
 after mounting?)


The clone_children option behaves like the release-agent mount option no 
? We can mount with a specific release agent and change it at runtime. 
IMHO it would be better to give a chance to the administrator to set its 
system with the mount option instead of force him to write post mount 
scripts. An alternative would be to set this cgroup option *only* via 
the mount option, but I am not sure it is good as it may be an 
unresolvable constraint for a system wanting to use the cgroups with and 
without this option (same kind of constraint we have with the ns_cgroup).

I am favorable to keep the mount option and the ability to change it for 
another cgroup.

   -- Daniel

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/3] cgroup : add clone_children control file

2010-09-07 Thread Daniel Lezcano

On 09/07/2010 11:22 PM, Paul Menage wrote:
 On Tue, Sep 7, 2010 at 2:13 PM, Daniel Lezcanodaniel.lezc...@free.fr  wrote:

 The clone_children option behaves like the release-agent mount option no ?
  
 Not quite, since it can be controlled on a per-cgroup basis.


 We can mount with a specific release agent and change it at runtime. IMHO it
 would be better to give a chance to the administrator to set its system with
 the mount option instead of force him to write post mount scripts. An
 alternative would be to set this cgroup option *only* via the mount option,
 but I am not sure it is good as it may be an unresolvable constraint for a
 system wanting to use the cgroups with and without this option (same kind of
 constraint we have with the ns_cgroup).

 I am favorable to keep the mount option and the ability to change it for
 another cgroup.
  
 OK, lets mostly keep the current patch, but lose the flag stored at
 mount-time and just report the mount option based on the current value
 of the root cgroup's flag.


Ok, will resend a new version.

Thanks for reviewing the patchset.

   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [PATCH 1/3] cgroup : add clone_children control file

2010-09-04 Thread Daniel Lezcano

This patch is sent as an answer to a previous thread around the ns_cgroup.

https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html

It adds a control file 'clone_children' for a cgroup.
This control file is a boolean specifying if the child cgroup should
be a clone of the parent cgroup or not. The default value is 'false'.

This flag makes the child cgroup to call the post_clone callback of all
the subsystem, if it is available.

At present, the cpuset is the only one which had implemented the post_clone
callback.

The option can be set at mount time by specifying the 'clone_children' mount
option.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
Cc: Eric W. Biederman ebied...@xmission.com
Cc: Paul Menage men...@google.com
Reviewed-by: Li Zefan l...@cn.fujitsu.com
---
 Documentation/cgroups/cgroups.txt |   14 +++-
 include/linux/cgroup.h|4 +++
 kernel/cgroup.c   |   39 +
 3 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt 
b/Documentation/cgroups/cgroups.txt
index b34823f..190018b 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -18,7 +18,8 @@ CONTENTS:
   1.2 Why are cgroups needed ?
   1.3 How are cgroups implemented ?
   1.4 What does notify_on_release do ?
-  1.5 How do I use cgroups ?
+  1.5 What does clone_children do ?
+  1.6 How do I use cgroups ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Attaching processes
@@ -293,7 +294,16 @@ notify_on_release in the root cgroup at system boot is 
disabled
 value of their parents notify_on_release setting. The default value of
 a cgroup hierarchy's release_agent path is empty.
 
-1.5 How do I use cgroups ?
+1.5 What does clone_children do ?
+-
+
+If the clone_children flag is enabled (1) in a cgroup, then all
+cgroups created beneath will call the post_clone callbacks for each
+subsystem of the newly created cgroup. Usually when this callback is
+implemented for a subsystem, it copies the values of the parent
+subsystem, this is the case for the cpuset.
+
+1.6 How do I use cgroups ?
 --
 
 To start a new job that is to be contained within a cgroup, using
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3cb7d04..d01543b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -154,6 +154,10 @@ enum {
 * A thread in rmdir() is wating for this cgroup.
 */
CGRP_WAIT_ON_RMDIR,
+   /*
+* Clone cgroup values when creating a new child cgroup
+*/
+   CGRP_CLONE_CHILDREN,
 };
 
 /* which pidlist file are we talking about? */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e5c5497..0473a9a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -229,6 +229,7 @@ inline int cgroup_is_removed(const struct cgroup *cgrp)
 /* bits in struct cgroupfs_root flags field */
 enum {
ROOT_NOPREFIX, /* mounted subsystems have no named prefix */
+   ROOT_CLONE_CHILDREN, /* mounted subsystems will inherit from parent */
 };
 
 static int cgroup_is_releasable(const struct cgroup *cgrp)
@@ -244,6 +245,11 @@ static int notify_on_release(const struct cgroup *cgrp)
return test_bit(CGRP_NOTIFY_ON_RELEASE, cgrp-flags);
 }
 
+static int clone_children(const struct cgroup *cgrp)
+{
+   return test_bit(CGRP_CLONE_CHILDREN, cgrp-flags);
+}
+
 /*
  * for_each_subsys() allows you to iterate on each subsystem attached to
  * an active hierarchy
@@ -1038,6 +1044,8 @@ static int cgroup_show_options(struct seq_file *seq, 
struct vfsmount *vfs)
seq_printf(seq, ,%s, ss-name);
if (test_bit(ROOT_NOPREFIX, root-flags))
seq_puts(seq, ,noprefix);
+   if (test_bit(ROOT_CLONE_CHILDREN, root-flags))
+   seq_puts(seq, ,clone_children);
if (strlen(root-release_agent_path))
seq_printf(seq, ,release_agent=%s, root-release_agent_path);
if (strlen(root-name))
@@ -1097,6 +1105,8 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
opts-none = true;
} else if (!strcmp(token, noprefix)) {
set_bit(ROOT_NOPREFIX, opts-flags);
+   } else if (!strcmp(token, clone_children)) {
+   set_bit(ROOT_CLONE_CHILDREN, opts-flags);
} else if (!strncmp(token, release_agent=, 14)) {
/* Specifying two release agents is forbidden */
if (opts-release_agent)
@@ -1357,6 +1367,8 @@ static struct cgroupfs_root *cgroup_root_from_opts(struct 
cgroup_sb_opts *opts)
strcpy(root-release_agent_path, opts-release_agent);
if (opts-name)
strcpy(root-name, opts-name);
+   if (test_bit(ROOT_CLONE_CHILDREN, opts-flags

[Devel] [PATCH 3/3] cgroup : remove the ns_cgroup

2010-09-04 Thread Daniel Lezcano

The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier.

For example, a single process can not handle a big amount of namespaces
without interacting with this cgroup and falling in an exponential creation
time due to the nested cgroup directory depth (eg. /cgroup/pid/.../pid/...).

That was spotted when creating a single process using multiple network 
namespaces,
the objective was 4096 network namespaces, but at 820 netns, the creation time
was dramatically slow and the creation time for a namespace increased from 
10msec
to 10sec. After five hours, the expected numbers of netns was not reached.
Without the ns_cgroup interaction, 4K netns are created after 2 minutes.

In order to solve that, we have to mount the cgroup with all the subsystems
except the ns_cgroup, it's a little weird and hard to manage from an 
administration
pov because we have to know what are the cgroup available on the system and we
can't do a simple 'mount -t cgroup cgroup /cgroup'.

With the previous patch which adds a 'clone_children' parameter to a cgroup,
we should be able to remove the ns_cgroup and manage manually the creation +
adding a task to the cgroup consistenly with the rest of the subsystems.

This patch removes the ns_cgroup as suggested in the following thread:

https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

The 'cgroup_clone' function is removed because it is no longer used.

Changelog:
* Sep 1 (dle): refreshed CONFIG_CGROUP_NS references
* Jul 29 (seh): remove references to ns_cgroup_clone(), fix up
   some documentation, and remove CONFIG_CGROUP_NS references.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
Cc: Eric W. Biederman ebied...@xmission.com
Cc: Jamal Hadi Salim h...@cyberus.ca
Reviewed-by: Li Zefan l...@cn.fujitsu.com
Acked-by: Paul Menage men...@google.com
Acked-by: Matt Helsley matth...@us.ibm.com
---
 Documentation/cgroups/cgroups.txt  |2 +-
 arch/mips/configs/bcm47xx_defconfig|1 -
 arch/powerpc/configs/ppc6xx_defconfig  |1 -
 arch/powerpc/configs/pseries_defconfig |1 -
 arch/s390/defconfig|1 -
 arch/sh/configs/sdk7786_defconfig  |1 -
 arch/sh/configs/se7206_defconfig   |1 -
 arch/sh/configs/shx3_defconfig |1 -
 arch/sh/configs/urquell_defconfig  |1 -
 arch/x86/configs/i386_defconfig|1 -
 arch/x86/configs/x86_64_defconfig  |1 -
 include/linux/cgroup.h |3 -
 include/linux/cgroup_subsys.h  |6 --
 include/linux/nsproxy.h|9 ---
 init/Kconfig   |9 ---
 kernel/Makefile|1 -
 kernel/cgroup.c|  116 
 kernel/cpuset.c|7 +-
 kernel/fork.c  |6 --
 kernel/ns_cgroup.c |  110 --
 kernel/nsproxy.c   |4 -
 21 files changed, 4 insertions(+), 279 deletions(-)
 delete mode 100644 kernel/ns_cgroup.c

diff --git a/Documentation/cgroups/cgroups.txt 
b/Documentation/cgroups/cgroups.txt
index 190018b..6a5ba63 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -618,7 +618,7 @@ always handled well.
 void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
 (cgroup_mutex held by caller)
 
-Called at the end of cgroup_clone() to do any parameter
+Called during cgroup_create() to do any parameter
 initialization which might be required before a task could attach.  For
 example in cpusets, no task may attach before 'cpus' and 'mems' are set
 up.
diff --git a/arch/mips/configs/bcm47xx_defconfig 
b/arch/mips/configs/bcm47xx_defconfig
index 927d58b..c4338e0 100644
--- a/arch/mips/configs/bcm47xx_defconfig
+++ b/arch/mips/configs/bcm47xx_defconfig
@@ -16,7 +16,6 @@ CONFIG_TASK_IO_ACCOUNTING=y
 CONFIG_AUDIT=y
 CONFIG_TINY_RCU=y
 CONFIG_CGROUPS=y
-CONFIG_CGROUP_NS=y
 CONFIG_CGROUP_CPUACCT=y
 CONFIG_RELAY=y
 CONFIG_BLK_DEV_INITRD=y
diff --git a/arch/powerpc/configs/ppc6xx_defconfig 
b/arch/powerpc/configs/ppc6xx_defconfig
index 9d64a68..9b253f6 100644
--- a/arch/powerpc/configs/ppc6xx_defconfig
+++ b/arch/powerpc/configs/ppc6xx_defconfig
@@ -10,7 +10,6 @@ CONFIG_TASK_XACCT=y
 CONFIG_TASK_IO_ACCOUNTING=y
 CONFIG_AUDIT=y
 CONFIG_CGROUPS=y
-CONFIG_CGROUP_NS=y
 CONFIG_CGROUP_DEVICE=y
 CONFIG_CGROUP_CPUACCT=y
 CONFIG_RESOURCE_COUNTERS=y
diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index f87f0e1..972587f 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -15,7 +15,6 @@ CONFIG_AUDITSYSCALL=y
 CONFIG_IKCONFIG=y
 CONFIG_IKCONFIG_PROC=y
 CONFIG_CGROUPS=y
-CONFIG_CGROUP_NS=y
 CONFIG_CGROUP_FREEZER=y
 CONFIG_CGROUP_DEVICE=y
 CONFIG_CPUSETS=y
diff --git a/arch/s390/defconfig b/arch/s390

[Devel] [PATCH 2/3] cgroup : make the mount options parsing more accurate

2010-09-04 Thread Daniel Lezcano

The actual code does not detect 'all' with one subsystem name, which
is IMHO mutually exclusive and when an option is specified even if it
is not a subsystem name, we have to specify the 'all' option with the
other option.
eg:
 not detected : mount -t cgroup -o all,freezer cgroup /cgroup
 not flexible : mount -t cgroup -o noprefix,all cgroup /cgroup

This patch fix this and makes the code a bit more clear by replacing
'else if' indentation by 'continue' blocks in the loop.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
Cc: Eric W. Biederman ebied...@xmission.com
Cc: Paul Menage men...@google.com
Reviewed-by: Li Zefan l...@cn.fujitsu.com
---
 kernel/cgroup.c |   91 +--
 1 files changed, 61 insertions(+), 30 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0473a9a..ca2314f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1074,7 +1074,8 @@ struct cgroup_sb_opts {
  */
 static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 {
-   char *token, *o = data ?: all;
+   char *token, *o = data;
+   bool all_ss = false, one_ss = false;
unsigned long mask = (unsigned long)-1;
int i;
bool module_pin_failed = false;
@@ -1088,26 +1089,30 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
memset(opts, 0, sizeof(*opts));
 
while ((token = strsep(o, ,)) != NULL) {
+
if (!*token)
return -EINVAL;
-   if (!strcmp(token, all)) {
-   /* Add all non-disabled subsystems */
-   opts-subsys_bits = 0;
-   for (i = 0; i  CGROUP_SUBSYS_COUNT; i++) {
-   struct cgroup_subsys *ss = subsys[i];
-   if (ss == NULL)
-   continue;
-   if (!ss-disabled)
-   opts-subsys_bits |= 1ul  i;
-   }
-   } else if (!strcmp(token, none)) {
+   if (!strcmp(token, none)) {
/* Explicitly have no subsystems */
opts-none = true;
-   } else if (!strcmp(token, noprefix)) {
+   continue;
+   }
+   if (!strcmp(token, all)) {
+   /* Mutually exclusive option 'all' + subsystem name */
+   if (one_ss)
+   return -EINVAL;
+   all_ss = true;
+   continue;
+   }
+   if (!strcmp(token, noprefix)) {
set_bit(ROOT_NOPREFIX, opts-flags);
-   } else if (!strcmp(token, clone_children)) {
+   continue;
+   }
+   if (!strcmp(token, clone_children)) {
set_bit(ROOT_CLONE_CHILDREN, opts-flags);
-   } else if (!strncmp(token, release_agent=, 14)) {
+   continue;
+   }
+   if (!strncmp(token, release_agent=, 14)) {
/* Specifying two release agents is forbidden */
if (opts-release_agent)
return -EINVAL;
@@ -1115,7 +1120,9 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
kstrndup(token + 14, PATH_MAX - 1, GFP_KERNEL);
if (!opts-release_agent)
return -ENOMEM;
-   } else if (!strncmp(token, name=, 5)) {
+   continue;
+   }
+   if (!strncmp(token, name=, 5)) {
const char *name = token + 5;
/* Can't specify an empty name */
if (!strlen(name))
@@ -1137,20 +1144,44 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
  GFP_KERNEL);
if (!opts-name)
return -ENOMEM;
-   } else {
-   struct cgroup_subsys *ss;
-   for (i = 0; i  CGROUP_SUBSYS_COUNT; i++) {
-   ss = subsys[i];
-   if (ss == NULL)
-   continue;
-   if (!strcmp(token, ss-name)) {
-   if (!ss-disabled)
-   set_bit(i, opts-subsys_bits);
-   break;
-   }
-   }
-   if (i == CGROUP_SUBSYS_COUNT)
-   return -ENOENT;
+
+   continue

[Devel] Re: LXC on ARM (RISC)

2010-09-02 Thread Daniel Lezcano

On 07/08/2010 09:54 PM, Mihamina Rakotomandimby wrote:
 Manao ahoana, Hello, Bonjour,

 Will LXC work in this ARM (RISC) CPU?
 http://www.tilera.com/products/TILEPro64.php

 Misaotra, Thanks, Merci.


LXC is not arch-dependant. It should work if the network driver on your 
system is compatible with the network virtualization.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] unshare pidns and setns syscall

2010-08-23 Thread Daniel Lezcano


Hi Eric,

do you plan to send a new version of the 'setns' patchset ? I will be 
happy to test it if you have a more recent one.

In the meantime, we kept porting your patchset on top of the latest kernels.
http://lxc.sourceforge.net/patches/linux/2.6.35/2.6.35-lxc1/patches/

Thanks
   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] pid namespace isolation broken with powertop

2010-07-29 Thread Daniel Lezcano

Hi all,

I noticed all the tasks of the host are listed in /proc/timer_stats
These information is not virtualized neither isolated within a container.

I was expecting to see only the tasks in the container with the 
corresponding pids.

I am not sure this is something critical, but the usage of powertop in 
the container shows all the tasks of the system.

While looking at the code in kernel/time/timer.c, it is not obvious to 
fix this isolation because it is the pid number which is stored in a 
list, so there is not enough informations to discriminate the pid 
namespace against the current one.

I am wondering if:

  1) is it worth to isolate these informations ? (IMHO, yes).
  2) should the stats be stored per pid namespace or adding an hash 
value + pid namespace as a key in the timer stats list ?

Thanks
   -- Daniel


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: pid namespace isolation broken with powertop

2010-07-29 Thread Daniel Lezcano

On 07/29/2010 04:35 PM, Nathan Lynch wrote:
 On Thu, 2010-07-29 at 14:30 +0200, Daniel Lezcano wrote:

 Hi all,

 I noticed all the tasks of the host are listed in /proc/timer_stats
 These information is not virtualized neither isolated within a container.

 I was expecting to see only the tasks in the container with the
 corresponding pids.

 I am not sure this is something critical, but the usage of powertop in
 the container shows all the tasks of the system.

 While looking at the code in kernel/time/timer.c, it is not obvious to
 fix this isolation because it is the pid number which is stored in a
 list, so there is not enough informations to discriminate the pid
 namespace against the current one.

 I am wondering if:

1) is it worth to isolate these informations ? (IMHO, yes).
2) should the stats be stored per pid namespace or adding an hash
 value + pid namespace as a key in the timer stats list ?
  
 Well, powertop is used for monitoring and modifying global system
 characteristics (e.g. processor C states, USB autosuspend) that don't
 make sense to virtualize.  Many events in /proc/timer_stats are
 accocunted to pid 0 (swapper/idle).  I think the question is whether a
 pidns-relative slice of timer events will be useful or just confusing.


IMHO I find confusing to see all the applications name/pid running on 
the whole system (host + containers) from a container. Even if these 
applications are not accessible, that gives informations to the 
container on what is running on the system and I think we should 
consider that as a security breach.

We can just hide the content of this file for a pid namespace 
different of the init pid namespace, but that may suppress the 
possibility to investigate with powertop the consumption of a specific 
appliance, as accurate as it could be...

Thanks
   -- Daniel

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [Lxc-users] PHYS type lxc not working

2010-07-19 Thread Daniel Lezcano

On 07/19/2010 09:21 AM, Sabdar wrote:
 Hi,
   i tried the same but by upgrading to the latest ip utility and its
 works fine as you specified.With the old ip utility the link add commands
 fails,Thanks for the help.

 I tried PHYS in the kernel version 2.6.35-rc4 and phys seems to work but
 with a problem ie) the working configuration is as follows,

 lxc.utsname = gamma
 lxc.network.type = phys
 lxc.network.flags = up
 *lxc.network.link = eth0*
 *lxc.network.name*http://lxc.network.name/* = eth0*
 lxc.network.ipv4 = 192.168.10.1

 The same when we change as below  fails to release the host interface after
 lxc is stopped

 lxc.utsname = gamma
 lxc.network.type = phys
 lxc.network.flags = up
 *lxc.network.link = eth1*
 *lxc.network.name*http://lxc.network.name/* = eth0*
 lxc.network.ipv4 = 192.168.10.1

 Instead i see an interface as dev3 in my host after the lxc is stopped.Is it
 necessary that both the network.link and network.name has to be same in the
 case of PHYS.


When the container exits, the kernel move the physical interface from 
the container space to the host space.
But within the container, the physical interface is named eth0, when 
it goes back to the host, the name is conflicting because the host has 
another interface with the eth0 name. So the kernel creates a new name 
(eg. dev3).

When no name is specified in the configuration file, the default name is 
eth0, you are probably right that would make sense to keep the 
original name for the physical interface instead of defaulting to eth0 
but at the cost of having to modify generic scripts within the container 
to match the physical interface name.

IMHO, the kernel should reassign the previous name when the interface 
goes back to the host namespace, and if it conflicts then create a new 
name like dev%d.

Thanks
   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: CPU and Memory Quota for LXC

2010-07-10 Thread Daniel Lezcano

On 07/09/2010 05:57 AM, Abi Wilson wrote:

 Hi,


Hi Abi,

actually the quota is working well but the /proc information is not 
virtualized, so you will see the memory available on your host, not the 
memory you set with the cgroup.

 I tried to allocate quota(CPU and Memory)  for each containers , for that I
 have tried with the below lxc-config file . With this config I am able to
 create container and able to start the container, But the RAM memory
 displayed is same as Host. Do I need to modify anything to get Quota
 working? Will Lxc support Quota management ? I am using 2.6.29 kernel with
 the following config for lxc and using lxc-0.6.5.

 CONFIG_CGROUPS=y
 CONFIG_CGROUP_DEBUG=y
 CONFIG_CGROUP_NS=y
 CONFIG_CGROUP_FREEZER=y
 CONFIG_CGROUP_DEVICE=y
 CONFIG_CPUSETS=y
 CONFIG_PROC_PID_CPUSET=y
 CONFIG_CGROUP_CPUACCT=y
 CONFIG_RESOURCE_COUNTERS=y
 CONFIG_CGROUP_MEM_RES_CTLR=y
 CONFIG_NAMESPACES=y
 CONFIG_UTS_NS=y
 CONFIG_IPC_NS=y
 CONFIG_USER_NS=y
 CONFIG_PID_NS=y
 CONFIG_NET_NS=y


 # veth pair virtual network devices
 lxc.utsname = vsys1

 # === Internal control bridge
 lxc.network.type = veth
 lxc.network.flags = up
 lxc.network.link = br0
 lxc.network.name = eth0
 lxc.network.hwaddr = 00:09:00:09:00:01
 lxc.network.ipv4 = 192.168.18.25

 lxc.cgroup.memory.limit_in_bytes = 67108864
 #lxc.cgroup.shares=5

 *RAM memory of Host:*

   total   used   free sharedbuffers cached
 Mem:   1028904 472380 556524  0352 350960
 -/+ buffers/cache: 121068 907836
 Swap:0  0  0

 *RAM memory of container:*

 Mem:   1028904 508640 520264  0352 384608
 -/+ buffers/cache: 123680 905224
 Swap:0  0  0

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: How do containers tie to multiple IP's on a NIC?

2010-07-05 Thread Daniel Lezcano

On 07/05/2010 04:07 PM, Whit Blauvelt wrote:
 On Mon, Jul 05, 2010 at 05:50:38PM +0800, Pavel Labushev wrote:


 What exactly are you trying to achieve? A transparent packet forwarding
 between containers and external networks?
  
 I'm trying to get the overview of what can be achieved, and how. Unless I've
 missed it, there's not much documentation on even moderately complex use of
 containers. Since the capabilities are rapidly advancing, maybe I'm just
 asking the question a few months too early? From the outside, as someone new
 to containers, it looks like a maze where there are a number of entrances,
 each of which may lead approximately to the goal, but some of which may be
 dead ends.

Hi Whit,

may be this documents can help you:

http://lxc.sourceforge.net/doc/sigops/appcr.pdf


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: How do containers tie to multiple IP's on a NIC?

2010-07-04 Thread Daniel Lezcano

On 07/04/2010 05:40 AM, Whit Blauvelt wrote:
 Hi,

 In the containerless world, I often have multiple IPs assigned to a NIC. The
 scant documentation I can find on running containers only ever speaks of
 single IP assignment schemes. Can I have for example a box with a single NIC
 with 8 IPs assigned to it, where the host gets one IP, or perhaps
 alternately can see all 8 to run iptables across, but each of the containers
 can see only whichever IP or IPs are assigned to it?


What container userspace command are you using ? libvirt ? liblxc ? 
unshare --net ?

Thanks
   -- Daniel


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: How do containers tie to multiple IP's on a NIC?

2010-07-04 Thread Daniel Lezcano

On 07/04/2010 09:18 PM, Whit Blauvelt wrote:
 On Sun, Jul 04, 2010 at 06:51:34PM +0200, Daniel Lezcano wrote:


 What container userspace command are you using ? libvirt ? liblxc ?
 unshare --net ?
  
 Which one do you recommend, considering what I'm trying to do with multiple
 IPs on a NIC? I haven't committed to one yet. Which utility do you expect
 future development will favor most? I'll be happy to use any tool which gets
 the job done, preferably one that has a future.


Well  ... please don't consider what I will suggest as preaching for 
its parish ;)
(not sure it is a correct expression. It is a direct translation from 
French)

I would recommend to use the lxc tools, preferably the 0.7.1 version. 
These tools allow to do what you are expecting that is assign several Ip 
addresses to the same virtual nic.

They are available at:

http://lxc.sourceforge.net/download/lxc/lxc-0.7.1.tar.gz

an older version is certainly available on your distro.

As a quick start:

write a configuration file (eg. lxc.conf)

lxc.network.type=macvlan
lxc.network.link=eth0
lxc.network.flags=up
lxc.network.ipv4=1.2.3.4/24
lxc.network.ipv4=192.168.1.123/24
lxc.network.ipv4=10.0.0.23
lxc.network.ipv4=172.2.1.3

And then lxc-execute -n foo -f lxc.conf /bin/bash

In your shell you should have a new network with one interface and 
several IP addresses.

You can create much more complex configuration but I let you check if 
these tools fit your needs.

Thanks
   -- Daniel

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 4/8] af_unix: Allow SO_PEERCRED to work across namespaces.

2010-06-14 Thread Daniel Lezcano

On 06/13/2010 03:30 PM, Eric W. Biederman wrote:
 Use struct pid and struct cred to store the peer credentials on struct
 sock.  This gives enough information to convert the peer credential
 information to a value relative to whatever namespace the socket is in
 at the time.

 This removes nasty surprises when using SO_PEERCRED on socket
 connetions where the processes on either side are in different pid and
 user namespaces.

 Signed-off-by: Eric W. Biedermanebied...@xmission.com


Acked-by: Daniel Lezcano daniel.lezc...@free.fr

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 5/8] af_netlink: Add needed scm_destroy after scm_send.

2010-06-14 Thread Daniel Lezcano

On 06/13/2010 03:31 PM, Eric W. Biederman wrote:
 scm_send occasionally allocates state in the scm_cookie, so I have
 modified netlink_sendmsg to guarantee that when scm_send succeeds
 scm_destory will be called to free that state.

 Signed-off-by: Eric W. Biedermanebied...@xmission.com
 ---


Reviewed-by: Daniel Lezcano daniel.lezc...@free.fr
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 8/8] af_unix: Allow connecting to sockets in other network namespaces.

2010-06-14 Thread Daniel Lezcano

On 06/13/2010 03:35 PM, Eric W. Biederman wrote:
 Remove the restriction that only allows connecting to a unix domain
 socket identified by unix path that is in the same network namespace.

 Crossing network namespaces is always tricky and we did not support
 this at first, because of a strict policy of don't mix the namespaces.
 Later after Pavel proposed this we did not support this because no one
 had performed the audit to make certain using unix domain sockets
 across namespaces is safe.

 What fundamentally makes connecting to af_unix sockets in other
 namespaces is safe is that you have to have the proper permissions on
 the unix domain socket inode that lives in the filesystem.  If you
 want strict isolation you just don't create inodes where unfriendlys
 can get at them, or with permissions that allow unfriendlys to open
 them.  All nicely handled for us by the mount namespace and other
 standard file system facilities.

 I looked through unix domain sockets and they are a very controlled
 environment so none of the work that goes on in dev_forward_skb to
 make crossing namespaces safe appears needed, we are not loosing
 controll of the skb and so do not need to set up the skb to look like
 it is comming in fresh from the outside world.  Further the fields in
 struct unix_skb_parms should not have any problems crossing network
 namespaces.

 Now that we handle SCM_CREDENTIALS in a way that gives useable values
 across namespaces.  There does not appear to be any operational
 problems with encouraging the use of unix domain sockets across
 containers either.

 Signed-off-by: Eric W. Biedermanebied...@xmission.com


Acked-by: Daniel Lezcano daniel.lezc...@free.fr

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: VRF-like use of Network Namespaces

2010-06-13 Thread Daniel Lezcano

On 06/13/2010 11:59 AM, Eric W. Biederman wrote:
 Daniel Lezcanodaniel.lezc...@free.fr  writes:


 On 06/11/2010 04:47 PM, Mathieu Peresse wrote:
  
 Hi,

 [this is related to the use of Eric Biederman's new set of patches for named
 netns / netns switching]

 ok so I successfully modified /sbin/ip. I can now:
 - add/del a new netns by name: ip netns {addns,delns} ns_name
 -   The namespace files are mounted on /var/run/netns/ns_name (so you have 
 to
 mkdir /var/run/netns/ for this to work).


 IMHO, the ip command is not suitable for this, it does not write
 anything to the fs.
  
 It does configuration by all kinds of means.  As far as it goes I
 think the ip command is perfectly suitable in this particular
 situation.  Having a vrf functionality in linux is very desirable.


I agree it would be preferable to centralize all in the ip command.

But the approach proposed by Mathieu relies on the filesystem. I don't 
think there is another solution but having the ip command mounting, 
writing and reading from this directory is a bit weird IMHO, may be 
because it does not do that (or I missed something).

And for this reason, only, I find the ip command not suitable for this.
But I am perfectly fine with the idea in general.

That makes me feel, maybe a 'netnsfs' is missing. IMHO, it is like we 
fork and we store the pid in /var/run/pid/1234.
In the other hand, the 'ip' command is run as root, so we can assume he 
knows what it does, like the 'mount' command writing to /etc/mtab.

 Getting this into ip has the major advantage that we will have a defacto
 standard, and using IFLA_NET_NS_FD makes a lot more sense if everything
 is in ip.


Sure, if the netdev guys are ok with writing into /var/run/netns, I 
won't argue against.

 You should write you own command, which can be a perl script using the
 'unshare' command (util-linux package on my distro).

 vrf createname
 vrf deletename
 vrf attachname
 vrf list

 vrf create will bind mount the ns at the place you decided in the script
 (eg. a tmpfs in order to keep the directory consistent across (unclean)
 reboots).

  
 - list netns: ip netns show
 - use /sbin/ip in any named netns: ip -netns ns_name link show

 (rough patch against current git tree attached)

 I want now to move devices across namespaces using their filesystem names
 (instead of using PIDs...). I'm not sure I can do it in userspace with the
 current code yet, can I ?


 No, you can do that only with pids, but why don't you move the devices
 at the create time ?
 You have all the latitude to do that, no ?
  
 Does my published tree not have IFLA_NET_NS_FD in it?

Hmm, AFAICS no.
 I saw there was a rtnetlink attribute to set the netns of a device but it
 uses the PID of a namespace owner to do so... within 'ip' i can refer to
 only one namespace (i.e. the one that 'ip' task_struct-ns_proxy currently
 points to), so I won't be able to move an interface from outside my
 namespace to my namespace...
 I hope my explanation is clear and that this will get some interest... :)


 Your 'create' command can open a fd to its current  netns, unshare a new
 namespace, bind mount it, and then return to the previously saved netns.

  
 BTW is this the right ML to post this on ?


 Well, this is something related to a subsystem of the containers, so it
 has some interest but I would suggest to send to the netdev@ mailing
 list (net...@vger.kernel.org), maybe cc'ing this mailing list.
  
 Anyway it looks like time to post the core of my patchset for review,
 and get things moving on this.

Reviewing in progress ... ;)

Thanks
   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: VRF-like use of Network Namespaces

2010-06-11 Thread Daniel Lezcano

On 06/11/2010 04:47 PM, Mathieu Peresse wrote:
 Hi,

 [this is related to the use of Eric Biederman's new set of patches for named
 netns / netns switching]

 ok so I successfully modified /sbin/ip. I can now:
 - add/del a new netns by name: ip netns {addns,delns} ns_name
 -  The namespace files are mounted on /var/run/netns/ns_name (so you have to
 mkdir /var/run/netns/ for this to work).


IMHO, the ip command is not suitable for this, it does not write 
anything to the fs.
You should write you own command, which can be a perl script using the 
'unshare' command (util-linux package on my distro).

vrf create name
vrf delete name
vrf attach name
vrf list

vrf create will bind mount the ns at the place you decided in the script 
(eg. a tmpfs in order to keep the directory consistent across (unclean) 
reboots).

 - list netns: ip netns show
 - use /sbin/ip in any named netns: ip -netns ns_name link show

 (rough patch against current git tree attached)

 I want now to move devices across namespaces using their filesystem names
 (instead of using PIDs...). I'm not sure I can do it in userspace with the
 current code yet, can I ?

No, you can do that only with pids, but why don't you move the devices 
at the create time ?
You have all the latitude to do that, no ?

 I saw there was a rtnetlink attribute to set the netns of a device but it
 uses the PID of a namespace owner to do so... within 'ip' i can refer to
 only one namespace (i.e. the one that 'ip' task_struct-ns_proxy currently
 points to), so I won't be able to move an interface from outside my
 namespace to my namespace...
 I hope my explanation is clear and that this will get some interest... :)


Your 'create' command can open a fd to its current  netns, unshare a new 
namespace, bind mount it, and then return to the previously saved netns.

 BTW is this the right ML to post this on ?


Well, this is something related to a subsystem of the containers, so it 
has some interest but I would suggest to send to the netdev@ mailing 
list (net...@vger.kernel.org), maybe cc'ing this mailing list.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: VRF-like use of Network Namespaces

2010-06-08 Thread Daniel Lezcano

On 06/08/2010 05:23 PM, Mathieu Peresse wrote:
 Hi all,

 I saw this post from Oct 2008:
 https://lists.linux-foundation.org/pipermail/containers/2008-October/013917.html,
 discussing how to manipulate network namespaces like we do with VRFs
 on
 Cisco routers (e.g. using normal network commands, plus appending vrf
 vrf_name at the end to manipulate the desired VRF), without the need to
 have processes bound to network namespaces.

 Are there any activities on this subject ?


There is a prototype here:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6.33-nsfd-v5.git

The description of what it does:

http://git.kernel.org/?p=linux/kernel/git/ebiederm/linux-2.6.33-nsfd-v5.git;a=commit;h=9c2f86a44d9ca93e78fd8e81a4e2a8c2a4cdb054

I don't know what is the status of this patchset and if Eric is willing 
to push it for the next kernel version.

Thanks
   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: VRF-like use of Network Namespaces

2010-06-08 Thread Daniel Lezcano

On 06/08/2010 07:12 PM, Mathieu Peresse wrote:
 Looks good, thanks ! Has anyone worked to make 'ip' use these facilities ?

 If I understand correctly, from a network resource configuration
 perspective:

 - Creating a persisting namespace ('VRF') is equivalent to: create a
 namespace (using clone()),  which creates a proc entry for that namespace,
 and then bind mount the file so that it stays open.


 From the same process, unshare (using unshare()), open 
/proc/self/ns/net, store the fd, unshare again, open /proc/self/ns/net, 
store the fd, ...
A single process handles by this way several network namespaces.

To switch from one namespace to another, just use the setns syscall.

Well this is one example to use it, AFAIK you are looking for this very 
specific usage no ?

Thanks
   -- Daniel


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: LXC bringup issue on Fedora

2010-06-02 Thread Daniel Lezcano

On 06/02/2010 07:09 AM, Nirmal Guhan wrote:
 Hi,

 Am trying to get lenny (latest debian from http://ftp.us.debian.org/debian)
 run as a container on Fedora12 with 2.6.32.13 kernel and running into below
 error :

 lxc-start -n lennycont
 SELinux:  Could not open policy file=
 /etc/selinux/targeted/policy/policy.24:  No such file or directory
 INIT: version 2.86 booting
 INIT: Entering runlevel: 2
 Starting enhanced syslogd: rsyslogd.
 Starting periodic command scheduler: crond.

 INIT: Id 4 respawning too fast: disabled for 5 minutes
 INIT: Id 2 respawning too fast: disabled for 5 minutes
 INIT: Id T1 respawning too fast: disabled for 5 minutes
 INIT: Id 1 respawning too fast: disabled for 5 minutes
 INIT: Id 5 respawning too fast: disabled for 5 minutes
 INIT: Id 3 respawning too fast: disabled for 5 minutes
 INIT: Id T0 respawning too fast: disabled for 5 minutes
 INIT: Id 6 respawning too fast: disabled for 5 minutes
 INIT: no more processes left in this runlevel

 My config file is as below :

 lxc.utsname = lennycont
 lxc.network.type = veth
 lxc.network.flags = up
 lxc.network.link = br0
 lxc.network.ipv4 = 128.107.159.180/22
 lxc.network.name = eth0
 lxc.rootfs = /lxc/lenny-chroot
 lxc.mount = /lxc/lenny.fstab
 lxc.tty = 1

 fstab :
 none /lxc/lenny-chroot/dev/pts devpts defaults 0 0
 none /lxc/lenny-chroot/procproc   defaults 0 0
 none /lxc/lenny-chroot/sys sysfs  defaults 0 0
 none /lxc/lenny-chroot/dev/shm tmpfs  defaults 0 0

 I googled and found some solutions but none of them worked for me :-( Could
 you please help?


Hi,

it is probable the number of ttys of your container configuration does 
not match the number of ttys used by the container.
Please ask to lxc-us...@lists.sourceforge.net

Thanks
   -- Daniel
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

1 2 3 4 5 6 >

1 - 100 of 543 matches

Mail list logo