Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()

2014-11-25 Thread Oleg Nesterov
forgot to mention...

On 11/25, Oleg Nesterov wrote:
>
> On 11/24, Eric W. Biederman wrote:
> >
> > - We won't wait for an injected process in a pid namespace,
> >   or a processes debugged with gdb to be reaped before the pid
> >   init process exits if we don't wait.
>
> Yes, and I do not see why this is bad, but this is off-topic.
>
> Again, lets discuss this in another thread. This patch doesn't try to
> document the desired semantics, it only tries to explain why zap_pid_ns
> _has_ to wait until EXIT_DEAD (and in fact EXIT_ZOMBIE) tasks go away.

Just in case, as for "processes debugged with gdb" I obviously agree,
but this has nothing to do with EXIT_DEAD threads. zap_pid_ns_processes()
will sleep in sys_wait4() until debugger detaches the tracee and reaps
it due to ignored SIGCHLD.

But again, this is off-topic now.

> Eric. I'd really ask you to take another look. But in any case: thanks for
> looking at this.

Yes, please.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()

2014-11-25 Thread Oleg Nesterov
On 11/24, Eric W. Biederman wrote:
>
> Oleg Nesterov  writes:
>
> > The comments in zap_pid_ns_processes() are simply wrong, we need
> > to explain how this code actually works.
> >
> > 1. "Ignore SIGCHLD" looks like optimization but it is not, we also
> >need this for correctness.
> >
> > 2. The comment above sys_wait4() could be more clear.
> >
> > 3. The comment about TASK_DEAD children is outdated. Contrary to
> >what it says we do not need to make sure they all go away.
>
> We absolutely do need to wait until they all go away.

I can easily miss something, and that is why I asked you to review this
change. But if I missed something, perhaps we should update the comments?

This "wait until TASK_DEAD children" loop was added to ensure that
child_reaper can't be reaped before other tasks in this pid namespace.
6347e90091041e "pidns: guarantee that the pidns init will be the last pidns
process reaped".

But this was needed because proc_flush_task(pid == 1) called
kern_unmount(proc_mnt). After 0a01f2cc390e10633 "pidns: Make the pidns
proc mount/umount logic obvious" we can rely on schedule_work() in
free_pid(nr_hashed == 0).

So in any case I think that the current comment is outdated.


> - rusage will be wrong if we don't wait.

I already answered in 0/2, let me quote myself:

Do you mean cstime/cutime/c* accounting?

Firstly it is not clear what makes child_reaper special in _this_ 
sense, but
this doesn't matter at all.

The auotoreaping/EXIT_DEAD children are not accounted, only 
wait_task_zombie()
accumulates these counters. (just in case, accounting in 
__exit_signal() is
another thing).


> - We won't wait for an injected process in a pid namespace,
>   or a processes debugged with gdb to be reaped before the pid
>   init process exits if we don't wait.

Yes, and I do not see why this is bad, but this is off-topic.

Again, lets discuss this in another thread. This patch doesn't try to
document the desired semantics, it only tries to explain why zap_pid_ns
_has_ to wait until EXIT_DEAD (and in fact EXIT_ZOMBIE) tasks go away.


> The user visible semantics go from weird to completely insane if we
> relax the rule that the init process is the last process in a pid
> namespace.
>
> I don't see anything approaching a good reason for changing the user
> space semantics.

See above.

> > @@ -190,7 +190,11 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> > /* Don't allow any more processes into the pid namespace */
> > disable_pid_allocation(pid_ns);
>
> 
>
> The above line disables injections of new processes in this pid
> namespace.

Yes, sure. See below.

> > -   /* Ignore SIGCHLD causing any terminated children to autoreap */
> > +   /*
> > +* Ignore SIGCHLD causing any terminated children to autoreap.
> > +* This speeds up the namespace shutdown, plus see the comment
> > +* below.
> > +*/
>
>   This speeds up the namespace shutdown and ensures that after we
>   have waited for all existing zombies there will be no more
>   zombies to wait for.

Afaics, no, see below.

> > -* sys_wait4() above can't reap the TASK_DEAD children.
> > -* Make sure they all go away, see free_pid().
> > +* sys_wait4() above can't reap the TASK_DEAD children but we do not
> > +* really care, we could reparent them to the global init.
>
> We do care.

See above. I only meant that nothing bad can happen from the kernel
perspective.

> > +* But this namespace can also have other tasks injected by setns().
> > +* Again, we do not really need to wait until they are all reaped,
>
> We do, and setns does not matter here.  Injection was stopped way up above.

I think that setns() does matter.

Yes injection was stopped. But a task T can be already injected before
disable_pid_allocation() was called.

Now it is killed (or it could even exit before). To simplify, suppose it
is already EXIT_ZOMBIE, although this doesn't really matter.

The sys_wait4() loop can't see it, it is not our child.

Now suppose that its parent doesn't do wait() but exits. This means that
the exiting parent will try to reparent T to pid_ns->child_reaper.

child_reaper already sleeps in "wait for nr_hashed == init_pids" loop, it
can do nothing with T.

So we rely on autoreaping (forced by ignored SIGCHLD), and this is the
(technical) reason why child_reaper can't exit: pid_ns->child_reaper should
be valid.

No?

> > +* We rely on ignored SIGCHLD, an injected zombie must be autoreaped
> > +* if reparented.
>
> Your new comment is about 90% wrong.

See above.

Eric. I'd really ask you to take another look. But in any case: thanks for
looking at this.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()

2014-11-25 Thread Oleg Nesterov
On 11/24, Eric W. Biederman wrote:

 Oleg Nesterov o...@redhat.com writes:

  The comments in zap_pid_ns_processes() are simply wrong, we need
  to explain how this code actually works.
 
  1. Ignore SIGCHLD looks like optimization but it is not, we also
 need this for correctness.
 
  2. The comment above sys_wait4() could be more clear.
 
  3. The comment about TASK_DEAD children is outdated. Contrary to
 what it says we do not need to make sure they all go away.

 We absolutely do need to wait until they all go away.

I can easily miss something, and that is why I asked you to review this
change. But if I missed something, perhaps we should update the comments?

This wait until TASK_DEAD children loop was added to ensure that
child_reaper can't be reaped before other tasks in this pid namespace.
6347e90091041e pidns: guarantee that the pidns init will be the last pidns
process reaped.

But this was needed because proc_flush_task(pid == 1) called
kern_unmount(proc_mnt). After 0a01f2cc390e10633 pidns: Make the pidns
proc mount/umount logic obvious we can rely on schedule_work() in
free_pid(nr_hashed == 0).

So in any case I think that the current comment is outdated.


 - rusage will be wrong if we don't wait.

I already answered in 0/2, let me quote myself:

Do you mean cstime/cutime/c* accounting?

Firstly it is not clear what makes child_reaper special in _this_ 
sense, but
this doesn't matter at all.

The auotoreaping/EXIT_DEAD children are not accounted, only 
wait_task_zombie()
accumulates these counters. (just in case, accounting in 
__exit_signal() is
another thing).


 - We won't wait for an injected process in a pid namespace,
   or a processes debugged with gdb to be reaped before the pid
   init process exits if we don't wait.

Yes, and I do not see why this is bad, but this is off-topic.

Again, lets discuss this in another thread. This patch doesn't try to
document the desired semantics, it only tries to explain why zap_pid_ns
_has_ to wait until EXIT_DEAD (and in fact EXIT_ZOMBIE) tasks go away.


 The user visible semantics go from weird to completely insane if we
 relax the rule that the init process is the last process in a pid
 namespace.

 I don't see anything approaching a good reason for changing the user
 space semantics.

See above.

  @@ -190,7 +190,11 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
  /* Don't allow any more processes into the pid namespace */
  disable_pid_allocation(pid_ns);

 

 The above line disables injections of new processes in this pid
 namespace.

Yes, sure. See below.

  -   /* Ignore SIGCHLD causing any terminated children to autoreap */
  +   /*
  +* Ignore SIGCHLD causing any terminated children to autoreap.
  +* This speeds up the namespace shutdown, plus see the comment
  +* below.
  +*/

   This speeds up the namespace shutdown and ensures that after we
   have waited for all existing zombies there will be no more
   zombies to wait for.

Afaics, no, see below.

  -* sys_wait4() above can't reap the TASK_DEAD children.
  -* Make sure they all go away, see free_pid().
  +* sys_wait4() above can't reap the TASK_DEAD children but we do not
  +* really care, we could reparent them to the global init.

 We do care.

See above. I only meant that nothing bad can happen from the kernel
perspective.

  +* But this namespace can also have other tasks injected by setns().
  +* Again, we do not really need to wait until they are all reaped,

 We do, and setns does not matter here.  Injection was stopped way up above.

I think that setns() does matter.

Yes injection was stopped. But a task T can be already injected before
disable_pid_allocation() was called.

Now it is killed (or it could even exit before). To simplify, suppose it
is already EXIT_ZOMBIE, although this doesn't really matter.

The sys_wait4() loop can't see it, it is not our child.

Now suppose that its parent doesn't do wait() but exits. This means that
the exiting parent will try to reparent T to pid_ns-child_reaper.

child_reaper already sleeps in wait for nr_hashed == init_pids loop, it
can do nothing with T.

So we rely on autoreaping (forced by ignored SIGCHLD), and this is the
(technical) reason why child_reaper can't exit: pid_ns-child_reaper should
be valid.

No?

  +* We rely on ignored SIGCHLD, an injected zombie must be autoreaped
  +* if reparented.

 Your new comment is about 90% wrong.

See above.

Eric. I'd really ask you to take another look. But in any case: thanks for
looking at this.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()

2014-11-25 Thread Oleg Nesterov
forgot to mention...

On 11/25, Oleg Nesterov wrote:

 On 11/24, Eric W. Biederman wrote:
 
  - We won't wait for an injected process in a pid namespace,
or a processes debugged with gdb to be reaped before the pid
init process exits if we don't wait.

 Yes, and I do not see why this is bad, but this is off-topic.

 Again, lets discuss this in another thread. This patch doesn't try to
 document the desired semantics, it only tries to explain why zap_pid_ns
 _has_ to wait until EXIT_DEAD (and in fact EXIT_ZOMBIE) tasks go away.

Just in case, as for processes debugged with gdb I obviously agree,
but this has nothing to do with EXIT_DEAD threads. zap_pid_ns_processes()
will sleep in sys_wait4() until debugger detaches the tracee and reaps
it due to ignored SIGCHLD.

But again, this is off-topic now.

 Eric. I'd really ask you to take another look. But in any case: thanks for
 looking at this.

Yes, please.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()

2014-11-24 Thread Eric W. Biederman
Oleg Nesterov  writes:

> The comments in zap_pid_ns_processes() are simply wrong, we need
> to explain how this code actually works.
>
> 1. "Ignore SIGCHLD" looks like optimization but it is not, we also
>need this for correctness.
>
> 2. The comment above sys_wait4() could be more clear.
>
> 3. The comment about TASK_DEAD children is outdated. Contrary to
>what it says we do not need to make sure they all go away.

We absolutely do need to wait until they all go away.

- rusage will be wrong if we don't wait.
- We won't wait for an injected process in a pid namespace,
  or a processes debugged with gdb to be reaped before the pid
  init process exits if we don't wait.

>At the same time, we do need to wait for nr_hashed == init_pids,
>but the reason is quite different and not obvious.

The user visible semantics go from weird to completely insane if we
relax the rule that the init process is the last process in a pid
namespace.

I don't see anything approaching a good reason for changing the user
space semantics.

> Signed-off-by: Oleg Nesterov 
> ---
>  kernel/pid_namespace.c |   23 +++
>  1 files changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index db95d8e..1519b02 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -190,7 +190,11 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>   /* Don't allow any more processes into the pid namespace */
>   disable_pid_allocation(pid_ns);



The above line disables injections of new processes in this pid
namespace.

>  
> - /* Ignore SIGCHLD causing any terminated children to autoreap */
> + /*
> +  * Ignore SIGCHLD causing any terminated children to autoreap.
> +  * This speeds up the namespace shutdown, plus see the comment
> +  * below.
> +  */

This speeds up the namespace shutdown and ensures that after we
have waited for all existing zombies there will be no more
zombies to wait for.

>   spin_lock_irq(>sighand->siglock);
>   me->sighand->action[SIGCHLD - 1].sa.sa_handler = SIG_IGN;
>   spin_unlock_irq(>sighand->siglock);
> @@ -223,15 +227,26 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>   }
>   read_unlock(_lock);
>  
> - /* Firstly reap the EXIT_ZOMBIE children we may have. */
> + /* Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD */
>   do {
>   clear_thread_flag(TIF_SIGPENDING);
>   rc = sys_wait4(-1, NULL, __WALL, NULL);
>   } while (rc != -ECHILD);
>  
>   /*
> -  * sys_wait4() above can't reap the TASK_DEAD children.
> -  * Make sure they all go away, see free_pid().
> +  * sys_wait4() above can't reap the TASK_DEAD children but we do not
> +  * really care, we could reparent them to the global init. 

We do care.


> We could
> +  * exit and reap ->child_reaper even if it is not the last thread in
> +  * this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(),
> +  * pid_ns can not go away until proc_kill_sb() drops the reference.


> +  * But this namespace can also have other tasks injected by setns().
> +  * Again, we do not really need to wait until they are all reaped,

We do, and setns does not matter here.  Injection was stopped way up above.


> +  * but they can be reparented to us and thus we need to ensure that
> +  * pid->child_reaper is valid until they all go away.


> +  * We rely on ignored SIGCHLD, an injected zombie must be autoreaped
> +  * if reparented.

Your new comment is about 90% wrong.

>*/
>   for (;;) {
>   set_current_state(TASK_UNINTERRUPTIBLE);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()

2014-11-24 Thread Oleg Nesterov
On 11/24, Oleg Nesterov wrote:
>
> +  * sys_wait4() above can't reap the TASK_DEAD children but we do not
> +  * really care, we could reparent them to the global init. We could
> +  * exit and reap ->child_reaper even if it is not the last thread in
> +  * this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(),
> +  * pid_ns can not go away until proc_kill_sb() drops the reference.
> +  *
> +  * But this namespace can also have other tasks injected by setns().
^
I meant setns() + fork(), but I hope this is clear.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()

2014-11-24 Thread Oleg Nesterov
The comments in zap_pid_ns_processes() are simply wrong, we need
to explain how this code actually works.

1. "Ignore SIGCHLD" looks like optimization but it is not, we also
   need this for correctness.

2. The comment above sys_wait4() could be more clear.

3. The comment about TASK_DEAD children is outdated. Contrary to
   what it says we do not need to make sure they all go away.

   At the same time, we do need to wait for nr_hashed == init_pids,
   but the reason is quite different and not obvious.

Signed-off-by: Oleg Nesterov 
---
 kernel/pid_namespace.c |   23 +++
 1 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index db95d8e..1519b02 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -190,7 +190,11 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
/* Don't allow any more processes into the pid namespace */
disable_pid_allocation(pid_ns);
 
-   /* Ignore SIGCHLD causing any terminated children to autoreap */
+   /*
+* Ignore SIGCHLD causing any terminated children to autoreap.
+* This speeds up the namespace shutdown, plus see the comment
+* below.
+*/
spin_lock_irq(>sighand->siglock);
me->sighand->action[SIGCHLD - 1].sa.sa_handler = SIG_IGN;
spin_unlock_irq(>sighand->siglock);
@@ -223,15 +227,26 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
}
read_unlock(_lock);
 
-   /* Firstly reap the EXIT_ZOMBIE children we may have. */
+   /* Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD */
do {
clear_thread_flag(TIF_SIGPENDING);
rc = sys_wait4(-1, NULL, __WALL, NULL);
} while (rc != -ECHILD);
 
/*
-* sys_wait4() above can't reap the TASK_DEAD children.
-* Make sure they all go away, see free_pid().
+* sys_wait4() above can't reap the TASK_DEAD children but we do not
+* really care, we could reparent them to the global init. We could
+* exit and reap ->child_reaper even if it is not the last thread in
+* this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(),
+* pid_ns can not go away until proc_kill_sb() drops the reference.
+*
+* But this namespace can also have other tasks injected by setns().
+* Again, we do not really need to wait until they are all reaped,
+* but they can be reparented to us and thus we need to ensure that
+* pid->child_reaper is valid until they all go away.
+*
+* We rely on ignored SIGCHLD, an injected zombie must be autoreaped
+* if reparented.
 */
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
-- 
1.5.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()

2014-11-24 Thread Oleg Nesterov
The comments in zap_pid_ns_processes() are simply wrong, we need
to explain how this code actually works.

1. Ignore SIGCHLD looks like optimization but it is not, we also
   need this for correctness.

2. The comment above sys_wait4() could be more clear.

3. The comment about TASK_DEAD children is outdated. Contrary to
   what it says we do not need to make sure they all go away.

   At the same time, we do need to wait for nr_hashed == init_pids,
   but the reason is quite different and not obvious.

Signed-off-by: Oleg Nesterov o...@redhat.com
---
 kernel/pid_namespace.c |   23 +++
 1 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index db95d8e..1519b02 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -190,7 +190,11 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
/* Don't allow any more processes into the pid namespace */
disable_pid_allocation(pid_ns);
 
-   /* Ignore SIGCHLD causing any terminated children to autoreap */
+   /*
+* Ignore SIGCHLD causing any terminated children to autoreap.
+* This speeds up the namespace shutdown, plus see the comment
+* below.
+*/
spin_lock_irq(me-sighand-siglock);
me-sighand-action[SIGCHLD - 1].sa.sa_handler = SIG_IGN;
spin_unlock_irq(me-sighand-siglock);
@@ -223,15 +227,26 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
}
read_unlock(tasklist_lock);
 
-   /* Firstly reap the EXIT_ZOMBIE children we may have. */
+   /* Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD */
do {
clear_thread_flag(TIF_SIGPENDING);
rc = sys_wait4(-1, NULL, __WALL, NULL);
} while (rc != -ECHILD);
 
/*
-* sys_wait4() above can't reap the TASK_DEAD children.
-* Make sure they all go away, see free_pid().
+* sys_wait4() above can't reap the TASK_DEAD children but we do not
+* really care, we could reparent them to the global init. We could
+* exit and reap -child_reaper even if it is not the last thread in
+* this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(),
+* pid_ns can not go away until proc_kill_sb() drops the reference.
+*
+* But this namespace can also have other tasks injected by setns().
+* Again, we do not really need to wait until they are all reaped,
+* but they can be reparented to us and thus we need to ensure that
+* pid-child_reaper is valid until they all go away.
+*
+* We rely on ignored SIGCHLD, an injected zombie must be autoreaped
+* if reparented.
 */
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
-- 
1.5.5.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()

2014-11-24 Thread Oleg Nesterov
On 11/24, Oleg Nesterov wrote:

 +  * sys_wait4() above can't reap the TASK_DEAD children but we do not
 +  * really care, we could reparent them to the global init. We could
 +  * exit and reap -child_reaper even if it is not the last thread in
 +  * this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(),
 +  * pid_ns can not go away until proc_kill_sb() drops the reference.
 +  *
 +  * But this namespace can also have other tasks injected by setns().
^
I meant setns() + fork(), but I hope this is clear.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()

2014-11-24 Thread Eric W. Biederman
Oleg Nesterov o...@redhat.com writes:

 The comments in zap_pid_ns_processes() are simply wrong, we need
 to explain how this code actually works.

 1. Ignore SIGCHLD looks like optimization but it is not, we also
need this for correctness.

 2. The comment above sys_wait4() could be more clear.

 3. The comment about TASK_DEAD children is outdated. Contrary to
what it says we do not need to make sure they all go away.

We absolutely do need to wait until they all go away.

- rusage will be wrong if we don't wait.
- We won't wait for an injected process in a pid namespace,
  or a processes debugged with gdb to be reaped before the pid
  init process exits if we don't wait.

At the same time, we do need to wait for nr_hashed == init_pids,
but the reason is quite different and not obvious.

The user visible semantics go from weird to completely insane if we
relax the rule that the init process is the last process in a pid
namespace.

I don't see anything approaching a good reason for changing the user
space semantics.

 Signed-off-by: Oleg Nesterov o...@redhat.com
 ---
  kernel/pid_namespace.c |   23 +++
  1 files changed, 19 insertions(+), 4 deletions(-)

 diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
 index db95d8e..1519b02 100644
 --- a/kernel/pid_namespace.c
 +++ b/kernel/pid_namespace.c
 @@ -190,7 +190,11 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
   /* Don't allow any more processes into the pid namespace */
   disable_pid_allocation(pid_ns);



The above line disables injections of new processes in this pid
namespace.

  
 - /* Ignore SIGCHLD causing any terminated children to autoreap */
 + /*
 +  * Ignore SIGCHLD causing any terminated children to autoreap.
 +  * This speeds up the namespace shutdown, plus see the comment
 +  * below.
 +  */

This speeds up the namespace shutdown and ensures that after we
have waited for all existing zombies there will be no more
zombies to wait for.

   spin_lock_irq(me-sighand-siglock);
   me-sighand-action[SIGCHLD - 1].sa.sa_handler = SIG_IGN;
   spin_unlock_irq(me-sighand-siglock);
 @@ -223,15 +227,26 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
   }
   read_unlock(tasklist_lock);
  
 - /* Firstly reap the EXIT_ZOMBIE children we may have. */
 + /* Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD */
   do {
   clear_thread_flag(TIF_SIGPENDING);
   rc = sys_wait4(-1, NULL, __WALL, NULL);
   } while (rc != -ECHILD);
  
   /*
 -  * sys_wait4() above can't reap the TASK_DEAD children.
 -  * Make sure they all go away, see free_pid().
 +  * sys_wait4() above can't reap the TASK_DEAD children but we do not
 +  * really care, we could reparent them to the global init. 

We do care.


 We could
 +  * exit and reap -child_reaper even if it is not the last thread in
 +  * this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(),
 +  * pid_ns can not go away until proc_kill_sb() drops the reference.


 +  * But this namespace can also have other tasks injected by setns().
 +  * Again, we do not really need to wait until they are all reaped,

We do, and setns does not matter here.  Injection was stopped way up above.


 +  * but they can be reparented to us and thus we need to ensure that
 +  * pid-child_reaper is valid until they all go away.


 +  * We rely on ignored SIGCHLD, an injected zombie must be autoreaped
 +  * if reparented.

Your new comment is about 90% wrong.

*/
   for (;;) {
   set_current_state(TASK_UNINTERRUPTIBLE);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/