Kirill Tkhai <ktk...@virtuozzo.com> writes: > This patch prohibits pid allocation till child_reaper > of pid namespace is set, and it makes possible and safe > to get just unshared pid_ns from "/proc/[pid]/ns/pid_for_children" > file. This may be useful to determine user_ns of such a created > pid_ns, which is not possible now. > > It was prohibited till now, because the architecture of pid namespaces > assumes child reaper is the firstly created process of the namespace, > and it initializes pid_namespace::proc_mnt. Child reaper creation > mustn't race with creation of another processes from this namespace, > otherwise a process with pid > 1 may die before pid_namespace::proc_mnt > is populated and it will get a null pointer dereference in proc_flush_task(). > Also, child reaper mustn't die before processes from the namespace.
This patch introduces the possibility that two or more processes may have the same pid namespace (with no processes) as pid_ns_for_children. Which means you can now have a race for the first pid in alloc_pid. Making it indeterminant who allocates the init process. Which is not acceptable. It is not acceptable on two grounds. 1) It is a bogus user space semantic. Because userspace needs to know who allocates init. 2) It is horrible for maintenance becuase now the code has to be very clever to deal with a case that no one cares about. Which is a general formula for buggy code. Eric > The patch prevents such races. It allows to alloc_pid() only if > ns->child_reaper is already set, and it guarantees, that > pid_namespace::proc_mnt is already populated. Also, we do the assignment > under the tasklist_lock in the copy_process() stage, when it can't fail. > This guarantees the child_reaper will be hashed before the concurrent > process, so the concurrent process can't die before it. When child reaper > dies before the concurrent hashes to task list, fork() of the concurrent > will aborts as it's prohibited after commit 3fd372262166: > "pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()". > So, we can't safely allow to get "/proc/[pid]/ns/pid_for_children" > since it's created. > > Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com> > CC: Andrew Morton <a...@linux-foundation.org> > CC: "Eric W. Biederman" <ebied...@xmission.com> > CC: Oleg Nesterov <o...@redhat.com> > CC: Andy Lutomirski <l...@kernel.org> > CC: Serge Hallyn <se...@hallyn.com> > CC: Michal Hocko <mho...@suse.com> > CC: Andrei Vagin <ava...@openvz.org> > CC: Cyrill Gorcunov <gorcu...@openvz.org> > CC: Mike Rapoport <r...@linux.vnet.ibm.com> > CC: Ingo Molnar <mi...@kernel.org> > CC: Peter Zijlstra <pet...@infradead.org> > --- > kernel/pid.c | 3 ++- > kernel/pid_namespace.c | 9 --------- > 2 files changed, 2 insertions(+), 10 deletions(-) > > diff --git a/kernel/pid.c b/kernel/pid.c > index fd1cde1e4576..eeeb01fdd87c 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c > @@ -334,7 +334,8 @@ struct pid *alloc_pid(struct pid_namespace *ns) > > upid = pid->numbers + ns->level; > spin_lock_irq(&pidmap_lock); > - if (!(ns->nr_hashed & PIDNS_HASH_ADDING)) > + if (!(ns->nr_hashed & PIDNS_HASH_ADDING) || > + (!ns->child_reaper && !is_child_reaper(pid))) > goto out_unlock; > for ( ; upid >= pid->numbers; --upid) { > hlist_add_head_rcu(&upid->pid_chain, > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index 74a5a7255b4d..51dd1d490542 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -385,15 +385,6 @@ static struct ns_common *pidns_for_children_get(struct > task_struct *task) > } > task_unlock(task); > > - if (ns) { > - read_lock(&tasklist_lock); > - if (!ns->child_reaper) { > - put_pid_ns(ns); > - ns = NULL; > - } > - read_unlock(&tasklist_lock); > - } > - > return ns ? &ns->ns : NULL; > } >