On Thu, May 11, 2017 at 03:23:17PM -0500, Eric W. Biederman wrote: > Guenter Roeck <li...@roeck-us.net> writes: > > > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote: > >> Guenter Roeck <li...@roeck-us.net> writes: > >> > >> > Hi all, > >> > > >> > the test program attached below almost always results in one of the child > >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can > >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is > >> > only > >> > a single thread left in the pid namespace (the one that is stuck). > >> > Traceback from /proc/<pid>/stack is > >> > > >> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0 > >> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330 > >> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130 > >> > [<ffffffff810d4347>] get_signal+0x367/0x8a0 > >> > [<ffffffff81046e73>] do_signal+0x83/0xb90 > >> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0 > >> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0 > >> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad > >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> > > >> > After 120 seconds, I get the "hung task" message. > >> > > >> > Example from v4.11: > >> > > >> > ... > >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds. > >> > [ 3263.379561] Not tainted 4.11.0+ #1 > >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > >> > disables this message. > >> > [ 3263.379577] clone D 0 27910 27909 0x00000000 > >> > [ 3263.379587] Call Trace: > >> > [ 3263.379608] __schedule+0x677/0xda0 > >> > [ 3263.379621] ? pci_mmcfg_check_reserved+0xc0/0xc0 > >> > [ 3263.379634] ? task_stopped_code+0x70/0x70 > >> > [ 3263.379643] schedule+0x4d/0xd0 > >> > [ 3263.379653] zap_pid_ns_processes+0x1ee/0x2a0 > >> > [ 3263.379659] ? copy_pid_ns+0x4d0/0x4d0 > >> > [ 3263.379670] do_exit+0x10d4/0x1330 > >> > ... > >> > > >> > The problem is seen in all kernels up to v4.11. > >> > > >> > Any idea what might be going on and how to fix the problem ? > >> > >> Let me see. Reading the code it looks like we have three tasks > >> let's call them main, child1, and child2. > >> > >> child1 and child2 are started using CLONE_THREAD and are > >> thus clones of one another. > >> > >> child2 exits first but is ptraced by main so is not reaped. > >> Further child2 calls do_group_exit forcing child1 to > >> exit making for fun races. > >> > >> A ptread_exit() or syscall(SYS_exit, 0); would skip > >> the group exit and make the window larger. > >> > >> child1 exits next and calls zap_pid_ns_processes and is > >> waiting for child2 to be reaped by main. > >> > >> main is just sitting around doing nothing for 3600 seconds > >> not reaping anyone. > >> > >> I would expect that when main exits everything would be cleaned up > >> and the only real issue is that we have a hung task warning. > >> > >> Does everything cleanup when main exits? > >> > > Yes, it does. > > > > Problem is that if the main task doesn't exit, it hangs forever. > > Chrome OS (where we see the problem in the field, and the application > > is chrome) is configured to reboot on hung tasks - if a task is hung > > for 120 seconds on those systems, it tends to be in a bad shape. > > This makes it a quite severe problem for us. > > We definitely need to change the code in such a way as to not > trigger the hung task warning. > > > I browsed through a number of stack traces - the ones I looked at > > all have the same traceback from do_group_exit(), though the > > caller of do_group_exit() is not always the same (that may depend > > on the kernel version, though). Sometimes it is __wake_up_parent(), > > sometimes it is get_signal_to_deliver(), sometimes it is > > get_signal(). > > > > Are there other conditions besides ptrace where a task isn't reaped ? > > The problem here is that the parent was outside of the pid namespace > and was choosing not to reap the child. The main task could easily > have reaped the child when it received SIGCHLD. > > Arranging for forked children to run in a pid namespace with setns > can also create processes with parents outside the pid namespace, > that may or may not be reaped. > > I suspect it is a bug to allow PTRACE_TRACEME if either the parent > or the child has called exec. That is not enough to block your test > case but I think it would block real world problems. > > For getting to the root cause of this I would see if I could get > a full process list. It would be interesting to know what > the unreaped zombie is. > root 11579 11472 0 13:38 pts/15 00:00:00 sudo ./clone root 11580 11579 0 13:38 pts/15 00:00:00 ./clone root 11581 11580 0 13:38 pts/15 00:00:00 [clone]
I assume this means that it is child1 ? What other information would be needed ? Thanks, Guenter > In the meantime I am going to see what it takes to change the code > so it doesn't trigger the hung task warning. I don't want to change > the actual observed behavior of the code but I can avoid triggering > a warning that does not apply. > > Eric > > > > > > Thanks, > > Guenter > > > >> Eric > >> > >> > >> > > >> > Thanks, > >> > Guenter > >> > > >> > --- > >> > This test program was kindly provided by Vovo Yang <vo...@google.com>. > >> > > >> > Note that the ptrace() call in child1() is not necessary for the problem > >> > to be seen, though it seems to make it a bit more likely. > >> > >> That would appear to just slow things down a smidge. As there is > >> nothing substantial that happens ptrace wise except until after > >> zap_pid_ns_processes. > >> > >> > >> > --- > >> > > >> > #define _GNU_SOURCE > >> > #include <stdio.h> > >> > #include <stdlib.h> > >> > #include <unistd.h> > >> > #include <sys/ptrace.h> > >> > #include <errno.h> > >> > #include <string.h> > >> > #include <sched.h> > >> > > >> > #define STACK_SIZE 65536 > >> > > >> > int child1(void* arg); > >> > int child2(void* arg); > >> > > >> > int main(int argc, char **argv) > >> > { > >> > int child_pid; > >> > char* child_stack = malloc(STACK_SIZE); > >> > char* stack_top = child_stack + STACK_SIZE; > >> > char command[256]; > >> > > >> > child_pid = clone(&child1, stack_top, CLONE_NEWPID, NULL); > >> > if (child_pid == -1) { > >> > printf("parent: clone failed: %s\n", strerror(errno)); > >> > return EXIT_FAILURE; > >> > } > >> > printf("parent: child1_pid: %d\n", child_pid); > >> > > >> > sleep(2); > >> > printf("child state, if it's D (disk sleep), the child process is > >> > hung\n"); > >> > sprintf(command, "cat /proc/%d/status | grep State:", child_pid); > >> > system(command); > >> > sleep(3600); > >> > return EXIT_SUCCESS; > >> > } > >> > > >> > int child1(void* arg) > >> > { > >> > int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | > >> > CLONE_THREAD; > >> > char* child_stack = malloc(STACK_SIZE); > >> > char* stack_top = child_stack + STACK_SIZE; > >> > long ret; > >> > > >> > ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL); > >> > if (ret == -1) { > >> > printf("child1: ptrace failed: %s\n", strerror(errno)); > >> > return EXIT_FAILURE; > >> > } > >> > > >> > ret = clone(&child2, stack_top, flags, NULL); > >> > if (ret == -1) { > >> > printf("child1: clone failed: %s\n", strerror(errno)); > >> > return EXIT_FAILURE; > >> > } > >> > printf("child1: child2 pid: %ld\n", ret); > >> > > >> > sleep(1); > >> > printf("child1: end\n"); > >> > return EXIT_SUCCESS; > >> > } > >> > > >> > int child2(void* arg) > >> > { > >> > long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL); > >> > if (ret == -1) { > >> > printf("child2: ptrace failed: %s\n", strerror(errno)); > >> > return EXIT_FAILURE; > >> > } > >> > > >> > printf("child2: end\n"); > >> > return EXIT_SUCCESS; > >> > }