Guenter Roeck <li...@roeck-us.net> writes: > Hi all, > > the test program attached below almost always results in one of the child > processes being stuck in zap_pid_ns_processes(). When this happens, I can > see from test logs that nr_hashed == 2 and init_pids==1, but there is only > a single thread left in the pid namespace (the one that is stuck). > Traceback from /proc/<pid>/stack is > > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0 > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330 > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130 > [<ffffffff810d4347>] get_signal+0x367/0x8a0 > [<ffffffff81046e73>] do_signal+0x83/0xb90 > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0 > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0 > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad > [<ffffffffffffffff>] 0xffffffffffffffff > > After 120 seconds, I get the "hung task" message. > > Example from v4.11: > > ... > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds. > [ 3263.379561] Not tainted 4.11.0+ #1 > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 3263.379577] clone D 0 27910 27909 0x00000000 > [ 3263.379587] Call Trace: > [ 3263.379608] __schedule+0x677/0xda0 > [ 3263.379621] ? pci_mmcfg_check_reserved+0xc0/0xc0 > [ 3263.379634] ? task_stopped_code+0x70/0x70 > [ 3263.379643] schedule+0x4d/0xd0 > [ 3263.379653] zap_pid_ns_processes+0x1ee/0x2a0 > [ 3263.379659] ? copy_pid_ns+0x4d0/0x4d0 > [ 3263.379670] do_exit+0x10d4/0x1330 > ... > > The problem is seen in all kernels up to v4.11. > > Any idea what might be going on and how to fix the problem ?
Let me see. Reading the code it looks like we have three tasks let's call them main, child1, and child2. child1 and child2 are started using CLONE_THREAD and are thus clones of one another. child2 exits first but is ptraced by main so is not reaped. Further child2 calls do_group_exit forcing child1 to exit making for fun races. A ptread_exit() or syscall(SYS_exit, 0); would skip the group exit and make the window larger. child1 exits next and calls zap_pid_ns_processes and is waiting for child2 to be reaped by main. main is just sitting around doing nothing for 3600 seconds not reaping anyone. I would expect that when main exits everything would be cleaned up and the only real issue is that we have a hung task warning. Does everything cleanup when main exits? Eric > > Thanks, > Guenter > > --- > This test program was kindly provided by Vovo Yang <vo...@google.com>. > > Note that the ptrace() call in child1() is not necessary for the problem > to be seen, though it seems to make it a bit more likely. That would appear to just slow things down a smidge. As there is nothing substantial that happens ptrace wise except until after zap_pid_ns_processes. > --- > > #define _GNU_SOURCE > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <sys/ptrace.h> > #include <errno.h> > #include <string.h> > #include <sched.h> > > #define STACK_SIZE 65536 > > int child1(void* arg); > int child2(void* arg); > > int main(int argc, char **argv) > { > int child_pid; > char* child_stack = malloc(STACK_SIZE); > char* stack_top = child_stack + STACK_SIZE; > char command[256]; > > child_pid = clone(&child1, stack_top, CLONE_NEWPID, NULL); > if (child_pid == -1) { > printf("parent: clone failed: %s\n", strerror(errno)); > return EXIT_FAILURE; > } > printf("parent: child1_pid: %d\n", child_pid); > > sleep(2); > printf("child state, if it's D (disk sleep), the child process is hung\n"); > sprintf(command, "cat /proc/%d/status | grep State:", child_pid); > system(command); > sleep(3600); > return EXIT_SUCCESS; > } > > int child1(void* arg) > { > int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | > CLONE_THREAD; > char* child_stack = malloc(STACK_SIZE); > char* stack_top = child_stack + STACK_SIZE; > long ret; > > ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL); > if (ret == -1) { > printf("child1: ptrace failed: %s\n", strerror(errno)); > return EXIT_FAILURE; > } > > ret = clone(&child2, stack_top, flags, NULL); > if (ret == -1) { > printf("child1: clone failed: %s\n", strerror(errno)); > return EXIT_FAILURE; > } > printf("child1: child2 pid: %ld\n", ret); > > sleep(1); > printf("child1: end\n"); > return EXIT_SUCCESS; > } > > int child2(void* arg) > { > long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL); > if (ret == -1) { > printf("child2: ptrace failed: %s\n", strerror(errno)); > return EXIT_FAILURE; > } > > printf("child2: end\n"); > return EXIT_SUCCESS; > }