On Thu, Jun 01, 2017 at 02:36:38PM -0500, Eric W. Biederman wrote: > Guenter Roeck <li...@roeck-us.net> writes: > > > On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote: > >> Guenter Roeck <li...@roeck-us.net> writes: > >> > > >> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get > >> > a zombie process. > >> > > >> > I guess the only question left is if zap_pid_ns_processes() should (or > >> > could) > >> > somehow detect that situation and return instead of waiting forever. > >> > What do you think ? > >> > >> Any chance you can point me at the chromium code that is performing the > >> ptrace? > >> > >> I want to conduct a review of the kernel semantics to see if the current > >> semantics make it unnecessarily easy to get into hang situations. If > >> the semantics make it really easy to get into a hang situation I want > >> to see if there is anything we can do to delicately change the semantics > >> to avoid the hangs without breaking existing userspace. > >> > > The internal bug should be accessible to you. > > > > https://bugs.chromium.org/p/chromium/issues/detail?id=721298&desc=2 > > > > It has some additional information, and points to the following code in > > Chrome. > > > > https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989&l=85 > > > > With the information we have, I don't really have a good idea what we could > > or > > should change in Chrome to make the problem disappear, so I just concluded > > that > > we'll have to live with the forever-sleeping task. > > I believe I see what is happening. The code makes the assumption that a > thread will stay stopped and will not go away once ptrace attach > completes. > > Unfortunately if someone sends SIGKILL to the process or exec sends > SIGKILL to the individual thread then PTRACE_DETACH will fail. > > At which point you can use waitpid to reap the zombie and detach > from the thread. > > So I think the forever-sleeping can be fixed with something as simple > as changing ResumeThread to say: > > // Resumes a thread by detaching from it. > static bool ResumeThread(pid_t pid) { > if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0) > return true; > /* Someone killed the thread? */ > return waitpid(pid, NULL, 0) == pid; > } > > It almost certainly makes sense to fix PTRACE_DETACH in the kernel to > allow this case to work. And odds are good that we could make that > change without breaking anyone. So it is worth a try. >
Do I interpret this correctly as "the above code should work, but currently doesn't" ? Thanks, Guenter