Guenter Roeck <li...@roeck-us.net> writes: > Hi Eric, > > On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote: >> Guenter Roeck <li...@roeck-us.net> writes: >> >> > Hi Eric, >> > >> > On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote: >> >> Vovo Yang <vo...@google.com> writes: >> >> >> >> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman >> >> > <ebied...@xmission.com> wrote: >> >> >> Guenter Roeck <li...@roeck-us.net> writes: >> >> >> >> >> >>> What I know so far is >> >> >>> - We see this condition on a regular basis in the field. Regular is >> >> >>> relative, of course - let's say maybe 1 in a Milion Chromebooks >> >> >>> per day reports a crash because of it. That is not that many, >> >> >>> but it adds up. >> >> >>> - We are able to reproduce the problem with a performance benchmark >> >> >>> which opens 100 chrome tabs. While that is a lot, it should not >> >> >>> result in a kernel hang/crash. >> >> >>> - Vovo proviced the test code last night. I don't know if this is >> >> >>> exactly what is observed in the benchmark, or how it relates to the >> >> >>> benchmark in the first place, but it is the first time we are >> >> >>> actually >> >> >>> able to reliably create a condition where the problem is seen. >> >> >> >> >> >> Thank you. I will be interesting to hear what is happening in the >> >> >> chrome perfomance benchmark that triggers this. >> >> >> >> >> > What's happening in the benchmark: >> >> > 1. A chrome renderer process was created with CLONE_NEWPID >> >> > 2. The process crashed >> >> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to >> >> > every >> >> > threads of the crashed process to dump info >> >> > 4. When breakpad detach the crashed process, the crashed process stuck >> >> > in >> >> > zap_pid_ns_processes() >> >> >> >> Very interesting thank you. >> >> >> >> So the question is specifically which interaction is causing this. >> >> >> >> In the test case provided it was a sibling task in the pid namespace >> >> dying and not being reaped. Which may be what is happening with >> >> breakpad. So far I have yet to see kernel bug but I won't rule one out. >> >> >> > >> > I am trying to understand what you are looking for. I would have thought >> > that both the test application as well as the Chrome functionality >> > described above show that there are situations where zap_pid_ns_processes() >> > can get stuck and cause hung task timeouts in conjunction with the use of >> > ptrace(). >> > >> > Your last sentence seems to suggest that you believe that the kernel might >> > do what it is expected to do. Assuming this is the case, what else would >> > you like to see ? A test application which matches exactly the Chrome use >> > case ? We can try to provide that, but I don't entirely understand how >> > that would change the situation. After all, we already know that it is >> > possible to get a thread into this condition, and we already have one >> > means to reproduce it. >> > >> > Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the >> > test application and the Chrome benchmark. The thread is still stuck in >> > zap_pid_ns_processes(), but it is now in S (sleep) state instead of D, >> > and no longer results in a hung task timeout. It remains in that state >> > until the parent process terminates. I am not entirely happy with it >> > since the processes are still stuck and may pile up over time, but at >> > least it solves the immediate problem for us. >> > >> > Question now is what to do with that solution. We can of course apply >> > it locally to Chrome OS, but I would rather have it upstream - especially >> > since we have to assume that any users of Chrome on Linux, or more >> > generically anyone using ptrace in conjunction with CLONE_NEWPID, may >> > experience the same problem. Right now I have no idea how to get there, >> > though. Can you provide some guidance ? >> >> Apologies for not being clear. I intend to send a pull request with the > > No worries. > >> the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the >> next week or so with a Cc stable and an appropriate Fixes tag. So the >> fix can be backported. >> > Great, that is good to know. > >> I have a more comprehensive change queued I will probably merge for 4.13 >> already but it just changes what kind of zombies you see. It won't >> remove the ``stuck'' zombies. >> >> So what I am looking for now is: >> Why are things getting stuck in your benchmark? >> >> - Is it a userspace bug? >> >> In which case we can figure out what userspace (aka breakpad) needs >> to do to avoid the problem. >> > I spent some time trying to understand what breakpad is doing. I don't > really see how it can be blamed. It attaches itself to a crashing thread, > gets the information it is looking for, and detaches itself. At the > same time, whatever it does, it seems to me that userspace should not > be able to create such a situation in the first place.
I suspect what is happening is that the thread breakpad is attached to is killed by zap_pid_ns_processes. At which point PTRACE_DETACH won't work because the thread has been killed and waitpid(thread_pid,...) needs to be used instead. Certainly that is what is happening in the reproducer. So I suspect breakpad just needs a waitpid whend PTRACE_DETACH fails or possible just a general loop that listens for SIGCHLD and calls some variant of wait. >> - Is it a kernel bug with ptrace? >> >> There have been a lot of little subtle bugs with ptrace over the >> years so one more would not surprise >> > Good question. I suspect that PTRACE_ATTACH (or PTRACE_TRACEME) > changes the reaper to be the calling process, and a subsequent > PTRACE_DETACH or exit of the process calling PTRACE_TRACEME doesn't > change it back or changes it to the process group owner. Is that > what happens, and if so is it what should happen ? I don't know (yet). Not the reaper but ptrace does change the parent to be the tracer. PTRACE_DETACH won't work if the process is not stopped waiting for the tracer. And being killed and being a zombie is the ultimate not waiting for the tracer state. What needs to happen is for the tracer (aka breakpad) to call waitpid. >> So I am just looking to make certain we fix the root issue not just >> the hung task timeout warning. >> > Makes sense. At least with TASK_INTERRUPTIBLE, I suspect we are just > fixing the symptom, but on the other side that isn't really my area > of expertise, so I don't really know. > > One additional data point: If I time the threads in the test code by > adding various calls to sleep(), it is still more or less random if the > task status ends up in Z or D, meaning the child thread is sometimes > reaped and sometimes not. I would have expected it to be well defined. > I may be wrong, but that smells racy to me. If you have the threads exit with the exit system call rather than what in the kernel winds up being do_group_exit it won't be racy. The easy version is to use pthreads from child1 to create child2 and have them exit with pthread_exit. Alternatively you can wrap call syscall(SYS_exit, exit_code); which should also do the trick. > Anyway, I'll spend some more time on this. I'll let you know if I find > anything. Thanks. Eric