Re: Threads stuck in zap_pid_ns_processes()

2017-06-01 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, Jun 01, 2017 at 02:36:38PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> 
>> > On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
>> >> Guenter Roeck  writes:
>> >> >
>> >> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
>> >> > a zombie process.
>> >> >
>> >> > I guess the only question left is if zap_pid_ns_processes() should (or 
>> >> > could)
>> >> > somehow detect that situation and return instead of waiting forever.
>> >> > What do you think ?
>> >> 
>> >> Any chance you can point me at the chromium code that is performing the
>> >> ptrace?
>> >> 
>> >> I want to conduct a review of the kernel semantics to see if the current
>> >> semantics make it unnecessarily easy to get into hang situations.  If
>> >> the semantics make it really easy to get into a hang situation I want
>> >> to see if there is anything we can do to delicately change the semantics
>> >> to avoid the hangs without breaking existing userspace.
>> >> 
>> > The internal bug should be accessible to you.
>> >
>> > https://bugs.chromium.org/p/chromium/issues/detail?id=721298=2
>> >
>> > It has some additional information, and points to the following code in 
>> > Chrome.
>> >
>> > https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989=85
>> >
>> > With the information we have, I don't really have a good idea what we 
>> > could or
>> > should change in Chrome to make the problem disappear, so I just concluded 
>> > that
>> > we'll have to live with the forever-sleeping task.
>> 
>> I believe I see what is happening.  The code makes the assumption that a
>> thread will stay stopped and will not go away once ptrace attach
>> completes.
>> 
>> Unfortunately if someone sends SIGKILL to the process or exec sends
>> SIGKILL to the individual thread then PTRACE_DETACH will fail.
>> 
>> At which point you can use waitpid to reap the zombie and detach
>> from the thread.
>> 
>> So I think the forever-sleeping can be fixed with something as simple
>> as changing ResumeThread to say:
>> 
>> // Resumes a thread by detaching from it.
>> static bool ResumeThread(pid_t pid) {
>>   if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0)
>>  return true;
>>   /* Someone killed the thread? */
>>   return waitpid(pid, NULL, 0) == pid;
>> }
>> 
>> It almost certainly makes sense to fix PTRACE_DETACH in the kernel to
>> allow this case to work.  And odds are good that we could make that
>> change without breaking anyone.  So it is worth a try.
>> 
>
> Do I interpret this correctly as "the above code should work, but currently
> doesn't" ?

I added the early exit and the fallback waitpid clause.  So I am saying
with a trivial modification the code can be made to work.

Eric



Re: Threads stuck in zap_pid_ns_processes()

2017-06-01 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, Jun 01, 2017 at 02:36:38PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> 
>> > On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
>> >> Guenter Roeck  writes:
>> >> >
>> >> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
>> >> > a zombie process.
>> >> >
>> >> > I guess the only question left is if zap_pid_ns_processes() should (or 
>> >> > could)
>> >> > somehow detect that situation and return instead of waiting forever.
>> >> > What do you think ?
>> >> 
>> >> Any chance you can point me at the chromium code that is performing the
>> >> ptrace?
>> >> 
>> >> I want to conduct a review of the kernel semantics to see if the current
>> >> semantics make it unnecessarily easy to get into hang situations.  If
>> >> the semantics make it really easy to get into a hang situation I want
>> >> to see if there is anything we can do to delicately change the semantics
>> >> to avoid the hangs without breaking existing userspace.
>> >> 
>> > The internal bug should be accessible to you.
>> >
>> > https://bugs.chromium.org/p/chromium/issues/detail?id=721298=2
>> >
>> > It has some additional information, and points to the following code in 
>> > Chrome.
>> >
>> > https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989=85
>> >
>> > With the information we have, I don't really have a good idea what we 
>> > could or
>> > should change in Chrome to make the problem disappear, so I just concluded 
>> > that
>> > we'll have to live with the forever-sleeping task.
>> 
>> I believe I see what is happening.  The code makes the assumption that a
>> thread will stay stopped and will not go away once ptrace attach
>> completes.
>> 
>> Unfortunately if someone sends SIGKILL to the process or exec sends
>> SIGKILL to the individual thread then PTRACE_DETACH will fail.
>> 
>> At which point you can use waitpid to reap the zombie and detach
>> from the thread.
>> 
>> So I think the forever-sleeping can be fixed with something as simple
>> as changing ResumeThread to say:
>> 
>> // Resumes a thread by detaching from it.
>> static bool ResumeThread(pid_t pid) {
>>   if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0)
>>  return true;
>>   /* Someone killed the thread? */
>>   return waitpid(pid, NULL, 0) == pid;
>> }
>> 
>> It almost certainly makes sense to fix PTRACE_DETACH in the kernel to
>> allow this case to work.  And odds are good that we could make that
>> change without breaking anyone.  So it is worth a try.
>> 
>
> Do I interpret this correctly as "the above code should work, but currently
> doesn't" ?

I added the early exit and the fallback waitpid clause.  So I am saying
with a trivial modification the code can be made to work.

Eric



Re: Threads stuck in zap_pid_ns_processes()

2017-06-01 Thread Guenter Roeck
On Thu, Jun 01, 2017 at 02:36:38PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> 
> > On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
> >> Guenter Roeck  writes:
> >> >
> >> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
> >> > a zombie process.
> >> >
> >> > I guess the only question left is if zap_pid_ns_processes() should (or 
> >> > could)
> >> > somehow detect that situation and return instead of waiting forever.
> >> > What do you think ?
> >> 
> >> Any chance you can point me at the chromium code that is performing the
> >> ptrace?
> >> 
> >> I want to conduct a review of the kernel semantics to see if the current
> >> semantics make it unnecessarily easy to get into hang situations.  If
> >> the semantics make it really easy to get into a hang situation I want
> >> to see if there is anything we can do to delicately change the semantics
> >> to avoid the hangs without breaking existing userspace.
> >> 
> > The internal bug should be accessible to you.
> >
> > https://bugs.chromium.org/p/chromium/issues/detail?id=721298=2
> >
> > It has some additional information, and points to the following code in 
> > Chrome.
> >
> > https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989=85
> >
> > With the information we have, I don't really have a good idea what we could 
> > or
> > should change in Chrome to make the problem disappear, so I just concluded 
> > that
> > we'll have to live with the forever-sleeping task.
> 
> I believe I see what is happening.  The code makes the assumption that a
> thread will stay stopped and will not go away once ptrace attach
> completes.
> 
> Unfortunately if someone sends SIGKILL to the process or exec sends
> SIGKILL to the individual thread then PTRACE_DETACH will fail.
> 
> At which point you can use waitpid to reap the zombie and detach
> from the thread.
> 
> So I think the forever-sleeping can be fixed with something as simple
> as changing ResumeThread to say:
> 
> // Resumes a thread by detaching from it.
> static bool ResumeThread(pid_t pid) {
>   if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0)
>   return true;
>   /* Someone killed the thread? */
>   return waitpid(pid, NULL, 0) == pid;
> }
> 
> It almost certainly makes sense to fix PTRACE_DETACH in the kernel to
> allow this case to work.  And odds are good that we could make that
> change without breaking anyone.  So it is worth a try.
> 

Do I interpret this correctly as "the above code should work, but currently
doesn't" ?

Thanks,
Guenter


Re: Threads stuck in zap_pid_ns_processes()

2017-06-01 Thread Guenter Roeck
On Thu, Jun 01, 2017 at 02:36:38PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> 
> > On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
> >> Guenter Roeck  writes:
> >> >
> >> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
> >> > a zombie process.
> >> >
> >> > I guess the only question left is if zap_pid_ns_processes() should (or 
> >> > could)
> >> > somehow detect that situation and return instead of waiting forever.
> >> > What do you think ?
> >> 
> >> Any chance you can point me at the chromium code that is performing the
> >> ptrace?
> >> 
> >> I want to conduct a review of the kernel semantics to see if the current
> >> semantics make it unnecessarily easy to get into hang situations.  If
> >> the semantics make it really easy to get into a hang situation I want
> >> to see if there is anything we can do to delicately change the semantics
> >> to avoid the hangs without breaking existing userspace.
> >> 
> > The internal bug should be accessible to you.
> >
> > https://bugs.chromium.org/p/chromium/issues/detail?id=721298=2
> >
> > It has some additional information, and points to the following code in 
> > Chrome.
> >
> > https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989=85
> >
> > With the information we have, I don't really have a good idea what we could 
> > or
> > should change in Chrome to make the problem disappear, so I just concluded 
> > that
> > we'll have to live with the forever-sleeping task.
> 
> I believe I see what is happening.  The code makes the assumption that a
> thread will stay stopped and will not go away once ptrace attach
> completes.
> 
> Unfortunately if someone sends SIGKILL to the process or exec sends
> SIGKILL to the individual thread then PTRACE_DETACH will fail.
> 
> At which point you can use waitpid to reap the zombie and detach
> from the thread.
> 
> So I think the forever-sleeping can be fixed with something as simple
> as changing ResumeThread to say:
> 
> // Resumes a thread by detaching from it.
> static bool ResumeThread(pid_t pid) {
>   if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0)
>   return true;
>   /* Someone killed the thread? */
>   return waitpid(pid, NULL, 0) == pid;
> }
> 
> It almost certainly makes sense to fix PTRACE_DETACH in the kernel to
> allow this case to work.  And odds are good that we could make that
> change without breaking anyone.  So it is worth a try.
> 

Do I interpret this correctly as "the above code should work, but currently
doesn't" ?

Thanks,
Guenter


Re: Threads stuck in zap_pid_ns_processes()

2017-06-01 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> >
>> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
>> > a zombie process.
>> >
>> > I guess the only question left is if zap_pid_ns_processes() should (or 
>> > could)
>> > somehow detect that situation and return instead of waiting forever.
>> > What do you think ?
>> 
>> Any chance you can point me at the chromium code that is performing the
>> ptrace?
>> 
>> I want to conduct a review of the kernel semantics to see if the current
>> semantics make it unnecessarily easy to get into hang situations.  If
>> the semantics make it really easy to get into a hang situation I want
>> to see if there is anything we can do to delicately change the semantics
>> to avoid the hangs without breaking existing userspace.
>> 
> The internal bug should be accessible to you.
>
> https://bugs.chromium.org/p/chromium/issues/detail?id=721298=2
>
> It has some additional information, and points to the following code in 
> Chrome.
>
> https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989=85
>
> With the information we have, I don't really have a good idea what we could or
> should change in Chrome to make the problem disappear, so I just concluded 
> that
> we'll have to live with the forever-sleeping task.

I believe I see what is happening.  The code makes the assumption that a
thread will stay stopped and will not go away once ptrace attach
completes.

Unfortunately if someone sends SIGKILL to the process or exec sends
SIGKILL to the individual thread then PTRACE_DETACH will fail.

At which point you can use waitpid to reap the zombie and detach
from the thread.

So I think the forever-sleeping can be fixed with something as simple
as changing ResumeThread to say:

// Resumes a thread by detaching from it.
static bool ResumeThread(pid_t pid) {
  if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0)
return true;
  /* Someone killed the thread? */
  return waitpid(pid, NULL, 0) == pid;
}

It almost certainly makes sense to fix PTRACE_DETACH in the kernel to
allow this case to work.  And odds are good that we could make that
change without breaking anyone.  So it is worth a try.

Eric



Re: Threads stuck in zap_pid_ns_processes()

2017-06-01 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> >
>> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
>> > a zombie process.
>> >
>> > I guess the only question left is if zap_pid_ns_processes() should (or 
>> > could)
>> > somehow detect that situation and return instead of waiting forever.
>> > What do you think ?
>> 
>> Any chance you can point me at the chromium code that is performing the
>> ptrace?
>> 
>> I want to conduct a review of the kernel semantics to see if the current
>> semantics make it unnecessarily easy to get into hang situations.  If
>> the semantics make it really easy to get into a hang situation I want
>> to see if there is anything we can do to delicately change the semantics
>> to avoid the hangs without breaking existing userspace.
>> 
> The internal bug should be accessible to you.
>
> https://bugs.chromium.org/p/chromium/issues/detail?id=721298=2
>
> It has some additional information, and points to the following code in 
> Chrome.
>
> https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989=85
>
> With the information we have, I don't really have a good idea what we could or
> should change in Chrome to make the problem disappear, so I just concluded 
> that
> we'll have to live with the forever-sleeping task.

I believe I see what is happening.  The code makes the assumption that a
thread will stay stopped and will not go away once ptrace attach
completes.

Unfortunately if someone sends SIGKILL to the process or exec sends
SIGKILL to the individual thread then PTRACE_DETACH will fail.

At which point you can use waitpid to reap the zombie and detach
from the thread.

So I think the forever-sleeping can be fixed with something as simple
as changing ResumeThread to say:

// Resumes a thread by detaching from it.
static bool ResumeThread(pid_t pid) {
  if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0)
return true;
  /* Someone killed the thread? */
  return waitpid(pid, NULL, 0) == pid;
}

It almost certainly makes sense to fix PTRACE_DETACH in the kernel to
allow this case to work.  And odds are good that we could make that
change without breaking anyone.  So it is worth a try.

Eric



Re: Threads stuck in zap_pid_ns_processes()

2017-06-01 Thread Guenter Roeck
On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> >
> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
> > a zombie process.
> >
> > I guess the only question left is if zap_pid_ns_processes() should (or 
> > could)
> > somehow detect that situation and return instead of waiting forever.
> > What do you think ?
> 
> Any chance you can point me at the chromium code that is performing the
> ptrace?
> 
> I want to conduct a review of the kernel semantics to see if the current
> semantics make it unnecessarily easy to get into hang situations.  If
> the semantics make it really easy to get into a hang situation I want
> to see if there is anything we can do to delicately change the semantics
> to avoid the hangs without breaking existing userspace.
> 
The internal bug should be accessible to you.

https://bugs.chromium.org/p/chromium/issues/detail?id=721298=2

It has some additional information, and points to the following code in Chrome.

https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989=85

With the information we have, I don't really have a good idea what we could or
should change in Chrome to make the problem disappear, so I just concluded that
we'll have to live with the forever-sleeping task.

Thanks,
Guenter


Re: Threads stuck in zap_pid_ns_processes()

2017-06-01 Thread Guenter Roeck
On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> >
> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
> > a zombie process.
> >
> > I guess the only question left is if zap_pid_ns_processes() should (or 
> > could)
> > somehow detect that situation and return instead of waiting forever.
> > What do you think ?
> 
> Any chance you can point me at the chromium code that is performing the
> ptrace?
> 
> I want to conduct a review of the kernel semantics to see if the current
> semantics make it unnecessarily easy to get into hang situations.  If
> the semantics make it really easy to get into a hang situation I want
> to see if there is anything we can do to delicately change the semantics
> to avoid the hangs without breaking existing userspace.
> 
The internal bug should be accessible to you.

https://bugs.chromium.org/p/chromium/issues/detail?id=721298=2

It has some additional information, and points to the following code in Chrome.

https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989=85

With the information we have, I don't really have a good idea what we could or
should change in Chrome to make the problem disappear, so I just concluded that
we'll have to live with the forever-sleeping task.

Thanks,
Guenter


Re: Threads stuck in zap_pid_ns_processes()

2017-06-01 Thread Eric W. Biederman
Guenter Roeck  writes:
>
> I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
> a zombie process.
>
> I guess the only question left is if zap_pid_ns_processes() should (or could)
> somehow detect that situation and return instead of waiting forever.
> What do you think ?

Any chance you can point me at the chromium code that is performing the
ptrace?

I want to conduct a review of the kernel semantics to see if the current
semantics make it unnecessarily easy to get into hang situations.  If
the semantics make it really easy to get into a hang situation I want
to see if there is anything we can do to delicately change the semantics
to avoid the hangs without breaking existing userspace.

We have a real problem in exec which has similar semantics and as long
as I am looking at one I figure I should look at the other.

Eric


Re: Threads stuck in zap_pid_ns_processes()

2017-06-01 Thread Eric W. Biederman
Guenter Roeck  writes:
>
> I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
> a zombie process.
>
> I guess the only question left is if zap_pid_ns_processes() should (or could)
> somehow detect that situation and return instead of waiting forever.
> What do you think ?

Any chance you can point me at the chromium code that is performing the
ptrace?

I want to conduct a review of the kernel semantics to see if the current
semantics make it unnecessarily easy to get into hang situations.  If
the semantics make it really easy to get into a hang situation I want
to see if there is anything we can do to delicately change the semantics
to avoid the hangs without breaking existing userspace.

We have a real problem in exec which has similar semantics and as long
as I am looking at one I figure I should look at the other.

Eric


Re: Threads stuck in zap_pid_ns_processes()

2017-05-13 Thread Eric W. Biederman
Guenter Roeck  writes:

> On 05/12/2017 01:03 PM, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>>
>>> Hi Eric,
>>>
>>> On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:
 Guenter Roeck  writes:

> Hi Eric,
>
> On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
>> Vovo Yang  writes:
>>
>>> On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>>>  wrote:
 Guenter Roeck  writes:

> What I know so far is
> - We see this condition on a regular basis in the field. Regular is
>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>   per day reports a crash because of it. That is not that many,
>   but it adds up.
> - We are able to reproduce the problem with a performance benchmark
>   which opens 100 chrome tabs. While that is a lot, it should not
>   result in a kernel hang/crash.
> - Vovo proviced the test code last night. I don't know if this is
>   exactly what is observed in the benchmark, or how it relates to the
>   benchmark in the first place, but it is the first time we are 
> actually
>   able to reliably create a condition where the problem is seen.

 Thank you.  I will be interesting to hear what is happening in the
 chrome perfomance benchmark that triggers this.

>>> What's happening in the benchmark:
>>> 1. A chrome renderer process was created with CLONE_NEWPID
>>> 2. The process crashed
>>> 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to 
>>> every
>>>   threads of the crashed process to dump info
>>> 4. When breakpad detach the crashed process, the crashed process stuck 
>>> in
>>>   zap_pid_ns_processes()
>>
>> Very interesting thank you.
>>
>> So the question is specifically which interaction is causing this.
>>
>> In the test case provided it was a sibling task in the pid namespace
>> dying and not being reaped.  Which may be what is happening with
>> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
>>
>
> I am trying to understand what you are looking for. I would have thought
> that both the test application as well as the Chrome functionality
> described above show that there are situations where 
> zap_pid_ns_processes()
> can get stuck and cause hung task timeouts in conjunction with the use of
> ptrace().
>
> Your last sentence seems to suggest that you believe that the kernel might
> do what it is expected to do. Assuming this is the case, what else would
> you like to see ? A test application which matches exactly the Chrome use
> case ? We can try to provide that, but I don't entirely understand how
> that would change the situation. After all, we already know that it is
> possible to get a thread into this condition, and we already have one
> means to reproduce it.
>
> Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
> test application and the Chrome benchmark. The thread is still stuck in
> zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
> and no longer results in a hung task timeout. It remains in that state
> until the parent process terminates. I am not entirely happy with it
> since the processes are still stuck and may pile up over time, but at
> least it solves the immediate problem for us.
>
> Question now is what to do with that solution. We can of course apply
> it locally to Chrome OS, but I would rather have it upstream - especially
> since we have to assume that any users of Chrome on Linux, or more
> generically anyone using ptrace in conjunction with CLONE_NEWPID, may
> experience the same problem. Right now I have no idea how to get there,
> though. Can you provide some guidance ?

 Apologies for not being clear.  I intend to send a pull request with the
>>>
>>> No worries.
>>>
 the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
 next week or so with a Cc stable and an appropriate Fixes tag.  So the
 fix can be backported.

>>> Great, that is good to know.
>>>
 I have a more comprehensive change queued I will probably merge for 4.13
 already but it just changes what kind of zombies you see.  It won't
 remove the ``stuck'' zombies.

 So what I am looking for now is:
 Why are things getting stuck in your benchmark?

 -  Is it a userspace bug?

   In which case we can figure out what userspace (aka breakpad) needs
to do to avoid the problem.

>>> I spent some time trying to understand what breakpad is doing. I don't
>>> really see how it 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-13 Thread Eric W. Biederman
Guenter Roeck  writes:

> On 05/12/2017 01:03 PM, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>>
>>> Hi Eric,
>>>
>>> On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:
 Guenter Roeck  writes:

> Hi Eric,
>
> On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
>> Vovo Yang  writes:
>>
>>> On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>>>  wrote:
 Guenter Roeck  writes:

> What I know so far is
> - We see this condition on a regular basis in the field. Regular is
>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>   per day reports a crash because of it. That is not that many,
>   but it adds up.
> - We are able to reproduce the problem with a performance benchmark
>   which opens 100 chrome tabs. While that is a lot, it should not
>   result in a kernel hang/crash.
> - Vovo proviced the test code last night. I don't know if this is
>   exactly what is observed in the benchmark, or how it relates to the
>   benchmark in the first place, but it is the first time we are 
> actually
>   able to reliably create a condition where the problem is seen.

 Thank you.  I will be interesting to hear what is happening in the
 chrome perfomance benchmark that triggers this.

>>> What's happening in the benchmark:
>>> 1. A chrome renderer process was created with CLONE_NEWPID
>>> 2. The process crashed
>>> 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to 
>>> every
>>>   threads of the crashed process to dump info
>>> 4. When breakpad detach the crashed process, the crashed process stuck 
>>> in
>>>   zap_pid_ns_processes()
>>
>> Very interesting thank you.
>>
>> So the question is specifically which interaction is causing this.
>>
>> In the test case provided it was a sibling task in the pid namespace
>> dying and not being reaped.  Which may be what is happening with
>> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
>>
>
> I am trying to understand what you are looking for. I would have thought
> that both the test application as well as the Chrome functionality
> described above show that there are situations where 
> zap_pid_ns_processes()
> can get stuck and cause hung task timeouts in conjunction with the use of
> ptrace().
>
> Your last sentence seems to suggest that you believe that the kernel might
> do what it is expected to do. Assuming this is the case, what else would
> you like to see ? A test application which matches exactly the Chrome use
> case ? We can try to provide that, but I don't entirely understand how
> that would change the situation. After all, we already know that it is
> possible to get a thread into this condition, and we already have one
> means to reproduce it.
>
> Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
> test application and the Chrome benchmark. The thread is still stuck in
> zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
> and no longer results in a hung task timeout. It remains in that state
> until the parent process terminates. I am not entirely happy with it
> since the processes are still stuck and may pile up over time, but at
> least it solves the immediate problem for us.
>
> Question now is what to do with that solution. We can of course apply
> it locally to Chrome OS, but I would rather have it upstream - especially
> since we have to assume that any users of Chrome on Linux, or more
> generically anyone using ptrace in conjunction with CLONE_NEWPID, may
> experience the same problem. Right now I have no idea how to get there,
> though. Can you provide some guidance ?

 Apologies for not being clear.  I intend to send a pull request with the
>>>
>>> No worries.
>>>
 the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
 next week or so with a Cc stable and an appropriate Fixes tag.  So the
 fix can be backported.

>>> Great, that is good to know.
>>>
 I have a more comprehensive change queued I will probably merge for 4.13
 already but it just changes what kind of zombies you see.  It won't
 remove the ``stuck'' zombies.

 So what I am looking for now is:
 Why are things getting stuck in your benchmark?

 -  Is it a userspace bug?

   In which case we can figure out what userspace (aka breakpad) needs
to do to avoid the problem.

>>> I spent some time trying to understand what breakpad is doing. I don't
>>> really see how it can be blamed. It attaches itself to a crashing thread,
>>> gets the information it is looking for, and detaches itself. At 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-13 Thread Guenter Roeck

On 05/12/2017 01:03 PM, Eric W. Biederman wrote:

Guenter Roeck  writes:


Hi Eric,

On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:

Guenter Roeck  writes:


Hi Eric,

On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:

Vovo Yang  writes:


On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
 wrote:

Guenter Roeck  writes:


What I know so far is
- We see this condition on a regular basis in the field. Regular is
  relative, of course - let's say maybe 1 in a Milion Chromebooks
  per day reports a crash because of it. That is not that many,
  but it adds up.
- We are able to reproduce the problem with a performance benchmark
  which opens 100 chrome tabs. While that is a lot, it should not
  result in a kernel hang/crash.
- Vovo proviced the test code last night. I don't know if this is
  exactly what is observed in the benchmark, or how it relates to the
  benchmark in the first place, but it is the first time we are actually
  able to reliably create a condition where the problem is seen.


Thank you.  I will be interesting to hear what is happening in the
chrome perfomance benchmark that triggers this.


What's happening in the benchmark:
1. A chrome renderer process was created with CLONE_NEWPID
2. The process crashed
3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
  threads of the crashed process to dump info
4. When breakpad detach the crashed process, the crashed process stuck in
  zap_pid_ns_processes()


Very interesting thank you.

So the question is specifically which interaction is causing this.

In the test case provided it was a sibling task in the pid namespace
dying and not being reaped.  Which may be what is happening with
breakpad.  So far I have yet to see kernel bug but I won't rule one out.



I am trying to understand what you are looking for. I would have thought
that both the test application as well as the Chrome functionality
described above show that there are situations where zap_pid_ns_processes()
can get stuck and cause hung task timeouts in conjunction with the use of
ptrace().

Your last sentence seems to suggest that you believe that the kernel might
do what it is expected to do. Assuming this is the case, what else would
you like to see ? A test application which matches exactly the Chrome use
case ? We can try to provide that, but I don't entirely understand how
that would change the situation. After all, we already know that it is
possible to get a thread into this condition, and we already have one
means to reproduce it.

Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
test application and the Chrome benchmark. The thread is still stuck in
zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
and no longer results in a hung task timeout. It remains in that state
until the parent process terminates. I am not entirely happy with it
since the processes are still stuck and may pile up over time, but at
least it solves the immediate problem for us.

Question now is what to do with that solution. We can of course apply
it locally to Chrome OS, but I would rather have it upstream - especially
since we have to assume that any users of Chrome on Linux, or more
generically anyone using ptrace in conjunction with CLONE_NEWPID, may
experience the same problem. Right now I have no idea how to get there,
though. Can you provide some guidance ?


Apologies for not being clear.  I intend to send a pull request with the


No worries.


the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
next week or so with a Cc stable and an appropriate Fixes tag.  So the
fix can be backported.


Great, that is good to know.


I have a more comprehensive change queued I will probably merge for 4.13
already but it just changes what kind of zombies you see.  It won't
remove the ``stuck'' zombies.

So what I am looking for now is:
Why are things getting stuck in your benchmark?

-  Is it a userspace bug?

  In which case we can figure out what userspace (aka breakpad) needs
   to do to avoid the problem.


I spent some time trying to understand what breakpad is doing. I don't
really see how it can be blamed. It attaches itself to a crashing thread,
gets the information it is looking for, and detaches itself. At the
same time, whatever it does, it seems to me that userspace should not
be able to create such a situation in the first place.


I suspect what is happening is that the thread breakpad is attached to
is killed by zap_pid_ns_processes.  At which point PTRACE_DETACH won't
work because the thread has been killed and waitpid(thread_pid,...)
needs to be used instead.



I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
a zombie process.

I guess the only question left is if zap_pid_ns_processes() should (or could)
somehow detect that situation and return 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-13 Thread Guenter Roeck

On 05/12/2017 01:03 PM, Eric W. Biederman wrote:

Guenter Roeck  writes:


Hi Eric,

On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:

Guenter Roeck  writes:


Hi Eric,

On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:

Vovo Yang  writes:


On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
 wrote:

Guenter Roeck  writes:


What I know so far is
- We see this condition on a regular basis in the field. Regular is
  relative, of course - let's say maybe 1 in a Milion Chromebooks
  per day reports a crash because of it. That is not that many,
  but it adds up.
- We are able to reproduce the problem with a performance benchmark
  which opens 100 chrome tabs. While that is a lot, it should not
  result in a kernel hang/crash.
- Vovo proviced the test code last night. I don't know if this is
  exactly what is observed in the benchmark, or how it relates to the
  benchmark in the first place, but it is the first time we are actually
  able to reliably create a condition where the problem is seen.


Thank you.  I will be interesting to hear what is happening in the
chrome perfomance benchmark that triggers this.


What's happening in the benchmark:
1. A chrome renderer process was created with CLONE_NEWPID
2. The process crashed
3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
  threads of the crashed process to dump info
4. When breakpad detach the crashed process, the crashed process stuck in
  zap_pid_ns_processes()


Very interesting thank you.

So the question is specifically which interaction is causing this.

In the test case provided it was a sibling task in the pid namespace
dying and not being reaped.  Which may be what is happening with
breakpad.  So far I have yet to see kernel bug but I won't rule one out.



I am trying to understand what you are looking for. I would have thought
that both the test application as well as the Chrome functionality
described above show that there are situations where zap_pid_ns_processes()
can get stuck and cause hung task timeouts in conjunction with the use of
ptrace().

Your last sentence seems to suggest that you believe that the kernel might
do what it is expected to do. Assuming this is the case, what else would
you like to see ? A test application which matches exactly the Chrome use
case ? We can try to provide that, but I don't entirely understand how
that would change the situation. After all, we already know that it is
possible to get a thread into this condition, and we already have one
means to reproduce it.

Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
test application and the Chrome benchmark. The thread is still stuck in
zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
and no longer results in a hung task timeout. It remains in that state
until the parent process terminates. I am not entirely happy with it
since the processes are still stuck and may pile up over time, but at
least it solves the immediate problem for us.

Question now is what to do with that solution. We can of course apply
it locally to Chrome OS, but I would rather have it upstream - especially
since we have to assume that any users of Chrome on Linux, or more
generically anyone using ptrace in conjunction with CLONE_NEWPID, may
experience the same problem. Right now I have no idea how to get there,
though. Can you provide some guidance ?


Apologies for not being clear.  I intend to send a pull request with the


No worries.


the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
next week or so with a Cc stable and an appropriate Fixes tag.  So the
fix can be backported.


Great, that is good to know.


I have a more comprehensive change queued I will probably merge for 4.13
already but it just changes what kind of zombies you see.  It won't
remove the ``stuck'' zombies.

So what I am looking for now is:
Why are things getting stuck in your benchmark?

-  Is it a userspace bug?

  In which case we can figure out what userspace (aka breakpad) needs
   to do to avoid the problem.


I spent some time trying to understand what breakpad is doing. I don't
really see how it can be blamed. It attaches itself to a crashing thread,
gets the information it is looking for, and detaches itself. At the
same time, whatever it does, it seems to me that userspace should not
be able to create such a situation in the first place.


I suspect what is happening is that the thread breakpad is attached to
is killed by zap_pid_ns_processes.  At which point PTRACE_DETACH won't
work because the thread has been killed and waitpid(thread_pid,...)
needs to be used instead.



I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
a zombie process.

I guess the only question left is if zap_pid_ns_processes() should (or could)
somehow detect that situation and return instead of waiting forever.
What do you think ?

Thanks,
Guenter



Re: Threads stuck in zap_pid_ns_processes()

2017-05-12 Thread Eric W. Biederman
Guenter Roeck  writes:

> Hi Eric,
>
> On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> 
>> > Hi Eric,
>> >
>> > On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
>> >> Vovo Yang  writes:
>> >> 
>> >> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>> >> >  wrote:
>> >> >> Guenter Roeck  writes:
>> >> >>
>> >> >>> What I know so far is
>> >> >>> - We see this condition on a regular basis in the field. Regular is
>> >> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>> >> >>>   per day reports a crash because of it. That is not that many,
>> >> >>>   but it adds up.
>> >> >>> - We are able to reproduce the problem with a performance benchmark
>> >> >>>   which opens 100 chrome tabs. While that is a lot, it should not
>> >> >>>   result in a kernel hang/crash.
>> >> >>> - Vovo proviced the test code last night. I don't know if this is
>> >> >>>   exactly what is observed in the benchmark, or how it relates to the
>> >> >>>   benchmark in the first place, but it is the first time we are 
>> >> >>> actually
>> >> >>>   able to reliably create a condition where the problem is seen.
>> >> >>
>> >> >> Thank you.  I will be interesting to hear what is happening in the
>> >> >> chrome perfomance benchmark that triggers this.
>> >> >>
>> >> > What's happening in the benchmark:
>> >> > 1. A chrome renderer process was created with CLONE_NEWPID
>> >> > 2. The process crashed
>> >> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to 
>> >> > every
>> >> >   threads of the crashed process to dump info
>> >> > 4. When breakpad detach the crashed process, the crashed process stuck 
>> >> > in
>> >> >   zap_pid_ns_processes()
>> >> 
>> >> Very interesting thank you.
>> >> 
>> >> So the question is specifically which interaction is causing this.
>> >> 
>> >> In the test case provided it was a sibling task in the pid namespace
>> >> dying and not being reaped.  Which may be what is happening with
>> >> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
>> >> 
>> >
>> > I am trying to understand what you are looking for. I would have thought
>> > that both the test application as well as the Chrome functionality
>> > described above show that there are situations where zap_pid_ns_processes()
>> > can get stuck and cause hung task timeouts in conjunction with the use of
>> > ptrace().
>> >
>> > Your last sentence seems to suggest that you believe that the kernel might
>> > do what it is expected to do. Assuming this is the case, what else would
>> > you like to see ? A test application which matches exactly the Chrome use
>> > case ? We can try to provide that, but I don't entirely understand how
>> > that would change the situation. After all, we already know that it is
>> > possible to get a thread into this condition, and we already have one
>> > means to reproduce it.
>> >
>> > Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
>> > test application and the Chrome benchmark. The thread is still stuck in
>> > zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
>> > and no longer results in a hung task timeout. It remains in that state
>> > until the parent process terminates. I am not entirely happy with it
>> > since the processes are still stuck and may pile up over time, but at
>> > least it solves the immediate problem for us.
>> >
>> > Question now is what to do with that solution. We can of course apply
>> > it locally to Chrome OS, but I would rather have it upstream - especially
>> > since we have to assume that any users of Chrome on Linux, or more
>> > generically anyone using ptrace in conjunction with CLONE_NEWPID, may
>> > experience the same problem. Right now I have no idea how to get there,
>> > though. Can you provide some guidance ?
>> 
>> Apologies for not being clear.  I intend to send a pull request with the
>
> No worries.
>
>> the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
>> next week or so with a Cc stable and an appropriate Fixes tag.  So the
>> fix can be backported.
>> 
> Great, that is good to know.
>
>> I have a more comprehensive change queued I will probably merge for 4.13
>> already but it just changes what kind of zombies you see.  It won't
>> remove the ``stuck'' zombies.
>> 
>> So what I am looking for now is:
>> Why are things getting stuck in your benchmark?
>> 
>> -  Is it a userspace bug?
>> 
>>   In which case we can figure out what userspace (aka breakpad) needs
>>to do to avoid the problem.
>>
> I spent some time trying to understand what breakpad is doing. I don't
> really see how it can be blamed. It attaches itself to a crashing thread,
> gets the information it is looking for, and detaches itself. At the
> same time, whatever it does, it seems to me that userspace should not

Re: Threads stuck in zap_pid_ns_processes()

2017-05-12 Thread Eric W. Biederman
Guenter Roeck  writes:

> Hi Eric,
>
> On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> 
>> > Hi Eric,
>> >
>> > On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
>> >> Vovo Yang  writes:
>> >> 
>> >> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>> >> >  wrote:
>> >> >> Guenter Roeck  writes:
>> >> >>
>> >> >>> What I know so far is
>> >> >>> - We see this condition on a regular basis in the field. Regular is
>> >> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>> >> >>>   per day reports a crash because of it. That is not that many,
>> >> >>>   but it adds up.
>> >> >>> - We are able to reproduce the problem with a performance benchmark
>> >> >>>   which opens 100 chrome tabs. While that is a lot, it should not
>> >> >>>   result in a kernel hang/crash.
>> >> >>> - Vovo proviced the test code last night. I don't know if this is
>> >> >>>   exactly what is observed in the benchmark, or how it relates to the
>> >> >>>   benchmark in the first place, but it is the first time we are 
>> >> >>> actually
>> >> >>>   able to reliably create a condition where the problem is seen.
>> >> >>
>> >> >> Thank you.  I will be interesting to hear what is happening in the
>> >> >> chrome perfomance benchmark that triggers this.
>> >> >>
>> >> > What's happening in the benchmark:
>> >> > 1. A chrome renderer process was created with CLONE_NEWPID
>> >> > 2. The process crashed
>> >> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to 
>> >> > every
>> >> >   threads of the crashed process to dump info
>> >> > 4. When breakpad detach the crashed process, the crashed process stuck 
>> >> > in
>> >> >   zap_pid_ns_processes()
>> >> 
>> >> Very interesting thank you.
>> >> 
>> >> So the question is specifically which interaction is causing this.
>> >> 
>> >> In the test case provided it was a sibling task in the pid namespace
>> >> dying and not being reaped.  Which may be what is happening with
>> >> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
>> >> 
>> >
>> > I am trying to understand what you are looking for. I would have thought
>> > that both the test application as well as the Chrome functionality
>> > described above show that there are situations where zap_pid_ns_processes()
>> > can get stuck and cause hung task timeouts in conjunction with the use of
>> > ptrace().
>> >
>> > Your last sentence seems to suggest that you believe that the kernel might
>> > do what it is expected to do. Assuming this is the case, what else would
>> > you like to see ? A test application which matches exactly the Chrome use
>> > case ? We can try to provide that, but I don't entirely understand how
>> > that would change the situation. After all, we already know that it is
>> > possible to get a thread into this condition, and we already have one
>> > means to reproduce it.
>> >
>> > Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
>> > test application and the Chrome benchmark. The thread is still stuck in
>> > zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
>> > and no longer results in a hung task timeout. It remains in that state
>> > until the parent process terminates. I am not entirely happy with it
>> > since the processes are still stuck and may pile up over time, but at
>> > least it solves the immediate problem for us.
>> >
>> > Question now is what to do with that solution. We can of course apply
>> > it locally to Chrome OS, but I would rather have it upstream - especially
>> > since we have to assume that any users of Chrome on Linux, or more
>> > generically anyone using ptrace in conjunction with CLONE_NEWPID, may
>> > experience the same problem. Right now I have no idea how to get there,
>> > though. Can you provide some guidance ?
>> 
>> Apologies for not being clear.  I intend to send a pull request with the
>
> No worries.
>
>> the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
>> next week or so with a Cc stable and an appropriate Fixes tag.  So the
>> fix can be backported.
>> 
> Great, that is good to know.
>
>> I have a more comprehensive change queued I will probably merge for 4.13
>> already but it just changes what kind of zombies you see.  It won't
>> remove the ``stuck'' zombies.
>> 
>> So what I am looking for now is:
>> Why are things getting stuck in your benchmark?
>> 
>> -  Is it a userspace bug?
>> 
>>   In which case we can figure out what userspace (aka breakpad) needs
>>to do to avoid the problem.
>>
> I spent some time trying to understand what breakpad is doing. I don't
> really see how it can be blamed. It attaches itself to a crashing thread,
> gets the information it is looking for, and detaches itself. At the
> same time, whatever it does, it seems to me that userspace should not
> be able to create such a situation in the first place.

I suspect what is happening is that the 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-12 Thread Guenter Roeck
Hi Eric,

On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> 
> > Hi Eric,
> >
> > On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
> >> Vovo Yang  writes:
> >> 
> >> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
> >> >  wrote:
> >> >> Guenter Roeck  writes:
> >> >>
> >> >>> What I know so far is
> >> >>> - We see this condition on a regular basis in the field. Regular is
> >> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
> >> >>>   per day reports a crash because of it. That is not that many,
> >> >>>   but it adds up.
> >> >>> - We are able to reproduce the problem with a performance benchmark
> >> >>>   which opens 100 chrome tabs. While that is a lot, it should not
> >> >>>   result in a kernel hang/crash.
> >> >>> - Vovo proviced the test code last night. I don't know if this is
> >> >>>   exactly what is observed in the benchmark, or how it relates to the
> >> >>>   benchmark in the first place, but it is the first time we are 
> >> >>> actually
> >> >>>   able to reliably create a condition where the problem is seen.
> >> >>
> >> >> Thank you.  I will be interesting to hear what is happening in the
> >> >> chrome perfomance benchmark that triggers this.
> >> >>
> >> > What's happening in the benchmark:
> >> > 1. A chrome renderer process was created with CLONE_NEWPID
> >> > 2. The process crashed
> >> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to 
> >> > every
> >> >   threads of the crashed process to dump info
> >> > 4. When breakpad detach the crashed process, the crashed process stuck in
> >> >   zap_pid_ns_processes()
> >> 
> >> Very interesting thank you.
> >> 
> >> So the question is specifically which interaction is causing this.
> >> 
> >> In the test case provided it was a sibling task in the pid namespace
> >> dying and not being reaped.  Which may be what is happening with
> >> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
> >> 
> >
> > I am trying to understand what you are looking for. I would have thought
> > that both the test application as well as the Chrome functionality
> > described above show that there are situations where zap_pid_ns_processes()
> > can get stuck and cause hung task timeouts in conjunction with the use of
> > ptrace().
> >
> > Your last sentence seems to suggest that you believe that the kernel might
> > do what it is expected to do. Assuming this is the case, what else would
> > you like to see ? A test application which matches exactly the Chrome use
> > case ? We can try to provide that, but I don't entirely understand how
> > that would change the situation. After all, we already know that it is
> > possible to get a thread into this condition, and we already have one
> > means to reproduce it.
> >
> > Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
> > test application and the Chrome benchmark. The thread is still stuck in
> > zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
> > and no longer results in a hung task timeout. It remains in that state
> > until the parent process terminates. I am not entirely happy with it
> > since the processes are still stuck and may pile up over time, but at
> > least it solves the immediate problem for us.
> >
> > Question now is what to do with that solution. We can of course apply
> > it locally to Chrome OS, but I would rather have it upstream - especially
> > since we have to assume that any users of Chrome on Linux, or more
> > generically anyone using ptrace in conjunction with CLONE_NEWPID, may
> > experience the same problem. Right now I have no idea how to get there,
> > though. Can you provide some guidance ?
> 
> Apologies for not being clear.  I intend to send a pull request with the

No worries.

> the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
> next week or so with a Cc stable and an appropriate Fixes tag.  So the
> fix can be backported.
> 
Great, that is good to know.

> I have a more comprehensive change queued I will probably merge for 4.13
> already but it just changes what kind of zombies you see.  It won't
> remove the ``stuck'' zombies.
> 
> So what I am looking for now is:
> Why are things getting stuck in your benchmark?
> 
> -  Is it a userspace bug?
> 
>   In which case we can figure out what userspace (aka breakpad) needs
>to do to avoid the problem.
>
I spent some time trying to understand what breakpad is doing. I don't
really see how it can be blamed. It attaches itself to a crashing thread,
gets the information it is looking for, and detaches itself. At the
same time, whatever it does, it seems to me that userspace should not
be able to create such a situation in the first place.

> -  Is it a kernel bug with ptrace?
> 
>There have been a lot of little subtle bugs with ptrace over 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-12 Thread Guenter Roeck
Hi Eric,

On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> 
> > Hi Eric,
> >
> > On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
> >> Vovo Yang  writes:
> >> 
> >> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
> >> >  wrote:
> >> >> Guenter Roeck  writes:
> >> >>
> >> >>> What I know so far is
> >> >>> - We see this condition on a regular basis in the field. Regular is
> >> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
> >> >>>   per day reports a crash because of it. That is not that many,
> >> >>>   but it adds up.
> >> >>> - We are able to reproduce the problem with a performance benchmark
> >> >>>   which opens 100 chrome tabs. While that is a lot, it should not
> >> >>>   result in a kernel hang/crash.
> >> >>> - Vovo proviced the test code last night. I don't know if this is
> >> >>>   exactly what is observed in the benchmark, or how it relates to the
> >> >>>   benchmark in the first place, but it is the first time we are 
> >> >>> actually
> >> >>>   able to reliably create a condition where the problem is seen.
> >> >>
> >> >> Thank you.  I will be interesting to hear what is happening in the
> >> >> chrome perfomance benchmark that triggers this.
> >> >>
> >> > What's happening in the benchmark:
> >> > 1. A chrome renderer process was created with CLONE_NEWPID
> >> > 2. The process crashed
> >> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to 
> >> > every
> >> >   threads of the crashed process to dump info
> >> > 4. When breakpad detach the crashed process, the crashed process stuck in
> >> >   zap_pid_ns_processes()
> >> 
> >> Very interesting thank you.
> >> 
> >> So the question is specifically which interaction is causing this.
> >> 
> >> In the test case provided it was a sibling task in the pid namespace
> >> dying and not being reaped.  Which may be what is happening with
> >> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
> >> 
> >
> > I am trying to understand what you are looking for. I would have thought
> > that both the test application as well as the Chrome functionality
> > described above show that there are situations where zap_pid_ns_processes()
> > can get stuck and cause hung task timeouts in conjunction with the use of
> > ptrace().
> >
> > Your last sentence seems to suggest that you believe that the kernel might
> > do what it is expected to do. Assuming this is the case, what else would
> > you like to see ? A test application which matches exactly the Chrome use
> > case ? We can try to provide that, but I don't entirely understand how
> > that would change the situation. After all, we already know that it is
> > possible to get a thread into this condition, and we already have one
> > means to reproduce it.
> >
> > Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
> > test application and the Chrome benchmark. The thread is still stuck in
> > zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
> > and no longer results in a hung task timeout. It remains in that state
> > until the parent process terminates. I am not entirely happy with it
> > since the processes are still stuck and may pile up over time, but at
> > least it solves the immediate problem for us.
> >
> > Question now is what to do with that solution. We can of course apply
> > it locally to Chrome OS, but I would rather have it upstream - especially
> > since we have to assume that any users of Chrome on Linux, or more
> > generically anyone using ptrace in conjunction with CLONE_NEWPID, may
> > experience the same problem. Right now I have no idea how to get there,
> > though. Can you provide some guidance ?
> 
> Apologies for not being clear.  I intend to send a pull request with the

No worries.

> the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
> next week or so with a Cc stable and an appropriate Fixes tag.  So the
> fix can be backported.
> 
Great, that is good to know.

> I have a more comprehensive change queued I will probably merge for 4.13
> already but it just changes what kind of zombies you see.  It won't
> remove the ``stuck'' zombies.
> 
> So what I am looking for now is:
> Why are things getting stuck in your benchmark?
> 
> -  Is it a userspace bug?
> 
>   In which case we can figure out what userspace (aka breakpad) needs
>to do to avoid the problem.
>
I spent some time trying to understand what breakpad is doing. I don't
really see how it can be blamed. It attaches itself to a crashing thread,
gets the information it is looking for, and detaches itself. At the
same time, whatever it does, it seems to me that userspace should not
be able to create such a situation in the first place.

> -  Is it a kernel bug with ptrace?
> 
>There have been a lot of little subtle bugs with ptrace over the
>years so one more would not surprise
> 
Good question. I suspect that 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-12 Thread Eric W. Biederman
Guenter Roeck  writes:

> Hi Eric,
>
> On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
>> Vovo Yang  writes:
>> 
>> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>> >  wrote:
>> >> Guenter Roeck  writes:
>> >>
>> >>> What I know so far is
>> >>> - We see this condition on a regular basis in the field. Regular is
>> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>> >>>   per day reports a crash because of it. That is not that many,
>> >>>   but it adds up.
>> >>> - We are able to reproduce the problem with a performance benchmark
>> >>>   which opens 100 chrome tabs. While that is a lot, it should not
>> >>>   result in a kernel hang/crash.
>> >>> - Vovo proviced the test code last night. I don't know if this is
>> >>>   exactly what is observed in the benchmark, or how it relates to the
>> >>>   benchmark in the first place, but it is the first time we are actually
>> >>>   able to reliably create a condition where the problem is seen.
>> >>
>> >> Thank you.  I will be interesting to hear what is happening in the
>> >> chrome perfomance benchmark that triggers this.
>> >>
>> > What's happening in the benchmark:
>> > 1. A chrome renderer process was created with CLONE_NEWPID
>> > 2. The process crashed
>> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to 
>> > every
>> >   threads of the crashed process to dump info
>> > 4. When breakpad detach the crashed process, the crashed process stuck in
>> >   zap_pid_ns_processes()
>> 
>> Very interesting thank you.
>> 
>> So the question is specifically which interaction is causing this.
>> 
>> In the test case provided it was a sibling task in the pid namespace
>> dying and not being reaped.  Which may be what is happening with
>> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
>> 
>
> I am trying to understand what you are looking for. I would have thought
> that both the test application as well as the Chrome functionality
> described above show that there are situations where zap_pid_ns_processes()
> can get stuck and cause hung task timeouts in conjunction with the use of
> ptrace().
>
> Your last sentence seems to suggest that you believe that the kernel might
> do what it is expected to do. Assuming this is the case, what else would
> you like to see ? A test application which matches exactly the Chrome use
> case ? We can try to provide that, but I don't entirely understand how
> that would change the situation. After all, we already know that it is
> possible to get a thread into this condition, and we already have one
> means to reproduce it.
>
> Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
> test application and the Chrome benchmark. The thread is still stuck in
> zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
> and no longer results in a hung task timeout. It remains in that state
> until the parent process terminates. I am not entirely happy with it
> since the processes are still stuck and may pile up over time, but at
> least it solves the immediate problem for us.
>
> Question now is what to do with that solution. We can of course apply
> it locally to Chrome OS, but I would rather have it upstream - especially
> since we have to assume that any users of Chrome on Linux, or more
> generically anyone using ptrace in conjunction with CLONE_NEWPID, may
> experience the same problem. Right now I have no idea how to get there,
> though. Can you provide some guidance ?

Apologies for not being clear.  I intend to send a pull request with the
the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
next week or so with a Cc stable and an appropriate Fixes tag.  So the
fix can be backported.

I have a more comprehensive change queued I will probably merge for 4.13
already but it just changes what kind of zombies you see.  It won't
remove the ``stuck'' zombies.

So what I am looking for now is:
Why are things getting stuck in your benchmark?

-  Is it a userspace bug?

  In which case we can figure out what userspace (aka breakpad) needs
   to do to avoid the problem.
   
-  Is it a kernel bug with ptrace?

   There have been a lot of little subtle bugs with ptrace over the
   years so one more would not surprise

So I am just looking to make certain we fix the root issue not just
the hung task timeout warning.

Eric


Re: Threads stuck in zap_pid_ns_processes()

2017-05-12 Thread Eric W. Biederman
Guenter Roeck  writes:

> Hi Eric,
>
> On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
>> Vovo Yang  writes:
>> 
>> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>> >  wrote:
>> >> Guenter Roeck  writes:
>> >>
>> >>> What I know so far is
>> >>> - We see this condition on a regular basis in the field. Regular is
>> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>> >>>   per day reports a crash because of it. That is not that many,
>> >>>   but it adds up.
>> >>> - We are able to reproduce the problem with a performance benchmark
>> >>>   which opens 100 chrome tabs. While that is a lot, it should not
>> >>>   result in a kernel hang/crash.
>> >>> - Vovo proviced the test code last night. I don't know if this is
>> >>>   exactly what is observed in the benchmark, or how it relates to the
>> >>>   benchmark in the first place, but it is the first time we are actually
>> >>>   able to reliably create a condition where the problem is seen.
>> >>
>> >> Thank you.  I will be interesting to hear what is happening in the
>> >> chrome perfomance benchmark that triggers this.
>> >>
>> > What's happening in the benchmark:
>> > 1. A chrome renderer process was created with CLONE_NEWPID
>> > 2. The process crashed
>> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to 
>> > every
>> >   threads of the crashed process to dump info
>> > 4. When breakpad detach the crashed process, the crashed process stuck in
>> >   zap_pid_ns_processes()
>> 
>> Very interesting thank you.
>> 
>> So the question is specifically which interaction is causing this.
>> 
>> In the test case provided it was a sibling task in the pid namespace
>> dying and not being reaped.  Which may be what is happening with
>> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
>> 
>
> I am trying to understand what you are looking for. I would have thought
> that both the test application as well as the Chrome functionality
> described above show that there are situations where zap_pid_ns_processes()
> can get stuck and cause hung task timeouts in conjunction with the use of
> ptrace().
>
> Your last sentence seems to suggest that you believe that the kernel might
> do what it is expected to do. Assuming this is the case, what else would
> you like to see ? A test application which matches exactly the Chrome use
> case ? We can try to provide that, but I don't entirely understand how
> that would change the situation. After all, we already know that it is
> possible to get a thread into this condition, and we already have one
> means to reproduce it.
>
> Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
> test application and the Chrome benchmark. The thread is still stuck in
> zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
> and no longer results in a hung task timeout. It remains in that state
> until the parent process terminates. I am not entirely happy with it
> since the processes are still stuck and may pile up over time, but at
> least it solves the immediate problem for us.
>
> Question now is what to do with that solution. We can of course apply
> it locally to Chrome OS, but I would rather have it upstream - especially
> since we have to assume that any users of Chrome on Linux, or more
> generically anyone using ptrace in conjunction with CLONE_NEWPID, may
> experience the same problem. Right now I have no idea how to get there,
> though. Can you provide some guidance ?

Apologies for not being clear.  I intend to send a pull request with the
the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
next week or so with a Cc stable and an appropriate Fixes tag.  So the
fix can be backported.

I have a more comprehensive change queued I will probably merge for 4.13
already but it just changes what kind of zombies you see.  It won't
remove the ``stuck'' zombies.

So what I am looking for now is:
Why are things getting stuck in your benchmark?

-  Is it a userspace bug?

  In which case we can figure out what userspace (aka breakpad) needs
   to do to avoid the problem.
   
-  Is it a kernel bug with ptrace?

   There have been a lot of little subtle bugs with ptrace over the
   years so one more would not surprise

So I am just looking to make certain we fix the root issue not just
the hung task timeout warning.

Eric


Re: Threads stuck in zap_pid_ns_processes()

2017-05-12 Thread Guenter Roeck
Hi Eric,

On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
> Vovo Yang  writes:
> 
> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
> >  wrote:
> >> Guenter Roeck  writes:
> >>
> >>> What I know so far is
> >>> - We see this condition on a regular basis in the field. Regular is
> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
> >>>   per day reports a crash because of it. That is not that many,
> >>>   but it adds up.
> >>> - We are able to reproduce the problem with a performance benchmark
> >>>   which opens 100 chrome tabs. While that is a lot, it should not
> >>>   result in a kernel hang/crash.
> >>> - Vovo proviced the test code last night. I don't know if this is
> >>>   exactly what is observed in the benchmark, or how it relates to the
> >>>   benchmark in the first place, but it is the first time we are actually
> >>>   able to reliably create a condition where the problem is seen.
> >>
> >> Thank you.  I will be interesting to hear what is happening in the
> >> chrome perfomance benchmark that triggers this.
> >>
> > What's happening in the benchmark:
> > 1. A chrome renderer process was created with CLONE_NEWPID
> > 2. The process crashed
> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to 
> > every
> >   threads of the crashed process to dump info
> > 4. When breakpad detach the crashed process, the crashed process stuck in
> >   zap_pid_ns_processes()
> 
> Very interesting thank you.
> 
> So the question is specifically which interaction is causing this.
> 
> In the test case provided it was a sibling task in the pid namespace
> dying and not being reaped.  Which may be what is happening with
> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
> 

I am trying to understand what you are looking for. I would have thought
that both the test application as well as the Chrome functionality
described above show that there are situations where zap_pid_ns_processes()
can get stuck and cause hung task timeouts in conjunction with the use of
ptrace().

Your last sentence seems to suggest that you believe that the kernel might
do what it is expected to do. Assuming this is the case, what else would
you like to see ? A test application which matches exactly the Chrome use
case ? We can try to provide that, but I don't entirely understand how
that would change the situation. After all, we already know that it is
possible to get a thread into this condition, and we already have one
means to reproduce it.

Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
test application and the Chrome benchmark. The thread is still stuck in
zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
and no longer results in a hung task timeout. It remains in that state
until the parent process terminates. I am not entirely happy with it
since the processes are still stuck and may pile up over time, but at
least it solves the immediate problem for us.

Question now is what to do with that solution. We can of course apply
it locally to Chrome OS, but I would rather have it upstream - especially
since we have to assume that any users of Chrome on Linux, or more
generically anyone using ptrace in conjunction with CLONE_NEWPID, may
experience the same problem. Right now I have no idea how to get there,
though. Can you provide some guidance ?

Thanks,
Guenter


Re: Threads stuck in zap_pid_ns_processes()

2017-05-12 Thread Guenter Roeck
Hi Eric,

On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
> Vovo Yang  writes:
> 
> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
> >  wrote:
> >> Guenter Roeck  writes:
> >>
> >>> What I know so far is
> >>> - We see this condition on a regular basis in the field. Regular is
> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
> >>>   per day reports a crash because of it. That is not that many,
> >>>   but it adds up.
> >>> - We are able to reproduce the problem with a performance benchmark
> >>>   which opens 100 chrome tabs. While that is a lot, it should not
> >>>   result in a kernel hang/crash.
> >>> - Vovo proviced the test code last night. I don't know if this is
> >>>   exactly what is observed in the benchmark, or how it relates to the
> >>>   benchmark in the first place, but it is the first time we are actually
> >>>   able to reliably create a condition where the problem is seen.
> >>
> >> Thank you.  I will be interesting to hear what is happening in the
> >> chrome perfomance benchmark that triggers this.
> >>
> > What's happening in the benchmark:
> > 1. A chrome renderer process was created with CLONE_NEWPID
> > 2. The process crashed
> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to 
> > every
> >   threads of the crashed process to dump info
> > 4. When breakpad detach the crashed process, the crashed process stuck in
> >   zap_pid_ns_processes()
> 
> Very interesting thank you.
> 
> So the question is specifically which interaction is causing this.
> 
> In the test case provided it was a sibling task in the pid namespace
> dying and not being reaped.  Which may be what is happening with
> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
> 

I am trying to understand what you are looking for. I would have thought
that both the test application as well as the Chrome functionality
described above show that there are situations where zap_pid_ns_processes()
can get stuck and cause hung task timeouts in conjunction with the use of
ptrace().

Your last sentence seems to suggest that you believe that the kernel might
do what it is expected to do. Assuming this is the case, what else would
you like to see ? A test application which matches exactly the Chrome use
case ? We can try to provide that, but I don't entirely understand how
that would change the situation. After all, we already know that it is
possible to get a thread into this condition, and we already have one
means to reproduce it.

Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
test application and the Chrome benchmark. The thread is still stuck in
zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
and no longer results in a hung task timeout. It remains in that state
until the parent process terminates. I am not entirely happy with it
since the processes are still stuck and may pile up over time, but at
least it solves the immediate problem for us.

Question now is what to do with that solution. We can of course apply
it locally to Chrome OS, but I would rather have it upstream - especially
since we have to assume that any users of Chrome on Linux, or more
generically anyone using ptrace in conjunction with CLONE_NEWPID, may
experience the same problem. Right now I have no idea how to get there,
though. Can you provide some guidance ?

Thanks,
Guenter


Re: Threads stuck in zap_pid_ns_processes()

2017-05-12 Thread Eric W. Biederman
Vovo Yang  writes:

> On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>  wrote:
>> Guenter Roeck  writes:
>>
>>> What I know so far is
>>> - We see this condition on a regular basis in the field. Regular is
>>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>>>   per day reports a crash because of it. That is not that many,
>>>   but it adds up.
>>> - We are able to reproduce the problem with a performance benchmark
>>>   which opens 100 chrome tabs. While that is a lot, it should not
>>>   result in a kernel hang/crash.
>>> - Vovo proviced the test code last night. I don't know if this is
>>>   exactly what is observed in the benchmark, or how it relates to the
>>>   benchmark in the first place, but it is the first time we are actually
>>>   able to reliably create a condition where the problem is seen.
>>
>> Thank you.  I will be interesting to hear what is happening in the
>> chrome perfomance benchmark that triggers this.
>>
> What's happening in the benchmark:
> 1. A chrome renderer process was created with CLONE_NEWPID
> 2. The process crashed
> 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
>   threads of the crashed process to dump info
> 4. When breakpad detach the crashed process, the crashed process stuck in
>   zap_pid_ns_processes()

Very interesting thank you.

So the question is specifically which interaction is causing this.

In the test case provided it was a sibling task in the pid namespace
dying and not being reaped.  Which may be what is happening with
breakpad.  So far I have yet to see kernel bug but I won't rule one out.

Eric


Re: Threads stuck in zap_pid_ns_processes()

2017-05-12 Thread Eric W. Biederman
Vovo Yang  writes:

> On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>  wrote:
>> Guenter Roeck  writes:
>>
>>> What I know so far is
>>> - We see this condition on a regular basis in the field. Regular is
>>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>>>   per day reports a crash because of it. That is not that many,
>>>   but it adds up.
>>> - We are able to reproduce the problem with a performance benchmark
>>>   which opens 100 chrome tabs. While that is a lot, it should not
>>>   result in a kernel hang/crash.
>>> - Vovo proviced the test code last night. I don't know if this is
>>>   exactly what is observed in the benchmark, or how it relates to the
>>>   benchmark in the first place, but it is the first time we are actually
>>>   able to reliably create a condition where the problem is seen.
>>
>> Thank you.  I will be interesting to hear what is happening in the
>> chrome perfomance benchmark that triggers this.
>>
> What's happening in the benchmark:
> 1. A chrome renderer process was created with CLONE_NEWPID
> 2. The process crashed
> 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
>   threads of the crashed process to dump info
> 4. When breakpad detach the crashed process, the crashed process stuck in
>   zap_pid_ns_processes()

Very interesting thank you.

So the question is specifically which interaction is causing this.

In the test case provided it was a sibling task in the pid namespace
dying and not being reaped.  Which may be what is happening with
breakpad.  So far I have yet to see kernel bug but I won't rule one out.

Eric


Re: Threads stuck in zap_pid_ns_processes()

2017-05-12 Thread Vovo Yang
On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
 wrote:
> Guenter Roeck  writes:
>
>> On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
>>> Guenter Roeck  writes:
>>>
>>> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>>> >> Guenter Roeck  writes:
>>> >>
>>> >> > Hi all,
>>> >> >
>>> >> > the test program attached below almost always results in one of the 
>>> >> > child
>>> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I 
>>> >> > can
>>> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is 
>>> >> > only
>>> >> > a single thread left in the pid namespace (the one that is stuck).
>>> >> > Traceback from /proc//stack is
>>> >> >
>>> >> > [] zap_pid_ns_processes+0x1ee/0x2a0
>>> >> > [] do_exit+0x10d4/0x1330
>>> >> > [] do_group_exit+0x86/0x130
>>> >> > [] get_signal+0x367/0x8a0
>>> >> > [] do_signal+0x83/0xb90
>>> >> > [] exit_to_usermode_loop+0x75/0xc0
>>> >> > [] syscall_return_slowpath+0xc6/0xd0
>>> >> > [] entry_SYSCALL_64_fastpath+0xab/0xad
>>> >> > [] 0x
>>> >> >
>>> >> > After 120 seconds, I get the "hung task" message.
>>> >> >
>>> >> > Example from v4.11:
>>> >> >
>>> >> > ...
>>> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 
>>> >> > seconds.
>>> >> > [ 3263.379561]   Not tainted 4.11.0+ #1
>>> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> >> > disables this message.
>>> >> > [ 3263.379577] clone   D0 27910  27909 0x
>>> >> > [ 3263.379587] Call Trace:
>>> >> > [ 3263.379608]  __schedule+0x677/0xda0
>>> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>>> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>>> >> > [ 3263.379643]  schedule+0x4d/0xd0
>>> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>>> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>>> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
>>> >> > ...
>>> >> >
>>> >> > The problem is seen in all kernels up to v4.11.
>>> >> >
>>> >> > Any idea what might be going on and how to fix the problem ?
>>> >>
>>> >> Let me see.  Reading the code it looks like we have three tasks
>>> >> let's call them main, child1, and child2.
>>> >>
>>> >> child1 and child2 are started using CLONE_THREAD and are
>>> >> thus clones of one another.
>>> >>
>>> >> child2 exits first but is ptraced by main so is not reaped.
>>> >>Further child2 calls do_group_exit forcing child1 to
>>> >>exit making for fun races.
>>> >>
>>> >>A ptread_exit() or syscall(SYS_exit, 0); would skip
>>> >>the group exit and make the window larger.
>>> >>
>>> >> child1 exits next and calls zap_pid_ns_processes and is
>>> >>waiting for child2 to be reaped by main.
>>> >>
>>> >> main is just sitting around doing nothing for 3600 seconds
>>> >> not reaping anyone.
>>> >>
>>> >> I would expect that when main exits everything would be cleaned up
>>> >> and the only real issue is that we have a hung task warning.
>>> >>
>>> >> Does everything cleanup when main exits?
>>> >>
>>> >
>>> > As an add-on to my previous mail: I added a function to count
>>> > the number of threads in the pid namespace, using next_pidmap().
>>> > Even though nr_hashed == 2, only the hanging thread is still
>>> > present.
>>>
>>> For your testcase?  I suspect you copied the code from
>>> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
>>> calling zap_pid_ns_processes.
>>>
>>
>> Almost. Something along the line of
>>
>>   count = 0;
>>   nr = next_pidmap(pid_ns, 0);
>>   while (nr > 0) {
>>   count++;
>>   nr = next_pidmap(pid_ns, nr);
>>   }
>>
>> only I also call sched_show_task() for each thread, and the only
>> one printed is the one that hangs in zap_pid_ns_processes().
>>
>>> > Is there maybe a better way to terminate the wait loop than
>>> > with "if (pid_ns->nr_hashed == init_pids)" ?
>>>
>>> Not right now.  nr_hashed counts the number of "struct pid"s that are
>>> in attached to processes and in the hash table for looking up tasks.
>>>
>>> Waiting until that drops to just the last task in the pid namespace
>>> really does do a good job of counting what we are looking for.  It
>>> is a little tricky because if our task_pid(p) != task_tgid(p) then
>>> nr_hashed will be 2.  For various reasons.
>>>
>>> That said it isn't as obvious as it should be and I would like
>>> to improve that.
>>>
>>> I think the immediate solution here is to just use TASK_INTERRUPTIBLE
>>> instead of TASK_UNITERRUPTIBLE.  This really is not a short term
>>> disk sleep nor anything guaranteed to complete in any sort of
>>> finite time.  Plus the hung task timeout warning only triggers
>>> if your task state is TASK_UNINTERRUPTIBLE so this will prevent
>>> the warning and keep a long delay in zap_pid_ns_processes from
>>> bumping up your load average.
>>>
>>> That 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-12 Thread Vovo Yang
On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
 wrote:
> Guenter Roeck  writes:
>
>> On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
>>> Guenter Roeck  writes:
>>>
>>> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>>> >> Guenter Roeck  writes:
>>> >>
>>> >> > Hi all,
>>> >> >
>>> >> > the test program attached below almost always results in one of the 
>>> >> > child
>>> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I 
>>> >> > can
>>> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is 
>>> >> > only
>>> >> > a single thread left in the pid namespace (the one that is stuck).
>>> >> > Traceback from /proc//stack is
>>> >> >
>>> >> > [] zap_pid_ns_processes+0x1ee/0x2a0
>>> >> > [] do_exit+0x10d4/0x1330
>>> >> > [] do_group_exit+0x86/0x130
>>> >> > [] get_signal+0x367/0x8a0
>>> >> > [] do_signal+0x83/0xb90
>>> >> > [] exit_to_usermode_loop+0x75/0xc0
>>> >> > [] syscall_return_slowpath+0xc6/0xd0
>>> >> > [] entry_SYSCALL_64_fastpath+0xab/0xad
>>> >> > [] 0x
>>> >> >
>>> >> > After 120 seconds, I get the "hung task" message.
>>> >> >
>>> >> > Example from v4.11:
>>> >> >
>>> >> > ...
>>> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 
>>> >> > seconds.
>>> >> > [ 3263.379561]   Not tainted 4.11.0+ #1
>>> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> >> > disables this message.
>>> >> > [ 3263.379577] clone   D0 27910  27909 0x
>>> >> > [ 3263.379587] Call Trace:
>>> >> > [ 3263.379608]  __schedule+0x677/0xda0
>>> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>>> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>>> >> > [ 3263.379643]  schedule+0x4d/0xd0
>>> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>>> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>>> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
>>> >> > ...
>>> >> >
>>> >> > The problem is seen in all kernels up to v4.11.
>>> >> >
>>> >> > Any idea what might be going on and how to fix the problem ?
>>> >>
>>> >> Let me see.  Reading the code it looks like we have three tasks
>>> >> let's call them main, child1, and child2.
>>> >>
>>> >> child1 and child2 are started using CLONE_THREAD and are
>>> >> thus clones of one another.
>>> >>
>>> >> child2 exits first but is ptraced by main so is not reaped.
>>> >>Further child2 calls do_group_exit forcing child1 to
>>> >>exit making for fun races.
>>> >>
>>> >>A ptread_exit() or syscall(SYS_exit, 0); would skip
>>> >>the group exit and make the window larger.
>>> >>
>>> >> child1 exits next and calls zap_pid_ns_processes and is
>>> >>waiting for child2 to be reaped by main.
>>> >>
>>> >> main is just sitting around doing nothing for 3600 seconds
>>> >> not reaping anyone.
>>> >>
>>> >> I would expect that when main exits everything would be cleaned up
>>> >> and the only real issue is that we have a hung task warning.
>>> >>
>>> >> Does everything cleanup when main exits?
>>> >>
>>> >
>>> > As an add-on to my previous mail: I added a function to count
>>> > the number of threads in the pid namespace, using next_pidmap().
>>> > Even though nr_hashed == 2, only the hanging thread is still
>>> > present.
>>>
>>> For your testcase?  I suspect you copied the code from
>>> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
>>> calling zap_pid_ns_processes.
>>>
>>
>> Almost. Something along the line of
>>
>>   count = 0;
>>   nr = next_pidmap(pid_ns, 0);
>>   while (nr > 0) {
>>   count++;
>>   nr = next_pidmap(pid_ns, nr);
>>   }
>>
>> only I also call sched_show_task() for each thread, and the only
>> one printed is the one that hangs in zap_pid_ns_processes().
>>
>>> > Is there maybe a better way to terminate the wait loop than
>>> > with "if (pid_ns->nr_hashed == init_pids)" ?
>>>
>>> Not right now.  nr_hashed counts the number of "struct pid"s that are
>>> in attached to processes and in the hash table for looking up tasks.
>>>
>>> Waiting until that drops to just the last task in the pid namespace
>>> really does do a good job of counting what we are looking for.  It
>>> is a little tricky because if our task_pid(p) != task_tgid(p) then
>>> nr_hashed will be 2.  For various reasons.
>>>
>>> That said it isn't as obvious as it should be and I would like
>>> to improve that.
>>>
>>> I think the immediate solution here is to just use TASK_INTERRUPTIBLE
>>> instead of TASK_UNITERRUPTIBLE.  This really is not a short term
>>> disk sleep nor anything guaranteed to complete in any sort of
>>> finite time.  Plus the hung task timeout warning only triggers
>>> if your task state is TASK_UNINTERRUPTIBLE so this will prevent
>>> the warning and keep a long delay in zap_pid_ns_processes from
>>> bumping up your load average.
>>>
>>> That still leaves the question of what state strange state
>>> you are getting into but 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:

>> > As an add-on to my previous mail: I added a function to count
>> > the number of threads in the pid namespace, using next_pidmap().
>> > Even though nr_hashed == 2, only the hanging thread is still
>> > present.
>> 
>> For your testcase?  I suspect you copied the code from
>> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
>> calling zap_pid_ns_processes.
>> 
>
> Almost. Something along the line of
>
>   count = 0;
>   nr = next_pidmap(pid_ns, 0);
>   while (nr > 0) {
>   count++;
>   nr = next_pidmap(pid_ns, nr);
>   }
>
> only I also call sched_show_task() for each thread, and the only
> one printed is the one that hangs in zap_pid_ns_processes().

The function sched_show_task() does:
if (!try_get_task_stack(p))
return;

Which won't work on a zombie who has already released it's stack.
Which is exactly what child2 should be at that point.

Eric



Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:

>> > As an add-on to my previous mail: I added a function to count
>> > the number of threads in the pid namespace, using next_pidmap().
>> > Even though nr_hashed == 2, only the hanging thread is still
>> > present.
>> 
>> For your testcase?  I suspect you copied the code from
>> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
>> calling zap_pid_ns_processes.
>> 
>
> Almost. Something along the line of
>
>   count = 0;
>   nr = next_pidmap(pid_ns, 0);
>   while (nr > 0) {
>   count++;
>   nr = next_pidmap(pid_ns, nr);
>   }
>
> only I also call sched_show_task() for each thread, and the only
> one printed is the one that hangs in zap_pid_ns_processes().

The function sched_show_task() does:
if (!try_get_task_stack(p))
return;

Which won't work on a zombie who has already released it's stack.
Which is exactly what child2 should be at that point.

Eric



Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> 
>> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> >> Guenter Roeck  writes:
>> >> 
>> >> > Hi all,
>> >> >
>> >> > the test program attached below almost always results in one of the 
>> >> > child
>> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I 
>> >> > can
>> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is 
>> >> > only
>> >> > a single thread left in the pid namespace (the one that is stuck).
>> >> > Traceback from /proc//stack is
>> >> >
>> >> > [] zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [] do_exit+0x10d4/0x1330
>> >> > [] do_group_exit+0x86/0x130
>> >> > [] get_signal+0x367/0x8a0
>> >> > [] do_signal+0x83/0xb90
>> >> > [] exit_to_usermode_loop+0x75/0xc0
>> >> > [] syscall_return_slowpath+0xc6/0xd0
>> >> > [] entry_SYSCALL_64_fastpath+0xab/0xad
>> >> > [] 0x
>> >> >
>> >> > After 120 seconds, I get the "hung task" message.
>> >> >
>> >> > Example from v4.11:
>> >> >
>> >> > ...
>> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> >> > [ 3263.379561]   Not tainted 4.11.0+ #1
>> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>> >> > disables this message.
>> >> > [ 3263.379577] clone   D0 27910  27909 0x
>> >> > [ 3263.379587] Call Trace:
>> >> > [ 3263.379608]  __schedule+0x677/0xda0
>> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> >> > [ 3263.379643]  schedule+0x4d/0xd0
>> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> >> > ...
>> >> >
>> >> > The problem is seen in all kernels up to v4.11.
>> >> >
>> >> > Any idea what might be going on and how to fix the problem ?
>> >> 
>> >> Let me see.  Reading the code it looks like we have three tasks
>> >> let's call them main, child1, and child2.
>> >> 
>> >> child1 and child2 are started using CLONE_THREAD and are
>> >> thus clones of one another.
>> >> 
>> >> child2 exits first but is ptraced by main so is not reaped.
>> >>Further child2 calls do_group_exit forcing child1 to
>> >>exit making for fun races.
>> >> 
>> >>A ptread_exit() or syscall(SYS_exit, 0); would skip
>> >>the group exit and make the window larger.
>> >> 
>> >> child1 exits next and calls zap_pid_ns_processes and is
>> >>waiting for child2 to be reaped by main.
>> >> 
>> >> main is just sitting around doing nothing for 3600 seconds
>> >> not reaping anyone.
>> >> 
>> >> I would expect that when main exits everything would be cleaned up
>> >> and the only real issue is that we have a hung task warning.
>> >> 
>> >> Does everything cleanup when main exits?
>> >> 
>> >
>> > As an add-on to my previous mail: I added a function to count
>> > the number of threads in the pid namespace, using next_pidmap().
>> > Even though nr_hashed == 2, only the hanging thread is still
>> > present.
>> 
>> For your testcase?  I suspect you copied the code from
>> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
>> calling zap_pid_ns_processes.
>> 
>
> Almost. Something along the line of
>
>   count = 0;
>   nr = next_pidmap(pid_ns, 0);
>   while (nr > 0) {
>   count++;
>   nr = next_pidmap(pid_ns, nr);
>   }
>
> only I also call sched_show_task() for each thread, and the only
> one printed is the one that hangs in zap_pid_ns_processes().
>
>> > Is there maybe a better way to terminate the wait loop than
>> > with "if (pid_ns->nr_hashed == init_pids)" ?
>> 
>> Not right now.  nr_hashed counts the number of "struct pid"s that are
>> in attached to processes and in the hash table for looking up tasks.
>> 
>> Waiting until that drops to just the last task in the pid namespace
>> really does do a good job of counting what we are looking for.  It
>> is a little tricky because if our task_pid(p) != task_tgid(p) then
>> nr_hashed will be 2.  For various reasons.
>> 
>> That said it isn't as obvious as it should be and I would like
>> to improve that.
>> 
>> I think the immediate solution here is to just use TASK_INTERRUPTIBLE
>> instead of TASK_UNITERRUPTIBLE.  This really is not a short term
>> disk sleep nor anything guaranteed to complete in any sort of
>> finite time.  Plus the hung task timeout warning only triggers
>> if your task state is TASK_UNINTERRUPTIBLE so this will prevent
>> the warning and keep a long delay in zap_pid_ns_processes from
>> bumping up your load average.
>> 
>> That still leaves the question of what state strange state
>> you are getting into but this should at least prevent the unexepcted
>> reboots.
>> 
>
> What I know so far is 
> - We see this condition on 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> 
>> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> >> Guenter Roeck  writes:
>> >> 
>> >> > Hi all,
>> >> >
>> >> > the test program attached below almost always results in one of the 
>> >> > child
>> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I 
>> >> > can
>> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is 
>> >> > only
>> >> > a single thread left in the pid namespace (the one that is stuck).
>> >> > Traceback from /proc//stack is
>> >> >
>> >> > [] zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [] do_exit+0x10d4/0x1330
>> >> > [] do_group_exit+0x86/0x130
>> >> > [] get_signal+0x367/0x8a0
>> >> > [] do_signal+0x83/0xb90
>> >> > [] exit_to_usermode_loop+0x75/0xc0
>> >> > [] syscall_return_slowpath+0xc6/0xd0
>> >> > [] entry_SYSCALL_64_fastpath+0xab/0xad
>> >> > [] 0x
>> >> >
>> >> > After 120 seconds, I get the "hung task" message.
>> >> >
>> >> > Example from v4.11:
>> >> >
>> >> > ...
>> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> >> > [ 3263.379561]   Not tainted 4.11.0+ #1
>> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>> >> > disables this message.
>> >> > [ 3263.379577] clone   D0 27910  27909 0x
>> >> > [ 3263.379587] Call Trace:
>> >> > [ 3263.379608]  __schedule+0x677/0xda0
>> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> >> > [ 3263.379643]  schedule+0x4d/0xd0
>> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> >> > ...
>> >> >
>> >> > The problem is seen in all kernels up to v4.11.
>> >> >
>> >> > Any idea what might be going on and how to fix the problem ?
>> >> 
>> >> Let me see.  Reading the code it looks like we have three tasks
>> >> let's call them main, child1, and child2.
>> >> 
>> >> child1 and child2 are started using CLONE_THREAD and are
>> >> thus clones of one another.
>> >> 
>> >> child2 exits first but is ptraced by main so is not reaped.
>> >>Further child2 calls do_group_exit forcing child1 to
>> >>exit making for fun races.
>> >> 
>> >>A ptread_exit() or syscall(SYS_exit, 0); would skip
>> >>the group exit and make the window larger.
>> >> 
>> >> child1 exits next and calls zap_pid_ns_processes and is
>> >>waiting for child2 to be reaped by main.
>> >> 
>> >> main is just sitting around doing nothing for 3600 seconds
>> >> not reaping anyone.
>> >> 
>> >> I would expect that when main exits everything would be cleaned up
>> >> and the only real issue is that we have a hung task warning.
>> >> 
>> >> Does everything cleanup when main exits?
>> >> 
>> >
>> > As an add-on to my previous mail: I added a function to count
>> > the number of threads in the pid namespace, using next_pidmap().
>> > Even though nr_hashed == 2, only the hanging thread is still
>> > present.
>> 
>> For your testcase?  I suspect you copied the code from
>> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
>> calling zap_pid_ns_processes.
>> 
>
> Almost. Something along the line of
>
>   count = 0;
>   nr = next_pidmap(pid_ns, 0);
>   while (nr > 0) {
>   count++;
>   nr = next_pidmap(pid_ns, nr);
>   }
>
> only I also call sched_show_task() for each thread, and the only
> one printed is the one that hangs in zap_pid_ns_processes().
>
>> > Is there maybe a better way to terminate the wait loop than
>> > with "if (pid_ns->nr_hashed == init_pids)" ?
>> 
>> Not right now.  nr_hashed counts the number of "struct pid"s that are
>> in attached to processes and in the hash table for looking up tasks.
>> 
>> Waiting until that drops to just the last task in the pid namespace
>> really does do a good job of counting what we are looking for.  It
>> is a little tricky because if our task_pid(p) != task_tgid(p) then
>> nr_hashed will be 2.  For various reasons.
>> 
>> That said it isn't as obvious as it should be and I would like
>> to improve that.
>> 
>> I think the immediate solution here is to just use TASK_INTERRUPTIBLE
>> instead of TASK_UNITERRUPTIBLE.  This really is not a short term
>> disk sleep nor anything guaranteed to complete in any sort of
>> finite time.  Plus the hung task timeout warning only triggers
>> if your task state is TASK_UNINTERRUPTIBLE so this will prevent
>> the warning and keep a long delay in zap_pid_ns_processes from
>> bumping up your load average.
>> 
>> That still leaves the question of what state strange state
>> you are getting into but this should at least prevent the unexepcted
>> reboots.
>> 
>
> What I know so far is 
> - We see this condition on a regular basis in the field. Regular is
>   relative, of 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Guenter Roeck
On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> 
> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> >> Guenter Roeck  writes:
> >> 
> >> > Hi all,
> >> >
> >> > the test program attached below almost always results in one of the child
> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is 
> >> > only
> >> > a single thread left in the pid namespace (the one that is stuck).
> >> > Traceback from /proc//stack is
> >> >
> >> > [] zap_pid_ns_processes+0x1ee/0x2a0
> >> > [] do_exit+0x10d4/0x1330
> >> > [] do_group_exit+0x86/0x130
> >> > [] get_signal+0x367/0x8a0
> >> > [] do_signal+0x83/0xb90
> >> > [] exit_to_usermode_loop+0x75/0xc0
> >> > [] syscall_return_slowpath+0xc6/0xd0
> >> > [] entry_SYSCALL_64_fastpath+0xab/0xad
> >> > [] 0x
> >> >
> >> > After 120 seconds, I get the "hung task" message.
> >> >
> >> > Example from v4.11:
> >> >
> >> > ...
> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> >> > [ 3263.379561]   Not tainted 4.11.0+ #1
> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> >> > disables this message.
> >> > [ 3263.379577] clone   D0 27910  27909 0x
> >> > [ 3263.379587] Call Trace:
> >> > [ 3263.379608]  __schedule+0x677/0xda0
> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> >> > [ 3263.379643]  schedule+0x4d/0xd0
> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
> >> > ...
> >> >
> >> > The problem is seen in all kernels up to v4.11.
> >> >
> >> > Any idea what might be going on and how to fix the problem ?
> >> 
> >> Let me see.  Reading the code it looks like we have three tasks
> >> let's call them main, child1, and child2.
> >> 
> >> child1 and child2 are started using CLONE_THREAD and are
> >> thus clones of one another.
> >> 
> >> child2 exits first but is ptraced by main so is not reaped.
> >>Further child2 calls do_group_exit forcing child1 to
> >>exit making for fun races.
> >> 
> >>A ptread_exit() or syscall(SYS_exit, 0); would skip
> >>the group exit and make the window larger.
> >> 
> >> child1 exits next and calls zap_pid_ns_processes and is
> >>waiting for child2 to be reaped by main.
> >> 
> >> main is just sitting around doing nothing for 3600 seconds
> >> not reaping anyone.
> >> 
> >> I would expect that when main exits everything would be cleaned up
> >> and the only real issue is that we have a hung task warning.
> >> 
> >> Does everything cleanup when main exits?
> >> 
> >
> > As an add-on to my previous mail: I added a function to count
> > the number of threads in the pid namespace, using next_pidmap().
> > Even though nr_hashed == 2, only the hanging thread is still
> > present.
> 
> For your testcase?  I suspect you copied the code from
> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
> calling zap_pid_ns_processes.
> 

Almost. Something along the line of

count = 0;
nr = next_pidmap(pid_ns, 0);
while (nr > 0) {
count++;
nr = next_pidmap(pid_ns, nr);
}

only I also call sched_show_task() for each thread, and the only
one printed is the one that hangs in zap_pid_ns_processes().

> > Is there maybe a better way to terminate the wait loop than
> > with "if (pid_ns->nr_hashed == init_pids)" ?
> 
> Not right now.  nr_hashed counts the number of "struct pid"s that are
> in attached to processes and in the hash table for looking up tasks.
> 
> Waiting until that drops to just the last task in the pid namespace
> really does do a good job of counting what we are looking for.  It
> is a little tricky because if our task_pid(p) != task_tgid(p) then
> nr_hashed will be 2.  For various reasons.
> 
> That said it isn't as obvious as it should be and I would like
> to improve that.
> 
> I think the immediate solution here is to just use TASK_INTERRUPTIBLE
> instead of TASK_UNITERRUPTIBLE.  This really is not a short term
> disk sleep nor anything guaranteed to complete in any sort of
> finite time.  Plus the hung task timeout warning only triggers
> if your task state is TASK_UNINTERRUPTIBLE so this will prevent
> the warning and keep a long delay in zap_pid_ns_processes from
> bumping up your load average.
> 
> That still leaves the question of what state strange state
> you are getting into but this should at least prevent the unexepcted
> reboots.
> 

What I know so far is 
- We see this condition on a regular basis in the field. Regular is
  relative, of course - let's say maybe 1 in a Milion Chromebooks
  per day reports a crash because of it. That is not that many,
  but it adds 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Guenter Roeck
On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> 
> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> >> Guenter Roeck  writes:
> >> 
> >> > Hi all,
> >> >
> >> > the test program attached below almost always results in one of the child
> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is 
> >> > only
> >> > a single thread left in the pid namespace (the one that is stuck).
> >> > Traceback from /proc//stack is
> >> >
> >> > [] zap_pid_ns_processes+0x1ee/0x2a0
> >> > [] do_exit+0x10d4/0x1330
> >> > [] do_group_exit+0x86/0x130
> >> > [] get_signal+0x367/0x8a0
> >> > [] do_signal+0x83/0xb90
> >> > [] exit_to_usermode_loop+0x75/0xc0
> >> > [] syscall_return_slowpath+0xc6/0xd0
> >> > [] entry_SYSCALL_64_fastpath+0xab/0xad
> >> > [] 0x
> >> >
> >> > After 120 seconds, I get the "hung task" message.
> >> >
> >> > Example from v4.11:
> >> >
> >> > ...
> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> >> > [ 3263.379561]   Not tainted 4.11.0+ #1
> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> >> > disables this message.
> >> > [ 3263.379577] clone   D0 27910  27909 0x
> >> > [ 3263.379587] Call Trace:
> >> > [ 3263.379608]  __schedule+0x677/0xda0
> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> >> > [ 3263.379643]  schedule+0x4d/0xd0
> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
> >> > ...
> >> >
> >> > The problem is seen in all kernels up to v4.11.
> >> >
> >> > Any idea what might be going on and how to fix the problem ?
> >> 
> >> Let me see.  Reading the code it looks like we have three tasks
> >> let's call them main, child1, and child2.
> >> 
> >> child1 and child2 are started using CLONE_THREAD and are
> >> thus clones of one another.
> >> 
> >> child2 exits first but is ptraced by main so is not reaped.
> >>Further child2 calls do_group_exit forcing child1 to
> >>exit making for fun races.
> >> 
> >>A ptread_exit() or syscall(SYS_exit, 0); would skip
> >>the group exit and make the window larger.
> >> 
> >> child1 exits next and calls zap_pid_ns_processes and is
> >>waiting for child2 to be reaped by main.
> >> 
> >> main is just sitting around doing nothing for 3600 seconds
> >> not reaping anyone.
> >> 
> >> I would expect that when main exits everything would be cleaned up
> >> and the only real issue is that we have a hung task warning.
> >> 
> >> Does everything cleanup when main exits?
> >> 
> >
> > As an add-on to my previous mail: I added a function to count
> > the number of threads in the pid namespace, using next_pidmap().
> > Even though nr_hashed == 2, only the hanging thread is still
> > present.
> 
> For your testcase?  I suspect you copied the code from
> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
> calling zap_pid_ns_processes.
> 

Almost. Something along the line of

count = 0;
nr = next_pidmap(pid_ns, 0);
while (nr > 0) {
count++;
nr = next_pidmap(pid_ns, nr);
}

only I also call sched_show_task() for each thread, and the only
one printed is the one that hangs in zap_pid_ns_processes().

> > Is there maybe a better way to terminate the wait loop than
> > with "if (pid_ns->nr_hashed == init_pids)" ?
> 
> Not right now.  nr_hashed counts the number of "struct pid"s that are
> in attached to processes and in the hash table for looking up tasks.
> 
> Waiting until that drops to just the last task in the pid namespace
> really does do a good job of counting what we are looking for.  It
> is a little tricky because if our task_pid(p) != task_tgid(p) then
> nr_hashed will be 2.  For various reasons.
> 
> That said it isn't as obvious as it should be and I would like
> to improve that.
> 
> I think the immediate solution here is to just use TASK_INTERRUPTIBLE
> instead of TASK_UNITERRUPTIBLE.  This really is not a short term
> disk sleep nor anything guaranteed to complete in any sort of
> finite time.  Plus the hung task timeout warning only triggers
> if your task state is TASK_UNINTERRUPTIBLE so this will prevent
> the warning and keep a long delay in zap_pid_ns_processes from
> bumping up your load average.
> 
> That still leaves the question of what state strange state
> you are getting into but this should at least prevent the unexepcted
> reboots.
> 

What I know so far is 
- We see this condition on a regular basis in the field. Regular is
  relative, of course - let's say maybe 1 in a Milion Chromebooks
  per day reports a crash because of it. That is not that many,
  but it adds up.
- We are able to reproduce the 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, May 11, 2017 at 03:23:17PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> 
>> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> >> Guenter Roeck  writes:
>> >> 
>> >> > Hi all,
>> >> >
>> >> > the test program attached below almost always results in one of the 
>> >> > child
>> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I 
>> >> > can
>> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is 
>> >> > only
>> >> > a single thread left in the pid namespace (the one that is stuck).
>> >> > Traceback from /proc//stack is
>> >> >
>> >> > [] zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [] do_exit+0x10d4/0x1330
>> >> > [] do_group_exit+0x86/0x130
>> >> > [] get_signal+0x367/0x8a0
>> >> > [] do_signal+0x83/0xb90
>> >> > [] exit_to_usermode_loop+0x75/0xc0
>> >> > [] syscall_return_slowpath+0xc6/0xd0
>> >> > [] entry_SYSCALL_64_fastpath+0xab/0xad
>> >> > [] 0x
>> >> >
>> >> > After 120 seconds, I get the "hung task" message.
>> >> >
>> >> > Example from v4.11:
>> >> >
>> >> > ...
>> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> >> > [ 3263.379561]   Not tainted 4.11.0+ #1
>> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>> >> > disables this message.
>> >> > [ 3263.379577] clone   D0 27910  27909 0x
>> >> > [ 3263.379587] Call Trace:
>> >> > [ 3263.379608]  __schedule+0x677/0xda0
>> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> >> > [ 3263.379643]  schedule+0x4d/0xd0
>> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> >> > ...
>> >> >
>> >> > The problem is seen in all kernels up to v4.11.
>> >> >
>> >> > Any idea what might be going on and how to fix the problem ?
>> >> 
>> >> Let me see.  Reading the code it looks like we have three tasks
>> >> let's call them main, child1, and child2.
>> >> 
>> >> child1 and child2 are started using CLONE_THREAD and are
>> >> thus clones of one another.
>> >> 
>> >> child2 exits first but is ptraced by main so is not reaped.
>> >>Further child2 calls do_group_exit forcing child1 to
>> >>exit making for fun races.
>> >> 
>> >>A ptread_exit() or syscall(SYS_exit, 0); would skip
>> >>the group exit and make the window larger.
>> >> 
>> >> child1 exits next and calls zap_pid_ns_processes and is
>> >>waiting for child2 to be reaped by main.
>> >> 
>> >> main is just sitting around doing nothing for 3600 seconds
>> >> not reaping anyone.
>> >> 
>> >> I would expect that when main exits everything would be cleaned up
>> >> and the only real issue is that we have a hung task warning.
>> >> 
>> >> Does everything cleanup when main exits?
>> >> 
>> > Yes, it does.
>> >
>> > Problem is that if the main task doesn't exit, it hangs forever.
>> > Chrome OS (where we see the problem in the field, and the application
>> > is chrome) is configured to reboot on hung tasks - if a task is hung
>> > for 120 seconds on those systems, it tends to be in a bad shape.
>> > This makes it a quite severe problem for us.
>> 
>> We definitely need to change the code in such a way as to not
>> trigger the hung task warning.
>> 
>> > I browsed through a number of stack traces - the ones I looked at
>> > all have the same traceback from do_group_exit(), though the
>> > caller of do_group_exit() is not always the same (that may depend
>> > on the kernel version, though). Sometimes it is __wake_up_parent(),
>> > sometimes it is get_signal_to_deliver(), sometimes it is
>> > get_signal().
>> >
>> > Are there other conditions besides ptrace where a task isn't reaped ?
>> 
>> The problem here is that the parent was outside of the pid namespace
>> and was choosing not to reap the child.  The main task could easily
>> have reaped the child when it received SIGCHLD.
>> 
>> Arranging for forked children to run in a pid namespace with setns
>> can also create processes with parents outside the pid namespace,
>> that may or may not be reaped.
>> 
>> I suspect it is a bug to allow PTRACE_TRACEME if either the parent
>> or the child has called exec.  That is not enough to block your test
>> case but I think it would block real world problems.
>> 
>> For getting to the root cause of this I would see if I could get
>> a full process list.  It would be interesting to know what
>> the unreaped zombie is.
>> 
> root 11579 11472  0 13:38 pts/15   00:00:00 sudo ./clone
> root 11580 11579  0 13:38 pts/15   00:00:00 ./clone
> root 11581 11580  0 13:38 pts/15   00:00:00 [clone]
>
> I assume this means that it is child1 ? What other information
> would be needed ?

My question is does the test case cover the real life cause of this?
I assume it 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, May 11, 2017 at 03:23:17PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> 
>> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> >> Guenter Roeck  writes:
>> >> 
>> >> > Hi all,
>> >> >
>> >> > the test program attached below almost always results in one of the 
>> >> > child
>> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I 
>> >> > can
>> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is 
>> >> > only
>> >> > a single thread left in the pid namespace (the one that is stuck).
>> >> > Traceback from /proc//stack is
>> >> >
>> >> > [] zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [] do_exit+0x10d4/0x1330
>> >> > [] do_group_exit+0x86/0x130
>> >> > [] get_signal+0x367/0x8a0
>> >> > [] do_signal+0x83/0xb90
>> >> > [] exit_to_usermode_loop+0x75/0xc0
>> >> > [] syscall_return_slowpath+0xc6/0xd0
>> >> > [] entry_SYSCALL_64_fastpath+0xab/0xad
>> >> > [] 0x
>> >> >
>> >> > After 120 seconds, I get the "hung task" message.
>> >> >
>> >> > Example from v4.11:
>> >> >
>> >> > ...
>> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> >> > [ 3263.379561]   Not tainted 4.11.0+ #1
>> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>> >> > disables this message.
>> >> > [ 3263.379577] clone   D0 27910  27909 0x
>> >> > [ 3263.379587] Call Trace:
>> >> > [ 3263.379608]  __schedule+0x677/0xda0
>> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> >> > [ 3263.379643]  schedule+0x4d/0xd0
>> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> >> > ...
>> >> >
>> >> > The problem is seen in all kernels up to v4.11.
>> >> >
>> >> > Any idea what might be going on and how to fix the problem ?
>> >> 
>> >> Let me see.  Reading the code it looks like we have three tasks
>> >> let's call them main, child1, and child2.
>> >> 
>> >> child1 and child2 are started using CLONE_THREAD and are
>> >> thus clones of one another.
>> >> 
>> >> child2 exits first but is ptraced by main so is not reaped.
>> >>Further child2 calls do_group_exit forcing child1 to
>> >>exit making for fun races.
>> >> 
>> >>A ptread_exit() or syscall(SYS_exit, 0); would skip
>> >>the group exit and make the window larger.
>> >> 
>> >> child1 exits next and calls zap_pid_ns_processes and is
>> >>waiting for child2 to be reaped by main.
>> >> 
>> >> main is just sitting around doing nothing for 3600 seconds
>> >> not reaping anyone.
>> >> 
>> >> I would expect that when main exits everything would be cleaned up
>> >> and the only real issue is that we have a hung task warning.
>> >> 
>> >> Does everything cleanup when main exits?
>> >> 
>> > Yes, it does.
>> >
>> > Problem is that if the main task doesn't exit, it hangs forever.
>> > Chrome OS (where we see the problem in the field, and the application
>> > is chrome) is configured to reboot on hung tasks - if a task is hung
>> > for 120 seconds on those systems, it tends to be in a bad shape.
>> > This makes it a quite severe problem for us.
>> 
>> We definitely need to change the code in such a way as to not
>> trigger the hung task warning.
>> 
>> > I browsed through a number of stack traces - the ones I looked at
>> > all have the same traceback from do_group_exit(), though the
>> > caller of do_group_exit() is not always the same (that may depend
>> > on the kernel version, though). Sometimes it is __wake_up_parent(),
>> > sometimes it is get_signal_to_deliver(), sometimes it is
>> > get_signal().
>> >
>> > Are there other conditions besides ptrace where a task isn't reaped ?
>> 
>> The problem here is that the parent was outside of the pid namespace
>> and was choosing not to reap the child.  The main task could easily
>> have reaped the child when it received SIGCHLD.
>> 
>> Arranging for forked children to run in a pid namespace with setns
>> can also create processes with parents outside the pid namespace,
>> that may or may not be reaped.
>> 
>> I suspect it is a bug to allow PTRACE_TRACEME if either the parent
>> or the child has called exec.  That is not enough to block your test
>> case but I think it would block real world problems.
>> 
>> For getting to the root cause of this I would see if I could get
>> a full process list.  It would be interesting to know what
>> the unreaped zombie is.
>> 
> root 11579 11472  0 13:38 pts/15   00:00:00 sudo ./clone
> root 11580 11579  0 13:38 pts/15   00:00:00 ./clone
> root 11581 11580  0 13:38 pts/15   00:00:00 [clone]
>
> I assume this means that it is child1 ? What other information
> would be needed ?

My question is does the test case cover the real life cause of this?
I assume it is simplified from what you are seeing on chromeOS.

More 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> 
>> > Hi all,
>> >
>> > the test program attached below almost always results in one of the child
>> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>> > a single thread left in the pid namespace (the one that is stuck).
>> > Traceback from /proc//stack is
>> >
>> > [] zap_pid_ns_processes+0x1ee/0x2a0
>> > [] do_exit+0x10d4/0x1330
>> > [] do_group_exit+0x86/0x130
>> > [] get_signal+0x367/0x8a0
>> > [] do_signal+0x83/0xb90
>> > [] exit_to_usermode_loop+0x75/0xc0
>> > [] syscall_return_slowpath+0xc6/0xd0
>> > [] entry_SYSCALL_64_fastpath+0xab/0xad
>> > [] 0x
>> >
>> > After 120 seconds, I get the "hung task" message.
>> >
>> > Example from v4.11:
>> >
>> > ...
>> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> > [ 3263.379561]   Not tainted 4.11.0+ #1
>> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> > this message.
>> > [ 3263.379577] clone   D0 27910  27909 0x
>> > [ 3263.379587] Call Trace:
>> > [ 3263.379608]  __schedule+0x677/0xda0
>> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> > [ 3263.379643]  schedule+0x4d/0xd0
>> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> > ...
>> >
>> > The problem is seen in all kernels up to v4.11.
>> >
>> > Any idea what might be going on and how to fix the problem ?
>> 
>> Let me see.  Reading the code it looks like we have three tasks
>> let's call them main, child1, and child2.
>> 
>> child1 and child2 are started using CLONE_THREAD and are
>> thus clones of one another.
>> 
>> child2 exits first but is ptraced by main so is not reaped.
>>Further child2 calls do_group_exit forcing child1 to
>>exit making for fun races.
>> 
>>A ptread_exit() or syscall(SYS_exit, 0); would skip
>>the group exit and make the window larger.
>> 
>> child1 exits next and calls zap_pid_ns_processes and is
>>waiting for child2 to be reaped by main.
>> 
>> main is just sitting around doing nothing for 3600 seconds
>> not reaping anyone.
>> 
>> I would expect that when main exits everything would be cleaned up
>> and the only real issue is that we have a hung task warning.
>> 
>> Does everything cleanup when main exits?
>> 
>
> As an add-on to my previous mail: I added a function to count
> the number of threads in the pid namespace, using next_pidmap().
> Even though nr_hashed == 2, only the hanging thread is still
> present.

For your testcase?  I suspect you copied the code from
zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
calling zap_pid_ns_processes.

> Is there maybe a better way to terminate the wait loop than
> with "if (pid_ns->nr_hashed == init_pids)" ?

Not right now.  nr_hashed counts the number of "struct pid"s that are
in attached to processes and in the hash table for looking up tasks.

Waiting until that drops to just the last task in the pid namespace
really does do a good job of counting what we are looking for.  It
is a little tricky because if our task_pid(p) != task_tgid(p) then
nr_hashed will be 2.  For various reasons.

That said it isn't as obvious as it should be and I would like
to improve that.

I think the immediate solution here is to just use TASK_INTERRUPTIBLE
instead of TASK_UNITERRUPTIBLE.  This really is not a short term
disk sleep nor anything guaranteed to complete in any sort of
finite time.  Plus the hung task timeout warning only triggers
if your task state is TASK_UNINTERRUPTIBLE so this will prevent
the warning and keep a long delay in zap_pid_ns_processes from
bumping up your load average.

That still leaves the question of what state strange state
you are getting into but this should at least prevent the unexepcted
reboots.

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 77403c157157..971e7bc6939b 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 * if reparented.
 */
for (;;) {
-   set_current_state(TASK_UNINTERRUPTIBLE);
+   set_current_state(TASK_INTERRUPTIBLE);
if (pid_ns->nr_hashed == init_pids)
break;
schedule();


Eric


Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> 
>> > Hi all,
>> >
>> > the test program attached below almost always results in one of the child
>> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>> > a single thread left in the pid namespace (the one that is stuck).
>> > Traceback from /proc//stack is
>> >
>> > [] zap_pid_ns_processes+0x1ee/0x2a0
>> > [] do_exit+0x10d4/0x1330
>> > [] do_group_exit+0x86/0x130
>> > [] get_signal+0x367/0x8a0
>> > [] do_signal+0x83/0xb90
>> > [] exit_to_usermode_loop+0x75/0xc0
>> > [] syscall_return_slowpath+0xc6/0xd0
>> > [] entry_SYSCALL_64_fastpath+0xab/0xad
>> > [] 0x
>> >
>> > After 120 seconds, I get the "hung task" message.
>> >
>> > Example from v4.11:
>> >
>> > ...
>> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> > [ 3263.379561]   Not tainted 4.11.0+ #1
>> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> > this message.
>> > [ 3263.379577] clone   D0 27910  27909 0x
>> > [ 3263.379587] Call Trace:
>> > [ 3263.379608]  __schedule+0x677/0xda0
>> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> > [ 3263.379643]  schedule+0x4d/0xd0
>> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> > ...
>> >
>> > The problem is seen in all kernels up to v4.11.
>> >
>> > Any idea what might be going on and how to fix the problem ?
>> 
>> Let me see.  Reading the code it looks like we have three tasks
>> let's call them main, child1, and child2.
>> 
>> child1 and child2 are started using CLONE_THREAD and are
>> thus clones of one another.
>> 
>> child2 exits first but is ptraced by main so is not reaped.
>>Further child2 calls do_group_exit forcing child1 to
>>exit making for fun races.
>> 
>>A ptread_exit() or syscall(SYS_exit, 0); would skip
>>the group exit and make the window larger.
>> 
>> child1 exits next and calls zap_pid_ns_processes and is
>>waiting for child2 to be reaped by main.
>> 
>> main is just sitting around doing nothing for 3600 seconds
>> not reaping anyone.
>> 
>> I would expect that when main exits everything would be cleaned up
>> and the only real issue is that we have a hung task warning.
>> 
>> Does everything cleanup when main exits?
>> 
>
> As an add-on to my previous mail: I added a function to count
> the number of threads in the pid namespace, using next_pidmap().
> Even though nr_hashed == 2, only the hanging thread is still
> present.

For your testcase?  I suspect you copied the code from
zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
calling zap_pid_ns_processes.

> Is there maybe a better way to terminate the wait loop than
> with "if (pid_ns->nr_hashed == init_pids)" ?

Not right now.  nr_hashed counts the number of "struct pid"s that are
in attached to processes and in the hash table for looking up tasks.

Waiting until that drops to just the last task in the pid namespace
really does do a good job of counting what we are looking for.  It
is a little tricky because if our task_pid(p) != task_tgid(p) then
nr_hashed will be 2.  For various reasons.

That said it isn't as obvious as it should be and I would like
to improve that.

I think the immediate solution here is to just use TASK_INTERRUPTIBLE
instead of TASK_UNITERRUPTIBLE.  This really is not a short term
disk sleep nor anything guaranteed to complete in any sort of
finite time.  Plus the hung task timeout warning only triggers
if your task state is TASK_UNINTERRUPTIBLE so this will prevent
the warning and keep a long delay in zap_pid_ns_processes from
bumping up your load average.

That still leaves the question of what state strange state
you are getting into but this should at least prevent the unexepcted
reboots.

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 77403c157157..971e7bc6939b 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 * if reparented.
 */
for (;;) {
-   set_current_state(TASK_UNINTERRUPTIBLE);
+   set_current_state(TASK_INTERRUPTIBLE);
if (pid_ns->nr_hashed == init_pids)
break;
schedule();


Eric


Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Guenter Roeck
On Thu, May 11, 2017 at 03:23:17PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> 
> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> >> Guenter Roeck  writes:
> >> 
> >> > Hi all,
> >> >
> >> > the test program attached below almost always results in one of the child
> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is 
> >> > only
> >> > a single thread left in the pid namespace (the one that is stuck).
> >> > Traceback from /proc//stack is
> >> >
> >> > [] zap_pid_ns_processes+0x1ee/0x2a0
> >> > [] do_exit+0x10d4/0x1330
> >> > [] do_group_exit+0x86/0x130
> >> > [] get_signal+0x367/0x8a0
> >> > [] do_signal+0x83/0xb90
> >> > [] exit_to_usermode_loop+0x75/0xc0
> >> > [] syscall_return_slowpath+0xc6/0xd0
> >> > [] entry_SYSCALL_64_fastpath+0xab/0xad
> >> > [] 0x
> >> >
> >> > After 120 seconds, I get the "hung task" message.
> >> >
> >> > Example from v4.11:
> >> >
> >> > ...
> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> >> > [ 3263.379561]   Not tainted 4.11.0+ #1
> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> >> > disables this message.
> >> > [ 3263.379577] clone   D0 27910  27909 0x
> >> > [ 3263.379587] Call Trace:
> >> > [ 3263.379608]  __schedule+0x677/0xda0
> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> >> > [ 3263.379643]  schedule+0x4d/0xd0
> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
> >> > ...
> >> >
> >> > The problem is seen in all kernels up to v4.11.
> >> >
> >> > Any idea what might be going on and how to fix the problem ?
> >> 
> >> Let me see.  Reading the code it looks like we have three tasks
> >> let's call them main, child1, and child2.
> >> 
> >> child1 and child2 are started using CLONE_THREAD and are
> >> thus clones of one another.
> >> 
> >> child2 exits first but is ptraced by main so is not reaped.
> >>Further child2 calls do_group_exit forcing child1 to
> >>exit making for fun races.
> >> 
> >>A ptread_exit() or syscall(SYS_exit, 0); would skip
> >>the group exit and make the window larger.
> >> 
> >> child1 exits next and calls zap_pid_ns_processes and is
> >>waiting for child2 to be reaped by main.
> >> 
> >> main is just sitting around doing nothing for 3600 seconds
> >> not reaping anyone.
> >> 
> >> I would expect that when main exits everything would be cleaned up
> >> and the only real issue is that we have a hung task warning.
> >> 
> >> Does everything cleanup when main exits?
> >> 
> > Yes, it does.
> >
> > Problem is that if the main task doesn't exit, it hangs forever.
> > Chrome OS (where we see the problem in the field, and the application
> > is chrome) is configured to reboot on hung tasks - if a task is hung
> > for 120 seconds on those systems, it tends to be in a bad shape.
> > This makes it a quite severe problem for us.
> 
> We definitely need to change the code in such a way as to not
> trigger the hung task warning.
> 
> > I browsed through a number of stack traces - the ones I looked at
> > all have the same traceback from do_group_exit(), though the
> > caller of do_group_exit() is not always the same (that may depend
> > on the kernel version, though). Sometimes it is __wake_up_parent(),
> > sometimes it is get_signal_to_deliver(), sometimes it is
> > get_signal().
> >
> > Are there other conditions besides ptrace where a task isn't reaped ?
> 
> The problem here is that the parent was outside of the pid namespace
> and was choosing not to reap the child.  The main task could easily
> have reaped the child when it received SIGCHLD.
> 
> Arranging for forked children to run in a pid namespace with setns
> can also create processes with parents outside the pid namespace,
> that may or may not be reaped.
> 
> I suspect it is a bug to allow PTRACE_TRACEME if either the parent
> or the child has called exec.  That is not enough to block your test
> case but I think it would block real world problems.
> 
> For getting to the root cause of this I would see if I could get
> a full process list.  It would be interesting to know what
> the unreaped zombie is.
> 
root 11579 11472  0 13:38 pts/15   00:00:00 sudo ./clone
root 11580 11579  0 13:38 pts/15   00:00:00 ./clone
root 11581 11580  0 13:38 pts/15   00:00:00 [clone]

I assume this means that it is child1 ? What other information
would be needed ?

Thanks,
Guenter

> In the meantime I am going to see what it takes to change the code
> so it doesn't trigger the hung task warning.  I don't want to change
> the actual observed behavior of the code but I can avoid triggering
> a warning that does not apply.
> 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Guenter Roeck
On Thu, May 11, 2017 at 03:23:17PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> 
> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> >> Guenter Roeck  writes:
> >> 
> >> > Hi all,
> >> >
> >> > the test program attached below almost always results in one of the child
> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is 
> >> > only
> >> > a single thread left in the pid namespace (the one that is stuck).
> >> > Traceback from /proc//stack is
> >> >
> >> > [] zap_pid_ns_processes+0x1ee/0x2a0
> >> > [] do_exit+0x10d4/0x1330
> >> > [] do_group_exit+0x86/0x130
> >> > [] get_signal+0x367/0x8a0
> >> > [] do_signal+0x83/0xb90
> >> > [] exit_to_usermode_loop+0x75/0xc0
> >> > [] syscall_return_slowpath+0xc6/0xd0
> >> > [] entry_SYSCALL_64_fastpath+0xab/0xad
> >> > [] 0x
> >> >
> >> > After 120 seconds, I get the "hung task" message.
> >> >
> >> > Example from v4.11:
> >> >
> >> > ...
> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> >> > [ 3263.379561]   Not tainted 4.11.0+ #1
> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> >> > disables this message.
> >> > [ 3263.379577] clone   D0 27910  27909 0x
> >> > [ 3263.379587] Call Trace:
> >> > [ 3263.379608]  __schedule+0x677/0xda0
> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> >> > [ 3263.379643]  schedule+0x4d/0xd0
> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
> >> > ...
> >> >
> >> > The problem is seen in all kernels up to v4.11.
> >> >
> >> > Any idea what might be going on and how to fix the problem ?
> >> 
> >> Let me see.  Reading the code it looks like we have three tasks
> >> let's call them main, child1, and child2.
> >> 
> >> child1 and child2 are started using CLONE_THREAD and are
> >> thus clones of one another.
> >> 
> >> child2 exits first but is ptraced by main so is not reaped.
> >>Further child2 calls do_group_exit forcing child1 to
> >>exit making for fun races.
> >> 
> >>A ptread_exit() or syscall(SYS_exit, 0); would skip
> >>the group exit and make the window larger.
> >> 
> >> child1 exits next and calls zap_pid_ns_processes and is
> >>waiting for child2 to be reaped by main.
> >> 
> >> main is just sitting around doing nothing for 3600 seconds
> >> not reaping anyone.
> >> 
> >> I would expect that when main exits everything would be cleaned up
> >> and the only real issue is that we have a hung task warning.
> >> 
> >> Does everything cleanup when main exits?
> >> 
> > Yes, it does.
> >
> > Problem is that if the main task doesn't exit, it hangs forever.
> > Chrome OS (where we see the problem in the field, and the application
> > is chrome) is configured to reboot on hung tasks - if a task is hung
> > for 120 seconds on those systems, it tends to be in a bad shape.
> > This makes it a quite severe problem for us.
> 
> We definitely need to change the code in such a way as to not
> trigger the hung task warning.
> 
> > I browsed through a number of stack traces - the ones I looked at
> > all have the same traceback from do_group_exit(), though the
> > caller of do_group_exit() is not always the same (that may depend
> > on the kernel version, though). Sometimes it is __wake_up_parent(),
> > sometimes it is get_signal_to_deliver(), sometimes it is
> > get_signal().
> >
> > Are there other conditions besides ptrace where a task isn't reaped ?
> 
> The problem here is that the parent was outside of the pid namespace
> and was choosing not to reap the child.  The main task could easily
> have reaped the child when it received SIGCHLD.
> 
> Arranging for forked children to run in a pid namespace with setns
> can also create processes with parents outside the pid namespace,
> that may or may not be reaped.
> 
> I suspect it is a bug to allow PTRACE_TRACEME if either the parent
> or the child has called exec.  That is not enough to block your test
> case but I think it would block real world problems.
> 
> For getting to the root cause of this I would see if I could get
> a full process list.  It would be interesting to know what
> the unreaped zombie is.
> 
root 11579 11472  0 13:38 pts/15   00:00:00 sudo ./clone
root 11580 11579  0 13:38 pts/15   00:00:00 ./clone
root 11581 11580  0 13:38 pts/15   00:00:00 [clone]

I assume this means that it is child1 ? What other information
would be needed ?

Thanks,
Guenter

> In the meantime I am going to see what it takes to change the code
> so it doesn't trigger the hung task warning.  I don't want to change
> the actual observed behavior of the code but I can avoid triggering
> a warning that does not apply.
> 
> Eric
> 
> 
> >
> > Thanks,
> > Guenter

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> 
>> > Hi all,
>> >
>> > the test program attached below almost always results in one of the child
>> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>> > a single thread left in the pid namespace (the one that is stuck).
>> > Traceback from /proc//stack is
>> >
>> > [] zap_pid_ns_processes+0x1ee/0x2a0
>> > [] do_exit+0x10d4/0x1330
>> > [] do_group_exit+0x86/0x130
>> > [] get_signal+0x367/0x8a0
>> > [] do_signal+0x83/0xb90
>> > [] exit_to_usermode_loop+0x75/0xc0
>> > [] syscall_return_slowpath+0xc6/0xd0
>> > [] entry_SYSCALL_64_fastpath+0xab/0xad
>> > [] 0x
>> >
>> > After 120 seconds, I get the "hung task" message.
>> >
>> > Example from v4.11:
>> >
>> > ...
>> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> > [ 3263.379561]   Not tainted 4.11.0+ #1
>> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> > this message.
>> > [ 3263.379577] clone   D0 27910  27909 0x
>> > [ 3263.379587] Call Trace:
>> > [ 3263.379608]  __schedule+0x677/0xda0
>> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> > [ 3263.379643]  schedule+0x4d/0xd0
>> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> > ...
>> >
>> > The problem is seen in all kernels up to v4.11.
>> >
>> > Any idea what might be going on and how to fix the problem ?
>> 
>> Let me see.  Reading the code it looks like we have three tasks
>> let's call them main, child1, and child2.
>> 
>> child1 and child2 are started using CLONE_THREAD and are
>> thus clones of one another.
>> 
>> child2 exits first but is ptraced by main so is not reaped.
>>Further child2 calls do_group_exit forcing child1 to
>>exit making for fun races.
>> 
>>A ptread_exit() or syscall(SYS_exit, 0); would skip
>>the group exit and make the window larger.
>> 
>> child1 exits next and calls zap_pid_ns_processes and is
>>waiting for child2 to be reaped by main.
>> 
>> main is just sitting around doing nothing for 3600 seconds
>> not reaping anyone.
>> 
>> I would expect that when main exits everything would be cleaned up
>> and the only real issue is that we have a hung task warning.
>> 
>> Does everything cleanup when main exits?
>> 
> Yes, it does.
>
> Problem is that if the main task doesn't exit, it hangs forever.
> Chrome OS (where we see the problem in the field, and the application
> is chrome) is configured to reboot on hung tasks - if a task is hung
> for 120 seconds on those systems, it tends to be in a bad shape.
> This makes it a quite severe problem for us.

We definitely need to change the code in such a way as to not
trigger the hung task warning.

> I browsed through a number of stack traces - the ones I looked at
> all have the same traceback from do_group_exit(), though the
> caller of do_group_exit() is not always the same (that may depend
> on the kernel version, though). Sometimes it is __wake_up_parent(),
> sometimes it is get_signal_to_deliver(), sometimes it is
> get_signal().
>
> Are there other conditions besides ptrace where a task isn't reaped ?

The problem here is that the parent was outside of the pid namespace
and was choosing not to reap the child.  The main task could easily
have reaped the child when it received SIGCHLD.

Arranging for forked children to run in a pid namespace with setns
can also create processes with parents outside the pid namespace,
that may or may not be reaped.

I suspect it is a bug to allow PTRACE_TRACEME if either the parent
or the child has called exec.  That is not enough to block your test
case but I think it would block real world problems.

For getting to the root cause of this I would see if I could get
a full process list.  It would be interesting to know what
the unreaped zombie is.

In the meantime I am going to see what it takes to change the code
so it doesn't trigger the hung task warning.  I don't want to change
the actual observed behavior of the code but I can avoid triggering
a warning that does not apply.

Eric


>
> Thanks,
> Guenter
>
>> Eric
>> 
>> 
>> >
>> > Thanks,
>> > Guenter
>> >
>> > ---
>> > This test program was kindly provided by Vovo Yang .
>> >
>> > Note that the ptrace() call in child1() is not necessary for the problem
>> > to be seen, though it seems to make it a bit more likely.
>> 
>> That would appear to just slow things down a smidge.As there is
>> nothing substantial that happens ptrace wise except until after
>> zap_pid_ns_processes.
>> 
>> 
>> > ---
>> >
>> > #define _GNU_SOURCE
>> > #include 
>> > #include 
>> > #include 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Eric W. Biederman
Guenter Roeck  writes:

> On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck  writes:
>> 
>> > Hi all,
>> >
>> > the test program attached below almost always results in one of the child
>> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>> > a single thread left in the pid namespace (the one that is stuck).
>> > Traceback from /proc//stack is
>> >
>> > [] zap_pid_ns_processes+0x1ee/0x2a0
>> > [] do_exit+0x10d4/0x1330
>> > [] do_group_exit+0x86/0x130
>> > [] get_signal+0x367/0x8a0
>> > [] do_signal+0x83/0xb90
>> > [] exit_to_usermode_loop+0x75/0xc0
>> > [] syscall_return_slowpath+0xc6/0xd0
>> > [] entry_SYSCALL_64_fastpath+0xab/0xad
>> > [] 0x
>> >
>> > After 120 seconds, I get the "hung task" message.
>> >
>> > Example from v4.11:
>> >
>> > ...
>> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> > [ 3263.379561]   Not tainted 4.11.0+ #1
>> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> > this message.
>> > [ 3263.379577] clone   D0 27910  27909 0x
>> > [ 3263.379587] Call Trace:
>> > [ 3263.379608]  __schedule+0x677/0xda0
>> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> > [ 3263.379643]  schedule+0x4d/0xd0
>> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> > ...
>> >
>> > The problem is seen in all kernels up to v4.11.
>> >
>> > Any idea what might be going on and how to fix the problem ?
>> 
>> Let me see.  Reading the code it looks like we have three tasks
>> let's call them main, child1, and child2.
>> 
>> child1 and child2 are started using CLONE_THREAD and are
>> thus clones of one another.
>> 
>> child2 exits first but is ptraced by main so is not reaped.
>>Further child2 calls do_group_exit forcing child1 to
>>exit making for fun races.
>> 
>>A ptread_exit() or syscall(SYS_exit, 0); would skip
>>the group exit and make the window larger.
>> 
>> child1 exits next and calls zap_pid_ns_processes and is
>>waiting for child2 to be reaped by main.
>> 
>> main is just sitting around doing nothing for 3600 seconds
>> not reaping anyone.
>> 
>> I would expect that when main exits everything would be cleaned up
>> and the only real issue is that we have a hung task warning.
>> 
>> Does everything cleanup when main exits?
>> 
> Yes, it does.
>
> Problem is that if the main task doesn't exit, it hangs forever.
> Chrome OS (where we see the problem in the field, and the application
> is chrome) is configured to reboot on hung tasks - if a task is hung
> for 120 seconds on those systems, it tends to be in a bad shape.
> This makes it a quite severe problem for us.

We definitely need to change the code in such a way as to not
trigger the hung task warning.

> I browsed through a number of stack traces - the ones I looked at
> all have the same traceback from do_group_exit(), though the
> caller of do_group_exit() is not always the same (that may depend
> on the kernel version, though). Sometimes it is __wake_up_parent(),
> sometimes it is get_signal_to_deliver(), sometimes it is
> get_signal().
>
> Are there other conditions besides ptrace where a task isn't reaped ?

The problem here is that the parent was outside of the pid namespace
and was choosing not to reap the child.  The main task could easily
have reaped the child when it received SIGCHLD.

Arranging for forked children to run in a pid namespace with setns
can also create processes with parents outside the pid namespace,
that may or may not be reaped.

I suspect it is a bug to allow PTRACE_TRACEME if either the parent
or the child has called exec.  That is not enough to block your test
case but I think it would block real world problems.

For getting to the root cause of this I would see if I could get
a full process list.  It would be interesting to know what
the unreaped zombie is.

In the meantime I am going to see what it takes to change the code
so it doesn't trigger the hung task warning.  I don't want to change
the actual observed behavior of the code but I can avoid triggering
a warning that does not apply.

Eric


>
> Thanks,
> Guenter
>
>> Eric
>> 
>> 
>> >
>> > Thanks,
>> > Guenter
>> >
>> > ---
>> > This test program was kindly provided by Vovo Yang .
>> >
>> > Note that the ptrace() call in child1() is not necessary for the problem
>> > to be seen, though it seems to make it a bit more likely.
>> 
>> That would appear to just slow things down a smidge.As there is
>> nothing substantial that happens ptrace wise except until after
>> zap_pid_ns_processes.
>> 
>> 
>> > ---
>> >
>> > #define _GNU_SOURCE
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> > 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Guenter Roeck
On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> 
> > Hi all,
> >
> > the test program attached below almost always results in one of the child
> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
> > a single thread left in the pid namespace (the one that is stuck).
> > Traceback from /proc//stack is
> >
> > [] zap_pid_ns_processes+0x1ee/0x2a0
> > [] do_exit+0x10d4/0x1330
> > [] do_group_exit+0x86/0x130
> > [] get_signal+0x367/0x8a0
> > [] do_signal+0x83/0xb90
> > [] exit_to_usermode_loop+0x75/0xc0
> > [] syscall_return_slowpath+0xc6/0xd0
> > [] entry_SYSCALL_64_fastpath+0xab/0xad
> > [] 0x
> >
> > After 120 seconds, I get the "hung task" message.
> >
> > Example from v4.11:
> >
> > ...
> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> > [ 3263.379561]   Not tainted 4.11.0+ #1
> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [ 3263.379577] clone   D0 27910  27909 0x
> > [ 3263.379587] Call Trace:
> > [ 3263.379608]  __schedule+0x677/0xda0
> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> > [ 3263.379643]  schedule+0x4d/0xd0
> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> > [ 3263.379670]  do_exit+0x10d4/0x1330
> > ...
> >
> > The problem is seen in all kernels up to v4.11.
> >
> > Any idea what might be going on and how to fix the problem ?
> 
> Let me see.  Reading the code it looks like we have three tasks
> let's call them main, child1, and child2.
> 
> child1 and child2 are started using CLONE_THREAD and are
> thus clones of one another.
> 
> child2 exits first but is ptraced by main so is not reaped.
>Further child2 calls do_group_exit forcing child1 to
>exit making for fun races.
> 
>A ptread_exit() or syscall(SYS_exit, 0); would skip
>the group exit and make the window larger.
> 
> child1 exits next and calls zap_pid_ns_processes and is
>waiting for child2 to be reaped by main.
> 
> main is just sitting around doing nothing for 3600 seconds
> not reaping anyone.
> 
> I would expect that when main exits everything would be cleaned up
> and the only real issue is that we have a hung task warning.
> 
> Does everything cleanup when main exits?
> 

As an add-on to my previous mail: I added a function to count
the number of threads in the pid namespace, using next_pidmap().
Even though nr_hashed == 2, only the hanging thread is still
present.

Is there maybe a better way to terminate the wait loop than
with "if (pid_ns->nr_hashed == init_pids)" ?

Thanks,
Guenter


Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Guenter Roeck
On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> 
> > Hi all,
> >
> > the test program attached below almost always results in one of the child
> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
> > a single thread left in the pid namespace (the one that is stuck).
> > Traceback from /proc//stack is
> >
> > [] zap_pid_ns_processes+0x1ee/0x2a0
> > [] do_exit+0x10d4/0x1330
> > [] do_group_exit+0x86/0x130
> > [] get_signal+0x367/0x8a0
> > [] do_signal+0x83/0xb90
> > [] exit_to_usermode_loop+0x75/0xc0
> > [] syscall_return_slowpath+0xc6/0xd0
> > [] entry_SYSCALL_64_fastpath+0xab/0xad
> > [] 0x
> >
> > After 120 seconds, I get the "hung task" message.
> >
> > Example from v4.11:
> >
> > ...
> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> > [ 3263.379561]   Not tainted 4.11.0+ #1
> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [ 3263.379577] clone   D0 27910  27909 0x
> > [ 3263.379587] Call Trace:
> > [ 3263.379608]  __schedule+0x677/0xda0
> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> > [ 3263.379643]  schedule+0x4d/0xd0
> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> > [ 3263.379670]  do_exit+0x10d4/0x1330
> > ...
> >
> > The problem is seen in all kernels up to v4.11.
> >
> > Any idea what might be going on and how to fix the problem ?
> 
> Let me see.  Reading the code it looks like we have three tasks
> let's call them main, child1, and child2.
> 
> child1 and child2 are started using CLONE_THREAD and are
> thus clones of one another.
> 
> child2 exits first but is ptraced by main so is not reaped.
>Further child2 calls do_group_exit forcing child1 to
>exit making for fun races.
> 
>A ptread_exit() or syscall(SYS_exit, 0); would skip
>the group exit and make the window larger.
> 
> child1 exits next and calls zap_pid_ns_processes and is
>waiting for child2 to be reaped by main.
> 
> main is just sitting around doing nothing for 3600 seconds
> not reaping anyone.
> 
> I would expect that when main exits everything would be cleaned up
> and the only real issue is that we have a hung task warning.
> 
> Does everything cleanup when main exits?
> 

As an add-on to my previous mail: I added a function to count
the number of threads in the pid namespace, using next_pidmap().
Even though nr_hashed == 2, only the hanging thread is still
present.

Is there maybe a better way to terminate the wait loop than
with "if (pid_ns->nr_hashed == init_pids)" ?

Thanks,
Guenter


Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Guenter Roeck
On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> 
> > Hi all,
> >
> > the test program attached below almost always results in one of the child
> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
> > a single thread left in the pid namespace (the one that is stuck).
> > Traceback from /proc//stack is
> >
> > [] zap_pid_ns_processes+0x1ee/0x2a0
> > [] do_exit+0x10d4/0x1330
> > [] do_group_exit+0x86/0x130
> > [] get_signal+0x367/0x8a0
> > [] do_signal+0x83/0xb90
> > [] exit_to_usermode_loop+0x75/0xc0
> > [] syscall_return_slowpath+0xc6/0xd0
> > [] entry_SYSCALL_64_fastpath+0xab/0xad
> > [] 0x
> >
> > After 120 seconds, I get the "hung task" message.
> >
> > Example from v4.11:
> >
> > ...
> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> > [ 3263.379561]   Not tainted 4.11.0+ #1
> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [ 3263.379577] clone   D0 27910  27909 0x
> > [ 3263.379587] Call Trace:
> > [ 3263.379608]  __schedule+0x677/0xda0
> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> > [ 3263.379643]  schedule+0x4d/0xd0
> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> > [ 3263.379670]  do_exit+0x10d4/0x1330
> > ...
> >
> > The problem is seen in all kernels up to v4.11.
> >
> > Any idea what might be going on and how to fix the problem ?
> 
> Let me see.  Reading the code it looks like we have three tasks
> let's call them main, child1, and child2.
> 
> child1 and child2 are started using CLONE_THREAD and are
> thus clones of one another.
> 
> child2 exits first but is ptraced by main so is not reaped.
>Further child2 calls do_group_exit forcing child1 to
>exit making for fun races.
> 
>A ptread_exit() or syscall(SYS_exit, 0); would skip
>the group exit and make the window larger.
> 
> child1 exits next and calls zap_pid_ns_processes and is
>waiting for child2 to be reaped by main.
> 
> main is just sitting around doing nothing for 3600 seconds
> not reaping anyone.
> 
> I would expect that when main exits everything would be cleaned up
> and the only real issue is that we have a hung task warning.
> 
> Does everything cleanup when main exits?
> 
Yes, it does.

Problem is that if the main task doesn't exit, it hangs forever.
Chrome OS (where we see the problem in the field, and the application
is chrome) is configured to reboot on hung tasks - if a task is hung
for 120 seconds on those systems, it tends to be in a bad shape.
This makes it a quite severe problem for us.

I browsed through a number of stack traces - the ones I looked at
all have the same traceback from do_group_exit(), though the
caller of do_group_exit() is not always the same (that may depend
on the kernel version, though). Sometimes it is __wake_up_parent(),
sometimes it is get_signal_to_deliver(), sometimes it is
get_signal().

Are there other conditions besides ptrace where a task isn't reaped ?

Thanks,
Guenter

> Eric
> 
> 
> >
> > Thanks,
> > Guenter
> >
> > ---
> > This test program was kindly provided by Vovo Yang .
> >
> > Note that the ptrace() call in child1() is not necessary for the problem
> > to be seen, though it seems to make it a bit more likely.
> 
> That would appear to just slow things down a smidge.As there is
> nothing substantial that happens ptrace wise except until after
> zap_pid_ns_processes.
> 
> 
> > ---
> >
> > #define _GNU_SOURCE
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > #define STACK_SIZE 65536
> >
> > int child1(void* arg);
> > int child2(void* arg);
> >
> > int main(int argc, char **argv)
> > {
> >   int child_pid;
> >   char* child_stack = malloc(STACK_SIZE);
> >   char* stack_top = child_stack + STACK_SIZE;
> >   char command[256];
> >
> >   child_pid = clone(, stack_top, CLONE_NEWPID, NULL);
> >   if (child_pid == -1) {
> > printf("parent: clone failed: %s\n", strerror(errno));
> > return EXIT_FAILURE;
> >   }
> >   printf("parent: child1_pid: %d\n", child_pid);
> >
> >   sleep(2);
> >   printf("child state, if it's D (disk sleep), the child process is 
> > hung\n");
> >   sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
> >   system(command);
> >   sleep(3600);
> >   return EXIT_SUCCESS;
> > }
> >
> > int child1(void* arg)
> > {
> >   int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | 
> > CLONE_THREAD;
> >   char* child_stack = malloc(STACK_SIZE);
> >   char* stack_top = child_stack + STACK_SIZE;
> >   long ret;
> >
> >   ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
> >   if (ret == -1) {
> > printf("child1: ptrace failed: %s\n", 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Guenter Roeck
On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> Guenter Roeck  writes:
> 
> > Hi all,
> >
> > the test program attached below almost always results in one of the child
> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
> > a single thread left in the pid namespace (the one that is stuck).
> > Traceback from /proc//stack is
> >
> > [] zap_pid_ns_processes+0x1ee/0x2a0
> > [] do_exit+0x10d4/0x1330
> > [] do_group_exit+0x86/0x130
> > [] get_signal+0x367/0x8a0
> > [] do_signal+0x83/0xb90
> > [] exit_to_usermode_loop+0x75/0xc0
> > [] syscall_return_slowpath+0xc6/0xd0
> > [] entry_SYSCALL_64_fastpath+0xab/0xad
> > [] 0x
> >
> > After 120 seconds, I get the "hung task" message.
> >
> > Example from v4.11:
> >
> > ...
> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> > [ 3263.379561]   Not tainted 4.11.0+ #1
> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [ 3263.379577] clone   D0 27910  27909 0x
> > [ 3263.379587] Call Trace:
> > [ 3263.379608]  __schedule+0x677/0xda0
> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> > [ 3263.379643]  schedule+0x4d/0xd0
> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> > [ 3263.379670]  do_exit+0x10d4/0x1330
> > ...
> >
> > The problem is seen in all kernels up to v4.11.
> >
> > Any idea what might be going on and how to fix the problem ?
> 
> Let me see.  Reading the code it looks like we have three tasks
> let's call them main, child1, and child2.
> 
> child1 and child2 are started using CLONE_THREAD and are
> thus clones of one another.
> 
> child2 exits first but is ptraced by main so is not reaped.
>Further child2 calls do_group_exit forcing child1 to
>exit making for fun races.
> 
>A ptread_exit() or syscall(SYS_exit, 0); would skip
>the group exit and make the window larger.
> 
> child1 exits next and calls zap_pid_ns_processes and is
>waiting for child2 to be reaped by main.
> 
> main is just sitting around doing nothing for 3600 seconds
> not reaping anyone.
> 
> I would expect that when main exits everything would be cleaned up
> and the only real issue is that we have a hung task warning.
> 
> Does everything cleanup when main exits?
> 
Yes, it does.

Problem is that if the main task doesn't exit, it hangs forever.
Chrome OS (where we see the problem in the field, and the application
is chrome) is configured to reboot on hung tasks - if a task is hung
for 120 seconds on those systems, it tends to be in a bad shape.
This makes it a quite severe problem for us.

I browsed through a number of stack traces - the ones I looked at
all have the same traceback from do_group_exit(), though the
caller of do_group_exit() is not always the same (that may depend
on the kernel version, though). Sometimes it is __wake_up_parent(),
sometimes it is get_signal_to_deliver(), sometimes it is
get_signal().

Are there other conditions besides ptrace where a task isn't reaped ?

Thanks,
Guenter

> Eric
> 
> 
> >
> > Thanks,
> > Guenter
> >
> > ---
> > This test program was kindly provided by Vovo Yang .
> >
> > Note that the ptrace() call in child1() is not necessary for the problem
> > to be seen, though it seems to make it a bit more likely.
> 
> That would appear to just slow things down a smidge.As there is
> nothing substantial that happens ptrace wise except until after
> zap_pid_ns_processes.
> 
> 
> > ---
> >
> > #define _GNU_SOURCE
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > #define STACK_SIZE 65536
> >
> > int child1(void* arg);
> > int child2(void* arg);
> >
> > int main(int argc, char **argv)
> > {
> >   int child_pid;
> >   char* child_stack = malloc(STACK_SIZE);
> >   char* stack_top = child_stack + STACK_SIZE;
> >   char command[256];
> >
> >   child_pid = clone(, stack_top, CLONE_NEWPID, NULL);
> >   if (child_pid == -1) {
> > printf("parent: clone failed: %s\n", strerror(errno));
> > return EXIT_FAILURE;
> >   }
> >   printf("parent: child1_pid: %d\n", child_pid);
> >
> >   sleep(2);
> >   printf("child state, if it's D (disk sleep), the child process is 
> > hung\n");
> >   sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
> >   system(command);
> >   sleep(3600);
> >   return EXIT_SUCCESS;
> > }
> >
> > int child1(void* arg)
> > {
> >   int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | 
> > CLONE_THREAD;
> >   char* child_stack = malloc(STACK_SIZE);
> >   char* stack_top = child_stack + STACK_SIZE;
> >   long ret;
> >
> >   ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
> >   if (ret == -1) {
> > printf("child1: ptrace failed: %s\n", strerror(errno));
> > return 

Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Eric W. Biederman
Guenter Roeck  writes:

> Hi all,
>
> the test program attached below almost always results in one of the child
> processes being stuck in zap_pid_ns_processes(). When this happens, I can
> see from test logs that nr_hashed == 2 and init_pids==1, but there is only
> a single thread left in the pid namespace (the one that is stuck).
> Traceback from /proc//stack is
>
> [] zap_pid_ns_processes+0x1ee/0x2a0
> [] do_exit+0x10d4/0x1330
> [] do_group_exit+0x86/0x130
> [] get_signal+0x367/0x8a0
> [] do_signal+0x83/0xb90
> [] exit_to_usermode_loop+0x75/0xc0
> [] syscall_return_slowpath+0xc6/0xd0
> [] entry_SYSCALL_64_fastpath+0xab/0xad
> [] 0x
>
> After 120 seconds, I get the "hung task" message.
>
> Example from v4.11:
>
> ...
> [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> [ 3263.379561]   Not tainted 4.11.0+ #1
> [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [ 3263.379577] clone   D0 27910  27909 0x
> [ 3263.379587] Call Trace:
> [ 3263.379608]  __schedule+0x677/0xda0
> [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> [ 3263.379634]  ? task_stopped_code+0x70/0x70
> [ 3263.379643]  schedule+0x4d/0xd0
> [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> [ 3263.379670]  do_exit+0x10d4/0x1330
> ...
>
> The problem is seen in all kernels up to v4.11.
>
> Any idea what might be going on and how to fix the problem ?

Let me see.  Reading the code it looks like we have three tasks
let's call them main, child1, and child2.

child1 and child2 are started using CLONE_THREAD and are
thus clones of one another.

child2 exits first but is ptraced by main so is not reaped.
   Further child2 calls do_group_exit forcing child1 to
   exit making for fun races.

   A ptread_exit() or syscall(SYS_exit, 0); would skip
   the group exit and make the window larger.

child1 exits next and calls zap_pid_ns_processes and is
   waiting for child2 to be reaped by main.

main is just sitting around doing nothing for 3600 seconds
not reaping anyone.

I would expect that when main exits everything would be cleaned up
and the only real issue is that we have a hung task warning.

Does everything cleanup when main exits?

Eric


>
> Thanks,
> Guenter
>
> ---
> This test program was kindly provided by Vovo Yang .
>
> Note that the ptrace() call in child1() is not necessary for the problem
> to be seen, though it seems to make it a bit more likely.

That would appear to just slow things down a smidge.As there is
nothing substantial that happens ptrace wise except until after
zap_pid_ns_processes.


> ---
>
> #define _GNU_SOURCE
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> #define STACK_SIZE 65536
>
> int child1(void* arg);
> int child2(void* arg);
>
> int main(int argc, char **argv)
> {
>   int child_pid;
>   char* child_stack = malloc(STACK_SIZE);
>   char* stack_top = child_stack + STACK_SIZE;
>   char command[256];
>
>   child_pid = clone(, stack_top, CLONE_NEWPID, NULL);
>   if (child_pid == -1) {
> printf("parent: clone failed: %s\n", strerror(errno));
> return EXIT_FAILURE;
>   }
>   printf("parent: child1_pid: %d\n", child_pid);
>
>   sleep(2);
>   printf("child state, if it's D (disk sleep), the child process is hung\n");
>   sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
>   system(command);
>   sleep(3600);
>   return EXIT_SUCCESS;
> }
>
> int child1(void* arg)
> {
>   int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | 
> CLONE_THREAD;
>   char* child_stack = malloc(STACK_SIZE);
>   char* stack_top = child_stack + STACK_SIZE;
>   long ret;
>
>   ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
>   if (ret == -1) {
> printf("child1: ptrace failed: %s\n", strerror(errno));
> return EXIT_FAILURE;
>   }
>
>   ret = clone(, stack_top, flags, NULL);
>   if (ret == -1) {
> printf("child1: clone failed: %s\n", strerror(errno));
> return EXIT_FAILURE;
>   }
>   printf("child1: child2 pid: %ld\n", ret);
>
>   sleep(1);
>   printf("child1: end\n");
>   return EXIT_SUCCESS;
> }
>
> int child2(void* arg)
> {
>   long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
>   if (ret == -1) {
> printf("child2: ptrace failed: %s\n", strerror(errno));
> return EXIT_FAILURE;
>   }
>
>   printf("child2: end\n");
>   return EXIT_SUCCESS;
> }


Re: Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Eric W. Biederman
Guenter Roeck  writes:

> Hi all,
>
> the test program attached below almost always results in one of the child
> processes being stuck in zap_pid_ns_processes(). When this happens, I can
> see from test logs that nr_hashed == 2 and init_pids==1, but there is only
> a single thread left in the pid namespace (the one that is stuck).
> Traceback from /proc//stack is
>
> [] zap_pid_ns_processes+0x1ee/0x2a0
> [] do_exit+0x10d4/0x1330
> [] do_group_exit+0x86/0x130
> [] get_signal+0x367/0x8a0
> [] do_signal+0x83/0xb90
> [] exit_to_usermode_loop+0x75/0xc0
> [] syscall_return_slowpath+0xc6/0xd0
> [] entry_SYSCALL_64_fastpath+0xab/0xad
> [] 0x
>
> After 120 seconds, I get the "hung task" message.
>
> Example from v4.11:
>
> ...
> [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> [ 3263.379561]   Not tainted 4.11.0+ #1
> [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [ 3263.379577] clone   D0 27910  27909 0x
> [ 3263.379587] Call Trace:
> [ 3263.379608]  __schedule+0x677/0xda0
> [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> [ 3263.379634]  ? task_stopped_code+0x70/0x70
> [ 3263.379643]  schedule+0x4d/0xd0
> [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> [ 3263.379670]  do_exit+0x10d4/0x1330
> ...
>
> The problem is seen in all kernels up to v4.11.
>
> Any idea what might be going on and how to fix the problem ?

Let me see.  Reading the code it looks like we have three tasks
let's call them main, child1, and child2.

child1 and child2 are started using CLONE_THREAD and are
thus clones of one another.

child2 exits first but is ptraced by main so is not reaped.
   Further child2 calls do_group_exit forcing child1 to
   exit making for fun races.

   A ptread_exit() or syscall(SYS_exit, 0); would skip
   the group exit and make the window larger.

child1 exits next and calls zap_pid_ns_processes and is
   waiting for child2 to be reaped by main.

main is just sitting around doing nothing for 3600 seconds
not reaping anyone.

I would expect that when main exits everything would be cleaned up
and the only real issue is that we have a hung task warning.

Does everything cleanup when main exits?

Eric


>
> Thanks,
> Guenter
>
> ---
> This test program was kindly provided by Vovo Yang .
>
> Note that the ptrace() call in child1() is not necessary for the problem
> to be seen, though it seems to make it a bit more likely.

That would appear to just slow things down a smidge.As there is
nothing substantial that happens ptrace wise except until after
zap_pid_ns_processes.


> ---
>
> #define _GNU_SOURCE
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> #define STACK_SIZE 65536
>
> int child1(void* arg);
> int child2(void* arg);
>
> int main(int argc, char **argv)
> {
>   int child_pid;
>   char* child_stack = malloc(STACK_SIZE);
>   char* stack_top = child_stack + STACK_SIZE;
>   char command[256];
>
>   child_pid = clone(, stack_top, CLONE_NEWPID, NULL);
>   if (child_pid == -1) {
> printf("parent: clone failed: %s\n", strerror(errno));
> return EXIT_FAILURE;
>   }
>   printf("parent: child1_pid: %d\n", child_pid);
>
>   sleep(2);
>   printf("child state, if it's D (disk sleep), the child process is hung\n");
>   sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
>   system(command);
>   sleep(3600);
>   return EXIT_SUCCESS;
> }
>
> int child1(void* arg)
> {
>   int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | 
> CLONE_THREAD;
>   char* child_stack = malloc(STACK_SIZE);
>   char* stack_top = child_stack + STACK_SIZE;
>   long ret;
>
>   ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
>   if (ret == -1) {
> printf("child1: ptrace failed: %s\n", strerror(errno));
> return EXIT_FAILURE;
>   }
>
>   ret = clone(, stack_top, flags, NULL);
>   if (ret == -1) {
> printf("child1: clone failed: %s\n", strerror(errno));
> return EXIT_FAILURE;
>   }
>   printf("child1: child2 pid: %ld\n", ret);
>
>   sleep(1);
>   printf("child1: end\n");
>   return EXIT_SUCCESS;
> }
>
> int child2(void* arg)
> {
>   long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
>   if (ret == -1) {
> printf("child2: ptrace failed: %s\n", strerror(errno));
> return EXIT_FAILURE;
>   }
>
>   printf("child2: end\n");
>   return EXIT_SUCCESS;
> }


Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Guenter Roeck
Hi all,

the test program attached below almost always results in one of the child
processes being stuck in zap_pid_ns_processes(). When this happens, I can
see from test logs that nr_hashed == 2 and init_pids==1, but there is only
a single thread left in the pid namespace (the one that is stuck).
Traceback from /proc//stack is

[] zap_pid_ns_processes+0x1ee/0x2a0
[] do_exit+0x10d4/0x1330
[] do_group_exit+0x86/0x130
[] get_signal+0x367/0x8a0
[] do_signal+0x83/0xb90
[] exit_to_usermode_loop+0x75/0xc0
[] syscall_return_slowpath+0xc6/0xd0
[] entry_SYSCALL_64_fastpath+0xab/0xad
[] 0x

After 120 seconds, I get the "hung task" message.

Example from v4.11:

...
[ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
[ 3263.379561]   Not tainted 4.11.0+ #1
[ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 3263.379577] clone   D0 27910  27909 0x
[ 3263.379587] Call Trace:
[ 3263.379608]  __schedule+0x677/0xda0
[ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
[ 3263.379634]  ? task_stopped_code+0x70/0x70
[ 3263.379643]  schedule+0x4d/0xd0
[ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
[ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
[ 3263.379670]  do_exit+0x10d4/0x1330
...

The problem is seen in all kernels up to v4.11.

Any idea what might be going on and how to fix the problem ?

Thanks,
Guenter

---
This test program was kindly provided by Vovo Yang .

Note that the ptrace() call in child1() is not necessary for the problem
to be seen, though it seems to make it a bit more likely.
---

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define STACK_SIZE 65536

int child1(void* arg);
int child2(void* arg);

int main(int argc, char **argv)
{
  int child_pid;
  char* child_stack = malloc(STACK_SIZE);
  char* stack_top = child_stack + STACK_SIZE;
  char command[256];

  child_pid = clone(, stack_top, CLONE_NEWPID, NULL);
  if (child_pid == -1) {
printf("parent: clone failed: %s\n", strerror(errno));
return EXIT_FAILURE;
  }
  printf("parent: child1_pid: %d\n", child_pid);

  sleep(2);
  printf("child state, if it's D (disk sleep), the child process is hung\n");
  sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
  system(command);
  sleep(3600);
  return EXIT_SUCCESS;
}

int child1(void* arg)
{
  int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | CLONE_THREAD;
  char* child_stack = malloc(STACK_SIZE);
  char* stack_top = child_stack + STACK_SIZE;
  long ret;

  ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
  if (ret == -1) {
printf("child1: ptrace failed: %s\n", strerror(errno));
return EXIT_FAILURE;
  }

  ret = clone(, stack_top, flags, NULL);
  if (ret == -1) {
printf("child1: clone failed: %s\n", strerror(errno));
return EXIT_FAILURE;
  }
  printf("child1: child2 pid: %ld\n", ret);

  sleep(1);
  printf("child1: end\n");
  return EXIT_SUCCESS;
}

int child2(void* arg)
{
  long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
  if (ret == -1) {
printf("child2: ptrace failed: %s\n", strerror(errno));
return EXIT_FAILURE;
  }

  printf("child2: end\n");
  return EXIT_SUCCESS;
}




Threads stuck in zap_pid_ns_processes()

2017-05-11 Thread Guenter Roeck
Hi all,

the test program attached below almost always results in one of the child
processes being stuck in zap_pid_ns_processes(). When this happens, I can
see from test logs that nr_hashed == 2 and init_pids==1, but there is only
a single thread left in the pid namespace (the one that is stuck).
Traceback from /proc//stack is

[] zap_pid_ns_processes+0x1ee/0x2a0
[] do_exit+0x10d4/0x1330
[] do_group_exit+0x86/0x130
[] get_signal+0x367/0x8a0
[] do_signal+0x83/0xb90
[] exit_to_usermode_loop+0x75/0xc0
[] syscall_return_slowpath+0xc6/0xd0
[] entry_SYSCALL_64_fastpath+0xab/0xad
[] 0x

After 120 seconds, I get the "hung task" message.

Example from v4.11:

...
[ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
[ 3263.379561]   Not tainted 4.11.0+ #1
[ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 3263.379577] clone   D0 27910  27909 0x
[ 3263.379587] Call Trace:
[ 3263.379608]  __schedule+0x677/0xda0
[ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
[ 3263.379634]  ? task_stopped_code+0x70/0x70
[ 3263.379643]  schedule+0x4d/0xd0
[ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
[ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
[ 3263.379670]  do_exit+0x10d4/0x1330
...

The problem is seen in all kernels up to v4.11.

Any idea what might be going on and how to fix the problem ?

Thanks,
Guenter

---
This test program was kindly provided by Vovo Yang .

Note that the ptrace() call in child1() is not necessary for the problem
to be seen, though it seems to make it a bit more likely.
---

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define STACK_SIZE 65536

int child1(void* arg);
int child2(void* arg);

int main(int argc, char **argv)
{
  int child_pid;
  char* child_stack = malloc(STACK_SIZE);
  char* stack_top = child_stack + STACK_SIZE;
  char command[256];

  child_pid = clone(, stack_top, CLONE_NEWPID, NULL);
  if (child_pid == -1) {
printf("parent: clone failed: %s\n", strerror(errno));
return EXIT_FAILURE;
  }
  printf("parent: child1_pid: %d\n", child_pid);

  sleep(2);
  printf("child state, if it's D (disk sleep), the child process is hung\n");
  sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
  system(command);
  sleep(3600);
  return EXIT_SUCCESS;
}

int child1(void* arg)
{
  int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | CLONE_THREAD;
  char* child_stack = malloc(STACK_SIZE);
  char* stack_top = child_stack + STACK_SIZE;
  long ret;

  ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
  if (ret == -1) {
printf("child1: ptrace failed: %s\n", strerror(errno));
return EXIT_FAILURE;
  }

  ret = clone(, stack_top, flags, NULL);
  if (ret == -1) {
printf("child1: clone failed: %s\n", strerror(errno));
return EXIT_FAILURE;
  }
  printf("child1: child2 pid: %ld\n", ret);

  sleep(1);
  printf("child1: end\n");
  return EXIT_SUCCESS;
}

int child2(void* arg)
{
  long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
  if (ret == -1) {
printf("child2: ptrace failed: %s\n", strerror(errno));
return EXIT_FAILURE;
  }

  printf("child2: end\n");
  return EXIT_SUCCESS;
}