subject:"rcu_preempt self\-detected stall on CPU from 4.5\-rc3, since 3.17"

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-06-30 Thread Paul E. McKenney

On Sat, May 07, 2016 at 08:25:01AM -0700, Paul E. McKenney wrote:
> On Fri, May 06, 2016 at 04:25:16PM +1000, Ross Green wrote:
> > On Sun, Apr 3, 2016 at 6:18 PM, Paul E. McKenney
> >  wrote:

[ . . . ]

> > Thought i would update a few runs with the linux-4.6-rc kernels.
> > 
> > I have attached log outputs through dmesg showing rcu_preempt stall 
> > warnings.
> > 
> > 
> > Thought it might be interesting for someone else to look at.
> > 
> > Currently running linux-4.6-rc6 in testing.
> 
> Thank you for sending these, I will look them over!
> 
> Still working to reproduce this quickly enough to do real debug...  :-/

And Peter Zijlstra's patch looks to have hugely reduced the rate of
occurrence of this bug in my testing:

lkml.kernel.org/r/20160523091907.gd15...@worktop.ger.corp.intel.com

Almost all of the issues I am seeing now are transient, do not trigger
RCU CPU stall warnings, and would not have been visible to me before
I upgraded my testing scripts and in-kernel code.  In all the tests I
have run with Peter's fix, I have seen only one run with RCU CPU stall
warnings, and all the stalls in that run were transient, unlike those
that I was seeing before his fix.

I will of course be tracking this stuff down, but the low reproduction
rates will make it slow going.

I am guessing that you no longer see this issue?

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-06-30 Thread Paul E. McKenney

On Sat, May 07, 2016 at 08:25:01AM -0700, Paul E. McKenney wrote:
> On Fri, May 06, 2016 at 04:25:16PM +1000, Ross Green wrote:
> > On Sun, Apr 3, 2016 at 6:18 PM, Paul E. McKenney
> >  wrote:

[ . . . ]

> > Thought i would update a few runs with the linux-4.6-rc kernels.
> > 
> > I have attached log outputs through dmesg showing rcu_preempt stall 
> > warnings.
> > 
> > 
> > Thought it might be interesting for someone else to look at.
> > 
> > Currently running linux-4.6-rc6 in testing.
> 
> Thank you for sending these, I will look them over!
> 
> Still working to reproduce this quickly enough to do real debug...  :-/

And Peter Zijlstra's patch looks to have hugely reduced the rate of
occurrence of this bug in my testing:

lkml.kernel.org/r/20160523091907.gd15...@worktop.ger.corp.intel.com

Almost all of the issues I am seeing now are transient, do not trigger
RCU CPU stall warnings, and would not have been visible to me before
I upgraded my testing scripts and in-kernel code.  In all the tests I
have run with Peter's fix, I have seen only one run with RCU CPU stall
warnings, and all the stalls in that run were transient, unlike those
that I was seeing before his fix.

I will of course be tracking this stuff down, but the low reproduction
rates will make it slow going.

I am guessing that you no longer see this issue?

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-05-09 Thread Ross Green

On Sun, May 8, 2016 at 1:25 AM, Paul E. McKenney
 wrote:
> On Fri, May 06, 2016 at 04:25:16PM +1000, Ross Green wrote:
>> On Sun, Apr 3, 2016 at 6:18 PM, Paul E. McKenney
>>  wrote:
>> > On Thu, Mar 31, 2016 at 08:42:55AM -0700, Paul E. McKenney wrote:
>> >> On Wed, Mar 30, 2016 at 07:55:47AM -0700, Paul E. McKenney wrote:
>> >> > On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote:
>> >> > > On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
>> >> > > > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
>> >> > > > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
>> >> > > > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
>> >> > > > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney 
>> >> > > > > > > wrote:
>> >> > > > >
>> >> > > > > [ . . . ]
>> >> > > > >
>> >> > > > > > > > OK, so I should instrument migration_call() if I get the 
>> >> > > > > > > > repro rate up?
>> >> > > > > > >
>> >> > > > > > > Can do, maybe try the below first. (yes I know how long it 
>> >> > > > > > > all takes :/)
>> >> > > > > >
>> >> > > > > > OK, will run this today, then run calibration for last night's 
>> >> > > > > > run this
>> >> > > > > > evening.
>> >> > >
>> >> > > And of 18 two-hour runs, there were five failures, or about 28%.
>> >> > > That said, I don't have even one significant digit on the failure 
>> >> > > rate,
>> >> > > as 5 of 18 is within the 95% confidence limits for a failure 
>> >> > > probability
>> >> > > as low as 12.5% and as high as 47%.
>> >> >
>> >> > And after last night's run, this is narrowed down to between 23% and 
>> >> > 38%,
>> >> > which is close enough.  Average is 30%, 18 failures in 60 runs.
>> >> >
>> >> > Next step is to test Peter's patch some more.  Might take a couple of
>> >> > night's worth of runs to get statistical significance.  After which
>> >> > it will be time to rebase to 4.6-rc1.
>> >>
>> >> And the first night was not so good: 6 failures out of 24 runs.  Adding
>> >> this to the 1-of-10 earlier gets 7 failures of 34.  Here are how things
>> >> stack up given the range of base failure estimates:
>> >>
>> >> Low 95% bound of 23%: 84% confidence.
>> >>
>> >> Actual measurement of 30%:92% confidence.
>> >>
>> >> High 95% bound of 38%:98% confidence.
>> >>
>> >> So there is still some chance that Peter's patch is helping.  I will
>> >> run for one more evening, after which it will be time to move forward
>> >> to 4.6-rc1.
>> >
>> > And no luck reducing bounds.  However, moving to 4.6-rc1 did get some
>> > of the trace_printk() to print.  The ftrace_dump()s resulted in RCU
>> > CPU stall warnings, and the dumps were truncated due to test timeouts
>> > in my scripting.  (I need to make my scripts more patient when they
>> > see an ftrace dump in progress, I guess.)
>> >
>> > Here are the results:
>> >
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.1.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.2.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.3.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.4.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.5.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.6.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.7.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.8.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.9.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.11.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.12.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.13.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.14.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.15.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.16.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.17.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.18.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.19.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.20.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.21.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.22.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.23.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.24.console.log.tgz
>> >
>> > The config is here:
>> >
>> > http://www2.rdrop.com/users/paulmck/submission/config.tgz
>> >
>> > More runs to measure 4.6-rc1 base error rate...
>> >
>> >

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-05-09 Thread Ross Green

On Sun, May 8, 2016 at 1:25 AM, Paul E. McKenney
 wrote:
> On Fri, May 06, 2016 at 04:25:16PM +1000, Ross Green wrote:
>> On Sun, Apr 3, 2016 at 6:18 PM, Paul E. McKenney
>>  wrote:
>> > On Thu, Mar 31, 2016 at 08:42:55AM -0700, Paul E. McKenney wrote:
>> >> On Wed, Mar 30, 2016 at 07:55:47AM -0700, Paul E. McKenney wrote:
>> >> > On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote:
>> >> > > On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
>> >> > > > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
>> >> > > > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
>> >> > > > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
>> >> > > > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney 
>> >> > > > > > > wrote:
>> >> > > > >
>> >> > > > > [ . . . ]
>> >> > > > >
>> >> > > > > > > > OK, so I should instrument migration_call() if I get the 
>> >> > > > > > > > repro rate up?
>> >> > > > > > >
>> >> > > > > > > Can do, maybe try the below first. (yes I know how long it 
>> >> > > > > > > all takes :/)
>> >> > > > > >
>> >> > > > > > OK, will run this today, then run calibration for last night's 
>> >> > > > > > run this
>> >> > > > > > evening.
>> >> > >
>> >> > > And of 18 two-hour runs, there were five failures, or about 28%.
>> >> > > That said, I don't have even one significant digit on the failure 
>> >> > > rate,
>> >> > > as 5 of 18 is within the 95% confidence limits for a failure 
>> >> > > probability
>> >> > > as low as 12.5% and as high as 47%.
>> >> >
>> >> > And after last night's run, this is narrowed down to between 23% and 
>> >> > 38%,
>> >> > which is close enough.  Average is 30%, 18 failures in 60 runs.
>> >> >
>> >> > Next step is to test Peter's patch some more.  Might take a couple of
>> >> > night's worth of runs to get statistical significance.  After which
>> >> > it will be time to rebase to 4.6-rc1.
>> >>
>> >> And the first night was not so good: 6 failures out of 24 runs.  Adding
>> >> this to the 1-of-10 earlier gets 7 failures of 34.  Here are how things
>> >> stack up given the range of base failure estimates:
>> >>
>> >> Low 95% bound of 23%: 84% confidence.
>> >>
>> >> Actual measurement of 30%:92% confidence.
>> >>
>> >> High 95% bound of 38%:98% confidence.
>> >>
>> >> So there is still some chance that Peter's patch is helping.  I will
>> >> run for one more evening, after which it will be time to move forward
>> >> to 4.6-rc1.
>> >
>> > And no luck reducing bounds.  However, moving to 4.6-rc1 did get some
>> > of the trace_printk() to print.  The ftrace_dump()s resulted in RCU
>> > CPU stall warnings, and the dumps were truncated due to test timeouts
>> > in my scripting.  (I need to make my scripts more patient when they
>> > see an ftrace dump in progress, I guess.)
>> >
>> > Here are the results:
>> >
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.1.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.2.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.3.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.4.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.5.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.6.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.7.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.8.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.9.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.11.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.12.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.13.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.14.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.15.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.16.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.17.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.18.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.19.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.20.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.21.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.22.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.23.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.24.console.log.tgz
>> >
>> > The config is here:
>> >
>> > http://www2.rdrop.com/users/paulmck/submission/config.tgz
>> >
>> > More runs to measure 4.6-rc1 base error rate...
>> >
>> >

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-05-07 Thread Paul E. McKenney

On Fri, May 06, 2016 at 04:25:16PM +1000, Ross Green wrote:
> On Sun, Apr 3, 2016 at 6:18 PM, Paul E. McKenney
>  wrote:
> > On Thu, Mar 31, 2016 at 08:42:55AM -0700, Paul E. McKenney wrote:
> >> On Wed, Mar 30, 2016 at 07:55:47AM -0700, Paul E. McKenney wrote:
> >> > On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote:
> >> > > On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
> >> > > > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> >> > > > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> >> > > > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> >> > > > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney 
> >> > > > > > > wrote:
> >> > > > >
> >> > > > > [ . . . ]
> >> > > > >
> >> > > > > > > > OK, so I should instrument migration_call() if I get the 
> >> > > > > > > > repro rate up?
> >> > > > > > >
> >> > > > > > > Can do, maybe try the below first. (yes I know how long it all 
> >> > > > > > > takes :/)
> >> > > > > >
> >> > > > > > OK, will run this today, then run calibration for last night's 
> >> > > > > > run this
> >> > > > > > evening.
> >> > >
> >> > > And of 18 two-hour runs, there were five failures, or about 28%.
> >> > > That said, I don't have even one significant digit on the failure rate,
> >> > > as 5 of 18 is within the 95% confidence limits for a failure 
> >> > > probability
> >> > > as low as 12.5% and as high as 47%.
> >> >
> >> > And after last night's run, this is narrowed down to between 23% and 38%,
> >> > which is close enough.  Average is 30%, 18 failures in 60 runs.
> >> >
> >> > Next step is to test Peter's patch some more.  Might take a couple of
> >> > night's worth of runs to get statistical significance.  After which
> >> > it will be time to rebase to 4.6-rc1.
> >>
> >> And the first night was not so good: 6 failures out of 24 runs.  Adding
> >> this to the 1-of-10 earlier gets 7 failures of 34.  Here are how things
> >> stack up given the range of base failure estimates:
> >>
> >> Low 95% bound of 23%: 84% confidence.
> >>
> >> Actual measurement of 30%:92% confidence.
> >>
> >> High 95% bound of 38%:98% confidence.
> >>
> >> So there is still some chance that Peter's patch is helping.  I will
> >> run for one more evening, after which it will be time to move forward
> >> to 4.6-rc1.
> >
> > And no luck reducing bounds.  However, moving to 4.6-rc1 did get some
> > of the trace_printk() to print.  The ftrace_dump()s resulted in RCU
> > CPU stall warnings, and the dumps were truncated due to test timeouts
> > in my scripting.  (I need to make my scripts more patient when they
> > see an ftrace dump in progress, I guess.)
> >
> > Here are the results:
> >
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.1.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.2.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.3.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.4.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.5.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.6.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.7.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.8.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.9.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.11.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.12.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.13.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.14.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.15.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.16.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.17.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.18.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.19.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.20.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.21.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.22.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.23.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.24.console.log.tgz
> >
> > The config is here:
> >
> > http://www2.rdrop.com/users/paulmck/submission/config.tgz
> >
> > More runs to measure 4.6-rc1 base error rate...
> >
> > Thanx, Paul
> >
> G'day Paul,
> 
> 
> Thought i would update a few runs with the linux-4.6-rc kernels.
> 
> I have attached log outputs through

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-05-07 Thread Paul E. McKenney

On Fri, May 06, 2016 at 04:25:16PM +1000, Ross Green wrote:
> On Sun, Apr 3, 2016 at 6:18 PM, Paul E. McKenney
>  wrote:
> > On Thu, Mar 31, 2016 at 08:42:55AM -0700, Paul E. McKenney wrote:
> >> On Wed, Mar 30, 2016 at 07:55:47AM -0700, Paul E. McKenney wrote:
> >> > On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote:
> >> > > On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
> >> > > > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> >> > > > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> >> > > > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> >> > > > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney 
> >> > > > > > > wrote:
> >> > > > >
> >> > > > > [ . . . ]
> >> > > > >
> >> > > > > > > > OK, so I should instrument migration_call() if I get the 
> >> > > > > > > > repro rate up?
> >> > > > > > >
> >> > > > > > > Can do, maybe try the below first. (yes I know how long it all 
> >> > > > > > > takes :/)
> >> > > > > >
> >> > > > > > OK, will run this today, then run calibration for last night's 
> >> > > > > > run this
> >> > > > > > evening.
> >> > >
> >> > > And of 18 two-hour runs, there were five failures, or about 28%.
> >> > > That said, I don't have even one significant digit on the failure rate,
> >> > > as 5 of 18 is within the 95% confidence limits for a failure 
> >> > > probability
> >> > > as low as 12.5% and as high as 47%.
> >> >
> >> > And after last night's run, this is narrowed down to between 23% and 38%,
> >> > which is close enough.  Average is 30%, 18 failures in 60 runs.
> >> >
> >> > Next step is to test Peter's patch some more.  Might take a couple of
> >> > night's worth of runs to get statistical significance.  After which
> >> > it will be time to rebase to 4.6-rc1.
> >>
> >> And the first night was not so good: 6 failures out of 24 runs.  Adding
> >> this to the 1-of-10 earlier gets 7 failures of 34.  Here are how things
> >> stack up given the range of base failure estimates:
> >>
> >> Low 95% bound of 23%: 84% confidence.
> >>
> >> Actual measurement of 30%:92% confidence.
> >>
> >> High 95% bound of 38%:98% confidence.
> >>
> >> So there is still some chance that Peter's patch is helping.  I will
> >> run for one more evening, after which it will be time to move forward
> >> to 4.6-rc1.
> >
> > And no luck reducing bounds.  However, moving to 4.6-rc1 did get some
> > of the trace_printk() to print.  The ftrace_dump()s resulted in RCU
> > CPU stall warnings, and the dumps were truncated due to test timeouts
> > in my scripting.  (I need to make my scripts more patient when they
> > see an ftrace dump in progress, I guess.)
> >
> > Here are the results:
> >
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.1.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.2.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.3.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.4.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.5.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.6.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.7.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.8.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.9.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.11.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.12.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.13.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.14.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.15.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.16.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.17.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.18.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.19.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.20.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.21.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.22.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.23.console.log.tgz
> > http://www2.rdrop.com/users/paulmck/submission/TREE03.24.console.log.tgz
> >
> > The config is here:
> >
> > http://www2.rdrop.com/users/paulmck/submission/config.tgz
> >
> > More runs to measure 4.6-rc1 base error rate...
> >
> > Thanx, Paul
> >
> G'day Paul,
> 
> 
> Thought i would update a few runs with the linux-4.6-rc kernels.
> 
> I have attached log outputs through dmesg showing rcu_preempt

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-04-03 Thread Paul E. McKenney

On Thu, Mar 31, 2016 at 08:42:55AM -0700, Paul E. McKenney wrote:
> On Wed, Mar 30, 2016 at 07:55:47AM -0700, Paul E. McKenney wrote:
> > On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote:
> > > On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
> > > > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> > > > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > > > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > > OK, so I should instrument migration_call() if I get the repro 
> > > > > > > > rate up?
> > > > > > > 
> > > > > > > Can do, maybe try the below first. (yes I know how long it all 
> > > > > > > takes :/)
> > > > > > 
> > > > > > OK, will run this today, then run calibration for last night's run 
> > > > > > this
> > > > > > evening.
> > > 
> > > And of 18 two-hour runs, there were five failures, or about 28%.
> > > That said, I don't have even one significant digit on the failure rate,
> > > as 5 of 18 is within the 95% confidence limits for a failure probability
> > > as low as 12.5% and as high as 47%.
> > 
> > And after last night's run, this is narrowed down to between 23% and 38%,
> > which is close enough.  Average is 30%, 18 failures in 60 runs.
> > 
> > Next step is to test Peter's patch some more.  Might take a couple of
> > night's worth of runs to get statistical significance.  After which
> > it will be time to rebase to 4.6-rc1.
> 
> And the first night was not so good: 6 failures out of 24 runs.  Adding
> this to the 1-of-10 earlier gets 7 failures of 34.  Here are how things
> stack up given the range of base failure estimates:
> 
> Low 95% bound of 23%: 84% confidence.
> 
> Actual measurement of 30%:92% confidence.
> 
> High 95% bound of 38%:98% confidence.
> 
> So there is still some chance that Peter's patch is helping.  I will
> run for one more evening, after which it will be time to move forward
> to 4.6-rc1.

And no luck reducing bounds.  However, moving to 4.6-rc1 did get some
of the trace_printk() to print.  The ftrace_dump()s resulted in RCU
CPU stall warnings, and the dumps were truncated due to test timeouts
in my scripting.  (I need to make my scripts more patient when they
see an ftrace dump in progress, I guess.)

Here are the results:

http://www2.rdrop.com/users/paulmck/submission/TREE03.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.1.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.2.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.3.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.4.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.5.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.6.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.7.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.8.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.9.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.11.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.12.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.13.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.14.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.15.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.16.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.17.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.18.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.19.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.20.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.21.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.22.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.23.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.24.console.log.tgz

The config is here:

http://www2.rdrop.com/users/paulmck/submission/config.tgz

More runs to measure 4.6-rc1 base error rate...

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-04-03 Thread Paul E. McKenney

On Thu, Mar 31, 2016 at 08:42:55AM -0700, Paul E. McKenney wrote:
> On Wed, Mar 30, 2016 at 07:55:47AM -0700, Paul E. McKenney wrote:
> > On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote:
> > > On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
> > > > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> > > > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > > > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > > OK, so I should instrument migration_call() if I get the repro 
> > > > > > > > rate up?
> > > > > > > 
> > > > > > > Can do, maybe try the below first. (yes I know how long it all 
> > > > > > > takes :/)
> > > > > > 
> > > > > > OK, will run this today, then run calibration for last night's run 
> > > > > > this
> > > > > > evening.
> > > 
> > > And of 18 two-hour runs, there were five failures, or about 28%.
> > > That said, I don't have even one significant digit on the failure rate,
> > > as 5 of 18 is within the 95% confidence limits for a failure probability
> > > as low as 12.5% and as high as 47%.
> > 
> > And after last night's run, this is narrowed down to between 23% and 38%,
> > which is close enough.  Average is 30%, 18 failures in 60 runs.
> > 
> > Next step is to test Peter's patch some more.  Might take a couple of
> > night's worth of runs to get statistical significance.  After which
> > it will be time to rebase to 4.6-rc1.
> 
> And the first night was not so good: 6 failures out of 24 runs.  Adding
> this to the 1-of-10 earlier gets 7 failures of 34.  Here are how things
> stack up given the range of base failure estimates:
> 
> Low 95% bound of 23%: 84% confidence.
> 
> Actual measurement of 30%:92% confidence.
> 
> High 95% bound of 38%:98% confidence.
> 
> So there is still some chance that Peter's patch is helping.  I will
> run for one more evening, after which it will be time to move forward
> to 4.6-rc1.

And no luck reducing bounds.  However, moving to 4.6-rc1 did get some
of the trace_printk() to print.  The ftrace_dump()s resulted in RCU
CPU stall warnings, and the dumps were truncated due to test timeouts
in my scripting.  (I need to make my scripts more patient when they
see an ftrace dump in progress, I guess.)

Here are the results:

http://www2.rdrop.com/users/paulmck/submission/TREE03.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.1.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.2.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.3.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.4.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.5.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.6.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.7.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.8.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.9.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.11.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.12.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.13.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.14.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.15.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.16.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.17.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.18.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.19.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.20.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.21.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.22.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.23.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.24.console.log.tgz

The config is here:

http://www2.rdrop.com/users/paulmck/submission/config.tgz

More runs to measure 4.6-rc1 base error rate...

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-31 Thread Paul E. McKenney

On Wed, Mar 30, 2016 at 07:55:47AM -0700, Paul E. McKenney wrote:
> On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote:
> > On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
> > > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> > > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > > OK, so I should instrument migration_call() if I get the repro 
> > > > > > > rate up?
> > > > > > 
> > > > > > Can do, maybe try the below first. (yes I know how long it all 
> > > > > > takes :/)
> > > > > 
> > > > > OK, will run this today, then run calibration for last night's run 
> > > > > this
> > > > > evening.
> > 
> > And of 18 two-hour runs, there were five failures, or about 28%.
> > That said, I don't have even one significant digit on the failure rate,
> > as 5 of 18 is within the 95% confidence limits for a failure probability
> > as low as 12.5% and as high as 47%.
> 
> And after last night's run, this is narrowed down to between 23% and 38%,
> which is close enough.  Average is 30%, 18 failures in 60 runs.
> 
> Next step is to test Peter's patch some more.  Might take a couple of
> night's worth of runs to get statistical significance.  After which
> it will be time to rebase to 4.6-rc1.

And the first night was not so good: 6 failures out of 24 runs.  Adding
this to the 1-of-10 earlier gets 7 failures of 34.  Here are how things
stack up given the range of base failure estimates:

Low 95% bound of 23%:   84% confidence.

Actual measurement of 30%:  92% confidence.

High 95% bound of 38%:  98% confidence.

So there is still some chance that Peter's patch is helping.  I will
run for one more evening, after which it will be time to move forward
to 4.6-rc1.

Thanx, Paul

> > However, the previous night's runs gave 7 failures in 24 two-hour runs,
> > for about a 29% failure rate.  There is thus a good probability that my
> > disabling of TIF_POLLING_NRFLAG had no effect whatsoever, tantalizing
> > though that possibility might have been.
> > 
> > (FWIW, I use the pdf_binomial() and quantile_binomial() functions in
> > maxima for computing this stuff.  Similar stuff is no doubt available
> > in other math/stat packages as well.)
> > 
> > So we have bugs, but not much idea where they are.  Situation normal.
> > 
> > Other thoughts?
> > 
> > Thanx, Paul
> > 
> > > > And there was one failure out of ten runs.  If last night's failure rate
> > > > was typical (7 of 24), then I believe we can be about 87% confident that
> > > > this change helped.  That isn't all that confident, but...
> > > 
> > > And, as Murphy would have it, the instrumentation didn't trigger.  I just
> > > got the usual stall-warning messages with a starving RCU grace-period
> > > kthread.
> > > 
> > >   Thanx, Paul
> > > 
> > > > Tested-by: Paul E. McKenney 
> > > > 
> > > > So what to run tonight?
> > > > 
> > > > The most sane approach would be to run stock in order to get a baseline
> > > > failure rate.  It is tempting to run more of Peter's patch, but part of
> > > > the problem is that we don't know the current baseline.
> > > > 
> > > > So baseline it is...
> > > > 
> > > > Thanx, Paul
> > > > 
> > > > > Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
> > > > > consisted of 24 two-hour runs.  Six of them had hard hangs, and 
> > > > > another
> > > > > had a hang that eventually unhung of its own accord.  I believe that 
> > > > > this
> > > > > is significantly fewer failures than from a stock kernel, but I could
> > > > > be wrong, and it will take some serious testing to give statistical
> > > > > confidence for whatever conclusion is correct.
> > > > > 
> > > > > > > > The other interesting case would be resched_cpu(), which uses
> > > > > > > > set_nr_and_not_polling() to kick a remote cpu to call 
> > > > > > > > schedule(). It
> > > > > > > > atomically sets TIF_NEED_RESCHED and returns if 
> > > > > > > > TIF_POLLING_NRFLAG was
> > > > > > > > not set. If indeed not, it will send an IPI.
> > > > > > > > 
> > > > > > > > This assumes the idle 'exit' path will do the same as the IPI 
> > > > > > > > does; and
> > > > > > > > if you look at cpu_idle_loop() it does indeed do both
> > > > > > > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > > > > > > 
> > > > > > > > Note that one cannot rely on irq_enter()/irq_exit() being 
> > > > > > > > called for the
> > > > > > > > scheduler IPI.
> > > > > > > 
> > > > > > > OK, thank you for the info!  Any specific debug actions?
> > > > > > 
> >

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-31 Thread Paul E. McKenney

On Wed, Mar 30, 2016 at 07:55:47AM -0700, Paul E. McKenney wrote:
> On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote:
> > On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
> > > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> > > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > > OK, so I should instrument migration_call() if I get the repro 
> > > > > > > rate up?
> > > > > > 
> > > > > > Can do, maybe try the below first. (yes I know how long it all 
> > > > > > takes :/)
> > > > > 
> > > > > OK, will run this today, then run calibration for last night's run 
> > > > > this
> > > > > evening.
> > 
> > And of 18 two-hour runs, there were five failures, or about 28%.
> > That said, I don't have even one significant digit on the failure rate,
> > as 5 of 18 is within the 95% confidence limits for a failure probability
> > as low as 12.5% and as high as 47%.
> 
> And after last night's run, this is narrowed down to between 23% and 38%,
> which is close enough.  Average is 30%, 18 failures in 60 runs.
> 
> Next step is to test Peter's patch some more.  Might take a couple of
> night's worth of runs to get statistical significance.  After which
> it will be time to rebase to 4.6-rc1.

And the first night was not so good: 6 failures out of 24 runs.  Adding
this to the 1-of-10 earlier gets 7 failures of 34.  Here are how things
stack up given the range of base failure estimates:

Low 95% bound of 23%:   84% confidence.

Actual measurement of 30%:  92% confidence.

High 95% bound of 38%:  98% confidence.

So there is still some chance that Peter's patch is helping.  I will
run for one more evening, after which it will be time to move forward
to 4.6-rc1.

Thanx, Paul

> > However, the previous night's runs gave 7 failures in 24 two-hour runs,
> > for about a 29% failure rate.  There is thus a good probability that my
> > disabling of TIF_POLLING_NRFLAG had no effect whatsoever, tantalizing
> > though that possibility might have been.
> > 
> > (FWIW, I use the pdf_binomial() and quantile_binomial() functions in
> > maxima for computing this stuff.  Similar stuff is no doubt available
> > in other math/stat packages as well.)
> > 
> > So we have bugs, but not much idea where they are.  Situation normal.
> > 
> > Other thoughts?
> > 
> > Thanx, Paul
> > 
> > > > And there was one failure out of ten runs.  If last night's failure rate
> > > > was typical (7 of 24), then I believe we can be about 87% confident that
> > > > this change helped.  That isn't all that confident, but...
> > > 
> > > And, as Murphy would have it, the instrumentation didn't trigger.  I just
> > > got the usual stall-warning messages with a starving RCU grace-period
> > > kthread.
> > > 
> > >   Thanx, Paul
> > > 
> > > > Tested-by: Paul E. McKenney 
> > > > 
> > > > So what to run tonight?
> > > > 
> > > > The most sane approach would be to run stock in order to get a baseline
> > > > failure rate.  It is tempting to run more of Peter's patch, but part of
> > > > the problem is that we don't know the current baseline.
> > > > 
> > > > So baseline it is...
> > > > 
> > > > Thanx, Paul
> > > > 
> > > > > Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
> > > > > consisted of 24 two-hour runs.  Six of them had hard hangs, and 
> > > > > another
> > > > > had a hang that eventually unhung of its own accord.  I believe that 
> > > > > this
> > > > > is significantly fewer failures than from a stock kernel, but I could
> > > > > be wrong, and it will take some serious testing to give statistical
> > > > > confidence for whatever conclusion is correct.
> > > > > 
> > > > > > > > The other interesting case would be resched_cpu(), which uses
> > > > > > > > set_nr_and_not_polling() to kick a remote cpu to call 
> > > > > > > > schedule(). It
> > > > > > > > atomically sets TIF_NEED_RESCHED and returns if 
> > > > > > > > TIF_POLLING_NRFLAG was
> > > > > > > > not set. If indeed not, it will send an IPI.
> > > > > > > > 
> > > > > > > > This assumes the idle 'exit' path will do the same as the IPI 
> > > > > > > > does; and
> > > > > > > > if you look at cpu_idle_loop() it does indeed do both
> > > > > > > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > > > > > > 
> > > > > > > > Note that one cannot rely on irq_enter()/irq_exit() being 
> > > > > > > > called for the
> > > > > > > > scheduler IPI.
> > > > > > > 
> > > > > > > OK, thank you for the info!  Any specific debug actions?
> > > > > > 
> > > > > > Dunno, something

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-30 Thread Paul E. McKenney

On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote:
> On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
> > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > > OK, so I should instrument migration_call() if I get the repro rate 
> > > > > > up?
> > > > > 
> > > > > Can do, maybe try the below first. (yes I know how long it all takes 
> > > > > :/)
> > > > 
> > > > OK, will run this today, then run calibration for last night's run this
> > > > evening.
> 
> And of 18 two-hour runs, there were five failures, or about 28%.
> That said, I don't have even one significant digit on the failure rate,
> as 5 of 18 is within the 95% confidence limits for a failure probability
> as low as 12.5% and as high as 47%.

And after last night's run, this is narrowed down to between 23% and 38%,
which is close enough.  Average is 30%, 18 failures in 60 runs.

Next step is to test Peter's patch some more.  Might take a couple of
night's worth of runs to get statistical significance.  After which
it will be time to rebase to 4.6-rc1.

Thanx, Paul

> However, the previous night's runs gave 7 failures in 24 two-hour runs,
> for about a 29% failure rate.  There is thus a good probability that my
> disabling of TIF_POLLING_NRFLAG had no effect whatsoever, tantalizing
> though that possibility might have been.
> 
> (FWIW, I use the pdf_binomial() and quantile_binomial() functions in
> maxima for computing this stuff.  Similar stuff is no doubt available
> in other math/stat packages as well.)
> 
> So we have bugs, but not much idea where they are.  Situation normal.
> 
> Other thoughts?
> 
>   Thanx, Paul
> 
> > > And there was one failure out of ten runs.  If last night's failure rate
> > > was typical (7 of 24), then I believe we can be about 87% confident that
> > > this change helped.  That isn't all that confident, but...
> > 
> > And, as Murphy would have it, the instrumentation didn't trigger.  I just
> > got the usual stall-warning messages with a starving RCU grace-period
> > kthread.
> > 
> > Thanx, Paul
> > 
> > > Tested-by: Paul E. McKenney 
> > > 
> > > So what to run tonight?
> > > 
> > > The most sane approach would be to run stock in order to get a baseline
> > > failure rate.  It is tempting to run more of Peter's patch, but part of
> > > the problem is that we don't know the current baseline.
> > > 
> > > So baseline it is...
> > > 
> > >   Thanx, Paul
> > > 
> > > > Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
> > > > consisted of 24 two-hour runs.  Six of them had hard hangs, and another
> > > > had a hang that eventually unhung of its own accord.  I believe that 
> > > > this
> > > > is significantly fewer failures than from a stock kernel, but I could
> > > > be wrong, and it will take some serious testing to give statistical
> > > > confidence for whatever conclusion is correct.
> > > > 
> > > > > > > The other interesting case would be resched_cpu(), which uses
> > > > > > > set_nr_and_not_polling() to kick a remote cpu to call schedule(). 
> > > > > > > It
> > > > > > > atomically sets TIF_NEED_RESCHED and returns if 
> > > > > > > TIF_POLLING_NRFLAG was
> > > > > > > not set. If indeed not, it will send an IPI.
> > > > > > > 
> > > > > > > This assumes the idle 'exit' path will do the same as the IPI 
> > > > > > > does; and
> > > > > > > if you look at cpu_idle_loop() it does indeed do both
> > > > > > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > > > > > 
> > > > > > > Note that one cannot rely on irq_enter()/irq_exit() being called 
> > > > > > > for the
> > > > > > > scheduler IPI.
> > > > > > 
> > > > > > OK, thank you for the info!  Any specific debug actions?
> > > > > 
> > > > > Dunno, something like the below should bring visibility into the
> > > > > (lockless) wake_list thingy.
> > > > > 
> > > > > So these trace_printk()s should happen between trace_sched_waking() 
> > > > > and
> > > > > trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
> > > > > some traces with these here thingies on).
> > > > > 
> > > > > ---
> > > > >  arch/x86/include/asm/bitops.h | 6 --
> > > > >  kernel/sched/core.c   | 9 +
> > > > >  2 files changed, 13 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/arch/x86/include/asm/bitops.h 
> > > > > b/arch/x86/include/asm/bitops.h
> > > > > index 7766d1cf096e..5345784d5e41 100644
> > > > > --- a/arch/x86/include/asm/bitops.h
> > > > > +++

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-30 Thread Paul E. McKenney

On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote:
> On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
> > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > > OK, so I should instrument migration_call() if I get the repro rate 
> > > > > > up?
> > > > > 
> > > > > Can do, maybe try the below first. (yes I know how long it all takes 
> > > > > :/)
> > > > 
> > > > OK, will run this today, then run calibration for last night's run this
> > > > evening.
> 
> And of 18 two-hour runs, there were five failures, or about 28%.
> That said, I don't have even one significant digit on the failure rate,
> as 5 of 18 is within the 95% confidence limits for a failure probability
> as low as 12.5% and as high as 47%.

And after last night's run, this is narrowed down to between 23% and 38%,
which is close enough.  Average is 30%, 18 failures in 60 runs.

Next step is to test Peter's patch some more.  Might take a couple of
night's worth of runs to get statistical significance.  After which
it will be time to rebase to 4.6-rc1.

Thanx, Paul

> However, the previous night's runs gave 7 failures in 24 two-hour runs,
> for about a 29% failure rate.  There is thus a good probability that my
> disabling of TIF_POLLING_NRFLAG had no effect whatsoever, tantalizing
> though that possibility might have been.
> 
> (FWIW, I use the pdf_binomial() and quantile_binomial() functions in
> maxima for computing this stuff.  Similar stuff is no doubt available
> in other math/stat packages as well.)
> 
> So we have bugs, but not much idea where they are.  Situation normal.
> 
> Other thoughts?
> 
>   Thanx, Paul
> 
> > > And there was one failure out of ten runs.  If last night's failure rate
> > > was typical (7 of 24), then I believe we can be about 87% confident that
> > > this change helped.  That isn't all that confident, but...
> > 
> > And, as Murphy would have it, the instrumentation didn't trigger.  I just
> > got the usual stall-warning messages with a starving RCU grace-period
> > kthread.
> > 
> > Thanx, Paul
> > 
> > > Tested-by: Paul E. McKenney 
> > > 
> > > So what to run tonight?
> > > 
> > > The most sane approach would be to run stock in order to get a baseline
> > > failure rate.  It is tempting to run more of Peter's patch, but part of
> > > the problem is that we don't know the current baseline.
> > > 
> > > So baseline it is...
> > > 
> > >   Thanx, Paul
> > > 
> > > > Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
> > > > consisted of 24 two-hour runs.  Six of them had hard hangs, and another
> > > > had a hang that eventually unhung of its own accord.  I believe that 
> > > > this
> > > > is significantly fewer failures than from a stock kernel, but I could
> > > > be wrong, and it will take some serious testing to give statistical
> > > > confidence for whatever conclusion is correct.
> > > > 
> > > > > > > The other interesting case would be resched_cpu(), which uses
> > > > > > > set_nr_and_not_polling() to kick a remote cpu to call schedule(). 
> > > > > > > It
> > > > > > > atomically sets TIF_NEED_RESCHED and returns if 
> > > > > > > TIF_POLLING_NRFLAG was
> > > > > > > not set. If indeed not, it will send an IPI.
> > > > > > > 
> > > > > > > This assumes the idle 'exit' path will do the same as the IPI 
> > > > > > > does; and
> > > > > > > if you look at cpu_idle_loop() it does indeed do both
> > > > > > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > > > > > 
> > > > > > > Note that one cannot rely on irq_enter()/irq_exit() being called 
> > > > > > > for the
> > > > > > > scheduler IPI.
> > > > > > 
> > > > > > OK, thank you for the info!  Any specific debug actions?
> > > > > 
> > > > > Dunno, something like the below should bring visibility into the
> > > > > (lockless) wake_list thingy.
> > > > > 
> > > > > So these trace_printk()s should happen between trace_sched_waking() 
> > > > > and
> > > > > trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
> > > > > some traces with these here thingies on).
> > > > > 
> > > > > ---
> > > > >  arch/x86/include/asm/bitops.h | 6 --
> > > > >  kernel/sched/core.c   | 9 +
> > > > >  2 files changed, 13 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/arch/x86/include/asm/bitops.h 
> > > > > b/arch/x86/include/asm/bitops.h
> > > > > index 7766d1cf096e..5345784d5e41 100644
> > > > > --- a/arch/x86/include/asm/bitops.h
> > > > > +++ b/arch/x86/include/asm/bitops.h
> > > > >

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-29 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
> On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> > 
> > [ . . . ]
> > 
> > > > > OK, so I should instrument migration_call() if I get the repro rate 
> > > > > up?
> > > > 
> > > > Can do, maybe try the below first. (yes I know how long it all takes :/)
> > > 
> > > OK, will run this today, then run calibration for last night's run this
> > > evening.

And of 18 two-hour runs, there were five failures, or about 28%.
That said, I don't have even one significant digit on the failure rate,
as 5 of 18 is within the 95% confidence limits for a failure probability
as low as 12.5% and as high as 47%.

However, the previous night's runs gave 7 failures in 24 two-hour runs,
for about a 29% failure rate.  There is thus a good probability that my
disabling of TIF_POLLING_NRFLAG had no effect whatsoever, tantalizing
though that possibility might have been.

(FWIW, I use the pdf_binomial() and quantile_binomial() functions in
maxima for computing this stuff.  Similar stuff is no doubt available
in other math/stat packages as well.)

So we have bugs, but not much idea where they are.  Situation normal.

Other thoughts?

Thanx, Paul

> > And there was one failure out of ten runs.  If last night's failure rate
> > was typical (7 of 24), then I believe we can be about 87% confident that
> > this change helped.  That isn't all that confident, but...
> 
> And, as Murphy would have it, the instrumentation didn't trigger.  I just
> got the usual stall-warning messages with a starving RCU grace-period
> kthread.
> 
>   Thanx, Paul
> 
> > Tested-by: Paul E. McKenney 
> > 
> > So what to run tonight?
> > 
> > The most sane approach would be to run stock in order to get a baseline
> > failure rate.  It is tempting to run more of Peter's patch, but part of
> > the problem is that we don't know the current baseline.
> > 
> > So baseline it is...
> > 
> > Thanx, Paul
> > 
> > > Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
> > > consisted of 24 two-hour runs.  Six of them had hard hangs, and another
> > > had a hang that eventually unhung of its own accord.  I believe that this
> > > is significantly fewer failures than from a stock kernel, but I could
> > > be wrong, and it will take some serious testing to give statistical
> > > confidence for whatever conclusion is correct.
> > > 
> > > > > > The other interesting case would be resched_cpu(), which uses
> > > > > > set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> > > > > > atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG 
> > > > > > was
> > > > > > not set. If indeed not, it will send an IPI.
> > > > > > 
> > > > > > This assumes the idle 'exit' path will do the same as the IPI does; 
> > > > > > and
> > > > > > if you look at cpu_idle_loop() it does indeed do both
> > > > > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > > > > 
> > > > > > Note that one cannot rely on irq_enter()/irq_exit() being called 
> > > > > > for the
> > > > > > scheduler IPI.
> > > > > 
> > > > > OK, thank you for the info!  Any specific debug actions?
> > > > 
> > > > Dunno, something like the below should bring visibility into the
> > > > (lockless) wake_list thingy.
> > > > 
> > > > So these trace_printk()s should happen between trace_sched_waking() and
> > > > trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
> > > > some traces with these here thingies on).
> > > > 
> > > > ---
> > > >  arch/x86/include/asm/bitops.h | 6 --
> > > >  kernel/sched/core.c   | 9 +
> > > >  2 files changed, 13 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/bitops.h 
> > > > b/arch/x86/include/asm/bitops.h
> > > > index 7766d1cf096e..5345784d5e41 100644
> > > > --- a/arch/x86/include/asm/bitops.h
> > > > +++ b/arch/x86/include/asm/bitops.h
> > > > @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
> > > > if (IS_IMMEDIATE(nr)) {
> > > > asm volatile(LOCK_PREFIX "andb %1,%0"
> > > > : CONST_MASK_ADDR(nr, addr)
> > > > -   : "iq" ((u8)~CONST_MASK(nr)));
> > > > +   : "iq" ((u8)~CONST_MASK(nr))
> > > > +   : "memory");
> > > > } else {
> > > > asm volatile(LOCK_PREFIX "btr %1,%0"
> > > > : BITOP_ADDR(addr)
> > > > -   : "Ir" (nr));
> > > > +   : "Ir" (nr)
> > > > +

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-29 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
> On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> > 
> > [ . . . ]
> > 
> > > > > OK, so I should instrument migration_call() if I get the repro rate 
> > > > > up?
> > > > 
> > > > Can do, maybe try the below first. (yes I know how long it all takes :/)
> > > 
> > > OK, will run this today, then run calibration for last night's run this
> > > evening.

And of 18 two-hour runs, there were five failures, or about 28%.
That said, I don't have even one significant digit on the failure rate,
as 5 of 18 is within the 95% confidence limits for a failure probability
as low as 12.5% and as high as 47%.

However, the previous night's runs gave 7 failures in 24 two-hour runs,
for about a 29% failure rate.  There is thus a good probability that my
disabling of TIF_POLLING_NRFLAG had no effect whatsoever, tantalizing
though that possibility might have been.

(FWIW, I use the pdf_binomial() and quantile_binomial() functions in
maxima for computing this stuff.  Similar stuff is no doubt available
in other math/stat packages as well.)

So we have bugs, but not much idea where they are.  Situation normal.

Other thoughts?

Thanx, Paul

> > And there was one failure out of ten runs.  If last night's failure rate
> > was typical (7 of 24), then I believe we can be about 87% confident that
> > this change helped.  That isn't all that confident, but...
> 
> And, as Murphy would have it, the instrumentation didn't trigger.  I just
> got the usual stall-warning messages with a starving RCU grace-period
> kthread.
> 
>   Thanx, Paul
> 
> > Tested-by: Paul E. McKenney 
> > 
> > So what to run tonight?
> > 
> > The most sane approach would be to run stock in order to get a baseline
> > failure rate.  It is tempting to run more of Peter's patch, but part of
> > the problem is that we don't know the current baseline.
> > 
> > So baseline it is...
> > 
> > Thanx, Paul
> > 
> > > Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
> > > consisted of 24 two-hour runs.  Six of them had hard hangs, and another
> > > had a hang that eventually unhung of its own accord.  I believe that this
> > > is significantly fewer failures than from a stock kernel, but I could
> > > be wrong, and it will take some serious testing to give statistical
> > > confidence for whatever conclusion is correct.
> > > 
> > > > > > The other interesting case would be resched_cpu(), which uses
> > > > > > set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> > > > > > atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG 
> > > > > > was
> > > > > > not set. If indeed not, it will send an IPI.
> > > > > > 
> > > > > > This assumes the idle 'exit' path will do the same as the IPI does; 
> > > > > > and
> > > > > > if you look at cpu_idle_loop() it does indeed do both
> > > > > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > > > > 
> > > > > > Note that one cannot rely on irq_enter()/irq_exit() being called 
> > > > > > for the
> > > > > > scheduler IPI.
> > > > > 
> > > > > OK, thank you for the info!  Any specific debug actions?
> > > > 
> > > > Dunno, something like the below should bring visibility into the
> > > > (lockless) wake_list thingy.
> > > > 
> > > > So these trace_printk()s should happen between trace_sched_waking() and
> > > > trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
> > > > some traces with these here thingies on).
> > > > 
> > > > ---
> > > >  arch/x86/include/asm/bitops.h | 6 --
> > > >  kernel/sched/core.c   | 9 +
> > > >  2 files changed, 13 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/bitops.h 
> > > > b/arch/x86/include/asm/bitops.h
> > > > index 7766d1cf096e..5345784d5e41 100644
> > > > --- a/arch/x86/include/asm/bitops.h
> > > > +++ b/arch/x86/include/asm/bitops.h
> > > > @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
> > > > if (IS_IMMEDIATE(nr)) {
> > > > asm volatile(LOCK_PREFIX "andb %1,%0"
> > > > : CONST_MASK_ADDR(nr, addr)
> > > > -   : "iq" ((u8)~CONST_MASK(nr)));
> > > > +   : "iq" ((u8)~CONST_MASK(nr))
> > > > +   : "memory");
> > > > } else {
> > > > asm volatile(LOCK_PREFIX "btr %1,%0"
> > > > : BITOP_ADDR(addr)
> > > > -   : "Ir" (nr));
> > > > +   : "Ir" (nr)
> > > > +   : "memory");
> > > >

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > > OK, so I should instrument migration_call() if I get the repro rate up?
> > > 
> > > Can do, maybe try the below first. (yes I know how long it all takes :/)
> > 
> > OK, will run this today, then run calibration for last night's run this
> > evening.
> 
> And there was one failure out of ten runs.  If last night's failure rate
> was typical (7 of 24), then I believe we can be about 87% confident that
> this change helped.  That isn't all that confident, but...

And, as Murphy would have it, the instrumentation didn't trigger.  I just
got the usual stall-warning messages with a starving RCU grace-period
kthread.

Thanx, Paul

> Tested-by: Paul E. McKenney 
> 
> So what to run tonight?
> 
> The most sane approach would be to run stock in order to get a baseline
> failure rate.  It is tempting to run more of Peter's patch, but part of
> the problem is that we don't know the current baseline.
> 
> So baseline it is...
> 
>   Thanx, Paul
> 
> > Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
> > consisted of 24 two-hour runs.  Six of them had hard hangs, and another
> > had a hang that eventually unhung of its own accord.  I believe that this
> > is significantly fewer failures than from a stock kernel, but I could
> > be wrong, and it will take some serious testing to give statistical
> > confidence for whatever conclusion is correct.
> > 
> > > > > The other interesting case would be resched_cpu(), which uses
> > > > > set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> > > > > atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> > > > > not set. If indeed not, it will send an IPI.
> > > > > 
> > > > > This assumes the idle 'exit' path will do the same as the IPI does; 
> > > > > and
> > > > > if you look at cpu_idle_loop() it does indeed do both
> > > > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > > > 
> > > > > Note that one cannot rely on irq_enter()/irq_exit() being called for 
> > > > > the
> > > > > scheduler IPI.
> > > > 
> > > > OK, thank you for the info!  Any specific debug actions?
> > > 
> > > Dunno, something like the below should bring visibility into the
> > > (lockless) wake_list thingy.
> > > 
> > > So these trace_printk()s should happen between trace_sched_waking() and
> > > trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
> > > some traces with these here thingies on).
> > > 
> > > ---
> > >  arch/x86/include/asm/bitops.h | 6 --
> > >  kernel/sched/core.c   | 9 +
> > >  2 files changed, 13 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> > > index 7766d1cf096e..5345784d5e41 100644
> > > --- a/arch/x86/include/asm/bitops.h
> > > +++ b/arch/x86/include/asm/bitops.h
> > > @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
> > >   if (IS_IMMEDIATE(nr)) {
> > >   asm volatile(LOCK_PREFIX "andb %1,%0"
> > >   : CONST_MASK_ADDR(nr, addr)
> > > - : "iq" ((u8)~CONST_MASK(nr)));
> > > + : "iq" ((u8)~CONST_MASK(nr))
> > > + : "memory");
> > >   } else {
> > >   asm volatile(LOCK_PREFIX "btr %1,%0"
> > >   : BITOP_ADDR(addr)
> > > - : "Ir" (nr));
> > > + : "Ir" (nr)
> > > + : "memory");
> > >   }
> > >  }
> > 
> > Is the above addition of "memory" strictly for the debug below, or is
> > it also a potential fix?
> > 
> > Starting it up regardless, but figured I should ask!
> > 
> > Thanx, Paul
> > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 0b21e7a724e1..b446f73c530d 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1669,6 +1669,7 @@ void sched_ttwu_pending(void)
> > >   while (llist) {
> > >   p = llist_entry(llist, struct task_struct, wake_entry);
> > >   llist = llist_next(llist);
> > > + trace_printk("waking %d\n", p->pid);
> > >   ttwu_do_activate(rq, p, 0);
> > >   }
> > > 
> > > @@ -1719,6 +1720,7 @@ static void ttwu_queue_remote(struct task_struct 
> > > *p, int cpu)
> > >   struct rq *rq = cpu_rq(cpu);
> > > 
> > >   if (llist_add(>wake_entry, _rq(cpu)->wake_list)) {
> > > + trace_printk("queued %d for waking on %d\n", p->pid, cpu);
> > >   if (!set_nr_if_polling(rq->idle))
> > >   smp_send_reschedule(cpu);
> > >   else
> > > @@ -5397,10

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > > OK, so I should instrument migration_call() if I get the repro rate up?
> > > 
> > > Can do, maybe try the below first. (yes I know how long it all takes :/)
> > 
> > OK, will run this today, then run calibration for last night's run this
> > evening.
> 
> And there was one failure out of ten runs.  If last night's failure rate
> was typical (7 of 24), then I believe we can be about 87% confident that
> this change helped.  That isn't all that confident, but...

And, as Murphy would have it, the instrumentation didn't trigger.  I just
got the usual stall-warning messages with a starving RCU grace-period
kthread.

Thanx, Paul

> Tested-by: Paul E. McKenney 
> 
> So what to run tonight?
> 
> The most sane approach would be to run stock in order to get a baseline
> failure rate.  It is tempting to run more of Peter's patch, but part of
> the problem is that we don't know the current baseline.
> 
> So baseline it is...
> 
>   Thanx, Paul
> 
> > Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
> > consisted of 24 two-hour runs.  Six of them had hard hangs, and another
> > had a hang that eventually unhung of its own accord.  I believe that this
> > is significantly fewer failures than from a stock kernel, but I could
> > be wrong, and it will take some serious testing to give statistical
> > confidence for whatever conclusion is correct.
> > 
> > > > > The other interesting case would be resched_cpu(), which uses
> > > > > set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> > > > > atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> > > > > not set. If indeed not, it will send an IPI.
> > > > > 
> > > > > This assumes the idle 'exit' path will do the same as the IPI does; 
> > > > > and
> > > > > if you look at cpu_idle_loop() it does indeed do both
> > > > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > > > 
> > > > > Note that one cannot rely on irq_enter()/irq_exit() being called for 
> > > > > the
> > > > > scheduler IPI.
> > > > 
> > > > OK, thank you for the info!  Any specific debug actions?
> > > 
> > > Dunno, something like the below should bring visibility into the
> > > (lockless) wake_list thingy.
> > > 
> > > So these trace_printk()s should happen between trace_sched_waking() and
> > > trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
> > > some traces with these here thingies on).
> > > 
> > > ---
> > >  arch/x86/include/asm/bitops.h | 6 --
> > >  kernel/sched/core.c   | 9 +
> > >  2 files changed, 13 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> > > index 7766d1cf096e..5345784d5e41 100644
> > > --- a/arch/x86/include/asm/bitops.h
> > > +++ b/arch/x86/include/asm/bitops.h
> > > @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
> > >   if (IS_IMMEDIATE(nr)) {
> > >   asm volatile(LOCK_PREFIX "andb %1,%0"
> > >   : CONST_MASK_ADDR(nr, addr)
> > > - : "iq" ((u8)~CONST_MASK(nr)));
> > > + : "iq" ((u8)~CONST_MASK(nr))
> > > + : "memory");
> > >   } else {
> > >   asm volatile(LOCK_PREFIX "btr %1,%0"
> > >   : BITOP_ADDR(addr)
> > > - : "Ir" (nr));
> > > + : "Ir" (nr)
> > > + : "memory");
> > >   }
> > >  }
> > 
> > Is the above addition of "memory" strictly for the debug below, or is
> > it also a potential fix?
> > 
> > Starting it up regardless, but figured I should ask!
> > 
> > Thanx, Paul
> > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 0b21e7a724e1..b446f73c530d 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1669,6 +1669,7 @@ void sched_ttwu_pending(void)
> > >   while (llist) {
> > >   p = llist_entry(llist, struct task_struct, wake_entry);
> > >   llist = llist_next(llist);
> > > + trace_printk("waking %d\n", p->pid);
> > >   ttwu_do_activate(rq, p, 0);
> > >   }
> > > 
> > > @@ -1719,6 +1720,7 @@ static void ttwu_queue_remote(struct task_struct 
> > > *p, int cpu)
> > >   struct rq *rq = cpu_rq(cpu);
> > > 
> > >   if (llist_add(>wake_entry, _rq(cpu)->wake_list)) {
> > > + trace_printk("queued %d for waking on %d\n", p->pid, cpu);
> > >   if (!set_nr_if_polling(rq->idle))
> > >   smp_send_reschedule(cpu);
> > >   else
> > > @@ -5397,10 +5399,17 @@

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:

[ . . . ]

> > > OK, so I should instrument migration_call() if I get the repro rate up?
> > 
> > Can do, maybe try the below first. (yes I know how long it all takes :/)
> 
> OK, will run this today, then run calibration for last night's run this
> evening.

And there was one failure out of ten runs.  If last night's failure rate
was typical (7 of 24), then I believe we can be about 87% confident that
this change helped.  That isn't all that confident, but...

Tested-by: Paul E. McKenney 

So what to run tonight?

The most sane approach would be to run stock in order to get a baseline
failure rate.  It is tempting to run more of Peter's patch, but part of
the problem is that we don't know the current baseline.

So baseline it is...

Thanx, Paul

> Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
> consisted of 24 two-hour runs.  Six of them had hard hangs, and another
> had a hang that eventually unhung of its own accord.  I believe that this
> is significantly fewer failures than from a stock kernel, but I could
> be wrong, and it will take some serious testing to give statistical
> confidence for whatever conclusion is correct.
> 
> > > > The other interesting case would be resched_cpu(), which uses
> > > > set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> > > > atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> > > > not set. If indeed not, it will send an IPI.
> > > > 
> > > > This assumes the idle 'exit' path will do the same as the IPI does; and
> > > > if you look at cpu_idle_loop() it does indeed do both
> > > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > > 
> > > > Note that one cannot rely on irq_enter()/irq_exit() being called for the
> > > > scheduler IPI.
> > > 
> > > OK, thank you for the info!  Any specific debug actions?
> > 
> > Dunno, something like the below should bring visibility into the
> > (lockless) wake_list thingy.
> > 
> > So these trace_printk()s should happen between trace_sched_waking() and
> > trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
> > some traces with these here thingies on).
> > 
> > ---
> >  arch/x86/include/asm/bitops.h | 6 --
> >  kernel/sched/core.c   | 9 +
> >  2 files changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> > index 7766d1cf096e..5345784d5e41 100644
> > --- a/arch/x86/include/asm/bitops.h
> > +++ b/arch/x86/include/asm/bitops.h
> > @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
> > if (IS_IMMEDIATE(nr)) {
> > asm volatile(LOCK_PREFIX "andb %1,%0"
> > : CONST_MASK_ADDR(nr, addr)
> > -   : "iq" ((u8)~CONST_MASK(nr)));
> > +   : "iq" ((u8)~CONST_MASK(nr))
> > +   : "memory");
> > } else {
> > asm volatile(LOCK_PREFIX "btr %1,%0"
> > : BITOP_ADDR(addr)
> > -   : "Ir" (nr));
> > +   : "Ir" (nr)
> > +   : "memory");
> > }
> >  }
> 
> Is the above addition of "memory" strictly for the debug below, or is
> it also a potential fix?
> 
> Starting it up regardless, but figured I should ask!
> 
>   Thanx, Paul
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0b21e7a724e1..b446f73c530d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1669,6 +1669,7 @@ void sched_ttwu_pending(void)
> > while (llist) {
> > p = llist_entry(llist, struct task_struct, wake_entry);
> > llist = llist_next(llist);
> > +   trace_printk("waking %d\n", p->pid);
> > ttwu_do_activate(rq, p, 0);
> > }
> > 
> > @@ -1719,6 +1720,7 @@ static void ttwu_queue_remote(struct task_struct *p, 
> > int cpu)
> > struct rq *rq = cpu_rq(cpu);
> > 
> > if (llist_add(>wake_entry, _rq(cpu)->wake_list)) {
> > +   trace_printk("queued %d for waking on %d\n", p->pid, cpu);
> > if (!set_nr_if_polling(rq->idle))
> > smp_send_reschedule(cpu);
> > else
> > @@ -5397,10 +5399,17 @@ migration_call(struct notifier_block *nfb, unsigned 
> > long action, void *hcpu)
> > migrate_tasks(rq);
> > BUG_ON(rq->nr_running != 1); /* the migration thread */
> > raw_spin_unlock_irqrestore(>lock, flags);
> > +
> > +   /* really bad m'kay */
> > +   WARN_ON(!llist_empty(>wake_list));
> > +
> > break;
> > 
> > case CPU_DEAD:
> > calc_load_migrate(rq);
> > +
> > +   /*

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:

[ . . . ]

> > > OK, so I should instrument migration_call() if I get the repro rate up?
> > 
> > Can do, maybe try the below first. (yes I know how long it all takes :/)
> 
> OK, will run this today, then run calibration for last night's run this
> evening.

And there was one failure out of ten runs.  If last night's failure rate
was typical (7 of 24), then I believe we can be about 87% confident that
this change helped.  That isn't all that confident, but...

Tested-by: Paul E. McKenney 

So what to run tonight?

The most sane approach would be to run stock in order to get a baseline
failure rate.  It is tempting to run more of Peter's patch, but part of
the problem is that we don't know the current baseline.

So baseline it is...

Thanx, Paul

> Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
> consisted of 24 two-hour runs.  Six of them had hard hangs, and another
> had a hang that eventually unhung of its own accord.  I believe that this
> is significantly fewer failures than from a stock kernel, but I could
> be wrong, and it will take some serious testing to give statistical
> confidence for whatever conclusion is correct.
> 
> > > > The other interesting case would be resched_cpu(), which uses
> > > > set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> > > > atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> > > > not set. If indeed not, it will send an IPI.
> > > > 
> > > > This assumes the idle 'exit' path will do the same as the IPI does; and
> > > > if you look at cpu_idle_loop() it does indeed do both
> > > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > > 
> > > > Note that one cannot rely on irq_enter()/irq_exit() being called for the
> > > > scheduler IPI.
> > > 
> > > OK, thank you for the info!  Any specific debug actions?
> > 
> > Dunno, something like the below should bring visibility into the
> > (lockless) wake_list thingy.
> > 
> > So these trace_printk()s should happen between trace_sched_waking() and
> > trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
> > some traces with these here thingies on).
> > 
> > ---
> >  arch/x86/include/asm/bitops.h | 6 --
> >  kernel/sched/core.c   | 9 +
> >  2 files changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> > index 7766d1cf096e..5345784d5e41 100644
> > --- a/arch/x86/include/asm/bitops.h
> > +++ b/arch/x86/include/asm/bitops.h
> > @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
> > if (IS_IMMEDIATE(nr)) {
> > asm volatile(LOCK_PREFIX "andb %1,%0"
> > : CONST_MASK_ADDR(nr, addr)
> > -   : "iq" ((u8)~CONST_MASK(nr)));
> > +   : "iq" ((u8)~CONST_MASK(nr))
> > +   : "memory");
> > } else {
> > asm volatile(LOCK_PREFIX "btr %1,%0"
> > : BITOP_ADDR(addr)
> > -   : "Ir" (nr));
> > +   : "Ir" (nr)
> > +   : "memory");
> > }
> >  }
> 
> Is the above addition of "memory" strictly for the debug below, or is
> it also a potential fix?
> 
> Starting it up regardless, but figured I should ask!
> 
>   Thanx, Paul
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0b21e7a724e1..b446f73c530d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1669,6 +1669,7 @@ void sched_ttwu_pending(void)
> > while (llist) {
> > p = llist_entry(llist, struct task_struct, wake_entry);
> > llist = llist_next(llist);
> > +   trace_printk("waking %d\n", p->pid);
> > ttwu_do_activate(rq, p, 0);
> > }
> > 
> > @@ -1719,6 +1720,7 @@ static void ttwu_queue_remote(struct task_struct *p, 
> > int cpu)
> > struct rq *rq = cpu_rq(cpu);
> > 
> > if (llist_add(>wake_entry, _rq(cpu)->wake_list)) {
> > +   trace_printk("queued %d for waking on %d\n", p->pid, cpu);
> > if (!set_nr_if_polling(rq->idle))
> > smp_send_reschedule(cpu);
> > else
> > @@ -5397,10 +5399,17 @@ migration_call(struct notifier_block *nfb, unsigned 
> > long action, void *hcpu)
> > migrate_tasks(rq);
> > BUG_ON(rq->nr_running != 1); /* the migration thread */
> > raw_spin_unlock_irqrestore(>lock, flags);
> > +
> > +   /* really bad m'kay */
> > +   WARN_ON(!llist_empty(>wake_list));
> > +
> > break;
> > 
> > case CPU_DEAD:
> > calc_load_migrate(rq);
> > +
> > +   /* more bad */
> > +

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 04:12:19PM +, Mathieu Desnoyers wrote:
> - On Mar 28, 2016, at 11:56 AM, Paul E. McKenney 
> paul...@linux.vnet.ibm.com wrote:
> 
> > On Mon, Mar 28, 2016 at 03:07:36PM +, Mathieu Desnoyers wrote:
> >> - On Mar 28, 2016, at 9:29 AM, Paul E. McKenney 
> >> paul...@linux.vnet.ibm.com
> >> wrote:
> >> 
> >> > On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
> >> >> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
> >> >> 
> >> >> > > Does that system have MONITOR/MWAIT errata?
> >> >> > 
> >> >> > On the off-chance that this question was also directed at me,
> >> >> 
> >> >> Hehe, it wasn't, however, since we're here..
> >> >> 
> >> >> > here is
> >> >> > what I am running on.  I am running in a qemu/KVM virtual machine, in
> >> >> > case that matters.
> >> >> 
> >> >> Have you actually tried on real proper hardware? Does it still reproduce
> >> >> there?
> >> > 
> >> > Ross has, but I have not, given that I have a shared system on the one
> >> > hand and a single-socket (four core, eight hardware thread) laptop on
> >> > the other that has even longer reproduction times.  The repeat-by is
> >> > as follows:
> >> > 
> >> > oBuild a kernel with the following Kconfigs:
> >> > 
> >> >  CONFIG_SMP=y
> >> >  CONFIG_NR_CPUS=16
> >> >  CONFIG_PREEMPT_NONE=n
> >> >  CONFIG_PREEMPT_VOLUNTARY=n
> >> >  CONFIG_PREEMPT=y
> >> >  # This should result in CONFIG_PREEMPT_RCU=y
> >> >  CONFIG_HZ_PERIODIC=y
> >> >  CONFIG_NO_HZ_IDLE=n
> >> >  CONFIG_NO_HZ_FULL=n
> >> >  CONFIG_RCU_TRACE=y
> >> >  CONFIG_HOTPLUG_CPU=y
> >> >  CONFIG_RCU_FANOUT=2
> >> >  CONFIG_RCU_FANOUT_LEAF=2
> >> >  CONFIG_RCU_NOCB_CPU=n
> >> >  CONFIG_DEBUG_LOCK_ALLOC=n
> >> >  CONFIG_RCU_BOOST=y
> >> >  CONFIG_RCU_KTHREAD_PRIO=2
> >> >  CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> >> >  CONFIG_RCU_EXPERT=y
> >> >  CONFIG_RCU_TORTURE_TEST=y
> >> >  CONFIG_PRINTK_TIME=y
> >> >  CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
> >> >  CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
> >> >  CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> >> > 
> >> >  If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
> >> >  and modprobe/insmod the module manually.
> >> > 
> >> > oFind a two-socket x86 system or larger, with at least 16 CPUs.
> >> > 
> >> > oBoot the kernel with the following kernel boot parameters:
> >> > 
> >> >  rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
> >> > 
> >> >  The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
> >> >  When manually setting up the module, you get the holdoff for
> >> >  free, courtesy of human timescales.
> >> > 
> >> > In the absence of instrumentation, I get failures usually within a
> >> > couple of hours, though sometimes much longer.  With instrumentation,
> >> > the sky appears to be the limit.  :-/
> >> > 
> >> > Ross is running on bare metal with no CPU hotplug, so perhaps his setup
> >> > is of more immediate interest.  He is seeing the same symptoms that I am,
> >> > namely a task being repeatedly awakened without actually coming out of
> >> > TASK_INTERRUPTIBLE state, let alone running.  As you pointed out earlier,
> >> > he cannot be seeing the same bug that my crude patch suppresses, but
> >> > given that I still see a few failures with that crude patch, it is quite
> >> > possible that there is still a common bug.
> >> 
> >> With respect to bare metal vs KVM guest, I've reported an issue with
> >> inaccurate detection of TSC as being an unreliable time source on a
> >> KVM guest. The basic setup is to overcommit the CPU use across the
> >> entire host, thus leading to preemption of the guest. The guest TSC
> >> watchdog then falsely assume that TSC is unreliable, because it gets
> >> preempted for a long time (e.g. 0.5 second) between reading the HPET
> >> and the TSC.
> >> 
> >> Ref. http://lkml.iu.edu/hypermail/linux/kernel/1509.1/00379.html
> >> 
> >> I'm wondering if what Paul is observing in the KVM setup might be
> >> caused by long preemption by the host. One way to stress test this
> >> is to run parallel kernel builds on the host (or in another guest)
> >> while the guest is running, thus over-committing the CPU use.
> >> 
> >> Thoughts ?
> > 
> > If I run NO_HZ_FULL, I do get warnings about unstable timesources.
> > 
> > And certainly guest VCPUs can be preempted.  However, if they were
> > preempted for the lengths of time I am seeing, I should also see
> > softlockup warnings on the host, which I do not see.
> 
> Why would you see softlockup warning on the host ?
> 
> I expect the priority at which the kvm vcpu runs is much lower than
> the priority of the rcu worker threads on the host. Therefore, you
> might very well have long preemption delays for kvm vpus while the
> rcu worker threads run fine on the host kernel because they have
> a higher priority.
> 
> Am I missing something ?

Right, host/guest confusion on my part.  I should expect softlockups
on the -guest- because rcutorture runs almost entirely

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 04:12:19PM +, Mathieu Desnoyers wrote:
> - On Mar 28, 2016, at 11:56 AM, Paul E. McKenney 
> paul...@linux.vnet.ibm.com wrote:
> 
> > On Mon, Mar 28, 2016 at 03:07:36PM +, Mathieu Desnoyers wrote:
> >> - On Mar 28, 2016, at 9:29 AM, Paul E. McKenney 
> >> paul...@linux.vnet.ibm.com
> >> wrote:
> >> 
> >> > On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
> >> >> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
> >> >> 
> >> >> > > Does that system have MONITOR/MWAIT errata?
> >> >> > 
> >> >> > On the off-chance that this question was also directed at me,
> >> >> 
> >> >> Hehe, it wasn't, however, since we're here..
> >> >> 
> >> >> > here is
> >> >> > what I am running on.  I am running in a qemu/KVM virtual machine, in
> >> >> > case that matters.
> >> >> 
> >> >> Have you actually tried on real proper hardware? Does it still reproduce
> >> >> there?
> >> > 
> >> > Ross has, but I have not, given that I have a shared system on the one
> >> > hand and a single-socket (four core, eight hardware thread) laptop on
> >> > the other that has even longer reproduction times.  The repeat-by is
> >> > as follows:
> >> > 
> >> > oBuild a kernel with the following Kconfigs:
> >> > 
> >> >  CONFIG_SMP=y
> >> >  CONFIG_NR_CPUS=16
> >> >  CONFIG_PREEMPT_NONE=n
> >> >  CONFIG_PREEMPT_VOLUNTARY=n
> >> >  CONFIG_PREEMPT=y
> >> >  # This should result in CONFIG_PREEMPT_RCU=y
> >> >  CONFIG_HZ_PERIODIC=y
> >> >  CONFIG_NO_HZ_IDLE=n
> >> >  CONFIG_NO_HZ_FULL=n
> >> >  CONFIG_RCU_TRACE=y
> >> >  CONFIG_HOTPLUG_CPU=y
> >> >  CONFIG_RCU_FANOUT=2
> >> >  CONFIG_RCU_FANOUT_LEAF=2
> >> >  CONFIG_RCU_NOCB_CPU=n
> >> >  CONFIG_DEBUG_LOCK_ALLOC=n
> >> >  CONFIG_RCU_BOOST=y
> >> >  CONFIG_RCU_KTHREAD_PRIO=2
> >> >  CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> >> >  CONFIG_RCU_EXPERT=y
> >> >  CONFIG_RCU_TORTURE_TEST=y
> >> >  CONFIG_PRINTK_TIME=y
> >> >  CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
> >> >  CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
> >> >  CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> >> > 
> >> >  If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
> >> >  and modprobe/insmod the module manually.
> >> > 
> >> > oFind a two-socket x86 system or larger, with at least 16 CPUs.
> >> > 
> >> > oBoot the kernel with the following kernel boot parameters:
> >> > 
> >> >  rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
> >> > 
> >> >  The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
> >> >  When manually setting up the module, you get the holdoff for
> >> >  free, courtesy of human timescales.
> >> > 
> >> > In the absence of instrumentation, I get failures usually within a
> >> > couple of hours, though sometimes much longer.  With instrumentation,
> >> > the sky appears to be the limit.  :-/
> >> > 
> >> > Ross is running on bare metal with no CPU hotplug, so perhaps his setup
> >> > is of more immediate interest.  He is seeing the same symptoms that I am,
> >> > namely a task being repeatedly awakened without actually coming out of
> >> > TASK_INTERRUPTIBLE state, let alone running.  As you pointed out earlier,
> >> > he cannot be seeing the same bug that my crude patch suppresses, but
> >> > given that I still see a few failures with that crude patch, it is quite
> >> > possible that there is still a common bug.
> >> 
> >> With respect to bare metal vs KVM guest, I've reported an issue with
> >> inaccurate detection of TSC as being an unreliable time source on a
> >> KVM guest. The basic setup is to overcommit the CPU use across the
> >> entire host, thus leading to preemption of the guest. The guest TSC
> >> watchdog then falsely assume that TSC is unreliable, because it gets
> >> preempted for a long time (e.g. 0.5 second) between reading the HPET
> >> and the TSC.
> >> 
> >> Ref. http://lkml.iu.edu/hypermail/linux/kernel/1509.1/00379.html
> >> 
> >> I'm wondering if what Paul is observing in the KVM setup might be
> >> caused by long preemption by the host. One way to stress test this
> >> is to run parallel kernel builds on the host (or in another guest)
> >> while the guest is running, thus over-committing the CPU use.
> >> 
> >> Thoughts ?
> > 
> > If I run NO_HZ_FULL, I do get warnings about unstable timesources.
> > 
> > And certainly guest VCPUs can be preempted.  However, if they were
> > preempted for the lengths of time I am seeing, I should also see
> > softlockup warnings on the host, which I do not see.
> 
> Why would you see softlockup warning on the host ?
> 
> I expect the priority at which the kvm vcpu runs is much lower than
> the priority of the rcu worker threads on the host. Therefore, you
> might very well have long preemption delays for kvm vpus while the
> rcu worker threads run fine on the host kernel because they have
> a higher priority.
> 
> Am I missing something ?

Right, host/guest confusion on my part.  I should expect softlockups
on the -guest- because rcutorture runs almost entirely

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Mathieu Desnoyers

- On Mar 28, 2016, at 11:56 AM, Paul E. McKenney paul...@linux.vnet.ibm.com 
wrote:

> On Mon, Mar 28, 2016 at 03:07:36PM +, Mathieu Desnoyers wrote:
>> - On Mar 28, 2016, at 9:29 AM, Paul E. McKenney 
>> paul...@linux.vnet.ibm.com
>> wrote:
>> 
>> > On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
>> >> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
>> >> 
>> >> > > Does that system have MONITOR/MWAIT errata?
>> >> > 
>> >> > On the off-chance that this question was also directed at me,
>> >> 
>> >> Hehe, it wasn't, however, since we're here..
>> >> 
>> >> > here is
>> >> > what I am running on.  I am running in a qemu/KVM virtual machine, in
>> >> > case that matters.
>> >> 
>> >> Have you actually tried on real proper hardware? Does it still reproduce
>> >> there?
>> > 
>> > Ross has, but I have not, given that I have a shared system on the one
>> > hand and a single-socket (four core, eight hardware thread) laptop on
>> > the other that has even longer reproduction times.  The repeat-by is
>> > as follows:
>> > 
>> > o  Build a kernel with the following Kconfigs:
>> > 
>> >CONFIG_SMP=y
>> >CONFIG_NR_CPUS=16
>> >CONFIG_PREEMPT_NONE=n
>> >CONFIG_PREEMPT_VOLUNTARY=n
>> >CONFIG_PREEMPT=y
>> ># This should result in CONFIG_PREEMPT_RCU=y
>> >CONFIG_HZ_PERIODIC=y
>> >CONFIG_NO_HZ_IDLE=n
>> >CONFIG_NO_HZ_FULL=n
>> >CONFIG_RCU_TRACE=y
>> >CONFIG_HOTPLUG_CPU=y
>> >CONFIG_RCU_FANOUT=2
>> >CONFIG_RCU_FANOUT_LEAF=2
>> >CONFIG_RCU_NOCB_CPU=n
>> >CONFIG_DEBUG_LOCK_ALLOC=n
>> >CONFIG_RCU_BOOST=y
>> >CONFIG_RCU_KTHREAD_PRIO=2
>> >CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
>> >CONFIG_RCU_EXPERT=y
>> >CONFIG_RCU_TORTURE_TEST=y
>> >CONFIG_PRINTK_TIME=y
>> >CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
>> >CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
>> >CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
>> > 
>> >If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
>> >and modprobe/insmod the module manually.
>> > 
>> > o  Find a two-socket x86 system or larger, with at least 16 CPUs.
>> > 
>> > o  Boot the kernel with the following kernel boot parameters:
>> > 
>> >rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
>> > 
>> >The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
>> >When manually setting up the module, you get the holdoff for
>> >free, courtesy of human timescales.
>> > 
>> > In the absence of instrumentation, I get failures usually within a
>> > couple of hours, though sometimes much longer.  With instrumentation,
>> > the sky appears to be the limit.  :-/
>> > 
>> > Ross is running on bare metal with no CPU hotplug, so perhaps his setup
>> > is of more immediate interest.  He is seeing the same symptoms that I am,
>> > namely a task being repeatedly awakened without actually coming out of
>> > TASK_INTERRUPTIBLE state, let alone running.  As you pointed out earlier,
>> > he cannot be seeing the same bug that my crude patch suppresses, but
>> > given that I still see a few failures with that crude patch, it is quite
>> > possible that there is still a common bug.
>> 
>> With respect to bare metal vs KVM guest, I've reported an issue with
>> inaccurate detection of TSC as being an unreliable time source on a
>> KVM guest. The basic setup is to overcommit the CPU use across the
>> entire host, thus leading to preemption of the guest. The guest TSC
>> watchdog then falsely assume that TSC is unreliable, because it gets
>> preempted for a long time (e.g. 0.5 second) between reading the HPET
>> and the TSC.
>> 
>> Ref. http://lkml.iu.edu/hypermail/linux/kernel/1509.1/00379.html
>> 
>> I'm wondering if what Paul is observing in the KVM setup might be
>> caused by long preemption by the host. One way to stress test this
>> is to run parallel kernel builds on the host (or in another guest)
>> while the guest is running, thus over-committing the CPU use.
>> 
>> Thoughts ?
> 
> If I run NO_HZ_FULL, I do get warnings about unstable timesources.
> 
> And certainly guest VCPUs can be preempted.  However, if they were
> preempted for the lengths of time I am seeing, I should also see
> softlockup warnings on the host, which I do not see.

Why would you see softlockup warning on the host ?

I expect the priority at which the kvm vcpu runs is much lower than
the priority of the rcu worker threads on the host. Therefore, you
might very well have long preemption delays for kvm vpus while the
rcu worker threads run fine on the host kernel because they have
a higher priority.

Am I missing something ?

Thanks,

Mathieu

> 
> That said, perhaps I should cobble together something to force short
> repeated preemptions at the host level.  Maybe that would get the
> reproduction rate sufficiently high to enable less-dainty debugging.
> 
>   Thanx, Paul

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Mathieu Desnoyers

- On Mar 28, 2016, at 11:56 AM, Paul E. McKenney paul...@linux.vnet.ibm.com 
wrote:

> On Mon, Mar 28, 2016 at 03:07:36PM +, Mathieu Desnoyers wrote:
>> - On Mar 28, 2016, at 9:29 AM, Paul E. McKenney 
>> paul...@linux.vnet.ibm.com
>> wrote:
>> 
>> > On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
>> >> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
>> >> 
>> >> > > Does that system have MONITOR/MWAIT errata?
>> >> > 
>> >> > On the off-chance that this question was also directed at me,
>> >> 
>> >> Hehe, it wasn't, however, since we're here..
>> >> 
>> >> > here is
>> >> > what I am running on.  I am running in a qemu/KVM virtual machine, in
>> >> > case that matters.
>> >> 
>> >> Have you actually tried on real proper hardware? Does it still reproduce
>> >> there?
>> > 
>> > Ross has, but I have not, given that I have a shared system on the one
>> > hand and a single-socket (four core, eight hardware thread) laptop on
>> > the other that has even longer reproduction times.  The repeat-by is
>> > as follows:
>> > 
>> > o  Build a kernel with the following Kconfigs:
>> > 
>> >CONFIG_SMP=y
>> >CONFIG_NR_CPUS=16
>> >CONFIG_PREEMPT_NONE=n
>> >CONFIG_PREEMPT_VOLUNTARY=n
>> >CONFIG_PREEMPT=y
>> ># This should result in CONFIG_PREEMPT_RCU=y
>> >CONFIG_HZ_PERIODIC=y
>> >CONFIG_NO_HZ_IDLE=n
>> >CONFIG_NO_HZ_FULL=n
>> >CONFIG_RCU_TRACE=y
>> >CONFIG_HOTPLUG_CPU=y
>> >CONFIG_RCU_FANOUT=2
>> >CONFIG_RCU_FANOUT_LEAF=2
>> >CONFIG_RCU_NOCB_CPU=n
>> >CONFIG_DEBUG_LOCK_ALLOC=n
>> >CONFIG_RCU_BOOST=y
>> >CONFIG_RCU_KTHREAD_PRIO=2
>> >CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
>> >CONFIG_RCU_EXPERT=y
>> >CONFIG_RCU_TORTURE_TEST=y
>> >CONFIG_PRINTK_TIME=y
>> >CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
>> >CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
>> >CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
>> > 
>> >If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
>> >and modprobe/insmod the module manually.
>> > 
>> > o  Find a two-socket x86 system or larger, with at least 16 CPUs.
>> > 
>> > o  Boot the kernel with the following kernel boot parameters:
>> > 
>> >rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
>> > 
>> >The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
>> >When manually setting up the module, you get the holdoff for
>> >free, courtesy of human timescales.
>> > 
>> > In the absence of instrumentation, I get failures usually within a
>> > couple of hours, though sometimes much longer.  With instrumentation,
>> > the sky appears to be the limit.  :-/
>> > 
>> > Ross is running on bare metal with no CPU hotplug, so perhaps his setup
>> > is of more immediate interest.  He is seeing the same symptoms that I am,
>> > namely a task being repeatedly awakened without actually coming out of
>> > TASK_INTERRUPTIBLE state, let alone running.  As you pointed out earlier,
>> > he cannot be seeing the same bug that my crude patch suppresses, but
>> > given that I still see a few failures with that crude patch, it is quite
>> > possible that there is still a common bug.
>> 
>> With respect to bare metal vs KVM guest, I've reported an issue with
>> inaccurate detection of TSC as being an unreliable time source on a
>> KVM guest. The basic setup is to overcommit the CPU use across the
>> entire host, thus leading to preemption of the guest. The guest TSC
>> watchdog then falsely assume that TSC is unreliable, because it gets
>> preempted for a long time (e.g. 0.5 second) between reading the HPET
>> and the TSC.
>> 
>> Ref. http://lkml.iu.edu/hypermail/linux/kernel/1509.1/00379.html
>> 
>> I'm wondering if what Paul is observing in the KVM setup might be
>> caused by long preemption by the host. One way to stress test this
>> is to run parallel kernel builds on the host (or in another guest)
>> while the guest is running, thus over-committing the CPU use.
>> 
>> Thoughts ?
> 
> If I run NO_HZ_FULL, I do get warnings about unstable timesources.
> 
> And certainly guest VCPUs can be preempted.  However, if they were
> preempted for the lengths of time I am seeing, I should also see
> softlockup warnings on the host, which I do not see.

Why would you see softlockup warning on the host ?

I expect the priority at which the kvm vcpu runs is much lower than
the priority of the rcu worker threads on the host. Therefore, you
might very well have long preemption delays for kvm vpus while the
rcu worker threads run fine on the host kernel because they have
a higher priority.

Am I missing something ?

Thanks,

Mathieu

> 
> That said, perhaps I should cobble together something to force short
> repeated preemptions at the host level.  Maybe that would get the
> reproduction rate sufficiently high to enable less-dainty debugging.
> 
>   Thanx, Paul

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 03:07:36PM +, Mathieu Desnoyers wrote:
> - On Mar 28, 2016, at 9:29 AM, Paul E. McKenney 
> paul...@linux.vnet.ibm.com wrote:
> 
> > On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
> >> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
> >> 
> >> > > Does that system have MONITOR/MWAIT errata?
> >> > 
> >> > On the off-chance that this question was also directed at me,
> >> 
> >> Hehe, it wasn't, however, since we're here..
> >> 
> >> > here is
> >> > what I am running on.  I am running in a qemu/KVM virtual machine, in
> >> > case that matters.
> >> 
> >> Have you actually tried on real proper hardware? Does it still reproduce
> >> there?
> > 
> > Ross has, but I have not, given that I have a shared system on the one
> > hand and a single-socket (four core, eight hardware thread) laptop on
> > the other that has even longer reproduction times.  The repeat-by is
> > as follows:
> > 
> > o   Build a kernel with the following Kconfigs:
> > 
> > CONFIG_SMP=y
> > CONFIG_NR_CPUS=16
> > CONFIG_PREEMPT_NONE=n
> > CONFIG_PREEMPT_VOLUNTARY=n
> > CONFIG_PREEMPT=y
> > # This should result in CONFIG_PREEMPT_RCU=y
> > CONFIG_HZ_PERIODIC=y
> > CONFIG_NO_HZ_IDLE=n
> > CONFIG_NO_HZ_FULL=n
> > CONFIG_RCU_TRACE=y
> > CONFIG_HOTPLUG_CPU=y
> > CONFIG_RCU_FANOUT=2
> > CONFIG_RCU_FANOUT_LEAF=2
> > CONFIG_RCU_NOCB_CPU=n
> > CONFIG_DEBUG_LOCK_ALLOC=n
> > CONFIG_RCU_BOOST=y
> > CONFIG_RCU_KTHREAD_PRIO=2
> > CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> > CONFIG_RCU_EXPERT=y
> > CONFIG_RCU_TORTURE_TEST=y
> > CONFIG_PRINTK_TIME=y
> > CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
> > CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
> > CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> > 
> > If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
> > and modprobe/insmod the module manually.
> > 
> > o   Find a two-socket x86 system or larger, with at least 16 CPUs.
> > 
> > o   Boot the kernel with the following kernel boot parameters:
> > 
> > rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
> > 
> > The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
> > When manually setting up the module, you get the holdoff for
> > free, courtesy of human timescales.
> > 
> > In the absence of instrumentation, I get failures usually within a
> > couple of hours, though sometimes much longer.  With instrumentation,
> > the sky appears to be the limit.  :-/
> > 
> > Ross is running on bare metal with no CPU hotplug, so perhaps his setup
> > is of more immediate interest.  He is seeing the same symptoms that I am,
> > namely a task being repeatedly awakened without actually coming out of
> > TASK_INTERRUPTIBLE state, let alone running.  As you pointed out earlier,
> > he cannot be seeing the same bug that my crude patch suppresses, but
> > given that I still see a few failures with that crude patch, it is quite
> > possible that there is still a common bug.
> 
> With respect to bare metal vs KVM guest, I've reported an issue with
> inaccurate detection of TSC as being an unreliable time source on a
> KVM guest. The basic setup is to overcommit the CPU use across the
> entire host, thus leading to preemption of the guest. The guest TSC
> watchdog then falsely assume that TSC is unreliable, because it gets
> preempted for a long time (e.g. 0.5 second) between reading the HPET
> and the TSC.
> 
> Ref. http://lkml.iu.edu/hypermail/linux/kernel/1509.1/00379.html
> 
> I'm wondering if what Paul is observing in the KVM setup might be
> caused by long preemption by the host. One way to stress test this
> is to run parallel kernel builds on the host (or in another guest)
> while the guest is running, thus over-committing the CPU use.
> 
> Thoughts ?

If I run NO_HZ_FULL, I do get warnings about unstable timesources.

And certainly guest VCPUs can be preempted.  However, if they were
preempted for the lengths of time I am seeing, I should also see
softlockup warnings on the host, which I do not see.

That said, perhaps I should cobble together something to force short
repeated preemptions at the host level.  Maybe that would get the
reproduction rate sufficiently high to enable less-dainty debugging.

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 03:07:36PM +, Mathieu Desnoyers wrote:
> - On Mar 28, 2016, at 9:29 AM, Paul E. McKenney 
> paul...@linux.vnet.ibm.com wrote:
> 
> > On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
> >> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
> >> 
> >> > > Does that system have MONITOR/MWAIT errata?
> >> > 
> >> > On the off-chance that this question was also directed at me,
> >> 
> >> Hehe, it wasn't, however, since we're here..
> >> 
> >> > here is
> >> > what I am running on.  I am running in a qemu/KVM virtual machine, in
> >> > case that matters.
> >> 
> >> Have you actually tried on real proper hardware? Does it still reproduce
> >> there?
> > 
> > Ross has, but I have not, given that I have a shared system on the one
> > hand and a single-socket (four core, eight hardware thread) laptop on
> > the other that has even longer reproduction times.  The repeat-by is
> > as follows:
> > 
> > o   Build a kernel with the following Kconfigs:
> > 
> > CONFIG_SMP=y
> > CONFIG_NR_CPUS=16
> > CONFIG_PREEMPT_NONE=n
> > CONFIG_PREEMPT_VOLUNTARY=n
> > CONFIG_PREEMPT=y
> > # This should result in CONFIG_PREEMPT_RCU=y
> > CONFIG_HZ_PERIODIC=y
> > CONFIG_NO_HZ_IDLE=n
> > CONFIG_NO_HZ_FULL=n
> > CONFIG_RCU_TRACE=y
> > CONFIG_HOTPLUG_CPU=y
> > CONFIG_RCU_FANOUT=2
> > CONFIG_RCU_FANOUT_LEAF=2
> > CONFIG_RCU_NOCB_CPU=n
> > CONFIG_DEBUG_LOCK_ALLOC=n
> > CONFIG_RCU_BOOST=y
> > CONFIG_RCU_KTHREAD_PRIO=2
> > CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> > CONFIG_RCU_EXPERT=y
> > CONFIG_RCU_TORTURE_TEST=y
> > CONFIG_PRINTK_TIME=y
> > CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
> > CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
> > CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> > 
> > If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
> > and modprobe/insmod the module manually.
> > 
> > o   Find a two-socket x86 system or larger, with at least 16 CPUs.
> > 
> > o   Boot the kernel with the following kernel boot parameters:
> > 
> > rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
> > 
> > The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
> > When manually setting up the module, you get the holdoff for
> > free, courtesy of human timescales.
> > 
> > In the absence of instrumentation, I get failures usually within a
> > couple of hours, though sometimes much longer.  With instrumentation,
> > the sky appears to be the limit.  :-/
> > 
> > Ross is running on bare metal with no CPU hotplug, so perhaps his setup
> > is of more immediate interest.  He is seeing the same symptoms that I am,
> > namely a task being repeatedly awakened without actually coming out of
> > TASK_INTERRUPTIBLE state, let alone running.  As you pointed out earlier,
> > he cannot be seeing the same bug that my crude patch suppresses, but
> > given that I still see a few failures with that crude patch, it is quite
> > possible that there is still a common bug.
> 
> With respect to bare metal vs KVM guest, I've reported an issue with
> inaccurate detection of TSC as being an unreliable time source on a
> KVM guest. The basic setup is to overcommit the CPU use across the
> entire host, thus leading to preemption of the guest. The guest TSC
> watchdog then falsely assume that TSC is unreliable, because it gets
> preempted for a long time (e.g. 0.5 second) between reading the HPET
> and the TSC.
> 
> Ref. http://lkml.iu.edu/hypermail/linux/kernel/1509.1/00379.html
> 
> I'm wondering if what Paul is observing in the KVM setup might be
> caused by long preemption by the host. One way to stress test this
> is to run parallel kernel builds on the host (or in another guest)
> while the guest is running, thus over-committing the CPU use.
> 
> Thoughts ?

If I run NO_HZ_FULL, I do get warnings about unstable timesources.

And certainly guest VCPUs can be preempted.  However, if they were
preempted for the lengths of time I am seeing, I should also see
softlockup warnings on the host, which I do not see.

That said, perhaps I should cobble together something to force short
repeated preemptions at the host level.  Maybe that would get the
reproduction rate sufficiently high to enable less-dainty debugging.

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Mathieu Desnoyers

- On Mar 28, 2016, at 9:29 AM, Paul E. McKenney paul...@linux.vnet.ibm.com 
wrote:

> On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
>> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
>> 
>> > > Does that system have MONITOR/MWAIT errata?
>> > 
>> > On the off-chance that this question was also directed at me,
>> 
>> Hehe, it wasn't, however, since we're here..
>> 
>> > here is
>> > what I am running on.  I am running in a qemu/KVM virtual machine, in
>> > case that matters.
>> 
>> Have you actually tried on real proper hardware? Does it still reproduce
>> there?
> 
> Ross has, but I have not, given that I have a shared system on the one
> hand and a single-socket (four core, eight hardware thread) laptop on
> the other that has even longer reproduction times.  The repeat-by is
> as follows:
> 
> o Build a kernel with the following Kconfigs:
> 
>   CONFIG_SMP=y
>   CONFIG_NR_CPUS=16
>   CONFIG_PREEMPT_NONE=n
>   CONFIG_PREEMPT_VOLUNTARY=n
>   CONFIG_PREEMPT=y
>   # This should result in CONFIG_PREEMPT_RCU=y
>   CONFIG_HZ_PERIODIC=y
>   CONFIG_NO_HZ_IDLE=n
>   CONFIG_NO_HZ_FULL=n
>   CONFIG_RCU_TRACE=y
>   CONFIG_HOTPLUG_CPU=y
>   CONFIG_RCU_FANOUT=2
>   CONFIG_RCU_FANOUT_LEAF=2
>   CONFIG_RCU_NOCB_CPU=n
>   CONFIG_DEBUG_LOCK_ALLOC=n
>   CONFIG_RCU_BOOST=y
>   CONFIG_RCU_KTHREAD_PRIO=2
>   CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
>   CONFIG_RCU_EXPERT=y
>   CONFIG_RCU_TORTURE_TEST=y
>   CONFIG_PRINTK_TIME=y
>   CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
>   CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
>   CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> 
>   If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
>   and modprobe/insmod the module manually.
> 
> o Find a two-socket x86 system or larger, with at least 16 CPUs.
> 
> o Boot the kernel with the following kernel boot parameters:
> 
>   rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
> 
>   The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
>   When manually setting up the module, you get the holdoff for
>   free, courtesy of human timescales.
> 
> In the absence of instrumentation, I get failures usually within a
> couple of hours, though sometimes much longer.  With instrumentation,
> the sky appears to be the limit.  :-/
> 
> Ross is running on bare metal with no CPU hotplug, so perhaps his setup
> is of more immediate interest.  He is seeing the same symptoms that I am,
> namely a task being repeatedly awakened without actually coming out of
> TASK_INTERRUPTIBLE state, let alone running.  As you pointed out earlier,
> he cannot be seeing the same bug that my crude patch suppresses, but
> given that I still see a few failures with that crude patch, it is quite
> possible that there is still a common bug.

With respect to bare metal vs KVM guest, I've reported an issue with
inaccurate detection of TSC as being an unreliable time source on a
KVM guest. The basic setup is to overcommit the CPU use across the
entire host, thus leading to preemption of the guest. The guest TSC
watchdog then falsely assume that TSC is unreliable, because it gets
preempted for a long time (e.g. 0.5 second) between reading the HPET
and the TSC.

Ref. http://lkml.iu.edu/hypermail/linux/kernel/1509.1/00379.html

I'm wondering if what Paul is observing in the KVM setup might be
caused by long preemption by the host. One way to stress test this
is to run parallel kernel builds on the host (or in another guest)
while the guest is running, thus over-committing the CPU use.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Mathieu Desnoyers

- On Mar 28, 2016, at 9:29 AM, Paul E. McKenney paul...@linux.vnet.ibm.com 
wrote:

> On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
>> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
>> 
>> > > Does that system have MONITOR/MWAIT errata?
>> > 
>> > On the off-chance that this question was also directed at me,
>> 
>> Hehe, it wasn't, however, since we're here..
>> 
>> > here is
>> > what I am running on.  I am running in a qemu/KVM virtual machine, in
>> > case that matters.
>> 
>> Have you actually tried on real proper hardware? Does it still reproduce
>> there?
> 
> Ross has, but I have not, given that I have a shared system on the one
> hand and a single-socket (four core, eight hardware thread) laptop on
> the other that has even longer reproduction times.  The repeat-by is
> as follows:
> 
> o Build a kernel with the following Kconfigs:
> 
>   CONFIG_SMP=y
>   CONFIG_NR_CPUS=16
>   CONFIG_PREEMPT_NONE=n
>   CONFIG_PREEMPT_VOLUNTARY=n
>   CONFIG_PREEMPT=y
>   # This should result in CONFIG_PREEMPT_RCU=y
>   CONFIG_HZ_PERIODIC=y
>   CONFIG_NO_HZ_IDLE=n
>   CONFIG_NO_HZ_FULL=n
>   CONFIG_RCU_TRACE=y
>   CONFIG_HOTPLUG_CPU=y
>   CONFIG_RCU_FANOUT=2
>   CONFIG_RCU_FANOUT_LEAF=2
>   CONFIG_RCU_NOCB_CPU=n
>   CONFIG_DEBUG_LOCK_ALLOC=n
>   CONFIG_RCU_BOOST=y
>   CONFIG_RCU_KTHREAD_PRIO=2
>   CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
>   CONFIG_RCU_EXPERT=y
>   CONFIG_RCU_TORTURE_TEST=y
>   CONFIG_PRINTK_TIME=y
>   CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
>   CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
>   CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> 
>   If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
>   and modprobe/insmod the module manually.
> 
> o Find a two-socket x86 system or larger, with at least 16 CPUs.
> 
> o Boot the kernel with the following kernel boot parameters:
> 
>   rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
> 
>   The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
>   When manually setting up the module, you get the holdoff for
>   free, courtesy of human timescales.
> 
> In the absence of instrumentation, I get failures usually within a
> couple of hours, though sometimes much longer.  With instrumentation,
> the sky appears to be the limit.  :-/
> 
> Ross is running on bare metal with no CPU hotplug, so perhaps his setup
> is of more immediate interest.  He is seeing the same symptoms that I am,
> namely a task being repeatedly awakened without actually coming out of
> TASK_INTERRUPTIBLE state, let alone running.  As you pointed out earlier,
> he cannot be seeing the same bug that my crude patch suppresses, but
> given that I still see a few failures with that crude patch, it is quite
> possible that there is still a common bug.

With respect to bare metal vs KVM guest, I've reported an issue with
inaccurate detection of TSC as being an unreliable time source on a
KVM guest. The basic setup is to overcommit the CPU use across the
entire host, thus leading to preemption of the guest. The guest TSC
watchdog then falsely assume that TSC is unreliable, because it gets
preempted for a long time (e.g. 0.5 second) between reading the HPET
and the TSC.

Ref. http://lkml.iu.edu/hypermail/linux/kernel/1509.1/00379.html

I'm wondering if what Paul is observing in the KVM setup might be
caused by long preemption by the host. One way to stress test this
is to run parallel kernel builds on the host (or in another guest)
while the guest is running, thus over-committing the CPU use.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Mathieu Desnoyers

- On Mar 28, 2016, at 2:13 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Mon, Mar 28, 2016 at 02:23:45AM +, Mathieu Desnoyers wrote:
> 
>> >> But, you need hotplug for this to happen, right?
>> > 
>> > My understanding is that this seems to be detection of failures to be
>> > awakened for a long time on idle CPUs. It therefore seems to be more
>> > idle-related than cpu hotplug-related. I'm not saying that there is
>> > no issue with hotplug, just that the investigation so far seems to
>> > target mostly idle systems, AFAIK without stressing hotplug.
> 
> Paul has stated that without hotplug he cannot trigger this.
> 
>> > set_nr_if_polling() returns true if the ti->flags read has the
>> > _TIF_NEED_RESCHED bit set, which will skip the IPI.
> 
> POLLING_NR, as per your later comment
> 
>> > But it seems weird. The side that calls set_nr_if_polling()
>> > does the following:
>> > 1) llist_add(>wake_entry, _rq(cpu)->wake_list)
>> > 2) set_nr_if_polling(rq->idle)
>> > 3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned
>> >   true)
>> > 
>> > The idle loop does:
>> > 1) __current_set_polling()
>> > 2) __current_clr_polling()
>> > 3) smp_mb__after_atomic()
>> > 4) sched_ttwu_pending()
>> > 5) schedule_preempt_disabled()
>> >   -> This will clear the TIF_NEED_RESCHED flag
>> > 
>> > While the idle loop is in sched_ttwu_pending(), after
>> > it has done the llist_del_all() (thus has grabbed all the
>> > list entries), TIF_NEED_RESCHED is still set.
> 
>> > If both list_all and
> 
> llist_add() ?

Yes, indeed.

> 
>> > set_nr_if_polling() are called right after the llist_del_all(), we
>> > will end up in a situation where we have an entry in the list, but
>> > there won't be any reschedule sent on the idle CPU until something
>> > else awakens it. On a _very_ idle CPU, this could take some time.
> 
> Can't happen, as per clearing of POLLING_NR before doing llist_del_all()
> and the latter being a full memory barrier.
> 
>> > set_nr_and_not_polling() don't seem to have the same issue, because
>> > it does not return true if TIF_NEED_RESCHED is observed as being
>> > already set: it really just depends on the state of the TIF_POLLING_NRFLAG
>> > bit.
>> > 
>> > Am I missing something important ?
>> 
>> Well, it seems that the test for _TIF_POLLING_NRFLAG in set_nr_if_polling()
>> just before the test for _TIF_NEED_RESCHED should take care of it: while in
>> sched_ttwu_pending within the idle loop, the TIF_POLLING_NRFLAG should be
>> cleared, thus causing set_nr_if_polling to return false.
> 
> Right, clue in the name: Set NEED_RESCHED _IF_ POLLING_NR (is set).
> 
>> I'm slightly concerned about the lack of smp_mb__after_atomic()
>> between the TIF_NEED_RESCHED flag being cleared within 
>> schedule_preempt_disabled
>> and the TIF_POLLING_NRFLAG being set in the following loop. Indeed, 
>> clear_bit()
>> does not have a compiler barrier,
> 
> Urgh, it really should, as all atomic ops. set_bit() very much has a
> memory clobber in, see below.

Yes, I'd be more comfortable with the memory clobber in the clear_bit
too, but theoretically it *should* not matter, because we have a clobber
in set_bit, and clear_bit has a +m memory operand.

> 
>> nor processor-level memory barriers
>> (of course, the processor memory barrier should not really matter on
>> x86-64 due to lock prefix).
> 
> Right.
> 
>> Moreover, TIF_NEED_RESCHED is bit 3 on x86-64,
>> whereas TIF_POLLING_NRFLAG is bit 21. Those are in two different bytes of
>> the thread flags, and thus set/cleared as different addresses by clear_bit()
>> acting on an immediate "nr" argument.
>> 
>> If we have any state where TIF_POLLING_NRFLAG is set before TIF_NEED_RESCHED
>> is cleared within the idle thread, we could end up missing a needed resched 
>> IPI.
> 
> Yes, that would be bad. No objection to adding smp_mb__before_atomic()
> before the initial __current_set_polling(). Although that's not going to
> make a difference for x86_64 as you already noted.

Yep.

> 
>> Another question: why are set_nr_if_polling and set_nr_and_not_polling two
>> different implementations ?
> 
> Because they're fundamentally two different things. The one
> conditionally sets NEED_RESCHED, the other unconditionally sets it.

Got it, makes sense.

Thanks!

Mathieu

> 
>> Could they be combined ?
> 
> Can, yes, will not be pretty nor clear code though.
> 
> 
> ---
> arch/x86/include/asm/bitops.h | 6 --
> 1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 7766d1cf096e..5345784d5e41 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
>   if (IS_IMMEDIATE(nr)) {
>   asm volatile(LOCK_PREFIX "andb %1,%0"
>   : CONST_MASK_ADDR(nr, addr)
> - : "iq" ((u8)~CONST_MASK(nr)));
> + : "iq"

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Mathieu Desnoyers

- On Mar 28, 2016, at 2:13 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Mon, Mar 28, 2016 at 02:23:45AM +, Mathieu Desnoyers wrote:
> 
>> >> But, you need hotplug for this to happen, right?
>> > 
>> > My understanding is that this seems to be detection of failures to be
>> > awakened for a long time on idle CPUs. It therefore seems to be more
>> > idle-related than cpu hotplug-related. I'm not saying that there is
>> > no issue with hotplug, just that the investigation so far seems to
>> > target mostly idle systems, AFAIK without stressing hotplug.
> 
> Paul has stated that without hotplug he cannot trigger this.
> 
>> > set_nr_if_polling() returns true if the ti->flags read has the
>> > _TIF_NEED_RESCHED bit set, which will skip the IPI.
> 
> POLLING_NR, as per your later comment
> 
>> > But it seems weird. The side that calls set_nr_if_polling()
>> > does the following:
>> > 1) llist_add(>wake_entry, _rq(cpu)->wake_list)
>> > 2) set_nr_if_polling(rq->idle)
>> > 3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned
>> >   true)
>> > 
>> > The idle loop does:
>> > 1) __current_set_polling()
>> > 2) __current_clr_polling()
>> > 3) smp_mb__after_atomic()
>> > 4) sched_ttwu_pending()
>> > 5) schedule_preempt_disabled()
>> >   -> This will clear the TIF_NEED_RESCHED flag
>> > 
>> > While the idle loop is in sched_ttwu_pending(), after
>> > it has done the llist_del_all() (thus has grabbed all the
>> > list entries), TIF_NEED_RESCHED is still set.
> 
>> > If both list_all and
> 
> llist_add() ?

Yes, indeed.

> 
>> > set_nr_if_polling() are called right after the llist_del_all(), we
>> > will end up in a situation where we have an entry in the list, but
>> > there won't be any reschedule sent on the idle CPU until something
>> > else awakens it. On a _very_ idle CPU, this could take some time.
> 
> Can't happen, as per clearing of POLLING_NR before doing llist_del_all()
> and the latter being a full memory barrier.
> 
>> > set_nr_and_not_polling() don't seem to have the same issue, because
>> > it does not return true if TIF_NEED_RESCHED is observed as being
>> > already set: it really just depends on the state of the TIF_POLLING_NRFLAG
>> > bit.
>> > 
>> > Am I missing something important ?
>> 
>> Well, it seems that the test for _TIF_POLLING_NRFLAG in set_nr_if_polling()
>> just before the test for _TIF_NEED_RESCHED should take care of it: while in
>> sched_ttwu_pending within the idle loop, the TIF_POLLING_NRFLAG should be
>> cleared, thus causing set_nr_if_polling to return false.
> 
> Right, clue in the name: Set NEED_RESCHED _IF_ POLLING_NR (is set).
> 
>> I'm slightly concerned about the lack of smp_mb__after_atomic()
>> between the TIF_NEED_RESCHED flag being cleared within 
>> schedule_preempt_disabled
>> and the TIF_POLLING_NRFLAG being set in the following loop. Indeed, 
>> clear_bit()
>> does not have a compiler barrier,
> 
> Urgh, it really should, as all atomic ops. set_bit() very much has a
> memory clobber in, see below.

Yes, I'd be more comfortable with the memory clobber in the clear_bit
too, but theoretically it *should* not matter, because we have a clobber
in set_bit, and clear_bit has a +m memory operand.

> 
>> nor processor-level memory barriers
>> (of course, the processor memory barrier should not really matter on
>> x86-64 due to lock prefix).
> 
> Right.
> 
>> Moreover, TIF_NEED_RESCHED is bit 3 on x86-64,
>> whereas TIF_POLLING_NRFLAG is bit 21. Those are in two different bytes of
>> the thread flags, and thus set/cleared as different addresses by clear_bit()
>> acting on an immediate "nr" argument.
>> 
>> If we have any state where TIF_POLLING_NRFLAG is set before TIF_NEED_RESCHED
>> is cleared within the idle thread, we could end up missing a needed resched 
>> IPI.
> 
> Yes, that would be bad. No objection to adding smp_mb__before_atomic()
> before the initial __current_set_polling(). Although that's not going to
> make a difference for x86_64 as you already noted.

Yep.

> 
>> Another question: why are set_nr_if_polling and set_nr_and_not_polling two
>> different implementations ?
> 
> Because they're fundamentally two different things. The one
> conditionally sets NEED_RESCHED, the other unconditionally sets it.

Got it, makes sense.

Thanks!

Mathieu

> 
>> Could they be combined ?
> 
> Can, yes, will not be pretty nor clear code though.
> 
> 
> ---
> arch/x86/include/asm/bitops.h | 6 --
> 1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 7766d1cf096e..5345784d5e41 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
>   if (IS_IMMEDIATE(nr)) {
>   asm volatile(LOCK_PREFIX "andb %1,%0"
>   : CONST_MASK_ADDR(nr, addr)
> - : "iq" ((u8)~CONST_MASK(nr)));
> + : "iq"

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 08:13:51AM +0200, Peter Zijlstra wrote:
> On Mon, Mar 28, 2016 at 02:23:45AM +, Mathieu Desnoyers wrote:
> 
> > >> But, you need hotplug for this to happen, right?
> > > 
> > > My understanding is that this seems to be detection of failures to be
> > > awakened for a long time on idle CPUs. It therefore seems to be more
> > > idle-related than cpu hotplug-related. I'm not saying that there is
> > > no issue with hotplug, just that the investigation so far seems to
> > > target mostly idle systems, AFAIK without stressing hotplug.
> 
> Paul has stated that without hotplug he cannot trigger this.

Which means either that hotplug is absolutely necessary or that
hotplug increases the probability of failure.  Ross's experience is
without hotplug on a mostly idle system, so I am currently betting on
"increases the probability".  The set of bugs does seem to have gotten
worse somewhere between v4.1 and v4.2, but not sufficiently to allow
reasonble bisection (yes, I did try, several times).

> > > set_nr_if_polling() returns true if the ti->flags read has the
> > > _TIF_NEED_RESCHED bit set, which will skip the IPI.
> 
> POLLING_NR, as per your later comment
> 
> > > But it seems weird. The side that calls set_nr_if_polling()
> > > does the following:
> > > 1) llist_add(>wake_entry, _rq(cpu)->wake_list)
> > > 2) set_nr_if_polling(rq->idle)
> > > 3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned
> > >   true)
> > > 
> > > The idle loop does:
> > > 1) __current_set_polling()
> > > 2) __current_clr_polling()
> > > 3) smp_mb__after_atomic()
> > > 4) sched_ttwu_pending()
> > > 5) schedule_preempt_disabled()
> > >   -> This will clear the TIF_NEED_RESCHED flag
> > > 
> > > While the idle loop is in sched_ttwu_pending(), after
> > > it has done the llist_del_all() (thus has grabbed all the
> > > list entries), TIF_NEED_RESCHED is still set.
> 
> > > If both list_all and
> 
> llist_add() ?
> 
> > > set_nr_if_polling() are called right after the llist_del_all(), we
> > > will end up in a situation where we have an entry in the list, but
> > > there won't be any reschedule sent on the idle CPU until something
> > > else awakens it. On a _very_ idle CPU, this could take some time.
> 
> Can't happen, as per clearing of POLLING_NR before doing llist_del_all()
> and the latter being a full memory barrier.
> 
> > > set_nr_and_not_polling() don't seem to have the same issue, because
> > > it does not return true if TIF_NEED_RESCHED is observed as being
> > > already set: it really just depends on the state of the TIF_POLLING_NRFLAG
> > > bit.
> > > 
> > > Am I missing something important ?
> > 
> > Well, it seems that the test for _TIF_POLLING_NRFLAG in set_nr_if_polling()
> > just before the test for _TIF_NEED_RESCHED should take care of it: while in
> > sched_ttwu_pending within the idle loop, the TIF_POLLING_NRFLAG should be
> > cleared, thus causing set_nr_if_polling to return false.
> 
> Right, clue in the name: Set NEED_RESCHED _IF_ POLLING_NR (is set).
> 
> > I'm slightly concerned about the lack of smp_mb__after_atomic()
> > between the TIF_NEED_RESCHED flag being cleared within 
> > schedule_preempt_disabled
> > and the TIF_POLLING_NRFLAG being set in the following loop. Indeed, 
> > clear_bit()
> > does not have a compiler barrier,
> 
> Urgh, it really should, as all atomic ops. set_bit() very much has a
> memory clobber in, see below.

And this is one of the changes in your patch that I am now testing, correct?
(Looks that way to me, but..)

Thanx, Paul

> > nor processor-level memory barriers
> > (of course, the processor memory barrier should not really matter on
> > x86-64 due to lock prefix).
> 
> Right.
> 
> > Moreover, TIF_NEED_RESCHED is bit 3 on x86-64,
> > whereas TIF_POLLING_NRFLAG is bit 21. Those are in two different bytes of
> > the thread flags, and thus set/cleared as different addresses by clear_bit()
> > acting on an immediate "nr" argument.
> > 
> > If we have any state where TIF_POLLING_NRFLAG is set before TIF_NEED_RESCHED
> > is cleared within the idle thread, we could end up missing a needed resched 
> > IPI.
> 
> Yes, that would be bad. No objection to adding smp_mb__before_atomic()
> before the initial __current_set_polling(). Although that's not going to
> make a difference for x86_64 as you already noted.
> 
> > Another question: why are set_nr_if_polling and set_nr_and_not_polling two
> > different implementations ?
> 
> Because they're fundamentally two different things. The one
> conditionally sets NEED_RESCHED, the other unconditionally sets it.
> 
> > Could they be combined ?
> 
> Can, yes, will not be pretty nor clear code though.
> 
> 
> ---
>  arch/x86/include/asm/bitops.h | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 7766d1cf096e..5345784d5e41 100644
> ---

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 08:13:51AM +0200, Peter Zijlstra wrote:
> On Mon, Mar 28, 2016 at 02:23:45AM +, Mathieu Desnoyers wrote:
> 
> > >> But, you need hotplug for this to happen, right?
> > > 
> > > My understanding is that this seems to be detection of failures to be
> > > awakened for a long time on idle CPUs. It therefore seems to be more
> > > idle-related than cpu hotplug-related. I'm not saying that there is
> > > no issue with hotplug, just that the investigation so far seems to
> > > target mostly idle systems, AFAIK without stressing hotplug.
> 
> Paul has stated that without hotplug he cannot trigger this.

Which means either that hotplug is absolutely necessary or that
hotplug increases the probability of failure.  Ross's experience is
without hotplug on a mostly idle system, so I am currently betting on
"increases the probability".  The set of bugs does seem to have gotten
worse somewhere between v4.1 and v4.2, but not sufficiently to allow
reasonble bisection (yes, I did try, several times).

> > > set_nr_if_polling() returns true if the ti->flags read has the
> > > _TIF_NEED_RESCHED bit set, which will skip the IPI.
> 
> POLLING_NR, as per your later comment
> 
> > > But it seems weird. The side that calls set_nr_if_polling()
> > > does the following:
> > > 1) llist_add(>wake_entry, _rq(cpu)->wake_list)
> > > 2) set_nr_if_polling(rq->idle)
> > > 3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned
> > >   true)
> > > 
> > > The idle loop does:
> > > 1) __current_set_polling()
> > > 2) __current_clr_polling()
> > > 3) smp_mb__after_atomic()
> > > 4) sched_ttwu_pending()
> > > 5) schedule_preempt_disabled()
> > >   -> This will clear the TIF_NEED_RESCHED flag
> > > 
> > > While the idle loop is in sched_ttwu_pending(), after
> > > it has done the llist_del_all() (thus has grabbed all the
> > > list entries), TIF_NEED_RESCHED is still set.
> 
> > > If both list_all and
> 
> llist_add() ?
> 
> > > set_nr_if_polling() are called right after the llist_del_all(), we
> > > will end up in a situation where we have an entry in the list, but
> > > there won't be any reschedule sent on the idle CPU until something
> > > else awakens it. On a _very_ idle CPU, this could take some time.
> 
> Can't happen, as per clearing of POLLING_NR before doing llist_del_all()
> and the latter being a full memory barrier.
> 
> > > set_nr_and_not_polling() don't seem to have the same issue, because
> > > it does not return true if TIF_NEED_RESCHED is observed as being
> > > already set: it really just depends on the state of the TIF_POLLING_NRFLAG
> > > bit.
> > > 
> > > Am I missing something important ?
> > 
> > Well, it seems that the test for _TIF_POLLING_NRFLAG in set_nr_if_polling()
> > just before the test for _TIF_NEED_RESCHED should take care of it: while in
> > sched_ttwu_pending within the idle loop, the TIF_POLLING_NRFLAG should be
> > cleared, thus causing set_nr_if_polling to return false.
> 
> Right, clue in the name: Set NEED_RESCHED _IF_ POLLING_NR (is set).
> 
> > I'm slightly concerned about the lack of smp_mb__after_atomic()
> > between the TIF_NEED_RESCHED flag being cleared within 
> > schedule_preempt_disabled
> > and the TIF_POLLING_NRFLAG being set in the following loop. Indeed, 
> > clear_bit()
> > does not have a compiler barrier,
> 
> Urgh, it really should, as all atomic ops. set_bit() very much has a
> memory clobber in, see below.

And this is one of the changes in your patch that I am now testing, correct?
(Looks that way to me, but..)

Thanx, Paul

> > nor processor-level memory barriers
> > (of course, the processor memory barrier should not really matter on
> > x86-64 due to lock prefix).
> 
> Right.
> 
> > Moreover, TIF_NEED_RESCHED is bit 3 on x86-64,
> > whereas TIF_POLLING_NRFLAG is bit 21. Those are in two different bytes of
> > the thread flags, and thus set/cleared as different addresses by clear_bit()
> > acting on an immediate "nr" argument.
> > 
> > If we have any state where TIF_POLLING_NRFLAG is set before TIF_NEED_RESCHED
> > is cleared within the idle thread, we could end up missing a needed resched 
> > IPI.
> 
> Yes, that would be bad. No objection to adding smp_mb__before_atomic()
> before the initial __current_set_polling(). Although that's not going to
> make a difference for x86_64 as you already noted.
> 
> > Another question: why are set_nr_if_polling and set_nr_and_not_polling two
> > different implementations ?
> 
> Because they're fundamentally two different things. The one
> conditionally sets NEED_RESCHED, the other unconditionally sets it.
> 
> > Could they be combined ?
> 
> Can, yes, will not be pretty nor clear code though.
> 
> 
> ---
>  arch/x86/include/asm/bitops.h | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 7766d1cf096e..5345784d5e41 100644
> ---

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
> 
> > > Does that system have MONITOR/MWAIT errata?
> > 
> > On the off-chance that this question was also directed at me,
> 
> Hehe, it wasn't, however, since we're here..
> 
> > here is
> > what I am running on.  I am running in a qemu/KVM virtual machine, in
> > case that matters.
> 
> Have you actually tried on real proper hardware? Does it still reproduce
> there?

Ross has, but I have not, given that I have a shared system on the one
hand and a single-socket (four core, eight hardware thread) laptop on
the other that has even longer reproduction times.  The repeat-by is
as follows:

o   Build a kernel with the following Kconfigs:

CONFIG_SMP=y
CONFIG_NR_CPUS=16
CONFIG_PREEMPT_NONE=n
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=y
# This should result in CONFIG_PREEMPT_RCU=y
CONFIG_HZ_PERIODIC=y
CONFIG_NO_HZ_IDLE=n
CONFIG_NO_HZ_FULL=n
CONFIG_RCU_TRACE=y
CONFIG_HOTPLUG_CPU=y
CONFIG_RCU_FANOUT=2
CONFIG_RCU_FANOUT_LEAF=2
CONFIG_RCU_NOCB_CPU=n
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_RCU_BOOST=y
CONFIG_RCU_KTHREAD_PRIO=2
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
CONFIG_RCU_TORTURE_TEST=y
CONFIG_PRINTK_TIME=y
CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y

If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
and modprobe/insmod the module manually.

o   Find a two-socket x86 system or larger, with at least 16 CPUs.

o   Boot the kernel with the following kernel boot parameters:

rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30

The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
When manually setting up the module, you get the holdoff for
free, courtesy of human timescales.

In the absence of instrumentation, I get failures usually within a
couple of hours, though sometimes much longer.  With instrumentation,
the sky appears to be the limit.  :-/

Ross is running on bare metal with no CPU hotplug, so perhaps his setup
is of more immediate interest.  He is seeing the same symptoms that I am,
namely a task being repeatedly awakened without actually coming out of
TASK_INTERRUPTIBLE state, let alone running.  As you pointed out earlier,
he cannot be seeing the same bug that my crude patch suppresses, but
given that I still see a few failures with that crude patch, it is quite
possible that there is still a common bug.

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
> 
> > > Does that system have MONITOR/MWAIT errata?
> > 
> > On the off-chance that this question was also directed at me,
> 
> Hehe, it wasn't, however, since we're here..
> 
> > here is
> > what I am running on.  I am running in a qemu/KVM virtual machine, in
> > case that matters.
> 
> Have you actually tried on real proper hardware? Does it still reproduce
> there?

Ross has, but I have not, given that I have a shared system on the one
hand and a single-socket (four core, eight hardware thread) laptop on
the other that has even longer reproduction times.  The repeat-by is
as follows:

o   Build a kernel with the following Kconfigs:

CONFIG_SMP=y
CONFIG_NR_CPUS=16
CONFIG_PREEMPT_NONE=n
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=y
# This should result in CONFIG_PREEMPT_RCU=y
CONFIG_HZ_PERIODIC=y
CONFIG_NO_HZ_IDLE=n
CONFIG_NO_HZ_FULL=n
CONFIG_RCU_TRACE=y
CONFIG_HOTPLUG_CPU=y
CONFIG_RCU_FANOUT=2
CONFIG_RCU_FANOUT_LEAF=2
CONFIG_RCU_NOCB_CPU=n
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_RCU_BOOST=y
CONFIG_RCU_KTHREAD_PRIO=2
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
CONFIG_RCU_TORTURE_TEST=y
CONFIG_PRINTK_TIME=y
CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y

If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
and modprobe/insmod the module manually.

o   Find a two-socket x86 system or larger, with at least 16 CPUs.

o   Boot the kernel with the following kernel boot parameters:

rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30

The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
When manually setting up the module, you get the holdoff for
free, courtesy of human timescales.

In the absence of instrumentation, I get failures usually within a
couple of hours, though sometimes much longer.  With instrumentation,
the sky appears to be the limit.  :-/

Ross is running on bare metal with no CPU hotplug, so perhaps his setup
is of more immediate interest.  He is seeing the same symptoms that I am,
namely a task being repeatedly awakened without actually coming out of
TASK_INTERRUPTIBLE state, let alone running.  As you pointed out earlier,
he cannot be seeing the same bug that my crude patch suppresses, but
given that I still see a few failures with that crude patch, it is quite
possible that there is still a common bug.

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> 
> > > But, you need hotplug for this to happen, right?
> > 
> > I do, but Ross Green is seeing something that looks similar, and without
> > CPU hotplug.
> 
> Yes, but that's two differences so far, you need hotplug and he's on ARM
> (which doesn't have TIF_POLLING_NR).
> 
> So either we're all looking at the wrong thing or these really are two
> different issues.

Given that this failure has grown more probable over the past several
releases, it does seem quite likely that we have more than one bug.
Or maybe a few bugs and additional innocent-bystander commits that make
one or more of the bugs more probable.

> > > We should not be migrating towards, or waking on, CPUs no longer present
> > > in cpu_active_map, and there is a rcu/sched_sync() after clearing that
> > > bit. Furthermore, migration_call() does a sched_ttwu_pending() (waking
> > > any remaining stragglers) before we migrate all runnable tasks off the
> > > dying CPU.
> > 
> > OK, so I should instrument migration_call() if I get the repro rate up?
> 
> Can do, maybe try the below first. (yes I know how long it all takes :/)

OK, will run this today, then run calibration for last night's run this
evening.

Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
consisted of 24 two-hour runs.  Six of them had hard hangs, and another
had a hang that eventually unhung of its own accord.  I believe that this
is significantly fewer failures than from a stock kernel, but I could
be wrong, and it will take some serious testing to give statistical
confidence for whatever conclusion is correct.

> > > The other interesting case would be resched_cpu(), which uses
> > > set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> > > atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> > > not set. If indeed not, it will send an IPI.
> > > 
> > > This assumes the idle 'exit' path will do the same as the IPI does; and
> > > if you look at cpu_idle_loop() it does indeed do both
> > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > 
> > > Note that one cannot rely on irq_enter()/irq_exit() being called for the
> > > scheduler IPI.
> > 
> > OK, thank you for the info!  Any specific debug actions?
> 
> Dunno, something like the below should bring visibility into the
> (lockless) wake_list thingy.
> 
> So these trace_printk()s should happen between trace_sched_waking() and
> trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
> some traces with these here thingies on).
> 
> ---
>  arch/x86/include/asm/bitops.h | 6 --
>  kernel/sched/core.c   | 9 +
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 7766d1cf096e..5345784d5e41 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
>   if (IS_IMMEDIATE(nr)) {
>   asm volatile(LOCK_PREFIX "andb %1,%0"
>   : CONST_MASK_ADDR(nr, addr)
> - : "iq" ((u8)~CONST_MASK(nr)));
> + : "iq" ((u8)~CONST_MASK(nr))
> + : "memory");
>   } else {
>   asm volatile(LOCK_PREFIX "btr %1,%0"
>   : BITOP_ADDR(addr)
> - : "Ir" (nr));
> + : "Ir" (nr)
> + : "memory");
>   }
>  }

Is the above addition of "memory" strictly for the debug below, or is
it also a potential fix?

Starting it up regardless, but figured I should ask!

Thanx, Paul

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0b21e7a724e1..b446f73c530d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1669,6 +1669,7 @@ void sched_ttwu_pending(void)
>   while (llist) {
>   p = llist_entry(llist, struct task_struct, wake_entry);
>   llist = llist_next(llist);
> + trace_printk("waking %d\n", p->pid);
>   ttwu_do_activate(rq, p, 0);
>   }
> 
> @@ -1719,6 +1720,7 @@ static void ttwu_queue_remote(struct task_struct *p, 
> int cpu)
>   struct rq *rq = cpu_rq(cpu);
> 
>   if (llist_add(>wake_entry, _rq(cpu)->wake_list)) {
> + trace_printk("queued %d for waking on %d\n", p->pid, cpu);
>   if (!set_nr_if_polling(rq->idle))
>   smp_send_reschedule(cpu);
>   else
> @@ -5397,10 +5399,17 @@ migration_call(struct notifier_block *nfb, unsigned 
> long action, void *hcpu)
>   migrate_tasks(rq);
>   BUG_ON(rq->nr_running != 1); /* the migration thread */
>   raw_spin_unlock_irqrestore(>lock, flags);
> +
> + /* really bad m'kay */
> +

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Paul E. McKenney

On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> 
> > > But, you need hotplug for this to happen, right?
> > 
> > I do, but Ross Green is seeing something that looks similar, and without
> > CPU hotplug.
> 
> Yes, but that's two differences so far, you need hotplug and he's on ARM
> (which doesn't have TIF_POLLING_NR).
> 
> So either we're all looking at the wrong thing or these really are two
> different issues.

Given that this failure has grown more probable over the past several
releases, it does seem quite likely that we have more than one bug.
Or maybe a few bugs and additional innocent-bystander commits that make
one or more of the bugs more probable.

> > > We should not be migrating towards, or waking on, CPUs no longer present
> > > in cpu_active_map, and there is a rcu/sched_sync() after clearing that
> > > bit. Furthermore, migration_call() does a sched_ttwu_pending() (waking
> > > any remaining stragglers) before we migrate all runnable tasks off the
> > > dying CPU.
> > 
> > OK, so I should instrument migration_call() if I get the repro rate up?
> 
> Can do, maybe try the below first. (yes I know how long it all takes :/)

OK, will run this today, then run calibration for last night's run this
evening.

Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
consisted of 24 two-hour runs.  Six of them had hard hangs, and another
had a hang that eventually unhung of its own accord.  I believe that this
is significantly fewer failures than from a stock kernel, but I could
be wrong, and it will take some serious testing to give statistical
confidence for whatever conclusion is correct.

> > > The other interesting case would be resched_cpu(), which uses
> > > set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> > > atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> > > not set. If indeed not, it will send an IPI.
> > > 
> > > This assumes the idle 'exit' path will do the same as the IPI does; and
> > > if you look at cpu_idle_loop() it does indeed do both
> > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > 
> > > Note that one cannot rely on irq_enter()/irq_exit() being called for the
> > > scheduler IPI.
> > 
> > OK, thank you for the info!  Any specific debug actions?
> 
> Dunno, something like the below should bring visibility into the
> (lockless) wake_list thingy.
> 
> So these trace_printk()s should happen between trace_sched_waking() and
> trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
> some traces with these here thingies on).
> 
> ---
>  arch/x86/include/asm/bitops.h | 6 --
>  kernel/sched/core.c   | 9 +
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 7766d1cf096e..5345784d5e41 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
>   if (IS_IMMEDIATE(nr)) {
>   asm volatile(LOCK_PREFIX "andb %1,%0"
>   : CONST_MASK_ADDR(nr, addr)
> - : "iq" ((u8)~CONST_MASK(nr)));
> + : "iq" ((u8)~CONST_MASK(nr))
> + : "memory");
>   } else {
>   asm volatile(LOCK_PREFIX "btr %1,%0"
>   : BITOP_ADDR(addr)
> - : "Ir" (nr));
> + : "Ir" (nr)
> + : "memory");
>   }
>  }

Is the above addition of "memory" strictly for the debug below, or is
it also a potential fix?

Starting it up regardless, but figured I should ask!

Thanx, Paul

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0b21e7a724e1..b446f73c530d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1669,6 +1669,7 @@ void sched_ttwu_pending(void)
>   while (llist) {
>   p = llist_entry(llist, struct task_struct, wake_entry);
>   llist = llist_next(llist);
> + trace_printk("waking %d\n", p->pid);
>   ttwu_do_activate(rq, p, 0);
>   }
> 
> @@ -1719,6 +1720,7 @@ static void ttwu_queue_remote(struct task_struct *p, 
> int cpu)
>   struct rq *rq = cpu_rq(cpu);
> 
>   if (llist_add(>wake_entry, _rq(cpu)->wake_list)) {
> + trace_printk("queued %d for waking on %d\n", p->pid, cpu);
>   if (!set_nr_if_polling(rq->idle))
>   smp_send_reschedule(cpu);
>   else
> @@ -5397,10 +5399,17 @@ migration_call(struct notifier_block *nfb, unsigned 
> long action, void *hcpu)
>   migrate_tasks(rq);
>   BUG_ON(rq->nr_running != 1); /* the migration thread */
>   raw_spin_unlock_irqrestore(>lock, flags);
> +
> + /* really bad m'kay */
> +

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Peter Zijlstra

On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:

> > Does that system have MONITOR/MWAIT errata?
> 
> On the off-chance that this question was also directed at me,

Hehe, it wasn't, however, since we're here..

> here is
> what I am running on.  I am running in a qemu/KVM virtual machine, in
> case that matters.

Have you actually tried on real proper hardware? Does it still reproduce
there?

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Peter Zijlstra

On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:

> > Does that system have MONITOR/MWAIT errata?
> 
> On the off-chance that this question was also directed at me,

Hehe, it wasn't, however, since we're here..

> here is
> what I am running on.  I am running in a qemu/KVM virtual machine, in
> case that matters.

Have you actually tried on real proper hardware? Does it still reproduce
there?

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Peter Zijlstra

On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:

> > But, you need hotplug for this to happen, right?
> 
> I do, but Ross Green is seeing something that looks similar, and without
> CPU hotplug.

Yes, but that's two differences so far, you need hotplug and he's on ARM
(which doesn't have TIF_POLLING_NR).

So either we're all looking at the wrong thing or these really are two
different issues.

> > We should not be migrating towards, or waking on, CPUs no longer present
> > in cpu_active_map, and there is a rcu/sched_sync() after clearing that
> > bit. Furthermore, migration_call() does a sched_ttwu_pending() (waking
> > any remaining stragglers) before we migrate all runnable tasks off the
> > dying CPU.
> 
> OK, so I should instrument migration_call() if I get the repro rate up?

Can do, maybe try the below first. (yes I know how long it all takes :/)

> > The other interesting case would be resched_cpu(), which uses
> > set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> > atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> > not set. If indeed not, it will send an IPI.
> > 
> > This assumes the idle 'exit' path will do the same as the IPI does; and
> > if you look at cpu_idle_loop() it does indeed do both
> > preempt_fold_need_resched() and sched_ttwu_pending().
> > 
> > Note that one cannot rely on irq_enter()/irq_exit() being called for the
> > scheduler IPI.
> 
> OK, thank you for the info!  Any specific debug actions?

Dunno, something like the below should bring visibility into the
(lockless) wake_list thingy.

So these trace_printk()s should happen between trace_sched_waking() and
trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
some traces with these here thingies on).

---
 arch/x86/include/asm/bitops.h | 6 --
 kernel/sched/core.c   | 9 +
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 7766d1cf096e..5345784d5e41 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "andb %1,%0"
: CONST_MASK_ADDR(nr, addr)
-   : "iq" ((u8)~CONST_MASK(nr)));
+   : "iq" ((u8)~CONST_MASK(nr))
+   : "memory");
} else {
asm volatile(LOCK_PREFIX "btr %1,%0"
: BITOP_ADDR(addr)
-   : "Ir" (nr));
+   : "Ir" (nr)
+   : "memory");
}
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b21e7a724e1..b446f73c530d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1669,6 +1669,7 @@ void sched_ttwu_pending(void)
while (llist) {
p = llist_entry(llist, struct task_struct, wake_entry);
llist = llist_next(llist);
+   trace_printk("waking %d\n", p->pid);
ttwu_do_activate(rq, p, 0);
}
 
@@ -1719,6 +1720,7 @@ static void ttwu_queue_remote(struct task_struct *p, int 
cpu)
struct rq *rq = cpu_rq(cpu);
 
if (llist_add(>wake_entry, _rq(cpu)->wake_list)) {
+   trace_printk("queued %d for waking on %d\n", p->pid, cpu);
if (!set_nr_if_polling(rq->idle))
smp_send_reschedule(cpu);
else
@@ -5397,10 +5399,17 @@ migration_call(struct notifier_block *nfb, unsigned 
long action, void *hcpu)
migrate_tasks(rq);
BUG_ON(rq->nr_running != 1); /* the migration thread */
raw_spin_unlock_irqrestore(>lock, flags);
+
+   /* really bad m'kay */
+   WARN_ON(!llist_empty(>wake_list));
+
break;
 
case CPU_DEAD:
calc_load_migrate(rq);
+
+   /* more bad */
+   WARN_ON(!llist_empty(>wake_list));
break;
 #endif
}

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Peter Zijlstra

On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:

> > But, you need hotplug for this to happen, right?
> 
> I do, but Ross Green is seeing something that looks similar, and without
> CPU hotplug.

Yes, but that's two differences so far, you need hotplug and he's on ARM
(which doesn't have TIF_POLLING_NR).

So either we're all looking at the wrong thing or these really are two
different issues.

> > We should not be migrating towards, or waking on, CPUs no longer present
> > in cpu_active_map, and there is a rcu/sched_sync() after clearing that
> > bit. Furthermore, migration_call() does a sched_ttwu_pending() (waking
> > any remaining stragglers) before we migrate all runnable tasks off the
> > dying CPU.
> 
> OK, so I should instrument migration_call() if I get the repro rate up?

Can do, maybe try the below first. (yes I know how long it all takes :/)

> > The other interesting case would be resched_cpu(), which uses
> > set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> > atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> > not set. If indeed not, it will send an IPI.
> > 
> > This assumes the idle 'exit' path will do the same as the IPI does; and
> > if you look at cpu_idle_loop() it does indeed do both
> > preempt_fold_need_resched() and sched_ttwu_pending().
> > 
> > Note that one cannot rely on irq_enter()/irq_exit() being called for the
> > scheduler IPI.
> 
> OK, thank you for the info!  Any specific debug actions?

Dunno, something like the below should bring visibility into the
(lockless) wake_list thingy.

So these trace_printk()s should happen between trace_sched_waking() and
trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
some traces with these here thingies on).

---
 arch/x86/include/asm/bitops.h | 6 --
 kernel/sched/core.c   | 9 +
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 7766d1cf096e..5345784d5e41 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "andb %1,%0"
: CONST_MASK_ADDR(nr, addr)
-   : "iq" ((u8)~CONST_MASK(nr)));
+   : "iq" ((u8)~CONST_MASK(nr))
+   : "memory");
} else {
asm volatile(LOCK_PREFIX "btr %1,%0"
: BITOP_ADDR(addr)
-   : "Ir" (nr));
+   : "Ir" (nr)
+   : "memory");
}
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b21e7a724e1..b446f73c530d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1669,6 +1669,7 @@ void sched_ttwu_pending(void)
while (llist) {
p = llist_entry(llist, struct task_struct, wake_entry);
llist = llist_next(llist);
+   trace_printk("waking %d\n", p->pid);
ttwu_do_activate(rq, p, 0);
}
 
@@ -1719,6 +1720,7 @@ static void ttwu_queue_remote(struct task_struct *p, int 
cpu)
struct rq *rq = cpu_rq(cpu);
 
if (llist_add(>wake_entry, _rq(cpu)->wake_list)) {
+   trace_printk("queued %d for waking on %d\n", p->pid, cpu);
if (!set_nr_if_polling(rq->idle))
smp_send_reschedule(cpu);
else
@@ -5397,10 +5399,17 @@ migration_call(struct notifier_block *nfb, unsigned 
long action, void *hcpu)
migrate_tasks(rq);
BUG_ON(rq->nr_running != 1); /* the migration thread */
raw_spin_unlock_irqrestore(>lock, flags);
+
+   /* really bad m'kay */
+   WARN_ON(!llist_empty(>wake_list));
+
break;
 
case CPU_DEAD:
calc_load_migrate(rq);
+
+   /* more bad */
+   WARN_ON(!llist_empty(>wake_list));
break;
 #endif
}

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Peter Zijlstra

On Mon, Mar 28, 2016 at 02:23:45AM +, Mathieu Desnoyers wrote:

> >> But, you need hotplug for this to happen, right?
> > 
> > My understanding is that this seems to be detection of failures to be
> > awakened for a long time on idle CPUs. It therefore seems to be more
> > idle-related than cpu hotplug-related. I'm not saying that there is
> > no issue with hotplug, just that the investigation so far seems to
> > target mostly idle systems, AFAIK without stressing hotplug.

Paul has stated that without hotplug he cannot trigger this.

> > set_nr_if_polling() returns true if the ti->flags read has the
> > _TIF_NEED_RESCHED bit set, which will skip the IPI.

POLLING_NR, as per your later comment

> > But it seems weird. The side that calls set_nr_if_polling()
> > does the following:
> > 1) llist_add(>wake_entry, _rq(cpu)->wake_list)
> > 2) set_nr_if_polling(rq->idle)
> > 3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned
> >   true)
> > 
> > The idle loop does:
> > 1) __current_set_polling()
> > 2) __current_clr_polling()
> > 3) smp_mb__after_atomic()
> > 4) sched_ttwu_pending()
> > 5) schedule_preempt_disabled()
> >   -> This will clear the TIF_NEED_RESCHED flag
> > 
> > While the idle loop is in sched_ttwu_pending(), after
> > it has done the llist_del_all() (thus has grabbed all the
> > list entries), TIF_NEED_RESCHED is still set.

> > If both list_all and

llist_add() ?

> > set_nr_if_polling() are called right after the llist_del_all(), we
> > will end up in a situation where we have an entry in the list, but
> > there won't be any reschedule sent on the idle CPU until something
> > else awakens it. On a _very_ idle CPU, this could take some time.

Can't happen, as per clearing of POLLING_NR before doing llist_del_all()
and the latter being a full memory barrier.

> > set_nr_and_not_polling() don't seem to have the same issue, because
> > it does not return true if TIF_NEED_RESCHED is observed as being
> > already set: it really just depends on the state of the TIF_POLLING_NRFLAG
> > bit.
> > 
> > Am I missing something important ?
> 
> Well, it seems that the test for _TIF_POLLING_NRFLAG in set_nr_if_polling()
> just before the test for _TIF_NEED_RESCHED should take care of it: while in
> sched_ttwu_pending within the idle loop, the TIF_POLLING_NRFLAG should be
> cleared, thus causing set_nr_if_polling to return false.

Right, clue in the name: Set NEED_RESCHED _IF_ POLLING_NR (is set).

> I'm slightly concerned about the lack of smp_mb__after_atomic()
> between the TIF_NEED_RESCHED flag being cleared within 
> schedule_preempt_disabled
> and the TIF_POLLING_NRFLAG being set in the following loop. Indeed, 
> clear_bit()
> does not have a compiler barrier,

Urgh, it really should, as all atomic ops. set_bit() very much has a
memory clobber in, see below.

> nor processor-level memory barriers
> (of course, the processor memory barrier should not really matter on
> x86-64 due to lock prefix).

Right.

> Moreover, TIF_NEED_RESCHED is bit 3 on x86-64,
> whereas TIF_POLLING_NRFLAG is bit 21. Those are in two different bytes of
> the thread flags, and thus set/cleared as different addresses by clear_bit()
> acting on an immediate "nr" argument.
> 
> If we have any state where TIF_POLLING_NRFLAG is set before TIF_NEED_RESCHED
> is cleared within the idle thread, we could end up missing a needed resched 
> IPI.

Yes, that would be bad. No objection to adding smp_mb__before_atomic()
before the initial __current_set_polling(). Although that's not going to
make a difference for x86_64 as you already noted.

> Another question: why are set_nr_if_polling and set_nr_and_not_polling two
> different implementations ?

Because they're fundamentally two different things. The one
conditionally sets NEED_RESCHED, the other unconditionally sets it.

> Could they be combined ?

Can, yes, will not be pretty nor clear code though.


---
 arch/x86/include/asm/bitops.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 7766d1cf096e..5345784d5e41 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "andb %1,%0"
: CONST_MASK_ADDR(nr, addr)
-   : "iq" ((u8)~CONST_MASK(nr)));
+   : "iq" ((u8)~CONST_MASK(nr))
+   : "memory");
} else {
asm volatile(LOCK_PREFIX "btr %1,%0"
: BITOP_ADDR(addr)
-   : "Ir" (nr));
+   : "Ir" (nr)
+   : "memory");
}
 }

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-28 Thread Peter Zijlstra

On Mon, Mar 28, 2016 at 02:23:45AM +, Mathieu Desnoyers wrote:

> >> But, you need hotplug for this to happen, right?
> > 
> > My understanding is that this seems to be detection of failures to be
> > awakened for a long time on idle CPUs. It therefore seems to be more
> > idle-related than cpu hotplug-related. I'm not saying that there is
> > no issue with hotplug, just that the investigation so far seems to
> > target mostly idle systems, AFAIK without stressing hotplug.

Paul has stated that without hotplug he cannot trigger this.

> > set_nr_if_polling() returns true if the ti->flags read has the
> > _TIF_NEED_RESCHED bit set, which will skip the IPI.

POLLING_NR, as per your later comment

> > But it seems weird. The side that calls set_nr_if_polling()
> > does the following:
> > 1) llist_add(>wake_entry, _rq(cpu)->wake_list)
> > 2) set_nr_if_polling(rq->idle)
> > 3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned
> >   true)
> > 
> > The idle loop does:
> > 1) __current_set_polling()
> > 2) __current_clr_polling()
> > 3) smp_mb__after_atomic()
> > 4) sched_ttwu_pending()
> > 5) schedule_preempt_disabled()
> >   -> This will clear the TIF_NEED_RESCHED flag
> > 
> > While the idle loop is in sched_ttwu_pending(), after
> > it has done the llist_del_all() (thus has grabbed all the
> > list entries), TIF_NEED_RESCHED is still set.

> > If both list_all and

llist_add() ?

> > set_nr_if_polling() are called right after the llist_del_all(), we
> > will end up in a situation where we have an entry in the list, but
> > there won't be any reschedule sent on the idle CPU until something
> > else awakens it. On a _very_ idle CPU, this could take some time.

Can't happen, as per clearing of POLLING_NR before doing llist_del_all()
and the latter being a full memory barrier.

> > set_nr_and_not_polling() don't seem to have the same issue, because
> > it does not return true if TIF_NEED_RESCHED is observed as being
> > already set: it really just depends on the state of the TIF_POLLING_NRFLAG
> > bit.
> > 
> > Am I missing something important ?
> 
> Well, it seems that the test for _TIF_POLLING_NRFLAG in set_nr_if_polling()
> just before the test for _TIF_NEED_RESCHED should take care of it: while in
> sched_ttwu_pending within the idle loop, the TIF_POLLING_NRFLAG should be
> cleared, thus causing set_nr_if_polling to return false.

Right, clue in the name: Set NEED_RESCHED _IF_ POLLING_NR (is set).

> I'm slightly concerned about the lack of smp_mb__after_atomic()
> between the TIF_NEED_RESCHED flag being cleared within 
> schedule_preempt_disabled
> and the TIF_POLLING_NRFLAG being set in the following loop. Indeed, 
> clear_bit()
> does not have a compiler barrier,

Urgh, it really should, as all atomic ops. set_bit() very much has a
memory clobber in, see below.

> nor processor-level memory barriers
> (of course, the processor memory barrier should not really matter on
> x86-64 due to lock prefix).

Right.

> Moreover, TIF_NEED_RESCHED is bit 3 on x86-64,
> whereas TIF_POLLING_NRFLAG is bit 21. Those are in two different bytes of
> the thread flags, and thus set/cleared as different addresses by clear_bit()
> acting on an immediate "nr" argument.
> 
> If we have any state where TIF_POLLING_NRFLAG is set before TIF_NEED_RESCHED
> is cleared within the idle thread, we could end up missing a needed resched 
> IPI.

Yes, that would be bad. No objection to adding smp_mb__before_atomic()
before the initial __current_set_polling(). Although that's not going to
make a difference for x86_64 as you already noted.

> Another question: why are set_nr_if_polling and set_nr_and_not_polling two
> different implementations ?

Because they're fundamentally two different things. The one
conditionally sets NEED_RESCHED, the other unconditionally sets it.

> Could they be combined ?

Can, yes, will not be pretty nor clear code though.


---
 arch/x86/include/asm/bitops.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 7766d1cf096e..5345784d5e41 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "andb %1,%0"
: CONST_MASK_ADDR(nr, addr)
-   : "iq" ((u8)~CONST_MASK(nr)));
+   : "iq" ((u8)~CONST_MASK(nr))
+   : "memory");
} else {
asm volatile(LOCK_PREFIX "btr %1,%0"
: BITOP_ADDR(addr)
-   : "Ir" (nr));
+   : "Ir" (nr)
+   : "memory");
}
 }

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Mathieu Desnoyers

- On Mar 27, 2016, at 9:44 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Mar 27, 2016, at 4:45 PM, Peter Zijlstra pet...@infradead.org wrote:
> 
>> On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
>>> Oh, and the patch I am running with is below.  I am running x86, and so
>>> some other architectures would of course need the corresponding patch
>>> on that architecture.
>> 
>>> -#define TIF_POLLING_NRFLAG 21  /* idle is polling for TIF_NEED_RESCHED 
>>> */
>>> +/* #define TIF_POLLING_NRFLAG  21   idle is polling for 
>>> TIF_NEED_RESCHED */
>> 
>> x86 is the only arch that really uses this heavily IIRC.
>> 
>> Most of the other archs need interrupts to wake up remote cores.
>> 
>> So what we try to do is avoid sending IPIs when the CPU is idle, for the
>> remote wakeup case we use set_nr_if_polling() which sets
>> TIF_NEED_RESCHED if TIF_POLLING_NRFLAG was set. If it wasn't, we'll send
>> the IPI. Otherwise we rely on the idle loop to do sched_ttwu_pending()
>> when it breaks out of loop due to TIF_NEED_RESCHED.
>> 
>> But, you need hotplug for this to happen, right?
> 
> My understanding is that this seems to be detection of failures to be
> awakened for a long time on idle CPUs. It therefore seems to be more
> idle-related than cpu hotplug-related. I'm not saying that there is
> no issue with hotplug, just that the investigation so far seems to
> target mostly idle systems, AFAIK without stressing hotplug.
> 
>> 
>> We should not be migrating towards, or waking on, CPUs no longer present
>> in cpu_active_map, and there is a rcu/sched_sync() after clearing that
>> bit. Furthermore, migration_call() does a sched_ttwu_pending() (waking
>> any remaining stragglers) before we migrate all runnable tasks off the
>> dying CPU.
>> 
>> 
>> 
>> The other interesting case would be resched_cpu(), which uses
>> set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
>> atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
>> not set. If indeed not, it will send an IPI.
>> 
>> This assumes the idle 'exit' path will do the same as the IPI does; and
>> if you look at cpu_idle_loop() it does indeed do both
>> preempt_fold_need_resched() and sched_ttwu_pending().
>> 
>> Note that one cannot rely on irq_enter()/irq_exit() being called for the
>> scheduler IPI.
> 
> Looking at commit e3baac47f0e82c4be632f4f97215bb93bf16b342 :
> 
> set_nr_if_polling() returns true if the ti->flags read has the
> _TIF_NEED_RESCHED bit set, which will skip the IPI.
> 
> But it seems weird. The side that calls set_nr_if_polling()
> does the following:
> 1) llist_add(>wake_entry, _rq(cpu)->wake_list)
> 2) set_nr_if_polling(rq->idle)
> 3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned
>   true)
> 
> The idle loop does:
> 1) __current_set_polling()
> 2) __current_clr_polling()
> 3) smp_mb__after_atomic()
> 4) sched_ttwu_pending()
> 5) schedule_preempt_disabled()
>   -> This will clear the TIF_NEED_RESCHED flag
> 
> While the idle loop is in sched_ttwu_pending(), after
> it has done the llist_del_all() (thus has grabbed all the
> list entries), TIF_NEED_RESCHED is still set. If both list_all and
> set_nr_if_polling() are called right after the llist_del_all(), we
> will end up in a situation where we have an entry in the list, but
> there won't be any reschedule sent on the idle CPU until something
> else awakens it. On a _very_ idle CPU, this could take some time.
> 
> set_nr_and_not_polling() don't seem to have the same issue, because
> it does not return true if TIF_NEED_RESCHED is observed as being
> already set: it really just depends on the state of the TIF_POLLING_NRFLAG
> bit.
> 
> Am I missing something important ?

Well, it seems that the test for _TIF_POLLING_NRFLAG in set_nr_if_polling()
just before the test for _TIF_NEED_RESCHED should take care of it: while in
sched_ttwu_pending within the idle loop, the TIF_POLLING_NRFLAG should be
cleared, thus causing set_nr_if_polling to return false.

I'm slightly concerned about the lack of smp_mb__after_atomic()
between the TIF_NEED_RESCHED flag being cleared within schedule_preempt_disabled
and the TIF_POLLING_NRFLAG being set in the following loop. Indeed, clear_bit()
does not have a compiler barrier, nor processor-level memory barriers
(of course, the processor memory barrier should not really matter on
x86-64 due to lock prefix). Moreover, TIF_NEED_RESCHED is bit 3 on x86-64,
whereas TIF_POLLING_NRFLAG is bit 21. Those are in two different bytes of
the thread flags, and thus set/cleared as different addresses by clear_bit()
acting on an immediate "nr" argument.

If we have any state where TIF_POLLING_NRFLAG is set before TIF_NEED_RESCHED
is cleared within the idle thread, we could end up missing a needed resched IPI.

Another question: why are set_nr_if_polling and set_nr_and_not_polling two
different implementations ? Could they be combined ?

Thanks,

Mathieu


> 
> Thanks,
>

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Mathieu Desnoyers

- On Mar 27, 2016, at 9:44 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Mar 27, 2016, at 4:45 PM, Peter Zijlstra pet...@infradead.org wrote:
> 
>> On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
>>> Oh, and the patch I am running with is below.  I am running x86, and so
>>> some other architectures would of course need the corresponding patch
>>> on that architecture.
>> 
>>> -#define TIF_POLLING_NRFLAG 21  /* idle is polling for TIF_NEED_RESCHED 
>>> */
>>> +/* #define TIF_POLLING_NRFLAG  21   idle is polling for 
>>> TIF_NEED_RESCHED */
>> 
>> x86 is the only arch that really uses this heavily IIRC.
>> 
>> Most of the other archs need interrupts to wake up remote cores.
>> 
>> So what we try to do is avoid sending IPIs when the CPU is idle, for the
>> remote wakeup case we use set_nr_if_polling() which sets
>> TIF_NEED_RESCHED if TIF_POLLING_NRFLAG was set. If it wasn't, we'll send
>> the IPI. Otherwise we rely on the idle loop to do sched_ttwu_pending()
>> when it breaks out of loop due to TIF_NEED_RESCHED.
>> 
>> But, you need hotplug for this to happen, right?
> 
> My understanding is that this seems to be detection of failures to be
> awakened for a long time on idle CPUs. It therefore seems to be more
> idle-related than cpu hotplug-related. I'm not saying that there is
> no issue with hotplug, just that the investigation so far seems to
> target mostly idle systems, AFAIK without stressing hotplug.
> 
>> 
>> We should not be migrating towards, or waking on, CPUs no longer present
>> in cpu_active_map, and there is a rcu/sched_sync() after clearing that
>> bit. Furthermore, migration_call() does a sched_ttwu_pending() (waking
>> any remaining stragglers) before we migrate all runnable tasks off the
>> dying CPU.
>> 
>> 
>> 
>> The other interesting case would be resched_cpu(), which uses
>> set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
>> atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
>> not set. If indeed not, it will send an IPI.
>> 
>> This assumes the idle 'exit' path will do the same as the IPI does; and
>> if you look at cpu_idle_loop() it does indeed do both
>> preempt_fold_need_resched() and sched_ttwu_pending().
>> 
>> Note that one cannot rely on irq_enter()/irq_exit() being called for the
>> scheduler IPI.
> 
> Looking at commit e3baac47f0e82c4be632f4f97215bb93bf16b342 :
> 
> set_nr_if_polling() returns true if the ti->flags read has the
> _TIF_NEED_RESCHED bit set, which will skip the IPI.
> 
> But it seems weird. The side that calls set_nr_if_polling()
> does the following:
> 1) llist_add(>wake_entry, _rq(cpu)->wake_list)
> 2) set_nr_if_polling(rq->idle)
> 3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned
>   true)
> 
> The idle loop does:
> 1) __current_set_polling()
> 2) __current_clr_polling()
> 3) smp_mb__after_atomic()
> 4) sched_ttwu_pending()
> 5) schedule_preempt_disabled()
>   -> This will clear the TIF_NEED_RESCHED flag
> 
> While the idle loop is in sched_ttwu_pending(), after
> it has done the llist_del_all() (thus has grabbed all the
> list entries), TIF_NEED_RESCHED is still set. If both list_all and
> set_nr_if_polling() are called right after the llist_del_all(), we
> will end up in a situation where we have an entry in the list, but
> there won't be any reschedule sent on the idle CPU until something
> else awakens it. On a _very_ idle CPU, this could take some time.
> 
> set_nr_and_not_polling() don't seem to have the same issue, because
> it does not return true if TIF_NEED_RESCHED is observed as being
> already set: it really just depends on the state of the TIF_POLLING_NRFLAG
> bit.
> 
> Am I missing something important ?

Well, it seems that the test for _TIF_POLLING_NRFLAG in set_nr_if_polling()
just before the test for _TIF_NEED_RESCHED should take care of it: while in
sched_ttwu_pending within the idle loop, the TIF_POLLING_NRFLAG should be
cleared, thus causing set_nr_if_polling to return false.

I'm slightly concerned about the lack of smp_mb__after_atomic()
between the TIF_NEED_RESCHED flag being cleared within schedule_preempt_disabled
and the TIF_POLLING_NRFLAG being set in the following loop. Indeed, clear_bit()
does not have a compiler barrier, nor processor-level memory barriers
(of course, the processor memory barrier should not really matter on
x86-64 due to lock prefix). Moreover, TIF_NEED_RESCHED is bit 3 on x86-64,
whereas TIF_POLLING_NRFLAG is bit 21. Those are in two different bytes of
the thread flags, and thus set/cleared as different addresses by clear_bit()
acting on an immediate "nr" argument.

If we have any state where TIF_POLLING_NRFLAG is set before TIF_NEED_RESCHED
is cleared within the idle thread, we could end up missing a needed resched IPI.

Another question: why are set_nr_if_polling and set_nr_and_not_polling two
different implementations ? Could they be combined ?

Thanks,

Mathieu


> 
> Thanks,
>

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Mathieu Desnoyers

- On Mar 27, 2016, at 4:45 PM, Peter Zijlstra pet...@infradead.org wrote:

> On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
>> Oh, and the patch I am running with is below.  I am running x86, and so
>> some other architectures would of course need the corresponding patch
>> on that architecture.
> 
>> -#define TIF_POLLING_NRFLAG  21  /* idle is polling for TIF_NEED_RESCHED 
>> */
>> +/* #define TIF_POLLING_NRFLAG   21   idle is polling for 
>> TIF_NEED_RESCHED */
> 
> x86 is the only arch that really uses this heavily IIRC.
> 
> Most of the other archs need interrupts to wake up remote cores.
> 
> So what we try to do is avoid sending IPIs when the CPU is idle, for the
> remote wakeup case we use set_nr_if_polling() which sets
> TIF_NEED_RESCHED if TIF_POLLING_NRFLAG was set. If it wasn't, we'll send
> the IPI. Otherwise we rely on the idle loop to do sched_ttwu_pending()
> when it breaks out of loop due to TIF_NEED_RESCHED.
> 
> But, you need hotplug for this to happen, right?

My understanding is that this seems to be detection of failures to be
awakened for a long time on idle CPUs. It therefore seems to be more
idle-related than cpu hotplug-related. I'm not saying that there is
no issue with hotplug, just that the investigation so far seems to
target mostly idle systems, AFAIK without stressing hotplug.

> 
> We should not be migrating towards, or waking on, CPUs no longer present
> in cpu_active_map, and there is a rcu/sched_sync() after clearing that
> bit. Furthermore, migration_call() does a sched_ttwu_pending() (waking
> any remaining stragglers) before we migrate all runnable tasks off the
> dying CPU.
> 
> 
> 
> The other interesting case would be resched_cpu(), which uses
> set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> not set. If indeed not, it will send an IPI.
> 
> This assumes the idle 'exit' path will do the same as the IPI does; and
> if you look at cpu_idle_loop() it does indeed do both
> preempt_fold_need_resched() and sched_ttwu_pending().
> 
> Note that one cannot rely on irq_enter()/irq_exit() being called for the
> scheduler IPI.

Looking at commit e3baac47f0e82c4be632f4f97215bb93bf16b342 :

set_nr_if_polling() returns true if the ti->flags read has the
_TIF_NEED_RESCHED bit set, which will skip the IPI.

But it seems weird. The side that calls set_nr_if_polling()
does the following:
1) llist_add(>wake_entry, _rq(cpu)->wake_list)
2) set_nr_if_polling(rq->idle)
3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned
   true)

The idle loop does:
1) __current_set_polling()
2) __current_clr_polling()
3) smp_mb__after_atomic()
4) sched_ttwu_pending()
5) schedule_preempt_disabled()
   -> This will clear the TIF_NEED_RESCHED flag

While the idle loop is in sched_ttwu_pending(), after
it has done the llist_del_all() (thus has grabbed all the
list entries), TIF_NEED_RESCHED is still set. If both list_all and
set_nr_if_polling() are called right after the llist_del_all(), we
will end up in a situation where we have an entry in the list, but
there won't be any reschedule sent on the idle CPU until something
else awakens it. On a _very_ idle CPU, this could take some time.

set_nr_and_not_polling() don't seem to have the same issue, because
it does not return true if TIF_NEED_RESCHED is observed as being
already set: it really just depends on the state of the TIF_POLLING_NRFLAG
bit.

Am I missing something important ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Mathieu Desnoyers

- On Mar 27, 2016, at 4:45 PM, Peter Zijlstra pet...@infradead.org wrote:

> On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
>> Oh, and the patch I am running with is below.  I am running x86, and so
>> some other architectures would of course need the corresponding patch
>> on that architecture.
> 
>> -#define TIF_POLLING_NRFLAG  21  /* idle is polling for TIF_NEED_RESCHED 
>> */
>> +/* #define TIF_POLLING_NRFLAG   21   idle is polling for 
>> TIF_NEED_RESCHED */
> 
> x86 is the only arch that really uses this heavily IIRC.
> 
> Most of the other archs need interrupts to wake up remote cores.
> 
> So what we try to do is avoid sending IPIs when the CPU is idle, for the
> remote wakeup case we use set_nr_if_polling() which sets
> TIF_NEED_RESCHED if TIF_POLLING_NRFLAG was set. If it wasn't, we'll send
> the IPI. Otherwise we rely on the idle loop to do sched_ttwu_pending()
> when it breaks out of loop due to TIF_NEED_RESCHED.
> 
> But, you need hotplug for this to happen, right?

My understanding is that this seems to be detection of failures to be
awakened for a long time on idle CPUs. It therefore seems to be more
idle-related than cpu hotplug-related. I'm not saying that there is
no issue with hotplug, just that the investigation so far seems to
target mostly idle systems, AFAIK without stressing hotplug.

> 
> We should not be migrating towards, or waking on, CPUs no longer present
> in cpu_active_map, and there is a rcu/sched_sync() after clearing that
> bit. Furthermore, migration_call() does a sched_ttwu_pending() (waking
> any remaining stragglers) before we migrate all runnable tasks off the
> dying CPU.
> 
> 
> 
> The other interesting case would be resched_cpu(), which uses
> set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> not set. If indeed not, it will send an IPI.
> 
> This assumes the idle 'exit' path will do the same as the IPI does; and
> if you look at cpu_idle_loop() it does indeed do both
> preempt_fold_need_resched() and sched_ttwu_pending().
> 
> Note that one cannot rely on irq_enter()/irq_exit() being called for the
> scheduler IPI.

Looking at commit e3baac47f0e82c4be632f4f97215bb93bf16b342 :

set_nr_if_polling() returns true if the ti->flags read has the
_TIF_NEED_RESCHED bit set, which will skip the IPI.

But it seems weird. The side that calls set_nr_if_polling()
does the following:
1) llist_add(>wake_entry, _rq(cpu)->wake_list)
2) set_nr_if_polling(rq->idle)
3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned
   true)

The idle loop does:
1) __current_set_polling()
2) __current_clr_polling()
3) smp_mb__after_atomic()
4) sched_ttwu_pending()
5) schedule_preempt_disabled()
   -> This will clear the TIF_NEED_RESCHED flag

While the idle loop is in sched_ttwu_pending(), after
it has done the llist_del_all() (thus has grabbed all the
list entries), TIF_NEED_RESCHED is still set. If both list_all and
set_nr_if_polling() are called right after the llist_del_all(), we
will end up in a situation where we have an entry in the list, but
there won't be any reschedule sent on the idle CPU until something
else awakens it. On a _very_ idle CPU, this could take some time.

set_nr_and_not_polling() don't seem to have the same issue, because
it does not return true if TIF_NEED_RESCHED is observed as being
already set: it really just depends on the state of the TIF_POLLING_NRFLAG
bit.

Am I missing something important ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Paul E. McKenney

On Sun, Mar 27, 2016 at 10:54:39PM +0200, Peter Zijlstra wrote:
> On Mon, Mar 21, 2016 at 09:22:30AM -0700, Jacob Pan wrote:
> > > > We're seeing a similar stall (~60 seconds) on an x86 development
> > > > system here.  Any luck tracking down the cause of this?  If not, any
> > > > suggestions for traces that might be helpful?
> 
> > +Reinette, she has the system that can reproduce the issue. I
> > believe she is having some other problems with it at the moment. But
> > the .config should be available. Version is v4.5.
> 
> Does that system have MONITOR/MWAIT errata?

On the off-chance that this question was also directed at me, here is
what I am running on.  I am running in a qemu/KVM virtual machine, in
case that matters.

Thanx, Paul

processor   : 63
vendor_id   : GenuineIntel
cpu family  : 6
model   : 47
model name  : Intel(R) Xeon(R) CPU E7- 4820  @ 2.00GHz
stepping: 2
microcode   : 0x37
cpu MHz : 1064.000
cache size  : 18432 KB
physical id : 3
siblings: 16
core id : 25
cpu cores   : 8
apicid  : 243
initial apicid  : 243
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 
ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt aes lahf_lm ida arat 
epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips: 3990.01
clflush size: 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management:

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Paul E. McKenney

On Sun, Mar 27, 2016 at 10:54:39PM +0200, Peter Zijlstra wrote:
> On Mon, Mar 21, 2016 at 09:22:30AM -0700, Jacob Pan wrote:
> > > > We're seeing a similar stall (~60 seconds) on an x86 development
> > > > system here.  Any luck tracking down the cause of this?  If not, any
> > > > suggestions for traces that might be helpful?
> 
> > +Reinette, she has the system that can reproduce the issue. I
> > believe she is having some other problems with it at the moment. But
> > the .config should be available. Version is v4.5.
> 
> Does that system have MONITOR/MWAIT errata?

On the off-chance that this question was also directed at me, here is
what I am running on.  I am running in a qemu/KVM virtual machine, in
case that matters.

Thanx, Paul

processor   : 63
vendor_id   : GenuineIntel
cpu family  : 6
model   : 47
model name  : Intel(R) Xeon(R) CPU E7- 4820  @ 2.00GHz
stepping: 2
microcode   : 0x37
cpu MHz : 1064.000
cache size  : 18432 KB
physical id : 3
siblings: 16
core id : 25
cpu cores   : 8
apicid  : 243
initial apicid  : 243
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 
ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt aes lahf_lm ida arat 
epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips: 3990.01
clflush size: 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management:

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Paul E. McKenney

On Sun, Mar 27, 2016 at 10:53:18PM +0200, Peter Zijlstra wrote:
> On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
> > Oh, and the patch I am running with is below.  I am running x86, and so
> > some other architectures would of course need the corresponding patch
> > on that architecture.
> 
> > -#define TIF_POLLING_NRFLAG 21  /* idle is polling for TIF_NEED_RESCHED 
> > */
> 
> Also note that ARM (v7) which Ross is running doesn't have this to begin
> with.

He might well be seeing some other bug, then.  Reinette might be instead
seeing time-synchronization issues, perhaps that is also Ross's problem.

Or maybe there is more than one bug.  ;-)

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Paul E. McKenney

On Sun, Mar 27, 2016 at 10:53:18PM +0200, Peter Zijlstra wrote:
> On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
> > Oh, and the patch I am running with is below.  I am running x86, and so
> > some other architectures would of course need the corresponding patch
> > on that architecture.
> 
> > -#define TIF_POLLING_NRFLAG 21  /* idle is polling for TIF_NEED_RESCHED 
> > */
> 
> Also note that ARM (v7) which Ross is running doesn't have this to begin
> with.

He might well be seeing some other bug, then.  Reinette might be instead
seeing time-synchronization issues, perhaps that is also Ross's problem.

Or maybe there is more than one bug.  ;-)

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Paul E. McKenney

On Sun, Mar 27, 2016 at 10:45:59PM +0200, Peter Zijlstra wrote:
> On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
> > Oh, and the patch I am running with is below.  I am running x86, and so
> > some other architectures would of course need the corresponding patch
> > on that architecture.
> 
> > -#define TIF_POLLING_NRFLAG 21  /* idle is polling for TIF_NEED_RESCHED 
> > */
> > +/* #define TIF_POLLING_NRFLAG  21   idle is polling for 
> > TIF_NEED_RESCHED */
> 
> x86 is the only arch that really uses this heavily IIRC.
> 
> Most of the other archs need interrupts to wake up remote cores.
> 
> So what we try to do is avoid sending IPIs when the CPU is idle, for the
> remote wakeup case we use set_nr_if_polling() which sets
> TIF_NEED_RESCHED if TIF_POLLING_NRFLAG was set. If it wasn't, we'll send
> the IPI. Otherwise we rely on the idle loop to do sched_ttwu_pending()
> when it breaks out of loop due to TIF_NEED_RESCHED.
> 
> But, you need hotplug for this to happen, right?

I do, but Ross Green is seeing something that looks similar, and without
CPU hotplug.

> We should not be migrating towards, or waking on, CPUs no longer present
> in cpu_active_map, and there is a rcu/sched_sync() after clearing that
> bit. Furthermore, migration_call() does a sched_ttwu_pending() (waking
> any remaining stragglers) before we migrate all runnable tasks off the
> dying CPU.

OK, so I should instrument migration_call() if I get the repro rate up?

> The other interesting case would be resched_cpu(), which uses
> set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> not set. If indeed not, it will send an IPI.
> 
> This assumes the idle 'exit' path will do the same as the IPI does; and
> if you look at cpu_idle_loop() it does indeed do both
> preempt_fold_need_resched() and sched_ttwu_pending().
> 
> Note that one cannot rely on irq_enter()/irq_exit() being called for the
> scheduler IPI.

OK, thank you for the info!  Any specific debug actions?

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Paul E. McKenney

On Sun, Mar 27, 2016 at 10:45:59PM +0200, Peter Zijlstra wrote:
> On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
> > Oh, and the patch I am running with is below.  I am running x86, and so
> > some other architectures would of course need the corresponding patch
> > on that architecture.
> 
> > -#define TIF_POLLING_NRFLAG 21  /* idle is polling for TIF_NEED_RESCHED 
> > */
> > +/* #define TIF_POLLING_NRFLAG  21   idle is polling for 
> > TIF_NEED_RESCHED */
> 
> x86 is the only arch that really uses this heavily IIRC.
> 
> Most of the other archs need interrupts to wake up remote cores.
> 
> So what we try to do is avoid sending IPIs when the CPU is idle, for the
> remote wakeup case we use set_nr_if_polling() which sets
> TIF_NEED_RESCHED if TIF_POLLING_NRFLAG was set. If it wasn't, we'll send
> the IPI. Otherwise we rely on the idle loop to do sched_ttwu_pending()
> when it breaks out of loop due to TIF_NEED_RESCHED.
> 
> But, you need hotplug for this to happen, right?

I do, but Ross Green is seeing something that looks similar, and without
CPU hotplug.

> We should not be migrating towards, or waking on, CPUs no longer present
> in cpu_active_map, and there is a rcu/sched_sync() after clearing that
> bit. Furthermore, migration_call() does a sched_ttwu_pending() (waking
> any remaining stragglers) before we migrate all runnable tasks off the
> dying CPU.

OK, so I should instrument migration_call() if I get the repro rate up?

> The other interesting case would be resched_cpu(), which uses
> set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> not set. If indeed not, it will send an IPI.
> 
> This assumes the idle 'exit' path will do the same as the IPI does; and
> if you look at cpu_idle_loop() it does indeed do both
> preempt_fold_need_resched() and sched_ttwu_pending().
> 
> Note that one cannot rely on irq_enter()/irq_exit() being called for the
> scheduler IPI.

OK, thank you for the info!  Any specific debug actions?

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Peter Zijlstra

On Mon, Mar 21, 2016 at 09:22:30AM -0700, Jacob Pan wrote:
> > > We're seeing a similar stall (~60 seconds) on an x86 development
> > > system here.  Any luck tracking down the cause of this?  If not, any
> > > suggestions for traces that might be helpful?

> +Reinette, she has the system that can reproduce the issue. I
> believe she is having some other problems with it at the moment. But
> the .config should be available. Version is v4.5.

Does that system have MONITOR/MWAIT errata?

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Peter Zijlstra

On Mon, Mar 21, 2016 at 09:22:30AM -0700, Jacob Pan wrote:
> > > We're seeing a similar stall (~60 seconds) on an x86 development
> > > system here.  Any luck tracking down the cause of this?  If not, any
> > > suggestions for traces that might be helpful?

> +Reinette, she has the system that can reproduce the issue. I
> believe she is having some other problems with it at the moment. But
> the .config should be available. Version is v4.5.

Does that system have MONITOR/MWAIT errata?

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Peter Zijlstra

On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
> Oh, and the patch I am running with is below.  I am running x86, and so
> some other architectures would of course need the corresponding patch
> on that architecture.

> -#define TIF_POLLING_NRFLAG   21  /* idle is polling for TIF_NEED_RESCHED 
> */

Also note that ARM (v7) which Ross is running doesn't have this to begin
with.

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Peter Zijlstra

On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
> Oh, and the patch I am running with is below.  I am running x86, and so
> some other architectures would of course need the corresponding patch
> on that architecture.

> -#define TIF_POLLING_NRFLAG   21  /* idle is polling for TIF_NEED_RESCHED 
> */

Also note that ARM (v7) which Ross is running doesn't have this to begin
with.

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Peter Zijlstra

On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
> Oh, and the patch I am running with is below.  I am running x86, and so
> some other architectures would of course need the corresponding patch
> on that architecture.

> -#define TIF_POLLING_NRFLAG   21  /* idle is polling for TIF_NEED_RESCHED 
> */
> +/* #define TIF_POLLING_NRFLAG21   idle is polling for 
> TIF_NEED_RESCHED */

x86 is the only arch that really uses this heavily IIRC.

Most of the other archs need interrupts to wake up remote cores.

So what we try to do is avoid sending IPIs when the CPU is idle, for the
remote wakeup case we use set_nr_if_polling() which sets
TIF_NEED_RESCHED if TIF_POLLING_NRFLAG was set. If it wasn't, we'll send
the IPI. Otherwise we rely on the idle loop to do sched_ttwu_pending()
when it breaks out of loop due to TIF_NEED_RESCHED.

But, you need hotplug for this to happen, right?

We should not be migrating towards, or waking on, CPUs no longer present
in cpu_active_map, and there is a rcu/sched_sync() after clearing that
bit. Furthermore, migration_call() does a sched_ttwu_pending() (waking
any remaining stragglers) before we migrate all runnable tasks off the
dying CPU.

The other interesting case would be resched_cpu(), which uses
set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
not set. If indeed not, it will send an IPI.

This assumes the idle 'exit' path will do the same as the IPI does; and
if you look at cpu_idle_loop() it does indeed do both
preempt_fold_need_resched() and sched_ttwu_pending().

Note that one cannot rely on irq_enter()/irq_exit() being called for the
scheduler IPI.

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Peter Zijlstra

On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
> Oh, and the patch I am running with is below.  I am running x86, and so
> some other architectures would of course need the corresponding patch
> on that architecture.

> -#define TIF_POLLING_NRFLAG   21  /* idle is polling for TIF_NEED_RESCHED 
> */
> +/* #define TIF_POLLING_NRFLAG21   idle is polling for 
> TIF_NEED_RESCHED */

x86 is the only arch that really uses this heavily IIRC.

Most of the other archs need interrupts to wake up remote cores.

So what we try to do is avoid sending IPIs when the CPU is idle, for the
remote wakeup case we use set_nr_if_polling() which sets
TIF_NEED_RESCHED if TIF_POLLING_NRFLAG was set. If it wasn't, we'll send
the IPI. Otherwise we rely on the idle loop to do sched_ttwu_pending()
when it breaks out of loop due to TIF_NEED_RESCHED.

But, you need hotplug for this to happen, right?

We should not be migrating towards, or waking on, CPUs no longer present
in cpu_active_map, and there is a rcu/sched_sync() after clearing that
bit. Furthermore, migration_call() does a sched_ttwu_pending() (waking
any remaining stragglers) before we migrate all runnable tasks off the
dying CPU.

The other interesting case would be resched_cpu(), which uses
set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
not set. If indeed not, it will send an IPI.

This assumes the idle 'exit' path will do the same as the IPI does; and
if you look at cpu_idle_loop() it does indeed do both
preempt_fold_need_resched() and sched_ttwu_pending().

Note that one cannot rely on irq_enter()/irq_exit() being called for the
scheduler IPI.

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Paul E. McKenney

On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
> On Sun, Mar 27, 2016 at 01:48:55PM +, Mathieu Desnoyers wrote:
> > - On Mar 26, 2016, at 9:34 PM, Paul E. McKenney 
> > paul...@linux.vnet.ibm.com wrote:
> > > On Sat, Mar 26, 2016 at 10:22:57PM +, Mathieu Desnoyers wrote:
> > >> - On Mar 26, 2016, at 2:49 PM, Paul E. McKenney 
> > >> paul...@linux.vnet.ibm.com
> > >> wrote:
> > >> > On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
> > >> >> On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:

[ . . . ]

> > >> >> > Perhaps we could try with those commits reverted ?
> > >> >> > 
> > >> >> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
> > >> >> > Author: Peter Zijlstra 
> > >> >> > Date:   Wed Jun 4 10:31:18 2014 -0700
> > >> >> > 
> > >> >> > sched/idle: Optimize try-to-wake-up IPI
> > >> >> > 
> > >> >> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
> > >> >> > Author: Peter Zijlstra 
> > >> >> > Date:   Wed Apr 9 15:35:08 2014 +0200
> > >> >> > 
> > >> >> > sched/idle: Avoid spurious wakeup IPIs
> > >> >> > 
> > >> >> > They appeared in 3.16.
> > >> >> 
> > >> >> At this point, I am up for trying pretty much anything.  ;-)
> > >> >> 
> > >> >> Will give it a go.
> > >> > 
> > >> > And those certainly don't revert cleanly!  Would patching the kernel
> > >> > to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
> > >> > to the point, is there some other course of action that would be more
> > >> > useful?  At this point, the test times are measured in weeks...
> > >> 
> > >> Indeed, patching the kernel to remove the TIF_POLLING_NRFLAG
> > >> definition would have an effect similar to reverting those two
> > >> commits.
> > >> 
> > >> Since testing takes a while, we could take a more aggressive
> > >> approach towards reproducing a possible race condition: we
> > >> could re-implement the _TIF_POLLING_NRFLAG vs _TIF_NEED_RESCHED
> > >> dance, along with the ttwu pending lock-list queue, within
> > >> a dummy test module, with custom data structures, and
> > >> stress-test the invariants. We could also create a Promela
> > >> model of these ipi-skip optimisations trying to validate
> > >> progress: whenever a wakeup is requested, there should
> > >> always be a scheduling performed, even if no further wakeup
> > >> is encountered.
> > >> 
> > >> Each of the two approaches proposed above might be a significant
> > >> endeavor, and would only validate my specific hunch. So it might
> > >> be a good idea to just let a test run for a few weeks with
> > >> TIF_POLLING_NRFLAG disabled meanwhile.
> > > 
> > > This makes a lot of sense.  I did some short runs, and nothing broke
> > > too badly.  However, I left some diagnostic stuff in that obscured
> > > the outcome.  I disabled the diagnostic stuff and am running overnight.
> > > I might need to go further and revert some of my diagnostic patches,
> > > but let's see where it is in the morning.
> > 
> > Here is another idea that might help us reproduce this issue faster.
> > If you can afford it, you might want to just throw more similar hardware
> > at the problem. Assuming the problem shows up randomly, but its odds
> > of showing up make it happen only once per week, if we have 100 machines
> > idling in the same way in parallel, we should be able to reproduce it
> > within about 1-2 hours.
> > 
> > Of course, if the problem really need each machine to "degrade" for
> > a week (e.g. memory fragmentation), that would not help. It's only for
> > races that appear to be showing up randomly.
> 
> Certain rcutorture tests sometimes hit it within an hour (TREE03).
> Last night's TREE03 ran six hours without incident, which is unusual
> given that I didn't enable any tracepoints, but does not any significant
> level of statitstical confidence.  The set will finish in a few hours,
> at which point I will start parallel batches of TREE03 to see what
> comes up.
> 
> Feel free to take a look at kernel/rcu/waketorture.c for my (feeble
> thus far) attempt to speed things up.  I am thinking that I need to
> push sleeping tasks onto idle CPUs to make it happen more often.
> My current approach to this is to run with CPU utilizations of about
> 40% and using hrtimer with a prime number of microseconds to avoid
> synchronization.  That should in theory get me a 40% chance of hitting
> an idle CPU with a wakeup, and a reasonable chance of racing with a
> CPU-hotplug operation.  But maybe the wakeup needs to be remote or
> some such, in which case waketorture also needs to move stuff around.
> 
> Oh, and the patch I am running with is below.  I am running x86, and so
> some other architectures would of course need the corresponding patch
> on that architecture.

And it passed a full set of six-hour runs.  Unusual of late, but not
unheard of.  Next step is to focus on TREE03 overnight.

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Paul E. McKenney

On Sun, Mar 27, 2016 at 08:40:18AM -0700, Paul E. McKenney wrote:
> On Sun, Mar 27, 2016 at 01:48:55PM +, Mathieu Desnoyers wrote:
> > - On Mar 26, 2016, at 9:34 PM, Paul E. McKenney 
> > paul...@linux.vnet.ibm.com wrote:
> > > On Sat, Mar 26, 2016 at 10:22:57PM +, Mathieu Desnoyers wrote:
> > >> - On Mar 26, 2016, at 2:49 PM, Paul E. McKenney 
> > >> paul...@linux.vnet.ibm.com
> > >> wrote:
> > >> > On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
> > >> >> On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:

[ . . . ]

> > >> >> > Perhaps we could try with those commits reverted ?
> > >> >> > 
> > >> >> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
> > >> >> > Author: Peter Zijlstra 
> > >> >> > Date:   Wed Jun 4 10:31:18 2014 -0700
> > >> >> > 
> > >> >> > sched/idle: Optimize try-to-wake-up IPI
> > >> >> > 
> > >> >> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
> > >> >> > Author: Peter Zijlstra 
> > >> >> > Date:   Wed Apr 9 15:35:08 2014 +0200
> > >> >> > 
> > >> >> > sched/idle: Avoid spurious wakeup IPIs
> > >> >> > 
> > >> >> > They appeared in 3.16.
> > >> >> 
> > >> >> At this point, I am up for trying pretty much anything.  ;-)
> > >> >> 
> > >> >> Will give it a go.
> > >> > 
> > >> > And those certainly don't revert cleanly!  Would patching the kernel
> > >> > to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
> > >> > to the point, is there some other course of action that would be more
> > >> > useful?  At this point, the test times are measured in weeks...
> > >> 
> > >> Indeed, patching the kernel to remove the TIF_POLLING_NRFLAG
> > >> definition would have an effect similar to reverting those two
> > >> commits.
> > >> 
> > >> Since testing takes a while, we could take a more aggressive
> > >> approach towards reproducing a possible race condition: we
> > >> could re-implement the _TIF_POLLING_NRFLAG vs _TIF_NEED_RESCHED
> > >> dance, along with the ttwu pending lock-list queue, within
> > >> a dummy test module, with custom data structures, and
> > >> stress-test the invariants. We could also create a Promela
> > >> model of these ipi-skip optimisations trying to validate
> > >> progress: whenever a wakeup is requested, there should
> > >> always be a scheduling performed, even if no further wakeup
> > >> is encountered.
> > >> 
> > >> Each of the two approaches proposed above might be a significant
> > >> endeavor, and would only validate my specific hunch. So it might
> > >> be a good idea to just let a test run for a few weeks with
> > >> TIF_POLLING_NRFLAG disabled meanwhile.
> > > 
> > > This makes a lot of sense.  I did some short runs, and nothing broke
> > > too badly.  However, I left some diagnostic stuff in that obscured
> > > the outcome.  I disabled the diagnostic stuff and am running overnight.
> > > I might need to go further and revert some of my diagnostic patches,
> > > but let's see where it is in the morning.
> > 
> > Here is another idea that might help us reproduce this issue faster.
> > If you can afford it, you might want to just throw more similar hardware
> > at the problem. Assuming the problem shows up randomly, but its odds
> > of showing up make it happen only once per week, if we have 100 machines
> > idling in the same way in parallel, we should be able to reproduce it
> > within about 1-2 hours.
> > 
> > Of course, if the problem really need each machine to "degrade" for
> > a week (e.g. memory fragmentation), that would not help. It's only for
> > races that appear to be showing up randomly.
> 
> Certain rcutorture tests sometimes hit it within an hour (TREE03).
> Last night's TREE03 ran six hours without incident, which is unusual
> given that I didn't enable any tracepoints, but does not any significant
> level of statitstical confidence.  The set will finish in a few hours,
> at which point I will start parallel batches of TREE03 to see what
> comes up.
> 
> Feel free to take a look at kernel/rcu/waketorture.c for my (feeble
> thus far) attempt to speed things up.  I am thinking that I need to
> push sleeping tasks onto idle CPUs to make it happen more often.
> My current approach to this is to run with CPU utilizations of about
> 40% and using hrtimer with a prime number of microseconds to avoid
> synchronization.  That should in theory get me a 40% chance of hitting
> an idle CPU with a wakeup, and a reasonable chance of racing with a
> CPU-hotplug operation.  But maybe the wakeup needs to be remote or
> some such, in which case waketorture also needs to move stuff around.
> 
> Oh, and the patch I am running with is below.  I am running x86, and so
> some other architectures would of course need the corresponding patch
> on that architecture.

And it passed a full set of six-hour runs.  Unusual of late, but not
unheard of.  Next step is to focus on TREE03 overnight.

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Paul E. McKenney

On Sun, Mar 27, 2016 at 01:48:55PM +, Mathieu Desnoyers wrote:
> - On Mar 26, 2016, at 9:34 PM, Paul E. McKenney 
> paul...@linux.vnet.ibm.com wrote:
> 
> > On Sat, Mar 26, 2016 at 10:22:57PM +, Mathieu Desnoyers wrote:
> >> - On Mar 26, 2016, at 2:49 PM, Paul E. McKenney 
> >> paul...@linux.vnet.ibm.com
> >> wrote:
> >> 
> >> > On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
> >> >> On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:
> >> >> > - On Mar 25, 2016, at 5:46 PM, Paul E. McKenney 
> >> >> > paul...@linux.vnet.ibm.com
> >> >> > wrote:
> >> >> > 
> >> >> > > On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
> >> >> > >> Hi  Paul,
> >> >> > >> 
> >> >> > >> On 2016-03-23, Paul E. McKenney wrote:
> >> >> > >> > Please boot with the following parameters:
> >> >> > >> > 
> >> >> > >> >  rcu_tree.rcu_kick_kthreads ftrace
> >> >> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> >> >> > >> 
> >> >> > >> With these parameters I expected more details to show up in the 
> >> >> > >> kernel logs but
> >> >> > >> cannot find any. Even so, today I left the machine running again 
> >> >> > >> and when this
> >> >> > >> happened I think I was able to capture the trace data for the 
> >> >> > >> event. Please
> >> >> > >> find attached the trace information for the kernel message below. 
> >> >> > >> Since the
> >> >> > >> complete trace file is very big I trimmed it to show the time 
> >> >> > >> around this event
> >> >> > >> - hopefully this will contain the information you need. I would 
> >> >> > >> also like to
> >> >> > >> provide some additional information. The system on which I see 
> >> >> > >> these events had
> >> >> > >> a time that was _very_ wrong. I noticed that this issue occurs when
> >> >> > >> system-timesynd was one of the tasks calling the functions of 
> >> >> > >> interest to your
> >> >> > >> tracing and am wondering if a very out of sync time in process of 
> >> >> > >> being
> >> >> > >> corrected could be the cause of this issue? As an experiment I 
> >> >> > >> ensured the
> >> >> > >> system time was accurate before leaving the system idle overnight 
> >> >> > >> and I did not
> >> >> > >> see the issue the next morning.
> >> >> > > 
> >> >> > > Ah!  Yes, a sudden jump in time or a disagreement about the time 
> >> >> > > among
> >> >> > > different components of the system can definitely cause these 
> >> >> > > symptoms.
> >> >> > > We have sometimes seen these problems occur when a pair of CPUs have
> >> >> > > wildly different ideas about what time it is, for example.  Please 
> >> >> > > let
> >> >> > > me know how it goes.
> >> >> > > 
> >> >> > > Also, in your trace, there are no sched_waking events for the 
> >> >> > > rcu_preempt
> >> >> > > process that are not immediately followed by sched_wakeup, so your 
> >> >> > > trace
> >> >> > > isn't showing the problem that I am seeing.
> >> >> > 
> >> >> > This is interesting.
> >> >> > 
> >> >> > Perhaps we could try with those commits reverted ?
> >> >> > 
> >> >> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
> >> >> > Author: Peter Zijlstra 
> >> >> > Date:   Wed Jun 4 10:31:18 2014 -0700
> >> >> > 
> >> >> > sched/idle: Optimize try-to-wake-up IPI
> >> >> > 
> >> >> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
> >> >> > Author: Peter Zijlstra 
> >> >> > Date:   Wed Apr 9 15:35:08 2014 +0200
> >> >> > 
> >> >> > sched/idle: Avoid spurious wakeup IPIs
> >> >> > 
> >> >> > They appeared in 3.16.
> >> >> 
> >> >> At this point, I am up for trying pretty much anything.  ;-)
> >> >> 
> >> >> Will give it a go.
> >> > 
> >> > And those certainly don't revert cleanly!  Would patching the kernel
> >> > to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
> >> > to the point, is there some other course of action that would be more
> >> > useful?  At this point, the test times are measured in weeks...
> >> 
> >> Indeed, patching the kernel to remove the TIF_POLLING_NRFLAG
> >> definition would have an effect similar to reverting those two
> >> commits.
> >> 
> >> Since testing takes a while, we could take a more aggressive
> >> approach towards reproducing a possible race condition: we
> >> could re-implement the _TIF_POLLING_NRFLAG vs _TIF_NEED_RESCHED
> >> dance, along with the ttwu pending lock-list queue, within
> >> a dummy test module, with custom data structures, and
> >> stress-test the invariants. We could also create a Promela
> >> model of these ipi-skip optimisations trying to validate
> >> progress: whenever a wakeup is requested, there should
> >> always be a scheduling performed, even if no further wakeup
> >> is encountered.
> >> 
> >> Each of the two approaches proposed above might be a significant
> >> endeavor, and would only validate my specific hunch. So it might
> >> be a good idea to just let a test run for a few weeks with
> >>

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Paul E. McKenney

On Sun, Mar 27, 2016 at 01:48:55PM +, Mathieu Desnoyers wrote:
> - On Mar 26, 2016, at 9:34 PM, Paul E. McKenney 
> paul...@linux.vnet.ibm.com wrote:
> 
> > On Sat, Mar 26, 2016 at 10:22:57PM +, Mathieu Desnoyers wrote:
> >> - On Mar 26, 2016, at 2:49 PM, Paul E. McKenney 
> >> paul...@linux.vnet.ibm.com
> >> wrote:
> >> 
> >> > On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
> >> >> On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:
> >> >> > - On Mar 25, 2016, at 5:46 PM, Paul E. McKenney 
> >> >> > paul...@linux.vnet.ibm.com
> >> >> > wrote:
> >> >> > 
> >> >> > > On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
> >> >> > >> Hi  Paul,
> >> >> > >> 
> >> >> > >> On 2016-03-23, Paul E. McKenney wrote:
> >> >> > >> > Please boot with the following parameters:
> >> >> > >> > 
> >> >> > >> >  rcu_tree.rcu_kick_kthreads ftrace
> >> >> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> >> >> > >> 
> >> >> > >> With these parameters I expected more details to show up in the 
> >> >> > >> kernel logs but
> >> >> > >> cannot find any. Even so, today I left the machine running again 
> >> >> > >> and when this
> >> >> > >> happened I think I was able to capture the trace data for the 
> >> >> > >> event. Please
> >> >> > >> find attached the trace information for the kernel message below. 
> >> >> > >> Since the
> >> >> > >> complete trace file is very big I trimmed it to show the time 
> >> >> > >> around this event
> >> >> > >> - hopefully this will contain the information you need. I would 
> >> >> > >> also like to
> >> >> > >> provide some additional information. The system on which I see 
> >> >> > >> these events had
> >> >> > >> a time that was _very_ wrong. I noticed that this issue occurs when
> >> >> > >> system-timesynd was one of the tasks calling the functions of 
> >> >> > >> interest to your
> >> >> > >> tracing and am wondering if a very out of sync time in process of 
> >> >> > >> being
> >> >> > >> corrected could be the cause of this issue? As an experiment I 
> >> >> > >> ensured the
> >> >> > >> system time was accurate before leaving the system idle overnight 
> >> >> > >> and I did not
> >> >> > >> see the issue the next morning.
> >> >> > > 
> >> >> > > Ah!  Yes, a sudden jump in time or a disagreement about the time 
> >> >> > > among
> >> >> > > different components of the system can definitely cause these 
> >> >> > > symptoms.
> >> >> > > We have sometimes seen these problems occur when a pair of CPUs have
> >> >> > > wildly different ideas about what time it is, for example.  Please 
> >> >> > > let
> >> >> > > me know how it goes.
> >> >> > > 
> >> >> > > Also, in your trace, there are no sched_waking events for the 
> >> >> > > rcu_preempt
> >> >> > > process that are not immediately followed by sched_wakeup, so your 
> >> >> > > trace
> >> >> > > isn't showing the problem that I am seeing.
> >> >> > 
> >> >> > This is interesting.
> >> >> > 
> >> >> > Perhaps we could try with those commits reverted ?
> >> >> > 
> >> >> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
> >> >> > Author: Peter Zijlstra 
> >> >> > Date:   Wed Jun 4 10:31:18 2014 -0700
> >> >> > 
> >> >> > sched/idle: Optimize try-to-wake-up IPI
> >> >> > 
> >> >> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
> >> >> > Author: Peter Zijlstra 
> >> >> > Date:   Wed Apr 9 15:35:08 2014 +0200
> >> >> > 
> >> >> > sched/idle: Avoid spurious wakeup IPIs
> >> >> > 
> >> >> > They appeared in 3.16.
> >> >> 
> >> >> At this point, I am up for trying pretty much anything.  ;-)
> >> >> 
> >> >> Will give it a go.
> >> > 
> >> > And those certainly don't revert cleanly!  Would patching the kernel
> >> > to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
> >> > to the point, is there some other course of action that would be more
> >> > useful?  At this point, the test times are measured in weeks...
> >> 
> >> Indeed, patching the kernel to remove the TIF_POLLING_NRFLAG
> >> definition would have an effect similar to reverting those two
> >> commits.
> >> 
> >> Since testing takes a while, we could take a more aggressive
> >> approach towards reproducing a possible race condition: we
> >> could re-implement the _TIF_POLLING_NRFLAG vs _TIF_NEED_RESCHED
> >> dance, along with the ttwu pending lock-list queue, within
> >> a dummy test module, with custom data structures, and
> >> stress-test the invariants. We could also create a Promela
> >> model of these ipi-skip optimisations trying to validate
> >> progress: whenever a wakeup is requested, there should
> >> always be a scheduling performed, even if no further wakeup
> >> is encountered.
> >> 
> >> Each of the two approaches proposed above might be a significant
> >> endeavor, and would only validate my specific hunch. So it might
> >> be a good idea to just let a test run for a few weeks with
> >> TIF_POLLING_NRFLAG disabled meanwhile.
> > 
> > This makes

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Mathieu Desnoyers

- On Mar 26, 2016, at 9:34 PM, Paul E. McKenney paul...@linux.vnet.ibm.com 
wrote:

> On Sat, Mar 26, 2016 at 10:22:57PM +, Mathieu Desnoyers wrote:
>> - On Mar 26, 2016, at 2:49 PM, Paul E. McKenney 
>> paul...@linux.vnet.ibm.com
>> wrote:
>> 
>> > On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
>> >> On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:
>> >> > - On Mar 25, 2016, at 5:46 PM, Paul E. McKenney 
>> >> > paul...@linux.vnet.ibm.com
>> >> > wrote:
>> >> > 
>> >> > > On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
>> >> > >> Hi  Paul,
>> >> > >> 
>> >> > >> On 2016-03-23, Paul E. McKenney wrote:
>> >> > >> > Please boot with the following parameters:
>> >> > >> > 
>> >> > >> >rcu_tree.rcu_kick_kthreads ftrace
>> >> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
>> >> > >> 
>> >> > >> With these parameters I expected more details to show up in the 
>> >> > >> kernel logs but
>> >> > >> cannot find any. Even so, today I left the machine running again and 
>> >> > >> when this
>> >> > >> happened I think I was able to capture the trace data for the event. 
>> >> > >> Please
>> >> > >> find attached the trace information for the kernel message below. 
>> >> > >> Since the
>> >> > >> complete trace file is very big I trimmed it to show the time around 
>> >> > >> this event
>> >> > >> - hopefully this will contain the information you need. I would also 
>> >> > >> like to
>> >> > >> provide some additional information. The system on which I see these 
>> >> > >> events had
>> >> > >> a time that was _very_ wrong. I noticed that this issue occurs when
>> >> > >> system-timesynd was one of the tasks calling the functions of 
>> >> > >> interest to your
>> >> > >> tracing and am wondering if a very out of sync time in process of 
>> >> > >> being
>> >> > >> corrected could be the cause of this issue? As an experiment I 
>> >> > >> ensured the
>> >> > >> system time was accurate before leaving the system idle overnight 
>> >> > >> and I did not
>> >> > >> see the issue the next morning.
>> >> > > 
>> >> > > Ah!  Yes, a sudden jump in time or a disagreement about the time among
>> >> > > different components of the system can definitely cause these 
>> >> > > symptoms.
>> >> > > We have sometimes seen these problems occur when a pair of CPUs have
>> >> > > wildly different ideas about what time it is, for example.  Please let
>> >> > > me know how it goes.
>> >> > > 
>> >> > > Also, in your trace, there are no sched_waking events for the 
>> >> > > rcu_preempt
>> >> > > process that are not immediately followed by sched_wakeup, so your 
>> >> > > trace
>> >> > > isn't showing the problem that I am seeing.
>> >> > 
>> >> > This is interesting.
>> >> > 
>> >> > Perhaps we could try with those commits reverted ?
>> >> > 
>> >> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
>> >> > Author: Peter Zijlstra 
>> >> > Date:   Wed Jun 4 10:31:18 2014 -0700
>> >> > 
>> >> > sched/idle: Optimize try-to-wake-up IPI
>> >> > 
>> >> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
>> >> > Author: Peter Zijlstra 
>> >> > Date:   Wed Apr 9 15:35:08 2014 +0200
>> >> > 
>> >> > sched/idle: Avoid spurious wakeup IPIs
>> >> > 
>> >> > They appeared in 3.16.
>> >> 
>> >> At this point, I am up for trying pretty much anything.  ;-)
>> >> 
>> >> Will give it a go.
>> > 
>> > And those certainly don't revert cleanly!  Would patching the kernel
>> > to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
>> > to the point, is there some other course of action that would be more
>> > useful?  At this point, the test times are measured in weeks...
>> 
>> Indeed, patching the kernel to remove the TIF_POLLING_NRFLAG
>> definition would have an effect similar to reverting those two
>> commits.
>> 
>> Since testing takes a while, we could take a more aggressive
>> approach towards reproducing a possible race condition: we
>> could re-implement the _TIF_POLLING_NRFLAG vs _TIF_NEED_RESCHED
>> dance, along with the ttwu pending lock-list queue, within
>> a dummy test module, with custom data structures, and
>> stress-test the invariants. We could also create a Promela
>> model of these ipi-skip optimisations trying to validate
>> progress: whenever a wakeup is requested, there should
>> always be a scheduling performed, even if no further wakeup
>> is encountered.
>> 
>> Each of the two approaches proposed above might be a significant
>> endeavor, and would only validate my specific hunch. So it might
>> be a good idea to just let a test run for a few weeks with
>> TIF_POLLING_NRFLAG disabled meanwhile.
> 
> This makes a lot of sense.  I did some short runs, and nothing broke
> too badly.  However, I left some diagnostic stuff in that obscured
> the outcome.  I disabled the diagnostic stuff and am running overnight.
> I might need to go further and revert some of my diagnostic

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-27 Thread Mathieu Desnoyers

- On Mar 26, 2016, at 9:34 PM, Paul E. McKenney paul...@linux.vnet.ibm.com 
wrote:

> On Sat, Mar 26, 2016 at 10:22:57PM +, Mathieu Desnoyers wrote:
>> - On Mar 26, 2016, at 2:49 PM, Paul E. McKenney 
>> paul...@linux.vnet.ibm.com
>> wrote:
>> 
>> > On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
>> >> On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:
>> >> > - On Mar 25, 2016, at 5:46 PM, Paul E. McKenney 
>> >> > paul...@linux.vnet.ibm.com
>> >> > wrote:
>> >> > 
>> >> > > On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
>> >> > >> Hi  Paul,
>> >> > >> 
>> >> > >> On 2016-03-23, Paul E. McKenney wrote:
>> >> > >> > Please boot with the following parameters:
>> >> > >> > 
>> >> > >> >rcu_tree.rcu_kick_kthreads ftrace
>> >> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
>> >> > >> 
>> >> > >> With these parameters I expected more details to show up in the 
>> >> > >> kernel logs but
>> >> > >> cannot find any. Even so, today I left the machine running again and 
>> >> > >> when this
>> >> > >> happened I think I was able to capture the trace data for the event. 
>> >> > >> Please
>> >> > >> find attached the trace information for the kernel message below. 
>> >> > >> Since the
>> >> > >> complete trace file is very big I trimmed it to show the time around 
>> >> > >> this event
>> >> > >> - hopefully this will contain the information you need. I would also 
>> >> > >> like to
>> >> > >> provide some additional information. The system on which I see these 
>> >> > >> events had
>> >> > >> a time that was _very_ wrong. I noticed that this issue occurs when
>> >> > >> system-timesynd was one of the tasks calling the functions of 
>> >> > >> interest to your
>> >> > >> tracing and am wondering if a very out of sync time in process of 
>> >> > >> being
>> >> > >> corrected could be the cause of this issue? As an experiment I 
>> >> > >> ensured the
>> >> > >> system time was accurate before leaving the system idle overnight 
>> >> > >> and I did not
>> >> > >> see the issue the next morning.
>> >> > > 
>> >> > > Ah!  Yes, a sudden jump in time or a disagreement about the time among
>> >> > > different components of the system can definitely cause these 
>> >> > > symptoms.
>> >> > > We have sometimes seen these problems occur when a pair of CPUs have
>> >> > > wildly different ideas about what time it is, for example.  Please let
>> >> > > me know how it goes.
>> >> > > 
>> >> > > Also, in your trace, there are no sched_waking events for the 
>> >> > > rcu_preempt
>> >> > > process that are not immediately followed by sched_wakeup, so your 
>> >> > > trace
>> >> > > isn't showing the problem that I am seeing.
>> >> > 
>> >> > This is interesting.
>> >> > 
>> >> > Perhaps we could try with those commits reverted ?
>> >> > 
>> >> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
>> >> > Author: Peter Zijlstra 
>> >> > Date:   Wed Jun 4 10:31:18 2014 -0700
>> >> > 
>> >> > sched/idle: Optimize try-to-wake-up IPI
>> >> > 
>> >> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
>> >> > Author: Peter Zijlstra 
>> >> > Date:   Wed Apr 9 15:35:08 2014 +0200
>> >> > 
>> >> > sched/idle: Avoid spurious wakeup IPIs
>> >> > 
>> >> > They appeared in 3.16.
>> >> 
>> >> At this point, I am up for trying pretty much anything.  ;-)
>> >> 
>> >> Will give it a go.
>> > 
>> > And those certainly don't revert cleanly!  Would patching the kernel
>> > to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
>> > to the point, is there some other course of action that would be more
>> > useful?  At this point, the test times are measured in weeks...
>> 
>> Indeed, patching the kernel to remove the TIF_POLLING_NRFLAG
>> definition would have an effect similar to reverting those two
>> commits.
>> 
>> Since testing takes a while, we could take a more aggressive
>> approach towards reproducing a possible race condition: we
>> could re-implement the _TIF_POLLING_NRFLAG vs _TIF_NEED_RESCHED
>> dance, along with the ttwu pending lock-list queue, within
>> a dummy test module, with custom data structures, and
>> stress-test the invariants. We could also create a Promela
>> model of these ipi-skip optimisations trying to validate
>> progress: whenever a wakeup is requested, there should
>> always be a scheduling performed, even if no further wakeup
>> is encountered.
>> 
>> Each of the two approaches proposed above might be a significant
>> endeavor, and would only validate my specific hunch. So it might
>> be a good idea to just let a test run for a few weeks with
>> TIF_POLLING_NRFLAG disabled meanwhile.
> 
> This makes a lot of sense.  I did some short runs, and nothing broke
> too badly.  However, I left some diagnostic stuff in that obscured
> the outcome.  I disabled the diagnostic stuff and am running overnight.
> I might need to go further and revert some of my diagnostic patches,
> but let's see where it is in the

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-26 Thread Paul E. McKenney

On Sat, Mar 26, 2016 at 10:22:57PM +, Mathieu Desnoyers wrote:
> - On Mar 26, 2016, at 2:49 PM, Paul E. McKenney 
> paul...@linux.vnet.ibm.com wrote:
> 
> > On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
> >> On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:
> >> > - On Mar 25, 2016, at 5:46 PM, Paul E. McKenney 
> >> > paul...@linux.vnet.ibm.com
> >> > wrote:
> >> > 
> >> > > On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
> >> > >> Hi  Paul,
> >> > >> 
> >> > >> On 2016-03-23, Paul E. McKenney wrote:
> >> > >> > Please boot with the following parameters:
> >> > >> > 
> >> > >> > rcu_tree.rcu_kick_kthreads ftrace
> >> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> >> > >> 
> >> > >> With these parameters I expected more details to show up in the 
> >> > >> kernel logs but
> >> > >> cannot find any. Even so, today I left the machine running again and 
> >> > >> when this
> >> > >> happened I think I was able to capture the trace data for the event. 
> >> > >> Please
> >> > >> find attached the trace information for the kernel message below. 
> >> > >> Since the
> >> > >> complete trace file is very big I trimmed it to show the time around 
> >> > >> this event
> >> > >> - hopefully this will contain the information you need. I would also 
> >> > >> like to
> >> > >> provide some additional information. The system on which I see these 
> >> > >> events had
> >> > >> a time that was _very_ wrong. I noticed that this issue occurs when
> >> > >> system-timesynd was one of the tasks calling the functions of 
> >> > >> interest to your
> >> > >> tracing and am wondering if a very out of sync time in process of 
> >> > >> being
> >> > >> corrected could be the cause of this issue? As an experiment I 
> >> > >> ensured the
> >> > >> system time was accurate before leaving the system idle overnight and 
> >> > >> I did not
> >> > >> see the issue the next morning.
> >> > > 
> >> > > Ah!  Yes, a sudden jump in time or a disagreement about the time among
> >> > > different components of the system can definitely cause these symptoms.
> >> > > We have sometimes seen these problems occur when a pair of CPUs have
> >> > > wildly different ideas about what time it is, for example.  Please let
> >> > > me know how it goes.
> >> > > 
> >> > > Also, in your trace, there are no sched_waking events for the 
> >> > > rcu_preempt
> >> > > process that are not immediately followed by sched_wakeup, so your 
> >> > > trace
> >> > > isn't showing the problem that I am seeing.
> >> > 
> >> > This is interesting.
> >> > 
> >> > Perhaps we could try with those commits reverted ?
> >> > 
> >> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
> >> > Author: Peter Zijlstra 
> >> > Date:   Wed Jun 4 10:31:18 2014 -0700
> >> > 
> >> > sched/idle: Optimize try-to-wake-up IPI
> >> > 
> >> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
> >> > Author: Peter Zijlstra 
> >> > Date:   Wed Apr 9 15:35:08 2014 +0200
> >> > 
> >> > sched/idle: Avoid spurious wakeup IPIs
> >> > 
> >> > They appeared in 3.16.
> >> 
> >> At this point, I am up for trying pretty much anything.  ;-)
> >> 
> >> Will give it a go.
> > 
> > And those certainly don't revert cleanly!  Would patching the kernel
> > to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
> > to the point, is there some other course of action that would be more
> > useful?  At this point, the test times are measured in weeks...
> 
> Indeed, patching the kernel to remove the TIF_POLLING_NRFLAG
> definition would have an effect similar to reverting those two
> commits.
> 
> Since testing takes a while, we could take a more aggressive
> approach towards reproducing a possible race condition: we
> could re-implement the _TIF_POLLING_NRFLAG vs _TIF_NEED_RESCHED
> dance, along with the ttwu pending lock-list queue, within
> a dummy test module, with custom data structures, and
> stress-test the invariants. We could also create a Promela
> model of these ipi-skip optimisations trying to validate
> progress: whenever a wakeup is requested, there should
> always be a scheduling performed, even if no further wakeup
> is encountered.
> 
> Each of the two approaches proposed above might be a significant
> endeavor, and would only validate my specific hunch. So it might
> be a good idea to just let a test run for a few weeks with
> TIF_POLLING_NRFLAG disabled meanwhile.

This makes a lot of sense.  I did some short runs, and nothing broke
too badly.  However, I left some diagnostic stuff in that obscured
the outcome.  I disabled the diagnostic stuff and am running overnight.
I might need to go further and revert some of my diagnostic patches,
but let's see where it is in the morning.

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-26 Thread Paul E. McKenney

On Sat, Mar 26, 2016 at 10:22:57PM +, Mathieu Desnoyers wrote:
> - On Mar 26, 2016, at 2:49 PM, Paul E. McKenney 
> paul...@linux.vnet.ibm.com wrote:
> 
> > On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
> >> On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:
> >> > - On Mar 25, 2016, at 5:46 PM, Paul E. McKenney 
> >> > paul...@linux.vnet.ibm.com
> >> > wrote:
> >> > 
> >> > > On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
> >> > >> Hi  Paul,
> >> > >> 
> >> > >> On 2016-03-23, Paul E. McKenney wrote:
> >> > >> > Please boot with the following parameters:
> >> > >> > 
> >> > >> > rcu_tree.rcu_kick_kthreads ftrace
> >> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> >> > >> 
> >> > >> With these parameters I expected more details to show up in the 
> >> > >> kernel logs but
> >> > >> cannot find any. Even so, today I left the machine running again and 
> >> > >> when this
> >> > >> happened I think I was able to capture the trace data for the event. 
> >> > >> Please
> >> > >> find attached the trace information for the kernel message below. 
> >> > >> Since the
> >> > >> complete trace file is very big I trimmed it to show the time around 
> >> > >> this event
> >> > >> - hopefully this will contain the information you need. I would also 
> >> > >> like to
> >> > >> provide some additional information. The system on which I see these 
> >> > >> events had
> >> > >> a time that was _very_ wrong. I noticed that this issue occurs when
> >> > >> system-timesynd was one of the tasks calling the functions of 
> >> > >> interest to your
> >> > >> tracing and am wondering if a very out of sync time in process of 
> >> > >> being
> >> > >> corrected could be the cause of this issue? As an experiment I 
> >> > >> ensured the
> >> > >> system time was accurate before leaving the system idle overnight and 
> >> > >> I did not
> >> > >> see the issue the next morning.
> >> > > 
> >> > > Ah!  Yes, a sudden jump in time or a disagreement about the time among
> >> > > different components of the system can definitely cause these symptoms.
> >> > > We have sometimes seen these problems occur when a pair of CPUs have
> >> > > wildly different ideas about what time it is, for example.  Please let
> >> > > me know how it goes.
> >> > > 
> >> > > Also, in your trace, there are no sched_waking events for the 
> >> > > rcu_preempt
> >> > > process that are not immediately followed by sched_wakeup, so your 
> >> > > trace
> >> > > isn't showing the problem that I am seeing.
> >> > 
> >> > This is interesting.
> >> > 
> >> > Perhaps we could try with those commits reverted ?
> >> > 
> >> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
> >> > Author: Peter Zijlstra 
> >> > Date:   Wed Jun 4 10:31:18 2014 -0700
> >> > 
> >> > sched/idle: Optimize try-to-wake-up IPI
> >> > 
> >> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
> >> > Author: Peter Zijlstra 
> >> > Date:   Wed Apr 9 15:35:08 2014 +0200
> >> > 
> >> > sched/idle: Avoid spurious wakeup IPIs
> >> > 
> >> > They appeared in 3.16.
> >> 
> >> At this point, I am up for trying pretty much anything.  ;-)
> >> 
> >> Will give it a go.
> > 
> > And those certainly don't revert cleanly!  Would patching the kernel
> > to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
> > to the point, is there some other course of action that would be more
> > useful?  At this point, the test times are measured in weeks...
> 
> Indeed, patching the kernel to remove the TIF_POLLING_NRFLAG
> definition would have an effect similar to reverting those two
> commits.
> 
> Since testing takes a while, we could take a more aggressive
> approach towards reproducing a possible race condition: we
> could re-implement the _TIF_POLLING_NRFLAG vs _TIF_NEED_RESCHED
> dance, along with the ttwu pending lock-list queue, within
> a dummy test module, with custom data structures, and
> stress-test the invariants. We could also create a Promela
> model of these ipi-skip optimisations trying to validate
> progress: whenever a wakeup is requested, there should
> always be a scheduling performed, even if no further wakeup
> is encountered.
> 
> Each of the two approaches proposed above might be a significant
> endeavor, and would only validate my specific hunch. So it might
> be a good idea to just let a test run for a few weeks with
> TIF_POLLING_NRFLAG disabled meanwhile.

This makes a lot of sense.  I did some short runs, and nothing broke
too badly.  However, I left some diagnostic stuff in that obscured
the outcome.  I disabled the diagnostic stuff and am running overnight.
I might need to go further and revert some of my diagnostic patches,
but let's see where it is in the morning.

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-26 Thread Mathieu Desnoyers

- On Mar 26, 2016, at 2:49 PM, Paul E. McKenney paul...@linux.vnet.ibm.com 
wrote:

> On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
>> On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:
>> > - On Mar 25, 2016, at 5:46 PM, Paul E. McKenney 
>> > paul...@linux.vnet.ibm.com
>> > wrote:
>> > 
>> > > On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
>> > >> Hi  Paul,
>> > >> 
>> > >> On 2016-03-23, Paul E. McKenney wrote:
>> > >> > Please boot with the following parameters:
>> > >> > 
>> > >> >   rcu_tree.rcu_kick_kthreads ftrace
>> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
>> > >> 
>> > >> With these parameters I expected more details to show up in the kernel 
>> > >> logs but
>> > >> cannot find any. Even so, today I left the machine running again and 
>> > >> when this
>> > >> happened I think I was able to capture the trace data for the event. 
>> > >> Please
>> > >> find attached the trace information for the kernel message below. Since 
>> > >> the
>> > >> complete trace file is very big I trimmed it to show the time around 
>> > >> this event
>> > >> - hopefully this will contain the information you need. I would also 
>> > >> like to
>> > >> provide some additional information. The system on which I see these 
>> > >> events had
>> > >> a time that was _very_ wrong. I noticed that this issue occurs when
>> > >> system-timesynd was one of the tasks calling the functions of interest 
>> > >> to your
>> > >> tracing and am wondering if a very out of sync time in process of being
>> > >> corrected could be the cause of this issue? As an experiment I ensured 
>> > >> the
>> > >> system time was accurate before leaving the system idle overnight and I 
>> > >> did not
>> > >> see the issue the next morning.
>> > > 
>> > > Ah!  Yes, a sudden jump in time or a disagreement about the time among
>> > > different components of the system can definitely cause these symptoms.
>> > > We have sometimes seen these problems occur when a pair of CPUs have
>> > > wildly different ideas about what time it is, for example.  Please let
>> > > me know how it goes.
>> > > 
>> > > Also, in your trace, there are no sched_waking events for the rcu_preempt
>> > > process that are not immediately followed by sched_wakeup, so your trace
>> > > isn't showing the problem that I am seeing.
>> > 
>> > This is interesting.
>> > 
>> > Perhaps we could try with those commits reverted ?
>> > 
>> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
>> > Author: Peter Zijlstra 
>> > Date:   Wed Jun 4 10:31:18 2014 -0700
>> > 
>> > sched/idle: Optimize try-to-wake-up IPI
>> > 
>> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
>> > Author: Peter Zijlstra 
>> > Date:   Wed Apr 9 15:35:08 2014 +0200
>> > 
>> > sched/idle: Avoid spurious wakeup IPIs
>> > 
>> > They appeared in 3.16.
>> 
>> At this point, I am up for trying pretty much anything.  ;-)
>> 
>> Will give it a go.
> 
> And those certainly don't revert cleanly!  Would patching the kernel
> to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
> to the point, is there some other course of action that would be more
> useful?  At this point, the test times are measured in weeks...

Indeed, patching the kernel to remove the TIF_POLLING_NRFLAG
definition would have an effect similar to reverting those two
commits.

Since testing takes a while, we could take a more aggressive
approach towards reproducing a possible race condition: we
could re-implement the _TIF_POLLING_NRFLAG vs _TIF_NEED_RESCHED
dance, along with the ttwu pending lock-list queue, within
a dummy test module, with custom data structures, and
stress-test the invariants. We could also create a Promela
model of these ipi-skip optimisations trying to validate
progress: whenever a wakeup is requested, there should
always be a scheduling performed, even if no further wakeup
is encountered.

Each of the two approaches proposed above might be a significant
endeavor, and would only validate my specific hunch. So it might
be a good idea to just let a test run for a few weeks with
TIF_POLLING_NRFLAG disabled meanwhile.

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-26 Thread Mathieu Desnoyers

- On Mar 26, 2016, at 2:49 PM, Paul E. McKenney paul...@linux.vnet.ibm.com 
wrote:

> On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
>> On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:
>> > - On Mar 25, 2016, at 5:46 PM, Paul E. McKenney 
>> > paul...@linux.vnet.ibm.com
>> > wrote:
>> > 
>> > > On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
>> > >> Hi  Paul,
>> > >> 
>> > >> On 2016-03-23, Paul E. McKenney wrote:
>> > >> > Please boot with the following parameters:
>> > >> > 
>> > >> >   rcu_tree.rcu_kick_kthreads ftrace
>> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
>> > >> 
>> > >> With these parameters I expected more details to show up in the kernel 
>> > >> logs but
>> > >> cannot find any. Even so, today I left the machine running again and 
>> > >> when this
>> > >> happened I think I was able to capture the trace data for the event. 
>> > >> Please
>> > >> find attached the trace information for the kernel message below. Since 
>> > >> the
>> > >> complete trace file is very big I trimmed it to show the time around 
>> > >> this event
>> > >> - hopefully this will contain the information you need. I would also 
>> > >> like to
>> > >> provide some additional information. The system on which I see these 
>> > >> events had
>> > >> a time that was _very_ wrong. I noticed that this issue occurs when
>> > >> system-timesynd was one of the tasks calling the functions of interest 
>> > >> to your
>> > >> tracing and am wondering if a very out of sync time in process of being
>> > >> corrected could be the cause of this issue? As an experiment I ensured 
>> > >> the
>> > >> system time was accurate before leaving the system idle overnight and I 
>> > >> did not
>> > >> see the issue the next morning.
>> > > 
>> > > Ah!  Yes, a sudden jump in time or a disagreement about the time among
>> > > different components of the system can definitely cause these symptoms.
>> > > We have sometimes seen these problems occur when a pair of CPUs have
>> > > wildly different ideas about what time it is, for example.  Please let
>> > > me know how it goes.
>> > > 
>> > > Also, in your trace, there are no sched_waking events for the rcu_preempt
>> > > process that are not immediately followed by sched_wakeup, so your trace
>> > > isn't showing the problem that I am seeing.
>> > 
>> > This is interesting.
>> > 
>> > Perhaps we could try with those commits reverted ?
>> > 
>> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
>> > Author: Peter Zijlstra 
>> > Date:   Wed Jun 4 10:31:18 2014 -0700
>> > 
>> > sched/idle: Optimize try-to-wake-up IPI
>> > 
>> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
>> > Author: Peter Zijlstra 
>> > Date:   Wed Apr 9 15:35:08 2014 +0200
>> > 
>> > sched/idle: Avoid spurious wakeup IPIs
>> > 
>> > They appeared in 3.16.
>> 
>> At this point, I am up for trying pretty much anything.  ;-)
>> 
>> Will give it a go.
> 
> And those certainly don't revert cleanly!  Would patching the kernel
> to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
> to the point, is there some other course of action that would be more
> useful?  At this point, the test times are measured in weeks...

Indeed, patching the kernel to remove the TIF_POLLING_NRFLAG
definition would have an effect similar to reverting those two
commits.

Since testing takes a while, we could take a more aggressive
approach towards reproducing a possible race condition: we
could re-implement the _TIF_POLLING_NRFLAG vs _TIF_NEED_RESCHED
dance, along with the ttwu pending lock-list queue, within
a dummy test module, with custom data structures, and
stress-test the invariants. We could also create a Promela
model of these ipi-skip optimisations trying to validate
progress: whenever a wakeup is requested, there should
always be a scheduling performed, even if no further wakeup
is encountered.

Each of the two approaches proposed above might be a significant
endeavor, and would only validate my specific hunch. So it might
be a good idea to just let a test run for a few weeks with
TIF_POLLING_NRFLAG disabled meanwhile.

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-26 Thread Paul E. McKenney

On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
> On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:
> > - On Mar 25, 2016, at 5:46 PM, Paul E. McKenney 
> > paul...@linux.vnet.ibm.com wrote:
> > 
> > > On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
> > >> Hi  Paul,
> > >> 
> > >> On 2016-03-23, Paul E. McKenney wrote:
> > >> > Please boot with the following parameters:
> > >> > 
> > >> >rcu_tree.rcu_kick_kthreads ftrace
> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> > >> 
> > >> With these parameters I expected more details to show up in the kernel 
> > >> logs but
> > >> cannot find any. Even so, today I left the machine running again and 
> > >> when this
> > >> happened I think I was able to capture the trace data for the event. 
> > >> Please
> > >> find attached the trace information for the kernel message below. Since 
> > >> the
> > >> complete trace file is very big I trimmed it to show the time around 
> > >> this event
> > >> - hopefully this will contain the information you need. I would also 
> > >> like to
> > >> provide some additional information. The system on which I see these 
> > >> events had
> > >> a time that was _very_ wrong. I noticed that this issue occurs when
> > >> system-timesynd was one of the tasks calling the functions of interest 
> > >> to your
> > >> tracing and am wondering if a very out of sync time in process of being
> > >> corrected could be the cause of this issue? As an experiment I ensured 
> > >> the
> > >> system time was accurate before leaving the system idle overnight and I 
> > >> did not
> > >> see the issue the next morning.
> > > 
> > > Ah!  Yes, a sudden jump in time or a disagreement about the time among
> > > different components of the system can definitely cause these symptoms.
> > > We have sometimes seen these problems occur when a pair of CPUs have
> > > wildly different ideas about what time it is, for example.  Please let
> > > me know how it goes.
> > > 
> > > Also, in your trace, there are no sched_waking events for the rcu_preempt
> > > process that are not immediately followed by sched_wakeup, so your trace
> > > isn't showing the problem that I am seeing.
> > 
> > This is interesting.
> > 
> > Perhaps we could try with those commits reverted ?
> > 
> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
> > Author: Peter Zijlstra 
> > Date:   Wed Jun 4 10:31:18 2014 -0700
> > 
> > sched/idle: Optimize try-to-wake-up IPI
> > 
> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
> > Author: Peter Zijlstra 
> > Date:   Wed Apr 9 15:35:08 2014 +0200
> > 
> > sched/idle: Avoid spurious wakeup IPIs
> > 
> > They appeared in 3.16.
> 
> At this point, I am up for trying pretty much anything.  ;-)
> 
> Will give it a go.

And those certainly don't revert cleanly!  Would patching the kernel
to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
to the point, is there some other course of action that would be more
useful?  At this point, the test times are measured in weeks...

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-26 Thread Paul E. McKenney

On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
> On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:
> > - On Mar 25, 2016, at 5:46 PM, Paul E. McKenney 
> > paul...@linux.vnet.ibm.com wrote:
> > 
> > > On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
> > >> Hi  Paul,
> > >> 
> > >> On 2016-03-23, Paul E. McKenney wrote:
> > >> > Please boot with the following parameters:
> > >> > 
> > >> >rcu_tree.rcu_kick_kthreads ftrace
> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> > >> 
> > >> With these parameters I expected more details to show up in the kernel 
> > >> logs but
> > >> cannot find any. Even so, today I left the machine running again and 
> > >> when this
> > >> happened I think I was able to capture the trace data for the event. 
> > >> Please
> > >> find attached the trace information for the kernel message below. Since 
> > >> the
> > >> complete trace file is very big I trimmed it to show the time around 
> > >> this event
> > >> - hopefully this will contain the information you need. I would also 
> > >> like to
> > >> provide some additional information. The system on which I see these 
> > >> events had
> > >> a time that was _very_ wrong. I noticed that this issue occurs when
> > >> system-timesynd was one of the tasks calling the functions of interest 
> > >> to your
> > >> tracing and am wondering if a very out of sync time in process of being
> > >> corrected could be the cause of this issue? As an experiment I ensured 
> > >> the
> > >> system time was accurate before leaving the system idle overnight and I 
> > >> did not
> > >> see the issue the next morning.
> > > 
> > > Ah!  Yes, a sudden jump in time or a disagreement about the time among
> > > different components of the system can definitely cause these symptoms.
> > > We have sometimes seen these problems occur when a pair of CPUs have
> > > wildly different ideas about what time it is, for example.  Please let
> > > me know how it goes.
> > > 
> > > Also, in your trace, there are no sched_waking events for the rcu_preempt
> > > process that are not immediately followed by sched_wakeup, so your trace
> > > isn't showing the problem that I am seeing.
> > 
> > This is interesting.
> > 
> > Perhaps we could try with those commits reverted ?
> > 
> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
> > Author: Peter Zijlstra 
> > Date:   Wed Jun 4 10:31:18 2014 -0700
> > 
> > sched/idle: Optimize try-to-wake-up IPI
> > 
> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
> > Author: Peter Zijlstra 
> > Date:   Wed Apr 9 15:35:08 2014 +0200
> > 
> > sched/idle: Avoid spurious wakeup IPIs
> > 
> > They appeared in 3.16.
> 
> At this point, I am up for trying pretty much anything.  ;-)
> 
> Will give it a go.

And those certainly don't revert cleanly!  Would patching the kernel
to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
to the point, is there some other course of action that would be more
useful?  At this point, the test times are measured in weeks...

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-26 Thread Paul E. McKenney

On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:
> - On Mar 25, 2016, at 5:46 PM, Paul E. McKenney 
> paul...@linux.vnet.ibm.com wrote:
> 
> > On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
> >> Hi  Paul,
> >> 
> >> On 2016-03-23, Paul E. McKenney wrote:
> >> > Please boot with the following parameters:
> >> > 
> >> >  rcu_tree.rcu_kick_kthreads ftrace
> >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> >> 
> >> With these parameters I expected more details to show up in the kernel 
> >> logs but
> >> cannot find any. Even so, today I left the machine running again and when 
> >> this
> >> happened I think I was able to capture the trace data for the event. Please
> >> find attached the trace information for the kernel message below. Since the
> >> complete trace file is very big I trimmed it to show the time around this 
> >> event
> >> - hopefully this will contain the information you need. I would also like 
> >> to
> >> provide some additional information. The system on which I see these 
> >> events had
> >> a time that was _very_ wrong. I noticed that this issue occurs when
> >> system-timesynd was one of the tasks calling the functions of interest to 
> >> your
> >> tracing and am wondering if a very out of sync time in process of being
> >> corrected could be the cause of this issue? As an experiment I ensured the
> >> system time was accurate before leaving the system idle overnight and I 
> >> did not
> >> see the issue the next morning.
> > 
> > Ah!  Yes, a sudden jump in time or a disagreement about the time among
> > different components of the system can definitely cause these symptoms.
> > We have sometimes seen these problems occur when a pair of CPUs have
> > wildly different ideas about what time it is, for example.  Please let
> > me know how it goes.
> > 
> > Also, in your trace, there are no sched_waking events for the rcu_preempt
> > process that are not immediately followed by sched_wakeup, so your trace
> > isn't showing the problem that I am seeing.
> 
> This is interesting.
> 
> Perhaps we could try with those commits reverted ?
> 
> commit e3baac47f0e82c4be632f4f97215bb93bf16b342
> Author: Peter Zijlstra 
> Date:   Wed Jun 4 10:31:18 2014 -0700
> 
> sched/idle: Optimize try-to-wake-up IPI
> 
> commit fd99f91aa007ba255aac44fe6cf21c1db398243a
> Author: Peter Zijlstra 
> Date:   Wed Apr 9 15:35:08 2014 +0200
> 
> sched/idle: Avoid spurious wakeup IPIs
> 
> They appeared in 3.16.

At this point, I am up for trying pretty much anything.  ;-)

Will give it a go.

Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Still beating up on my stress test, which is not yet proving to be all
> > that stressful.  :-/
> > 
> > Thanx, Paul
> > 
> >> [  957.396537] INFO: rcu_preempt detected stalls on CPUs/tasks:
> >> [  957.399933]  1-...: (0 ticks this GP) idle=4d6/0/0 softirq=6311/6311 
> >> fqs=0
> >> [  957.403661]  (detected by 0, t=60002 jiffies, g=3583, c=3582, q=47)
> >> [  957.407227] Task dump for CPU 1:
> >> [  957.409964] swapper/1   R  running task0 0  1 
> >> 0x0020
> >> [  957.413770]  039daa9a7eb9 8801785cfed0 818af34c
> >> 8801
> >> [  957.417696]  00060003 8801785d 880072f9ea00
> >> 822dcf80
> >> [  957.421631]  8801785cc000 8801785cc000 8801785cfee0
> >> 818af597
> >> [  957.425562] Call Trace:
> >> [  957.428124]  [] ? cpuidle_enter_state+0xfc/0x310
> >> [  957.431713]  [] ? cpuidle_enter+0x17/0x20
> >> [  957.435122]  [] ? call_cpuidle+0x2a/0x40
> >> [  957.438467]  [] ? cpu_startup_entry+0x28d/0x360
> >> [  957.441949]  [] ? start_secondary+0x114/0x140
> >> [  957.445378] rcu_preempt kthread starved for 60002 jiffies! g3583 c3582 
> >> f0x0
> >> RCU_GP_WAIT_FQS(3) ->state=0x1
> >> [  957.449834] rcu_preempt S 8801785b7d68 0 7  2 
> >> 0x
> >> [  957.453579]  8801785b7d68 88017dc8cc80 88016fe6bb80
> >> 8801785abb80
> >> [  957.457428]  8801785b8000 8801785b7da0 88017dc8cc80
> >> 88017dc8cc80
> >> [  957.461249]  0003 8801785b7d80 81ab03df
> >> 000100373021
> >> [  957.465055] Call Trace:
> >> [  957.467493]  [] schedule+0x3f/0xa0
> >> [  957.470613]  [] schedule_timeout+0x127/0x270
> >> [  957.473976]  [] ? detach_if_pending+0x120/0x120
> >> [  957.477387]  [] rcu_gp_kthread+0x6d3/0xa40
> >> [  957.480659]  [] ? wake_atomic_t_function+0x70/0x70
> >> [  957.484123]  [] ? force_qs_rnp+0x1b0/0x1b0
> >> [  957.487392]  [] kthread+0xe6/0x100
> >> [  957.490470]  [] ? kthread_worker_fn+0x190/0x190
> >> [  957.493859]  [] ret_from_fork+0x3f/0x70
> >> [  957.497044]  [] ? kthread_worker_fn+0x190/0x190
> >> 
> > > Reinette
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
>

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-26 Thread Paul E. McKenney

On Sat, Mar 26, 2016 at 12:29:31PM +, Mathieu Desnoyers wrote:
> - On Mar 25, 2016, at 5:46 PM, Paul E. McKenney 
> paul...@linux.vnet.ibm.com wrote:
> 
> > On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
> >> Hi  Paul,
> >> 
> >> On 2016-03-23, Paul E. McKenney wrote:
> >> > Please boot with the following parameters:
> >> > 
> >> >  rcu_tree.rcu_kick_kthreads ftrace
> >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> >> 
> >> With these parameters I expected more details to show up in the kernel 
> >> logs but
> >> cannot find any. Even so, today I left the machine running again and when 
> >> this
> >> happened I think I was able to capture the trace data for the event. Please
> >> find attached the trace information for the kernel message below. Since the
> >> complete trace file is very big I trimmed it to show the time around this 
> >> event
> >> - hopefully this will contain the information you need. I would also like 
> >> to
> >> provide some additional information. The system on which I see these 
> >> events had
> >> a time that was _very_ wrong. I noticed that this issue occurs when
> >> system-timesynd was one of the tasks calling the functions of interest to 
> >> your
> >> tracing and am wondering if a very out of sync time in process of being
> >> corrected could be the cause of this issue? As an experiment I ensured the
> >> system time was accurate before leaving the system idle overnight and I 
> >> did not
> >> see the issue the next morning.
> > 
> > Ah!  Yes, a sudden jump in time or a disagreement about the time among
> > different components of the system can definitely cause these symptoms.
> > We have sometimes seen these problems occur when a pair of CPUs have
> > wildly different ideas about what time it is, for example.  Please let
> > me know how it goes.
> > 
> > Also, in your trace, there are no sched_waking events for the rcu_preempt
> > process that are not immediately followed by sched_wakeup, so your trace
> > isn't showing the problem that I am seeing.
> 
> This is interesting.
> 
> Perhaps we could try with those commits reverted ?
> 
> commit e3baac47f0e82c4be632f4f97215bb93bf16b342
> Author: Peter Zijlstra 
> Date:   Wed Jun 4 10:31:18 2014 -0700
> 
> sched/idle: Optimize try-to-wake-up IPI
> 
> commit fd99f91aa007ba255aac44fe6cf21c1db398243a
> Author: Peter Zijlstra 
> Date:   Wed Apr 9 15:35:08 2014 +0200
> 
> sched/idle: Avoid spurious wakeup IPIs
> 
> They appeared in 3.16.

At this point, I am up for trying pretty much anything.  ;-)

Will give it a go.

Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Still beating up on my stress test, which is not yet proving to be all
> > that stressful.  :-/
> > 
> > Thanx, Paul
> > 
> >> [  957.396537] INFO: rcu_preempt detected stalls on CPUs/tasks:
> >> [  957.399933]  1-...: (0 ticks this GP) idle=4d6/0/0 softirq=6311/6311 
> >> fqs=0
> >> [  957.403661]  (detected by 0, t=60002 jiffies, g=3583, c=3582, q=47)
> >> [  957.407227] Task dump for CPU 1:
> >> [  957.409964] swapper/1   R  running task0 0  1 
> >> 0x0020
> >> [  957.413770]  039daa9a7eb9 8801785cfed0 818af34c
> >> 8801
> >> [  957.417696]  00060003 8801785d 880072f9ea00
> >> 822dcf80
> >> [  957.421631]  8801785cc000 8801785cc000 8801785cfee0
> >> 818af597
> >> [  957.425562] Call Trace:
> >> [  957.428124]  [] ? cpuidle_enter_state+0xfc/0x310
> >> [  957.431713]  [] ? cpuidle_enter+0x17/0x20
> >> [  957.435122]  [] ? call_cpuidle+0x2a/0x40
> >> [  957.438467]  [] ? cpu_startup_entry+0x28d/0x360
> >> [  957.441949]  [] ? start_secondary+0x114/0x140
> >> [  957.445378] rcu_preempt kthread starved for 60002 jiffies! g3583 c3582 
> >> f0x0
> >> RCU_GP_WAIT_FQS(3) ->state=0x1
> >> [  957.449834] rcu_preempt S 8801785b7d68 0 7  2 
> >> 0x
> >> [  957.453579]  8801785b7d68 88017dc8cc80 88016fe6bb80
> >> 8801785abb80
> >> [  957.457428]  8801785b8000 8801785b7da0 88017dc8cc80
> >> 88017dc8cc80
> >> [  957.461249]  0003 8801785b7d80 81ab03df
> >> 000100373021
> >> [  957.465055] Call Trace:
> >> [  957.467493]  [] schedule+0x3f/0xa0
> >> [  957.470613]  [] schedule_timeout+0x127/0x270
> >> [  957.473976]  [] ? detach_if_pending+0x120/0x120
> >> [  957.477387]  [] rcu_gp_kthread+0x6d3/0xa40
> >> [  957.480659]  [] ? wake_atomic_t_function+0x70/0x70
> >> [  957.484123]  [] ? force_qs_rnp+0x1b0/0x1b0
> >> [  957.487392]  [] kthread+0xe6/0x100
> >> [  957.490470]  [] ? kthread_worker_fn+0x190/0x190
> >> [  957.493859]  [] ret_from_fork+0x3f/0x70
> >> [  957.497044]  [] ? kthread_worker_fn+0x190/0x190
> >> 
> > > Reinette
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-26 Thread Mathieu Desnoyers

- On Mar 25, 2016, at 5:46 PM, Paul E. McKenney paul...@linux.vnet.ibm.com 
wrote:

> On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
>> Hi  Paul,
>> 
>> On 2016-03-23, Paul E. McKenney wrote:
>> > Please boot with the following parameters:
>> > 
>> >rcu_tree.rcu_kick_kthreads ftrace
>> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
>> 
>> With these parameters I expected more details to show up in the kernel logs 
>> but
>> cannot find any. Even so, today I left the machine running again and when 
>> this
>> happened I think I was able to capture the trace data for the event. Please
>> find attached the trace information for the kernel message below. Since the
>> complete trace file is very big I trimmed it to show the time around this 
>> event
>> - hopefully this will contain the information you need. I would also like to
>> provide some additional information. The system on which I see these events 
>> had
>> a time that was _very_ wrong. I noticed that this issue occurs when
>> system-timesynd was one of the tasks calling the functions of interest to 
>> your
>> tracing and am wondering if a very out of sync time in process of being
>> corrected could be the cause of this issue? As an experiment I ensured the
>> system time was accurate before leaving the system idle overnight and I did 
>> not
>> see the issue the next morning.
> 
> Ah!  Yes, a sudden jump in time or a disagreement about the time among
> different components of the system can definitely cause these symptoms.
> We have sometimes seen these problems occur when a pair of CPUs have
> wildly different ideas about what time it is, for example.  Please let
> me know how it goes.
> 
> Also, in your trace, there are no sched_waking events for the rcu_preempt
> process that are not immediately followed by sched_wakeup, so your trace
> isn't showing the problem that I am seeing.

This is interesting.

Perhaps we could try with those commits reverted ?

commit e3baac47f0e82c4be632f4f97215bb93bf16b342
Author: Peter Zijlstra 
Date:   Wed Jun 4 10:31:18 2014 -0700

sched/idle: Optimize try-to-wake-up IPI

commit fd99f91aa007ba255aac44fe6cf21c1db398243a
Author: Peter Zijlstra 
Date:   Wed Apr 9 15:35:08 2014 +0200

sched/idle: Avoid spurious wakeup IPIs

They appeared in 3.16.

Thanks,

Mathieu

> 
> Still beating up on my stress test, which is not yet proving to be all
> that stressful.  :-/
> 
>   Thanx, Paul
> 
>> [  957.396537] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  957.399933]  1-...: (0 ticks this GP) idle=4d6/0/0 softirq=6311/6311 fqs=0
>> [  957.403661]  (detected by 0, t=60002 jiffies, g=3583, c=3582, q=47)
>> [  957.407227] Task dump for CPU 1:
>> [  957.409964] swapper/1   R  running task0 0  1 
>> 0x0020
>> [  957.413770]  039daa9a7eb9 8801785cfed0 818af34c
>> 8801
>> [  957.417696]  00060003 8801785d 880072f9ea00
>> 822dcf80
>> [  957.421631]  8801785cc000 8801785cc000 8801785cfee0
>> 818af597
>> [  957.425562] Call Trace:
>> [  957.428124]  [] ? cpuidle_enter_state+0xfc/0x310
>> [  957.431713]  [] ? cpuidle_enter+0x17/0x20
>> [  957.435122]  [] ? call_cpuidle+0x2a/0x40
>> [  957.438467]  [] ? cpu_startup_entry+0x28d/0x360
>> [  957.441949]  [] ? start_secondary+0x114/0x140
>> [  957.445378] rcu_preempt kthread starved for 60002 jiffies! g3583 c3582 
>> f0x0
>> RCU_GP_WAIT_FQS(3) ->state=0x1
>> [  957.449834] rcu_preempt S 8801785b7d68 0 7  2 
>> 0x
>> [  957.453579]  8801785b7d68 88017dc8cc80 88016fe6bb80
>> 8801785abb80
>> [  957.457428]  8801785b8000 8801785b7da0 88017dc8cc80
>> 88017dc8cc80
>> [  957.461249]  0003 8801785b7d80 81ab03df
>> 000100373021
>> [  957.465055] Call Trace:
>> [  957.467493]  [] schedule+0x3f/0xa0
>> [  957.470613]  [] schedule_timeout+0x127/0x270
>> [  957.473976]  [] ? detach_if_pending+0x120/0x120
>> [  957.477387]  [] rcu_gp_kthread+0x6d3/0xa40
>> [  957.480659]  [] ? wake_atomic_t_function+0x70/0x70
>> [  957.484123]  [] ? force_qs_rnp+0x1b0/0x1b0
>> [  957.487392]  [] kthread+0xe6/0x100
>> [  957.490470]  [] ? kthread_worker_fn+0x190/0x190
>> [  957.493859]  [] ret_from_fork+0x3f/0x70
>> [  957.497044]  [] ? kthread_worker_fn+0x190/0x190
>> 
> > Reinette

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-26 Thread Mathieu Desnoyers

- On Mar 25, 2016, at 5:46 PM, Paul E. McKenney paul...@linux.vnet.ibm.com 
wrote:

> On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
>> Hi  Paul,
>> 
>> On 2016-03-23, Paul E. McKenney wrote:
>> > Please boot with the following parameters:
>> > 
>> >rcu_tree.rcu_kick_kthreads ftrace
>> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
>> 
>> With these parameters I expected more details to show up in the kernel logs 
>> but
>> cannot find any. Even so, today I left the machine running again and when 
>> this
>> happened I think I was able to capture the trace data for the event. Please
>> find attached the trace information for the kernel message below. Since the
>> complete trace file is very big I trimmed it to show the time around this 
>> event
>> - hopefully this will contain the information you need. I would also like to
>> provide some additional information. The system on which I see these events 
>> had
>> a time that was _very_ wrong. I noticed that this issue occurs when
>> system-timesynd was one of the tasks calling the functions of interest to 
>> your
>> tracing and am wondering if a very out of sync time in process of being
>> corrected could be the cause of this issue? As an experiment I ensured the
>> system time was accurate before leaving the system idle overnight and I did 
>> not
>> see the issue the next morning.
> 
> Ah!  Yes, a sudden jump in time or a disagreement about the time among
> different components of the system can definitely cause these symptoms.
> We have sometimes seen these problems occur when a pair of CPUs have
> wildly different ideas about what time it is, for example.  Please let
> me know how it goes.
> 
> Also, in your trace, there are no sched_waking events for the rcu_preempt
> process that are not immediately followed by sched_wakeup, so your trace
> isn't showing the problem that I am seeing.

This is interesting.

Perhaps we could try with those commits reverted ?

commit e3baac47f0e82c4be632f4f97215bb93bf16b342
Author: Peter Zijlstra 
Date:   Wed Jun 4 10:31:18 2014 -0700

sched/idle: Optimize try-to-wake-up IPI

commit fd99f91aa007ba255aac44fe6cf21c1db398243a
Author: Peter Zijlstra 
Date:   Wed Apr 9 15:35:08 2014 +0200

sched/idle: Avoid spurious wakeup IPIs

They appeared in 3.16.

Thanks,

Mathieu

> 
> Still beating up on my stress test, which is not yet proving to be all
> that stressful.  :-/
> 
>   Thanx, Paul
> 
>> [  957.396537] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  957.399933]  1-...: (0 ticks this GP) idle=4d6/0/0 softirq=6311/6311 fqs=0
>> [  957.403661]  (detected by 0, t=60002 jiffies, g=3583, c=3582, q=47)
>> [  957.407227] Task dump for CPU 1:
>> [  957.409964] swapper/1   R  running task0 0  1 
>> 0x0020
>> [  957.413770]  039daa9a7eb9 8801785cfed0 818af34c
>> 8801
>> [  957.417696]  00060003 8801785d 880072f9ea00
>> 822dcf80
>> [  957.421631]  8801785cc000 8801785cc000 8801785cfee0
>> 818af597
>> [  957.425562] Call Trace:
>> [  957.428124]  [] ? cpuidle_enter_state+0xfc/0x310
>> [  957.431713]  [] ? cpuidle_enter+0x17/0x20
>> [  957.435122]  [] ? call_cpuidle+0x2a/0x40
>> [  957.438467]  [] ? cpu_startup_entry+0x28d/0x360
>> [  957.441949]  [] ? start_secondary+0x114/0x140
>> [  957.445378] rcu_preempt kthread starved for 60002 jiffies! g3583 c3582 
>> f0x0
>> RCU_GP_WAIT_FQS(3) ->state=0x1
>> [  957.449834] rcu_preempt S 8801785b7d68 0 7  2 
>> 0x
>> [  957.453579]  8801785b7d68 88017dc8cc80 88016fe6bb80
>> 8801785abb80
>> [  957.457428]  8801785b8000 8801785b7da0 88017dc8cc80
>> 88017dc8cc80
>> [  957.461249]  0003 8801785b7d80 81ab03df
>> 000100373021
>> [  957.465055] Call Trace:
>> [  957.467493]  [] schedule+0x3f/0xa0
>> [  957.470613]  [] schedule_timeout+0x127/0x270
>> [  957.473976]  [] ? detach_if_pending+0x120/0x120
>> [  957.477387]  [] rcu_gp_kthread+0x6d3/0xa40
>> [  957.480659]  [] ? wake_atomic_t_function+0x70/0x70
>> [  957.484123]  [] ? force_qs_rnp+0x1b0/0x1b0
>> [  957.487392]  [] kthread+0xe6/0x100
>> [  957.490470]  [] ? kthread_worker_fn+0x190/0x190
>> [  957.493859]  [] ret_from_fork+0x3f/0x70
>> [  957.497044]  [] ? kthread_worker_fn+0x190/0x190
>> 
> > Reinette

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-25 Thread Paul E. McKenney

On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
> Hi  Paul,
> 
> On 2016-03-23, Paul E. McKenney wrote:
> > Please boot with the following parameters:
> > 
> > rcu_tree.rcu_kick_kthreads ftrace
> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> 
> With these parameters I expected more details to show up in the kernel logs 
> but cannot find any. Even so, today I left the machine running again and when 
> this happened I think I was able to capture the trace data for the event. 
> Please find attached the trace information for the kernel message below. 
> Since the complete trace file is very big I trimmed it to show the time 
> around this event - hopefully this will contain the information you need. I 
> would also like to provide some additional information. The system on which I 
> see these events had a time that was _very_ wrong. I noticed that this issue 
> occurs when system-timesynd was one of the tasks calling the functions of 
> interest to your tracing and am wondering if a very out of sync time in 
> process of being corrected could be the cause of this issue? As an experiment 
> I ensured the system time was accurate before leaving the system idle 
> overnight and I did not see the issue the next morning. 

Ah!  Yes, a sudden jump in time or a disagreement about the time among
different components of the system can definitely cause these symptoms.
We have sometimes seen these problems occur when a pair of CPUs have
wildly different ideas about what time it is, for example.  Please let
me know how it goes.

Also, in your trace, there are no sched_waking events for the rcu_preempt
process that are not immediately followed by sched_wakeup, so your trace
isn't showing the problem that I am seeing.

Still beating up on my stress test, which is not yet proving to be all
that stressful.  :-/

Thanx, Paul

> [  957.396537] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  957.399933]  1-...: (0 ticks this GP) idle=4d6/0/0 softirq=6311/6311 fqs=0
> [  957.403661]  (detected by 0, t=60002 jiffies, g=3583, c=3582, q=47)
> [  957.407227] Task dump for CPU 1:
> [  957.409964] swapper/1   R  running task0 0  1 
> 0x0020
> [  957.413770]  039daa9a7eb9 8801785cfed0 818af34c 
> 8801
> [  957.417696]  00060003 8801785d 880072f9ea00 
> 822dcf80
> [  957.421631]  8801785cc000 8801785cc000 8801785cfee0 
> 818af597
> [  957.425562] Call Trace:
> [  957.428124]  [] ? cpuidle_enter_state+0xfc/0x310
> [  957.431713]  [] ? cpuidle_enter+0x17/0x20
> [  957.435122]  [] ? call_cpuidle+0x2a/0x40
> [  957.438467]  [] ? cpu_startup_entry+0x28d/0x360
> [  957.441949]  [] ? start_secondary+0x114/0x140
> [  957.445378] rcu_preempt kthread starved for 60002 jiffies! g3583 c3582 
> f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> [  957.449834] rcu_preempt S 8801785b7d68 0 7  2 
> 0x
> [  957.453579]  8801785b7d68 88017dc8cc80 88016fe6bb80 
> 8801785abb80
> [  957.457428]  8801785b8000 8801785b7da0 88017dc8cc80 
> 88017dc8cc80
> [  957.461249]  0003 8801785b7d80 81ab03df 
> 000100373021
> [  957.465055] Call Trace:
> [  957.467493]  [] schedule+0x3f/0xa0
> [  957.470613]  [] schedule_timeout+0x127/0x270
> [  957.473976]  [] ? detach_if_pending+0x120/0x120
> [  957.477387]  [] rcu_gp_kthread+0x6d3/0xa40
> [  957.480659]  [] ? wake_atomic_t_function+0x70/0x70
> [  957.484123]  [] ? force_qs_rnp+0x1b0/0x1b0
> [  957.487392]  [] kthread+0xe6/0x100
> [  957.490470]  [] ? kthread_worker_fn+0x190/0x190
> [  957.493859]  [] ret_from_fork+0x3f/0x70
> [  957.497044]  [] ? kthread_worker_fn+0x190/0x190
> 
> Reinette

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-25 Thread Paul E. McKenney

On Fri, Mar 25, 2016 at 09:24:14PM +, Chatre, Reinette wrote:
> Hi  Paul,
> 
> On 2016-03-23, Paul E. McKenney wrote:
> > Please boot with the following parameters:
> > 
> > rcu_tree.rcu_kick_kthreads ftrace
> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> 
> With these parameters I expected more details to show up in the kernel logs 
> but cannot find any. Even so, today I left the machine running again and when 
> this happened I think I was able to capture the trace data for the event. 
> Please find attached the trace information for the kernel message below. 
> Since the complete trace file is very big I trimmed it to show the time 
> around this event - hopefully this will contain the information you need. I 
> would also like to provide some additional information. The system on which I 
> see these events had a time that was _very_ wrong. I noticed that this issue 
> occurs when system-timesynd was one of the tasks calling the functions of 
> interest to your tracing and am wondering if a very out of sync time in 
> process of being corrected could be the cause of this issue? As an experiment 
> I ensured the system time was accurate before leaving the system idle 
> overnight and I did not see the issue the next morning. 

Ah!  Yes, a sudden jump in time or a disagreement about the time among
different components of the system can definitely cause these symptoms.
We have sometimes seen these problems occur when a pair of CPUs have
wildly different ideas about what time it is, for example.  Please let
me know how it goes.

Also, in your trace, there are no sched_waking events for the rcu_preempt
process that are not immediately followed by sched_wakeup, so your trace
isn't showing the problem that I am seeing.

Still beating up on my stress test, which is not yet proving to be all
that stressful.  :-/

Thanx, Paul

> [  957.396537] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  957.399933]  1-...: (0 ticks this GP) idle=4d6/0/0 softirq=6311/6311 fqs=0
> [  957.403661]  (detected by 0, t=60002 jiffies, g=3583, c=3582, q=47)
> [  957.407227] Task dump for CPU 1:
> [  957.409964] swapper/1   R  running task0 0  1 
> 0x0020
> [  957.413770]  039daa9a7eb9 8801785cfed0 818af34c 
> 8801
> [  957.417696]  00060003 8801785d 880072f9ea00 
> 822dcf80
> [  957.421631]  8801785cc000 8801785cc000 8801785cfee0 
> 818af597
> [  957.425562] Call Trace:
> [  957.428124]  [] ? cpuidle_enter_state+0xfc/0x310
> [  957.431713]  [] ? cpuidle_enter+0x17/0x20
> [  957.435122]  [] ? call_cpuidle+0x2a/0x40
> [  957.438467]  [] ? cpu_startup_entry+0x28d/0x360
> [  957.441949]  [] ? start_secondary+0x114/0x140
> [  957.445378] rcu_preempt kthread starved for 60002 jiffies! g3583 c3582 
> f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> [  957.449834] rcu_preempt S 8801785b7d68 0 7  2 
> 0x
> [  957.453579]  8801785b7d68 88017dc8cc80 88016fe6bb80 
> 8801785abb80
> [  957.457428]  8801785b8000 8801785b7da0 88017dc8cc80 
> 88017dc8cc80
> [  957.461249]  0003 8801785b7d80 81ab03df 
> 000100373021
> [  957.465055] Call Trace:
> [  957.467493]  [] schedule+0x3f/0xa0
> [  957.470613]  [] schedule_timeout+0x127/0x270
> [  957.473976]  [] ? detach_if_pending+0x120/0x120
> [  957.477387]  [] rcu_gp_kthread+0x6d3/0xa40
> [  957.480659]  [] ? wake_atomic_t_function+0x70/0x70
> [  957.484123]  [] ? force_qs_rnp+0x1b0/0x1b0
> [  957.487392]  [] kthread+0xe6/0x100
> [  957.490470]  [] ? kthread_worker_fn+0x190/0x190
> [  957.493859]  [] ret_from_fork+0x3f/0x70
> [  957.497044]  [] ? kthread_worker_fn+0x190/0x190
> 
> Reinette

RE: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-25 Thread Chatre, Reinette

Hi  Paul,

On 2016-03-23, Paul E. McKenney wrote:
> Please boot with the following parameters:
> 
>   rcu_tree.rcu_kick_kthreads ftrace
> trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi

With these parameters I expected more details to show up in the kernel logs but 
cannot find any. Even so, today I left the machine running again and when this 
happened I think I was able to capture the trace data for the event. Please 
find attached the trace information for the kernel message below. Since the 
complete trace file is very big I trimmed it to show the time around this event 
- hopefully this will contain the information you need. I would also like to 
provide some additional information. The system on which I see these events had 
a time that was _very_ wrong. I noticed that this issue occurs when 
system-timesynd was one of the tasks calling the functions of interest to your 
tracing and am wondering if a very out of sync time in process of being 
corrected could be the cause of this issue? As an experiment I ensured the 
system time was accurate before leaving the system idle overnight and I did not 
see the issue the next morning. 

[  957.396537] INFO: rcu_preempt detected stalls on CPUs/tasks:
[  957.399933]  1-...: (0 ticks this GP) idle=4d6/0/0 softirq=6311/6311 fqs=0
[  957.403661]  (detected by 0, t=60002 jiffies, g=3583, c=3582, q=47)
[  957.407227] Task dump for CPU 1:
[  957.409964] swapper/1   R  running task0 0  1 0x0020
[  957.413770]  039daa9a7eb9 8801785cfed0 818af34c 
8801
[  957.417696]  00060003 8801785d 880072f9ea00 
822dcf80
[  957.421631]  8801785cc000 8801785cc000 8801785cfee0 
818af597
[  957.425562] Call Trace:
[  957.428124]  [] ? cpuidle_enter_state+0xfc/0x310
[  957.431713]  [] ? cpuidle_enter+0x17/0x20
[  957.435122]  [] ? call_cpuidle+0x2a/0x40
[  957.438467]  [] ? cpu_startup_entry+0x28d/0x360
[  957.441949]  [] ? start_secondary+0x114/0x140
[  957.445378] rcu_preempt kthread starved for 60002 jiffies! g3583 c3582 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  957.449834] rcu_preempt S 8801785b7d68 0 7  2 0x
[  957.453579]  8801785b7d68 88017dc8cc80 88016fe6bb80 
8801785abb80
[  957.457428]  8801785b8000 8801785b7da0 88017dc8cc80 
88017dc8cc80
[  957.461249]  0003 8801785b7d80 81ab03df 
000100373021
[  957.465055] Call Trace:
[  957.467493]  [] schedule+0x3f/0xa0
[  957.470613]  [] schedule_timeout+0x127/0x270
[  957.473976]  [] ? detach_if_pending+0x120/0x120
[  957.477387]  [] rcu_gp_kthread+0x6d3/0xa40
[  957.480659]  [] ? wake_atomic_t_function+0x70/0x70
[  957.484123]  [] ? force_qs_rnp+0x1b0/0x1b0
[  957.487392]  [] kthread+0xe6/0x100
[  957.490470]  [] ? kthread_worker_fn+0x190/0x190
[  957.493859]  [] ret_from_fork+0x3f/0x70
[  957.497044]  [] ? kthread_worker_fn+0x190/0x190

Reinette


trace.trim.gz
Description: trace.trim.gz

RE: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-25 Thread Chatre, Reinette

Hi  Paul,

On 2016-03-23, Paul E. McKenney wrote:
> Please boot with the following parameters:
> 
>   rcu_tree.rcu_kick_kthreads ftrace
> trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi

With these parameters I expected more details to show up in the kernel logs but 
cannot find any. Even so, today I left the machine running again and when this 
happened I think I was able to capture the trace data for the event. Please 
find attached the trace information for the kernel message below. Since the 
complete trace file is very big I trimmed it to show the time around this event 
- hopefully this will contain the information you need. I would also like to 
provide some additional information. The system on which I see these events had 
a time that was _very_ wrong. I noticed that this issue occurs when 
system-timesynd was one of the tasks calling the functions of interest to your 
tracing and am wondering if a very out of sync time in process of being 
corrected could be the cause of this issue? As an experiment I ensured the 
system time was accurate before leaving the system idle overnight and I did not 
see the issue the next morning. 

[  957.396537] INFO: rcu_preempt detected stalls on CPUs/tasks:
[  957.399933]  1-...: (0 ticks this GP) idle=4d6/0/0 softirq=6311/6311 fqs=0
[  957.403661]  (detected by 0, t=60002 jiffies, g=3583, c=3582, q=47)
[  957.407227] Task dump for CPU 1:
[  957.409964] swapper/1   R  running task0 0  1 0x0020
[  957.413770]  039daa9a7eb9 8801785cfed0 818af34c 
8801
[  957.417696]  00060003 8801785d 880072f9ea00 
822dcf80
[  957.421631]  8801785cc000 8801785cc000 8801785cfee0 
818af597
[  957.425562] Call Trace:
[  957.428124]  [] ? cpuidle_enter_state+0xfc/0x310
[  957.431713]  [] ? cpuidle_enter+0x17/0x20
[  957.435122]  [] ? call_cpuidle+0x2a/0x40
[  957.438467]  [] ? cpu_startup_entry+0x28d/0x360
[  957.441949]  [] ? start_secondary+0x114/0x140
[  957.445378] rcu_preempt kthread starved for 60002 jiffies! g3583 c3582 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[  957.449834] rcu_preempt S 8801785b7d68 0 7  2 0x
[  957.453579]  8801785b7d68 88017dc8cc80 88016fe6bb80 
8801785abb80
[  957.457428]  8801785b8000 8801785b7da0 88017dc8cc80 
88017dc8cc80
[  957.461249]  0003 8801785b7d80 81ab03df 
000100373021
[  957.465055] Call Trace:
[  957.467493]  [] schedule+0x3f/0xa0
[  957.470613]  [] schedule_timeout+0x127/0x270
[  957.473976]  [] ? detach_if_pending+0x120/0x120
[  957.477387]  [] rcu_gp_kthread+0x6d3/0xa40
[  957.480659]  [] ? wake_atomic_t_function+0x70/0x70
[  957.484123]  [] ? force_qs_rnp+0x1b0/0x1b0
[  957.487392]  [] kthread+0xe6/0x100
[  957.490470]  [] ? kthread_worker_fn+0x190/0x190
[  957.493859]  [] ret_from_fork+0x3f/0x70
[  957.497044]  [] ? kthread_worker_fn+0x190/0x190

Reinette


trace.trim.gz
Description: trace.trim.gz

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-23 Thread Paul E. McKenney

On Wed, Mar 23, 2016 at 06:25:50PM +, Chatre, Reinette wrote:
> Hi Paul,
> 
> On 2016-03-23, Paul E. McKenney wrote:
> > Please boot with the following parameters:
> > 
> > rcu_tree.rcu_kick_kthreads ftrace
> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> > 
> > Or was this run with tracing?  If so, less than three hours isn't too bad.
> 
> This was with tracing enabled, only missing the crucial
> rcu_tree.rcu_kick_kthreads

Good, then the condition did trigger with tracing enabled!  ;-)

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-23 Thread Paul E. McKenney

On Wed, Mar 23, 2016 at 06:25:50PM +, Chatre, Reinette wrote:
> Hi Paul,
> 
> On 2016-03-23, Paul E. McKenney wrote:
> > Please boot with the following parameters:
> > 
> > rcu_tree.rcu_kick_kthreads ftrace
> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> > 
> > Or was this run with tracing?  If so, less than three hours isn't too bad.
> 
> This was with tracing enabled, only missing the crucial
> rcu_tree.rcu_kick_kthreads

Good, then the condition did trigger with tracing enabled!  ;-)

Thanx, Paul

RE: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-23 Thread Chatre, Reinette

Hi Paul,

On 2016-03-23, Paul E. McKenney wrote:
> Please boot with the following parameters:
> 
>   rcu_tree.rcu_kick_kthreads ftrace
> trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> 
> Or was this run with tracing?  If so, less than three hours isn't too bad.

This was with tracing enabled, only missing the crucial 
rcu_tree.rcu_kick_kthreads

Reinette

RE: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-23 Thread Chatre, Reinette

Hi Paul,

On 2016-03-23, Paul E. McKenney wrote:
> Please boot with the following parameters:
> 
>   rcu_tree.rcu_kick_kthreads ftrace
> trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
> 
> Or was this run with tracing?  If so, less than three hours isn't too bad.

This was with tracing enabled, only missing the crucial 
rcu_tree.rcu_kick_kthreads

Reinette

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-23 Thread Paul E. McKenney

On Wed, Mar 23, 2016 at 05:15:11PM +, Chatre, Reinette wrote:
> Hi Paul,
> 
> On 2016-03-22, Paul E. McKenney wrote:
> > On Tue, Mar 22, 2016 at 09:04:47PM +, Chatre, Reinette wrote:
> >> On 2016-03-22, Paul E. McKenney wrote:
> >>> You set CONFIG_RCU_CPU_STALL_TIMEOUT=60, which matches the 60004
> >>> jiffies above.  Is that value due to a distro setting or something? 
> >>> Mainline uses CONFIG_RCU_CPU_STALL_TIMEOUT=21.
> >> 
> >> Indeed ... this value originated from a Fedora configuration.
> > 
> > OK.  Setting it shorter might (or might not) make it reproduce more
> > quickly.  This can be set at boot time via rcupdate.rcu_cpu_stall_timeout.
> > Or at compile time via CONFIG_RCU_CPU_STALL_TIMEOUT.
> 
> I kept the original configuration and seem to be able to reproduce with that.
> 
> 
> >>> If dumping manually shortly after the stall is at all non-trivial
> >>> (for example, if your reproduction time is many minute or hours),
> >>> I can supply some patches that automate this.  Or you can pick
> >>> them up from -rcu:
> >> 
> >> ... could you please point me to the patches you refer to? Or would you 
> >> like
> > me to try with the entire kernel from rcu/dev?
> > 
> > 2dc92e2a86b9 (rcu: Awaken grace-period kthread if too long since FQS)
> > c3fd2095d015 (rcu: Dump ftrace buffer when kicking grace-period kthread)
> > 
> > There might be other dependencies, but these are the two that you need.
> 
> I did not look closely at the patches when I applied them and because
> of that missed that they need a kernel parameter to be activated. After
> leaving the system idle overnight with these patches the stalls occurred
> but without the parameter I did not capture the data you need. I will
> try again tonight. Below are the traces from last night just in case
> they have value to you.

I know that feeling!

Please boot with the following parameters:

rcu_tree.rcu_kick_kthreads ftrace 
trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi

Or was this run with tracing?  If so, less than three hours isn't too bad.

> [10154.635318] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [10154.639218]  1-...: (0 ticks this GP) idle=c4e/0/0 softirq=99936/99936 
> fqs=0
> [10154.643497]  (detected by 0, t=60005 jiffies, g=24190, c=24189, q=79)
> [10154.647596] Task dump for CPU 1:
> [10154.650818] swapper/1   R  running task0 0  1 
> 0x0020
> [10154.655052]  2656bf74de5e 8801785cfed0 818af34c 
> 8801
> [10154.659349]  00060003 8801785d 880072f0bc00 
> 822dcf80
> [10154.663636]  8801785cc000 8801785cc000 8801785cfee0 
> 818af597
> [10154.667916] Call Trace:
> [10154.670845]  [] ? cpuidle_enter_state+0xfc/0x310
> [10154.674802]  [] ? cpuidle_enter+0x17/0x20
> [10154.678564]  [] ? call_cpuidle+0x2a/0x40
> [10154.682295]  [] ? cpu_startup_entry+0x28d/0x360
> [10154.686187]  [] ? start_secondary+0x114/0x140
> [10154.690040] rcu_preempt kthread starved for 60005 jiffies! g24190 c24189 
> f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1

Still the same type of failure, which is reassuring.

Thanx, Paul

> [10154.694944] rcu_preempt S 8801785b7d68 0 7  2 
> 0x
> [10154.699062]  8801785b7d68 88017dc8cc80 8801785c3b80 
> 8801785abb80
> [10154.703275]  8801785b8000 8801785b7da0 88017dc8cc80 
> 88017dc8cc80
> [10154.707481]  0003 8801785b7d80 81ab03df 
> 0001027e21aa
> [10154.711692] Call Trace:
> [10154.714548]  [] schedule+0x3f/0xa0
> [10154.718075]  [] schedule_timeout+0x127/0x270
> [10154.721832]  [] ? detach_if_pending+0x120/0x120
> [10154.725659]  [] rcu_gp_kthread+0x6d3/0xa40
> [10154.729379]  [] ? wake_atomic_t_function+0x70/0x70
> [10154.733235]  [] ? force_qs_rnp+0x1b0/0x1b0
> [10154.736854]  [] kthread+0xe6/0x100
> [10154.740267]  [] ? kthread_worker_fn+0x190/0x190
> [10154.743980]  [] ret_from_fork+0x3f/0x70
> [10154.747511]  [] ? kthread_worker_fn+0x190/0x190
> [11348.912706] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [11348.916346]  2-...: (0 ticks this GP) idle=586/0/0 softirq=133504/133504 
> fqs=0
> [11348.920407]  (detected by 3, t=60002 jiffies, g=26799, c=26798, q=72)
> [11348.924244] Task dump for CPU 2:
> [11348.927205] swapper/2   R  running task0 0  1 
> 0x0020
> [11348.931178]  2adc83427a76 8801785d3ed0 818af34c 
> 8801
> [11348.935217]  00060003 8801785d4000 880177d01e00 
> 822dcf80
> [11348.939237]  8801785d 8801785d 8801785d3ee0 
> 818af597
> [11348.943252] Call Trace:
> [11348.945921]  [] ? cpuidle_enter_state+0xfc/0x310
> [11348.949615]  [] ? cpuidle_enter+0x17/0x20
> [11348.953115]  [] ? call_cpuidle+0x2a/0x40
> [11348.956584]  [] ? cpu_startup_entry+0x28d/0x360
> [11348.960215]  [] ? start_secondary+0x114/0x140
> [11348.963808]

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-23 Thread Paul E. McKenney

On Wed, Mar 23, 2016 at 05:15:11PM +, Chatre, Reinette wrote:
> Hi Paul,
> 
> On 2016-03-22, Paul E. McKenney wrote:
> > On Tue, Mar 22, 2016 at 09:04:47PM +, Chatre, Reinette wrote:
> >> On 2016-03-22, Paul E. McKenney wrote:
> >>> You set CONFIG_RCU_CPU_STALL_TIMEOUT=60, which matches the 60004
> >>> jiffies above.  Is that value due to a distro setting or something? 
> >>> Mainline uses CONFIG_RCU_CPU_STALL_TIMEOUT=21.
> >> 
> >> Indeed ... this value originated from a Fedora configuration.
> > 
> > OK.  Setting it shorter might (or might not) make it reproduce more
> > quickly.  This can be set at boot time via rcupdate.rcu_cpu_stall_timeout.
> > Or at compile time via CONFIG_RCU_CPU_STALL_TIMEOUT.
> 
> I kept the original configuration and seem to be able to reproduce with that.
> 
> 
> >>> If dumping manually shortly after the stall is at all non-trivial
> >>> (for example, if your reproduction time is many minute or hours),
> >>> I can supply some patches that automate this.  Or you can pick
> >>> them up from -rcu:
> >> 
> >> ... could you please point me to the patches you refer to? Or would you 
> >> like
> > me to try with the entire kernel from rcu/dev?
> > 
> > 2dc92e2a86b9 (rcu: Awaken grace-period kthread if too long since FQS)
> > c3fd2095d015 (rcu: Dump ftrace buffer when kicking grace-period kthread)
> > 
> > There might be other dependencies, but these are the two that you need.
> 
> I did not look closely at the patches when I applied them and because
> of that missed that they need a kernel parameter to be activated. After
> leaving the system idle overnight with these patches the stalls occurred
> but without the parameter I did not capture the data you need. I will
> try again tonight. Below are the traces from last night just in case
> they have value to you.

I know that feeling!

Please boot with the following parameters:

rcu_tree.rcu_kick_kthreads ftrace 
trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi

Or was this run with tracing?  If so, less than three hours isn't too bad.

> [10154.635318] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [10154.639218]  1-...: (0 ticks this GP) idle=c4e/0/0 softirq=99936/99936 
> fqs=0
> [10154.643497]  (detected by 0, t=60005 jiffies, g=24190, c=24189, q=79)
> [10154.647596] Task dump for CPU 1:
> [10154.650818] swapper/1   R  running task0 0  1 
> 0x0020
> [10154.655052]  2656bf74de5e 8801785cfed0 818af34c 
> 8801
> [10154.659349]  00060003 8801785d 880072f0bc00 
> 822dcf80
> [10154.663636]  8801785cc000 8801785cc000 8801785cfee0 
> 818af597
> [10154.667916] Call Trace:
> [10154.670845]  [] ? cpuidle_enter_state+0xfc/0x310
> [10154.674802]  [] ? cpuidle_enter+0x17/0x20
> [10154.678564]  [] ? call_cpuidle+0x2a/0x40
> [10154.682295]  [] ? cpu_startup_entry+0x28d/0x360
> [10154.686187]  [] ? start_secondary+0x114/0x140
> [10154.690040] rcu_preempt kthread starved for 60005 jiffies! g24190 c24189 
> f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1

Still the same type of failure, which is reassuring.

Thanx, Paul

> [10154.694944] rcu_preempt S 8801785b7d68 0 7  2 
> 0x
> [10154.699062]  8801785b7d68 88017dc8cc80 8801785c3b80 
> 8801785abb80
> [10154.703275]  8801785b8000 8801785b7da0 88017dc8cc80 
> 88017dc8cc80
> [10154.707481]  0003 8801785b7d80 81ab03df 
> 0001027e21aa
> [10154.711692] Call Trace:
> [10154.714548]  [] schedule+0x3f/0xa0
> [10154.718075]  [] schedule_timeout+0x127/0x270
> [10154.721832]  [] ? detach_if_pending+0x120/0x120
> [10154.725659]  [] rcu_gp_kthread+0x6d3/0xa40
> [10154.729379]  [] ? wake_atomic_t_function+0x70/0x70
> [10154.733235]  [] ? force_qs_rnp+0x1b0/0x1b0
> [10154.736854]  [] kthread+0xe6/0x100
> [10154.740267]  [] ? kthread_worker_fn+0x190/0x190
> [10154.743980]  [] ret_from_fork+0x3f/0x70
> [10154.747511]  [] ? kthread_worker_fn+0x190/0x190
> [11348.912706] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [11348.916346]  2-...: (0 ticks this GP) idle=586/0/0 softirq=133504/133504 
> fqs=0
> [11348.920407]  (detected by 3, t=60002 jiffies, g=26799, c=26798, q=72)
> [11348.924244] Task dump for CPU 2:
> [11348.927205] swapper/2   R  running task0 0  1 
> 0x0020
> [11348.931178]  2adc83427a76 8801785d3ed0 818af34c 
> 8801
> [11348.935217]  00060003 8801785d4000 880177d01e00 
> 822dcf80
> [11348.939237]  8801785d 8801785d 8801785d3ee0 
> 818af597
> [11348.943252] Call Trace:
> [11348.945921]  [] ? cpuidle_enter_state+0xfc/0x310
> [11348.949615]  [] ? cpuidle_enter+0x17/0x20
> [11348.953115]  [] ? call_cpuidle+0x2a/0x40
> [11348.956584]  [] ? cpu_startup_entry+0x28d/0x360
> [11348.960215]  [] ? start_secondary+0x114/0x140
> [11348.963808]

RE: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-23 Thread Chatre, Reinette

Hi Paul,

On 2016-03-22, Paul E. McKenney wrote:
> On Tue, Mar 22, 2016 at 09:04:47PM +, Chatre, Reinette wrote:
>> On 2016-03-22, Paul E. McKenney wrote:
>>> You set CONFIG_RCU_CPU_STALL_TIMEOUT=60, which matches the 60004
>>> jiffies above.  Is that value due to a distro setting or something? 
>>> Mainline uses CONFIG_RCU_CPU_STALL_TIMEOUT=21.
>> 
>> Indeed ... this value originated from a Fedora configuration.
> 
> OK.  Setting it shorter might (or might not) make it reproduce more
> quickly.  This can be set at boot time via rcupdate.rcu_cpu_stall_timeout.
> Or at compile time via CONFIG_RCU_CPU_STALL_TIMEOUT.

I kept the original configuration and seem to be able to reproduce with that.


>>> If dumping manually shortly after the stall is at all non-trivial
>>> (for example, if your reproduction time is many minute or hours),
>>> I can supply some patches that automate this.  Or you can pick
>>> them up from -rcu:
>> 
>> ... could you please point me to the patches you refer to? Or would you like
> me to try with the entire kernel from rcu/dev?
> 
> 2dc92e2a86b9 (rcu: Awaken grace-period kthread if too long since FQS)
> c3fd2095d015 (rcu: Dump ftrace buffer when kicking grace-period kthread)
> 
> There might be other dependencies, but these are the two that you need.

I did not look closely at the patches when I applied them and because of that 
missed that they need a kernel parameter to be activated. After leaving the 
system idle overnight with these patches the stalls occurred but without the 
parameter I did not capture the data you need. I will try again tonight. Below 
are the traces from last night just in case they have value to you.

[10154.635318] INFO: rcu_preempt detected stalls on CPUs/tasks:
[10154.639218]  1-...: (0 ticks this GP) idle=c4e/0/0 softirq=99936/99936 fqs=0
[10154.643497]  (detected by 0, t=60005 jiffies, g=24190, c=24189, q=79)
[10154.647596] Task dump for CPU 1:
[10154.650818] swapper/1   R  running task0 0  1 0x0020
[10154.655052]  2656bf74de5e 8801785cfed0 818af34c 
8801
[10154.659349]  00060003 8801785d 880072f0bc00 
822dcf80
[10154.663636]  8801785cc000 8801785cc000 8801785cfee0 
818af597
[10154.667916] Call Trace:
[10154.670845]  [] ? cpuidle_enter_state+0xfc/0x310
[10154.674802]  [] ? cpuidle_enter+0x17/0x20
[10154.678564]  [] ? call_cpuidle+0x2a/0x40
[10154.682295]  [] ? cpu_startup_entry+0x28d/0x360
[10154.686187]  [] ? start_secondary+0x114/0x140
[10154.690040] rcu_preempt kthread starved for 60005 jiffies! g24190 c24189 
f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[10154.694944] rcu_preempt S 8801785b7d68 0 7  2 0x
[10154.699062]  8801785b7d68 88017dc8cc80 8801785c3b80 
8801785abb80
[10154.703275]  8801785b8000 8801785b7da0 88017dc8cc80 
88017dc8cc80
[10154.707481]  0003 8801785b7d80 81ab03df 
0001027e21aa
[10154.711692] Call Trace:
[10154.714548]  [] schedule+0x3f/0xa0
[10154.718075]  [] schedule_timeout+0x127/0x270
[10154.721832]  [] ? detach_if_pending+0x120/0x120
[10154.725659]  [] rcu_gp_kthread+0x6d3/0xa40
[10154.729379]  [] ? wake_atomic_t_function+0x70/0x70
[10154.733235]  [] ? force_qs_rnp+0x1b0/0x1b0
[10154.736854]  [] kthread+0xe6/0x100
[10154.740267]  [] ? kthread_worker_fn+0x190/0x190
[10154.743980]  [] ret_from_fork+0x3f/0x70
[10154.747511]  [] ? kthread_worker_fn+0x190/0x190
[11348.912706] INFO: rcu_preempt detected stalls on CPUs/tasks:
[11348.916346]  2-...: (0 ticks this GP) idle=586/0/0 softirq=133504/133504 
fqs=0
[11348.920407]  (detected by 3, t=60002 jiffies, g=26799, c=26798, q=72)
[11348.924244] Task dump for CPU 2:
[11348.927205] swapper/2   R  running task0 0  1 0x0020
[11348.931178]  2adc83427a76 8801785d3ed0 818af34c 
8801
[11348.935217]  00060003 8801785d4000 880177d01e00 
822dcf80
[11348.939237]  8801785d 8801785d 8801785d3ee0 
818af597
[11348.943252] Call Trace:
[11348.945921]  [] ? cpuidle_enter_state+0xfc/0x310
[11348.949615]  [] ? cpuidle_enter+0x17/0x20
[11348.953115]  [] ? call_cpuidle+0x2a/0x40
[11348.956584]  [] ? cpu_startup_entry+0x28d/0x360
[11348.960215]  [] ? start_secondary+0x114/0x140
[11348.963808] rcu_preempt kthread starved for 60002 jiffies! g26799 c26798 
f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[11348.968452] rcu_preempt S 8801785b7d68 0 7  2 0x
[11348.972309]  8801785b7d68 88017dd0cc80 8801785c5940 
8801785abb80
[11348.976266]  8801785b8000 8801785b7da0 88017dd0cc80 
88017dd0cc80
[11348.980207]  0003 8801785b7d80 81ab03df 
000102c9d45e
[11348.984142] Call Trace:
[11348.986714]  [] schedule+0x3f/0xa0
[11348.989974]  [] schedule_timeout+0x127/0x270
[11348.993453]  [] ? detach_if_pending+0x120/0x120
[11348.997000]  [] rcu_gp_kthread+0x6d3/0xa40

RE: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-23 Thread Chatre, Reinette

Hi Paul,

On 2016-03-22, Paul E. McKenney wrote:
> On Tue, Mar 22, 2016 at 09:04:47PM +, Chatre, Reinette wrote:
>> On 2016-03-22, Paul E. McKenney wrote:
>>> You set CONFIG_RCU_CPU_STALL_TIMEOUT=60, which matches the 60004
>>> jiffies above.  Is that value due to a distro setting or something? 
>>> Mainline uses CONFIG_RCU_CPU_STALL_TIMEOUT=21.
>> 
>> Indeed ... this value originated from a Fedora configuration.
> 
> OK.  Setting it shorter might (or might not) make it reproduce more
> quickly.  This can be set at boot time via rcupdate.rcu_cpu_stall_timeout.
> Or at compile time via CONFIG_RCU_CPU_STALL_TIMEOUT.

I kept the original configuration and seem to be able to reproduce with that.


>>> If dumping manually shortly after the stall is at all non-trivial
>>> (for example, if your reproduction time is many minute or hours),
>>> I can supply some patches that automate this.  Or you can pick
>>> them up from -rcu:
>> 
>> ... could you please point me to the patches you refer to? Or would you like
> me to try with the entire kernel from rcu/dev?
> 
> 2dc92e2a86b9 (rcu: Awaken grace-period kthread if too long since FQS)
> c3fd2095d015 (rcu: Dump ftrace buffer when kicking grace-period kthread)
> 
> There might be other dependencies, but these are the two that you need.

I did not look closely at the patches when I applied them and because of that 
missed that they need a kernel parameter to be activated. After leaving the 
system idle overnight with these patches the stalls occurred but without the 
parameter I did not capture the data you need. I will try again tonight. Below 
are the traces from last night just in case they have value to you.

[10154.635318] INFO: rcu_preempt detected stalls on CPUs/tasks:
[10154.639218]  1-...: (0 ticks this GP) idle=c4e/0/0 softirq=99936/99936 fqs=0
[10154.643497]  (detected by 0, t=60005 jiffies, g=24190, c=24189, q=79)
[10154.647596] Task dump for CPU 1:
[10154.650818] swapper/1   R  running task0 0  1 0x0020
[10154.655052]  2656bf74de5e 8801785cfed0 818af34c 
8801
[10154.659349]  00060003 8801785d 880072f0bc00 
822dcf80
[10154.663636]  8801785cc000 8801785cc000 8801785cfee0 
818af597
[10154.667916] Call Trace:
[10154.670845]  [] ? cpuidle_enter_state+0xfc/0x310
[10154.674802]  [] ? cpuidle_enter+0x17/0x20
[10154.678564]  [] ? call_cpuidle+0x2a/0x40
[10154.682295]  [] ? cpu_startup_entry+0x28d/0x360
[10154.686187]  [] ? start_secondary+0x114/0x140
[10154.690040] rcu_preempt kthread starved for 60005 jiffies! g24190 c24189 
f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[10154.694944] rcu_preempt S 8801785b7d68 0 7  2 0x
[10154.699062]  8801785b7d68 88017dc8cc80 8801785c3b80 
8801785abb80
[10154.703275]  8801785b8000 8801785b7da0 88017dc8cc80 
88017dc8cc80
[10154.707481]  0003 8801785b7d80 81ab03df 
0001027e21aa
[10154.711692] Call Trace:
[10154.714548]  [] schedule+0x3f/0xa0
[10154.718075]  [] schedule_timeout+0x127/0x270
[10154.721832]  [] ? detach_if_pending+0x120/0x120
[10154.725659]  [] rcu_gp_kthread+0x6d3/0xa40
[10154.729379]  [] ? wake_atomic_t_function+0x70/0x70
[10154.733235]  [] ? force_qs_rnp+0x1b0/0x1b0
[10154.736854]  [] kthread+0xe6/0x100
[10154.740267]  [] ? kthread_worker_fn+0x190/0x190
[10154.743980]  [] ret_from_fork+0x3f/0x70
[10154.747511]  [] ? kthread_worker_fn+0x190/0x190
[11348.912706] INFO: rcu_preempt detected stalls on CPUs/tasks:
[11348.916346]  2-...: (0 ticks this GP) idle=586/0/0 softirq=133504/133504 
fqs=0
[11348.920407]  (detected by 3, t=60002 jiffies, g=26799, c=26798, q=72)
[11348.924244] Task dump for CPU 2:
[11348.927205] swapper/2   R  running task0 0  1 0x0020
[11348.931178]  2adc83427a76 8801785d3ed0 818af34c 
8801
[11348.935217]  00060003 8801785d4000 880177d01e00 
822dcf80
[11348.939237]  8801785d 8801785d 8801785d3ee0 
818af597
[11348.943252] Call Trace:
[11348.945921]  [] ? cpuidle_enter_state+0xfc/0x310
[11348.949615]  [] ? cpuidle_enter+0x17/0x20
[11348.953115]  [] ? call_cpuidle+0x2a/0x40
[11348.956584]  [] ? cpu_startup_entry+0x28d/0x360
[11348.960215]  [] ? start_secondary+0x114/0x140
[11348.963808] rcu_preempt kthread starved for 60002 jiffies! g26799 c26798 
f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[11348.968452] rcu_preempt S 8801785b7d68 0 7  2 0x
[11348.972309]  8801785b7d68 88017dd0cc80 8801785c5940 
8801785abb80
[11348.976266]  8801785b8000 8801785b7da0 88017dd0cc80 
88017dd0cc80
[11348.980207]  0003 8801785b7d80 81ab03df 
000102c9d45e
[11348.984142] Call Trace:
[11348.986714]  [] schedule+0x3f/0xa0
[11348.989974]  [] schedule_timeout+0x127/0x270
[11348.993453]  [] ? detach_if_pending+0x120/0x120
[11348.997000]  [] rcu_gp_kthread+0x6d3/0xa40

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-22 Thread Paul E. McKenney

On Tue, Mar 22, 2016 at 09:04:47PM +, Chatre, Reinette wrote:
> Hi Paul,
> 
> On 2016-03-22, Paul E. McKenney wrote:
> > On Tue, Mar 22, 2016 at 04:35:32PM +, Chatre, Reinette wrote:
> >> On 2016-03-21, Paul E. McKenney wrote:
> >>> On Mon, Mar 21, 2016 at 09:22:30AM -0700, Jacob Pan wrote:
>  On Fri, 18 Mar 2016 16:56:41 -0700
>  "Paul E. McKenney"  wrote:
> > On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote:
> >> On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:
> >>> 
> >>> [ . . . ]
> >>> 
> >> We're seeing a similar stall (~60 seconds) on an x86 development
> >> system here.  Any luck tracking down the cause of this?  If not, any
> >> suggestions for traces that might be helpful?
> > 
> > The dmesg containing the stall, the kernel version, and the .config
> > would be helpful!  Working on a torture test specific to this bug...
> > 
> > And thank you for the .config.  Your kenrle version looks to be 4.5.0.
> > 
>  +Reinette, she has the system that can reproduce the issue. I
>  believe she is having some other problems with it at the moment. But
>  the .config should be available. Version is v4.5.
> >>> 
> >>> A couple of additional questions:
> >>> 
> >>> 1.Is the test running on bare metal or virtualized?  If the
> >>>   latter, what is the host?
> >> 
> >> Bare metal.
> > 
> > OK, you are ahead of me.  Mine is virtualized.
> > 
> >>> 2.Does the workload involve CPU hotplug?
> >> 
> >> No.
> > 
> > Again, you are ahead of me.  Mine makes extremely heavy use of CPU hotplug.
> > 
> >>> 3.Are you seeing things like this in dmesg?
> >>> 
> >>>   "rcu_preempt kthread starved for 21033 jiffies"
> >>>   "rcu_sched kthread starved for 32103 jiffies"
> >>>   "rcu_bh kthread starved for 84031 jiffies"
> >>> 
> >>>   If not, you are probably facing some other bug, and should
> >>>   proceed debugging as described in Documentation/RCU/stallwarn.txt.
> >> 
> >> Below is a sample of what I see as captured with v4.5. The kernel
> >> configuration is attached.
> >> 
> >> [  135.456197] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 
> >> 135.457729]  3-...: (0 ticks this GP) idle=722/0/0 softirq=5532/5532
> >> fqs=0 [  135.459604]  (detected by 2, t=60004 jiffies, g=2105, c=2104,
> >> q=165) [  135.461318] Task dump for CPU 3: [  135.461321] swapper/3
> >>   R  running task0 0  1 0x0020 [  135.461325] 
> >> 0078560040e5 88017846fed0 818af2cc 8801 [ 
> >> 135.461330]  00060003 88017847 880072f32200
> >> 822dcec0 [  135.461334]  88017846c000 88017846c000
> >> 88017846fee0 818af517 [  135.461338] Call Trace: [ 
> >> 135.461345]  [] ? cpuidle_enter_state+0xfc/0x310 [ 
> >> 135.461349]  [] ? cpuidle_enter+0x17/0x20 [ 
> >> 135.461353]  [] ? call_cpuidle+0x2a/0x40 [ 
> >> 135.461355]  [] ? cpu_startup_entry+0x28d/0x360 [ 
> >> 135.461360]  [] ? start_secondary+0x114/0x140 [ 
> >> 135.461365] rcu_preempt kthread starved for 60004 jiffies! g2105 c2104
> > f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> > 
> > And yes, it looks like you are seeing the same bug that I am tracing.
> > 
> > The kthread is blocked on a schedule_timeout_interruptible().  Given
> > default configuration, this would have a three-jiffy timeout.
> > 
> > You set CONFIG_RCU_CPU_STALL_TIMEOUT=60, which matches the 60004 jiffies
> > above.  Is that value due to a distro setting or something?  Mainline
> > uses CONFIG_RCU_CPU_STALL_TIMEOUT=21.
> 
> Indeed ... this value originated from a Fedora configuration. 

OK.  Setting it shorter might (or might not) make it reproduce more
quickly.  This can be set at boot time via rcupdate.rcu_cpu_stall_timeout.
Or at compile time via CONFIG_RCU_CPU_STALL_TIMEOUT.

> >> [  135.463965] rcu_preempt S 88017844fd68 0 7  2
> >> 0x [  135.463969]  88017844fd68 88017dd8cc80
> >> 880177ff 880178443b80 [  135.463973]  88017845
> >> 88017844fda0 88017dd8cc80 88017dd8cc80 [  135.463977] 
> >> 0003 88017844fd80 81ab031f 000100031504 [ 
> >> 135.463981] Call Trace: [  135.463986]  []
> >> schedule+0x3f/0xa0 [  135.463989]  []
> >> schedule_timeout+0x127/0x270 [  135.463993]  [] ?
> >> detach_if_pending+0x120/0x120 [  135.463997]  []
> >> rcu_gp_kthread+0x6bd/0xa30 [  135.464000]  [] ?
> >> wake_atomic_t_function+0x70/0x70 [  135.464003]  [] ?
> >> force_qs_rnp+0x1b0/0x1b0 [  135.464006]  []
> >> kthread+0xe6/0x100 [  135.464009]  [] ?
> >> kthread_worker_fn+0x190/0x190 [  135.464012]  []
> >> ret_from_fork+0x3f/0x70 [  135.464015]  [] ?
> >> kthread_worker_fn+0x190/0x190
> > 
> > How long does it take to reproduce this?  If it reproduces in minutes
> > or hours, could you please boot with the following on the kernel command
> > line and dump the trace buffer shortly after the stall?
> > 
> > ftrace

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-22 Thread Paul E. McKenney

On Tue, Mar 22, 2016 at 09:04:47PM +, Chatre, Reinette wrote:
> Hi Paul,
> 
> On 2016-03-22, Paul E. McKenney wrote:
> > On Tue, Mar 22, 2016 at 04:35:32PM +, Chatre, Reinette wrote:
> >> On 2016-03-21, Paul E. McKenney wrote:
> >>> On Mon, Mar 21, 2016 at 09:22:30AM -0700, Jacob Pan wrote:
>  On Fri, 18 Mar 2016 16:56:41 -0700
>  "Paul E. McKenney"  wrote:
> > On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote:
> >> On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:
> >>> 
> >>> [ . . . ]
> >>> 
> >> We're seeing a similar stall (~60 seconds) on an x86 development
> >> system here.  Any luck tracking down the cause of this?  If not, any
> >> suggestions for traces that might be helpful?
> > 
> > The dmesg containing the stall, the kernel version, and the .config
> > would be helpful!  Working on a torture test specific to this bug...
> > 
> > And thank you for the .config.  Your kenrle version looks to be 4.5.0.
> > 
>  +Reinette, she has the system that can reproduce the issue. I
>  believe she is having some other problems with it at the moment. But
>  the .config should be available. Version is v4.5.
> >>> 
> >>> A couple of additional questions:
> >>> 
> >>> 1.Is the test running on bare metal or virtualized?  If the
> >>>   latter, what is the host?
> >> 
> >> Bare metal.
> > 
> > OK, you are ahead of me.  Mine is virtualized.
> > 
> >>> 2.Does the workload involve CPU hotplug?
> >> 
> >> No.
> > 
> > Again, you are ahead of me.  Mine makes extremely heavy use of CPU hotplug.
> > 
> >>> 3.Are you seeing things like this in dmesg?
> >>> 
> >>>   "rcu_preempt kthread starved for 21033 jiffies"
> >>>   "rcu_sched kthread starved for 32103 jiffies"
> >>>   "rcu_bh kthread starved for 84031 jiffies"
> >>> 
> >>>   If not, you are probably facing some other bug, and should
> >>>   proceed debugging as described in Documentation/RCU/stallwarn.txt.
> >> 
> >> Below is a sample of what I see as captured with v4.5. The kernel
> >> configuration is attached.
> >> 
> >> [  135.456197] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 
> >> 135.457729]  3-...: (0 ticks this GP) idle=722/0/0 softirq=5532/5532
> >> fqs=0 [  135.459604]  (detected by 2, t=60004 jiffies, g=2105, c=2104,
> >> q=165) [  135.461318] Task dump for CPU 3: [  135.461321] swapper/3
> >>   R  running task0 0  1 0x0020 [  135.461325] 
> >> 0078560040e5 88017846fed0 818af2cc 8801 [ 
> >> 135.461330]  00060003 88017847 880072f32200
> >> 822dcec0 [  135.461334]  88017846c000 88017846c000
> >> 88017846fee0 818af517 [  135.461338] Call Trace: [ 
> >> 135.461345]  [] ? cpuidle_enter_state+0xfc/0x310 [ 
> >> 135.461349]  [] ? cpuidle_enter+0x17/0x20 [ 
> >> 135.461353]  [] ? call_cpuidle+0x2a/0x40 [ 
> >> 135.461355]  [] ? cpu_startup_entry+0x28d/0x360 [ 
> >> 135.461360]  [] ? start_secondary+0x114/0x140 [ 
> >> 135.461365] rcu_preempt kthread starved for 60004 jiffies! g2105 c2104
> > f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> > 
> > And yes, it looks like you are seeing the same bug that I am tracing.
> > 
> > The kthread is blocked on a schedule_timeout_interruptible().  Given
> > default configuration, this would have a three-jiffy timeout.
> > 
> > You set CONFIG_RCU_CPU_STALL_TIMEOUT=60, which matches the 60004 jiffies
> > above.  Is that value due to a distro setting or something?  Mainline
> > uses CONFIG_RCU_CPU_STALL_TIMEOUT=21.
> 
> Indeed ... this value originated from a Fedora configuration. 

OK.  Setting it shorter might (or might not) make it reproduce more
quickly.  This can be set at boot time via rcupdate.rcu_cpu_stall_timeout.
Or at compile time via CONFIG_RCU_CPU_STALL_TIMEOUT.

> >> [  135.463965] rcu_preempt S 88017844fd68 0 7  2
> >> 0x [  135.463969]  88017844fd68 88017dd8cc80
> >> 880177ff 880178443b80 [  135.463973]  88017845
> >> 88017844fda0 88017dd8cc80 88017dd8cc80 [  135.463977] 
> >> 0003 88017844fd80 81ab031f 000100031504 [ 
> >> 135.463981] Call Trace: [  135.463986]  []
> >> schedule+0x3f/0xa0 [  135.463989]  []
> >> schedule_timeout+0x127/0x270 [  135.463993]  [] ?
> >> detach_if_pending+0x120/0x120 [  135.463997]  []
> >> rcu_gp_kthread+0x6bd/0xa30 [  135.464000]  [] ?
> >> wake_atomic_t_function+0x70/0x70 [  135.464003]  [] ?
> >> force_qs_rnp+0x1b0/0x1b0 [  135.464006]  []
> >> kthread+0xe6/0x100 [  135.464009]  [] ?
> >> kthread_worker_fn+0x190/0x190 [  135.464012]  []
> >> ret_from_fork+0x3f/0x70 [  135.464015]  [] ?
> >> kthread_worker_fn+0x190/0x190
> > 
> > How long does it take to reproduce this?  If it reproduces in minutes
> > or hours, could you please boot with the following on the kernel command
> > line and dump the trace buffer shortly after the stall?
> > 
> > ftrace

RE: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-22 Thread Chatre, Reinette

Hi Paul,

On 2016-03-22, Paul E. McKenney wrote:
> On Tue, Mar 22, 2016 at 04:35:32PM +, Chatre, Reinette wrote:
>> On 2016-03-21, Paul E. McKenney wrote:
>>> On Mon, Mar 21, 2016 at 09:22:30AM -0700, Jacob Pan wrote:
 On Fri, 18 Mar 2016 16:56:41 -0700
 "Paul E. McKenney"  wrote:
> On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote:
>> On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:
>>> 
>>> [ . . . ]
>>> 
>> We're seeing a similar stall (~60 seconds) on an x86 development
>> system here.  Any luck tracking down the cause of this?  If not, any
>> suggestions for traces that might be helpful?
> 
> The dmesg containing the stall, the kernel version, and the .config
> would be helpful!  Working on a torture test specific to this bug...
> 
> And thank you for the .config.  Your kenrle version looks to be 4.5.0.
> 
 +Reinette, she has the system that can reproduce the issue. I
 believe she is having some other problems with it at the moment. But
 the .config should be available. Version is v4.5.
>>> 
>>> A couple of additional questions:
>>> 
>>> 1.  Is the test running on bare metal or virtualized?  If the
>>> latter, what is the host?
>> 
>> Bare metal.
> 
> OK, you are ahead of me.  Mine is virtualized.
> 
>>> 2.  Does the workload involve CPU hotplug?
>> 
>> No.
> 
> Again, you are ahead of me.  Mine makes extremely heavy use of CPU hotplug.
> 
>>> 3.  Are you seeing things like this in dmesg?
>>> 
>>> "rcu_preempt kthread starved for 21033 jiffies"
>>> "rcu_sched kthread starved for 32103 jiffies"
>>> "rcu_bh kthread starved for 84031 jiffies"
>>> 
>>> If not, you are probably facing some other bug, and should
>>> proceed debugging as described in Documentation/RCU/stallwarn.txt.
>> 
>> Below is a sample of what I see as captured with v4.5. The kernel
>> configuration is attached.
>> 
>> [  135.456197] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 
>> 135.457729]  3-...: (0 ticks this GP) idle=722/0/0 softirq=5532/5532
>> fqs=0 [  135.459604]  (detected by 2, t=60004 jiffies, g=2105, c=2104,
>> q=165) [  135.461318] Task dump for CPU 3: [  135.461321] swapper/3
>>   R  running task0 0  1 0x0020 [  135.461325] 
>> 0078560040e5 88017846fed0 818af2cc 8801 [ 
>> 135.461330]  00060003 88017847 880072f32200
>> 822dcec0 [  135.461334]  88017846c000 88017846c000
>> 88017846fee0 818af517 [  135.461338] Call Trace: [ 
>> 135.461345]  [] ? cpuidle_enter_state+0xfc/0x310 [ 
>> 135.461349]  [] ? cpuidle_enter+0x17/0x20 [ 
>> 135.461353]  [] ? call_cpuidle+0x2a/0x40 [ 
>> 135.461355]  [] ? cpu_startup_entry+0x28d/0x360 [ 
>> 135.461360]  [] ? start_secondary+0x114/0x140 [ 
>> 135.461365] rcu_preempt kthread starved for 60004 jiffies! g2105 c2104
> f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> 
> And yes, it looks like you are seeing the same bug that I am tracing.
> 
> The kthread is blocked on a schedule_timeout_interruptible().  Given
> default configuration, this would have a three-jiffy timeout.
> 
> You set CONFIG_RCU_CPU_STALL_TIMEOUT=60, which matches the 60004 jiffies
> above.  Is that value due to a distro setting or something?  Mainline
> uses CONFIG_RCU_CPU_STALL_TIMEOUT=21.

Indeed ... this value originated from a Fedora configuration. 

>> [  135.463965] rcu_preempt S 88017844fd68 0 7  2
>> 0x [  135.463969]  88017844fd68 88017dd8cc80
>> 880177ff 880178443b80 [  135.463973]  88017845
>> 88017844fda0 88017dd8cc80 88017dd8cc80 [  135.463977] 
>> 0003 88017844fd80 81ab031f 000100031504 [ 
>> 135.463981] Call Trace: [  135.463986]  []
>> schedule+0x3f/0xa0 [  135.463989]  []
>> schedule_timeout+0x127/0x270 [  135.463993]  [] ?
>> detach_if_pending+0x120/0x120 [  135.463997]  []
>> rcu_gp_kthread+0x6bd/0xa30 [  135.464000]  [] ?
>> wake_atomic_t_function+0x70/0x70 [  135.464003]  [] ?
>> force_qs_rnp+0x1b0/0x1b0 [  135.464006]  []
>> kthread+0xe6/0x100 [  135.464009]  [] ?
>> kthread_worker_fn+0x190/0x190 [  135.464012]  []
>> ret_from_fork+0x3f/0x70 [  135.464015]  [] ?
>> kthread_worker_fn+0x190/0x190
> 
> How long does it take to reproduce this?  If it reproduces in minutes
> or hours, could you please boot with the following on the kernel command
> line and dump the trace buffer shortly after the stall?
> 
> ftrace trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi

The trace I provided above appeared after a few minutes and not again. On 
previous occasions I had to wait a few hours. I tried running with the above 
added to the kernel command line but I have not seen the trace yet. I will 
leave the system overnight but then may risk not capturing the data you need so 
...

> If dumping manually shortly after the stall is at all non-trivial
> (for example, if your

RE: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-22 Thread Chatre, Reinette

Hi Paul,

On 2016-03-22, Paul E. McKenney wrote:
> On Tue, Mar 22, 2016 at 04:35:32PM +, Chatre, Reinette wrote:
>> On 2016-03-21, Paul E. McKenney wrote:
>>> On Mon, Mar 21, 2016 at 09:22:30AM -0700, Jacob Pan wrote:
 On Fri, 18 Mar 2016 16:56:41 -0700
 "Paul E. McKenney"  wrote:
> On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote:
>> On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:
>>> 
>>> [ . . . ]
>>> 
>> We're seeing a similar stall (~60 seconds) on an x86 development
>> system here.  Any luck tracking down the cause of this?  If not, any
>> suggestions for traces that might be helpful?
> 
> The dmesg containing the stall, the kernel version, and the .config
> would be helpful!  Working on a torture test specific to this bug...
> 
> And thank you for the .config.  Your kenrle version looks to be 4.5.0.
> 
 +Reinette, she has the system that can reproduce the issue. I
 believe she is having some other problems with it at the moment. But
 the .config should be available. Version is v4.5.
>>> 
>>> A couple of additional questions:
>>> 
>>> 1.  Is the test running on bare metal or virtualized?  If the
>>> latter, what is the host?
>> 
>> Bare metal.
> 
> OK, you are ahead of me.  Mine is virtualized.
> 
>>> 2.  Does the workload involve CPU hotplug?
>> 
>> No.
> 
> Again, you are ahead of me.  Mine makes extremely heavy use of CPU hotplug.
> 
>>> 3.  Are you seeing things like this in dmesg?
>>> 
>>> "rcu_preempt kthread starved for 21033 jiffies"
>>> "rcu_sched kthread starved for 32103 jiffies"
>>> "rcu_bh kthread starved for 84031 jiffies"
>>> 
>>> If not, you are probably facing some other bug, and should
>>> proceed debugging as described in Documentation/RCU/stallwarn.txt.
>> 
>> Below is a sample of what I see as captured with v4.5. The kernel
>> configuration is attached.
>> 
>> [  135.456197] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 
>> 135.457729]  3-...: (0 ticks this GP) idle=722/0/0 softirq=5532/5532
>> fqs=0 [  135.459604]  (detected by 2, t=60004 jiffies, g=2105, c=2104,
>> q=165) [  135.461318] Task dump for CPU 3: [  135.461321] swapper/3
>>   R  running task0 0  1 0x0020 [  135.461325] 
>> 0078560040e5 88017846fed0 818af2cc 8801 [ 
>> 135.461330]  00060003 88017847 880072f32200
>> 822dcec0 [  135.461334]  88017846c000 88017846c000
>> 88017846fee0 818af517 [  135.461338] Call Trace: [ 
>> 135.461345]  [] ? cpuidle_enter_state+0xfc/0x310 [ 
>> 135.461349]  [] ? cpuidle_enter+0x17/0x20 [ 
>> 135.461353]  [] ? call_cpuidle+0x2a/0x40 [ 
>> 135.461355]  [] ? cpu_startup_entry+0x28d/0x360 [ 
>> 135.461360]  [] ? start_secondary+0x114/0x140 [ 
>> 135.461365] rcu_preempt kthread starved for 60004 jiffies! g2105 c2104
> f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> 
> And yes, it looks like you are seeing the same bug that I am tracing.
> 
> The kthread is blocked on a schedule_timeout_interruptible().  Given
> default configuration, this would have a three-jiffy timeout.
> 
> You set CONFIG_RCU_CPU_STALL_TIMEOUT=60, which matches the 60004 jiffies
> above.  Is that value due to a distro setting or something?  Mainline
> uses CONFIG_RCU_CPU_STALL_TIMEOUT=21.

Indeed ... this value originated from a Fedora configuration. 

>> [  135.463965] rcu_preempt S 88017844fd68 0 7  2
>> 0x [  135.463969]  88017844fd68 88017dd8cc80
>> 880177ff 880178443b80 [  135.463973]  88017845
>> 88017844fda0 88017dd8cc80 88017dd8cc80 [  135.463977] 
>> 0003 88017844fd80 81ab031f 000100031504 [ 
>> 135.463981] Call Trace: [  135.463986]  []
>> schedule+0x3f/0xa0 [  135.463989]  []
>> schedule_timeout+0x127/0x270 [  135.463993]  [] ?
>> detach_if_pending+0x120/0x120 [  135.463997]  []
>> rcu_gp_kthread+0x6bd/0xa30 [  135.464000]  [] ?
>> wake_atomic_t_function+0x70/0x70 [  135.464003]  [] ?
>> force_qs_rnp+0x1b0/0x1b0 [  135.464006]  []
>> kthread+0xe6/0x100 [  135.464009]  [] ?
>> kthread_worker_fn+0x190/0x190 [  135.464012]  []
>> ret_from_fork+0x3f/0x70 [  135.464015]  [] ?
>> kthread_worker_fn+0x190/0x190
> 
> How long does it take to reproduce this?  If it reproduces in minutes
> or hours, could you please boot with the following on the kernel command
> line and dump the trace buffer shortly after the stall?
> 
> ftrace trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi

The trace I provided above appeared after a few minutes and not again. On 
previous occasions I had to wait a few hours. I tried running with the above 
added to the kernel command line but I have not seen the trace yet. I will 
leave the system overnight but then may risk not capturing the data you need so 
...

> If dumping manually shortly after the stall is at all non-trivial
> (for example, if your reproduction time is many

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-22 Thread Paul E. McKenney

On Tue, Mar 22, 2016 at 04:35:32PM +, Chatre, Reinette wrote:
> Hi Paul,

Hello, Reinette!

> On 2016-03-21, Paul E. McKenney wrote:
> > On Mon, Mar 21, 2016 at 09:22:30AM -0700, Jacob Pan wrote:
> >> On Fri, 18 Mar 2016 16:56:41 -0700
> >> "Paul E. McKenney"  wrote:
> >>> On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote:
>  On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:
> > 
> > [ . . . ]
> > 
>  We're seeing a similar stall (~60 seconds) on an x86 development
>  system here.  Any luck tracking down the cause of this?  If not, any
>  suggestions for traces that might be helpful?
> >>> 
> >>> The dmesg containing the stall, the kernel version, and the .config
> >>> would be helpful!  Working on a torture test specific to this bug...

And thank you for the .config.  Your kenrle version looks to be 4.5.0.

> >> +Reinette, she has the system that can reproduce the issue. I
> >> believe she is having some other problems with it at the moment. But
> >> the .config should be available. Version is v4.5.
> > 
> > A couple of additional questions:
> > 
> > 1.  Is the test running on bare metal or virtualized?  If the
> > latter, what is the host?
> 
> Bare metal.

OK, you are ahead of me.  Mine is virtualized.

> > 2.  Does the workload involve CPU hotplug?
> 
> No.

Again, you are ahead of me.  Mine makes extremely heavy use of CPU hotplug.

> > 3.  Are you seeing things like this in dmesg?
> > 
> > "rcu_preempt kthread starved for 21033 jiffies"
> > "rcu_sched kthread starved for 32103 jiffies"
> > "rcu_bh kthread starved for 84031 jiffies"
> > 
> > If not, you are probably facing some other bug, and should
> > proceed debugging as described in Documentation/RCU/stallwarn.txt.
> 
> Below is a sample of what I see as captured with v4.5. The kernel 
> configuration is attached.
> 
> [  135.456197] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  135.457729]  3-...: (0 ticks this GP) idle=722/0/0 softirq=5532/5532 fqs=0 
> [  135.459604]  (detected by 2, t=60004 jiffies, g=2105, c=2104, q=165)
> [  135.461318] Task dump for CPU 3:
> [  135.461321] swapper/3   R  running task0 0  1 
> 0x0020
> [  135.461325]  0078560040e5 88017846fed0 818af2cc 
> 8801
> [  135.461330]  00060003 88017847 880072f32200 
> 822dcec0
> [  135.461334]  88017846c000 88017846c000 88017846fee0 
> 818af517
> [  135.461338] Call Trace:
> [  135.461345]  [] ? cpuidle_enter_state+0xfc/0x310
> [  135.461349]  [] ? cpuidle_enter+0x17/0x20
> [  135.461353]  [] ? call_cpuidle+0x2a/0x40
> [  135.461355]  [] ? cpu_startup_entry+0x28d/0x360
> [  135.461360]  [] ? start_secondary+0x114/0x140
> [  135.461365] rcu_preempt kthread starved for 60004 jiffies! g2105 c2104 
> f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1

And yes, it looks like you are seeing the same bug that I am tracing.

The kthread is blocked on a schedule_timeout_interruptible().  Given
default configuration, this would have a three-jiffy timeout.

You set CONFIG_RCU_CPU_STALL_TIMEOUT=60, which matches the 60004 jiffies
above.  Is that value due to a distro setting or something?  Mainline
uses CONFIG_RCU_CPU_STALL_TIMEOUT=21.

> [  135.463965] rcu_preempt S 88017844fd68 0 7  2 
> 0x
> [  135.463969]  88017844fd68 88017dd8cc80 880177ff 
> 880178443b80
> [  135.463973]  88017845 88017844fda0 88017dd8cc80 
> 88017dd8cc80
> [  135.463977]  0003 88017844fd80 81ab031f 
> 000100031504
> [  135.463981] Call Trace:
> [  135.463986]  [] schedule+0x3f/0xa0
> [  135.463989]  [] schedule_timeout+0x127/0x270
> [  135.463993]  [] ? detach_if_pending+0x120/0x120
> [  135.463997]  [] rcu_gp_kthread+0x6bd/0xa30
> [  135.464000]  [] ? wake_atomic_t_function+0x70/0x70
> [  135.464003]  [] ? force_qs_rnp+0x1b0/0x1b0
> [  135.464006]  [] kthread+0xe6/0x100
> [  135.464009]  [] ? kthread_worker_fn+0x190/0x190
> [  135.464012]  [] ret_from_fork+0x3f/0x70
> [  135.464015]  [] ? kthread_worker_fn+0x190/0x190

How long does it take to reproduce this?  If it reproduces in minutes
or hours, could you please boot with the following on the kernel command
line and dump the trace buffer shortly after the stall?

ftrace trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi

If dumping manually shortly after the stall is at all non-trivial
(for example, if your reproduction time is many minute or hours),
I can supply some patches that automate this.  Or you can pick
them up from -rcu:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

Branch rcu/dev has these patches (and much else besides).

Thanx, Paul

PS:  In case you are curious, when I enable those tracepoints, it
 shows me that the timer is firing every three jiffies, as it
 should,

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-22 Thread Paul E. McKenney

On Tue, Mar 22, 2016 at 04:35:32PM +, Chatre, Reinette wrote:
> Hi Paul,

Hello, Reinette!

> On 2016-03-21, Paul E. McKenney wrote:
> > On Mon, Mar 21, 2016 at 09:22:30AM -0700, Jacob Pan wrote:
> >> On Fri, 18 Mar 2016 16:56:41 -0700
> >> "Paul E. McKenney"  wrote:
> >>> On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote:
>  On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:
> > 
> > [ . . . ]
> > 
>  We're seeing a similar stall (~60 seconds) on an x86 development
>  system here.  Any luck tracking down the cause of this?  If not, any
>  suggestions for traces that might be helpful?
> >>> 
> >>> The dmesg containing the stall, the kernel version, and the .config
> >>> would be helpful!  Working on a torture test specific to this bug...

And thank you for the .config.  Your kenrle version looks to be 4.5.0.

> >> +Reinette, she has the system that can reproduce the issue. I
> >> believe she is having some other problems with it at the moment. But
> >> the .config should be available. Version is v4.5.
> > 
> > A couple of additional questions:
> > 
> > 1.  Is the test running on bare metal or virtualized?  If the
> > latter, what is the host?
> 
> Bare metal.

OK, you are ahead of me.  Mine is virtualized.

> > 2.  Does the workload involve CPU hotplug?
> 
> No.

Again, you are ahead of me.  Mine makes extremely heavy use of CPU hotplug.

> > 3.  Are you seeing things like this in dmesg?
> > 
> > "rcu_preempt kthread starved for 21033 jiffies"
> > "rcu_sched kthread starved for 32103 jiffies"
> > "rcu_bh kthread starved for 84031 jiffies"
> > 
> > If not, you are probably facing some other bug, and should
> > proceed debugging as described in Documentation/RCU/stallwarn.txt.
> 
> Below is a sample of what I see as captured with v4.5. The kernel 
> configuration is attached.
> 
> [  135.456197] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  135.457729]  3-...: (0 ticks this GP) idle=722/0/0 softirq=5532/5532 fqs=0 
> [  135.459604]  (detected by 2, t=60004 jiffies, g=2105, c=2104, q=165)
> [  135.461318] Task dump for CPU 3:
> [  135.461321] swapper/3   R  running task0 0  1 
> 0x0020
> [  135.461325]  0078560040e5 88017846fed0 818af2cc 
> 8801
> [  135.461330]  00060003 88017847 880072f32200 
> 822dcec0
> [  135.461334]  88017846c000 88017846c000 88017846fee0 
> 818af517
> [  135.461338] Call Trace:
> [  135.461345]  [] ? cpuidle_enter_state+0xfc/0x310
> [  135.461349]  [] ? cpuidle_enter+0x17/0x20
> [  135.461353]  [] ? call_cpuidle+0x2a/0x40
> [  135.461355]  [] ? cpu_startup_entry+0x28d/0x360
> [  135.461360]  [] ? start_secondary+0x114/0x140
> [  135.461365] rcu_preempt kthread starved for 60004 jiffies! g2105 c2104 
> f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1

And yes, it looks like you are seeing the same bug that I am tracing.

The kthread is blocked on a schedule_timeout_interruptible().  Given
default configuration, this would have a three-jiffy timeout.

You set CONFIG_RCU_CPU_STALL_TIMEOUT=60, which matches the 60004 jiffies
above.  Is that value due to a distro setting or something?  Mainline
uses CONFIG_RCU_CPU_STALL_TIMEOUT=21.

> [  135.463965] rcu_preempt S 88017844fd68 0 7  2 
> 0x
> [  135.463969]  88017844fd68 88017dd8cc80 880177ff 
> 880178443b80
> [  135.463973]  88017845 88017844fda0 88017dd8cc80 
> 88017dd8cc80
> [  135.463977]  0003 88017844fd80 81ab031f 
> 000100031504
> [  135.463981] Call Trace:
> [  135.463986]  [] schedule+0x3f/0xa0
> [  135.463989]  [] schedule_timeout+0x127/0x270
> [  135.463993]  [] ? detach_if_pending+0x120/0x120
> [  135.463997]  [] rcu_gp_kthread+0x6bd/0xa30
> [  135.464000]  [] ? wake_atomic_t_function+0x70/0x70
> [  135.464003]  [] ? force_qs_rnp+0x1b0/0x1b0
> [  135.464006]  [] kthread+0xe6/0x100
> [  135.464009]  [] ? kthread_worker_fn+0x190/0x190
> [  135.464012]  [] ret_from_fork+0x3f/0x70
> [  135.464015]  [] ? kthread_worker_fn+0x190/0x190

How long does it take to reproduce this?  If it reproduces in minutes
or hours, could you please boot with the following on the kernel command
line and dump the trace buffer shortly after the stall?

ftrace trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi

If dumping manually shortly after the stall is at all non-trivial
(for example, if your reproduction time is many minute or hours),
I can supply some patches that automate this.  Or you can pick
them up from -rcu:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

Branch rcu/dev has these patches (and much else besides).

Thanx, Paul

PS:  In case you are curious, when I enable those tracepoints, it
 shows me that the timer is firing every three jiffies, as it
 should, but that something happens

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-21 Thread Paul E. McKenney

On Mon, Mar 21, 2016 at 09:22:30AM -0700, Jacob Pan wrote:
> On Fri, 18 Mar 2016 16:56:41 -0700
> "Paul E. McKenney"  wrote:
> > On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote:
> > > On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:

[ . . . ]

> > > We're seeing a similar stall (~60 seconds) on an x86 development
> > > system here.  Any luck tracking down the cause of this?  If not, any
> > > suggestions for traces that might be helpful?
> > 
> > The dmesg containing the stall, the kernel version, and the .config
> > would be helpful!  Working on a torture test specific to this bug...
> > 
> > Thanx, Paul
> > 
> +Reinette, she has the system that can reproduce the issue. I
> believe she is having some other problems with it at the moment. But
> the .config should be available. Version is v4.5.

A couple of additional questions:

1.  Is the test running on bare metal or virtualized?  If the
latter, what is the host?

2.  Does the workload involve CPU hotplug?

3.  Are you seeing things like this in dmesg?

"rcu_preempt kthread starved for 21033 jiffies"
"rcu_sched kthread starved for 32103 jiffies"
"rcu_bh kthread starved for 84031 jiffies"

If not, you are probably facing some other bug, and should
proceed debugging as described in Documentation/RCU/stallwarn.txt.

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-21 Thread Paul E. McKenney

On Mon, Mar 21, 2016 at 09:22:30AM -0700, Jacob Pan wrote:
> On Fri, 18 Mar 2016 16:56:41 -0700
> "Paul E. McKenney"  wrote:
> > On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote:
> > > On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:

[ . . . ]

> > > We're seeing a similar stall (~60 seconds) on an x86 development
> > > system here.  Any luck tracking down the cause of this?  If not, any
> > > suggestions for traces that might be helpful?
> > 
> > The dmesg containing the stall, the kernel version, and the .config
> > would be helpful!  Working on a torture test specific to this bug...
> > 
> > Thanx, Paul
> > 
> +Reinette, she has the system that can reproduce the issue. I
> believe she is having some other problems with it at the moment. But
> the .config should be available. Version is v4.5.

A couple of additional questions:

1.  Is the test running on bare metal or virtualized?  If the
latter, what is the host?

2.  Does the workload involve CPU hotplug?

3.  Are you seeing things like this in dmesg?

"rcu_preempt kthread starved for 21033 jiffies"
"rcu_sched kthread starved for 32103 jiffies"
"rcu_bh kthread starved for 84031 jiffies"

If not, you are probably facing some other bug, and should
proceed debugging as described in Documentation/RCU/stallwarn.txt.

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-21 Thread Jacob Pan

On Fri, 18 Mar 2016 16:56:41 -0700
"Paul E. McKenney"  wrote:

> On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote:
> > On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 25, 2016 at 04:13:11PM +1100, Ross Green wrote:
> > > > On Wed, Feb 24, 2016 at 8:28 AM, Ross Green
> > > >  wrote:
> > > > > On Wed, Feb 24, 2016 at 7:55 AM, Paul E. McKenney
> > > > >  wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > >> Still working on getting decent traces...
> > > 
> > > And I might have succeeded, see below.
> > > 
> > > > >>
> > > > >> Thanx,
> > > > >> Paul
> > > > >>
> > > > >
> > > > > G'day all,
> > > > >
> > > > > Here is another dmesg output for 4.5-rc5 showing another
> > > > > rcu_preempt stall. This one appeared after only a day of
> > > > > running. CONFIG_DEBUG_TIMING is turned on, but can't see any
> > > > > output that shows from this.
> > > > >
> > > > > Again testing as before,
> > > > >
> > > > > Boot, run a series of small benchmarks, then just let the
> > > > > system be and idle away.
> > > > >
> > > > > I notice in the stack trace there is mention of
> > > > > hrtimer_run_queues and hrtimer_interrupt.
> > > > >
> > > > > Anyway, leave this for a few more eyes to look at.
> > > > >
> > > > > Open to any other suggestions of things to test.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Ross Green
> > > > 
> > > > 
> > > > G'day Paul,
> > > > 
> > > > I left the pandaboard running and captured another stall.
> > > > 
> > > > the attachment is the dmesg output.
> > > > 
> > > > Again there is no apparent output from any CONFIG_DEBUG_TIMING
> > > > so I assume there is nothing happening there.
> > > 
> > > I agree, looks like this is not due to time skew.
> > > 
> > > > I just saw the updates for 4.6 RCU code.
> > > > Is the patch in [PATCH tip/core/rcu 04/13] valid here?
> > > 
> > > I doubt that it will help, but you never know.
> > > 
> > > > do you want me try the new patch set with this configuration?
> > > 
> > > Even better would be to try Daniel Wagner's swait patchset.  I
> > > have attached them in UNIX mbox format, or you can get them from
> > > the -tip tree.
> > > 
> > > And I -finally- got some tracing that -might- be useful.  The
> > > dmesg, all 67MB of it, is here:
> > > 
> > >   http://www.rdrop.com/~paulmck/submission/console.2016.02.23a.log
> > > 
> > > This failure mode is less likely to happen, and looks a bit
> > > different than the ones that I was seeing before enabling
> > > tracing.  Then, an additional wakeup would actually wake the task
> > > up.  In contrast, with tracing enabled, the RCU grace-period
> > > kthread goes into "teenager mode", refusing to wake up despite
> > > repeated attempts.  However, this might be a side-effect of the
> > > ftrace dump.
> > > 
> > > On line 525,132, we see that the rcu_preempt grace-period kthread
> > > has been starved for 1,188,154 jiffies, or about 20 minutes.
> > > This seems unlikely...  The kthread is waiting for no more than a
> > > three-jiffy timeout ("RCU_GP_WAIT_FQS(3)") and is in
> > > TASK_INTERRUPTIBLE state ("0x1").
> > 
> > We're seeing a similar stall (~60 seconds) on an x86 development
> > system here.  Any luck tracking down the cause of this?  If not, any
> > suggestions for traces that might be helpful?
> 
> The dmesg containing the stall, the kernel version, and the .config
> would be helpful!  Working on a torture test specific to this bug...
> 
>   Thanx, Paul
> 
+Reinette, she has the system that can reproduce the issue. I
believe she is having some other problems with it at the moment. But
the .config should be available. Version is v4.5.

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-21 Thread Jacob Pan

On Fri, 18 Mar 2016 16:56:41 -0700
"Paul E. McKenney"  wrote:

> On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote:
> > On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 25, 2016 at 04:13:11PM +1100, Ross Green wrote:
> > > > On Wed, Feb 24, 2016 at 8:28 AM, Ross Green
> > > >  wrote:
> > > > > On Wed, Feb 24, 2016 at 7:55 AM, Paul E. McKenney
> > > > >  wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > >> Still working on getting decent traces...
> > > 
> > > And I might have succeeded, see below.
> > > 
> > > > >>
> > > > >> Thanx,
> > > > >> Paul
> > > > >>
> > > > >
> > > > > G'day all,
> > > > >
> > > > > Here is another dmesg output for 4.5-rc5 showing another
> > > > > rcu_preempt stall. This one appeared after only a day of
> > > > > running. CONFIG_DEBUG_TIMING is turned on, but can't see any
> > > > > output that shows from this.
> > > > >
> > > > > Again testing as before,
> > > > >
> > > > > Boot, run a series of small benchmarks, then just let the
> > > > > system be and idle away.
> > > > >
> > > > > I notice in the stack trace there is mention of
> > > > > hrtimer_run_queues and hrtimer_interrupt.
> > > > >
> > > > > Anyway, leave this for a few more eyes to look at.
> > > > >
> > > > > Open to any other suggestions of things to test.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Ross Green
> > > > 
> > > > 
> > > > G'day Paul,
> > > > 
> > > > I left the pandaboard running and captured another stall.
> > > > 
> > > > the attachment is the dmesg output.
> > > > 
> > > > Again there is no apparent output from any CONFIG_DEBUG_TIMING
> > > > so I assume there is nothing happening there.
> > > 
> > > I agree, looks like this is not due to time skew.
> > > 
> > > > I just saw the updates for 4.6 RCU code.
> > > > Is the patch in [PATCH tip/core/rcu 04/13] valid here?
> > > 
> > > I doubt that it will help, but you never know.
> > > 
> > > > do you want me try the new patch set with this configuration?
> > > 
> > > Even better would be to try Daniel Wagner's swait patchset.  I
> > > have attached them in UNIX mbox format, or you can get them from
> > > the -tip tree.
> > > 
> > > And I -finally- got some tracing that -might- be useful.  The
> > > dmesg, all 67MB of it, is here:
> > > 
> > >   http://www.rdrop.com/~paulmck/submission/console.2016.02.23a.log
> > > 
> > > This failure mode is less likely to happen, and looks a bit
> > > different than the ones that I was seeing before enabling
> > > tracing.  Then, an additional wakeup would actually wake the task
> > > up.  In contrast, with tracing enabled, the RCU grace-period
> > > kthread goes into "teenager mode", refusing to wake up despite
> > > repeated attempts.  However, this might be a side-effect of the
> > > ftrace dump.
> > > 
> > > On line 525,132, we see that the rcu_preempt grace-period kthread
> > > has been starved for 1,188,154 jiffies, or about 20 minutes.
> > > This seems unlikely...  The kthread is waiting for no more than a
> > > three-jiffy timeout ("RCU_GP_WAIT_FQS(3)") and is in
> > > TASK_INTERRUPTIBLE state ("0x1").
> > 
> > We're seeing a similar stall (~60 seconds) on an x86 development
> > system here.  Any luck tracking down the cause of this?  If not, any
> > suggestions for traces that might be helpful?
> 
> The dmesg containing the stall, the kernel version, and the .config
> would be helpful!  Working on a torture test specific to this bug...
> 
>   Thanx, Paul
> 
+Reinette, she has the system that can reproduce the issue. I
believe she is having some other problems with it at the moment. But
the .config should be available. Version is v4.5.

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-19 Thread Josh Triplett

On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:
> On Thu, Feb 25, 2016 at 04:13:11PM +1100, Ross Green wrote:
> > On Wed, Feb 24, 2016 at 8:28 AM, Ross Green  wrote:
> > > On Wed, Feb 24, 2016 at 7:55 AM, Paul E. McKenney
> > >  wrote:
> 
> [ . . . ]
> 
> > >> Still working on getting decent traces...
> 
> And I might have succeeded, see below.
> 
> > >>
> > >> Thanx, Paul
> > >>
> > >
> > > G'day all,
> > >
> > > Here is another dmesg output for 4.5-rc5 showing another rcu_preempt 
> > > stall.
> > > This one appeared after only a day of running. CONFIG_DEBUG_TIMING is
> > > turned on, but can't see any output that shows from this.
> > >
> > > Again testing as before,
> > >
> > > Boot, run a series of small benchmarks, then just let the system be
> > > and idle away.
> > >
> > > I notice in the stack trace there is mention of hrtimer_run_queues and
> > > hrtimer_interrupt.
> > >
> > > Anyway, leave this for a few more eyes to look at.
> > >
> > > Open to any other suggestions of things to test.
> > >
> > > Regards,
> > >
> > > Ross Green
> > 
> > 
> > G'day Paul,
> > 
> > I left the pandaboard running and captured another stall.
> > 
> > the attachment is the dmesg output.
> > 
> > Again there is no apparent output from any CONFIG_DEBUG_TIMING so I
> > assume there is nothing happening there.
> 
> I agree, looks like this is not due to time skew.
> 
> > I just saw the updates for 4.6 RCU code.
> > Is the patch in [PATCH tip/core/rcu 04/13] valid here?
> 
> I doubt that it will help, but you never know.
> 
> > do you want me try the new patch set with this configuration?
> 
> Even better would be to try Daniel Wagner's swait patchset.  I have
> attached them in UNIX mbox format, or you can get them from the
> -tip tree.
> 
> And I -finally- got some tracing that -might- be useful.  The dmesg, all
> 67MB of it, is here:
> 
>   http://www.rdrop.com/~paulmck/submission/console.2016.02.23a.log
> 
> This failure mode is less likely to happen, and looks a bit different
> than the ones that I was seeing before enabling tracing.  Then, an
> additional wakeup would actually wake the task up.  In contrast, with
> tracing enabled, the RCU grace-period kthread goes into "teenager mode",
> refusing to wake up despite repeated attempts.  However, this might
> be a side-effect of the ftrace dump.
> 
> On line 525,132, we see that the rcu_preempt grace-period kthread has
> been starved for 1,188,154 jiffies, or about 20 minutes.  This seems
> unlikely...  The kthread is waiting for no more than a three-jiffy
> timeout ("RCU_GP_WAIT_FQS(3)") and is in TASK_INTERRUPTIBLE state
> ("0x1").

We're seeing a similar stall (~60 seconds) on an x86 development system
here.  Any luck tracking down the cause of this?  If not, any
suggestions for traces that might be helpful?

- Josh Triplett

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-19 Thread Josh Triplett

On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:
> On Thu, Feb 25, 2016 at 04:13:11PM +1100, Ross Green wrote:
> > On Wed, Feb 24, 2016 at 8:28 AM, Ross Green  wrote:
> > > On Wed, Feb 24, 2016 at 7:55 AM, Paul E. McKenney
> > >  wrote:
> 
> [ . . . ]
> 
> > >> Still working on getting decent traces...
> 
> And I might have succeeded, see below.
> 
> > >>
> > >> Thanx, Paul
> > >>
> > >
> > > G'day all,
> > >
> > > Here is another dmesg output for 4.5-rc5 showing another rcu_preempt 
> > > stall.
> > > This one appeared after only a day of running. CONFIG_DEBUG_TIMING is
> > > turned on, but can't see any output that shows from this.
> > >
> > > Again testing as before,
> > >
> > > Boot, run a series of small benchmarks, then just let the system be
> > > and idle away.
> > >
> > > I notice in the stack trace there is mention of hrtimer_run_queues and
> > > hrtimer_interrupt.
> > >
> > > Anyway, leave this for a few more eyes to look at.
> > >
> > > Open to any other suggestions of things to test.
> > >
> > > Regards,
> > >
> > > Ross Green
> > 
> > 
> > G'day Paul,
> > 
> > I left the pandaboard running and captured another stall.
> > 
> > the attachment is the dmesg output.
> > 
> > Again there is no apparent output from any CONFIG_DEBUG_TIMING so I
> > assume there is nothing happening there.
> 
> I agree, looks like this is not due to time skew.
> 
> > I just saw the updates for 4.6 RCU code.
> > Is the patch in [PATCH tip/core/rcu 04/13] valid here?
> 
> I doubt that it will help, but you never know.
> 
> > do you want me try the new patch set with this configuration?
> 
> Even better would be to try Daniel Wagner's swait patchset.  I have
> attached them in UNIX mbox format, or you can get them from the
> -tip tree.
> 
> And I -finally- got some tracing that -might- be useful.  The dmesg, all
> 67MB of it, is here:
> 
>   http://www.rdrop.com/~paulmck/submission/console.2016.02.23a.log
> 
> This failure mode is less likely to happen, and looks a bit different
> than the ones that I was seeing before enabling tracing.  Then, an
> additional wakeup would actually wake the task up.  In contrast, with
> tracing enabled, the RCU grace-period kthread goes into "teenager mode",
> refusing to wake up despite repeated attempts.  However, this might
> be a side-effect of the ftrace dump.
> 
> On line 525,132, we see that the rcu_preempt grace-period kthread has
> been starved for 1,188,154 jiffies, or about 20 minutes.  This seems
> unlikely...  The kthread is waiting for no more than a three-jiffy
> timeout ("RCU_GP_WAIT_FQS(3)") and is in TASK_INTERRUPTIBLE state
> ("0x1").

We're seeing a similar stall (~60 seconds) on an x86 development system
here.  Any luck tracking down the cause of this?  If not, any
suggestions for traces that might be helpful?

- Josh Triplett

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-18 Thread Paul E. McKenney

On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote:
> On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:
> > On Thu, Feb 25, 2016 at 04:13:11PM +1100, Ross Green wrote:
> > > On Wed, Feb 24, 2016 at 8:28 AM, Ross Green  wrote:
> > > > On Wed, Feb 24, 2016 at 7:55 AM, Paul E. McKenney
> > > >  wrote:
> > 
> > [ . . . ]
> > 
> > > >> Still working on getting decent traces...
> > 
> > And I might have succeeded, see below.
> > 
> > > >>
> > > >> Thanx, Paul
> > > >>
> > > >
> > > > G'day all,
> > > >
> > > > Here is another dmesg output for 4.5-rc5 showing another rcu_preempt 
> > > > stall.
> > > > This one appeared after only a day of running. CONFIG_DEBUG_TIMING is
> > > > turned on, but can't see any output that shows from this.
> > > >
> > > > Again testing as before,
> > > >
> > > > Boot, run a series of small benchmarks, then just let the system be
> > > > and idle away.
> > > >
> > > > I notice in the stack trace there is mention of hrtimer_run_queues and
> > > > hrtimer_interrupt.
> > > >
> > > > Anyway, leave this for a few more eyes to look at.
> > > >
> > > > Open to any other suggestions of things to test.
> > > >
> > > > Regards,
> > > >
> > > > Ross Green
> > > 
> > > 
> > > G'day Paul,
> > > 
> > > I left the pandaboard running and captured another stall.
> > > 
> > > the attachment is the dmesg output.
> > > 
> > > Again there is no apparent output from any CONFIG_DEBUG_TIMING so I
> > > assume there is nothing happening there.
> > 
> > I agree, looks like this is not due to time skew.
> > 
> > > I just saw the updates for 4.6 RCU code.
> > > Is the patch in [PATCH tip/core/rcu 04/13] valid here?
> > 
> > I doubt that it will help, but you never know.
> > 
> > > do you want me try the new patch set with this configuration?
> > 
> > Even better would be to try Daniel Wagner's swait patchset.  I have
> > attached them in UNIX mbox format, or you can get them from the
> > -tip tree.
> > 
> > And I -finally- got some tracing that -might- be useful.  The dmesg, all
> > 67MB of it, is here:
> > 
> > http://www.rdrop.com/~paulmck/submission/console.2016.02.23a.log
> > 
> > This failure mode is less likely to happen, and looks a bit different
> > than the ones that I was seeing before enabling tracing.  Then, an
> > additional wakeup would actually wake the task up.  In contrast, with
> > tracing enabled, the RCU grace-period kthread goes into "teenager mode",
> > refusing to wake up despite repeated attempts.  However, this might
> > be a side-effect of the ftrace dump.
> > 
> > On line 525,132, we see that the rcu_preempt grace-period kthread has
> > been starved for 1,188,154 jiffies, or about 20 minutes.  This seems
> > unlikely...  The kthread is waiting for no more than a three-jiffy
> > timeout ("RCU_GP_WAIT_FQS(3)") and is in TASK_INTERRUPTIBLE state
> > ("0x1").
> 
> We're seeing a similar stall (~60 seconds) on an x86 development system
> here.  Any luck tracking down the cause of this?  If not, any
> suggestions for traces that might be helpful?

The dmesg containing the stall, the kernel version, and the .config would
be helpful!  Working on a torture test specific to this bug...

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-18 Thread Paul E. McKenney

On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote:
> On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote:
> > On Thu, Feb 25, 2016 at 04:13:11PM +1100, Ross Green wrote:
> > > On Wed, Feb 24, 2016 at 8:28 AM, Ross Green  wrote:
> > > > On Wed, Feb 24, 2016 at 7:55 AM, Paul E. McKenney
> > > >  wrote:
> > 
> > [ . . . ]
> > 
> > > >> Still working on getting decent traces...
> > 
> > And I might have succeeded, see below.
> > 
> > > >>
> > > >> Thanx, Paul
> > > >>
> > > >
> > > > G'day all,
> > > >
> > > > Here is another dmesg output for 4.5-rc5 showing another rcu_preempt 
> > > > stall.
> > > > This one appeared after only a day of running. CONFIG_DEBUG_TIMING is
> > > > turned on, but can't see any output that shows from this.
> > > >
> > > > Again testing as before,
> > > >
> > > > Boot, run a series of small benchmarks, then just let the system be
> > > > and idle away.
> > > >
> > > > I notice in the stack trace there is mention of hrtimer_run_queues and
> > > > hrtimer_interrupt.
> > > >
> > > > Anyway, leave this for a few more eyes to look at.
> > > >
> > > > Open to any other suggestions of things to test.
> > > >
> > > > Regards,
> > > >
> > > > Ross Green
> > > 
> > > 
> > > G'day Paul,
> > > 
> > > I left the pandaboard running and captured another stall.
> > > 
> > > the attachment is the dmesg output.
> > > 
> > > Again there is no apparent output from any CONFIG_DEBUG_TIMING so I
> > > assume there is nothing happening there.
> > 
> > I agree, looks like this is not due to time skew.
> > 
> > > I just saw the updates for 4.6 RCU code.
> > > Is the patch in [PATCH tip/core/rcu 04/13] valid here?
> > 
> > I doubt that it will help, but you never know.
> > 
> > > do you want me try the new patch set with this configuration?
> > 
> > Even better would be to try Daniel Wagner's swait patchset.  I have
> > attached them in UNIX mbox format, or you can get them from the
> > -tip tree.
> > 
> > And I -finally- got some tracing that -might- be useful.  The dmesg, all
> > 67MB of it, is here:
> > 
> > http://www.rdrop.com/~paulmck/submission/console.2016.02.23a.log
> > 
> > This failure mode is less likely to happen, and looks a bit different
> > than the ones that I was seeing before enabling tracing.  Then, an
> > additional wakeup would actually wake the task up.  In contrast, with
> > tracing enabled, the RCU grace-period kthread goes into "teenager mode",
> > refusing to wake up despite repeated attempts.  However, this might
> > be a side-effect of the ftrace dump.
> > 
> > On line 525,132, we see that the rcu_preempt grace-period kthread has
> > been starved for 1,188,154 jiffies, or about 20 minutes.  This seems
> > unlikely...  The kthread is waiting for no more than a three-jiffy
> > timeout ("RCU_GP_WAIT_FQS(3)") and is in TASK_INTERRUPTIBLE state
> > ("0x1").
> 
> We're seeing a similar stall (~60 seconds) on an x86 development system
> here.  Any luck tracking down the cause of this?  If not, any
> suggestions for traces that might be helpful?

The dmesg containing the stall, the kernel version, and the .config would
be helpful!  Working on a torture test specific to this bug...

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-04 Thread Paul E. McKenney

On Fri, Mar 04, 2016 at 04:30:12PM +1100, Ross Green wrote:
> On Fri, Feb 26, 2016 at 12:35 PM, Paul E. McKenney 
>  wrote:

[ . . . ]

> >> OK, so what wakeup path omits the sched_wakeup event?
> >>
> >> The sched_waking event looks to occur once in try_to_wake_up() and
> >> once in try_to_wake_up_local().  Starting with try_to_wake_up():
> >>
> >> o If the task is ->on_rq, ttwu_remote() is invoked:
> >>
> >>   o   This acquires the runqueue lock, then if
> >>   task_on_rq_queued() invokes ttwu_do_wakeup().  This
> >>   unconditionally does sched_wakeup, so we didn't go that
> >>   way.  (And this path skips the bulk of try_to_wake_up()
> >>   on return.)
> >>
> >>   o   Otherwise, we release the runqueu lock and returns zero.
> >>
> >> o There is some ordering checking, runqueue selection, and then
> >>   p->state is set to TASK_WAKING.  And we apparently are not getting
> >>   here, either.  But I don't see any other way out.
> >>
> >>   Ignoring this for the moment...
> >>
> >>   We eventually reach to the call to ttwu_queue().
> >>
> >>   o   Here the TTWU_QUEUE path seems to avoid doing a
> >>   sched_wakeup event -- and since we are trying to wake
> >>   CPU 0 from CPU 4, so they don't share cache (x86).
> >>
> >>   o   This invokes ttwu_queue_remote(), which sends an IPI
> >>   unless polling is in effect.  I would need to enable
> >>   trace_sched_wake_idle_without_ipi() to see whether or
> >>   not the IPI was actually sent.
> >>
> >>   If the target CPU was offline, we should have seen the
> >>   cpu_is_offline() WARN_ON().  I suppose that the CPU might
> >>   go offline between the check and the ->send_IPI_mask(),
> >>   but only once.  And we are trying to wakeup on CPU 0
> >>   quite a few times.
> >>
> >>   Any thoughts on what to look for?
> >>
> >> Next, try_to_wake_up_local():
> >>
> >> o After doing several checks, it does the sched_waking event.
> >>
> >> o If the task is already queued, it calls ttwu_activate().
> >>
> >> o It then invokes ttwu_do_wakeup(), which unconditionally
> >>   does the sched_wakeup() event.
> >>
> >>   So this path looks unlikely, even ignoring the fact that
> >>   the waking CPU in the traces above is always different than
> >>   the CPU to be awakened on.
> >>
> >> Any thoughts?
> >>
> >>   Thanx, Paul
> G'day,
> 
> 
> Here is a series of rcu_preempt stall events(5) from linux-4.5-rc6 release.
> 
> Again some testing procedure. boot, run series of brief benchmarks and
> then leave idle.
> The first stall event appeared quite quickly - within hours, the rest
> at what appears to be random intervals after that.
> 
> 
> I thought I might give Daniels patch set a try and see how that goes!

Looks like the same issue from dmesg.

For my part, I added more tracing, which seems to have further decreased
the probability of occurrence.  The sched_wake_idle_without_ipi event
did not appear.

My next step is to try writing a torture test focused specifically on
this issue.  We need a faster reproducer to make decent progress.

Thanx, Paul

Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

2016-03-04 Thread Paul E. McKenney

On Fri, Mar 04, 2016 at 04:30:12PM +1100, Ross Green wrote:
> On Fri, Feb 26, 2016 at 12:35 PM, Paul E. McKenney 
>  wrote:

[ . . . ]

> >> OK, so what wakeup path omits the sched_wakeup event?
> >>
> >> The sched_waking event looks to occur once in try_to_wake_up() and
> >> once in try_to_wake_up_local().  Starting with try_to_wake_up():
> >>
> >> o If the task is ->on_rq, ttwu_remote() is invoked:
> >>
> >>   o   This acquires the runqueue lock, then if
> >>   task_on_rq_queued() invokes ttwu_do_wakeup().  This
> >>   unconditionally does sched_wakeup, so we didn't go that
> >>   way.  (And this path skips the bulk of try_to_wake_up()
> >>   on return.)
> >>
> >>   o   Otherwise, we release the runqueu lock and returns zero.
> >>
> >> o There is some ordering checking, runqueue selection, and then
> >>   p->state is set to TASK_WAKING.  And we apparently are not getting
> >>   here, either.  But I don't see any other way out.
> >>
> >>   Ignoring this for the moment...
> >>
> >>   We eventually reach to the call to ttwu_queue().
> >>
> >>   o   Here the TTWU_QUEUE path seems to avoid doing a
> >>   sched_wakeup event -- and since we are trying to wake
> >>   CPU 0 from CPU 4, so they don't share cache (x86).
> >>
> >>   o   This invokes ttwu_queue_remote(), which sends an IPI
> >>   unless polling is in effect.  I would need to enable
> >>   trace_sched_wake_idle_without_ipi() to see whether or
> >>   not the IPI was actually sent.
> >>
> >>   If the target CPU was offline, we should have seen the
> >>   cpu_is_offline() WARN_ON().  I suppose that the CPU might
> >>   go offline between the check and the ->send_IPI_mask(),
> >>   but only once.  And we are trying to wakeup on CPU 0
> >>   quite a few times.
> >>
> >>   Any thoughts on what to look for?
> >>
> >> Next, try_to_wake_up_local():
> >>
> >> o After doing several checks, it does the sched_waking event.
> >>
> >> o If the task is already queued, it calls ttwu_activate().
> >>
> >> o It then invokes ttwu_do_wakeup(), which unconditionally
> >>   does the sched_wakeup() event.
> >>
> >>   So this path looks unlikely, even ignoring the fact that
> >>   the waking CPU in the traces above is always different than
> >>   the CPU to be awakened on.
> >>
> >> Any thoughts?
> >>
> >>   Thanx, Paul
> G'day,
> 
> 
> Here is a series of rcu_preempt stall events(5) from linux-4.5-rc6 release.
> 
> Again some testing procedure. boot, run series of brief benchmarks and
> then leave idle.
> The first stall event appeared quite quickly - within hours, the rest
> at what appears to be random intervals after that.
> 
> 
> I thought I might give Daniels patch set a try and see how that goes!

Looks like the same issue from dmesg.

For my part, I added more tracing, which seems to have further decreased
the probability of occurrence.  The sched_wake_idle_without_ipi event
did not appear.

My next step is to try writing a torture test focused specifically on
this issue.  We need a faster reproducer to make decent progress.

Thanx, Paul

1 2 >

1 - 100 of 148 matches

Mail list logo