Re: rcu_prempt stalls / lockup

2014-04-04 Thread Paul E. McKenney
On Thu, Apr 03, 2014 at 06:37:00PM -0400, Dave Jones wrote:
> On Thu, Apr 03, 2014 at 05:44:30PM -0400, Dave Jones wrote:
>  > On Thu, Apr 03, 2014 at 01:46:56PM -0700, Paul E. McKenney wrote:
>  >  > git merge anti-fixes
>  >  > 
>  >  > Then bisect through these reverts.
>  >  > 
>  >  > I am assuming, perhaps naively, that changes under Documentation and
>  >  > to torture testing should not be affecting you.
>  >  > 
>  >  > Does this make sense?
>  > 
>  > I tried reverting every one of those and could still hit it, which
>  > seems to indicate the bug is elsewhere, and rcu is just a victim of it.
>  > I'm going to start trying to revert the lockdep changes, given this seems
>  > to happen after that has warned..
> 
> And that was also without any success.
> Tomorrow I'll try a bisect on just arch/x86.

I cannot think of any better debugging strategy at the moment.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-04 Thread Paul E. McKenney
On Thu, Apr 03, 2014 at 06:37:00PM -0400, Dave Jones wrote:
 On Thu, Apr 03, 2014 at 05:44:30PM -0400, Dave Jones wrote:
   On Thu, Apr 03, 2014 at 01:46:56PM -0700, Paul E. McKenney wrote:
 git merge anti-fixes
 
 Then bisect through these reverts.
 
 I am assuming, perhaps naively, that changes under Documentation and
 to torture testing should not be affecting you.
 
 Does this make sense?
   
   I tried reverting every one of those and could still hit it, which
   seems to indicate the bug is elsewhere, and rcu is just a victim of it.
   I'm going to start trying to revert the lockdep changes, given this seems
   to happen after that has warned..
 
 And that was also without any success.
 Tomorrow I'll try a bisect on just arch/x86.

I cannot think of any better debugging strategy at the moment.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-03 Thread Dave Jones
On Thu, Apr 03, 2014 at 05:44:30PM -0400, Dave Jones wrote:
 > On Thu, Apr 03, 2014 at 01:46:56PM -0700, Paul E. McKenney wrote:
 >  >   git merge anti-fixes
 >  > 
 >  > Then bisect through these reverts.
 >  > 
 >  > I am assuming, perhaps naively, that changes under Documentation and
 >  > to torture testing should not be affecting you.
 >  > 
 >  > Does this make sense?
 > 
 > I tried reverting every one of those and could still hit it, which
 > seems to indicate the bug is elsewhere, and rcu is just a victim of it.
 > I'm going to start trying to revert the lockdep changes, given this seems
 > to happen after that has warned..

And that was also without any success.
Tomorrow I'll try a bisect on just arch/x86.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-03 Thread Dave Jones
On Thu, Apr 03, 2014 at 01:46:56PM -0700, Paul E. McKenney wrote:
 > So, how about reverting each commit in the RCU series, and then bisecting
 > through the reverts?  Something like the following:
 > 
 >  wherever=linus/master # or substitute whatever point you wish.
 >  # Create a revert branch for the rcu.2014.02.26a branch
 >  git checkout -b anti-fixes $wherever
 >  git revert 5cb5c6e18f822b19bd41a2c0f9930c82b3ec0bc9
 >  git revert 7a754743185a4b05818e10058fa2fbe4e6969085
 >  git revert 8857563b819b140aa8c9be920cfe44d5d3f808b7
 >  git revert add1f0995454374d90c9d6b2c420d2fba3d0a4e3
 >  git revert ae1670339c95c3ff96ab10582506cf827c5fecc8
 >  git revert 52e2bb958ac4f9b3c4bdd78606d279852fd72922
 >  git revert 88c1863066ccfa456797e12c5d8b4631aa1ad0d0
 >  git revert 0adab9b9aa18d7e90337d43567f1eec3d5401b81
 >  git revert 41f4abd92a34f9c5110bbb870c04f8854604e28d
 >  git revert cb1e78cfa267453bb19e7edafd214c03834b664c
 >  git revert 87de1cfdc55b16b794e245b07322340725149d62
 >  git revert 3660c2813fb6d0ba48ee44bcbf9feddf7218c11d
 >  # Create a revert branch for the rt.2014.02.17b branch
 >  git checkout -b anti-rt $wherever
 >  git revert f1f399d1281ea339a08469f7e58193624992f620
 >  git revert ffa83fb565fbc397cbafb4b71fd1cce276d4c3b6
 >  git revert 2f33b512a5460578f6cf11d7b7867bed53157c7c
 >  git merge anti-fixes
 > 
 > Then bisect through these reverts.
 > 
 > I am assuming, perhaps naively, that changes under Documentation and
 > to torture testing should not be affecting you.
 > 
 > Does this make sense?

I tried reverting every one of those and could still hit it, which
seems to indicate the bug is elsewhere, and rcu is just a victim of it.
I'm going to start trying to revert the lockdep changes, given this seems
to happen after that has warned..

Dave 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-03 Thread Paul E. McKenney
On Thu, Apr 03, 2014 at 04:01:43PM -0400, Dave Jones wrote:
> On Wed, Apr 02, 2014 at 06:48:40PM -0400, Dave Jones wrote:
> 
>  >  > > >  > Waiting uninterruptibly.  Presumably blocked on mutex_lock().  
> But
>  >  > > >  > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have 
> been
>  >  > > >  > reported.
>  >  > > > 
>  >  > > > Lockdep had reported something a little earlier (timestamped at 
> 1108.xx)
>  >  > > > but that's a known false-positive in xfs.
>  >  > > 
>  >  > > Yep, I would be very surprised if that was related to the 
> grace-period hang.
>  >  > 
>  >  > Ah, but it could be suppressing later lockdep splats.  So if this can be
>  >  > reproduced without xfs, we might get additional information from 
> lockdep.
>  > 
>  > Hrmph.
>  > 
>  > $ git bisect bad
>  > The merge base 5cb480f6b488128140c940abff3c36f524a334a8 is bad.
>  > This means the bug has been fixed between 
> 5cb480f6b488128140c940abff3c36f524a334a8 and 
> [455c6fdbd219161bd09b1165f11699d6d73de11c 
> 62c206bd514600d4d73751ade00dca8e488390a3 
> e086481baf9d0436bdd6e9b739bfa4a83fb89ef5].
>  > 
>  > Not sure where to go from here..
>  > 
>  > The 'good' news is I can reproduce it pretty reliably now.
>  > I start my fuzz tester, and immediately do a git diff in my working tree,
>  > and then boom..
> 
> Even better, now I realise I don't even need my fuzzer in the mix. Just doing
> a fair amount of disk io (like a git diff on a dirty tree) will trigger it.
> 
> I've tried adding a show_state() call when the stall happens, but another 
> stall
> seems to occur before it gets a chance to even dump everything over the 
> usb-serial console.
> And of course nothing ever makes it to disk, even though I can sysrq-sync, on 
> the
> next reboot systemd has stuffed a bunch of ^@ in the log where the interesting
> stuff should be.
> 
> Any other ideas ?

This bug does seem to be doing an effective job of defending itself,
doesn't it?  I guess producing additional debug output just isn't going
to cut it in this case.

So, how about reverting each commit in the RCU series, and then bisecting
through the reverts?  Something like the following:

wherever=linus/master # or substitute whatever point you wish.
# Create a revert branch for the rcu.2014.02.26a branch
git checkout -b anti-fixes $wherever
git revert 5cb5c6e18f822b19bd41a2c0f9930c82b3ec0bc9
git revert 7a754743185a4b05818e10058fa2fbe4e6969085
git revert 8857563b819b140aa8c9be920cfe44d5d3f808b7
git revert add1f0995454374d90c9d6b2c420d2fba3d0a4e3
git revert ae1670339c95c3ff96ab10582506cf827c5fecc8
git revert 52e2bb958ac4f9b3c4bdd78606d279852fd72922
git revert 88c1863066ccfa456797e12c5d8b4631aa1ad0d0
git revert 0adab9b9aa18d7e90337d43567f1eec3d5401b81
git revert 41f4abd92a34f9c5110bbb870c04f8854604e28d
git revert cb1e78cfa267453bb19e7edafd214c03834b664c
git revert 87de1cfdc55b16b794e245b07322340725149d62
git revert 3660c2813fb6d0ba48ee44bcbf9feddf7218c11d
# Create a revert branch for the rt.2014.02.17b branch
git checkout -b anti-rt $wherever
git revert f1f399d1281ea339a08469f7e58193624992f620
git revert ffa83fb565fbc397cbafb4b71fd1cce276d4c3b6
git revert 2f33b512a5460578f6cf11d7b7867bed53157c7c
git merge anti-fixes

Then bisect through these reverts.

I am assuming, perhaps naively, that changes under Documentation and
to torture testing should not be affecting you.

Does this make sense?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-03 Thread Dave Jones
On Wed, Apr 02, 2014 at 06:48:40PM -0400, Dave Jones wrote:
 
 >  > > >  > Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
 >  > > >  > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
 >  > > >  > reported.
 >  > > > 
 >  > > > Lockdep had reported something a little earlier (timestamped at 
 > 1108.xx)
 >  > > > but that's a known false-positive in xfs.
 >  > > 
 >  > > Yep, I would be very surprised if that was related to the grace-period 
 > hang.
 >  > 
 >  > Ah, but it could be suppressing later lockdep splats.  So if this can be
 >  > reproduced without xfs, we might get additional information from lockdep.
 > 
 > Hrmph.
 > 
 > $ git bisect bad
 > The merge base 5cb480f6b488128140c940abff3c36f524a334a8 is bad.
 > This means the bug has been fixed between 
 > 5cb480f6b488128140c940abff3c36f524a334a8 and 
 > [455c6fdbd219161bd09b1165f11699d6d73de11c 
 > 62c206bd514600d4d73751ade00dca8e488390a3 
 > e086481baf9d0436bdd6e9b739bfa4a83fb89ef5].
 > 
 > Not sure where to go from here..
 > 
 > The 'good' news is I can reproduce it pretty reliably now.
 > I start my fuzz tester, and immediately do a git diff in my working tree,
 > and then boom..

Even better, now I realise I don't even need my fuzzer in the mix. Just doing
a fair amount of disk io (like a git diff on a dirty tree) will trigger it.

I've tried adding a show_state() call when the stall happens, but another stall
seems to occur before it gets a chance to even dump everything over the 
usb-serial console.
And of course nothing ever makes it to disk, even though I can sysrq-sync, on 
the
next reboot systemd has stuffed a bunch of ^@ in the log where the interesting
stuff should be.

Any other ideas ?

Dave


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-03 Thread Dave Jones
On Wed, Apr 02, 2014 at 06:48:40PM -0400, Dave Jones wrote:
 
Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
reported.
  
  Lockdep had reported something a little earlier (timestamped at 
  1108.xx)
  but that's a known false-positive in xfs.
 
 Yep, I would be very surprised if that was related to the grace-period 
  hang.

Ah, but it could be suppressing later lockdep splats.  So if this can be
reproduced without xfs, we might get additional information from lockdep.
  
  Hrmph.
  
  $ git bisect bad
  The merge base 5cb480f6b488128140c940abff3c36f524a334a8 is bad.
  This means the bug has been fixed between 
  5cb480f6b488128140c940abff3c36f524a334a8 and 
  [455c6fdbd219161bd09b1165f11699d6d73de11c 
  62c206bd514600d4d73751ade00dca8e488390a3 
  e086481baf9d0436bdd6e9b739bfa4a83fb89ef5].
  
  Not sure where to go from here..
  
  The 'good' news is I can reproduce it pretty reliably now.
  I start my fuzz tester, and immediately do a git diff in my working tree,
  and then boom..

Even better, now I realise I don't even need my fuzzer in the mix. Just doing
a fair amount of disk io (like a git diff on a dirty tree) will trigger it.

I've tried adding a show_state() call when the stall happens, but another stall
seems to occur before it gets a chance to even dump everything over the 
usb-serial console.
And of course nothing ever makes it to disk, even though I can sysrq-sync, on 
the
next reboot systemd has stuffed a bunch of ^@ in the log where the interesting
stuff should be.

Any other ideas ?

Dave


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-03 Thread Paul E. McKenney
On Thu, Apr 03, 2014 at 04:01:43PM -0400, Dave Jones wrote:
 On Wed, Apr 02, 2014 at 06:48:40PM -0400, Dave Jones wrote:
 
 Waiting uninterruptibly.  Presumably blocked on mutex_lock().  
 But
 you have CONFIG_PROVE_LOCKING(), so any deadlocks should have 
 been
 reported.
   
   Lockdep had reported something a little earlier (timestamped at 
 1108.xx)
   but that's a known false-positive in xfs.
  
  Yep, I would be very surprised if that was related to the 
 grace-period hang.
 
 Ah, but it could be suppressing later lockdep splats.  So if this can be
 reproduced without xfs, we might get additional information from 
 lockdep.
   
   Hrmph.
   
   $ git bisect bad
   The merge base 5cb480f6b488128140c940abff3c36f524a334a8 is bad.
   This means the bug has been fixed between 
 5cb480f6b488128140c940abff3c36f524a334a8 and 
 [455c6fdbd219161bd09b1165f11699d6d73de11c 
 62c206bd514600d4d73751ade00dca8e488390a3 
 e086481baf9d0436bdd6e9b739bfa4a83fb89ef5].
   
   Not sure where to go from here..
   
   The 'good' news is I can reproduce it pretty reliably now.
   I start my fuzz tester, and immediately do a git diff in my working tree,
   and then boom..
 
 Even better, now I realise I don't even need my fuzzer in the mix. Just doing
 a fair amount of disk io (like a git diff on a dirty tree) will trigger it.
 
 I've tried adding a show_state() call when the stall happens, but another 
 stall
 seems to occur before it gets a chance to even dump everything over the 
 usb-serial console.
 And of course nothing ever makes it to disk, even though I can sysrq-sync, on 
 the
 next reboot systemd has stuffed a bunch of ^@ in the log where the interesting
 stuff should be.
 
 Any other ideas ?

This bug does seem to be doing an effective job of defending itself,
doesn't it?  I guess producing additional debug output just isn't going
to cut it in this case.

So, how about reverting each commit in the RCU series, and then bisecting
through the reverts?  Something like the following:

wherever=linus/master # or substitute whatever point you wish.
# Create a revert branch for the rcu.2014.02.26a branch
git checkout -b anti-fixes $wherever
git revert 5cb5c6e18f822b19bd41a2c0f9930c82b3ec0bc9
git revert 7a754743185a4b05818e10058fa2fbe4e6969085
git revert 8857563b819b140aa8c9be920cfe44d5d3f808b7
git revert add1f0995454374d90c9d6b2c420d2fba3d0a4e3
git revert ae1670339c95c3ff96ab10582506cf827c5fecc8
git revert 52e2bb958ac4f9b3c4bdd78606d279852fd72922
git revert 88c1863066ccfa456797e12c5d8b4631aa1ad0d0
git revert 0adab9b9aa18d7e90337d43567f1eec3d5401b81
git revert 41f4abd92a34f9c5110bbb870c04f8854604e28d
git revert cb1e78cfa267453bb19e7edafd214c03834b664c
git revert 87de1cfdc55b16b794e245b07322340725149d62
git revert 3660c2813fb6d0ba48ee44bcbf9feddf7218c11d
# Create a revert branch for the rt.2014.02.17b branch
git checkout -b anti-rt $wherever
git revert f1f399d1281ea339a08469f7e58193624992f620
git revert ffa83fb565fbc397cbafb4b71fd1cce276d4c3b6
git revert 2f33b512a5460578f6cf11d7b7867bed53157c7c
git merge anti-fixes

Then bisect through these reverts.

I am assuming, perhaps naively, that changes under Documentation and
to torture testing should not be affecting you.

Does this make sense?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-03 Thread Dave Jones
On Thu, Apr 03, 2014 at 01:46:56PM -0700, Paul E. McKenney wrote:
  So, how about reverting each commit in the RCU series, and then bisecting
  through the reverts?  Something like the following:
  
   wherever=linus/master # or substitute whatever point you wish.
   # Create a revert branch for the rcu.2014.02.26a branch
   git checkout -b anti-fixes $wherever
   git revert 5cb5c6e18f822b19bd41a2c0f9930c82b3ec0bc9
   git revert 7a754743185a4b05818e10058fa2fbe4e6969085
   git revert 8857563b819b140aa8c9be920cfe44d5d3f808b7
   git revert add1f0995454374d90c9d6b2c420d2fba3d0a4e3
   git revert ae1670339c95c3ff96ab10582506cf827c5fecc8
   git revert 52e2bb958ac4f9b3c4bdd78606d279852fd72922
   git revert 88c1863066ccfa456797e12c5d8b4631aa1ad0d0
   git revert 0adab9b9aa18d7e90337d43567f1eec3d5401b81
   git revert 41f4abd92a34f9c5110bbb870c04f8854604e28d
   git revert cb1e78cfa267453bb19e7edafd214c03834b664c
   git revert 87de1cfdc55b16b794e245b07322340725149d62
   git revert 3660c2813fb6d0ba48ee44bcbf9feddf7218c11d
   # Create a revert branch for the rt.2014.02.17b branch
   git checkout -b anti-rt $wherever
   git revert f1f399d1281ea339a08469f7e58193624992f620
   git revert ffa83fb565fbc397cbafb4b71fd1cce276d4c3b6
   git revert 2f33b512a5460578f6cf11d7b7867bed53157c7c
   git merge anti-fixes
  
  Then bisect through these reverts.
  
  I am assuming, perhaps naively, that changes under Documentation and
  to torture testing should not be affecting you.
  
  Does this make sense?

I tried reverting every one of those and could still hit it, which
seems to indicate the bug is elsewhere, and rcu is just a victim of it.
I'm going to start trying to revert the lockdep changes, given this seems
to happen after that has warned..

Dave 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-03 Thread Dave Jones
On Thu, Apr 03, 2014 at 05:44:30PM -0400, Dave Jones wrote:
  On Thu, Apr 03, 2014 at 01:46:56PM -0700, Paul E. McKenney wrote:
  git merge anti-fixes

Then bisect through these reverts.

I am assuming, perhaps naively, that changes under Documentation and
to torture testing should not be affecting you.

Does this make sense?
  
  I tried reverting every one of those and could still hit it, which
  seems to indicate the bug is elsewhere, and rcu is just a victim of it.
  I'm going to start trying to revert the lockdep changes, given this seems
  to happen after that has warned..

And that was also without any success.
Tomorrow I'll try a bisect on just arch/x86.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-02 Thread Dave Jones
On Wed, Apr 02, 2014 at 09:20:43AM -0700, Paul E. McKenney wrote:
 > On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
 > > On Tue, Apr 01, 2014 at 02:04:14PM -0400, Dave Jones wrote:
 > > > On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:
 > 
 > [ . . . ]
 > 
 > > >  > Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
 > > >  > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
 > > >  > reported.
 > > > 
 > > > Lockdep had reported something a little earlier (timestamped at 
 > > > 1108.xx)
 > > > but that's a known false-positive in xfs.
 > > 
 > > Yep, I would be very surprised if that was related to the grace-period 
 > > hang.
 > 
 > Ah, but it could be suppressing later lockdep splats.  So if this can be
 > reproduced without xfs, we might get additional information from lockdep.

Hrmph.

$ git bisect bad
The merge base 5cb480f6b488128140c940abff3c36f524a334a8 is bad.
This means the bug has been fixed between 
5cb480f6b488128140c940abff3c36f524a334a8 and 
[455c6fdbd219161bd09b1165f11699d6d73de11c 
62c206bd514600d4d73751ade00dca8e488390a3 
e086481baf9d0436bdd6e9b739bfa4a83fb89ef5].

Not sure where to go from here..

The 'good' news is I can reproduce it pretty reliably now.
I start my fuzz tester, and immediately do a git diff in my working tree,
and then boom..

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-02 Thread Dave Jones
On Wed, Apr 02, 2014 at 09:20:43AM -0700, Paul E. McKenney wrote:
 > On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
 > > On Tue, Apr 01, 2014 at 02:04:14PM -0400, Dave Jones wrote:
 > > > On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:
 > 
 > [ . . . ]
 > 
 > > >  > Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
 > > >  > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
 > > >  > reported.
 > > > 
 > > > Lockdep had reported something a little earlier (timestamped at 
 > > > 1108.xx)
 > > > but that's a known false-positive in xfs.
 > > 
 > > Yep, I would be very surprised if that was related to the grace-period 
 > > hang.
 > 
 > Ah, but it could be suppressing later lockdep splats.  So if this can be
 > reproduced without xfs, we might get additional information from lockdep.

Hmm, I lack another test machine, and don't really feel like reinstalling it 
right now.
I'm halfway through a bisect, hopefully that's fruitful.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-02 Thread Paul E. McKenney
On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 01, 2014 at 02:04:14PM -0400, Dave Jones wrote:
> > On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:

[ . . . ]

> >  > Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
> >  > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
> >  > reported.
> > 
> > Lockdep had reported something a little earlier (timestamped at 1108.xx)
> > but that's a known false-positive in xfs.
> 
> Yep, I would be very surprised if that was related to the grace-period hang.

Ah, but it could be suppressing later lockdep splats.  So if this can be
reproduced without xfs, we might get additional information from lockdep.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-02 Thread Paul E. McKenney
On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
 On Tue, Apr 01, 2014 at 02:04:14PM -0400, Dave Jones wrote:
  On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:

[ . . . ]

Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
reported.
  
  Lockdep had reported something a little earlier (timestamped at 1108.xx)
  but that's a known false-positive in xfs.
 
 Yep, I would be very surprised if that was related to the grace-period hang.

Ah, but it could be suppressing later lockdep splats.  So if this can be
reproduced without xfs, we might get additional information from lockdep.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-02 Thread Dave Jones
On Wed, Apr 02, 2014 at 09:20:43AM -0700, Paul E. McKenney wrote:
  On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
   On Tue, Apr 01, 2014 at 02:04:14PM -0400, Dave Jones wrote:
On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:
  
  [ . . . ]
  
  Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
  you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
  reported.

Lockdep had reported something a little earlier (timestamped at 
1108.xx)
but that's a known false-positive in xfs.
   
   Yep, I would be very surprised if that was related to the grace-period 
   hang.
  
  Ah, but it could be suppressing later lockdep splats.  So if this can be
  reproduced without xfs, we might get additional information from lockdep.

Hmm, I lack another test machine, and don't really feel like reinstalling it 
right now.
I'm halfway through a bisect, hopefully that's fruitful.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-02 Thread Dave Jones
On Wed, Apr 02, 2014 at 09:20:43AM -0700, Paul E. McKenney wrote:
  On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
   On Tue, Apr 01, 2014 at 02:04:14PM -0400, Dave Jones wrote:
On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:
  
  [ . . . ]
  
  Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
  you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
  reported.

Lockdep had reported something a little earlier (timestamped at 
1108.xx)
but that's a known false-positive in xfs.
   
   Yep, I would be very surprised if that was related to the grace-period 
   hang.
  
  Ah, but it could be suppressing later lockdep splats.  So if this can be
  reproduced without xfs, we might get additional information from lockdep.

Hrmph.

$ git bisect bad
The merge base 5cb480f6b488128140c940abff3c36f524a334a8 is bad.
This means the bug has been fixed between 
5cb480f6b488128140c940abff3c36f524a334a8 and 
[455c6fdbd219161bd09b1165f11699d6d73de11c 
62c206bd514600d4d73751ade00dca8e488390a3 
e086481baf9d0436bdd6e9b739bfa4a83fb89ef5].

Not sure where to go from here..

The 'good' news is I can reproduce it pretty reliably now.
I start my fuzz tester, and immediately do a git diff in my working tree,
and then boom..

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Paul E. McKenney
On Tue, Apr 01, 2014 at 07:31:30PM -0400, Dave Jones wrote:
> On Tue, Apr 01, 2014 at 04:18:22PM -0700, Paul E. McKenney wrote:
> 
>  > > # entries-in-buffer/entries-written: 7/7   #P:4
>  > > #
>  > > #  _-=> irqs-off
>  > > # / _=> need-resched
>  > > #| / _---=> hardirq/softirq
>  > > #|| / _--=> preempt-depth
>  > > #||| / delay
>  > > #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
>  > > #  | |   |      | |
>  > >   -0 [000] dNs3   851.748475: rcu_grace_period: 
> rcu_preempt 21342 cpuend
>  > >   -0 [000] dNs3   851.748477: rcu_grace_period: 
> rcu_preempt 21343 cpustart
>  > >   -0 [000] dNs3   851.748477: rcu_grace_period: 
> rcu_sched 20080 cpuend
>  > >   -0 [000] dN.2   851.748480: rcu_grace_period: 
> rcu_preempt 21343 cpuqs
>  > >  rcu_preempt-9 [000] ...1   851.748485: rcu_grace_period: 
> rcu_preempt 21343 fqsstart
>  > >  rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: 
> rcu_preempt 21343 fqsend
>  > >  rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: 
> rcu_preempt 21343 fqswait
>  > > # tracer: nop
>  > > #
>  > > # entries-in-buffer/entries-written: 13/13   #P:4
>  > > #
>  > > #  _-=> irqs-off
>  > > # / _=> need-resched
>  > > #| / _---=> hardirq/softirq
>  > > #|| / _--=> preempt-depth
>  > > #||| / delay
>  > > #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
>  > > #  | |   |      | |
>  > >  rcu_preempt-9 [003] d..2   851.878457: rcu_grace_period: 
> rcu_preempt 21349 end
>  > >  rcu_preempt-9 [003] d..2   851.878459: rcu_grace_period: 
> rcu_preempt 21349 newreq
>  > >  rcu_preempt-9 [003] ...1   851.878459: rcu_grace_period: 
> rcu_preempt 21349 reqwait
>  > >  rcu_preempt-9 [003] d..2   851.878460: rcu_grace_period: 
> rcu_preempt 21350 start
>  > >  rcu_preempt-9 [003] d..2   851.878461: rcu_grace_period: 
> rcu_preempt 21350 cpustart
>  > >  rcu_preempt-9 [003] d..2   851.878462: rcu_grace_period_init: 
> rcu_preempt 21350 0 0 7 f
>  > >  rcu_preempt-9 [003] ...1   851.878463: rcu_grace_period: 
> rcu_preempt 21350 fqswait
>  > >  rcu_preempt-9 [003] d..2   851.878464: rcu_grace_period: 
> rcu_preempt 21350 cpuqs
>  > >   -0 [000] dNs3   851.878482: rcu_grace_period: 
> rcu_preempt 21349 cpuend
>  > >   -0 [000] dNs3   851.878483: rcu_grace_period: 
> rcu_preempt 21350 cpustart
>  > >   -0 [000] dNs3   851.878484: rcu_grace_period: 
> rcu_sched 20086 cpuend
>  > >   -0 [000] .N.2   851.878486: rcu_grace_period: 
> rcu_sched 20086 cpuqs
>  > >   -0 [000] dN.2   851.878487: rcu_grace_period: 
> rcu_preempt 21350 cpuqs

OK, so the trace ended on grace period #21350...

>  > > 
>  > > 
>  > > followed by a half dozen 'empty' traces before it totally locked up.
>  > 
>  > Cool, thank you!
>  > 
>  > Could you please also send the RCU CPU stall warning messages?
> 
> >From that run they were..
> 
> [  917.790265] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  917.791189](detected by 0, t=6502 jiffies, g=21390, c=21389, q=0)

But the hang didn't happen until grace period #21390.  Sigh!  That means
that the trace doesn't tell me about the current grace period.

Thank you for getting this info, but it is beginning to look like
bisection is necessary here.  :-(

Thanx, Paul

> [  917.792058] INFO: Stall ended before state dump start, gp_kthread state: 
> 0x2
> [  917.792799] INFO: rcu_sched detected stalls on CPUs/tasks:
> [  917.793534](detected by 0, t=6502 jiffies, g=20128, c=20127, q=0)
> [  917.794265] INFO: Stall ended before state dump start, gp_kthread state: 
> 0x2
> [ 1112.727480] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1112.728425](detected by 0, t=26007 jiffies, g=21390, c=21389, q=0)
> [ 1112.729300] INFO: Stall ended before state dump start, gp_kthread state: 
> 0x2
> [ 1112.730035] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1112.730744](detected by 0, t=26007 jiffies, g=20128, c=20127, q=0)
> [ 1112.731459] INFO: Stall ended before state dump start, gp_kthread state: 
> 0x2
> [ 1307.664710] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1307.665623](detected by 0, t=45512 jiffies, g=21390, c=21389, q=0)
> [ 1307.666386] INFO: Stall ended before state dump start, gp_kthread state: 
> 0x2
> [ 1307.667113] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1307.667839](detected by 0, t=45512 jiffies, g=20128, c=20127, q=0)
> [ 1307.668572] INFO: 

Re: rcu_prempt stalls / lockup

2014-04-01 Thread Dave Jones
On Tue, Apr 01, 2014 at 04:57:24PM -0700, Paul E. McKenney wrote:
 
 > > [  917.790265] INFO: rcu_preempt detected stalls on CPUs/tasks:
 > > [  917.791189] (detected by 0, t=6502 jiffies, g=21390, c=21389, q=0)
 > 
 > But the hang didn't happen until grace period #21390.  Sigh!  That means
 > that the trace doesn't tell me about the current grace period.
 > 
 > Thank you for getting this info, but it is beginning to look like
 > bisection is necessary here.  :-(

ok, I was hoping that wouldn't be the case, but let's see what it turns up.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Dave Jones
On Tue, Apr 01, 2014 at 04:18:22PM -0700, Paul E. McKenney wrote:
 
 > > # entries-in-buffer/entries-written: 7/7   #P:4
 > > #
 > > #  _-=> irqs-off
 > > # / _=> need-resched
 > > #| / _---=> hardirq/softirq
 > > #|| / _--=> preempt-depth
 > > #||| / delay
 > > #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
 > > #  | |   |      | |
 > >   -0 [000] dNs3   851.748475: rcu_grace_period: 
 > > rcu_preempt 21342 cpuend
 > >   -0 [000] dNs3   851.748477: rcu_grace_period: 
 > > rcu_preempt 21343 cpustart
 > >   -0 [000] dNs3   851.748477: rcu_grace_period: 
 > > rcu_sched 20080 cpuend
 > >   -0 [000] dN.2   851.748480: rcu_grace_period: 
 > > rcu_preempt 21343 cpuqs
 > >  rcu_preempt-9 [000] ...1   851.748485: rcu_grace_period: 
 > > rcu_preempt 21343 fqsstart
 > >  rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: 
 > > rcu_preempt 21343 fqsend
 > >  rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: 
 > > rcu_preempt 21343 fqswait
 > > # tracer: nop
 > > #
 > > # entries-in-buffer/entries-written: 13/13   #P:4
 > > #
 > > #  _-=> irqs-off
 > > # / _=> need-resched
 > > #| / _---=> hardirq/softirq
 > > #|| / _--=> preempt-depth
 > > #||| / delay
 > > #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
 > > #  | |   |      | |
 > >  rcu_preempt-9 [003] d..2   851.878457: rcu_grace_period: 
 > > rcu_preempt 21349 end
 > >  rcu_preempt-9 [003] d..2   851.878459: rcu_grace_period: 
 > > rcu_preempt 21349 newreq
 > >  rcu_preempt-9 [003] ...1   851.878459: rcu_grace_period: 
 > > rcu_preempt 21349 reqwait
 > >  rcu_preempt-9 [003] d..2   851.878460: rcu_grace_period: 
 > > rcu_preempt 21350 start
 > >  rcu_preempt-9 [003] d..2   851.878461: rcu_grace_period: 
 > > rcu_preempt 21350 cpustart
 > >  rcu_preempt-9 [003] d..2   851.878462: rcu_grace_period_init: 
 > > rcu_preempt 21350 0 0 7 f
 > >  rcu_preempt-9 [003] ...1   851.878463: rcu_grace_period: 
 > > rcu_preempt 21350 fqswait
 > >  rcu_preempt-9 [003] d..2   851.878464: rcu_grace_period: 
 > > rcu_preempt 21350 cpuqs
 > >   -0 [000] dNs3   851.878482: rcu_grace_period: 
 > > rcu_preempt 21349 cpuend
 > >   -0 [000] dNs3   851.878483: rcu_grace_period: 
 > > rcu_preempt 21350 cpustart
 > >   -0 [000] dNs3   851.878484: rcu_grace_period: 
 > > rcu_sched 20086 cpuend
 > >   -0 [000] .N.2   851.878486: rcu_grace_period: 
 > > rcu_sched 20086 cpuqs
 > >   -0 [000] dN.2   851.878487: rcu_grace_period: 
 > > rcu_preempt 21350 cpuqs
 > > 
 > > 
 > > followed by a half dozen 'empty' traces before it totally locked up.
 > 
 > Cool, thank you!
 > 
 > Could you please also send the RCU CPU stall warning messages?

>From that run they were..

[  917.790265] INFO: rcu_preempt detected stalls on CPUs/tasks:
[  917.791189]  (detected by 0, t=6502 jiffies, g=21390, c=21389, q=0)
[  917.792058] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[  917.792799] INFO: rcu_sched detected stalls on CPUs/tasks:
[  917.793534]  (detected by 0, t=6502 jiffies, g=20128, c=20127, q=0)
[  917.794265] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1112.727480] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1112.728425]  (detected by 0, t=26007 jiffies, g=21390, c=21389, q=0)
[ 1112.729300] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1112.730035] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1112.730744]  (detected by 0, t=26007 jiffies, g=20128, c=20127, q=0)
[ 1112.731459] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1307.664710] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1307.665623]  (detected by 0, t=45512 jiffies, g=21390, c=21389, q=0)
[ 1307.666386] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1307.667113] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1307.667839]  (detected by 0, t=45512 jiffies, g=20128, c=20127, q=0)
[ 1307.668572] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1502.601961] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1502.602905]  (detected by 0, t=65017 jiffies, g=21390, c=21389, q=0)
[ 1502.603765] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1502.604550] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1502.605305]  (detected by 0, t=65017 jiffies, g=20128, c=20127, q=0)
[ 1502.606061] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1697.539217] INFO: rcu_preempt detected stalls on CPUs/tasks:

Re: rcu_prempt stalls / lockup

2014-04-01 Thread Paul E. McKenney
On Tue, Apr 01, 2014 at 06:16:16PM -0400, Dave Jones wrote:
> On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
> 
>  > >  > Given that you have CONFIG_RCU_TRACE=y, could you please enable the
>  > >  > following trace events and dump the trace before things hang?
>  > >  > 
>  > >  >   trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init
>  > >  > 
>  > >  > If it is not feasible to dump the trace before things hang, let me
>  > >  > know, and I will work out some other diagnostic regime.
>  > > 
>  > > I'll give that a shot when I get back in a few hours.
> 
> ok, this is tricky, and I'm not sure how helpful the below is.
> because I don't know when the hang is going to happen, in parallel to my 
> usual workload
> I did this..
> while [ 1 ];
> do
>   cat trace
>   echo > trace
> done
> 
> and got this...
> 
> 
> 
> # entries-in-buffer/entries-written: 7/7   #P:4
> #
> #  _-=> irqs-off
> # / _=> need-resched
> #| / _---=> hardirq/softirq
> #|| / _--=> preempt-depth
> #||| / delay
> #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
> #  | |   |      | |
>   -0 [000] dNs3   851.748475: rcu_grace_period: rcu_preempt 
> 21342 cpuend
>   -0 [000] dNs3   851.748477: rcu_grace_period: rcu_preempt 
> 21343 cpustart
>   -0 [000] dNs3   851.748477: rcu_grace_period: rcu_sched 
> 20080 cpuend
>   -0 [000] dN.2   851.748480: rcu_grace_period: rcu_preempt 
> 21343 cpuqs
>  rcu_preempt-9 [000] ...1   851.748485: rcu_grace_period: rcu_preempt 
> 21343 fqsstart
>  rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: rcu_preempt 
> 21343 fqsend
>  rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: rcu_preempt 
> 21343 fqswait
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 13/13   #P:4
> #
> #  _-=> irqs-off
> # / _=> need-resched
> #| / _---=> hardirq/softirq
> #|| / _--=> preempt-depth
> #||| / delay
> #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
> #  | |   |      | |
>  rcu_preempt-9 [003] d..2   851.878457: rcu_grace_period: rcu_preempt 
> 21349 end
>  rcu_preempt-9 [003] d..2   851.878459: rcu_grace_period: rcu_preempt 
> 21349 newreq
>  rcu_preempt-9 [003] ...1   851.878459: rcu_grace_period: rcu_preempt 
> 21349 reqwait
>  rcu_preempt-9 [003] d..2   851.878460: rcu_grace_period: rcu_preempt 
> 21350 start
>  rcu_preempt-9 [003] d..2   851.878461: rcu_grace_period: rcu_preempt 
> 21350 cpustart
>  rcu_preempt-9 [003] d..2   851.878462: rcu_grace_period_init: 
> rcu_preempt 21350 0 0 7 f
>  rcu_preempt-9 [003] ...1   851.878463: rcu_grace_period: rcu_preempt 
> 21350 fqswait
>  rcu_preempt-9 [003] d..2   851.878464: rcu_grace_period: rcu_preempt 
> 21350 cpuqs
>   -0 [000] dNs3   851.878482: rcu_grace_period: rcu_preempt 
> 21349 cpuend
>   -0 [000] dNs3   851.878483: rcu_grace_period: rcu_preempt 
> 21350 cpustart
>   -0 [000] dNs3   851.878484: rcu_grace_period: rcu_sched 
> 20086 cpuend
>   -0 [000] .N.2   851.878486: rcu_grace_period: rcu_sched 
> 20086 cpuqs
>   -0 [000] dN.2   851.878487: rcu_grace_period: rcu_preempt 
> 21350 cpuqs
> 
> 
> followed by a half dozen 'empty' traces before it totally locked up.

Cool, thank you!

Could you please also send the RCU CPU stall warning messages?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Dave Jones
On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
 
 > >  > Given that you have CONFIG_RCU_TRACE=y, could you please enable the
 > >  > following trace events and dump the trace before things hang?
 > >  > 
 > >  > trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init
 > >  > 
 > >  > If it is not feasible to dump the trace before things hang, let me
 > >  > know, and I will work out some other diagnostic regime.
 > > 
 > > I'll give that a shot when I get back in a few hours.

ok, this is tricky, and I'm not sure how helpful the below is.
because I don't know when the hang is going to happen, in parallel to my usual 
workload
I did this..
while [ 1 ];
do
  cat trace
  echo > trace
done

and got this...



# entries-in-buffer/entries-written: 7/7   #P:4
#
#  _-=> irqs-off
# / _=> need-resched
#| / _---=> hardirq/softirq
#|| / _--=> preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
  -0 [000] dNs3   851.748475: rcu_grace_period: rcu_preempt 
21342 cpuend
  -0 [000] dNs3   851.748477: rcu_grace_period: rcu_preempt 
21343 cpustart
  -0 [000] dNs3   851.748477: rcu_grace_period: rcu_sched 
20080 cpuend
  -0 [000] dN.2   851.748480: rcu_grace_period: rcu_preempt 
21343 cpuqs
 rcu_preempt-9 [000] ...1   851.748485: rcu_grace_period: rcu_preempt 
21343 fqsstart
 rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: rcu_preempt 
21343 fqsend
 rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: rcu_preempt 
21343 fqswait
# tracer: nop
#
# entries-in-buffer/entries-written: 13/13   #P:4
#
#  _-=> irqs-off
# / _=> need-resched
#| / _---=> hardirq/softirq
#|| / _--=> preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
 rcu_preempt-9 [003] d..2   851.878457: rcu_grace_period: rcu_preempt 
21349 end
 rcu_preempt-9 [003] d..2   851.878459: rcu_grace_period: rcu_preempt 
21349 newreq
 rcu_preempt-9 [003] ...1   851.878459: rcu_grace_period: rcu_preempt 
21349 reqwait
 rcu_preempt-9 [003] d..2   851.878460: rcu_grace_period: rcu_preempt 
21350 start
 rcu_preempt-9 [003] d..2   851.878461: rcu_grace_period: rcu_preempt 
21350 cpustart
 rcu_preempt-9 [003] d..2   851.878462: rcu_grace_period_init: 
rcu_preempt 21350 0 0 7 f
 rcu_preempt-9 [003] ...1   851.878463: rcu_grace_period: rcu_preempt 
21350 fqswait
 rcu_preempt-9 [003] d..2   851.878464: rcu_grace_period: rcu_preempt 
21350 cpuqs
  -0 [000] dNs3   851.878482: rcu_grace_period: rcu_preempt 
21349 cpuend
  -0 [000] dNs3   851.878483: rcu_grace_period: rcu_preempt 
21350 cpustart
  -0 [000] dNs3   851.878484: rcu_grace_period: rcu_sched 
20086 cpuend
  -0 [000] .N.2   851.878486: rcu_grace_period: rcu_sched 
20086 cpuqs
  -0 [000] dN.2   851.878487: rcu_grace_period: rcu_preempt 
21350 cpuqs


followed by a half dozen 'empty' traces before it totally locked up.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Paul E. McKenney
On Tue, Apr 01, 2014 at 02:04:14PM -0400, Dave Jones wrote:
> On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:
>  > >  > > so kernel space still works like before, but userspace is locked up.
>  > >  > 
>  > >  > Interesting.  I suspect that if you reverted the rest of this merge
>  > >  > window's RCU patches, you would get the same result.
> 
> Something that occurred to me is that this might be something in the x86 merge
> that's just changing timings enough to expose this problem.
> At some point this evening, I'll try bisecting it if we don't get any closer.

OK.  ;-)

>  > > [ 1953.672735] INFO: Stall ended before state dump start, gp_kthread 
> state: 0x2
>  > > [ 2148.608132] INFO: rcu_preempt detected stalls on CPUs/tasks:
>  > > [ 2148.609140]   (detected by 0, t=104027 jiffies, g=47728, c=47727, q=0)
>  > > etc etc.
>  > 
>  > Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
>  > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
>  > reported.
> 
> Lockdep had reported something a little earlier (timestamped at 1108.xx)
> but that's a known false-positive in xfs.

Yep, I would be very surprised if that was related to the grace-period hang.

>  > Given that you have CONFIG_RCU_TRACE=y, could you please enable the
>  > following trace events and dump the trace before things hang?
>  > 
>  >trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init
>  > 
>  > If it is not feasible to dump the trace before things hang, let me
>  > know, and I will work out some other diagnostic regime.
> 
> I'll give that a shot when I get back in a few hours.

Cool!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Dave Jones
On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:
 > >  > > so kernel space still works like before, but userspace is locked up.
 > >  > 
 > >  > Interesting.  I suspect that if you reverted the rest of this merge
 > >  > window's RCU patches, you would get the same result.
 
Something that occurred to me is that this might be something in the x86 merge
that's just changing timings enough to expose this problem.
At some point this evening, I'll try bisecting it if we don't get any closer.

 > > [ 1953.672735] INFO: Stall ended before state dump start, gp_kthread 
 > > state: 0x2
 > > [ 2148.608132] INFO: rcu_preempt detected stalls on CPUs/tasks:
 > > [ 2148.609140] (detected by 0, t=104027 jiffies, g=47728, c=47727, q=0)
 > > etc etc.
 > 
 > Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
 > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
 > reported.

Lockdep had reported something a little earlier (timestamped at 1108.xx)
but that's a known false-positive in xfs.

 > Given that you have CONFIG_RCU_TRACE=y, could you please enable the
 > following trace events and dump the trace before things hang?
 > 
 >  trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init
 > 
 > If it is not feasible to dump the trace before things hang, let me
 > know, and I will work out some other diagnostic regime.

I'll give that a shot when I get back in a few hours.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Paul E. McKenney
On Tue, Apr 01, 2014 at 01:22:44PM -0400, Dave Jones wrote:
> On Tue, Apr 01, 2014 at 08:30:32AM -0700, Paul E. McKenney wrote:
>  > > [14403.765300] INFO: rcu_preempt detected stalls on CPUs/tasks:
>  > > [14403.766405]  (detected by 0, t=260067 jiffies, g=449892, c=449891, 
> q=0)
>  > > and on and on..
>  > > 
>  > > so kernel space still works like before, but userspace is locked up.
>  > 
>  > Interesting.  I suspect that if you reverted the rest of this merge
>  > window's RCU patches, you would get the same result.
>  > 
>  > >  > If that doesn't work, I will need to put together some diagnostic 
> patches.
>  > >  > Starting with the one below.
>  > > 
>  > > bugger, I forgot to apply this last night. I'll do it on the next run.
>  > 
>  > But let's see what this shows first.
> 
> [ 1173.923463] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1173.924457](detected by 0, t=6502 jiffies, g=47728, c=47727, q=0)
> [ 1173.925199] INFO: Stall ended before state dump start, gp_kthread state: 
> 0x2
> [ 1368.859744] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1368.860754](detected by 0, t=26007 jiffies, g=47728, c=47727, q=0)
> [ 1368.861559] INFO: Stall ended before state dump start, gp_kthread state: 
> 0x2
> [ 1563.796416] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1563.797426](detected by 0, t=45512 jiffies, g=47728, c=47727, q=0)
> [ 1563.798293] INFO: Stall ended before state dump start, gp_kthread state: 
> 0x2
> [ 1758.733747] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1758.734725](detected by 0, t=65017 jiffies, g=47728, c=47727, q=0)
> [ 1758.735526] INFO: Stall ended before state dump start, gp_kthread state: 
> 0x2
> [ 1953.670944] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1953.671932](detected by 0, t=84522 jiffies, g=47728, c=47727, q=0)
> [ 1953.672735] INFO: Stall ended before state dump start, gp_kthread state: 
> 0x2
> [ 2148.608132] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 2148.609140](detected by 0, t=104027 jiffies, g=47728, c=47727, q=0)
> etc etc.

Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
reported.

Given that you have CONFIG_RCU_TRACE=y, could you please enable the
following trace events and dump the trace before things hang?

trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init

If it is not feasible to dump the trace before things hang, let me
know, and I will work out some other diagnostic regime.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Dave Jones
On Tue, Apr 01, 2014 at 08:30:32AM -0700, Paul E. McKenney wrote:
 > > [14403.765300] INFO: rcu_preempt detected stalls on CPUs/tasks:
 > > [14403.766405]  (detected by 0, t=260067 jiffies, g=449892, c=449891, q=0)
 > > and on and on..
 > > 
 > > so kernel space still works like before, but userspace is locked up.
 > 
 > Interesting.  I suspect that if you reverted the rest of this merge
 > window's RCU patches, you would get the same result.
 > 
 > >  > If that doesn't work, I will need to put together some diagnostic 
 > > patches.
 > >  > Starting with the one below.
 > > 
 > > bugger, I forgot to apply this last night. I'll do it on the next run.
 > 
 > But let's see what this shows first.

[ 1173.923463] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1173.924457]  (detected by 0, t=6502 jiffies, g=47728, c=47727, q=0)
[ 1173.925199] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1368.859744] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1368.860754]  (detected by 0, t=26007 jiffies, g=47728, c=47727, q=0)
[ 1368.861559] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1563.796416] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1563.797426]  (detected by 0, t=45512 jiffies, g=47728, c=47727, q=0)
[ 1563.798293] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1758.733747] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1758.734725]  (detected by 0, t=65017 jiffies, g=47728, c=47727, q=0)
[ 1758.735526] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1953.670944] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1953.671932]  (detected by 0, t=84522 jiffies, g=47728, c=47727, q=0)
[ 1953.672735] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 2148.608132] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2148.609140]  (detected by 0, t=104027 jiffies, g=47728, c=47727, q=0)
etc etc.

Dave
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Paul E. McKenney
On Tue, Apr 01, 2014 at 11:08:49AM -0400, Dave Jones wrote:
> On Mon, Mar 31, 2014 at 05:48:01PM -0700, Paul E. McKenney wrote:
> 
>  > Alternatively, I suppose a wakeup could be getting lost.  The main change
>  > related to that this merge window was ffa83fb565fb, which eliminated
>  > idle wakeups from RCU in the CONFIG_RCU_NOCB_CPU_ALL=y case.
>  > 
>  > So, could you please try reverting ffa83fb565fb?
> 
> Did a run overnight with that reverted, no luck..
> 
> [11869.580161] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [11869.581245]  (detected by 0, t=6502 jiffies, g=449892, c=449891, q=0)
> [11869.582047] INFO: Stall ended before state dump start
> [12064.517461] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [12064.518539]  (detected by 0, t=26007 jiffies, g=449892, c=449891, q=0)
> [12064.519347] INFO: Stall ended before state dump start
> [12259.454767] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [12259.455818]  (detected by 0, t=45512 jiffies, g=449892, c=449891, q=0)
> [12259.456602] INFO: Stall ended before state dump start
> [12454.392112] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [12454.393131]  (detected by 0, t=65017 jiffies, g=449892, c=449891, q=0)
> [12454.393942] INFO: Stall ended before state dump start
> [12649.329446] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [12649.330446]  (detected by 0, t=84522 jiffies, g=449892, c=449891, q=0)
> [12649.331267] INFO: Stall ended before state dump start
> [12844.266776] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [12844.267695]  (detected by 0, t=104027 jiffies, g=449892, c=449891, q=0)
> [12844.268571] INFO: Stall ended before state dump start
> [13039.204102] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [13039.205119]  (detected by 0, t=123532 jiffies, g=449892, c=449891, q=0)
> [13039.205871] INFO: Stall ended before state dump start
> [13234.141424] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [13234.141424] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [13234.142448]  (detected by 0, t=143037 jiffies, g=449892, c=449891, q=0)
> [13234.143339] INFO: Stall ended before state dump start
> [13429.078730] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [13429.079771]  (detected by 0, t=162542 jiffies, g=449892, c=449891, q=0)
> [13429.080591] INFO: Stall ended before state dump start
> [13624.016058] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [13624.017116]  (detected by 0, t=182047 jiffies, g=449892, c=449891, q=0)
> [13624.018033] INFO: Stall ended before state dump start
> [13818.953370] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [13818.954437]  (detected by 0, t=201552 jiffies, g=449892, c=449891, q=0)
> [13818.955334] INFO: Stall ended before state dump start
> [14013.890682] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [14013.891761]  (detected by 0, t=221057 jiffies, g=449892, c=449891, q=0)
> [14013.892691] INFO: Stall ended before state dump start
> [14208.827993] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [14208.829088]  (detected by 0, t=240562 jiffies, g=449892, c=449891, q=0)
> [14208.829977] INFO: Stall ended before state dump start
> [14403.765300] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [14403.766405]  (detected by 0, t=260067 jiffies, g=449892, c=449891, q=0)
> and on and on..
> 
> so kernel space still works like before, but userspace is locked up.

Interesting.  I suspect that if you reverted the rest of this merge
window's RCU patches, you would get the same result.

>  > If that doesn't work, I will need to put together some diagnostic patches.
>  > Starting with the one below.
> 
> bugger, I forgot to apply this last night. I'll do it on the next run.

But let's see what this shows first.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Dave Jones
On Mon, Mar 31, 2014 at 05:48:01PM -0700, Paul E. McKenney wrote:
 
 > Alternatively, I suppose a wakeup could be getting lost.  The main change
 > related to that this merge window was ffa83fb565fb, which eliminated
 > idle wakeups from RCU in the CONFIG_RCU_NOCB_CPU_ALL=y case.
 > 
 > So, could you please try reverting ffa83fb565fb?

Did a run overnight with that reverted, no luck..

[11869.580161] INFO: rcu_preempt detected stalls on CPUs/tasks:
[11869.581245]  (detected by 0, t=6502 jiffies, g=449892, c=449891, q=0)
[11869.582047] INFO: Stall ended before state dump start
[12064.517461] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12064.518539]  (detected by 0, t=26007 jiffies, g=449892, c=449891, q=0)
[12064.519347] INFO: Stall ended before state dump start
[12259.454767] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12259.455818]  (detected by 0, t=45512 jiffies, g=449892, c=449891, q=0)
[12259.456602] INFO: Stall ended before state dump start
[12454.392112] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12454.393131]  (detected by 0, t=65017 jiffies, g=449892, c=449891, q=0)
[12454.393942] INFO: Stall ended before state dump start
[12649.329446] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12649.330446]  (detected by 0, t=84522 jiffies, g=449892, c=449891, q=0)
[12649.331267] INFO: Stall ended before state dump start
[12844.266776] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12844.267695]  (detected by 0, t=104027 jiffies, g=449892, c=449891, q=0)
[12844.268571] INFO: Stall ended before state dump start
[13039.204102] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13039.205119]  (detected by 0, t=123532 jiffies, g=449892, c=449891, q=0)
[13039.205871] INFO: Stall ended before state dump start
[13234.141424] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13234.141424] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13234.142448]  (detected by 0, t=143037 jiffies, g=449892, c=449891, q=0)
[13234.143339] INFO: Stall ended before state dump start
[13429.078730] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13429.079771]  (detected by 0, t=162542 jiffies, g=449892, c=449891, q=0)
[13429.080591] INFO: Stall ended before state dump start
[13624.016058] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13624.017116]  (detected by 0, t=182047 jiffies, g=449892, c=449891, q=0)
[13624.018033] INFO: Stall ended before state dump start
[13818.953370] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13818.954437]  (detected by 0, t=201552 jiffies, g=449892, c=449891, q=0)
[13818.955334] INFO: Stall ended before state dump start
[14013.890682] INFO: rcu_preempt detected stalls on CPUs/tasks:
[14013.891761]  (detected by 0, t=221057 jiffies, g=449892, c=449891, q=0)
[14013.892691] INFO: Stall ended before state dump start
[14208.827993] INFO: rcu_preempt detected stalls on CPUs/tasks:
[14208.829088]  (detected by 0, t=240562 jiffies, g=449892, c=449891, q=0)
[14208.829977] INFO: Stall ended before state dump start
[14403.765300] INFO: rcu_preempt detected stalls on CPUs/tasks:
[14403.766405]  (detected by 0, t=260067 jiffies, g=449892, c=449891, q=0)
and on and on..

so kernel space still works like before, but userspace is locked up.

 > If that doesn't work, I will need to put together some diagnostic patches.
 > Starting with the one below.

bugger, I forgot to apply this last night. I'll do it on the next run.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Dave Jones
On Mon, Mar 31, 2014 at 05:48:01PM -0700, Paul E. McKenney wrote:
 
  Alternatively, I suppose a wakeup could be getting lost.  The main change
  related to that this merge window was ffa83fb565fb, which eliminated
  idle wakeups from RCU in the CONFIG_RCU_NOCB_CPU_ALL=y case.
  
  So, could you please try reverting ffa83fb565fb?

Did a run overnight with that reverted, no luck..

[11869.580161] INFO: rcu_preempt detected stalls on CPUs/tasks:
[11869.581245]  (detected by 0, t=6502 jiffies, g=449892, c=449891, q=0)
[11869.582047] INFO: Stall ended before state dump start
[12064.517461] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12064.518539]  (detected by 0, t=26007 jiffies, g=449892, c=449891, q=0)
[12064.519347] INFO: Stall ended before state dump start
[12259.454767] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12259.455818]  (detected by 0, t=45512 jiffies, g=449892, c=449891, q=0)
[12259.456602] INFO: Stall ended before state dump start
[12454.392112] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12454.393131]  (detected by 0, t=65017 jiffies, g=449892, c=449891, q=0)
[12454.393942] INFO: Stall ended before state dump start
[12649.329446] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12649.330446]  (detected by 0, t=84522 jiffies, g=449892, c=449891, q=0)
[12649.331267] INFO: Stall ended before state dump start
[12844.266776] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12844.267695]  (detected by 0, t=104027 jiffies, g=449892, c=449891, q=0)
[12844.268571] INFO: Stall ended before state dump start
[13039.204102] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13039.205119]  (detected by 0, t=123532 jiffies, g=449892, c=449891, q=0)
[13039.205871] INFO: Stall ended before state dump start
[13234.141424] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13234.141424] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13234.142448]  (detected by 0, t=143037 jiffies, g=449892, c=449891, q=0)
[13234.143339] INFO: Stall ended before state dump start
[13429.078730] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13429.079771]  (detected by 0, t=162542 jiffies, g=449892, c=449891, q=0)
[13429.080591] INFO: Stall ended before state dump start
[13624.016058] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13624.017116]  (detected by 0, t=182047 jiffies, g=449892, c=449891, q=0)
[13624.018033] INFO: Stall ended before state dump start
[13818.953370] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13818.954437]  (detected by 0, t=201552 jiffies, g=449892, c=449891, q=0)
[13818.955334] INFO: Stall ended before state dump start
[14013.890682] INFO: rcu_preempt detected stalls on CPUs/tasks:
[14013.891761]  (detected by 0, t=221057 jiffies, g=449892, c=449891, q=0)
[14013.892691] INFO: Stall ended before state dump start
[14208.827993] INFO: rcu_preempt detected stalls on CPUs/tasks:
[14208.829088]  (detected by 0, t=240562 jiffies, g=449892, c=449891, q=0)
[14208.829977] INFO: Stall ended before state dump start
[14403.765300] INFO: rcu_preempt detected stalls on CPUs/tasks:
[14403.766405]  (detected by 0, t=260067 jiffies, g=449892, c=449891, q=0)
and on and on..

so kernel space still works like before, but userspace is locked up.

  If that doesn't work, I will need to put together some diagnostic patches.
  Starting with the one below.

bugger, I forgot to apply this last night. I'll do it on the next run.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Paul E. McKenney
On Tue, Apr 01, 2014 at 11:08:49AM -0400, Dave Jones wrote:
 On Mon, Mar 31, 2014 at 05:48:01PM -0700, Paul E. McKenney wrote:
 
   Alternatively, I suppose a wakeup could be getting lost.  The main change
   related to that this merge window was ffa83fb565fb, which eliminated
   idle wakeups from RCU in the CONFIG_RCU_NOCB_CPU_ALL=y case.
   
   So, could you please try reverting ffa83fb565fb?
 
 Did a run overnight with that reverted, no luck..
 
 [11869.580161] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [11869.581245]  (detected by 0, t=6502 jiffies, g=449892, c=449891, q=0)
 [11869.582047] INFO: Stall ended before state dump start
 [12064.517461] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [12064.518539]  (detected by 0, t=26007 jiffies, g=449892, c=449891, q=0)
 [12064.519347] INFO: Stall ended before state dump start
 [12259.454767] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [12259.455818]  (detected by 0, t=45512 jiffies, g=449892, c=449891, q=0)
 [12259.456602] INFO: Stall ended before state dump start
 [12454.392112] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [12454.393131]  (detected by 0, t=65017 jiffies, g=449892, c=449891, q=0)
 [12454.393942] INFO: Stall ended before state dump start
 [12649.329446] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [12649.330446]  (detected by 0, t=84522 jiffies, g=449892, c=449891, q=0)
 [12649.331267] INFO: Stall ended before state dump start
 [12844.266776] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [12844.267695]  (detected by 0, t=104027 jiffies, g=449892, c=449891, q=0)
 [12844.268571] INFO: Stall ended before state dump start
 [13039.204102] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [13039.205119]  (detected by 0, t=123532 jiffies, g=449892, c=449891, q=0)
 [13039.205871] INFO: Stall ended before state dump start
 [13234.141424] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [13234.141424] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [13234.142448]  (detected by 0, t=143037 jiffies, g=449892, c=449891, q=0)
 [13234.143339] INFO: Stall ended before state dump start
 [13429.078730] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [13429.079771]  (detected by 0, t=162542 jiffies, g=449892, c=449891, q=0)
 [13429.080591] INFO: Stall ended before state dump start
 [13624.016058] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [13624.017116]  (detected by 0, t=182047 jiffies, g=449892, c=449891, q=0)
 [13624.018033] INFO: Stall ended before state dump start
 [13818.953370] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [13818.954437]  (detected by 0, t=201552 jiffies, g=449892, c=449891, q=0)
 [13818.955334] INFO: Stall ended before state dump start
 [14013.890682] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [14013.891761]  (detected by 0, t=221057 jiffies, g=449892, c=449891, q=0)
 [14013.892691] INFO: Stall ended before state dump start
 [14208.827993] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [14208.829088]  (detected by 0, t=240562 jiffies, g=449892, c=449891, q=0)
 [14208.829977] INFO: Stall ended before state dump start
 [14403.765300] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [14403.766405]  (detected by 0, t=260067 jiffies, g=449892, c=449891, q=0)
 and on and on..
 
 so kernel space still works like before, but userspace is locked up.

Interesting.  I suspect that if you reverted the rest of this merge
window's RCU patches, you would get the same result.

   If that doesn't work, I will need to put together some diagnostic patches.
   Starting with the one below.
 
 bugger, I forgot to apply this last night. I'll do it on the next run.

But let's see what this shows first.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Dave Jones
On Tue, Apr 01, 2014 at 08:30:32AM -0700, Paul E. McKenney wrote:
   [14403.765300] INFO: rcu_preempt detected stalls on CPUs/tasks:
   [14403.766405]  (detected by 0, t=260067 jiffies, g=449892, c=449891, q=0)
   and on and on..
   
   so kernel space still works like before, but userspace is locked up.
  
  Interesting.  I suspect that if you reverted the rest of this merge
  window's RCU patches, you would get the same result.
  
 If that doesn't work, I will need to put together some diagnostic 
   patches.
 Starting with the one below.
   
   bugger, I forgot to apply this last night. I'll do it on the next run.
  
  But let's see what this shows first.

[ 1173.923463] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1173.924457]  (detected by 0, t=6502 jiffies, g=47728, c=47727, q=0)
[ 1173.925199] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1368.859744] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1368.860754]  (detected by 0, t=26007 jiffies, g=47728, c=47727, q=0)
[ 1368.861559] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1563.796416] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1563.797426]  (detected by 0, t=45512 jiffies, g=47728, c=47727, q=0)
[ 1563.798293] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1758.733747] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1758.734725]  (detected by 0, t=65017 jiffies, g=47728, c=47727, q=0)
[ 1758.735526] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1953.670944] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1953.671932]  (detected by 0, t=84522 jiffies, g=47728, c=47727, q=0)
[ 1953.672735] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 2148.608132] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2148.609140]  (detected by 0, t=104027 jiffies, g=47728, c=47727, q=0)
etc etc.

Dave
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Paul E. McKenney
On Tue, Apr 01, 2014 at 01:22:44PM -0400, Dave Jones wrote:
 On Tue, Apr 01, 2014 at 08:30:32AM -0700, Paul E. McKenney wrote:
[14403.765300] INFO: rcu_preempt detected stalls on CPUs/tasks:
[14403.766405]  (detected by 0, t=260067 jiffies, g=449892, c=449891, 
 q=0)
and on and on..

so kernel space still works like before, but userspace is locked up.
   
   Interesting.  I suspect that if you reverted the rest of this merge
   window's RCU patches, you would get the same result.
   
  If that doesn't work, I will need to put together some diagnostic 
 patches.
  Starting with the one below.

bugger, I forgot to apply this last night. I'll do it on the next run.
   
   But let's see what this shows first.
 
 [ 1173.923463] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1173.924457](detected by 0, t=6502 jiffies, g=47728, c=47727, q=0)
 [ 1173.925199] INFO: Stall ended before state dump start, gp_kthread state: 
 0x2
 [ 1368.859744] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1368.860754](detected by 0, t=26007 jiffies, g=47728, c=47727, q=0)
 [ 1368.861559] INFO: Stall ended before state dump start, gp_kthread state: 
 0x2
 [ 1563.796416] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1563.797426](detected by 0, t=45512 jiffies, g=47728, c=47727, q=0)
 [ 1563.798293] INFO: Stall ended before state dump start, gp_kthread state: 
 0x2
 [ 1758.733747] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1758.734725](detected by 0, t=65017 jiffies, g=47728, c=47727, q=0)
 [ 1758.735526] INFO: Stall ended before state dump start, gp_kthread state: 
 0x2
 [ 1953.670944] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1953.671932](detected by 0, t=84522 jiffies, g=47728, c=47727, q=0)
 [ 1953.672735] INFO: Stall ended before state dump start, gp_kthread state: 
 0x2
 [ 2148.608132] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 2148.609140](detected by 0, t=104027 jiffies, g=47728, c=47727, q=0)
 etc etc.

Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
reported.

Given that you have CONFIG_RCU_TRACE=y, could you please enable the
following trace events and dump the trace before things hang?

trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init

If it is not feasible to dump the trace before things hang, let me
know, and I will work out some other diagnostic regime.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Dave Jones
On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:
  so kernel space still works like before, but userspace is locked up.
 
 Interesting.  I suspect that if you reverted the rest of this merge
 window's RCU patches, you would get the same result.
 
Something that occurred to me is that this might be something in the x86 merge
that's just changing timings enough to expose this problem.
At some point this evening, I'll try bisecting it if we don't get any closer.

   [ 1953.672735] INFO: Stall ended before state dump start, gp_kthread 
   state: 0x2
   [ 2148.608132] INFO: rcu_preempt detected stalls on CPUs/tasks:
   [ 2148.609140] (detected by 0, t=104027 jiffies, g=47728, c=47727, q=0)
   etc etc.
  
  Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
  you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
  reported.

Lockdep had reported something a little earlier (timestamped at 1108.xx)
but that's a known false-positive in xfs.

  Given that you have CONFIG_RCU_TRACE=y, could you please enable the
  following trace events and dump the trace before things hang?
  
   trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init
  
  If it is not feasible to dump the trace before things hang, let me
  know, and I will work out some other diagnostic regime.

I'll give that a shot when I get back in a few hours.

Dave
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Paul E. McKenney
On Tue, Apr 01, 2014 at 02:04:14PM -0400, Dave Jones wrote:
 On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:
   so kernel space still works like before, but userspace is locked up.
  
  Interesting.  I suspect that if you reverted the rest of this merge
  window's RCU patches, you would get the same result.
 
 Something that occurred to me is that this might be something in the x86 merge
 that's just changing timings enough to expose this problem.
 At some point this evening, I'll try bisecting it if we don't get any closer.

OK.  ;-)

[ 1953.672735] INFO: Stall ended before state dump start, gp_kthread 
 state: 0x2
[ 2148.608132] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2148.609140]   (detected by 0, t=104027 jiffies, g=47728, c=47727, q=0)
etc etc.
   
   Waiting uninterruptibly.  Presumably blocked on mutex_lock().  But
   you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
   reported.
 
 Lockdep had reported something a little earlier (timestamped at 1108.xx)
 but that's a known false-positive in xfs.

Yep, I would be very surprised if that was related to the grace-period hang.

   Given that you have CONFIG_RCU_TRACE=y, could you please enable the
   following trace events and dump the trace before things hang?
   
  trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init
   
   If it is not feasible to dump the trace before things hang, let me
   know, and I will work out some other diagnostic regime.
 
 I'll give that a shot when I get back in a few hours.

Cool!

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Dave Jones
On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
 
 Given that you have CONFIG_RCU_TRACE=y, could you please enable the
 following trace events and dump the trace before things hang?
 
 trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init
 
 If it is not feasible to dump the trace before things hang, let me
 know, and I will work out some other diagnostic regime.
   
   I'll give that a shot when I get back in a few hours.

ok, this is tricky, and I'm not sure how helpful the below is.
because I don't know when the hang is going to happen, in parallel to my usual 
workload
I did this..
while [ 1 ];
do
  cat trace
  echo  trace
done

and got this...



# entries-in-buffer/entries-written: 7/7   #P:4
#
#  _-= irqs-off
# / _= need-resched
#| / _---= hardirq/softirq
#|| / _--= preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
  idle-0 [000] dNs3   851.748475: rcu_grace_period: rcu_preempt 
21342 cpuend
  idle-0 [000] dNs3   851.748477: rcu_grace_period: rcu_preempt 
21343 cpustart
  idle-0 [000] dNs3   851.748477: rcu_grace_period: rcu_sched 
20080 cpuend
  idle-0 [000] dN.2   851.748480: rcu_grace_period: rcu_preempt 
21343 cpuqs
 rcu_preempt-9 [000] ...1   851.748485: rcu_grace_period: rcu_preempt 
21343 fqsstart
 rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: rcu_preempt 
21343 fqsend
 rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: rcu_preempt 
21343 fqswait
# tracer: nop
#
# entries-in-buffer/entries-written: 13/13   #P:4
#
#  _-= irqs-off
# / _= need-resched
#| / _---= hardirq/softirq
#|| / _--= preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
 rcu_preempt-9 [003] d..2   851.878457: rcu_grace_period: rcu_preempt 
21349 end
 rcu_preempt-9 [003] d..2   851.878459: rcu_grace_period: rcu_preempt 
21349 newreq
 rcu_preempt-9 [003] ...1   851.878459: rcu_grace_period: rcu_preempt 
21349 reqwait
 rcu_preempt-9 [003] d..2   851.878460: rcu_grace_period: rcu_preempt 
21350 start
 rcu_preempt-9 [003] d..2   851.878461: rcu_grace_period: rcu_preempt 
21350 cpustart
 rcu_preempt-9 [003] d..2   851.878462: rcu_grace_period_init: 
rcu_preempt 21350 0 0 7 f
 rcu_preempt-9 [003] ...1   851.878463: rcu_grace_period: rcu_preempt 
21350 fqswait
 rcu_preempt-9 [003] d..2   851.878464: rcu_grace_period: rcu_preempt 
21350 cpuqs
  idle-0 [000] dNs3   851.878482: rcu_grace_period: rcu_preempt 
21349 cpuend
  idle-0 [000] dNs3   851.878483: rcu_grace_period: rcu_preempt 
21350 cpustart
  idle-0 [000] dNs3   851.878484: rcu_grace_period: rcu_sched 
20086 cpuend
  idle-0 [000] .N.2   851.878486: rcu_grace_period: rcu_sched 
20086 cpuqs
  idle-0 [000] dN.2   851.878487: rcu_grace_period: rcu_preempt 
21350 cpuqs


followed by a half dozen 'empty' traces before it totally locked up.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Paul E. McKenney
On Tue, Apr 01, 2014 at 06:16:16PM -0400, Dave Jones wrote:
 On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
 
  Given that you have CONFIG_RCU_TRACE=y, could you please enable the
  following trace events and dump the trace before things hang?
  
trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init
  
  If it is not feasible to dump the trace before things hang, let me
  know, and I will work out some other diagnostic regime.

I'll give that a shot when I get back in a few hours.
 
 ok, this is tricky, and I'm not sure how helpful the below is.
 because I don't know when the hang is going to happen, in parallel to my 
 usual workload
 I did this..
 while [ 1 ];
 do
   cat trace
   echo  trace
 done
 
 and got this...
 
 
 
 # entries-in-buffer/entries-written: 7/7   #P:4
 #
 #  _-= irqs-off
 # / _= need-resched
 #| / _---= hardirq/softirq
 #|| / _--= preempt-depth
 #||| / delay
 #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
 #  | |   |      | |
   idle-0 [000] dNs3   851.748475: rcu_grace_period: rcu_preempt 
 21342 cpuend
   idle-0 [000] dNs3   851.748477: rcu_grace_period: rcu_preempt 
 21343 cpustart
   idle-0 [000] dNs3   851.748477: rcu_grace_period: rcu_sched 
 20080 cpuend
   idle-0 [000] dN.2   851.748480: rcu_grace_period: rcu_preempt 
 21343 cpuqs
  rcu_preempt-9 [000] ...1   851.748485: rcu_grace_period: rcu_preempt 
 21343 fqsstart
  rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: rcu_preempt 
 21343 fqsend
  rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: rcu_preempt 
 21343 fqswait
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 13/13   #P:4
 #
 #  _-= irqs-off
 # / _= need-resched
 #| / _---= hardirq/softirq
 #|| / _--= preempt-depth
 #||| / delay
 #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
 #  | |   |      | |
  rcu_preempt-9 [003] d..2   851.878457: rcu_grace_period: rcu_preempt 
 21349 end
  rcu_preempt-9 [003] d..2   851.878459: rcu_grace_period: rcu_preempt 
 21349 newreq
  rcu_preempt-9 [003] ...1   851.878459: rcu_grace_period: rcu_preempt 
 21349 reqwait
  rcu_preempt-9 [003] d..2   851.878460: rcu_grace_period: rcu_preempt 
 21350 start
  rcu_preempt-9 [003] d..2   851.878461: rcu_grace_period: rcu_preempt 
 21350 cpustart
  rcu_preempt-9 [003] d..2   851.878462: rcu_grace_period_init: 
 rcu_preempt 21350 0 0 7 f
  rcu_preempt-9 [003] ...1   851.878463: rcu_grace_period: rcu_preempt 
 21350 fqswait
  rcu_preempt-9 [003] d..2   851.878464: rcu_grace_period: rcu_preempt 
 21350 cpuqs
   idle-0 [000] dNs3   851.878482: rcu_grace_period: rcu_preempt 
 21349 cpuend
   idle-0 [000] dNs3   851.878483: rcu_grace_period: rcu_preempt 
 21350 cpustart
   idle-0 [000] dNs3   851.878484: rcu_grace_period: rcu_sched 
 20086 cpuend
   idle-0 [000] .N.2   851.878486: rcu_grace_period: rcu_sched 
 20086 cpuqs
   idle-0 [000] dN.2   851.878487: rcu_grace_period: rcu_preempt 
 21350 cpuqs
 
 
 followed by a half dozen 'empty' traces before it totally locked up.

Cool, thank you!

Could you please also send the RCU CPU stall warning messages?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Dave Jones
On Tue, Apr 01, 2014 at 04:18:22PM -0700, Paul E. McKenney wrote:
 
   # entries-in-buffer/entries-written: 7/7   #P:4
   #
   #  _-= irqs-off
   # / _= need-resched
   #| / _---= hardirq/softirq
   #|| / _--= preempt-depth
   #||| / delay
   #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
   #  | |   |      | |
 idle-0 [000] dNs3   851.748475: rcu_grace_period: 
   rcu_preempt 21342 cpuend
 idle-0 [000] dNs3   851.748477: rcu_grace_period: 
   rcu_preempt 21343 cpustart
 idle-0 [000] dNs3   851.748477: rcu_grace_period: 
   rcu_sched 20080 cpuend
 idle-0 [000] dN.2   851.748480: rcu_grace_period: 
   rcu_preempt 21343 cpuqs
rcu_preempt-9 [000] ...1   851.748485: rcu_grace_period: 
   rcu_preempt 21343 fqsstart
rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: 
   rcu_preempt 21343 fqsend
rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: 
   rcu_preempt 21343 fqswait
   # tracer: nop
   #
   # entries-in-buffer/entries-written: 13/13   #P:4
   #
   #  _-= irqs-off
   # / _= need-resched
   #| / _---= hardirq/softirq
   #|| / _--= preempt-depth
   #||| / delay
   #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
   #  | |   |      | |
rcu_preempt-9 [003] d..2   851.878457: rcu_grace_period: 
   rcu_preempt 21349 end
rcu_preempt-9 [003] d..2   851.878459: rcu_grace_period: 
   rcu_preempt 21349 newreq
rcu_preempt-9 [003] ...1   851.878459: rcu_grace_period: 
   rcu_preempt 21349 reqwait
rcu_preempt-9 [003] d..2   851.878460: rcu_grace_period: 
   rcu_preempt 21350 start
rcu_preempt-9 [003] d..2   851.878461: rcu_grace_period: 
   rcu_preempt 21350 cpustart
rcu_preempt-9 [003] d..2   851.878462: rcu_grace_period_init: 
   rcu_preempt 21350 0 0 7 f
rcu_preempt-9 [003] ...1   851.878463: rcu_grace_period: 
   rcu_preempt 21350 fqswait
rcu_preempt-9 [003] d..2   851.878464: rcu_grace_period: 
   rcu_preempt 21350 cpuqs
 idle-0 [000] dNs3   851.878482: rcu_grace_period: 
   rcu_preempt 21349 cpuend
 idle-0 [000] dNs3   851.878483: rcu_grace_period: 
   rcu_preempt 21350 cpustart
 idle-0 [000] dNs3   851.878484: rcu_grace_period: 
   rcu_sched 20086 cpuend
 idle-0 [000] .N.2   851.878486: rcu_grace_period: 
   rcu_sched 20086 cpuqs
 idle-0 [000] dN.2   851.878487: rcu_grace_period: 
   rcu_preempt 21350 cpuqs
   
   
   followed by a half dozen 'empty' traces before it totally locked up.
  
  Cool, thank you!
  
  Could you please also send the RCU CPU stall warning messages?

From that run they were..

[  917.790265] INFO: rcu_preempt detected stalls on CPUs/tasks:
[  917.791189]  (detected by 0, t=6502 jiffies, g=21390, c=21389, q=0)
[  917.792058] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[  917.792799] INFO: rcu_sched detected stalls on CPUs/tasks:
[  917.793534]  (detected by 0, t=6502 jiffies, g=20128, c=20127, q=0)
[  917.794265] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1112.727480] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1112.728425]  (detected by 0, t=26007 jiffies, g=21390, c=21389, q=0)
[ 1112.729300] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1112.730035] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1112.730744]  (detected by 0, t=26007 jiffies, g=20128, c=20127, q=0)
[ 1112.731459] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1307.664710] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1307.665623]  (detected by 0, t=45512 jiffies, g=21390, c=21389, q=0)
[ 1307.666386] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1307.667113] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1307.667839]  (detected by 0, t=45512 jiffies, g=20128, c=20127, q=0)
[ 1307.668572] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1502.601961] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1502.602905]  (detected by 0, t=65017 jiffies, g=21390, c=21389, q=0)
[ 1502.603765] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1502.604550] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1502.605305]  (detected by 0, t=65017 jiffies, g=20128, c=20127, q=0)
[ 1502.606061] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1697.539217] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1697.540189]  (detected by 0, t=84522 jiffies, g=21390, c=21389, q=0)
[ 1697.541052] INFO: Stall 

Re: rcu_prempt stalls / lockup

2014-04-01 Thread Dave Jones
On Tue, Apr 01, 2014 at 04:57:24PM -0700, Paul E. McKenney wrote:
 
   [  917.790265] INFO: rcu_preempt detected stalls on CPUs/tasks:
   [  917.791189] (detected by 0, t=6502 jiffies, g=21390, c=21389, q=0)
  
  But the hang didn't happen until grace period #21390.  Sigh!  That means
  that the trace doesn't tell me about the current grace period.
  
  Thank you for getting this info, but it is beginning to look like
  bisection is necessary here.  :-(

ok, I was hoping that wouldn't be the case, but let's see what it turns up.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-04-01 Thread Paul E. McKenney
On Tue, Apr 01, 2014 at 07:31:30PM -0400, Dave Jones wrote:
 On Tue, Apr 01, 2014 at 04:18:22PM -0700, Paul E. McKenney wrote:
 
# entries-in-buffer/entries-written: 7/7   #P:4
#
#  _-= irqs-off
# / _= need-resched
#| / _---= hardirq/softirq
#|| / _--= preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
  idle-0 [000] dNs3   851.748475: rcu_grace_period: 
 rcu_preempt 21342 cpuend
  idle-0 [000] dNs3   851.748477: rcu_grace_period: 
 rcu_preempt 21343 cpustart
  idle-0 [000] dNs3   851.748477: rcu_grace_period: 
 rcu_sched 20080 cpuend
  idle-0 [000] dN.2   851.748480: rcu_grace_period: 
 rcu_preempt 21343 cpuqs
 rcu_preempt-9 [000] ...1   851.748485: rcu_grace_period: 
 rcu_preempt 21343 fqsstart
 rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: 
 rcu_preempt 21343 fqsend
 rcu_preempt-9 [000] ...1   851.748487: rcu_grace_period: 
 rcu_preempt 21343 fqswait
# tracer: nop
#
# entries-in-buffer/entries-written: 13/13   #P:4
#
#  _-= irqs-off
# / _= need-resched
#| / _---= hardirq/softirq
#|| / _--= preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
 rcu_preempt-9 [003] d..2   851.878457: rcu_grace_period: 
 rcu_preempt 21349 end
 rcu_preempt-9 [003] d..2   851.878459: rcu_grace_period: 
 rcu_preempt 21349 newreq
 rcu_preempt-9 [003] ...1   851.878459: rcu_grace_period: 
 rcu_preempt 21349 reqwait
 rcu_preempt-9 [003] d..2   851.878460: rcu_grace_period: 
 rcu_preempt 21350 start
 rcu_preempt-9 [003] d..2   851.878461: rcu_grace_period: 
 rcu_preempt 21350 cpustart
 rcu_preempt-9 [003] d..2   851.878462: rcu_grace_period_init: 
 rcu_preempt 21350 0 0 7 f
 rcu_preempt-9 [003] ...1   851.878463: rcu_grace_period: 
 rcu_preempt 21350 fqswait
 rcu_preempt-9 [003] d..2   851.878464: rcu_grace_period: 
 rcu_preempt 21350 cpuqs
  idle-0 [000] dNs3   851.878482: rcu_grace_period: 
 rcu_preempt 21349 cpuend
  idle-0 [000] dNs3   851.878483: rcu_grace_period: 
 rcu_preempt 21350 cpustart
  idle-0 [000] dNs3   851.878484: rcu_grace_period: 
 rcu_sched 20086 cpuend
  idle-0 [000] .N.2   851.878486: rcu_grace_period: 
 rcu_sched 20086 cpuqs
  idle-0 [000] dN.2   851.878487: rcu_grace_period: 
 rcu_preempt 21350 cpuqs

OK, so the trace ended on grace period #21350...



followed by a half dozen 'empty' traces before it totally locked up.
   
   Cool, thank you!
   
   Could you please also send the RCU CPU stall warning messages?
 
 From that run they were..
 
 [  917.790265] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [  917.791189](detected by 0, t=6502 jiffies, g=21390, c=21389, q=0)

But the hang didn't happen until grace period #21390.  Sigh!  That means
that the trace doesn't tell me about the current grace period.

Thank you for getting this info, but it is beginning to look like
bisection is necessary here.  :-(

Thanx, Paul

 [  917.792058] INFO: Stall ended before state dump start, gp_kthread state: 
 0x2
 [  917.792799] INFO: rcu_sched detected stalls on CPUs/tasks:
 [  917.793534](detected by 0, t=6502 jiffies, g=20128, c=20127, q=0)
 [  917.794265] INFO: Stall ended before state dump start, gp_kthread state: 
 0x2
 [ 1112.727480] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1112.728425](detected by 0, t=26007 jiffies, g=21390, c=21389, q=0)
 [ 1112.729300] INFO: Stall ended before state dump start, gp_kthread state: 
 0x2
 [ 1112.730035] INFO: rcu_sched detected stalls on CPUs/tasks:
 [ 1112.730744](detected by 0, t=26007 jiffies, g=20128, c=20127, q=0)
 [ 1112.731459] INFO: Stall ended before state dump start, gp_kthread state: 
 0x2
 [ 1307.664710] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1307.665623](detected by 0, t=45512 jiffies, g=21390, c=21389, q=0)
 [ 1307.666386] INFO: Stall ended before state dump start, gp_kthread state: 
 0x2
 [ 1307.667113] INFO: rcu_sched detected stalls on CPUs/tasks:
 [ 1307.667839](detected by 0, t=45512 jiffies, g=20128, c=20127, q=0)
 [ 1307.668572] INFO: Stall ended before state dump start, gp_kthread state: 
 0x2
 [ 1502.601961] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1502.602905](detected by 

Re: rcu_prempt stalls / lockup

2014-03-31 Thread Paul E. McKenney
On Mon, Mar 31, 2014 at 07:35:52PM -0400, Dave Jones wrote:
> On Mon, Mar 31, 2014 at 04:22:21PM -0700, Paul E. McKenney wrote:
>  > On Mon, Mar 31, 2014 at 07:02:41PM -0400, Dave Jones wrote:
>  > > You can tell the merge window is open, because I'm back to breaking RCU.
>  > > 
>  > > ... 
>  > > [ 3558.120739] INFO: Stall ended before state dump start
>  > > 
>  > > at that point, userspace stopped responding. cursor on console was 
> blinking,
>  > > but I couldn't even switch tty's, or sysrq dump.

Hmmm...  I am having a very hard time imagining any of this merge
window's RCU changes preventing a sysrq dump.  On the other hand,
having a single grace period persist without anything blocking it
is pretty strange as well.

I would hope that the sysrq path does not allocate memory, but who knows?
After all, one possible reason for the eventual hang is memory exhaustion.
So one thing to try is to do sysrq earlier in the process.  (Yeah,
I know, tough to do if you have lots of scripted systems.)

>  > > rc8 was fine, so this is todays rcu changes.
>  > 
>  > New one on me!  Any chance of a .config file?
> 
> http://paste.fedoraproject.org/90449/30888213/raw/

Given that you have CONFIG_RCU_NOCB_CPU_ALL=y, all the grace-period
activity is being driven by the grace-period kthreads ("rcu_preempt"
in this case).  This leads me to wonder if your workload if preventing
RCU's grace-period kthreads from running.  These kthreads are SCHED_OTHER,
so could potentially be preempted for a long time.  But I would expect
a softlockup message in that case.

Alternatively, I suppose a wakeup could be getting lost.  The main change
related to that this merge window was ffa83fb565fb, which eliminated
idle wakeups from RCU in the CONFIG_RCU_NOCB_CPU_ALL=y case.

So, could you please try reverting ffa83fb565fb?

If that doesn't work, I will need to put together some diagnostic patches.
Starting with the one below.

Thanx, Paul



diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0c47e300210a..c5a163378710 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -936,7 +936,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
   smp_processor_id(), (long)(jiffies - rsp->gp_start),
   rsp->gpnum, rsp->completed, totqlen);
if (ndetected == 0)
-   pr_err("INFO: Stall ended before state dump start\n");
+   pr_err("INFO: Stall ended before state dump start, gp_kthread 
state: %#lx\n", rsp->gp_kthread->state);
else if (!trigger_all_cpu_backtrace())
rcu_dump_cpu_stacks(rsp);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-03-31 Thread Dave Jones
On Mon, Mar 31, 2014 at 04:22:21PM -0700, Paul E. McKenney wrote:
 > On Mon, Mar 31, 2014 at 07:02:41PM -0400, Dave Jones wrote:
 > > You can tell the merge window is open, because I'm back to breaking RCU.
 > > 
 > > ... 
 > > [ 3558.120739] INFO: Stall ended before state dump start
 > > 
 > > at that point, userspace stopped responding. cursor on console was 
 > > blinking,
 > > but I couldn't even switch tty's, or sysrq dump.
 > > 
 > > rc8 was fine, so this is todays rcu changes.
 > 
 > New one on me!  Any chance of a .config file?

http://paste.fedoraproject.org/90449/30888213/raw/

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-03-31 Thread Paul E. McKenney
On Mon, Mar 31, 2014 at 07:02:41PM -0400, Dave Jones wrote:
> You can tell the merge window is open, because I'm back to breaking RCU.
> 
> [ 1023.934532] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1023.935501]  (detected by 0, t=6502 jiffies, g=28666, c=28665, q=0)
> [ 1023.936299] INFO: Stall ended before state dump start
> [ 1218.871404] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1218.872399]  (detected by 0, t=26007 jiffies, g=28666, c=28665, q=0)
> [ 1218.873256] INFO: Stall ended before state dump start
> [ 1413.808406] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1413.809387]  (detected by 0, t=45512 jiffies, g=28666, c=28665, q=0)
> [ 1413.810113] INFO: Stall ended before state dump start
> [ 1608.745637] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1608.746614]  (detected by 0, t=65017 jiffies, g=28666, c=28665, q=0)
> [ 1608.747327] INFO: Stall ended before state dump start
> [ 1803.682892] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1803.683876]  (detected by 1, t=84522 jiffies, g=28666, c=28665, q=0)
> [ 1803.684741] INFO: Stall ended before state dump start
> [ 1998.620169] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1998.621123]  (detected by 0, t=104027 jiffies, g=28666, c=28665, q=0)
> [ 1998.621916] INFO: Stall ended before state dump start
> [ 2193.557459] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 2193.558425]  (detected by 0, t=123532 jiffies, g=28666, c=28665, q=0)
> [ 2193.559248] INFO: Stall ended before state dump start
> [ 2388.494719] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 2388.495688]  (detected by 0, t=143037 jiffies, g=28666, c=28665, q=0)
> [ 2388.496417] INFO: Stall ended before state dump start
> [ 2583.431983] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 2583.432967]  (detected by 0, t=162542 jiffies, g=28666, c=28665, q=0)
> [ 2583.433852] INFO: Stall ended before state dump start
> [ 2778.369242] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 2778.370235]  (detected by 0, t=182047 jiffies, g=28666, c=28665, q=0)
> [ 2778.371037] INFO: Stall ended before state dump start
> [ 2973.306537] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 2973.307540]  (detected by 0, t=201552 jiffies, g=28666, c=28665, q=0)
> [ 2973.308347] INFO: Stall ended before state dump start
> [ 3168.244036] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 3168.245056]  (detected by 0, t=221057 jiffies, g=28666, c=28665, q=0)
> [ 3168.245805] INFO: Stall ended before state dump start
> [ 3363.181458] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 3363.182496]  (detected by 0, t=240562 jiffies, g=28666, c=28665, q=0)
> [ 3363.183289] INFO: Stall ended before state dump start
> [ 3558.118835] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 3558.119879]  (detected by 0, t=260067 jiffies, g=28666, c=28665, q=0)
> [ 3558.120739] INFO: Stall ended before state dump start
> 
> at that point, userspace stopped responding. cursor on console was blinking,
> but I couldn't even switch tty's, or sysrq dump.
> 
> rc8 was fine, so this is todays rcu changes.

New one on me!  Any chance of a .config file?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-03-31 Thread Paul E. McKenney
On Mon, Mar 31, 2014 at 07:02:41PM -0400, Dave Jones wrote:
 You can tell the merge window is open, because I'm back to breaking RCU.
 
 [ 1023.934532] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1023.935501]  (detected by 0, t=6502 jiffies, g=28666, c=28665, q=0)
 [ 1023.936299] INFO: Stall ended before state dump start
 [ 1218.871404] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1218.872399]  (detected by 0, t=26007 jiffies, g=28666, c=28665, q=0)
 [ 1218.873256] INFO: Stall ended before state dump start
 [ 1413.808406] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1413.809387]  (detected by 0, t=45512 jiffies, g=28666, c=28665, q=0)
 [ 1413.810113] INFO: Stall ended before state dump start
 [ 1608.745637] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1608.746614]  (detected by 0, t=65017 jiffies, g=28666, c=28665, q=0)
 [ 1608.747327] INFO: Stall ended before state dump start
 [ 1803.682892] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1803.683876]  (detected by 1, t=84522 jiffies, g=28666, c=28665, q=0)
 [ 1803.684741] INFO: Stall ended before state dump start
 [ 1998.620169] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 1998.621123]  (detected by 0, t=104027 jiffies, g=28666, c=28665, q=0)
 [ 1998.621916] INFO: Stall ended before state dump start
 [ 2193.557459] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 2193.558425]  (detected by 0, t=123532 jiffies, g=28666, c=28665, q=0)
 [ 2193.559248] INFO: Stall ended before state dump start
 [ 2388.494719] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 2388.495688]  (detected by 0, t=143037 jiffies, g=28666, c=28665, q=0)
 [ 2388.496417] INFO: Stall ended before state dump start
 [ 2583.431983] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 2583.432967]  (detected by 0, t=162542 jiffies, g=28666, c=28665, q=0)
 [ 2583.433852] INFO: Stall ended before state dump start
 [ 2778.369242] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 2778.370235]  (detected by 0, t=182047 jiffies, g=28666, c=28665, q=0)
 [ 2778.371037] INFO: Stall ended before state dump start
 [ 2973.306537] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 2973.307540]  (detected by 0, t=201552 jiffies, g=28666, c=28665, q=0)
 [ 2973.308347] INFO: Stall ended before state dump start
 [ 3168.244036] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 3168.245056]  (detected by 0, t=221057 jiffies, g=28666, c=28665, q=0)
 [ 3168.245805] INFO: Stall ended before state dump start
 [ 3363.181458] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 3363.182496]  (detected by 0, t=240562 jiffies, g=28666, c=28665, q=0)
 [ 3363.183289] INFO: Stall ended before state dump start
 [ 3558.118835] INFO: rcu_preempt detected stalls on CPUs/tasks:
 [ 3558.119879]  (detected by 0, t=260067 jiffies, g=28666, c=28665, q=0)
 [ 3558.120739] INFO: Stall ended before state dump start
 
 at that point, userspace stopped responding. cursor on console was blinking,
 but I couldn't even switch tty's, or sysrq dump.
 
 rc8 was fine, so this is todays rcu changes.

New one on me!  Any chance of a .config file?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-03-31 Thread Dave Jones
On Mon, Mar 31, 2014 at 04:22:21PM -0700, Paul E. McKenney wrote:
  On Mon, Mar 31, 2014 at 07:02:41PM -0400, Dave Jones wrote:
   You can tell the merge window is open, because I'm back to breaking RCU.
   
   ... 
   [ 3558.120739] INFO: Stall ended before state dump start
   
   at that point, userspace stopped responding. cursor on console was 
   blinking,
   but I couldn't even switch tty's, or sysrq dump.
   
   rc8 was fine, so this is todays rcu changes.
  
  New one on me!  Any chance of a .config file?

http://paste.fedoraproject.org/90449/30888213/raw/

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu_prempt stalls / lockup

2014-03-31 Thread Paul E. McKenney
On Mon, Mar 31, 2014 at 07:35:52PM -0400, Dave Jones wrote:
 On Mon, Mar 31, 2014 at 04:22:21PM -0700, Paul E. McKenney wrote:
   On Mon, Mar 31, 2014 at 07:02:41PM -0400, Dave Jones wrote:
You can tell the merge window is open, because I'm back to breaking RCU.

... 
[ 3558.120739] INFO: Stall ended before state dump start

at that point, userspace stopped responding. cursor on console was 
 blinking,
but I couldn't even switch tty's, or sysrq dump.

Hmmm...  I am having a very hard time imagining any of this merge
window's RCU changes preventing a sysrq dump.  On the other hand,
having a single grace period persist without anything blocking it
is pretty strange as well.

I would hope that the sysrq path does not allocate memory, but who knows?
After all, one possible reason for the eventual hang is memory exhaustion.
So one thing to try is to do sysrq earlier in the process.  (Yeah,
I know, tough to do if you have lots of scripted systems.)

rc8 was fine, so this is todays rcu changes.
   
   New one on me!  Any chance of a .config file?
 
 http://paste.fedoraproject.org/90449/30888213/raw/

Given that you have CONFIG_RCU_NOCB_CPU_ALL=y, all the grace-period
activity is being driven by the grace-period kthreads (rcu_preempt
in this case).  This leads me to wonder if your workload if preventing
RCU's grace-period kthreads from running.  These kthreads are SCHED_OTHER,
so could potentially be preempted for a long time.  But I would expect
a softlockup message in that case.

Alternatively, I suppose a wakeup could be getting lost.  The main change
related to that this merge window was ffa83fb565fb, which eliminated
idle wakeups from RCU in the CONFIG_RCU_NOCB_CPU_ALL=y case.

So, could you please try reverting ffa83fb565fb?

If that doesn't work, I will need to put together some diagnostic patches.
Starting with the one below.

Thanx, Paul



diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0c47e300210a..c5a163378710 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -936,7 +936,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
   smp_processor_id(), (long)(jiffies - rsp-gp_start),
   rsp-gpnum, rsp-completed, totqlen);
if (ndetected == 0)
-   pr_err(INFO: Stall ended before state dump start\n);
+   pr_err(INFO: Stall ended before state dump start, gp_kthread 
state: %#lx\n, rsp-gp_kthread-state);
else if (!trigger_all_cpu_backtrace())
rcu_dump_cpu_stacks(rsp);
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/