subject:"Re\: rcu self\-detected stall messages on OMAP3, 4 boards"

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-10-01 Thread Paul E. McKenney

On Mon, Oct 01, 2012 at 10:55:11AM +0200, Linus Walleij wrote:
> On Sat, Sep 22, 2012 at 11:59 PM, Paul E. McKenney
>  wrote:
> 
> > rcu: Fix day-one dyntick-idle stall-warning bug
> 
> As mentioned in another thread this solves the same problem for ux500.
> Reported/Tested-by: Linus Walleij 
> 
> But now it appears that this commit didn't make it into v3.6 so
> it definately needs to be tagged with Cc: sta...@kernel.org
> before it gets merged since the stall warnings are kinda scary.

Ingo submitting this to Linus Torvalds earlier today, so we should be
able to send to stable shortly.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-10-01 Thread Linus Walleij

On Sat, Sep 22, 2012 at 11:59 PM, Paul E. McKenney
 wrote:

> rcu: Fix day-one dyntick-idle stall-warning bug

As mentioned in another thread this solves the same problem for ux500.
Reported/Tested-by: Linus Walleij 

But now it appears that this commit didn't make it into v3.6 so
it definately needs to be tagged with Cc: sta...@kernel.org
before it gets merged since the stall warnings are kinda scary.

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-10-01 Thread Linus Walleij

On Sat, Sep 22, 2012 at 11:59 PM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 rcu: Fix day-one dyntick-idle stall-warning bug

As mentioned in another thread this solves the same problem for ux500.
Reported/Tested-by: Linus Walleij linus.wall...@linaro.org

But now it appears that this commit didn't make it into v3.6 so
it definately needs to be tagged with Cc: sta...@kernel.org
before it gets merged since the stall warnings are kinda scary.

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-10-01 Thread Paul E. McKenney

On Mon, Oct 01, 2012 at 10:55:11AM +0200, Linus Walleij wrote:
 On Sat, Sep 22, 2012 at 11:59 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  rcu: Fix day-one dyntick-idle stall-warning bug
 
 As mentioned in another thread this solves the same problem for ux500.
 Reported/Tested-by: Linus Walleij linus.wall...@linaro.org
 
 But now it appears that this commit didn't make it into v3.6 so
 it definately needs to be tagged with Cc: sta...@kernel.org
 before it gets merged since the stall warnings are kinda scary.

Ingo submitting this to Linus Torvalds earlier today, so we should be
able to send to stable shortly.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-24 Thread Paul E. McKenney

On Mon, Sep 24, 2012 at 09:54:00PM +, Paul Walmsley wrote:
> On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> 
> > On Sat, Sep 22, 2012 at 10:20:19PM +, Paul Walmsley wrote:
> > > On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> > >
> > > > This thing has been in the kernel since about 2004, not sure why you 
> > > > didn't hit it earlier.
> > > 
> > > One other data point in that regard - noticed the warnings don't appear 
> > > when the board is booted with:
> > > 
> > > commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
> > > Author: Paul E. McKenney 
> > > Date:   Tue Jun 5 15:53:53 2012 -0700
> > > 
> > > rcu: Fix qlen_lazy breakage
> > 
> > You lost me on this one.  This is already in mainline, so if you were
> > using (say) 3.6-rc6, you would already have this commit applied.
> 
> If I check out a kernel at this commit 
> 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367 + the 
> zero-callback-in-tickless-idle diagnostic patch, build and boot it, then 
> the stall warnings don't appear (in 25 minutes of testing).
> 
> Messages from the diagnostic patch indicate that the kernel is entering 
> idle during a grace period with no RCU callbacks, though.
> 
> This is not a big deal and does not need any further attention.  Just 
> wanted to place a time boundary on the point when these messages started 
> appearing.  (It is unlikely to be an optimal bound: i.e., there are 
> probably later commits where the warnings also don't appear.)

Ah, got it, thank you!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-24 Thread Paul Walmsley

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

> On Sat, Sep 22, 2012 at 10:20:19PM +, Paul Walmsley wrote:
> > On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> >
> > > This thing has been in the kernel since about 2004, not sure why you 
> > > didn't hit it earlier.
> > 
> > One other data point in that regard - noticed the warnings don't appear 
> > when the board is booted with:
> > 
> > commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
> > Author: Paul E. McKenney 
> > Date:   Tue Jun 5 15:53:53 2012 -0700
> > 
> > rcu: Fix qlen_lazy breakage
> 
> You lost me on this one.  This is already in mainline, so if you were
> using (say) 3.6-rc6, you would already have this commit applied.

If I check out a kernel at this commit 
4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367 + the 
zero-callback-in-tickless-idle diagnostic patch, build and boot it, then 
the stall warnings don't appear (in 25 minutes of testing).

Messages from the diagnostic patch indicate that the kernel is entering 
idle during a grace period with no RCU callbacks, though.

This is not a big deal and does not need any further attention.  Just 
wanted to place a time boundary on the point when these messages started 
appearing.  (It is unlikely to be an optimal bound: i.e., there are 
probably later commits where the warnings also don't appear.)

- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-24 Thread Paul E. McKenney

On Mon, Sep 24, 2012 at 03:11:34PM +0530, Shilimkar, Santosh wrote:
> On Sun, Sep 23, 2012 at 3:29 AM, Paul E. McKenney
>  wrote:
> > On Sat, Sep 22, 2012 at 01:10:43PM -0700, Paul E. McKenney wrote:
> >> On Sat, Sep 22, 2012 at 06:42:08PM +, Paul Walmsley wrote:
> >> > On Fri, 21 Sep 2012, Paul E. McKenney wrote:
> 
> [...]
> 
> >
> > And here is a patch.  I am still having trouble reproducing the problem,
> > but figured that I should avoid serializing things.
> >
> > Thanx, Paul
> >
> > 
> >
> >  b/kernel/rcutree.c |4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > rcu: Fix day-one dyntick-idle stall-warning bug
> >
> > Each grace period is supposed to have at least one callback waiting
> > for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
> > extra callback-free grace period is no big problem -- it will chew up
> > a tiny bit of CPU time, but it will complete normally.  In contrast,
> > CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
> > sleep indefinitely, in turn indefinitely delaying completion of the
> > callback-free grace period.  Given that nothing is waiting on this grace
> > period, this is also not a problem.
> >
> > Unless RCU CPU stall warnings are also enabled, as they are in recent
> > kernels.  In this case, if a CPU wakes up after at least one minute
> > of inactivity, an RCU CPU stall warning will result.  The reason that
> > no one noticed until quite recently is that most systems have enough
> > OS noise that they will never remain absolutely idle for a full minute.
> > But there are some embedded systems with cut-down userspace configurations
> > that get into this mode quite easily.
> >
> > All this begs the question of exactly how a callback-free grace period
> > gets started in the first place.  This can happen due to the fact that
> > CPUs do not necessarily agree on which grace period is in progress.
> > If a CPU still believes that the grace period that just completed is
> > still ongoing, it will believe that it has callbacks that need to wait
> > for another grace period, never mind the fact that the grace period
> > that they were waiting for just completed.  This CPU can therefore
> > erroneously decide to start a new grace period.
> >
> > Once this CPU notices that the earlier grace period completed, it will
> > invoke its callbacks.  It then won't have any callbacks left.  If no
> > other CPU has any callbacks, we now have a callback-free grace period.
> >
> > This commit therefore makes CPUs check more carefully before starting a
> > new grace period.  This new check relies on an array of tail pointers
> > into each CPU's list of callbacks.  If the CPU is up to date on which
> > grace periods have completed, it checks to see if any callbacks follow
> > the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
> > follow the RCU_WAIT_TAIL segment.  The reason that this works is that
> > the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
> > as soon as the CPU figures out that the old grace period has ended.
> >
> > This change is to cpu_needs_another_gp(), which is called in a number
> > of places.  The only one that really matters is in rcu_start_gp(), where
> > the root rcu_node structure's ->lock is held, which prevents any
> > other CPU from starting or completing a grace period, so that the
> > comparison that determines whether the CPU is missing the completion
> > of a grace period is stable.
> >
> > Signed-off-by: Paul E. McKenney 
> > Signed-off-by: Paul E. McKenney 
> >
> As already confirmed by Paul W and others, I too no longer see the rcu dumps
> any more with above patch. Thanks a lot for the fix.

Glad it finally works!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-24 Thread Shilimkar, Santosh

On Sun, Sep 23, 2012 at 3:29 AM, Paul E. McKenney
 wrote:
> On Sat, Sep 22, 2012 at 01:10:43PM -0700, Paul E. McKenney wrote:
>> On Sat, Sep 22, 2012 at 06:42:08PM +, Paul Walmsley wrote:
>> > On Fri, 21 Sep 2012, Paul E. McKenney wrote:

[...]

>
> And here is a patch.  I am still having trouble reproducing the problem,
> but figured that I should avoid serializing things.
>
> Thanx, Paul
>
> 
>
>  b/kernel/rcutree.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> rcu: Fix day-one dyntick-idle stall-warning bug
>
> Each grace period is supposed to have at least one callback waiting
> for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
> extra callback-free grace period is no big problem -- it will chew up
> a tiny bit of CPU time, but it will complete normally.  In contrast,
> CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
> sleep indefinitely, in turn indefinitely delaying completion of the
> callback-free grace period.  Given that nothing is waiting on this grace
> period, this is also not a problem.
>
> Unless RCU CPU stall warnings are also enabled, as they are in recent
> kernels.  In this case, if a CPU wakes up after at least one minute
> of inactivity, an RCU CPU stall warning will result.  The reason that
> no one noticed until quite recently is that most systems have enough
> OS noise that they will never remain absolutely idle for a full minute.
> But there are some embedded systems with cut-down userspace configurations
> that get into this mode quite easily.
>
> All this begs the question of exactly how a callback-free grace period
> gets started in the first place.  This can happen due to the fact that
> CPUs do not necessarily agree on which grace period is in progress.
> If a CPU still believes that the grace period that just completed is
> still ongoing, it will believe that it has callbacks that need to wait
> for another grace period, never mind the fact that the grace period
> that they were waiting for just completed.  This CPU can therefore
> erroneously decide to start a new grace period.
>
> Once this CPU notices that the earlier grace period completed, it will
> invoke its callbacks.  It then won't have any callbacks left.  If no
> other CPU has any callbacks, we now have a callback-free grace period.
>
> This commit therefore makes CPUs check more carefully before starting a
> new grace period.  This new check relies on an array of tail pointers
> into each CPU's list of callbacks.  If the CPU is up to date on which
> grace periods have completed, it checks to see if any callbacks follow
> the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
> follow the RCU_WAIT_TAIL segment.  The reason that this works is that
> the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
> as soon as the CPU figures out that the old grace period has ended.
>
> This change is to cpu_needs_another_gp(), which is called in a number
> of places.  The only one that really matters is in rcu_start_gp(), where
> the root rcu_node structure's ->lock is held, which prevents any
> other CPU from starting or completing a grace period, so that the
> comparison that determines whether the CPU is missing the completion
> of a grace period is stable.
>
> Signed-off-by: Paul E. McKenney 
> Signed-off-by: Paul E. McKenney 
>
As already confirmed by Paul W and others, I too no longer see the rcu dumps
any more with above patch. Thanks a lot for the fix.

Regards
Santosh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-24 Thread Shilimkar, Santosh

On Sun, Sep 23, 2012 at 3:29 AM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:
 On Sat, Sep 22, 2012 at 01:10:43PM -0700, Paul E. McKenney wrote:
 On Sat, Sep 22, 2012 at 06:42:08PM +, Paul Walmsley wrote:
  On Fri, 21 Sep 2012, Paul E. McKenney wrote:

[...]


 And here is a patch.  I am still having trouble reproducing the problem,
 but figured that I should avoid serializing things.

 Thanx, Paul

 

  b/kernel/rcutree.c |4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

 rcu: Fix day-one dyntick-idle stall-warning bug

 Each grace period is supposed to have at least one callback waiting
 for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
 extra callback-free grace period is no big problem -- it will chew up
 a tiny bit of CPU time, but it will complete normally.  In contrast,
 CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
 sleep indefinitely, in turn indefinitely delaying completion of the
 callback-free grace period.  Given that nothing is waiting on this grace
 period, this is also not a problem.

 Unless RCU CPU stall warnings are also enabled, as they are in recent
 kernels.  In this case, if a CPU wakes up after at least one minute
 of inactivity, an RCU CPU stall warning will result.  The reason that
 no one noticed until quite recently is that most systems have enough
 OS noise that they will never remain absolutely idle for a full minute.
 But there are some embedded systems with cut-down userspace configurations
 that get into this mode quite easily.

 All this begs the question of exactly how a callback-free grace period
 gets started in the first place.  This can happen due to the fact that
 CPUs do not necessarily agree on which grace period is in progress.
 If a CPU still believes that the grace period that just completed is
 still ongoing, it will believe that it has callbacks that need to wait
 for another grace period, never mind the fact that the grace period
 that they were waiting for just completed.  This CPU can therefore
 erroneously decide to start a new grace period.

 Once this CPU notices that the earlier grace period completed, it will
 invoke its callbacks.  It then won't have any callbacks left.  If no
 other CPU has any callbacks, we now have a callback-free grace period.

 This commit therefore makes CPUs check more carefully before starting a
 new grace period.  This new check relies on an array of tail pointers
 into each CPU's list of callbacks.  If the CPU is up to date on which
 grace periods have completed, it checks to see if any callbacks follow
 the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
 follow the RCU_WAIT_TAIL segment.  The reason that this works is that
 the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
 as soon as the CPU figures out that the old grace period has ended.

 This change is to cpu_needs_another_gp(), which is called in a number
 of places.  The only one that really matters is in rcu_start_gp(), where
 the root rcu_node structure's -lock is held, which prevents any
 other CPU from starting or completing a grace period, so that the
 comparison that determines whether the CPU is missing the completion
 of a grace period is stable.

 Signed-off-by: Paul E. McKenney paul.mcken...@linaro.org
 Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

As already confirmed by Paul W and others, I too no longer see the rcu dumps
any more with above patch. Thanks a lot for the fix.

Regards
Santosh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-24 Thread Paul E. McKenney

On Mon, Sep 24, 2012 at 03:11:34PM +0530, Shilimkar, Santosh wrote:
 On Sun, Sep 23, 2012 at 3:29 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
  On Sat, Sep 22, 2012 at 01:10:43PM -0700, Paul E. McKenney wrote:
  On Sat, Sep 22, 2012 at 06:42:08PM +, Paul Walmsley wrote:
   On Fri, 21 Sep 2012, Paul E. McKenney wrote:
 
 [...]
 
 
  And here is a patch.  I am still having trouble reproducing the problem,
  but figured that I should avoid serializing things.
 
  Thanx, Paul
 
  
 
   b/kernel/rcutree.c |4 +++-
   1 file changed, 3 insertions(+), 1 deletion(-)
 
  rcu: Fix day-one dyntick-idle stall-warning bug
 
  Each grace period is supposed to have at least one callback waiting
  for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
  extra callback-free grace period is no big problem -- it will chew up
  a tiny bit of CPU time, but it will complete normally.  In contrast,
  CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
  sleep indefinitely, in turn indefinitely delaying completion of the
  callback-free grace period.  Given that nothing is waiting on this grace
  period, this is also not a problem.
 
  Unless RCU CPU stall warnings are also enabled, as they are in recent
  kernels.  In this case, if a CPU wakes up after at least one minute
  of inactivity, an RCU CPU stall warning will result.  The reason that
  no one noticed until quite recently is that most systems have enough
  OS noise that they will never remain absolutely idle for a full minute.
  But there are some embedded systems with cut-down userspace configurations
  that get into this mode quite easily.
 
  All this begs the question of exactly how a callback-free grace period
  gets started in the first place.  This can happen due to the fact that
  CPUs do not necessarily agree on which grace period is in progress.
  If a CPU still believes that the grace period that just completed is
  still ongoing, it will believe that it has callbacks that need to wait
  for another grace period, never mind the fact that the grace period
  that they were waiting for just completed.  This CPU can therefore
  erroneously decide to start a new grace period.
 
  Once this CPU notices that the earlier grace period completed, it will
  invoke its callbacks.  It then won't have any callbacks left.  If no
  other CPU has any callbacks, we now have a callback-free grace period.
 
  This commit therefore makes CPUs check more carefully before starting a
  new grace period.  This new check relies on an array of tail pointers
  into each CPU's list of callbacks.  If the CPU is up to date on which
  grace periods have completed, it checks to see if any callbacks follow
  the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
  follow the RCU_WAIT_TAIL segment.  The reason that this works is that
  the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
  as soon as the CPU figures out that the old grace period has ended.
 
  This change is to cpu_needs_another_gp(), which is called in a number
  of places.  The only one that really matters is in rcu_start_gp(), where
  the root rcu_node structure's -lock is held, which prevents any
  other CPU from starting or completing a grace period, so that the
  comparison that determines whether the CPU is missing the completion
  of a grace period is stable.
 
  Signed-off-by: Paul E. McKenney paul.mcken...@linaro.org
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 
 As already confirmed by Paul W and others, I too no longer see the rcu dumps
 any more with above patch. Thanks a lot for the fix.

Glad it finally works!

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-24 Thread Paul Walmsley

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

 On Sat, Sep 22, 2012 at 10:20:19PM +, Paul Walmsley wrote:
  On Sat, 22 Sep 2012, Paul E. McKenney wrote:
 
   This thing has been in the kernel since about 2004, not sure why you 
   didn't hit it earlier.
  
  One other data point in that regard - noticed the warnings don't appear 
  when the board is booted with:
  
  commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
  Author: Paul E. McKenney paul.mcken...@linaro.org
  Date:   Tue Jun 5 15:53:53 2012 -0700
  
  rcu: Fix qlen_lazy breakage
 
 You lost me on this one.  This is already in mainline, so if you were
 using (say) 3.6-rc6, you would already have this commit applied.

If I check out a kernel at this commit 
4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367 + the 
zero-callback-in-tickless-idle diagnostic patch, build and boot it, then 
the stall warnings don't appear (in 25 minutes of testing).

Messages from the diagnostic patch indicate that the kernel is entering 
idle during a grace period with no RCU callbacks, though.

This is not a big deal and does not need any further attention.  Just 
wanted to place a time boundary on the point when these messages started 
appearing.  (It is unlikely to be an optimal bound: i.e., there are 
probably later commits where the warnings also don't appear.)


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-24 Thread Paul E. McKenney

On Mon, Sep 24, 2012 at 09:54:00PM +, Paul Walmsley wrote:
 On Sat, 22 Sep 2012, Paul E. McKenney wrote:
 
  On Sat, Sep 22, 2012 at 10:20:19PM +, Paul Walmsley wrote:
   On Sat, 22 Sep 2012, Paul E. McKenney wrote:
  
This thing has been in the kernel since about 2004, not sure why you 
didn't hit it earlier.
   
   One other data point in that regard - noticed the warnings don't appear 
   when the board is booted with:
   
   commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
   Author: Paul E. McKenney paul.mcken...@linaro.org
   Date:   Tue Jun 5 15:53:53 2012 -0700
   
   rcu: Fix qlen_lazy breakage
  
  You lost me on this one.  This is already in mainline, so if you were
  using (say) 3.6-rc6, you would already have this commit applied.
 
 If I check out a kernel at this commit 
 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367 + the 
 zero-callback-in-tickless-idle diagnostic patch, build and boot it, then 
 the stall warnings don't appear (in 25 minutes of testing).
 
 Messages from the diagnostic patch indicate that the kernel is entering 
 idle during a grace period with no RCU callbacks, though.
 
 This is not a big deal and does not need any further attention.  Just 
 wanted to place a time boundary on the point when these messages started 
 appearing.  (It is unlikely to be an optimal bound: i.e., there are 
 probably later commits where the warnings also don't appear.)

Ah, got it, thank you!

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-23 Thread Paul E. McKenney

On Sun, Sep 23, 2012 at 07:55:50AM +, Paul Walmsley wrote:
> On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> 
> > On Sat, Sep 22, 2012 at 10:25:59PM +, Paul Walmsley wrote:
> > 
> > > The recent tests here have been on Pandaboard, which is dual-CPU, but my 
> > > recollection is that I also observed the warnings on a single-core 
> > > Beagleboard.  Will re-test.
> > 
> > Anxiously awaiting the results.
> 
> The same problem exists on BeagleBoard XM (OMAP3730, single-core 
> Cortex-A8):
> 
>
> http://www.pwsan.com/omap/transcripts/20120922-beaglexm-rcu-stall-debug-pre-fix.txt
> 
> and the same patch fixes it:
> 
>
> http://www.pwsan.com/omap/transcripts/20120922-beaglexm-rcu-stall-debug-post-fix.txt
> 
> Please feel free to update my Tested-by:, if you wish.
> 
> Tested-by: Paul Walmsley   # OMAP3730, OMAP4430

Very good, thank you very much!!!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-23 Thread Paul Walmsley

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

> On Sat, Sep 22, 2012 at 10:25:59PM +, Paul Walmsley wrote:
> 
> > The recent tests here have been on Pandaboard, which is dual-CPU, but my 
> > recollection is that I also observed the warnings on a single-core 
> > Beagleboard.  Will re-test.
> 
> Anxiously awaiting the results.

The same problem exists on BeagleBoard XM (OMAP3730, single-core 
Cortex-A8):

   
http://www.pwsan.com/omap/transcripts/20120922-beaglexm-rcu-stall-debug-pre-fix.txt

and the same patch fixes it:

   
http://www.pwsan.com/omap/transcripts/20120922-beaglexm-rcu-stall-debug-post-fix.txt

Please feel free to update my Tested-by:, if you wish.

Tested-by: Paul Walmsley   # OMAP3730, OMAP4430


- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-23 Thread Paul Walmsley

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

 On Sat, Sep 22, 2012 at 10:25:59PM +, Paul Walmsley wrote:
 
  The recent tests here have been on Pandaboard, which is dual-CPU, but my 
  recollection is that I also observed the warnings on a single-core 
  Beagleboard.  Will re-test.
 
 Anxiously awaiting the results.

The same problem exists on BeagleBoard XM (OMAP3730, single-core 
Cortex-A8):

   
http://www.pwsan.com/omap/transcripts/20120922-beaglexm-rcu-stall-debug-pre-fix.txt

and the same patch fixes it:

   
http://www.pwsan.com/omap/transcripts/20120922-beaglexm-rcu-stall-debug-post-fix.txt

Please feel free to update my Tested-by:, if you wish.

Tested-by: Paul Walmsley p...@pwsan.com  # OMAP3730, OMAP4430


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-23 Thread Paul E. McKenney

On Sun, Sep 23, 2012 at 07:55:50AM +, Paul Walmsley wrote:
 On Sat, 22 Sep 2012, Paul E. McKenney wrote:
 
  On Sat, Sep 22, 2012 at 10:25:59PM +, Paul Walmsley wrote:
  
   The recent tests here have been on Pandaboard, which is dual-CPU, but my 
   recollection is that I also observed the warnings on a single-core 
   Beagleboard.  Will re-test.
  
  Anxiously awaiting the results.
 
 The same problem exists on BeagleBoard XM (OMAP3730, single-core 
 Cortex-A8):
 

 http://www.pwsan.com/omap/transcripts/20120922-beaglexm-rcu-stall-debug-pre-fix.txt
 
 and the same patch fixes it:
 

 http://www.pwsan.com/omap/transcripts/20120922-beaglexm-rcu-stall-debug-post-fix.txt
 
 Please feel free to update my Tested-by:, if you wish.
 
 Tested-by: Paul Walmsley p...@pwsan.com  # OMAP3730, OMAP4430

Very good, thank you very much!!!

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

> Very cool, thank you for your testing efforts!!!

You're welcome.

> May I apply your Tested-by to this patch?

Please do:

Tested-by: Paul Walmsley   # OMAP4430

Am testing on OMAP3730 (single-core) now.


- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sun, Sep 23, 2012 at 01:42:10AM +, Paul Walmsley wrote:
> Hi Paul
> 
> On Sat, 22 Sep 2012, Paul Walmsley wrote:
> 
> > On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> > 
> > > And here is a patch.  I am still having trouble reproducing the problem,
> > > but figured that I should avoid serializing things.
> > 
> > Thanks, testing this now on v3.6-rc6.
> 
> Looks like you solved it!
> 
> Tested v3.6-rc6 + your stall diagnostic patch:
> 
> http://marc.info/?l=linux-arm-kernel=134827237215882=2
> 
> on OMAP4430ES2 Pandaboard using omap2plus_defconfig and 
> CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings.
> 
> Then added "rcu: Fix day-one dyntick-idle stall-warning bug" from:
> 
> http://marc.info/?l=linux-arm-kernel=134835120600590=2
> 
> Booted that, and the stall warnings did not appear within 30 minutes.

Very cool, thank you for your testing efforts!!!

May I apply your Tested-by to this patch?

And good show on the debugging patch -- it is quite good to have such
solid evidence that the bug that the fix was intended for was actually
occurring.

Thanx, Paul

> To confirm that the problem being solved matched your hypothesis, the
> debugging patch below[1] was added to the RCU idle entry/exit code.
> 
> Without the bugfix patch, a boot log transcript was obtained
> indicating that the idle loop was entered with tick_nohz_enabled=1
> during a grace period with no callbacks present:
> 
> http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt
> 
> The debugging events started to appear at 1.867370 seconds into the
> boot.  ENTER was pressed about 464 seconds in; this triggered the
> rcu_sched stall traceback.
> 
> With the bugfix patch, a boot log transcript was obtained that
> indicated that the condition under test never occurred after waiting
> about 20 minutes:
> 
> 
> http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt
> 
> Thanks for being so willing to root-cause the issue, Paul; it's 
> appreciated, and it's been quite instructive as well.  Will address some 
> remaining loose ends in follow-up E-mails.
> 
> 
> - Paul
> 
> 
> [1] Debugging patch to printk() if the previous idle loop entry occurred 
> with tick_nohz_enabled=1 during a grace period with no RCU callbacks 
> present:
> 
> 
> ---
>  kernel/rcutree.c |   17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index f1eb7ad..f42941b 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -60,6 +60,9 @@
> 
>  /* Data structures. */
> 
> +extern int tick_nohz_enabled;
> +static int no_cbs_idle_entry_count;
> +
>  static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
> 
>  #define RCU_STATE_INITIALIZER(sname, cr) { \
> @@ -400,8 +403,12 @@ void rcu_idle_enter(void)
>   unsigned long flags;
>   long long oldval;
>   struct rcu_dynticks *rdtp;
> + int cpu;
> + long totqlen = 0;
> + struct rcu_data *rdp;
> 
>   local_irq_save(flags);
> + rdp = &__get_cpu_var(rcu_sched_data);
>   rdtp = &__get_cpu_var(rcu_dynticks);
>   oldval = rdtp->dynticks_nesting;
>   WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0);
> @@ -410,6 +417,12 @@ void rcu_idle_enter(void)
>   else
>   rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
>   rcu_idle_enter_common(rdtp, oldval);
> + if (tick_nohz_enabled && rcu_gp_in_progress(rdp->rsp)) {
> + for_each_possible_cpu(cpu)
> + totqlen += per_cpu_ptr(rdp->rsp->rda, cpu)->qlen;
> + if (totqlen == 0)
> + no_cbs_idle_entry_count = 1;
> + }
>   local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL_GPL(rcu_idle_enter);
> @@ -503,6 +516,10 @@ void rcu_idle_exit(void)
>   rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
>   rcu_idle_exit_common(rdtp, oldval);
>   local_irq_restore(flags);
> + if (no_cbs_idle_entry_count) {
> + no_cbs_idle_entry_count = 0;
> + pr_err("* Tickless idle was entered with zero RCU callbacks\n");
> + }
>  }
>  EXPORT_SYMBOL_GPL(rcu_idle_exit);
> 
> -- 
> 1.7.10.4
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley

Hi Paul

On Sat, 22 Sep 2012, Paul Walmsley wrote:

> On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> 
> > And here is a patch.  I am still having trouble reproducing the problem,
> > but figured that I should avoid serializing things.
> 
> Thanks, testing this now on v3.6-rc6.

Looks like you solved it!

Tested v3.6-rc6 + your stall diagnostic patch:

http://marc.info/?l=linux-arm-kernel=134827237215882=2

on OMAP4430ES2 Pandaboard using omap2plus_defconfig and 
CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings.

Then added "rcu: Fix day-one dyntick-idle stall-warning bug" from:

http://marc.info/?l=linux-arm-kernel=134835120600590=2

Booted that, and the stall warnings did not appear within 30 minutes.

To confirm that the problem being solved matched your hypothesis, the
debugging patch below[1] was added to the RCU idle entry/exit code.

Without the bugfix patch, a boot log transcript was obtained
indicating that the idle loop was entered with tick_nohz_enabled=1
during a grace period with no callbacks present:

http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt

The debugging events started to appear at 1.867370 seconds into the
boot.  ENTER was pressed about 464 seconds in; this triggered the
rcu_sched stall traceback.

With the bugfix patch, a boot log transcript was obtained that
indicated that the condition under test never occurred after waiting
about 20 minutes:

http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt

Thanks for being so willing to root-cause the issue, Paul; it's 
appreciated, and it's been quite instructive as well.  Will address some 
remaining loose ends in follow-up E-mails.


- Paul


[1] Debugging patch to printk() if the previous idle loop entry occurred 
with tick_nohz_enabled=1 during a grace period with no RCU callbacks 
present:


---
 kernel/rcutree.c |   17 +
 1 file changed, 17 insertions(+)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index f1eb7ad..f42941b 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -60,6 +60,9 @@
 
 /* Data structures. */
 
+extern int tick_nohz_enabled;
+static int no_cbs_idle_entry_count;
+
 static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
 
 #define RCU_STATE_INITIALIZER(sname, cr) { \
@@ -400,8 +403,12 @@ void rcu_idle_enter(void)
unsigned long flags;
long long oldval;
struct rcu_dynticks *rdtp;
+   int cpu;
+   long totqlen = 0;
+   struct rcu_data *rdp;
 
local_irq_save(flags);
+   rdp = &__get_cpu_var(rcu_sched_data);
rdtp = &__get_cpu_var(rcu_dynticks);
oldval = rdtp->dynticks_nesting;
WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0);
@@ -410,6 +417,12 @@ void rcu_idle_enter(void)
else
rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
rcu_idle_enter_common(rdtp, oldval);
+   if (tick_nohz_enabled && rcu_gp_in_progress(rdp->rsp)) {
+   for_each_possible_cpu(cpu)
+   totqlen += per_cpu_ptr(rdp->rsp->rda, cpu)->qlen;
+   if (totqlen == 0)
+   no_cbs_idle_entry_count = 1;
+   }
local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -503,6 +516,10 @@ void rcu_idle_exit(void)
rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
rcu_idle_exit_common(rdtp, oldval);
local_irq_restore(flags);
+   if (no_cbs_idle_entry_count) {
+   no_cbs_idle_entry_count = 0;
+   pr_err("* Tickless idle was entered with zero RCU callbacks\n");
+   }
 }
 EXPORT_SYMBOL_GPL(rcu_idle_exit);
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sat, Sep 22, 2012 at 10:20:19PM +, Paul Walmsley wrote:
> Hi Paul
> 
> On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> 
> > Strangely enough, I believe that I have inadvertently fixed this in
> > my -rcu tree:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next
> > 
> > Nevertheless, if you get a chance to try it, I would be interested to
> > hear if my guess is correct.
> 
> Yes, good news: the stall warnings go away with that branch.

Very good!

> > The trick is that a kthread drives the grace period in -rcu, regardless 
> > of whether or not there are callbacks.
> 
> This is "rcu: Move quiescent-state forcing into kthread" ?

Yep, plus the preceding commits moving grace-period initialization and
cleanup into that same kthread.  This was motivated by a bug report
last February complaining about 200-microsecond latency spikes from
RCU grace-period initialization.  On systems with 4096 CPUs.

Real-time response.  It is far bigger than I thought.  ;-)

> Added some debugging into rcu_gp_kthread() after that commit and can 
> confirm that the quiescent-state forcing loop does start a few times when 
> there are zero callbacks pending (modulo any races in my measurement 
> code).

Cool, thank you!  Assuming it works, that indicates that there is long-term
value to the fix for this problem.  On larger systems, extra grace periods
are not what you want, as their expense increases with the number of CPUs.

> > However, the backport would not be something that -stable would be happy
> > with, so I will be putting together a fix for mainline.  This thing
> > has been in the kernel since about 2004, not sure why you didn't hit
> > it earlier.
> 
> One other data point in that regard - noticed the warnings don't appear 
> when the board is booted with:
> 
> commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
> Author: Paul E. McKenney 
> Date:   Tue Jun 5 15:53:53 2012 -0700
> 
> rcu: Fix qlen_lazy breakage

You lost me on this one.  This is already in mainline, so if you were
using (say) 3.6-rc6, you would already have this commit applied.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sat, Sep 22, 2012 at 10:25:59PM +, Paul Walmsley wrote:
> On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> 
> > And here is a patch.  I am still having trouble reproducing the problem,
> > but figured that I should avoid serializing things.
> 
> Thanks, testing this now on v3.6-rc6.

Very cool, thank you!

>One question though about the patch 
> description:
> 
> > All this begs the question of exactly how a callback-free grace period
> > gets started in the first place.  This can happen due to the fact that
> > CPUs do not necessarily agree on which grace period is in progress.
> > If a CPU still believes that the grace period that just completed is
> > still ongoing, it will believe that it has callbacks that need to wait
> > for another grace period, never mind the fact that the grace period
> > that they were waiting for just completed.  This CPU can therefore
> > erroneously decide to start a new grace period.
> 
> Doesn't this imply that this bug would only affect multi-CPU systems?  

Surprisingly not, at least when running TREE_RCU or TREE_PREEMPT_RCU.
In order to keep lock contention down to a dull roar on larger systems,
TREE_RCU keeps three sets of books: (1) the global state in the rcu_state
structure, (2) the combining-tree per-node state in the rcu_node
structure, and the per-CPU state in the rcu_data structure.  A CPU is
not officially aware of the end of a grace period until it is reflected
in its rcu_data structure.  This has the perhaps-surprising consequence
that the CPU that detected the end of the old grace period might start
a new one before becoming officially aware that the old one ended.

Why not have the CPU inform itself immediately upon noticing that the
old grace period ended?  Deadlock.  The rcu_node locks must be acquired
from leaf towards root, and the CPU is holding the root rcu_node lock
when it notices that the grace period has ended.

I have made this a bit less problematic in the bigrt branch, working
towards a goal of getting RCU into a state where automatic formal
validation might one day be possible.  And yes, I am starting to get some
formal-validation people interested in this lofty goal, see for example:
http://sites.google.com/site/popl13grace/paper.pdf.

> The recent tests here have been on Pandaboard, which is dual-CPU, but my 
> recollection is that I also observed the warnings on a single-core 
> Beagleboard.  Will re-test.

Anxiously awaiting the results.  This has been a strange one, even by
RCU's standards.

Plus I need to add a few Reported-by lines.  Next version...

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

> And here is a patch.  I am still having trouble reproducing the problem,
> but figured that I should avoid serializing things.

Thanks, testing this now on v3.6-rc6.  One question though about the patch 
description:

> All this begs the question of exactly how a callback-free grace period
> gets started in the first place.  This can happen due to the fact that
> CPUs do not necessarily agree on which grace period is in progress.
> If a CPU still believes that the grace period that just completed is
> still ongoing, it will believe that it has callbacks that need to wait
> for another grace period, never mind the fact that the grace period
> that they were waiting for just completed.  This CPU can therefore
> erroneously decide to start a new grace period.

Doesn't this imply that this bug would only affect multi-CPU systems?  

The recent tests here have been on Pandaboard, which is dual-CPU, but my 
recollection is that I also observed the warnings on a single-core 
Beagleboard.  Will re-test.


- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley

Hi Paul

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

> Strangely enough, I believe that I have inadvertently fixed this in
> my -rcu tree:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next
> 
> Nevertheless, if you get a chance to try it, I would be interested to
> hear if my guess is correct.

Yes, good news: the stall warnings go away with that branch.

> The trick is that a kthread drives the grace period in -rcu, regardless 
> of whether or not there are callbacks.

This is "rcu: Move quiescent-state forcing into kthread" ?

Added some debugging into rcu_gp_kthread() after that commit and can 
confirm that the quiescent-state forcing loop does start a few times when 
there are zero callbacks pending (modulo any races in my measurement 
code).

> However, the backport would not be something that -stable would be happy
> with, so I will be putting together a fix for mainline.  This thing
> has been in the kernel since about 2004, not sure why you didn't hit
> it earlier.

One other data point in that regard - noticed the warnings don't appear 
when the board is booted with:

commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
Author: Paul E. McKenney 
Date:   Tue Jun 5 15:53:53 2012 -0700

rcu: Fix qlen_lazy breakage

...


- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sat, Sep 22, 2012 at 01:10:43PM -0700, Paul E. McKenney wrote:
> On Sat, Sep 22, 2012 at 06:42:08PM +, Paul Walmsley wrote:
> > On Fri, 21 Sep 2012, Paul E. McKenney wrote:
> > 
> > > Could you please point me to a recipe for creating a minimal userspace?
> > > Just in case it is the userspac erather than the architecture/hardware
> > > that makes the difference.
> > 
> > Tony's suggestion is pretty good.  Note that there may also be differences 
> > in kernel timers -- either between x86 and ARM architectures, or loaded 
> > device drivers -- that may confound the problem.
> 
> For example, there must be at least one RCU callback outstanding after
> the boot sequence quiets down.  Of course, the last time I tried Tony's
> approach, I was doing it on top of my -rcu stack, so am retrying on
> v3.6-rc6.
> 
> > > Just to make sure I understand the combinations:
> > > 
> > > o All stalls have happened when running a minimal userspace.
> > > o CONFIG_NO_HZ=n suppresses the stalls.
> > > o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
> > >   no observable effect on the stalls.
> > > 
> > > Did I get that right, or am I missing a combination?
> > 
> > That's correct.
> > 
> > > Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is 
> > > cancelled upon exit from idle, and therefore should (almost) never 
> > > actually execute. Its sole purpose is to wake up the CPU.  ;-)
> > 
> > Right.  Just curious, what would wake up the kernel from idle to handle a 
> > grace period expiration when CONFIG_RCU_FAST_NO_HZ=n?  On a very idle 
> > system, the time between timer ticks could potentially be several tens of 
> > seconds.
> 
> If CONFIG_RCU_FAST_NO_HZ=n, then CPUs with RCU callbacks are not permitted
> to shut off the scheduling-clock tick, so any CPU with RCU callbacks will
> be awakened every jiffy.  The problem is that there appears to be a way
> to get an RCU grace period started without any CPU having any callbacks,
> which, as you surmise, would result in all the CPUs going to sleep and
> the grace period never ending.  So if a CPU is awakened for any reason
> after this everlasting grace period has extended for more than a minute,
> the first thing that CPU will do is print an RCU CPU stall warning.
> 
> I believe that I see how to prevent callback-free grace periods from
> ever starting.  (Famous last words...)

And here is a patch.  I am still having trouble reproducing the problem,
but figured that I should avoid serializing things.

Thanx, Paul

 b/kernel/rcutree.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

rcu: Fix day-one dyntick-idle stall-warning bug

Each grace period is supposed to have at least one callback waiting
for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
extra callback-free grace period is no big problem -- it will chew up
a tiny bit of CPU time, but it will complete normally.  In contrast,
CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
sleep indefinitely, in turn indefinitely delaying completion of the
callback-free grace period.  Given that nothing is waiting on this grace
period, this is also not a problem.

Unless RCU CPU stall warnings are also enabled, as they are in recent
kernels.  In this case, if a CPU wakes up after at least one minute
of inactivity, an RCU CPU stall warning will result.  The reason that
no one noticed until quite recently is that most systems have enough
OS noise that they will never remain absolutely idle for a full minute.
But there are some embedded systems with cut-down userspace configurations
that get into this mode quite easily.

All this begs the question of exactly how a callback-free grace period
gets started in the first place.  This can happen due to the fact that
CPUs do not necessarily agree on which grace period is in progress.
If a CPU still believes that the grace period that just completed is
still ongoing, it will believe that it has callbacks that need to wait
for another grace period, never mind the fact that the grace period
that they were waiting for just completed.  This CPU can therefore
erroneously decide to start a new grace period.

Once this CPU notices that the earlier grace period completed, it will
invoke its callbacks.  It then won't have any callbacks left.  If no
other CPU has any callbacks, we now have a callback-free grace period.

This commit therefore makes CPUs check more carefully before starting a
new grace period.  This new check relies on an array of tail pointers
into each CPU's list of callbacks.  If the CPU is up to date on which
grace periods have completed, it checks to see if any callbacks follow
the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
follow the RCU_WAIT_TAIL segment.  The reason that this works is that
the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sat, Sep 22, 2012 at 06:16:15PM +, Paul Walmsley wrote:
> Hi Paul
> 
> On Fri, 21 Sep 2012, Paul E. McKenney wrote:
> 
> > I am wondering if your system somehow figured out how to start a grace
> > period that had no RCU callbacks waiting for it.  If that happened,
> > then a CONFIG_NO_HZ=y system could in theory get into a state where all
> > CPUs are in dyntick-idle mode, so that none of them is doing anything
> > to force the grace period to complete.
> >
> > That should be easy to diagnose, anyway.  Please see below, which
> > includes the earlier diagnostic patch.
> 
> Here you go.
> 
> - Paul
> 
> [  248.902618] INFO: rcu_sched self-detected stall on CPU
> [  248.905456]  0: (1 ticks this GP) idle=933/1/0 
> [  248.907897]   (t=26570 jiffies g=11 c=10 q=0)

Bingo!!!  (q=0, in case you were wondering.  And thank you for testing this!)

Strangely enough, I believe that I have inadvertently fixed this in
my -rcu tree:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next

Nevertheless, if you get a chance to try it, I would be interested to
hear if my guess is correct.  The trick is that a kthread drives the
grace period in -rcu, regardless of whether or not there are callbacks.

However, the backport would not be something that -stable would be happy
with, so I will be putting together a fix for mainline.  This thing
has been in the kernel since about 2004, not sure why you didn't hit
it earlier.

Thanx, Paul

> [  248.910339] [] (unwind_backtrace+0x0/0xf0) from [] 
> (rcu_check_callbacks+0x220/0x714)
> [  248.915527] [] (rcu_check_callbacks+0x220/0x714) from 
> [] (update_process_times+0x38/0x68)
> [  248.920928] [] (update_process_times+0x38/0x68) from 
> [] (tick_sched_timer+0x80/0xec)
> [  248.926116] [] (tick_sched_timer+0x80/0xec) from [] 
> (__run_hrtimer+0x7c/0x1e0)
> [  248.930999] [] (__run_hrtimer+0x7c/0x1e0) from [] 
> (hrtimer_interrupt+0x11c/0x2d0)
> [  248.936035] [] (hrtimer_interrupt+0x11c/0x2d0) from [] 
> (twd_handler+0x30/0x44)
> [  248.940948] [] (twd_handler+0x30/0x44) from [] 
> (handle_percpu_devid_irq+0x90/0x13c)
> [  248.946075] [] (handle_percpu_devid_irq+0x90/0x13c) from 
> [] (generic_handle_irq+0x30/0x48)
> [  248.951538] [] (generic_handle_irq+0x30/0x48) from [] 
> (handle_IRQ+0x4c/0xac)
> [  248.956329] [] (handle_IRQ+0x4c/0xac) from [] 
> (gic_handle_irq+0x28/0x5c)
> [  248.960937] [] (gic_handle_irq+0x28/0x5c) from [] 
> (__irq_svc+0x44/0x5c)
> [  248.965484] Exception stack(0xc0729f58 to 0xc0729fa0)
> [  248.968231] 9f40:   
> 0003b832 0001
> [  248.972686] 9f60:  c074a8e8 c0728000 c07c42c8 c05065a0 c074bdc8 
>  411fc092
> [  248.977142] 9f80: c074bfe8  0001 c0729fa0 0003b833 c0015130 
> 2113 
> [  248.981597] [] (__irq_svc+0x44/0x5c) from [] 
> (default_idle+0x20/0x44)
> [  248.986083] [] (default_idle+0x20/0x44) from [] 
> (cpu_idle+0x9c/0x114)
> [  248.990539] [] (cpu_idle+0x9c/0x114) from [] 
> (start_kernel+0x2b4/0x304)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sat, Sep 22, 2012 at 06:42:08PM +, Paul Walmsley wrote:
> On Fri, 21 Sep 2012, Paul E. McKenney wrote:
> 
> > Could you please point me to a recipe for creating a minimal userspace?
> > Just in case it is the userspac erather than the architecture/hardware
> > that makes the difference.
> 
> Tony's suggestion is pretty good.  Note that there may also be differences 
> in kernel timers -- either between x86 and ARM architectures, or loaded 
> device drivers -- that may confound the problem.

For example, there must be at least one RCU callback outstanding after
the boot sequence quiets down.  Of course, the last time I tried Tony's
approach, I was doing it on top of my -rcu stack, so am retrying on
v3.6-rc6.

> > Just to make sure I understand the combinations:
> > 
> > o   All stalls have happened when running a minimal userspace.
> > o   CONFIG_NO_HZ=n suppresses the stalls.
> > o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
> > no observable effect on the stalls.
> > 
> > Did I get that right, or am I missing a combination?
> 
> That's correct.
> 
> > Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is 
> > cancelled upon exit from idle, and therefore should (almost) never 
> > actually execute. Its sole purpose is to wake up the CPU.  ;-)
> 
> Right.  Just curious, what would wake up the kernel from idle to handle a 
> grace period expiration when CONFIG_RCU_FAST_NO_HZ=n?  On a very idle 
> system, the time between timer ticks could potentially be several tens of 
> seconds.

If CONFIG_RCU_FAST_NO_HZ=n, then CPUs with RCU callbacks are not permitted
to shut off the scheduling-clock tick, so any CPU with RCU callbacks will
be awakened every jiffy.  The problem is that there appears to be a way
to get an RCU grace period started without any CPU having any callbacks,
which, as you surmise, would result in all the CPUs going to sleep and
the grace period never ending.  So if a CPU is awakened for any reason
after this everlasting grace period has extended for more than a minute,
the first thing that CPU will do is print an RCU CPU stall warning.

I believe that I see how to prevent callback-free grace periods from
ever starting.  (Famous last words...)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

> Could you please point me to a recipe for creating a minimal userspace?
> Just in case it is the userspac erather than the architecture/hardware
> that makes the difference.

Tony's suggestion is pretty good.  Note that there may also be differences 
in kernel timers -- either between x86 and ARM architectures, or loaded 
device drivers -- that may confound the problem.

> Just to make sure I understand the combinations:
> 
> o All stalls have happened when running a minimal userspace.
> o CONFIG_NO_HZ=n suppresses the stalls.
> o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
>   no observable effect on the stalls.
> 
> Did I get that right, or am I missing a combination?

That's correct.

> Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is 
> cancelled upon exit from idle, and therefore should (almost) never 
> actually execute. Its sole purpose is to wake up the CPU.  ;-)

Right.  Just curious, what would wake up the kernel from idle to handle a 
grace period expiration when CONFIG_RCU_FAST_NO_HZ=n?  On a very idle 
system, the time between timer ticks could potentially be several tens of 
seconds.

- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley

Hi Paul

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

> I am wondering if your system somehow figured out how to start a grace
> period that had no RCU callbacks waiting for it.  If that happened,
> then a CONFIG_NO_HZ=y system could in theory get into a state where all
> CPUs are in dyntick-idle mode, so that none of them is doing anything
> to force the grace period to complete.
>
> That should be easy to diagnose, anyway.  Please see below, which
> includes the earlier diagnostic patch.

Here you go.
 
- Paul

[  248.902618] INFO: rcu_sched self-detected stall on CPU
[  248.905456]  0: (1 ticks this GP) idle=933/1/0 
[  248.907897]   (t=26570 jiffies g=11 c=10 q=0)
[  248.910339] [] (unwind_backtrace+0x0/0xf0) from [] 
(rcu_check_callbacks+0x220/0x714)
[  248.915527] [] (rcu_check_callbacks+0x220/0x714) from [] 
(update_process_times+0x38/0x68)
[  248.920928] [] (update_process_times+0x38/0x68) from [] 
(tick_sched_timer+0x80/0xec)
[  248.926116] [] (tick_sched_timer+0x80/0xec) from [] 
(__run_hrtimer+0x7c/0x1e0)
[  248.930999] [] (__run_hrtimer+0x7c/0x1e0) from [] 
(hrtimer_interrupt+0x11c/0x2d0)
[  248.936035] [] (hrtimer_interrupt+0x11c/0x2d0) from [] 
(twd_handler+0x30/0x44)
[  248.940948] [] (twd_handler+0x30/0x44) from [] 
(handle_percpu_devid_irq+0x90/0x13c)
[  248.946075] [] (handle_percpu_devid_irq+0x90/0x13c) from 
[] (generic_handle_irq+0x30/0x48)
[  248.951538] [] (generic_handle_irq+0x30/0x48) from [] 
(handle_IRQ+0x4c/0xac)
[  248.956329] [] (handle_IRQ+0x4c/0xac) from [] 
(gic_handle_irq+0x28/0x5c)
[  248.960937] [] (gic_handle_irq+0x28/0x5c) from [] 
(__irq_svc+0x44/0x5c)
[  248.965484] Exception stack(0xc0729f58 to 0xc0729fa0)
[  248.968231] 9f40:   
0003b832 0001
[  248.972686] 9f60:  c074a8e8 c0728000 c07c42c8 c05065a0 c074bdc8 
 411fc092
[  248.977142] 9f80: c074bfe8  0001 c0729fa0 0003b833 c0015130 
2113 
[  248.981597] [] (__irq_svc+0x44/0x5c) from [] 
(default_idle+0x20/0x44)
[  248.986083] [] (default_idle+0x20/0x44) from [] 
(cpu_idle+0x9c/0x114)
[  248.990539] [] (cpu_idle+0x9c/0x114) from [] 
(start_kernel+0x2b4/0x304)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sat, Sep 22, 2012 at 05:45:12PM +0200, Frederic Weisbecker wrote:
> 2012/9/22 Paul E. McKenney :
> > On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote:
> >> * Paul E. McKenney  [120921 12:58]:
> >> >
> >> > Just to make sure I understand the combinations:
> >> >
> >> > o   All stalls have happened when running a minimal userspace.
> >> > o   CONFIG_NO_HZ=n suppresses the stalls.
> >> > o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
> >> > no observable effect on the stalls.
> >>
> >> The reason why you may need minimal userspace is to cut down
> >> the number of timers waking up the system with NO_HZ.
> >> Booting with init=/bin/sh might also do the trick for that.
> >
> > Good point!  This does make for a very quiet system, but does not
> > reproduce the problem under kvm, even after waiting for four minutes.
> > I will leave it for more time, but it looks like I really might need to
> > ask Linaro for remote access to a Panda.
> 
> I have one. I'm currently installing Ubuntu on it and I'll try to
> manage to build
> a kernel and reproduce the issue.
> 
> I'll give more news soon.

Thank you!

My bet is that you have to have a userspace that is so small that it
registers only a few (but at least one!) RCU callback at boot time,
then never registers any callbacks ever again.  I have coded up a
crude test case, using Tony Lindgren's suggestion of "init=/bin/sh",
but I appear to have inadvertently fixed this bug in current -rcu
(git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git,
branch rcu/next).

But I have been wrong a few times already on this particular bug...

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Frederic Weisbecker

2012/9/22 Paul E. McKenney :
> On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote:
>> * Paul E. McKenney  [120921 12:58]:
>> >
>> > Just to make sure I understand the combinations:
>> >
>> > o   All stalls have happened when running a minimal userspace.
>> > o   CONFIG_NO_HZ=n suppresses the stalls.
>> > o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
>> > no observable effect on the stalls.
>>
>> The reason why you may need minimal userspace is to cut down
>> the number of timers waking up the system with NO_HZ.
>> Booting with init=/bin/sh might also do the trick for that.
>
> Good point!  This does make for a very quiet system, but does not
> reproduce the problem under kvm, even after waiting for four minutes.
> I will leave it for more time, but it looks like I really might need to
> ask Linaro for remote access to a Panda.

I have one. I'm currently installing Ubuntu on it and I'll try to
manage to build
a kernel and reproduce the issue.

I'll give more news soon.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Frederic Weisbecker

2012/9/22 Paul E. McKenney paul...@linux.vnet.ibm.com:
 On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote:
 * Paul E. McKenney paul...@linux.vnet.ibm.com [120921 12:58]:
 
  Just to make sure I understand the combinations:
 
  o   All stalls have happened when running a minimal userspace.
  o   CONFIG_NO_HZ=n suppresses the stalls.
  o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
  no observable effect on the stalls.

 The reason why you may need minimal userspace is to cut down
 the number of timers waking up the system with NO_HZ.
 Booting with init=/bin/sh might also do the trick for that.

 Good point!  This does make for a very quiet system, but does not
 reproduce the problem under kvm, even after waiting for four minutes.
 I will leave it for more time, but it looks like I really might need to
 ask Linaro for remote access to a Panda.

I have one. I'm currently installing Ubuntu on it and I'll try to
manage to build
a kernel and reproduce the issue.

I'll give more news soon.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sat, Sep 22, 2012 at 05:45:12PM +0200, Frederic Weisbecker wrote:
 2012/9/22 Paul E. McKenney paul...@linux.vnet.ibm.com:
  On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote:
  * Paul E. McKenney paul...@linux.vnet.ibm.com [120921 12:58]:
  
   Just to make sure I understand the combinations:
  
   o   All stalls have happened when running a minimal userspace.
   o   CONFIG_NO_HZ=n suppresses the stalls.
   o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
   no observable effect on the stalls.
 
  The reason why you may need minimal userspace is to cut down
  the number of timers waking up the system with NO_HZ.
  Booting with init=/bin/sh might also do the trick for that.
 
  Good point!  This does make for a very quiet system, but does not
  reproduce the problem under kvm, even after waiting for four minutes.
  I will leave it for more time, but it looks like I really might need to
  ask Linaro for remote access to a Panda.
 
 I have one. I'm currently installing Ubuntu on it and I'll try to
 manage to build
 a kernel and reproduce the issue.
 
 I'll give more news soon.

Thank you!

My bet is that you have to have a userspace that is so small that it
registers only a few (but at least one!) RCU callback at boot time,
then never registers any callbacks ever again.  I have coded up a
crude test case, using Tony Lindgren's suggestion of init=/bin/sh,
but I appear to have inadvertently fixed this bug in current -rcu
(git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git,
branch rcu/next).

But I have been wrong a few times already on this particular bug...

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley

Hi Paul

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

 I am wondering if your system somehow figured out how to start a grace
 period that had no RCU callbacks waiting for it.  If that happened,
 then a CONFIG_NO_HZ=y system could in theory get into a state where all
 CPUs are in dyntick-idle mode, so that none of them is doing anything
 to force the grace period to complete.

 That should be easy to diagnose, anyway.  Please see below, which
 includes the earlier diagnostic patch.

Here you go.
 
- Paul

[  248.902618] INFO: rcu_sched self-detected stall on CPU
[  248.905456]  0: (1 ticks this GP) idle=933/1/0 
[  248.907897]   (t=26570 jiffies g=11 c=10 q=0)
[  248.910339] [c001bc90] (unwind_backtrace+0x0/0xf0) from [c00ad800] 
(rcu_check_callbacks+0x220/0x714)
[  248.915527] [c00ad800] (rcu_check_callbacks+0x220/0x714) from [c00532a0] 
(update_process_times+0x38/0x68)
[  248.920928] [c00532a0] (update_process_times+0x38/0x68) from [c008c9e8] 
(tick_sched_timer+0x80/0xec)
[  248.926116] [c008c9e8] (tick_sched_timer+0x80/0xec) from [c0068ed4] 
(__run_hrtimer+0x7c/0x1e0)
[  248.930999] [c0068ed4] (__run_hrtimer+0x7c/0x1e0) from [c0069cb8] 
(hrtimer_interrupt+0x11c/0x2d0)
[  248.936035] [c0069cb8] (hrtimer_interrupt+0x11c/0x2d0) from [c001a3cc] 
(twd_handler+0x30/0x44)
[  248.940948] [c001a3cc] (twd_handler+0x30/0x44) from [c00a7bd0] 
(handle_percpu_devid_irq+0x90/0x13c)
[  248.946075] [c00a7bd0] (handle_percpu_devid_irq+0x90/0x13c) from 
[c00a4344] (generic_handle_irq+0x30/0x48)
[  248.951538] [c00a4344] (generic_handle_irq+0x30/0x48) from [c0014e38] 
(handle_IRQ+0x4c/0xac)
[  248.956329] [c0014e38] (handle_IRQ+0x4c/0xac) from [c00084cc] 
(gic_handle_irq+0x28/0x5c)
[  248.960937] [c00084cc] (gic_handle_irq+0x28/0x5c) from [c04fb1a4] 
(__irq_svc+0x44/0x5c)
[  248.965484] Exception stack(0xc0729f58 to 0xc0729fa0)
[  248.968231] 9f40:   
0003b832 0001
[  248.972686] 9f60:  c074a8e8 c0728000 c07c42c8 c05065a0 c074bdc8 
 411fc092
[  248.977142] 9f80: c074bfe8  0001 c0729fa0 0003b833 c0015130 
2113 
[  248.981597] [c04fb1a4] (__irq_svc+0x44/0x5c) from [c0015130] 
(default_idle+0x20/0x44)
[  248.986083] [c0015130] (default_idle+0x20/0x44) from [c001535c] 
(cpu_idle+0x9c/0x114)
[  248.990539] [c001535c] (cpu_idle+0x9c/0x114) from [c06d77b0] 
(start_kernel+0x2b4/0x304)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

 Could you please point me to a recipe for creating a minimal userspace?
 Just in case it is the userspac erather than the architecture/hardware
 that makes the difference.

Tony's suggestion is pretty good.  Note that there may also be differences 
in kernel timers -- either between x86 and ARM architectures, or loaded 
device drivers -- that may confound the problem.

 Just to make sure I understand the combinations:
 
 o All stalls have happened when running a minimal userspace.
 o CONFIG_NO_HZ=n suppresses the stalls.
 o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
   no observable effect on the stalls.
 
 Did I get that right, or am I missing a combination?

That's correct.

 Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is 
 cancelled upon exit from idle, and therefore should (almost) never 
 actually execute. Its sole purpose is to wake up the CPU.  ;-)

Right.  Just curious, what would wake up the kernel from idle to handle a 
grace period expiration when CONFIG_RCU_FAST_NO_HZ=n?  On a very idle 
system, the time between timer ticks could potentially be several tens of 
seconds.


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sat, Sep 22, 2012 at 06:42:08PM +, Paul Walmsley wrote:
 On Fri, 21 Sep 2012, Paul E. McKenney wrote:
 
  Could you please point me to a recipe for creating a minimal userspace?
  Just in case it is the userspac erather than the architecture/hardware
  that makes the difference.
 
 Tony's suggestion is pretty good.  Note that there may also be differences 
 in kernel timers -- either between x86 and ARM architectures, or loaded 
 device drivers -- that may confound the problem.

For example, there must be at least one RCU callback outstanding after
the boot sequence quiets down.  Of course, the last time I tried Tony's
approach, I was doing it on top of my -rcu stack, so am retrying on
v3.6-rc6.

  Just to make sure I understand the combinations:
  
  o   All stalls have happened when running a minimal userspace.
  o   CONFIG_NO_HZ=n suppresses the stalls.
  o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
  no observable effect on the stalls.
  
  Did I get that right, or am I missing a combination?
 
 That's correct.
 
  Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is 
  cancelled upon exit from idle, and therefore should (almost) never 
  actually execute. Its sole purpose is to wake up the CPU.  ;-)
 
 Right.  Just curious, what would wake up the kernel from idle to handle a 
 grace period expiration when CONFIG_RCU_FAST_NO_HZ=n?  On a very idle 
 system, the time between timer ticks could potentially be several tens of 
 seconds.

If CONFIG_RCU_FAST_NO_HZ=n, then CPUs with RCU callbacks are not permitted
to shut off the scheduling-clock tick, so any CPU with RCU callbacks will
be awakened every jiffy.  The problem is that there appears to be a way
to get an RCU grace period started without any CPU having any callbacks,
which, as you surmise, would result in all the CPUs going to sleep and
the grace period never ending.  So if a CPU is awakened for any reason
after this everlasting grace period has extended for more than a minute,
the first thing that CPU will do is print an RCU CPU stall warning.

I believe that I see how to prevent callback-free grace periods from
ever starting.  (Famous last words...)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sat, Sep 22, 2012 at 06:16:15PM +, Paul Walmsley wrote:
 Hi Paul
 
 On Fri, 21 Sep 2012, Paul E. McKenney wrote:
 
  I am wondering if your system somehow figured out how to start a grace
  period that had no RCU callbacks waiting for it.  If that happened,
  then a CONFIG_NO_HZ=y system could in theory get into a state where all
  CPUs are in dyntick-idle mode, so that none of them is doing anything
  to force the grace period to complete.
 
  That should be easy to diagnose, anyway.  Please see below, which
  includes the earlier diagnostic patch.
 
 Here you go.
 
 - Paul
 
 [  248.902618] INFO: rcu_sched self-detected stall on CPU
 [  248.905456]  0: (1 ticks this GP) idle=933/1/0 
 [  248.907897]   (t=26570 jiffies g=11 c=10 q=0)

Bingo!!!  (q=0, in case you were wondering.  And thank you for testing this!)

Strangely enough, I believe that I have inadvertently fixed this in
my -rcu tree:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next

Nevertheless, if you get a chance to try it, I would be interested to
hear if my guess is correct.  The trick is that a kthread drives the
grace period in -rcu, regardless of whether or not there are callbacks.

However, the backport would not be something that -stable would be happy
with, so I will be putting together a fix for mainline.  This thing
has been in the kernel since about 2004, not sure why you didn't hit
it earlier.

Thanx, Paul

 [  248.910339] [c001bc90] (unwind_backtrace+0x0/0xf0) from [c00ad800] 
 (rcu_check_callbacks+0x220/0x714)
 [  248.915527] [c00ad800] (rcu_check_callbacks+0x220/0x714) from 
 [c00532a0] (update_process_times+0x38/0x68)
 [  248.920928] [c00532a0] (update_process_times+0x38/0x68) from 
 [c008c9e8] (tick_sched_timer+0x80/0xec)
 [  248.926116] [c008c9e8] (tick_sched_timer+0x80/0xec) from [c0068ed4] 
 (__run_hrtimer+0x7c/0x1e0)
 [  248.930999] [c0068ed4] (__run_hrtimer+0x7c/0x1e0) from [c0069cb8] 
 (hrtimer_interrupt+0x11c/0x2d0)
 [  248.936035] [c0069cb8] (hrtimer_interrupt+0x11c/0x2d0) from [c001a3cc] 
 (twd_handler+0x30/0x44)
 [  248.940948] [c001a3cc] (twd_handler+0x30/0x44) from [c00a7bd0] 
 (handle_percpu_devid_irq+0x90/0x13c)
 [  248.946075] [c00a7bd0] (handle_percpu_devid_irq+0x90/0x13c) from 
 [c00a4344] (generic_handle_irq+0x30/0x48)
 [  248.951538] [c00a4344] (generic_handle_irq+0x30/0x48) from [c0014e38] 
 (handle_IRQ+0x4c/0xac)
 [  248.956329] [c0014e38] (handle_IRQ+0x4c/0xac) from [c00084cc] 
 (gic_handle_irq+0x28/0x5c)
 [  248.960937] [c00084cc] (gic_handle_irq+0x28/0x5c) from [c04fb1a4] 
 (__irq_svc+0x44/0x5c)
 [  248.965484] Exception stack(0xc0729f58 to 0xc0729fa0)
 [  248.968231] 9f40:   
 0003b832 0001
 [  248.972686] 9f60:  c074a8e8 c0728000 c07c42c8 c05065a0 c074bdc8 
  411fc092
 [  248.977142] 9f80: c074bfe8  0001 c0729fa0 0003b833 c0015130 
 2113 
 [  248.981597] [c04fb1a4] (__irq_svc+0x44/0x5c) from [c0015130] 
 (default_idle+0x20/0x44)
 [  248.986083] [c0015130] (default_idle+0x20/0x44) from [c001535c] 
 (cpu_idle+0x9c/0x114)
 [  248.990539] [c001535c] (cpu_idle+0x9c/0x114) from [c06d77b0] 
 (start_kernel+0x2b4/0x304)
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sat, Sep 22, 2012 at 01:10:43PM -0700, Paul E. McKenney wrote:
 On Sat, Sep 22, 2012 at 06:42:08PM +, Paul Walmsley wrote:
  On Fri, 21 Sep 2012, Paul E. McKenney wrote:
  
   Could you please point me to a recipe for creating a minimal userspace?
   Just in case it is the userspac erather than the architecture/hardware
   that makes the difference.
  
  Tony's suggestion is pretty good.  Note that there may also be differences 
  in kernel timers -- either between x86 and ARM architectures, or loaded 
  device drivers -- that may confound the problem.
 
 For example, there must be at least one RCU callback outstanding after
 the boot sequence quiets down.  Of course, the last time I tried Tony's
 approach, I was doing it on top of my -rcu stack, so am retrying on
 v3.6-rc6.
 
   Just to make sure I understand the combinations:
   
   o All stalls have happened when running a minimal userspace.
   o CONFIG_NO_HZ=n suppresses the stalls.
   o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
 no observable effect on the stalls.
   
   Did I get that right, or am I missing a combination?
  
  That's correct.
  
   Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is 
   cancelled upon exit from idle, and therefore should (almost) never 
   actually execute. Its sole purpose is to wake up the CPU.  ;-)
  
  Right.  Just curious, what would wake up the kernel from idle to handle a 
  grace period expiration when CONFIG_RCU_FAST_NO_HZ=n?  On a very idle 
  system, the time between timer ticks could potentially be several tens of 
  seconds.
 
 If CONFIG_RCU_FAST_NO_HZ=n, then CPUs with RCU callbacks are not permitted
 to shut off the scheduling-clock tick, so any CPU with RCU callbacks will
 be awakened every jiffy.  The problem is that there appears to be a way
 to get an RCU grace period started without any CPU having any callbacks,
 which, as you surmise, would result in all the CPUs going to sleep and
 the grace period never ending.  So if a CPU is awakened for any reason
 after this everlasting grace period has extended for more than a minute,
 the first thing that CPU will do is print an RCU CPU stall warning.
 
 I believe that I see how to prevent callback-free grace periods from
 ever starting.  (Famous last words...)

And here is a patch.  I am still having trouble reproducing the problem,
but figured that I should avoid serializing things.

Thanx, Paul



 b/kernel/rcutree.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

rcu: Fix day-one dyntick-idle stall-warning bug

Each grace period is supposed to have at least one callback waiting
for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
extra callback-free grace period is no big problem -- it will chew up
a tiny bit of CPU time, but it will complete normally.  In contrast,
CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
sleep indefinitely, in turn indefinitely delaying completion of the
callback-free grace period.  Given that nothing is waiting on this grace
period, this is also not a problem.

Unless RCU CPU stall warnings are also enabled, as they are in recent
kernels.  In this case, if a CPU wakes up after at least one minute
of inactivity, an RCU CPU stall warning will result.  The reason that
no one noticed until quite recently is that most systems have enough
OS noise that they will never remain absolutely idle for a full minute.
But there are some embedded systems with cut-down userspace configurations
that get into this mode quite easily.

All this begs the question of exactly how a callback-free grace period
gets started in the first place.  This can happen due to the fact that
CPUs do not necessarily agree on which grace period is in progress.
If a CPU still believes that the grace period that just completed is
still ongoing, it will believe that it has callbacks that need to wait
for another grace period, never mind the fact that the grace period
that they were waiting for just completed.  This CPU can therefore
erroneously decide to start a new grace period.

Once this CPU notices that the earlier grace period completed, it will
invoke its callbacks.  It then won't have any callbacks left.  If no
other CPU has any callbacks, we now have a callback-free grace period.

This commit therefore makes CPUs check more carefully before starting a
new grace period.  This new check relies on an array of tail pointers
into each CPU's list of callbacks.  If the CPU is up to date on which
grace periods have completed, it checks to see if any callbacks follow
the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
follow the RCU_WAIT_TAIL segment.  The reason that this works is that
the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
as soon as the CPU figures out that the old grace period has ended.

This change

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley

Hi Paul

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

 Strangely enough, I believe that I have inadvertently fixed this in
 my -rcu tree:
 
 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next
 
 Nevertheless, if you get a chance to try it, I would be interested to
 hear if my guess is correct.

Yes, good news: the stall warnings go away with that branch.

 The trick is that a kthread drives the grace period in -rcu, regardless 
 of whether or not there are callbacks.

This is rcu: Move quiescent-state forcing into kthread ?

Added some debugging into rcu_gp_kthread() after that commit and can 
confirm that the quiescent-state forcing loop does start a few times when 
there are zero callbacks pending (modulo any races in my measurement 
code).

 However, the backport would not be something that -stable would be happy
 with, so I will be putting together a fix for mainline.  This thing
 has been in the kernel since about 2004, not sure why you didn't hit
 it earlier.

One other data point in that regard - noticed the warnings don't appear 
when the board is booted with:

commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
Author: Paul E. McKenney paul.mcken...@linaro.org
Date:   Tue Jun 5 15:53:53 2012 -0700

rcu: Fix qlen_lazy breakage

...


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

 And here is a patch.  I am still having trouble reproducing the problem,
 but figured that I should avoid serializing things.

Thanks, testing this now on v3.6-rc6.  One question though about the patch 
description:

 All this begs the question of exactly how a callback-free grace period
 gets started in the first place.  This can happen due to the fact that
 CPUs do not necessarily agree on which grace period is in progress.
 If a CPU still believes that the grace period that just completed is
 still ongoing, it will believe that it has callbacks that need to wait
 for another grace period, never mind the fact that the grace period
 that they were waiting for just completed.  This CPU can therefore
 erroneously decide to start a new grace period.

Doesn't this imply that this bug would only affect multi-CPU systems?  

The recent tests here have been on Pandaboard, which is dual-CPU, but my 
recollection is that I also observed the warnings on a single-core 
Beagleboard.  Will re-test.


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sat, Sep 22, 2012 at 10:25:59PM +, Paul Walmsley wrote:
 On Sat, 22 Sep 2012, Paul E. McKenney wrote:
 
  And here is a patch.  I am still having trouble reproducing the problem,
  but figured that I should avoid serializing things.
 
 Thanks, testing this now on v3.6-rc6.

Very cool, thank you!

One question though about the patch 
 description:
 
  All this begs the question of exactly how a callback-free grace period
  gets started in the first place.  This can happen due to the fact that
  CPUs do not necessarily agree on which grace period is in progress.
  If a CPU still believes that the grace period that just completed is
  still ongoing, it will believe that it has callbacks that need to wait
  for another grace period, never mind the fact that the grace period
  that they were waiting for just completed.  This CPU can therefore
  erroneously decide to start a new grace period.
 
 Doesn't this imply that this bug would only affect multi-CPU systems?  

Surprisingly not, at least when running TREE_RCU or TREE_PREEMPT_RCU.
In order to keep lock contention down to a dull roar on larger systems,
TREE_RCU keeps three sets of books: (1) the global state in the rcu_state
structure, (2) the combining-tree per-node state in the rcu_node
structure, and the per-CPU state in the rcu_data structure.  A CPU is
not officially aware of the end of a grace period until it is reflected
in its rcu_data structure.  This has the perhaps-surprising consequence
that the CPU that detected the end of the old grace period might start
a new one before becoming officially aware that the old one ended.

Why not have the CPU inform itself immediately upon noticing that the
old grace period ended?  Deadlock.  The rcu_node locks must be acquired
from leaf towards root, and the CPU is holding the root rcu_node lock
when it notices that the grace period has ended.

I have made this a bit less problematic in the bigrt branch, working
towards a goal of getting RCU into a state where automatic formal
validation might one day be possible.  And yes, I am starting to get some
formal-validation people interested in this lofty goal, see for example:
http://sites.google.com/site/popl13grace/paper.pdf.

 The recent tests here have been on Pandaboard, which is dual-CPU, but my 
 recollection is that I also observed the warnings on a single-core 
 Beagleboard.  Will re-test.

Anxiously awaiting the results.  This has been a strange one, even by
RCU's standards.

Plus I need to add a few Reported-by lines.  Next version...

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sat, Sep 22, 2012 at 10:20:19PM +, Paul Walmsley wrote:
 Hi Paul
 
 On Sat, 22 Sep 2012, Paul E. McKenney wrote:
 
  Strangely enough, I believe that I have inadvertently fixed this in
  my -rcu tree:
  
  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next
  
  Nevertheless, if you get a chance to try it, I would be interested to
  hear if my guess is correct.
 
 Yes, good news: the stall warnings go away with that branch.

Very good!

  The trick is that a kthread drives the grace period in -rcu, regardless 
  of whether or not there are callbacks.
 
 This is rcu: Move quiescent-state forcing into kthread ?

Yep, plus the preceding commits moving grace-period initialization and
cleanup into that same kthread.  This was motivated by a bug report
last February complaining about 200-microsecond latency spikes from
RCU grace-period initialization.  On systems with 4096 CPUs.

Real-time response.  It is far bigger than I thought.  ;-)

 Added some debugging into rcu_gp_kthread() after that commit and can 
 confirm that the quiescent-state forcing loop does start a few times when 
 there are zero callbacks pending (modulo any races in my measurement 
 code).

Cool, thank you!  Assuming it works, that indicates that there is long-term
value to the fix for this problem.  On larger systems, extra grace periods
are not what you want, as their expense increases with the number of CPUs.

  However, the backport would not be something that -stable would be happy
  with, so I will be putting together a fix for mainline.  This thing
  has been in the kernel since about 2004, not sure why you didn't hit
  it earlier.
 
 One other data point in that regard - noticed the warnings don't appear 
 when the board is booted with:
 
 commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
 Author: Paul E. McKenney paul.mcken...@linaro.org
 Date:   Tue Jun 5 15:53:53 2012 -0700
 
 rcu: Fix qlen_lazy breakage

You lost me on this one.  This is already in mainline, so if you were
using (say) 3.6-rc6, you would already have this commit applied.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley

Hi Paul

On Sat, 22 Sep 2012, Paul Walmsley wrote:

 On Sat, 22 Sep 2012, Paul E. McKenney wrote:
 
  And here is a patch.  I am still having trouble reproducing the problem,
  but figured that I should avoid serializing things.
 
 Thanks, testing this now on v3.6-rc6.

Looks like you solved it!

Tested v3.6-rc6 + your stall diagnostic patch:

http://marc.info/?l=linux-arm-kernelm=134827237215882w=2

on OMAP4430ES2 Pandaboard using omap2plus_defconfig and 
CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings.

Then added rcu: Fix day-one dyntick-idle stall-warning bug from:

http://marc.info/?l=linux-arm-kernelm=134835120600590w=2

Booted that, and the stall warnings did not appear within 30 minutes.

To confirm that the problem being solved matched your hypothesis, the
debugging patch below[1] was added to the RCU idle entry/exit code.

Without the bugfix patch, a boot log transcript was obtained
indicating that the idle loop was entered with tick_nohz_enabled=1
during a grace period with no callbacks present:

http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt

The debugging events started to appear at 1.867370 seconds into the
boot.  ENTER was pressed about 464 seconds in; this triggered the
rcu_sched stall traceback.

With the bugfix patch, a boot log transcript was obtained that
indicated that the condition under test never occurred after waiting
about 20 minutes:

http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt

Thanks for being so willing to root-cause the issue, Paul; it's 
appreciated, and it's been quite instructive as well.  Will address some 
remaining loose ends in follow-up E-mails.


- Paul


[1] Debugging patch to printk() if the previous idle loop entry occurred 
with tick_nohz_enabled=1 during a grace period with no RCU callbacks 
present:


---
 kernel/rcutree.c |   17 +
 1 file changed, 17 insertions(+)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index f1eb7ad..f42941b 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -60,6 +60,9 @@
 
 /* Data structures. */
 
+extern int tick_nohz_enabled;
+static int no_cbs_idle_entry_count;
+
 static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
 
 #define RCU_STATE_INITIALIZER(sname, cr) { \
@@ -400,8 +403,12 @@ void rcu_idle_enter(void)
unsigned long flags;
long long oldval;
struct rcu_dynticks *rdtp;
+   int cpu;
+   long totqlen = 0;
+   struct rcu_data *rdp;
 
local_irq_save(flags);
+   rdp = __get_cpu_var(rcu_sched_data);
rdtp = __get_cpu_var(rcu_dynticks);
oldval = rdtp-dynticks_nesting;
WARN_ON_ONCE((oldval  DYNTICK_TASK_NEST_MASK) == 0);
@@ -410,6 +417,12 @@ void rcu_idle_enter(void)
else
rdtp-dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
rcu_idle_enter_common(rdtp, oldval);
+   if (tick_nohz_enabled  rcu_gp_in_progress(rdp-rsp)) {
+   for_each_possible_cpu(cpu)
+   totqlen += per_cpu_ptr(rdp-rsp-rda, cpu)-qlen;
+   if (totqlen == 0)
+   no_cbs_idle_entry_count = 1;
+   }
local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -503,6 +516,10 @@ void rcu_idle_exit(void)
rdtp-dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
rcu_idle_exit_common(rdtp, oldval);
local_irq_restore(flags);
+   if (no_cbs_idle_entry_count) {
+   no_cbs_idle_entry_count = 0;
+   pr_err(* Tickless idle was entered with zero RCU callbacks\n);
+   }
 }
 EXPORT_SYMBOL_GPL(rcu_idle_exit);
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney

On Sun, Sep 23, 2012 at 01:42:10AM +, Paul Walmsley wrote:
 Hi Paul
 
 On Sat, 22 Sep 2012, Paul Walmsley wrote:
 
  On Sat, 22 Sep 2012, Paul E. McKenney wrote:
  
   And here is a patch.  I am still having trouble reproducing the problem,
   but figured that I should avoid serializing things.
  
  Thanks, testing this now on v3.6-rc6.
 
 Looks like you solved it!
 
 Tested v3.6-rc6 + your stall diagnostic patch:
 
 http://marc.info/?l=linux-arm-kernelm=134827237215882w=2
 
 on OMAP4430ES2 Pandaboard using omap2plus_defconfig and 
 CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings.
 
 Then added rcu: Fix day-one dyntick-idle stall-warning bug from:
 
 http://marc.info/?l=linux-arm-kernelm=134835120600590w=2
 
 Booted that, and the stall warnings did not appear within 30 minutes.

Very cool, thank you for your testing efforts!!!

May I apply your Tested-by to this patch?

And good show on the debugging patch -- it is quite good to have such
solid evidence that the bug that the fix was intended for was actually
occurring.

Thanx, Paul

 To confirm that the problem being solved matched your hypothesis, the
 debugging patch below[1] was added to the RCU idle entry/exit code.
 
 Without the bugfix patch, a boot log transcript was obtained
 indicating that the idle loop was entered with tick_nohz_enabled=1
 during a grace period with no callbacks present:
 
 http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt
 
 The debugging events started to appear at 1.867370 seconds into the
 boot.  ENTER was pressed about 464 seconds in; this triggered the
 rcu_sched stall traceback.
 
 With the bugfix patch, a boot log transcript was obtained that
 indicated that the condition under test never occurred after waiting
 about 20 minutes:
 
 
 http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt
 
 Thanks for being so willing to root-cause the issue, Paul; it's 
 appreciated, and it's been quite instructive as well.  Will address some 
 remaining loose ends in follow-up E-mails.
 
 
 - Paul
 
 
 [1] Debugging patch to printk() if the previous idle loop entry occurred 
 with tick_nohz_enabled=1 during a grace period with no RCU callbacks 
 present:
 
 
 ---
  kernel/rcutree.c |   17 +
  1 file changed, 17 insertions(+)
 
 diff --git a/kernel/rcutree.c b/kernel/rcutree.c
 index f1eb7ad..f42941b 100644
 --- a/kernel/rcutree.c
 +++ b/kernel/rcutree.c
 @@ -60,6 +60,9 @@
 
  /* Data structures. */
 
 +extern int tick_nohz_enabled;
 +static int no_cbs_idle_entry_count;
 +
  static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
 
  #define RCU_STATE_INITIALIZER(sname, cr) { \
 @@ -400,8 +403,12 @@ void rcu_idle_enter(void)
   unsigned long flags;
   long long oldval;
   struct rcu_dynticks *rdtp;
 + int cpu;
 + long totqlen = 0;
 + struct rcu_data *rdp;
 
   local_irq_save(flags);
 + rdp = __get_cpu_var(rcu_sched_data);
   rdtp = __get_cpu_var(rcu_dynticks);
   oldval = rdtp-dynticks_nesting;
   WARN_ON_ONCE((oldval  DYNTICK_TASK_NEST_MASK) == 0);
 @@ -410,6 +417,12 @@ void rcu_idle_enter(void)
   else
   rdtp-dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
   rcu_idle_enter_common(rdtp, oldval);
 + if (tick_nohz_enabled  rcu_gp_in_progress(rdp-rsp)) {
 + for_each_possible_cpu(cpu)
 + totqlen += per_cpu_ptr(rdp-rsp-rda, cpu)-qlen;
 + if (totqlen == 0)
 + no_cbs_idle_entry_count = 1;
 + }
   local_irq_restore(flags);
  }
  EXPORT_SYMBOL_GPL(rcu_idle_enter);
 @@ -503,6 +516,10 @@ void rcu_idle_exit(void)
   rdtp-dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
   rcu_idle_exit_common(rdtp, oldval);
   local_irq_restore(flags);
 + if (no_cbs_idle_entry_count) {
 + no_cbs_idle_entry_count = 0;
 + pr_err(* Tickless idle was entered with zero RCU callbacks\n);
 + }
  }
  EXPORT_SYMBOL_GPL(rcu_idle_exit);
 
 -- 
 1.7.10.4
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

 Very cool, thank you for your testing efforts!!!

You're welcome.

 May I apply your Tested-by to this patch?

Please do:

Tested-by: Paul Walmsley p...@pwsan.com  # OMAP4430

Am testing on OMAP3730 (single-core) now.


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul E. McKenney

On Fri, Sep 21, 2012 at 10:41:14PM +, Paul Walmsley wrote:
> On Fri, 21 Sep 2012, Paul E. McKenney wrote:
> 
> > On Fri, Sep 21, 2012 at 05:47:31PM +, Paul Walmsley wrote:
> >
> > > I built an OMAP kernel from Linus' commit 
> > > 4651afbbae968772efd6dc4ba461cba9b49bb9d8 ("Merge branch 'for-3.6-fixes' 
> > > of 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq").  The config used 
> > > was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand.  Booted 
> > > it 
> > > on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.
> > 
> > Did you have the patch at https://lkml.org/lkml/2012/8/30/290 applied?
> 
> No, it's just as described above.
> 
> > If not, could you please try it?  (This patch cleared up a similar
> > problem for Becky, also on OMAP.)
> 
> Did not seem to help, either with or without CONFIG_CPU_IDLE.

I was hoping!  ;-)

And my init=/bin/sh kernel ran idle for more than an hour without
any RCU CPU stall warnings...

I am wondering if your system somehow figured out how to start a grace
period that had no RCU callbacks waiting for it.  If that happened,
then a CONFIG_NO_HZ=y system could in theory get into a state where all
CPUs are in dyntick-idle mode, so that none of them is doing anything
to force the grace period to complete.

That should be easy to diagnose, anyway.  Please see below, which
includes the earlier diagnostic patch.

Thanx, Paul



diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 307caf1..696f189 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -879,6 +879,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
unsigned long flags;
int ndetected = 0;
struct rcu_node *rnp = rcu_get_root(rsp);
+   long totqlen = 0;
 
/* Only let one CPU complain about others per time interval. */
 
@@ -923,8 +924,11 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
raw_spin_unlock_irqrestore(>lock, flags);
 
print_cpu_stall_info_end();
-   printk(KERN_CONT "(detected by %d, t=%ld jiffies)\n",
-  smp_processor_id(), (long)(jiffies - rsp->gp_start));
+   for_each_possible_cpu(cpu)
+   totqlen += per_cpu_ptr(rsp->rda, cpu)->qlen;
+   pr_cont("(detected by %d, t=%ld jiffies, g=%lu, c=%lu, q=%lu)\n",
+  smp_processor_id(), (long)(jiffies - rsp->gp_start),
+  rsp->gpnum, rsp->completed, totqlen);
if (ndetected == 0)
printk(KERN_ERR "INFO: Stall ended before state dump start\n");
else if (!trigger_all_cpu_backtrace())
@@ -939,8 +943,10 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 
 static void print_cpu_stall(struct rcu_state *rsp)
 {
+   int cpu;
unsigned long flags;
struct rcu_node *rnp = rcu_get_root(rsp);
+   long totqlen = 0;
 
/*
 * OK, time to rat on ourselves...
@@ -951,7 +957,10 @@ static void print_cpu_stall(struct rcu_state *rsp)
print_cpu_stall_info_begin();
print_cpu_stall_info(rsp, smp_processor_id());
print_cpu_stall_info_end();
-   printk(KERN_CONT " (t=%lu jiffies)\n", jiffies - rsp->gp_start);
+   for_each_possible_cpu(cpu)
+   totqlen += per_cpu_ptr(rsp->rda, cpu)->qlen;
+   pr_cont(" (t=%lu jiffies g=%lu c=%lu q=%lu)\n",
+   jiffies - rsp->gp_start, rsp->gpnum, rsp->completed, totqlen);
if (!trigger_all_cpu_backtrace())
dump_stack();
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul Walmsley

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

> On Fri, Sep 21, 2012 at 05:47:31PM +, Paul Walmsley wrote:
>
> > I built an OMAP kernel from Linus' commit 
> > 4651afbbae968772efd6dc4ba461cba9b49bb9d8 ("Merge branch 'for-3.6-fixes' of 
> > git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq").  The config used 
> > was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand.  Booted it 
> > on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.
> 
> Did you have the patch at https://lkml.org/lkml/2012/8/30/290 applied?

No, it's just as described above.

> If not, could you please try it?  (This patch cleared up a similar
> problem for Becky, also on OMAP.)

Did not seem to help, either with or without CONFIG_CPU_IDLE.


- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul E. McKenney

On Fri, Sep 21, 2012 at 12:57:17PM -0700, Paul E. McKenney wrote:
> On Fri, Sep 21, 2012 at 07:11:14PM +, Paul Walmsley wrote:
> > On Fri, 21 Sep 2012, Paul E. McKenney wrote:

[ . . . ]

> > > I may take your advice of remote access to a Panda board, though that
> > > is likely to take a bit of time due to timezones.  Regardless of the
> > > underlying issue here, I clearly need to make the stall-warning messages
> > > do a better job of printing out needed information.
> > 
> > If you've got a patch in mind for that, I'll boot it here.
> 
> Hammering it out, will send it along when it is a bit less destructive.  ;-)

And I eventually decided to keep things simple, please see below.
This applies on top of https://lkml.org/lkml/2012/8/30/290.

Thanx, Paul



rcu: Add grace-period information to RCU CPU stall warnings

This commit causes the last grace period started and completed to be
printed on RCU CPU stall warning messages in order to aid diagnosis.

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index cdaa7aa..fdb6854 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -925,8 +925,9 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
raw_spin_unlock_irqrestore(>lock, flags);
 
print_cpu_stall_info_end();
-   printk(KERN_CONT "(detected by %d, t=%ld jiffies)\n",
-  smp_processor_id(), (long)(jiffies - rsp->gp_start));
+   pr_cont("(detected by %d, t=%ld jiffies, g=%lu, c=%lu)\n",
+  smp_processor_id(), (long)(jiffies - rsp->gp_start),
+  rsp->gpnum, rsp->completed);
if (ndetected == 0)
printk(KERN_ERR "INFO: Stall ended before state dump start\n");
else if (!trigger_all_cpu_backtrace())
@@ -953,7 +954,8 @@ static void print_cpu_stall(struct rcu_state *rsp)
print_cpu_stall_info_begin();
print_cpu_stall_info(rsp, smp_processor_id());
print_cpu_stall_info_end();
-   printk(KERN_CONT " (t=%lu jiffies)\n", jiffies - rsp->gp_start);
+   pr_cont(" (t=%lu jiffies g=%lu c=%lu)\n",
+   jiffies - rsp->gp_start, rsp->gpnum, rsp->completed);
if (!trigger_all_cpu_backtrace())
dump_stack();
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul E. McKenney

On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote:
> * Paul E. McKenney  [120921 12:58]:
> > 
> > Just to make sure I understand the combinations:
> > 
> > o   All stalls have happened when running a minimal userspace.
> > o   CONFIG_NO_HZ=n suppresses the stalls.
> > o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
> > no observable effect on the stalls.
> 
> The reason why you may need minimal userspace is to cut down
> the number of timers waking up the system with NO_HZ.
> Booting with init=/bin/sh might also do the trick for that.

Good point!  This does make for a very quiet system, but does not
reproduce the problem under kvm, even after waiting for four minutes.
I will leave it for more time, but it looks like I really might need to
ask Linaro for remote access to a Panda.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul E. McKenney

On Fri, Sep 21, 2012 at 05:47:31PM +, Paul Walmsley wrote:
> Hi Paul
> 
> On Thu, 20 Sep 2012, Paul Walmsley wrote:
> 
> > On Thu, 20 Sep 2012, Paul E. McKenney wrote:
> >
> > > Paul Walmsley, please let me know if the config below doesn't clear things
> > > up for you or if there is some reason why this config is infeasible.
> > 
> > Will certainly test it here once things settle down.
> 
> I built an OMAP kernel from Linus' commit 
> 4651afbbae968772efd6dc4ba461cba9b49bb9d8 ("Merge branch 'for-3.6-fixes' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq").  The config used 
> was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand.  Booted it 
> on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.

Did you have the patch at https://lkml.org/lkml/2012/8/30/290 applied?
If not, could you please try it?  (This patch cleared up a similar
problem for Becky, also on OMAP.)

Thanx, Paul

> Long story short, the rcu_sched stall messages continue to appear.  
> Transcript below.
> 
> The above config should be easy to reproduce from mainline.  Paul, not 
> sure if you're still working with Linaro, but you should be able to get 
> access to a Pandaboard through them to try it yourself, should you wish.
> 
> 
> - Paul
> 
> ## Booting kernel from Legacy Image at 8200 ...   
> 
>Image Name:   Linux-3.6.0-rc6-00025-g4651afb   
> 
>Image Type:   ARM Linux Kernel Image (uncompressed)
> 
>Data Size:4061576 Bytes = 3.9 MiB  
> 
>Load Address: 80008000 
> 
>Entry Point:  80008000 
> 
>Verifying Checksum ... OK  
> 
>Loading Kernel Image ... OK
> 
> OK
> 
>   
> 
> Starting kernel ...   
> 
>   
> 
> Uncompressing Linux... done, booting the kernel.  
> 
> [0.00] Booting Linux on physical CPU 0
> 
> [0.00] Linux version 3.6.0-rc6-00025-g4651afb (paul@dusk) (gcc 
> version 4.5.1 (Sourcery G++ Lite 2010.09-50) ) #342 SMP Fri Sep 21 11:28:00 
> MDT 2012  
>
> [0.00] CPU: ARMv7 Processor [411fc092] revision 2 (ARMv7), 
> cr=10c53c7d
> [0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing 
> instruction cache   
> [0.00] Machine: OMAP4 Panda board 
> 
> [0.00] debug: ignoring loglevel setting.  
> 
> [0.00] Reserving 16777216 bytes SDRAM for VRAM
> 
> [0.00] Memory policy: ECC disabled, Data cache writealloc
> [0.00] On node 0 totalpages: 126720   
> 
> [0.00] free_area_init_node: node 0, pgdat c07c6140, node_mem_map 
> c0d23000
> [0.00]   Normal zone: 1024 pages used for memmap
> [0.00]   Normal zone: 0 pages reserved
> [0.00]   Normal zone: 125696 pages, LIFO batch:31
> [0.00] OMAP4430 ES2.0
> [0.00] PERCPU: Embedded 9 pages/cpu @c112b000 s14016 r8192 d14656 
> u36864
> [0.00] pcpu-alloc: s14016 r8192 d14656 u36864 alloc=9*4096
> [0.00] pcpu-alloc: [0] 0 [0] 1 
> [0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
> pages: 125696
> [0.00] Kernel command line: console=ttyO2,230400n8 vram=16M 
> root=/dev/mmcblk0p2 rw rootfstype=ext3 rootwait earlyprintk debug 
> ignore_loglevel
> [0.00] PID hash table entries: 2048 (order: 1, 8192 bytes)
> [0.00] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
> [0.00] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
> [0.00] Memory: 495MB = 495MB total
> [0.00] Memory: 488828k/488828k available, 35460k reserved, 0K highmem
> [0.00] Virtual kernel memory layout:
> [0.00] vector  : 0x - 0x1000   (   4 kB)
> [

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Tony Lindgren

* Paul E. McKenney  [120921 12:58]:
> 
> Just to make sure I understand the combinations:
> 
> o All stalls have happened when running a minimal userspace.
> o CONFIG_NO_HZ=n suppresses the stalls.
> o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
>   no observable effect on the stalls.

The reason why you may need minimal userspace is to cut down
the number of timers waking up the system with NO_HZ.
Booting with init=/bin/sh might also do the trick for that.

Regards,

Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul E. McKenney

On Fri, Sep 21, 2012 at 07:11:14PM +, Paul Walmsley wrote:
> On Fri, 21 Sep 2012, Paul E. McKenney wrote:
> 
> > On Fri, Sep 21, 2012 at 06:08:59PM +, Paul Walmsley wrote:
> > 
> > > As far as I know, our only idle entry point is in 
> > > arch/arm/common/process.c:cpu_idle().
> > 
> > In mainline, this is arch/arm/kernel/process.c, correct?
> 
> Indeed; sorry about that, mistyped.

No problem!

> > > Looking at the x86 idle entry, they call rcu_idle_{enter,exit}() inside 
> > > {stop,start}_critical_timings().  Making that change here didn't help.
> > 
> > The reason x86 does this is that they have idle notifiers deeper in the
> > idle loop that use RCU read-side critical sections.  So this was an
> > expected result.
> 
> OK
> 
> > > Also tried commenting out the code from the stop_critical_timings() call 
> > > to the WARN_ON(irqs_disabled()), and adding a local_irq_enable().  That 
> > > also didn't help, which suggests that the problem is not caused by the 
> > > OMAP-specific PM idle code.
> > 
> > I must admit that you make a convincing case here.  Though it does leave
> > me wondering what is different about Panda (and MX28, IIRC).
> 
> Given the dependency on CONFIG_NO_HZ, the stalls are probably dependent on 
> the userspace in use.  The userspaces here are quite minimal and so allow 
> the system to stay idle for relatively long periods of time.

Could you please point me to a recipe for creating a minimal userspace?
Just in case it is the userspac erather than the architecture/hardware
that makes the difference.

> > I may take your advice of remote access to a Panda board, though that
> > is likely to take a bit of time due to timezones.  Regardless of the
> > underlying issue here, I clearly need to make the stall-warning messages
> > do a better job of printing out needed information.
> 
> If you've got a patch in mind for that, I'll boot it here.

Hammering it out, will send it along when it is a bit less destructive.  ;-)

> One other observation.  omap2plus_defconfig sets CONFIG_NO_HZ=y but 
> doesn't set CONFIG_RCU_FAST_NO_HZ.  The stall warning messages still 
> appear when CONFIG_RCU_FAST_NO_HZ=y.  One of them is attached below (with 
> CONFIG_RCU_CPU_STALL_INFO set as well, obviously).

Just to make sure I understand the combinations:

o   All stalls have happened when running a minimal userspace.
o   CONFIG_NO_HZ=n suppresses the stalls.
o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
no observable effect on the stalls.

Did I get that right, or am I missing a combination?

> As an aside, in the CONFIG_RCU_FAST_NO_HZ=y build, I dropped a printk() 
> into rcu_idle_gp_timer_func() and it doesn't look like it ever executed.

Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is cancelled
upon exit from idle, and therefore should (almost) never actually execute.
Its sole purpose is to wake up the CPU.  ;-)

Thanx, Paul

> - Paul
> 
> [  305.832000] INFO: rcu_sched self-detected stall on CPU
> [  305.834838]  1: (2 GPs behind) idle=5b1/1/0 drain=0 . timer=4294967295
> [  305.838378]   (t=17463 jiffies)
> [  305.840118] [] (unwind_backtrace+0x0/0xf0) from [] 
> (rcu_pending+0xd0/0x540)
> [  305.844848] [] (rcu_pending+0xd0/0x540) from [] 
> (rcu_check_callbacks+0x110/0x198)
> [  305.849884] [] (rcu_check_callbacks+0x110/0x198) from 
> [] (update_process_times+0x38/0x68)
> [  305.855285] [] (update_process_times+0x38/0x68) from 
> [] (tick_sched_timer+0x80/0xec)
> [  305.860473] [] (tick_sched_timer+0x80/0xec) from [] 
> (__run_hrtimer+0x7c/0x1e0)
> [  305.865356] [] (__run_hrtimer+0x7c/0x1e0) from [] 
> (hrtimer_interrupt+0x11c/0x2d0)
> [  305.870361] [] (hrtimer_interrupt+0x11c/0x2d0) from [] 
> (twd_handler+0x30/0x44)
> [  305.875274] [] (twd_handler+0x30/0x44) from [] 
> (handle_percpu_devid_irq+0x90/0x13c)
> [  305.880371] [] (handle_percpu_devid_irq+0x90/0x13c) from 
> [] (generic_handle_irq+0x30/0x48)
> [  305.885833] [] (generic_handle_irq+0x30/0x48) from [] 
> (handle_IRQ+0x4c/0xac)
> [  305.890624] [] (handle_IRQ+0x4c/0xac) from [] 
> (gic_handle_irq+0x28/0x5c)
> [  305.895233] [] (gic_handle_irq+0x28/0x5c) from [] 
> (__irq_svc+0x44/0x5c)
> [  305.899780] Exception stack(0xde86ff88 to 0xde86ffd0)
> [  305.902526] ff80:   0004c062 0001  de8660c0 
> de86e000 c07c42c8
> [  305.906982] ffa0: c05075a0 c074bdd0  411fc092 c074bff0  
> 0001 de86ffd0
> [  305.911437] ffc0: 0004c063 c00152b0 2113 
> [  305.914184] [] (__irq_svc+0x44/0x5c) from [] 
> (default_idle+0x20/0x44)
> [  305.918640] [] (default_idle+0x20/0x44) from [] 
> (cpu_idle+0x9c/0x114)
> [  305.923126] [] (cpu_idle+0x9c/0x114) from [<804f4a34>] 
> (0x804f4a34)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul Walmsley

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

> On Fri, Sep 21, 2012 at 06:08:59PM +, Paul Walmsley wrote:
> 
> > As far as I know, our only idle entry point is in 
> > arch/arm/common/process.c:cpu_idle().
> 
> In mainline, this is arch/arm/kernel/process.c, correct?

Indeed; sorry about that, mistyped.

> > Looking at the x86 idle entry, they call rcu_idle_{enter,exit}() inside 
> > {stop,start}_critical_timings().  Making that change here didn't help.
> 
> The reason x86 does this is that they have idle notifiers deeper in the
> idle loop that use RCU read-side critical sections.  So this was an
> expected result.

OK

> > Also tried commenting out the code from the stop_critical_timings() call 
> > to the WARN_ON(irqs_disabled()), and adding a local_irq_enable().  That 
> > also didn't help, which suggests that the problem is not caused by the 
> > OMAP-specific PM idle code.
> 
> I must admit that you make a convincing case here.  Though it does leave
> me wondering what is different about Panda (and MX28, IIRC).

Given the dependency on CONFIG_NO_HZ, the stalls are probably dependent on 
the userspace in use.  The userspaces here are quite minimal and so allow 
the system to stay idle for relatively long periods of time.

> I may take your advice of remote access to a Panda board, though that
> is likely to take a bit of time due to timezones.  Regardless of the
> underlying issue here, I clearly need to make the stall-warning messages
> do a better job of printing out needed information.

If you've got a patch in mind for that, I'll boot it here.

One other observation.  omap2plus_defconfig sets CONFIG_NO_HZ=y but 
doesn't set CONFIG_RCU_FAST_NO_HZ.  The stall warning messages still 
appear when CONFIG_RCU_FAST_NO_HZ=y.  One of them is attached below (with 
CONFIG_RCU_CPU_STALL_INFO set as well, obviously).

As an aside, in the CONFIG_RCU_FAST_NO_HZ=y build, I dropped a printk() 
into rcu_idle_gp_timer_func() and it doesn't look like it ever executed.


- Paul

[  305.832000] INFO: rcu_sched self-detected stall on CPU
[  305.834838]  1: (2 GPs behind) idle=5b1/1/0 drain=0 . timer=4294967295
[  305.838378]   (t=17463 jiffies)
[  305.840118] [] (unwind_backtrace+0x0/0xf0) from [] 
(rcu_pending+0xd0/0x540)
[  305.844848] [] (rcu_pending+0xd0/0x540) from [] 
(rcu_check_callbacks+0x110/0x198)
[  305.849884] [] (rcu_check_callbacks+0x110/0x198) from [] 
(update_process_times+0x38/0x68)
[  305.855285] [] (update_process_times+0x38/0x68) from [] 
(tick_sched_timer+0x80/0xec)
[  305.860473] [] (tick_sched_timer+0x80/0xec) from [] 
(__run_hrtimer+0x7c/0x1e0)
[  305.865356] [] (__run_hrtimer+0x7c/0x1e0) from [] 
(hrtimer_interrupt+0x11c/0x2d0)
[  305.870361] [] (hrtimer_interrupt+0x11c/0x2d0) from [] 
(twd_handler+0x30/0x44)
[  305.875274] [] (twd_handler+0x30/0x44) from [] 
(handle_percpu_devid_irq+0x90/0x13c)
[  305.880371] [] (handle_percpu_devid_irq+0x90/0x13c) from 
[] (generic_handle_irq+0x30/0x48)
[  305.885833] [] (generic_handle_irq+0x30/0x48) from [] 
(handle_IRQ+0x4c/0xac)
[  305.890624] [] (handle_IRQ+0x4c/0xac) from [] 
(gic_handle_irq+0x28/0x5c)
[  305.895233] [] (gic_handle_irq+0x28/0x5c) from [] 
(__irq_svc+0x44/0x5c)
[  305.899780] Exception stack(0xde86ff88 to 0xde86ffd0)
[  305.902526] ff80:   0004c062 0001  de8660c0 
de86e000 c07c42c8
[  305.906982] ffa0: c05075a0 c074bdd0  411fc092 c074bff0  
0001 de86ffd0
[  305.911437] ffc0: 0004c063 c00152b0 2113 
[  305.914184] [] (__irq_svc+0x44/0x5c) from [] 
(default_idle+0x20/0x44)
[  305.918640] [] (default_idle+0x20/0x44) from [] 
(cpu_idle+0x9c/0x114)
[  305.923126] [] (cpu_idle+0x9c/0x114) from [<804f4a34>] (0x804f4a34)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul Walmsley

Hi

Just did a test with CONFIG_NO_HZ=n, and no rcu_sched stall messages 
appeared for 60 minutes.

Following is the diff from omap2plus_defconfig.

- Paul

--- .config 2012-09-21 12:51:19.114938954 -0600
+++ .testconfig 2012-09-21 12:51:15.686926318 -0600
@@ -69,7 +69,7 @@
 # Timers subsystem
 #
 CONFIG_TICK_ONESHOT=y
-CONFIG_NO_HZ=y
+# CONFIG_NO_HZ is not set
 CONFIG_HIGH_RES_TIMERS=y
 
 #
@@ -80,7 +80,6 @@
 CONFIG_RCU_FANOUT=32
 CONFIG_RCU_FANOUT_LEAF=16
 # CONFIG_RCU_FANOUT_EXACT is not set
-# CONFIG_RCU_FAST_NO_HZ is not set
 # CONFIG_TREE_RCU_TRACE is not set
 CONFIG_IKCONFIG=y
 CONFIG_IKCONFIG_PROC=y
@@ -2535,7 +2534,7 @@
 # CONFIG_BOOT_PRINTK_DELAY is not set
 # CONFIG_RCU_TORTURE_TEST is not set
 CONFIG_RCU_CPU_STALL_TIMEOUT=60
-# CONFIG_RCU_CPU_STALL_INFO is not set
+CONFIG_RCU_CPU_STALL_INFO=y
 # CONFIG_RCU_TRACE is not set
 # CONFIG_KPROBES_SANITY_TEST is not set
 # CONFIG_BACKTRACE_SELF_TEST is not set

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul E. McKenney

On Fri, Sep 21, 2012 at 06:08:59PM +, Paul Walmsley wrote:
> cc Frederic Weisbecker - context is here: 
> 
>http://marc.info/?l=linux-kernel=134749030206016=2
> 
> On Thu, 20 Sep 2012, Paul E. McKenney wrote:
> 
> > Fair point.  I am wondering whether there is some path into the idle
> > loop that somehow avoids telling RCU that the CPU has in face entered
> > idle.  There needs to be an rcu_idle_enter() call on the way to idle,
> > otherwise RCU CPU stall warnings are expected behavior.
> 
> As far as I know, our only idle entry point is in 
> arch/arm/common/process.c:cpu_idle().

In mainline, this is arch/arm/kernel/process.c, correct?

> Looking at the x86 idle entry, they call rcu_idle_{enter,exit}() inside 
> {stop,start}_critical_timings().  Making that change here didn't help.

The reason x86 does this is that they have idle notifiers deeper in the
idle loop that use RCU read-side critical sections.  So this was an
expected result.

> Also tried commenting out the code from the stop_critical_timings() call 
> to the WARN_ON(irqs_disabled()), and adding a local_irq_enable().  That 
> also didn't help, which suggests that the problem is not caused by the 
> OMAP-specific PM idle code.

I must admit that you make a convincing case here.  Though it does leave
me wondering what is different about Panda (and MX28, IIRC).

I may take your advice of remote access to a Panda board, though that
is likely to take a bit of time due to timezones.  Regardless of the
underlying issue here, I clearly need to make the stall-warning messages
do a better job of printing out needed information.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul Walmsley

cc Frederic Weisbecker - context is here: 

   http://marc.info/?l=linux-kernel=134749030206016=2

On Thu, 20 Sep 2012, Paul E. McKenney wrote:

> Fair point.  I am wondering whether there is some path into the idle
> loop that somehow avoids telling RCU that the CPU has in face entered
> idle.  There needs to be an rcu_idle_enter() call on the way to idle,
> otherwise RCU CPU stall warnings are expected behavior.

As far as I know, our only idle entry point is in 
arch/arm/common/process.c:cpu_idle().

Looking at the x86 idle entry, they call rcu_idle_{enter,exit}() inside 
{stop,start}_critical_timings().  Making that change here didn't help.

Also tried commenting out the code from the stop_critical_timings() call 
to the WARN_ON(irqs_disabled()), and adding a local_irq_enable().  That 
also didn't help, which suggests that the problem is not caused by the 
OMAP-specific PM idle code.

- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul Walmsley

On Fri, 21 Sep 2012, Paul Walmsley wrote:

> The config used was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE 
> by hand. 

One other thing I forgot to mention - CONFIG_RCU_CPU_STALL_INFO was 
enabled by hand also.  Below is the diff between omap2plus_defconfig and 
the config that was used here.


- Paul

--- .config 2012-09-21 11:50:19.565972113 -0600
+++ .testconfig 2012-09-21 11:50:12.233946950 -0600
@@ -536,7 +536,9 @@
 # CPU Frequency scaling
 #
 # CONFIG_CPU_FREQ is not set
-# CONFIG_CPU_IDLE is not set
+CONFIG_CPU_IDLE=y
+CONFIG_CPU_IDLE_GOV_LADDER=y
+CONFIG_CPU_IDLE_GOV_MENU=y
 CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED=y
 
 #
@@ -2535,7 +2537,7 @@
 # CONFIG_BOOT_PRINTK_DELAY is not set
 # CONFIG_RCU_TORTURE_TEST is not set
 CONFIG_RCU_CPU_STALL_TIMEOUT=60
-# CONFIG_RCU_CPU_STALL_INFO is not set
+CONFIG_RCU_CPU_STALL_INFO=y
 # CONFIG_RCU_TRACE is not set
 # CONFIG_KPROBES_SANITY_TEST is not set
 # CONFIG_BACKTRACE_SELF_TEST is not set

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul Walmsley

Hi Paul

On Thu, 20 Sep 2012, Paul Walmsley wrote:

> On Thu, 20 Sep 2012, Paul E. McKenney wrote:
>
> > Paul Walmsley, please let me know if the config below doesn't clear things
> > up for you or if there is some reason why this config is infeasible.
> 
> Will certainly test it here once things settle down.

I built an OMAP kernel from Linus' commit 
4651afbbae968772efd6dc4ba461cba9b49bb9d8 ("Merge branch 'for-3.6-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq").  The config used 
was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand.  Booted it 
on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.

Long story short, the rcu_sched stall messages continue to appear.  
Transcript below.

The above config should be easy to reproduce from mainline.  Paul, not 
sure if you're still working with Linaro, but you should be able to get 
access to a Pandaboard through them to try it yourself, should you wish.


- Paul

## Booting kernel from Legacy Image at 8200 ... 
  
   Image Name:   Linux-3.6.0-rc6-00025-g4651afb 
  
   Image Type:   ARM Linux Kernel Image (uncompressed)  
  
   Data Size:4061576 Bytes = 3.9 MiB
  
   Load Address: 80008000   
  
   Entry Point:  80008000   
  
   Verifying Checksum ... OK
  
   Loading Kernel Image ... OK  
  
OK  
  

  
Starting kernel ... 
  

  
Uncompressing Linux... done, booting the kernel.
  
[0.00] Booting Linux on physical CPU 0  
  
[0.00] Linux version 3.6.0-rc6-00025-g4651afb (paul@dusk) (gcc version 
4.5.1 (Sourcery G++ Lite 2010.09-50) ) #342 SMP Fri Sep 21 11:28:00 MDT 2012
 
[0.00] CPU: ARMv7 Processor [411fc092] revision 2 (ARMv7), cr=10c53c7d  
  
[0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing 
instruction cache   
[0.00] Machine: OMAP4 Panda board   
  
[0.00] debug: ignoring loglevel setting.
  
[0.00] Reserving 16777216 bytes SDRAM for VRAM  
  
[0.00] Memory policy: ECC disabled, Data cache writealloc
[0.00] On node 0 totalpages: 126720 
  
[0.00] free_area_init_node: node 0, pgdat c07c6140, node_mem_map 
c0d23000
[0.00]   Normal zone: 1024 pages used for memmap
[0.00]   Normal zone: 0 pages reserved
[0.00]   Normal zone: 125696 pages, LIFO batch:31
[0.00] OMAP4430 ES2.0
[0.00] PERCPU: Embedded 9 pages/cpu @c112b000 s14016 r8192 d14656 u36864
[0.00] pcpu-alloc: s14016 r8192 d14656 u36864 alloc=9*4096
[0.00] pcpu-alloc: [0] 0 [0] 1 
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 125696
[0.00] Kernel command line: console=ttyO2,230400n8 vram=16M 
root=/dev/mmcblk0p2 rw rootfstype=ext3 rootwait earlyprintk debug 
ignore_loglevel
[0.00] PID hash table entries: 2048 (order: 1, 8192 bytes)
[0.00] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
[0.00] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
[0.00] Memory: 495MB = 495MB total
[0.00] Memory: 488828k/488828k available, 35460k reserved, 0K highmem
[0.00] Virtual kernel memory layout:
[0.00] vector  : 0x - 0x1000   (   4 kB)
[0.00] fixmap  : 0xfff0 - 0xfffe   ( 896 kB)
[0.00] vmalloc : 0xe080 - 0xff00   ( 488 MB)
[0.00] lowmem  : 0xc000 - 0xe000   ( 512 MB)
[0.00] pkmap   : 0xbfe0 - 0xc000   (   2 MB)
[0.00] modules : 0xbf00 - 0xbfe0   (  14 MB)
[0.00]   .text : 0xc0008000 - 0xc06dad24   (6988 kB)
[0.00]   .init : 0xc06db000 - 0xc072a6c0   ( 318 kB)
[0.00]   .data : 0xc072c000 -

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul Walmsley

Hi Paul

On Thu, 20 Sep 2012, Paul Walmsley wrote:

 On Thu, 20 Sep 2012, Paul E. McKenney wrote:

  Paul Walmsley, please let me know if the config below doesn't clear things
  up for you or if there is some reason why this config is infeasible.
 
 Will certainly test it here once things settle down.

I built an OMAP kernel from Linus' commit 
4651afbbae968772efd6dc4ba461cba9b49bb9d8 (Merge branch 'for-3.6-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq).  The config used 
was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand.  Booted it 
on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.

Long story short, the rcu_sched stall messages continue to appear.  
Transcript below.

The above config should be easy to reproduce from mainline.  Paul, not 
sure if you're still working with Linaro, but you should be able to get 
access to a Pandaboard through them to try it yourself, should you wish.


- Paul

## Booting kernel from Legacy Image at 8200 ... 
  
   Image Name:   Linux-3.6.0-rc6-00025-g4651afb 
  
   Image Type:   ARM Linux Kernel Image (uncompressed)  
  
   Data Size:4061576 Bytes = 3.9 MiB
  
   Load Address: 80008000   
  
   Entry Point:  80008000   
  
   Verifying Checksum ... OK
  
   Loading Kernel Image ... OK  
  
OK  
  

  
Starting kernel ... 
  

  
Uncompressing Linux... done, booting the kernel.
  
[0.00] Booting Linux on physical CPU 0  
  
[0.00] Linux version 3.6.0-rc6-00025-g4651afb (paul@dusk) (gcc version 
4.5.1 (Sourcery G++ Lite 2010.09-50) ) #342 SMP Fri Sep 21 11:28:00 MDT 2012
 
[0.00] CPU: ARMv7 Processor [411fc092] revision 2 (ARMv7), cr=10c53c7d  
  
[0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing 
instruction cache   
[0.00] Machine: OMAP4 Panda board   
  
[0.00] debug: ignoring loglevel setting.
  
[0.00] Reserving 16777216 bytes SDRAM for VRAM  
  
[0.00] Memory policy: ECC disabled, Data cache writealloc
[0.00] On node 0 totalpages: 126720 
  
[0.00] free_area_init_node: node 0, pgdat c07c6140, node_mem_map 
c0d23000
[0.00]   Normal zone: 1024 pages used for memmap
[0.00]   Normal zone: 0 pages reserved
[0.00]   Normal zone: 125696 pages, LIFO batch:31
[0.00] OMAP4430 ES2.0
[0.00] PERCPU: Embedded 9 pages/cpu @c112b000 s14016 r8192 d14656 u36864
[0.00] pcpu-alloc: s14016 r8192 d14656 u36864 alloc=9*4096
[0.00] pcpu-alloc: [0] 0 [0] 1 
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 125696
[0.00] Kernel command line: console=ttyO2,230400n8 vram=16M 
root=/dev/mmcblk0p2 rw rootfstype=ext3 rootwait earlyprintk debug 
ignore_loglevel
[0.00] PID hash table entries: 2048 (order: 1, 8192 bytes)
[0.00] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
[0.00] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
[0.00] Memory: 495MB = 495MB total
[0.00] Memory: 488828k/488828k available, 35460k reserved, 0K highmem
[0.00] Virtual kernel memory layout:
[0.00] vector  : 0x - 0x1000   (   4 kB)
[0.00] fixmap  : 0xfff0 - 0xfffe   ( 896 kB)
[0.00] vmalloc : 0xe080 - 0xff00   ( 488 MB)
[0.00] lowmem  : 0xc000 - 0xe000   ( 512 MB)
[0.00] pkmap   : 0xbfe0 - 0xc000   (   2 MB)
[0.00] modules : 0xbf00 - 0xbfe0   (  14 MB)
[0.00]   .text : 0xc0008000 - 0xc06dad24   (6988 kB)
[0.00]   .init : 0xc06db000 - 0xc072a6c0   ( 318 kB)
[0.00]   .data : 0xc072c000 - 0xc07c8940

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul Walmsley

On Fri, 21 Sep 2012, Paul Walmsley wrote:

 The config used was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE 
 by hand. 

One other thing I forgot to mention - CONFIG_RCU_CPU_STALL_INFO was 
enabled by hand also.  Below is the diff between omap2plus_defconfig and 
the config that was used here.


- Paul

--- .config 2012-09-21 11:50:19.565972113 -0600
+++ .testconfig 2012-09-21 11:50:12.233946950 -0600
@@ -536,7 +536,9 @@
 # CPU Frequency scaling
 #
 # CONFIG_CPU_FREQ is not set
-# CONFIG_CPU_IDLE is not set
+CONFIG_CPU_IDLE=y
+CONFIG_CPU_IDLE_GOV_LADDER=y
+CONFIG_CPU_IDLE_GOV_MENU=y
 CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED=y
 
 #
@@ -2535,7 +2537,7 @@
 # CONFIG_BOOT_PRINTK_DELAY is not set
 # CONFIG_RCU_TORTURE_TEST is not set
 CONFIG_RCU_CPU_STALL_TIMEOUT=60
-# CONFIG_RCU_CPU_STALL_INFO is not set
+CONFIG_RCU_CPU_STALL_INFO=y
 # CONFIG_RCU_TRACE is not set
 # CONFIG_KPROBES_SANITY_TEST is not set
 # CONFIG_BACKTRACE_SELF_TEST is not set

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul Walmsley

cc Frederic Weisbecker - context is here: 

   http://marc.info/?l=linux-kernelm=134749030206016w=2

On Thu, 20 Sep 2012, Paul E. McKenney wrote:

 Fair point.  I am wondering whether there is some path into the idle
 loop that somehow avoids telling RCU that the CPU has in face entered
 idle.  There needs to be an rcu_idle_enter() call on the way to idle,
 otherwise RCU CPU stall warnings are expected behavior.

As far as I know, our only idle entry point is in 
arch/arm/common/process.c:cpu_idle().

Looking at the x86 idle entry, they call rcu_idle_{enter,exit}() inside 
{stop,start}_critical_timings().  Making that change here didn't help.

Also tried commenting out the code from the stop_critical_timings() call 
to the WARN_ON(irqs_disabled()), and adding a local_irq_enable().  That 
also didn't help, which suggests that the problem is not caused by the 
OMAP-specific PM idle code.


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul E. McKenney

On Fri, Sep 21, 2012 at 06:08:59PM +, Paul Walmsley wrote:
 cc Frederic Weisbecker - context is here: 
 
http://marc.info/?l=linux-kernelm=134749030206016w=2
 
 On Thu, 20 Sep 2012, Paul E. McKenney wrote:
 
  Fair point.  I am wondering whether there is some path into the idle
  loop that somehow avoids telling RCU that the CPU has in face entered
  idle.  There needs to be an rcu_idle_enter() call on the way to idle,
  otherwise RCU CPU stall warnings are expected behavior.
 
 As far as I know, our only idle entry point is in 
 arch/arm/common/process.c:cpu_idle().

In mainline, this is arch/arm/kernel/process.c, correct?

 Looking at the x86 idle entry, they call rcu_idle_{enter,exit}() inside 
 {stop,start}_critical_timings().  Making that change here didn't help.

The reason x86 does this is that they have idle notifiers deeper in the
idle loop that use RCU read-side critical sections.  So this was an
expected result.

 Also tried commenting out the code from the stop_critical_timings() call 
 to the WARN_ON(irqs_disabled()), and adding a local_irq_enable().  That 
 also didn't help, which suggests that the problem is not caused by the 
 OMAP-specific PM idle code.

I must admit that you make a convincing case here.  Though it does leave
me wondering what is different about Panda (and MX28, IIRC).

I may take your advice of remote access to a Panda board, though that
is likely to take a bit of time due to timezones.  Regardless of the
underlying issue here, I clearly need to make the stall-warning messages
do a better job of printing out needed information.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul Walmsley

Hi

Just did a test with CONFIG_NO_HZ=n, and no rcu_sched stall messages 
appeared for 60 minutes.

Following is the diff from omap2plus_defconfig.

- Paul

--- .config 2012-09-21 12:51:19.114938954 -0600
+++ .testconfig 2012-09-21 12:51:15.686926318 -0600
@@ -69,7 +69,7 @@
 # Timers subsystem
 #
 CONFIG_TICK_ONESHOT=y
-CONFIG_NO_HZ=y
+# CONFIG_NO_HZ is not set
 CONFIG_HIGH_RES_TIMERS=y
 
 #
@@ -80,7 +80,6 @@
 CONFIG_RCU_FANOUT=32
 CONFIG_RCU_FANOUT_LEAF=16
 # CONFIG_RCU_FANOUT_EXACT is not set
-# CONFIG_RCU_FAST_NO_HZ is not set
 # CONFIG_TREE_RCU_TRACE is not set
 CONFIG_IKCONFIG=y
 CONFIG_IKCONFIG_PROC=y
@@ -2535,7 +2534,7 @@
 # CONFIG_BOOT_PRINTK_DELAY is not set
 # CONFIG_RCU_TORTURE_TEST is not set
 CONFIG_RCU_CPU_STALL_TIMEOUT=60
-# CONFIG_RCU_CPU_STALL_INFO is not set
+CONFIG_RCU_CPU_STALL_INFO=y
 # CONFIG_RCU_TRACE is not set
 # CONFIG_KPROBES_SANITY_TEST is not set
 # CONFIG_BACKTRACE_SELF_TEST is not set

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul Walmsley

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

 On Fri, Sep 21, 2012 at 06:08:59PM +, Paul Walmsley wrote:
 
  As far as I know, our only idle entry point is in 
  arch/arm/common/process.c:cpu_idle().
 
 In mainline, this is arch/arm/kernel/process.c, correct?

Indeed; sorry about that, mistyped.

  Looking at the x86 idle entry, they call rcu_idle_{enter,exit}() inside 
  {stop,start}_critical_timings().  Making that change here didn't help.
 
 The reason x86 does this is that they have idle notifiers deeper in the
 idle loop that use RCU read-side critical sections.  So this was an
 expected result.

OK

  Also tried commenting out the code from the stop_critical_timings() call 
  to the WARN_ON(irqs_disabled()), and adding a local_irq_enable().  That 
  also didn't help, which suggests that the problem is not caused by the 
  OMAP-specific PM idle code.
 
 I must admit that you make a convincing case here.  Though it does leave
 me wondering what is different about Panda (and MX28, IIRC).

Given the dependency on CONFIG_NO_HZ, the stalls are probably dependent on 
the userspace in use.  The userspaces here are quite minimal and so allow 
the system to stay idle for relatively long periods of time.

 I may take your advice of remote access to a Panda board, though that
 is likely to take a bit of time due to timezones.  Regardless of the
 underlying issue here, I clearly need to make the stall-warning messages
 do a better job of printing out needed information.

If you've got a patch in mind for that, I'll boot it here.

One other observation.  omap2plus_defconfig sets CONFIG_NO_HZ=y but 
doesn't set CONFIG_RCU_FAST_NO_HZ.  The stall warning messages still 
appear when CONFIG_RCU_FAST_NO_HZ=y.  One of them is attached below (with 
CONFIG_RCU_CPU_STALL_INFO set as well, obviously).

As an aside, in the CONFIG_RCU_FAST_NO_HZ=y build, I dropped a printk() 
into rcu_idle_gp_timer_func() and it doesn't look like it ever executed.


- Paul

[  305.832000] INFO: rcu_sched self-detected stall on CPU
[  305.834838]  1: (2 GPs behind) idle=5b1/1/0 drain=0 . timer=4294967295
[  305.838378]   (t=17463 jiffies)
[  305.840118] [c001be10] (unwind_backtrace+0x0/0xf0) from [c00ad65c] 
(rcu_pending+0xd0/0x540)
[  305.844848] [c00ad65c] (rcu_pending+0xd0/0x540) from [c00ae5cc] 
(rcu_check_callbacks+0x110/0x198)
[  305.849884] [c00ae5cc] (rcu_check_callbacks+0x110/0x198) from [c0053800] 
(update_process_times+0x38/0x68)
[  305.855285] [c0053800] (update_process_times+0x38/0x68) from [c008cf40] 
(tick_sched_timer+0x80/0xec)
[  305.860473] [c008cf40] (tick_sched_timer+0x80/0xec) from [c006942c] 
(__run_hrtimer+0x7c/0x1e0)
[  305.865356] [c006942c] (__run_hrtimer+0x7c/0x1e0) from [c006a210] 
(hrtimer_interrupt+0x11c/0x2d0)
[  305.870361] [c006a210] (hrtimer_interrupt+0x11c/0x2d0) from [c001a54c] 
(twd_handler+0x30/0x44)
[  305.875274] [c001a54c] (twd_handler+0x30/0x44) from [c00a8128] 
(handle_percpu_devid_irq+0x90/0x13c)
[  305.880371] [c00a8128] (handle_percpu_devid_irq+0x90/0x13c) from 
[c00a489c] (generic_handle_irq+0x30/0x48)
[  305.885833] [c00a489c] (generic_handle_irq+0x30/0x48) from [c0014fb8] 
(handle_IRQ+0x4c/0xac)
[  305.890624] [c0014fb8] (handle_IRQ+0x4c/0xac) from [c000864c] 
(gic_handle_irq+0x28/0x5c)
[  305.895233] [c000864c] (gic_handle_irq+0x28/0x5c) from [c04fbc64] 
(__irq_svc+0x44/0x5c)
[  305.899780] Exception stack(0xde86ff88 to 0xde86ffd0)
[  305.902526] ff80:   0004c062 0001  de8660c0 
de86e000 c07c42c8
[  305.906982] ffa0: c05075a0 c074bdd0  411fc092 c074bff0  
0001 de86ffd0
[  305.911437] ffc0: 0004c063 c00152b0 2113 
[  305.914184] [c04fbc64] (__irq_svc+0x44/0x5c) from [c00152b0] 
(default_idle+0x20/0x44)
[  305.918640] [c00152b0] (default_idle+0x20/0x44) from [c00154dc] 
(cpu_idle+0x9c/0x114)
[  305.923126] [c00154dc] (cpu_idle+0x9c/0x114) from [804f4a34] (0x804f4a34)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul E. McKenney

On Fri, Sep 21, 2012 at 07:11:14PM +, Paul Walmsley wrote:
 On Fri, 21 Sep 2012, Paul E. McKenney wrote:
 
  On Fri, Sep 21, 2012 at 06:08:59PM +, Paul Walmsley wrote:
  
   As far as I know, our only idle entry point is in 
   arch/arm/common/process.c:cpu_idle().
  
  In mainline, this is arch/arm/kernel/process.c, correct?
 
 Indeed; sorry about that, mistyped.

No problem!

   Looking at the x86 idle entry, they call rcu_idle_{enter,exit}() inside 
   {stop,start}_critical_timings().  Making that change here didn't help.
  
  The reason x86 does this is that they have idle notifiers deeper in the
  idle loop that use RCU read-side critical sections.  So this was an
  expected result.
 
 OK
 
   Also tried commenting out the code from the stop_critical_timings() call 
   to the WARN_ON(irqs_disabled()), and adding a local_irq_enable().  That 
   also didn't help, which suggests that the problem is not caused by the 
   OMAP-specific PM idle code.
  
  I must admit that you make a convincing case here.  Though it does leave
  me wondering what is different about Panda (and MX28, IIRC).
 
 Given the dependency on CONFIG_NO_HZ, the stalls are probably dependent on 
 the userspace in use.  The userspaces here are quite minimal and so allow 
 the system to stay idle for relatively long periods of time.

Could you please point me to a recipe for creating a minimal userspace?
Just in case it is the userspac erather than the architecture/hardware
that makes the difference.

  I may take your advice of remote access to a Panda board, though that
  is likely to take a bit of time due to timezones.  Regardless of the
  underlying issue here, I clearly need to make the stall-warning messages
  do a better job of printing out needed information.
 
 If you've got a patch in mind for that, I'll boot it here.

Hammering it out, will send it along when it is a bit less destructive.  ;-)

 One other observation.  omap2plus_defconfig sets CONFIG_NO_HZ=y but 
 doesn't set CONFIG_RCU_FAST_NO_HZ.  The stall warning messages still 
 appear when CONFIG_RCU_FAST_NO_HZ=y.  One of them is attached below (with 
 CONFIG_RCU_CPU_STALL_INFO set as well, obviously).

Just to make sure I understand the combinations:

o   All stalls have happened when running a minimal userspace.
o   CONFIG_NO_HZ=n suppresses the stalls.
o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
no observable effect on the stalls.

Did I get that right, or am I missing a combination?

 As an aside, in the CONFIG_RCU_FAST_NO_HZ=y build, I dropped a printk() 
 into rcu_idle_gp_timer_func() and it doesn't look like it ever executed.

Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is cancelled
upon exit from idle, and therefore should (almost) never actually execute.
Its sole purpose is to wake up the CPU.  ;-)

Thanx, Paul

 - Paul
 
 [  305.832000] INFO: rcu_sched self-detected stall on CPU
 [  305.834838]  1: (2 GPs behind) idle=5b1/1/0 drain=0 . timer=4294967295
 [  305.838378]   (t=17463 jiffies)
 [  305.840118] [c001be10] (unwind_backtrace+0x0/0xf0) from [c00ad65c] 
 (rcu_pending+0xd0/0x540)
 [  305.844848] [c00ad65c] (rcu_pending+0xd0/0x540) from [c00ae5cc] 
 (rcu_check_callbacks+0x110/0x198)
 [  305.849884] [c00ae5cc] (rcu_check_callbacks+0x110/0x198) from 
 [c0053800] (update_process_times+0x38/0x68)
 [  305.855285] [c0053800] (update_process_times+0x38/0x68) from 
 [c008cf40] (tick_sched_timer+0x80/0xec)
 [  305.860473] [c008cf40] (tick_sched_timer+0x80/0xec) from [c006942c] 
 (__run_hrtimer+0x7c/0x1e0)
 [  305.865356] [c006942c] (__run_hrtimer+0x7c/0x1e0) from [c006a210] 
 (hrtimer_interrupt+0x11c/0x2d0)
 [  305.870361] [c006a210] (hrtimer_interrupt+0x11c/0x2d0) from [c001a54c] 
 (twd_handler+0x30/0x44)
 [  305.875274] [c001a54c] (twd_handler+0x30/0x44) from [c00a8128] 
 (handle_percpu_devid_irq+0x90/0x13c)
 [  305.880371] [c00a8128] (handle_percpu_devid_irq+0x90/0x13c) from 
 [c00a489c] (generic_handle_irq+0x30/0x48)
 [  305.885833] [c00a489c] (generic_handle_irq+0x30/0x48) from [c0014fb8] 
 (handle_IRQ+0x4c/0xac)
 [  305.890624] [c0014fb8] (handle_IRQ+0x4c/0xac) from [c000864c] 
 (gic_handle_irq+0x28/0x5c)
 [  305.895233] [c000864c] (gic_handle_irq+0x28/0x5c) from [c04fbc64] 
 (__irq_svc+0x44/0x5c)
 [  305.899780] Exception stack(0xde86ff88 to 0xde86ffd0)
 [  305.902526] ff80:   0004c062 0001  de8660c0 
 de86e000 c07c42c8
 [  305.906982] ffa0: c05075a0 c074bdd0  411fc092 c074bff0  
 0001 de86ffd0
 [  305.911437] ffc0: 0004c063 c00152b0 2113 
 [  305.914184] [c04fbc64] (__irq_svc+0x44/0x5c) from [c00152b0] 
 (default_idle+0x20/0x44)
 [  305.918640] [c00152b0] (default_idle+0x20/0x44) from [c00154dc] 
 (cpu_idle+0x9c/0x114)
 [  305.923126] [c00154dc] (cpu_idle+0x9c/0x114) from [804f4a34] 
 (0x804f4a34)
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Tony Lindgren

* Paul E. McKenney paul...@linux.vnet.ibm.com [120921 12:58]:
 
 Just to make sure I understand the combinations:
 
 o All stalls have happened when running a minimal userspace.
 o CONFIG_NO_HZ=n suppresses the stalls.
 o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
   no observable effect on the stalls.

The reason why you may need minimal userspace is to cut down
the number of timers waking up the system with NO_HZ.
Booting with init=/bin/sh might also do the trick for that.

Regards,

Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul E. McKenney

On Fri, Sep 21, 2012 at 05:47:31PM +, Paul Walmsley wrote:
 Hi Paul
 
 On Thu, 20 Sep 2012, Paul Walmsley wrote:
 
  On Thu, 20 Sep 2012, Paul E. McKenney wrote:
 
   Paul Walmsley, please let me know if the config below doesn't clear things
   up for you or if there is some reason why this config is infeasible.
  
  Will certainly test it here once things settle down.
 
 I built an OMAP kernel from Linus' commit 
 4651afbbae968772efd6dc4ba461cba9b49bb9d8 (Merge branch 'for-3.6-fixes' of 
 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq).  The config used 
 was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand.  Booted it 
 on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.

Did you have the patch at https://lkml.org/lkml/2012/8/30/290 applied?
If not, could you please try it?  (This patch cleared up a similar
problem for Becky, also on OMAP.)

Thanx, Paul

 Long story short, the rcu_sched stall messages continue to appear.  
 Transcript below.
 
 The above config should be easy to reproduce from mainline.  Paul, not 
 sure if you're still working with Linaro, but you should be able to get 
 access to a Pandaboard through them to try it yourself, should you wish.
 
 
 - Paul
 
 ## Booting kernel from Legacy Image at 8200 ...   
 
Image Name:   Linux-3.6.0-rc6-00025-g4651afb   
 
Image Type:   ARM Linux Kernel Image (uncompressed)
 
Data Size:4061576 Bytes = 3.9 MiB  
 
Load Address: 80008000 
 
Entry Point:  80008000 
 
Verifying Checksum ... OK  
 
Loading Kernel Image ... OK
 
 OK
 
   
 
 Starting kernel ...   
 
   
 
 Uncompressing Linux... done, booting the kernel.  
 
 [0.00] Booting Linux on physical CPU 0
 
 [0.00] Linux version 3.6.0-rc6-00025-g4651afb (paul@dusk) (gcc 
 version 4.5.1 (Sourcery G++ Lite 2010.09-50) ) #342 SMP Fri Sep 21 11:28:00 
 MDT 2012  

 [0.00] CPU: ARMv7 Processor [411fc092] revision 2 (ARMv7), 
 cr=10c53c7d
 [0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing 
 instruction cache   
 [0.00] Machine: OMAP4 Panda board 
 
 [0.00] debug: ignoring loglevel setting.  
 
 [0.00] Reserving 16777216 bytes SDRAM for VRAM
 
 [0.00] Memory policy: ECC disabled, Data cache writealloc
 [0.00] On node 0 totalpages: 126720   
 
 [0.00] free_area_init_node: node 0, pgdat c07c6140, node_mem_map 
 c0d23000
 [0.00]   Normal zone: 1024 pages used for memmap
 [0.00]   Normal zone: 0 pages reserved
 [0.00]   Normal zone: 125696 pages, LIFO batch:31
 [0.00] OMAP4430 ES2.0
 [0.00] PERCPU: Embedded 9 pages/cpu @c112b000 s14016 r8192 d14656 
 u36864
 [0.00] pcpu-alloc: s14016 r8192 d14656 u36864 alloc=9*4096
 [0.00] pcpu-alloc: [0] 0 [0] 1 
 [0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
 pages: 125696
 [0.00] Kernel command line: console=ttyO2,230400n8 vram=16M 
 root=/dev/mmcblk0p2 rw rootfstype=ext3 rootwait earlyprintk debug 
 ignore_loglevel
 [0.00] PID hash table entries: 2048 (order: 1, 8192 bytes)
 [0.00] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
 [0.00] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
 [0.00] Memory: 495MB = 495MB total
 [0.00] Memory: 488828k/488828k available, 35460k reserved, 0K highmem
 [0.00] Virtual kernel memory layout:
 [0.00] vector  : 0x - 0x1000   (   4 kB)
 [0.00] fixmap  : 0xfff0 - 0xfffe   ( 896 kB)
 [0.00] vmalloc : 0xe080 -

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul E. McKenney

On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote:
 * Paul E. McKenney paul...@linux.vnet.ibm.com [120921 12:58]:
  
  Just to make sure I understand the combinations:
  
  o   All stalls have happened when running a minimal userspace.
  o   CONFIG_NO_HZ=n suppresses the stalls.
  o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
  no observable effect on the stalls.
 
 The reason why you may need minimal userspace is to cut down
 the number of timers waking up the system with NO_HZ.
 Booting with init=/bin/sh might also do the trick for that.

Good point!  This does make for a very quiet system, but does not
reproduce the problem under kvm, even after waiting for four minutes.
I will leave it for more time, but it looks like I really might need to
ask Linaro for remote access to a Panda.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul E. McKenney

On Fri, Sep 21, 2012 at 12:57:17PM -0700, Paul E. McKenney wrote:
 On Fri, Sep 21, 2012 at 07:11:14PM +, Paul Walmsley wrote:
  On Fri, 21 Sep 2012, Paul E. McKenney wrote:

[ . . . ]

   I may take your advice of remote access to a Panda board, though that
   is likely to take a bit of time due to timezones.  Regardless of the
   underlying issue here, I clearly need to make the stall-warning messages
   do a better job of printing out needed information.
  
  If you've got a patch in mind for that, I'll boot it here.
 
 Hammering it out, will send it along when it is a bit less destructive.  ;-)

And I eventually decided to keep things simple, please see below.
This applies on top of https://lkml.org/lkml/2012/8/30/290.

Thanx, Paul



rcu: Add grace-period information to RCU CPU stall warnings

This commit causes the last grace period started and completed to be
printed on RCU CPU stall warning messages in order to aid diagnosis.

Signed-off-by: Paul E. McKenney paul.mcken...@linaro.org
Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index cdaa7aa..fdb6854 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -925,8 +925,9 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
raw_spin_unlock_irqrestore(rnp-lock, flags);
 
print_cpu_stall_info_end();
-   printk(KERN_CONT (detected by %d, t=%ld jiffies)\n,
-  smp_processor_id(), (long)(jiffies - rsp-gp_start));
+   pr_cont((detected by %d, t=%ld jiffies, g=%lu, c=%lu)\n,
+  smp_processor_id(), (long)(jiffies - rsp-gp_start),
+  rsp-gpnum, rsp-completed);
if (ndetected == 0)
printk(KERN_ERR INFO: Stall ended before state dump start\n);
else if (!trigger_all_cpu_backtrace())
@@ -953,7 +954,8 @@ static void print_cpu_stall(struct rcu_state *rsp)
print_cpu_stall_info_begin();
print_cpu_stall_info(rsp, smp_processor_id());
print_cpu_stall_info_end();
-   printk(KERN_CONT  (t=%lu jiffies)\n, jiffies - rsp-gp_start);
+   pr_cont( (t=%lu jiffies g=%lu c=%lu)\n,
+   jiffies - rsp-gp_start, rsp-gpnum, rsp-completed);
if (!trigger_all_cpu_backtrace())
dump_stack();
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul Walmsley

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

 On Fri, Sep 21, 2012 at 05:47:31PM +, Paul Walmsley wrote:

  I built an OMAP kernel from Linus' commit 
  4651afbbae968772efd6dc4ba461cba9b49bb9d8 (Merge branch 'for-3.6-fixes' of 
  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq).  The config used 
  was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand.  Booted it 
  on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.
 
 Did you have the patch at https://lkml.org/lkml/2012/8/30/290 applied?

No, it's just as described above.

 If not, could you please try it?  (This patch cleared up a similar
 problem for Becky, also on OMAP.)

Did not seem to help, either with or without CONFIG_CPU_IDLE.


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-21 Thread Paul E. McKenney

On Fri, Sep 21, 2012 at 10:41:14PM +, Paul Walmsley wrote:
 On Fri, 21 Sep 2012, Paul E. McKenney wrote:
 
  On Fri, Sep 21, 2012 at 05:47:31PM +, Paul Walmsley wrote:
 
   I built an OMAP kernel from Linus' commit 
   4651afbbae968772efd6dc4ba461cba9b49bb9d8 (Merge branch 'for-3.6-fixes' 
   of 
   git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq).  The config used 
   was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand.  Booted 
   it 
   on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.
  
  Did you have the patch at https://lkml.org/lkml/2012/8/30/290 applied?
 
 No, it's just as described above.
 
  If not, could you please try it?  (This patch cleared up a similar
  problem for Becky, also on OMAP.)
 
 Did not seem to help, either with or without CONFIG_CPU_IDLE.

I was hoping!  ;-)

And my init=/bin/sh kernel ran idle for more than an hour without
any RCU CPU stall warnings...

I am wondering if your system somehow figured out how to start a grace
period that had no RCU callbacks waiting for it.  If that happened,
then a CONFIG_NO_HZ=y system could in theory get into a state where all
CPUs are in dyntick-idle mode, so that none of them is doing anything
to force the grace period to complete.

That should be easy to diagnose, anyway.  Please see below, which
includes the earlier diagnostic patch.

Thanx, Paul



diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 307caf1..696f189 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -879,6 +879,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
unsigned long flags;
int ndetected = 0;
struct rcu_node *rnp = rcu_get_root(rsp);
+   long totqlen = 0;
 
/* Only let one CPU complain about others per time interval. */
 
@@ -923,8 +924,11 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
raw_spin_unlock_irqrestore(rnp-lock, flags);
 
print_cpu_stall_info_end();
-   printk(KERN_CONT (detected by %d, t=%ld jiffies)\n,
-  smp_processor_id(), (long)(jiffies - rsp-gp_start));
+   for_each_possible_cpu(cpu)
+   totqlen += per_cpu_ptr(rsp-rda, cpu)-qlen;
+   pr_cont((detected by %d, t=%ld jiffies, g=%lu, c=%lu, q=%lu)\n,
+  smp_processor_id(), (long)(jiffies - rsp-gp_start),
+  rsp-gpnum, rsp-completed, totqlen);
if (ndetected == 0)
printk(KERN_ERR INFO: Stall ended before state dump start\n);
else if (!trigger_all_cpu_backtrace())
@@ -939,8 +943,10 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 
 static void print_cpu_stall(struct rcu_state *rsp)
 {
+   int cpu;
unsigned long flags;
struct rcu_node *rnp = rcu_get_root(rsp);
+   long totqlen = 0;
 
/*
 * OK, time to rat on ourselves...
@@ -951,7 +957,10 @@ static void print_cpu_stall(struct rcu_state *rsp)
print_cpu_stall_info_begin();
print_cpu_stall_info(rsp, smp_processor_id());
print_cpu_stall_info_end();
-   printk(KERN_CONT  (t=%lu jiffies)\n, jiffies - rsp-gp_start);
+   for_each_possible_cpu(cpu)
+   totqlen += per_cpu_ptr(rsp-rda, cpu)-qlen;
+   pr_cont( (t=%lu jiffies g=%lu c=%lu q=%lu)\n,
+   jiffies - rsp-gp_start, rsp-gpnum, rsp-completed, totqlen);
if (!trigger_all_cpu_backtrace())
dump_stack();
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-20 Thread Paul E. McKenney

On Thu, Sep 20, 2012 at 10:47:25PM +, Paul Walmsley wrote:
> On Thu, 20 Sep 2012, Paul E. McKenney wrote:
> 
> > On Thu, Sep 20, 2012 at 09:49:13PM +, Bruce, Becky wrote:
> >
> > > OK, people, you can stop heckling me about "sent from my iPhone" - I'm in 
> > > the wilds of rural Louisiana with really bad internet service and was 
> > > trying to work on my phone (but, alas, did not notice the CC list 
> > > included the entire universe).   Shame on me.
> > > 
> > > With the above set, I don't seem to see any stalls with the RCU timeout 
> > > set to 60s (the default).  I left the board running for 25 minutes; I 
> > > will fire it up again later and let it run for a bit longer, but usually 
> > > I end up seeing the problem pretty quickly so I don't expect that to 
> > > result in anything.  I also didn't see any stalls on Paul's RCU tree as 
> > > of a week ago at 60s, so as far as I can tell the CPU_IDLE stuff didn't 
> > > have any impact (it wasn't on when I tested Paul's tree).
> > > 
> > > If I drop the timeout to 5s as Paul M. suggested for debug a while back, 
> > > I do see stalls (both with CPU_IDLE stuff and without). 
> > > 
> > > I'm using the default omap2plus config, with RCU stall info enabled and 
> > > the cpu idle stuff turned on (console dump below).  This is a Panda ES 
> > > 1.1 (OMAP4460)
> > 
> > Thank you for the testing, Becky!
> > 
> > Paul Walmsley, please let me know if the config below doesn't clear things
> > up for you or if there is some reason why this config is infeasible.
> 
> Will certainly test it here once things settle down.  But I wouldn't 
> consider the use of CONFIG_CPU_IDLE to be a 'fix' until the problem can be 
> understood.  We use !CONFIG_CPU_IDLE kernels for testing here.

Fair point.  I am wondering whether there is some path into the idle
loop that somehow avoids telling RCU that the CPU has in face entered
idle.  There needs to be an rcu_idle_enter() call on the way to idle,
otherwise RCU CPU stall warnings are expected behavior.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-20 Thread Paul Walmsley

On Thu, 20 Sep 2012, Paul E. McKenney wrote:

> On Thu, Sep 20, 2012 at 09:49:13PM +, Bruce, Becky wrote:
>
> > OK, people, you can stop heckling me about "sent from my iPhone" - I'm in 
> > the wilds of rural Louisiana with really bad internet service and was 
> > trying to work on my phone (but, alas, did not notice the CC list included 
> > the entire universe).   Shame on me.
> > 
> > With the above set, I don't seem to see any stalls with the RCU timeout set 
> > to 60s (the default).  I left the board running for 25 minutes; I will fire 
> > it up again later and let it run for a bit longer, but usually I end up 
> > seeing the problem pretty quickly so I don't expect that to result in 
> > anything.  I also didn't see any stalls on Paul's RCU tree as of a week ago 
> > at 60s, so as far as I can tell the CPU_IDLE stuff didn't have any impact 
> > (it wasn't on when I tested Paul's tree).
> > 
> > If I drop the timeout to 5s as Paul M. suggested for debug a while back, I 
> > do see stalls (both with CPU_IDLE stuff and without). 
> > 
> > I'm using the default omap2plus config, with RCU stall info enabled and the 
> > cpu idle stuff turned on (console dump below).  This is a Panda ES 1.1 
> > (OMAP4460)
> 
> Thank you for the testing, Becky!
> 
> Paul Walmsley, please let me know if the config below doesn't clear things
> up for you or if there is some reason why this config is infeasible.

Will certainly test it here once things settle down.  But I wouldn't 
consider the use of CONFIG_CPU_IDLE to be a 'fix' until the problem can be 
understood.  We use !CONFIG_CPU_IDLE kernels for testing here.


- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-20 Thread Paul E. McKenney

On Thu, Sep 20, 2012 at 09:49:13PM +, Bruce, Becky wrote:
> 
> On Sep 20, 2012, at 2:56 AM, Paul Walmsley wrote:
> 
> > Hi,
> > 
> > On Wed, 19 Sep 2012, Paul E. McKenney wrote:
> > 
> >> On Thu, Sep 13, 2012 at 06:52:10PM +, Paul Walmsley wrote:
> >> 
> >>> On Wed, 12 Sep 2012, Paul E. McKenney wrote:
> >> 
>  Subodh Nijsure (also CCed) reported something that might be similar on
>  ARM, and also reported that setting the following got rid of the stalls:
>  
>   CONFIG_CPU_IDLE=y
>   CONFIG_CPU_IDLE_GOV_LADDER=y
>   CONFIG_CPU_IDLE_GOV_MENU=y
>  
>  At which point he was happy, which was good, but which also left the
>  underlying problem unsolved.  Do these affect your system?  If so,
>  do they cause a different ARM idle loop to be executed?
> >>> 
> >>> Will give this a try.  What board was Subodh using?
> >> 
> >> Any news on trying the above settings?
> > 
> > Sorry, haven't had the chance to try it yet due to the impending merge 
> > window opening.  Once things settle down I'll give it a try -- or maybe 
> > someone else can test it in the meantime.
> > 
> 
> OK, people, you can stop heckling me about "sent from my iPhone" - I'm in the 
> wilds of rural Louisiana with really bad internet service and was trying to 
> work on my phone (but, alas, did not notice the CC list included the entire 
> universe).   Shame on me.
> 
> With the above set, I don't seem to see any stalls with the RCU timeout set 
> to 60s (the default).  I left the board running for 25 minutes; I will fire 
> it up again later and let it run for a bit longer, but usually I end up 
> seeing the problem pretty quickly so I don't expect that to result in 
> anything.  I also didn't see any stalls on Paul's RCU tree as of a week ago 
> at 60s, so as far as I can tell the CPU_IDLE stuff didn't have any impact (it 
> wasn't on when I tested Paul's tree).
> 
> If I drop the timeout to 5s as Paul M. suggested for debug a while back, I do 
> see stalls (both with CPU_IDLE stuff and without). 
> 
> I'm using the default omap2plus config, with RCU stall info enabled and the 
> cpu idle stuff turned on (console dump below).  This is a Panda ES 1.1 
> (OMAP4460)

Thank you for the testing, Becky!

Paul Walmsley, please let me know if the config below doesn't clear things
up for you or if there is some reason why this config is infeasible.

Thanx, Paul

> root@omap4430-panda:~# zcat /proc/config/gz. .. .. ..gz | grep RCU
> # RCU Subsystem
> CONFIG_TREE_RCU=y
> # CONFIG_PREEMPT_RCU is not set
> CONFIG_RCU_FANOUT=32
> CONFIG_RCU_FANOUT_LEAF=16
> # CONFIG_RCU_FANOUT_EXACT is not set
> # CONFIG_RCU_FAST_NO_HZ is not set
> # CONFIG_TREE_RCU_TRACE is not set
> # CONFIG_PROVE_RCU is not set
> # CONFIG_SPARSE_RCU_POINTER is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> CONFIG_RCU_CPU_STALL_TIMEOUT=5
> CONFIG_RCU_CPU_STALL_INFO=y
> # CONFIG_RCU_TRACE is not set
> root@omap4430-panda:~# zcat /proc/config.gz | grep IDLE
> CONFIG_GENERIC_SMP_IDLE_THREAD=y
> CONFIG_CPU_IDLE=y
> CONFIG_CPU_IDLE_GOV_LADDER=y
> CONFIG_CPU_IDLE_GOV_MENU=y
> CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED=y
> 
> Paul, let me know if you want me to try anything else.  My internet 
> connection is spotty today but (obviously :) I will see emails on my phone 
> and will test when I can.
> 
> Cheers,
> B
> 
> Console output:
> 
> root@omap4430-panda:~# [  377.495361] INFO: rcu_sched self-detected stall on 
> CPU
> [  377.500762] .1: (1 ticks this GP) idle=dcd/1/0 
> [  377.505523] . (t=761 jiffies)
> [  377.508666] [] (unwind_backtrace+0x0/0xf8) from [] 
> (rcu_check_callbacks+0x204/0x790)
> [  377.518615] [] (rcu_check_callbacks+0x204/0x790) from 
> [] (update_process_times+0x38/0x68)
> [  377.529022] [] (update_process_times+0x38/0x68) from 
> [] (tick_sched_timer+0x80/0xec)
> [  377.538970] [] (tick_sched_timer+0x80/0xec) from [] 
> (__run_hrtimer+0x7c/0x218)
> [  377.548339] [] (__run_hrtimer+0x7c/0x218) from [] 
> (hrtimer_interrupt+0x130/0x2d8)
> [  377.558013] [] (hrtimer_interrupt+0x130/0x2d8) from [] 
> (twd_handler+0x30/0x44)
> [  377.567413] [] (twd_handler+0x30/0x44) from [] 
> (handle_percpu_devid_irq+0x90/0x158)
> [  377.577270] [] (handle_percpu_devid_irq+0x90/0x158) from 
> [] (generic_handle_irq+0x30/0x44)
> [  377.587768] [] (generic_handle_irq+0x30/0x44) from [] 
> (handle_IRQ+0x4c/0xac)
> [  377.596984] [] (handle_IRQ+0x4c/0xac) from [] 
> (gic_handle_irq+0x24/0x58)
> [  377.605834] [] (gic_handle_irq+0x24/0x58) from [] 
> (__irq_svc+0x44/0x58)
> [  377.614593] Exception stack(0xee06ff08 to 0xee06ff50)
> [  377.619873] ff00:   0001 0001  3b9aca00 
> c608bc44 0057
> [  377.628448] ff20: c146a4f0 0002 c54e3b8f 0056 c048fb3c c0c47654 
> 0001 ee06ff50
> [  377.637023] ff40: c0084774 c0390fac 2113 
> [  377.642333] [] (__irq_svc+0x44/0x58) from [] 
> (cpuidle_wrap_enter+0x4c/0xa4)
> [  377.651458] []

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-20 Thread Bruce, Becky

On Sep 20, 2012, at 2:56 AM, Paul Walmsley wrote:

> Hi,
> 
> On Wed, 19 Sep 2012, Paul E. McKenney wrote:
> 
>> On Thu, Sep 13, 2012 at 06:52:10PM +, Paul Walmsley wrote:
>> 
>>> On Wed, 12 Sep 2012, Paul E. McKenney wrote:
>> 
 Subodh Nijsure (also CCed) reported something that might be similar on
 ARM, and also reported that setting the following got rid of the stalls:

CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y

 At which point he was happy, which was good, but which also left the
 underlying problem unsolved.  Do these affect your system?  If so,
 do they cause a different ARM idle loop to be executed?
>>> 
>>> Will give this a try.  What board was Subodh using?
>> 
>> Any news on trying the above settings?
> 
> Sorry, haven't had the chance to try it yet due to the impending merge 
> window opening.  Once things settle down I'll give it a try -- or maybe 
> someone else can test it in the meantime.
> 

OK, people, you can stop heckling me about "sent from my iPhone" - I'm in the 
wilds of rural Louisiana with really bad internet service and was trying to 
work on my phone (but, alas, did not notice the CC list included the entire 
universe).   Shame on me.

With the above set, I don't seem to see any stalls with the RCU timeout set to 
60s (the default).  I left the board running for 25 minutes; I will fire it up 
again later and let it run for a bit longer, but usually I end up seeing the 
problem pretty quickly so I don't expect that to result in anything.  I also 
didn't see any stalls on Paul's RCU tree as of a week ago at 60s, so as far as 
I can tell the CPU_IDLE stuff didn't have any impact (it wasn't on when I 
tested Paul's tree).

If I drop the timeout to 5s as Paul M. suggested for debug a while back, I do 
see stalls (both with CPU_IDLE stuff and without). 

I'm using the default omap2plus config, with RCU stall info enabled and the cpu 
idle stuff turned on (console dump below).  This is a Panda ES 1.1 (OMAP4460)

root@omap4430-panda:~# zcat /proc/config/gz. .. .. ..gz | grep RCU
# RCU Subsystem
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
CONFIG_RCU_FANOUT=32
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PROVE_RCU is not set
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=5
CONFIG_RCU_CPU_STALL_INFO=y
# CONFIG_RCU_TRACE is not set
root@omap4430-panda:~# zcat /proc/config.gz | grep IDLE
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED=y

Paul, let me know if you want me to try anything else.  My internet connection 
is spotty today but (obviously :) I will see emails on my phone and will test 
when I can.

Cheers,
B

Console output:

root@omap4430-panda:~# [  377.495361] INFO: rcu_sched self-detected stall on CPU
[  377.500762] .1: (1 ticks this GP) idle=dcd/1/0 
[  377.505523] . (t=761 jiffies)
[  377.508666] [] (unwind_backtrace+0x0/0xf8) from [] 
(rcu_check_callbacks+0x204/0x790)
[  377.518615] [] (rcu_check_callbacks+0x204/0x790) from [] 
(update_process_times+0x38/0x68)
[  377.529022] [] (update_process_times+0x38/0x68) from [] 
(tick_sched_timer+0x80/0xec)
[  377.538970] [] (tick_sched_timer+0x80/0xec) from [] 
(__run_hrtimer+0x7c/0x218)
[  377.548339] [] (__run_hrtimer+0x7c/0x218) from [] 
(hrtimer_interrupt+0x130/0x2d8)
[  377.558013] [] (hrtimer_interrupt+0x130/0x2d8) from [] 
(twd_handler+0x30/0x44)
[  377.567413] [] (twd_handler+0x30/0x44) from [] 
(handle_percpu_devid_irq+0x90/0x158)
[  377.577270] [] (handle_percpu_devid_irq+0x90/0x158) from 
[] (generic_handle_irq+0x30/0x44)
[  377.587768] [] (generic_handle_irq+0x30/0x44) from [] 
(handle_IRQ+0x4c/0xac)
[  377.596984] [] (handle_IRQ+0x4c/0xac) from [] 
(gic_handle_irq+0x24/0x58)
[  377.605834] [] (gic_handle_irq+0x24/0x58) from [] 
(__irq_svc+0x44/0x58)
[  377.614593] Exception stack(0xee06ff08 to 0xee06ff50)
[  377.619873] ff00:   0001 0001  3b9aca00 
c608bc44 0057
[  377.628448] ff20: c146a4f0 0002 c54e3b8f 0056 c048fb3c c0c47654 
0001 ee06ff50
[  377.637023] ff40: c0084774 c0390fac 2113 
[  377.642333] [] (__irq_svc+0x44/0x58) from [] 
(cpuidle_wrap_enter+0x4c/0xa4)
[  377.651458] [] (cpuidle_wrap_enter+0x4c/0xa4) from [] 
(cpuidle_enter_state+0x14/0x68)
[  377.661499] [] (cpuidle_enter_state+0x14/0x68) from [] 
(cpuidle_enter_state_coupled+0x210/0x2a0)
[  377.672515] [] (cpuidle_enter_state_coupled+0x210/0x2a0) from 
[] (cpuidle_idle_call+0x174/0x308)
[  377.683563] [] (cpuidle_idle_call+0x174/0x308) from [] 
(cpu_idle+0x54/0x12c)
[  377.692779] [] (cpu_idle+0x54/0x12c) from [<8047c6f4>] (0x8047c6f4)
root@omap4430-panda:~# [  821.495361] INFO: rcu_sched self-detected stall on CPU
[  821.500762] .1: (1 ticks this GP)

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-20 Thread Bruce, Becky

I'll give it a try on my panda later today.

B

Sent from my iPhone

On Sep 20, 2012, at 2:56 AM, "Paul Walmsley"  wrote:

> Hi,
> 
> On Wed, 19 Sep 2012, Paul E. McKenney wrote:
> 
>> On Thu, Sep 13, 2012 at 06:52:10PM +, Paul Walmsley wrote:
>> 
>>> On Wed, 12 Sep 2012, Paul E. McKenney wrote:
>> 
 Subodh Nijsure (also CCed) reported something that might be similar on
 ARM, and also reported that setting the following got rid of the stalls:
 
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
 
 At which point he was happy, which was good, but which also left the
 underlying problem unsolved.  Do these affect your system?  If so,
 do they cause a different ARM idle loop to be executed?
>>> 
>>> Will give this a try.  What board was Subodh using?
>> 
>> Any news on trying the above settings?
> 
> Sorry, haven't had the chance to try it yet due to the impending merge 
> window opening.  Once things settle down I'll give it a try -- or maybe 
> someone else can test it in the meantime.
> 
> 
> - Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-20 Thread Paul Walmsley

Hi,

On Wed, 19 Sep 2012, Paul E. McKenney wrote:

> On Thu, Sep 13, 2012 at 06:52:10PM +, Paul Walmsley wrote:
> 
> > On Wed, 12 Sep 2012, Paul E. McKenney wrote:
>  
> > > Subodh Nijsure (also CCed) reported something that might be similar on
> > > ARM, and also reported that setting the following got rid of the stalls:
> > > 
> > >   CONFIG_CPU_IDLE=y
> > >   CONFIG_CPU_IDLE_GOV_LADDER=y
> > >   CONFIG_CPU_IDLE_GOV_MENU=y
> > > 
> > > At which point he was happy, which was good, but which also left the
> > > underlying problem unsolved.  Do these affect your system?  If so,
> > > do they cause a different ARM idle loop to be executed?
> > 
> > Will give this a try.  What board was Subodh using?
> 
> Any news on trying the above settings?

Sorry, haven't had the chance to try it yet due to the impending merge 
window opening.  Once things settle down I'll give it a try -- or maybe 
someone else can test it in the meantime.


- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-20 Thread Paul Walmsley

Hi,

On Wed, 19 Sep 2012, Paul E. McKenney wrote:

 On Thu, Sep 13, 2012 at 06:52:10PM +, Paul Walmsley wrote:
 
  On Wed, 12 Sep 2012, Paul E. McKenney wrote:
  
   Subodh Nijsure (also CCed) reported something that might be similar on
   ARM, and also reported that setting the following got rid of the stalls:
   
 CONFIG_CPU_IDLE=y
 CONFIG_CPU_IDLE_GOV_LADDER=y
 CONFIG_CPU_IDLE_GOV_MENU=y
   
   At which point he was happy, which was good, but which also left the
   underlying problem unsolved.  Do these affect your system?  If so,
   do they cause a different ARM idle loop to be executed?
  
  Will give this a try.  What board was Subodh using?
 
 Any news on trying the above settings?

Sorry, haven't had the chance to try it yet due to the impending merge 
window opening.  Once things settle down I'll give it a try -- or maybe 
someone else can test it in the meantime.


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-20 Thread Bruce, Becky

I'll give it a try on my panda later today.

B

Sent from my iPhone

On Sep 20, 2012, at 2:56 AM, Paul Walmsley p...@pwsan.com wrote:

 Hi,
 
 On Wed, 19 Sep 2012, Paul E. McKenney wrote:
 
 On Thu, Sep 13, 2012 at 06:52:10PM +, Paul Walmsley wrote:
 
 On Wed, 12 Sep 2012, Paul E. McKenney wrote:
 
 Subodh Nijsure (also CCed) reported something that might be similar on
 ARM, and also reported that setting the following got rid of the stalls:
 
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
 
 At which point he was happy, which was good, but which also left the
 underlying problem unsolved.  Do these affect your system?  If so,
 do they cause a different ARM idle loop to be executed?
 
 Will give this a try.  What board was Subodh using?
 
 Any news on trying the above settings?
 
 Sorry, haven't had the chance to try it yet due to the impending merge 
 window opening.  Once things settle down I'll give it a try -- or maybe 
 someone else can test it in the meantime.
 
 
 - Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-20 Thread Bruce, Becky


On Sep 20, 2012, at 2:56 AM, Paul Walmsley wrote:

 Hi,
 
 On Wed, 19 Sep 2012, Paul E. McKenney wrote:
 
 On Thu, Sep 13, 2012 at 06:52:10PM +, Paul Walmsley wrote:
 
 On Wed, 12 Sep 2012, Paul E. McKenney wrote:
 
 Subodh Nijsure (also CCed) reported something that might be similar on
 ARM, and also reported that setting the following got rid of the stalls:
 
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
 
 At which point he was happy, which was good, but which also left the
 underlying problem unsolved.  Do these affect your system?  If so,
 do they cause a different ARM idle loop to be executed?
 
 Will give this a try.  What board was Subodh using?
 
 Any news on trying the above settings?
 
 Sorry, haven't had the chance to try it yet due to the impending merge 
 window opening.  Once things settle down I'll give it a try -- or maybe 
 someone else can test it in the meantime.
 

OK, people, you can stop heckling me about sent from my iPhone - I'm in the 
wilds of rural Louisiana with really bad internet service and was trying to 
work on my phone (but, alas, did not notice the CC list included the entire 
universe).   Shame on me.

With the above set, I don't seem to see any stalls with the RCU timeout set to 
60s (the default).  I left the board running for 25 minutes; I will fire it up 
again later and let it run for a bit longer, but usually I end up seeing the 
problem pretty quickly so I don't expect that to result in anything.  I also 
didn't see any stalls on Paul's RCU tree as of a week ago at 60s, so as far as 
I can tell the CPU_IDLE stuff didn't have any impact (it wasn't on when I 
tested Paul's tree).

If I drop the timeout to 5s as Paul M. suggested for debug a while back, I do 
see stalls (both with CPU_IDLE stuff and without). 

I'm using the default omap2plus config, with RCU stall info enabled and the cpu 
idle stuff turned on (console dump below).  This is a Panda ES 1.1 (OMAP4460)

root@omap4430-panda:~# zcat /proc/config/gz. .. .. ..gz | grep RCU
# RCU Subsystem
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
CONFIG_RCU_FANOUT=32
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PROVE_RCU is not set
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=5
CONFIG_RCU_CPU_STALL_INFO=y
# CONFIG_RCU_TRACE is not set
root@omap4430-panda:~# zcat /proc/config.gz | grep IDLE
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED=y

Paul, let me know if you want me to try anything else.  My internet connection 
is spotty today but (obviously :) I will see emails on my phone and will test 
when I can.

Cheers,
B

Console output:

root@omap4430-panda:~# [  377.495361] INFO: rcu_sched self-detected stall on CPU
[  377.500762] .1: (1 ticks this GP) idle=dcd/1/0 
[  377.505523] . (t=761 jiffies)
[  377.508666] [c0019da0] (unwind_backtrace+0x0/0xf8) from [c009b138] 
(rcu_check_callbacks+0x204/0x790)
[  377.518615] [c009b138] (rcu_check_callbacks+0x204/0x790) from [c0045890] 
(update_process_times+0x38/0x68)
[  377.529022] [c0045890] (update_process_times+0x38/0x68) from [c007d47c] 
(tick_sched_timer+0x80/0xec)
[  377.538970] [c007d47c] (tick_sched_timer+0x80/0xec) from [c005a2fc] 
(__run_hrtimer+0x7c/0x218)
[  377.548339] [c005a2fc] (__run_hrtimer+0x7c/0x218) from [c005b040] 
(hrtimer_interrupt+0x130/0x2d8)
[  377.558013] [c005b040] (hrtimer_interrupt+0x130/0x2d8) from [c0018998] 
(twd_handler+0x30/0x44)
[  377.567413] [c0018998] (twd_handler+0x30/0x44) from [c00960cc] 
(handle_percpu_devid_irq+0x90/0x158)
[  377.577270] [c00960cc] (handle_percpu_devid_irq+0x90/0x158) from 
[c00929ac] (generic_handle_irq+0x30/0x44)
[  377.587768] [c00929ac] (generic_handle_irq+0x30/0x44) from [c0013bd8] 
(handle_IRQ+0x4c/0xac)
[  377.596984] [c0013bd8] (handle_IRQ+0x4c/0xac) from [c0008470] 
(gic_handle_irq+0x24/0x58)
[  377.605834] [c0008470] (gic_handle_irq+0x24/0x58) from [c0487604] 
(__irq_svc+0x44/0x58)
[  377.614593] Exception stack(0xee06ff08 to 0xee06ff50)
[  377.619873] ff00:   0001 0001  3b9aca00 
c608bc44 0057
[  377.628448] ff20: c146a4f0 0002 c54e3b8f 0056 c048fb3c c0c47654 
0001 ee06ff50
[  377.637023] ff40: c0084774 c0390fac 2113 
[  377.642333] [c0487604] (__irq_svc+0x44/0x58) from [c0390fac] 
(cpuidle_wrap_enter+0x4c/0xa4)
[  377.651458] [c0390fac] (cpuidle_wrap_enter+0x4c/0xa4) from [c0390a80] 
(cpuidle_enter_state+0x14/0x68)
[  377.661499] [c0390a80] (cpuidle_enter_state+0x14/0x68) from [c0392890] 
(cpuidle_enter_state_coupled+0x210/0x2a0)
[  377.672515] [c0392890] (cpuidle_enter_state_coupled+0x210/0x2a0) from 
[c0390c48] (cpuidle_idle_call+0x174/0x308)
[  377.683563] [c0390c48] (cpuidle_idle_call+0x174/0x308) from [c0014098] 
(cpu_idle+0x54/0x12c)
[

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-20 Thread Paul E. McKenney

On Thu, Sep 20, 2012 at 09:49:13PM +, Bruce, Becky wrote:
 
 On Sep 20, 2012, at 2:56 AM, Paul Walmsley wrote:
 
  Hi,
  
  On Wed, 19 Sep 2012, Paul E. McKenney wrote:
  
  On Thu, Sep 13, 2012 at 06:52:10PM +, Paul Walmsley wrote:
  
  On Wed, 12 Sep 2012, Paul E. McKenney wrote:
  
  Subodh Nijsure (also CCed) reported something that might be similar on
  ARM, and also reported that setting the following got rid of the stalls:
  
   CONFIG_CPU_IDLE=y
   CONFIG_CPU_IDLE_GOV_LADDER=y
   CONFIG_CPU_IDLE_GOV_MENU=y
  
  At which point he was happy, which was good, but which also left the
  underlying problem unsolved.  Do these affect your system?  If so,
  do they cause a different ARM idle loop to be executed?
  
  Will give this a try.  What board was Subodh using?
  
  Any news on trying the above settings?
  
  Sorry, haven't had the chance to try it yet due to the impending merge 
  window opening.  Once things settle down I'll give it a try -- or maybe 
  someone else can test it in the meantime.
  
 
 OK, people, you can stop heckling me about sent from my iPhone - I'm in the 
 wilds of rural Louisiana with really bad internet service and was trying to 
 work on my phone (but, alas, did not notice the CC list included the entire 
 universe).   Shame on me.
 
 With the above set, I don't seem to see any stalls with the RCU timeout set 
 to 60s (the default).  I left the board running for 25 minutes; I will fire 
 it up again later and let it run for a bit longer, but usually I end up 
 seeing the problem pretty quickly so I don't expect that to result in 
 anything.  I also didn't see any stalls on Paul's RCU tree as of a week ago 
 at 60s, so as far as I can tell the CPU_IDLE stuff didn't have any impact (it 
 wasn't on when I tested Paul's tree).
 
 If I drop the timeout to 5s as Paul M. suggested for debug a while back, I do 
 see stalls (both with CPU_IDLE stuff and without). 
 
 I'm using the default omap2plus config, with RCU stall info enabled and the 
 cpu idle stuff turned on (console dump below).  This is a Panda ES 1.1 
 (OMAP4460)

Thank you for the testing, Becky!

Paul Walmsley, please let me know if the config below doesn't clear things
up for you or if there is some reason why this config is infeasible.

Thanx, Paul

 root@omap4430-panda:~# zcat /proc/config/gz. .. .. ..gz | grep RCU
 # RCU Subsystem
 CONFIG_TREE_RCU=y
 # CONFIG_PREEMPT_RCU is not set
 CONFIG_RCU_FANOUT=32
 CONFIG_RCU_FANOUT_LEAF=16
 # CONFIG_RCU_FANOUT_EXACT is not set
 # CONFIG_RCU_FAST_NO_HZ is not set
 # CONFIG_TREE_RCU_TRACE is not set
 # CONFIG_PROVE_RCU is not set
 # CONFIG_SPARSE_RCU_POINTER is not set
 # CONFIG_RCU_TORTURE_TEST is not set
 CONFIG_RCU_CPU_STALL_TIMEOUT=5
 CONFIG_RCU_CPU_STALL_INFO=y
 # CONFIG_RCU_TRACE is not set
 root@omap4430-panda:~# zcat /proc/config.gz | grep IDLE
 CONFIG_GENERIC_SMP_IDLE_THREAD=y
 CONFIG_CPU_IDLE=y
 CONFIG_CPU_IDLE_GOV_LADDER=y
 CONFIG_CPU_IDLE_GOV_MENU=y
 CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED=y
 
 Paul, let me know if you want me to try anything else.  My internet 
 connection is spotty today but (obviously :) I will see emails on my phone 
 and will test when I can.
 
 Cheers,
 B
 
 Console output:
 
 root@omap4430-panda:~# [  377.495361] INFO: rcu_sched self-detected stall on 
 CPU
 [  377.500762] .1: (1 ticks this GP) idle=dcd/1/0 
 [  377.505523] . (t=761 jiffies)
 [  377.508666] [c0019da0] (unwind_backtrace+0x0/0xf8) from [c009b138] 
 (rcu_check_callbacks+0x204/0x790)
 [  377.518615] [c009b138] (rcu_check_callbacks+0x204/0x790) from 
 [c0045890] (update_process_times+0x38/0x68)
 [  377.529022] [c0045890] (update_process_times+0x38/0x68) from 
 [c007d47c] (tick_sched_timer+0x80/0xec)
 [  377.538970] [c007d47c] (tick_sched_timer+0x80/0xec) from [c005a2fc] 
 (__run_hrtimer+0x7c/0x218)
 [  377.548339] [c005a2fc] (__run_hrtimer+0x7c/0x218) from [c005b040] 
 (hrtimer_interrupt+0x130/0x2d8)
 [  377.558013] [c005b040] (hrtimer_interrupt+0x130/0x2d8) from [c0018998] 
 (twd_handler+0x30/0x44)
 [  377.567413] [c0018998] (twd_handler+0x30/0x44) from [c00960cc] 
 (handle_percpu_devid_irq+0x90/0x158)
 [  377.577270] [c00960cc] (handle_percpu_devid_irq+0x90/0x158) from 
 [c00929ac] (generic_handle_irq+0x30/0x44)
 [  377.587768] [c00929ac] (generic_handle_irq+0x30/0x44) from [c0013bd8] 
 (handle_IRQ+0x4c/0xac)
 [  377.596984] [c0013bd8] (handle_IRQ+0x4c/0xac) from [c0008470] 
 (gic_handle_irq+0x24/0x58)
 [  377.605834] [c0008470] (gic_handle_irq+0x24/0x58) from [c0487604] 
 (__irq_svc+0x44/0x58)
 [  377.614593] Exception stack(0xee06ff08 to 0xee06ff50)
 [  377.619873] ff00:   0001 0001  3b9aca00 
 c608bc44 0057
 [  377.628448] ff20: c146a4f0 0002 c54e3b8f 0056 c048fb3c c0c47654 
 0001 ee06ff50
 [  377.637023] ff40: c0084774 c0390fac 2113 
 [  377.642333] [c0487604] (__irq_svc+0x44/0x58) from [c0390fac] 
 (cpuidle_wrap_enter+0x4c/0xa4)
 [  377.651458] [c0390fac]

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-20 Thread Paul Walmsley

On Thu, 20 Sep 2012, Paul E. McKenney wrote:

 On Thu, Sep 20, 2012 at 09:49:13PM +, Bruce, Becky wrote:

  OK, people, you can stop heckling me about sent from my iPhone - I'm in 
  the wilds of rural Louisiana with really bad internet service and was 
  trying to work on my phone (but, alas, did not notice the CC list included 
  the entire universe).   Shame on me.
  
  With the above set, I don't seem to see any stalls with the RCU timeout set 
  to 60s (the default).  I left the board running for 25 minutes; I will fire 
  it up again later and let it run for a bit longer, but usually I end up 
  seeing the problem pretty quickly so I don't expect that to result in 
  anything.  I also didn't see any stalls on Paul's RCU tree as of a week ago 
  at 60s, so as far as I can tell the CPU_IDLE stuff didn't have any impact 
  (it wasn't on when I tested Paul's tree).
  
  If I drop the timeout to 5s as Paul M. suggested for debug a while back, I 
  do see stalls (both with CPU_IDLE stuff and without). 
  
  I'm using the default omap2plus config, with RCU stall info enabled and the 
  cpu idle stuff turned on (console dump below).  This is a Panda ES 1.1 
  (OMAP4460)
 
 Thank you for the testing, Becky!
 
 Paul Walmsley, please let me know if the config below doesn't clear things
 up for you or if there is some reason why this config is infeasible.

Will certainly test it here once things settle down.  But I wouldn't 
consider the use of CONFIG_CPU_IDLE to be a 'fix' until the problem can be 
understood.  We use !CONFIG_CPU_IDLE kernels for testing here.


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-20 Thread Paul E. McKenney

On Thu, Sep 20, 2012 at 10:47:25PM +, Paul Walmsley wrote:
 On Thu, 20 Sep 2012, Paul E. McKenney wrote:
 
  On Thu, Sep 20, 2012 at 09:49:13PM +, Bruce, Becky wrote:
 
   OK, people, you can stop heckling me about sent from my iPhone - I'm in 
   the wilds of rural Louisiana with really bad internet service and was 
   trying to work on my phone (but, alas, did not notice the CC list 
   included the entire universe).   Shame on me.
   
   With the above set, I don't seem to see any stalls with the RCU timeout 
   set to 60s (the default).  I left the board running for 25 minutes; I 
   will fire it up again later and let it run for a bit longer, but usually 
   I end up seeing the problem pretty quickly so I don't expect that to 
   result in anything.  I also didn't see any stalls on Paul's RCU tree as 
   of a week ago at 60s, so as far as I can tell the CPU_IDLE stuff didn't 
   have any impact (it wasn't on when I tested Paul's tree).
   
   If I drop the timeout to 5s as Paul M. suggested for debug a while back, 
   I do see stalls (both with CPU_IDLE stuff and without). 
   
   I'm using the default omap2plus config, with RCU stall info enabled and 
   the cpu idle stuff turned on (console dump below).  This is a Panda ES 
   1.1 (OMAP4460)
  
  Thank you for the testing, Becky!
  
  Paul Walmsley, please let me know if the config below doesn't clear things
  up for you or if there is some reason why this config is infeasible.
 
 Will certainly test it here once things settle down.  But I wouldn't 
 consider the use of CONFIG_CPU_IDLE to be a 'fix' until the problem can be 
 understood.  We use !CONFIG_CPU_IDLE kernels for testing here.

Fair point.  I am wondering whether there is some path into the idle
loop that somehow avoids telling RCU that the CPU has in face entered
idle.  There needs to be an rcu_idle_enter() call on the way to idle,
otherwise RCU CPU stall warnings are expected behavior.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-19 Thread Paul E. McKenney

On Thu, Sep 13, 2012 at 06:52:10PM +, Paul Walmsley wrote:
> Hi Paul,
> 
> thanks for the reply,
> 
> On Wed, 12 Sep 2012, Paul E. McKenney wrote:
> 
> > Interesting.  I am assuming that the interrupt in the stack below came
> > from idle, if not, please let me know what.
> 
> According to the exception stack section in the original traceback, it
> appears that the serial interrupt took the SoC out of idle.
> 
> > Could you please reproduce with CONFIG_RCU_CPU_STALL_INFO=y?  That would
> > give me a bit more information about why RCU thought that there was
> > a stall.  (CCing Becky Bruce, who saw something similar recently.)
> 
> At the bottom of this mail is a series of tracebacks with
> CONFIG_RCU_CPU_STALL_INFO=y.  Unlike the traceback that was sent in
> the last message, these were not triggered by serial activity.  These
> appeared every 300 seconds.
> 
> > Subodh Nijsure (also CCed) reported something that might be similar on
> > ARM, and also reported that setting the following got rid of the stalls:
> > 
> > CONFIG_CPU_IDLE=y
> > CONFIG_CPU_IDLE_GOV_LADDER=y
> > CONFIG_CPU_IDLE_GOV_MENU=y
> > 
> > At which point he was happy, which was good, but which also left the
> > underlying problem unsolved.  Do these affect your system?  If so,
> > do they cause a different ARM idle loop to be executed?
> 
> Will give this a try.  What board was Subodh using?

Hello, Paul,

Any news on trying the above settings?

Thanx, Paul

> - Paul
> 
> 
> Debian GNU/Linux wheezy/sid armel ttyO2
> 
> armel login: [  305.942108] INFO: rcu_sched self-detected stall on CPU
> [  305.944946]  1: (7 GPs behind) idle=57b/1/0 
> [  305.947265]   (t=22811 jiffies)
> [  305.949066] [] (unwind_backtrace+0x0/0xf0) from [] 
> (rcu_check_callbacks+0x1b0/0x678)
> [  305.954223] [] (rcu_check_callbacks+0x1b0/0x678) from 
> [] (update_process_times+0x38/0x68)
> [  305.959625] [] (update_process_times+0x38/0x68) from 
> [] (tick_sched_timer+0x80/0xec)
> [  305.964813] [] (tick_sched_timer+0x80/0xec) from [] 
> (__run_hrtimer+0x7c/0x1e0)
> [  305.969696] [] (__run_hrtimer+0x7c/0x1e0) from [] 
> (hrtimer_interrupt+0x11c/0x2d0)
> [  305.974731] [] (hrtimer_interrupt+0x11c/0x2d0) from [] 
> (twd_handler+0x30/0x44)
> [  305.979644] [] (twd_handler+0x30/0x44) from [] 
> (handle_percpu_devid_irq+0x90/0x13c)
> [  305.984741] [] (handle_percpu_devid_irq+0x90/0x13c) from 
> [] (generic_handle_irq+0x30/0x48)
> [  305.990234] [] (generic_handle_irq+0x30/0x48) from [] 
> (handle_IRQ+0x4c/0xac)
> [  305.995025] [] (handle_IRQ+0x4c/0xac) from [] 
> (gic_handle_irq+0x28/0x5c)
> [  305.999633] [] (gic_handle_irq+0x28/0x5c) from [] 
> (__irq_svc+0x44/0x5c)
> [  306.004180] Exception stack(0xde86ff88 to 0xde86ffd0)
> [  306.006927] ff80:   0003c6d0 0001  de8660c0 
> de86e000 c07c23c8
> [  306.011383] ffa0: c0504590 c0749e20  411fc092 c074a040  
> 0001 de86ffd0
> [  306.015838] ffc0: 0003c6d1 c0014f50 2113 
> [  306.018585] [] (__irq_svc+0x44/0x5c) from [] 
> (default_idle+0x20/0x44)
> [  306.023040] [] (default_idle+0x20/0x44) from [] 
> (cpu_idle+0x9c/0x114)
> [  306.027526] [] (cpu_idle+0x9c/0x114) from [<804f1af4>] 
> (0x804f1af4)
> [  602.004486] INFO: rcu_sched detected stalls on CPUs/tasks:
> [  602.007476]  (detected by 0, t=60707 jiffies)
> [  602.009857] INFO: Stall ended before state dump start
> [  906.027893] INFO: rcu_sched self-detected stall on CPU
> [  906.030700]  1: (6 GPs behind) idle=647/1/0 
> [  906.033020]   (t=38379 jiffies)
> [  906.034790] [] (unwind_backtrace+0x0/0xf0) from [] 
> (rcu_check_callbacks+0x1b0/0x678)
> [  906.039947] [] (rcu_check_callbacks+0x1b0/0x678) from 
> [] (update_process_times+0x38/0x68)
> [  906.045349] [] (update_process_times+0x38/0x68) from 
> [] (tick_sched_timer+0x80/0xec)
> [  906.050537] [] (tick_sched_timer+0x80/0xec) from [] 
> (__run_hrtimer+0x7c/0x1e0)
> [  906.055419] [] (__run_hrtimer+0x7c/0x1e0) from [] 
> (hrtimer_interrupt+0x11c/0x2d0)
> [  906.060424] [] (hrtimer_interrupt+0x11c/0x2d0) from [] 
> (twd_handler+0x30/0x44)
> [  906.065307] [] (twd_handler+0x30/0x44) from [] 
> (handle_percpu_devid_irq+0x90/0x13c)
> [  906.070434] [] (handle_percpu_devid_irq+0x90/0x13c) from 
> [] (generic_handle_irq+0x30/0x48)
> [  906.075897] [] (generic_handle_irq+0x30/0x48) from [] 
> (handle_IRQ+0x4c/0xac)
> [  906.080688] [] (handle_IRQ+0x4c/0xac) from [] 
> (gic_handle_irq+0x28/0x5c)
> [  906.085296] [] (gic_handle_irq+0x28/0x5c) from [] 
> (__irq_svc+0x44/0x5c)
> [  906.089843] Exception stack(0xde86ff88 to 0xde86ffd0)
> [  906.092590] ff80:   0003cb06 0001  de8660c0 
> de86e000 c07c23c8
> [  906.097045] ffa0: c0504590 c0749e20  411fc092 c074a040  
> 0001 de86ffd0
> [  906.101501] ffc0: 0003cb07 c0014f50 2113 
> [  906.104278] [] (__irq_svc+0x44/0x5c) from [] 
> (default_idle+0x20/0x44)
> [  906.108734] []

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-19 Thread Paul E. McKenney

On Thu, Sep 13, 2012 at 06:52:10PM +, Paul Walmsley wrote:
 Hi Paul,
 
 thanks for the reply,
 
 On Wed, 12 Sep 2012, Paul E. McKenney wrote:
 
  Interesting.  I am assuming that the interrupt in the stack below came
  from idle, if not, please let me know what.
 
 According to the exception stack section in the original traceback, it
 appears that the serial interrupt took the SoC out of idle.
 
  Could you please reproduce with CONFIG_RCU_CPU_STALL_INFO=y?  That would
  give me a bit more information about why RCU thought that there was
  a stall.  (CCing Becky Bruce, who saw something similar recently.)
 
 At the bottom of this mail is a series of tracebacks with
 CONFIG_RCU_CPU_STALL_INFO=y.  Unlike the traceback that was sent in
 the last message, these were not triggered by serial activity.  These
 appeared every 300 seconds.
 
  Subodh Nijsure (also CCed) reported something that might be similar on
  ARM, and also reported that setting the following got rid of the stalls:
  
  CONFIG_CPU_IDLE=y
  CONFIG_CPU_IDLE_GOV_LADDER=y
  CONFIG_CPU_IDLE_GOV_MENU=y
  
  At which point he was happy, which was good, but which also left the
  underlying problem unsolved.  Do these affect your system?  If so,
  do they cause a different ARM idle loop to be executed?
 
 Will give this a try.  What board was Subodh using?

Hello, Paul,

Any news on trying the above settings?

Thanx, Paul

 - Paul
 
 
 Debian GNU/Linux wheezy/sid armel ttyO2
 
 armel login: [  305.942108] INFO: rcu_sched self-detected stall on CPU
 [  305.944946]  1: (7 GPs behind) idle=57b/1/0 
 [  305.947265]   (t=22811 jiffies)
 [  305.949066] [c001b7cc] (unwind_backtrace+0x0/0xf0) from [c00acc28] 
 (rcu_check_callbacks+0x1b0/0x678)
 [  305.954223] [c00acc28] (rcu_check_callbacks+0x1b0/0x678) from 
 [c00529e0] (update_process_times+0x38/0x68)
 [  305.959625] [c00529e0] (update_process_times+0x38/0x68) from 
 [c008bf14] (tick_sched_timer+0x80/0xec)
 [  305.964813] [c008bf14] (tick_sched_timer+0x80/0xec) from [c006840c] 
 (__run_hrtimer+0x7c/0x1e0)
 [  305.969696] [c006840c] (__run_hrtimer+0x7c/0x1e0) from [c00691f0] 
 (hrtimer_interrupt+0x11c/0x2d0)
 [  305.974731] [c00691f0] (hrtimer_interrupt+0x11c/0x2d0) from [c001a04c] 
 (twd_handler+0x30/0x44)
 [  305.979644] [c001a04c] (twd_handler+0x30/0x44) from [c00a7068] 
 (handle_percpu_devid_irq+0x90/0x13c)
 [  305.984741] [c00a7068] (handle_percpu_devid_irq+0x90/0x13c) from 
 [c00a37dc] (generic_handle_irq+0x30/0x48)
 [  305.990234] [c00a37dc] (generic_handle_irq+0x30/0x48) from [c0014c58] 
 (handle_IRQ+0x4c/0xac)
 [  305.995025] [c0014c58] (handle_IRQ+0x4c/0xac) from [c0008478] 
 (gic_handle_irq+0x28/0x5c)
 [  305.999633] [c0008478] (gic_handle_irq+0x28/0x5c) from [c04f8ca4] 
 (__irq_svc+0x44/0x5c)
 [  306.004180] Exception stack(0xde86ff88 to 0xde86ffd0)
 [  306.006927] ff80:   0003c6d0 0001  de8660c0 
 de86e000 c07c23c8
 [  306.011383] ffa0: c0504590 c0749e20  411fc092 c074a040  
 0001 de86ffd0
 [  306.015838] ffc0: 0003c6d1 c0014f50 2113 
 [  306.018585] [c04f8ca4] (__irq_svc+0x44/0x5c) from [c0014f50] 
 (default_idle+0x20/0x44)
 [  306.023040] [c0014f50] (default_idle+0x20/0x44) from [c001517c] 
 (cpu_idle+0x9c/0x114)
 [  306.027526] [c001517c] (cpu_idle+0x9c/0x114) from [804f1af4] 
 (0x804f1af4)
 [  602.004486] INFO: rcu_sched detected stalls on CPUs/tasks:
 [  602.007476]  (detected by 0, t=60707 jiffies)
 [  602.009857] INFO: Stall ended before state dump start
 [  906.027893] INFO: rcu_sched self-detected stall on CPU
 [  906.030700]  1: (6 GPs behind) idle=647/1/0 
 [  906.033020]   (t=38379 jiffies)
 [  906.034790] [c001b7cc] (unwind_backtrace+0x0/0xf0) from [c00acc28] 
 (rcu_check_callbacks+0x1b0/0x678)
 [  906.039947] [c00acc28] (rcu_check_callbacks+0x1b0/0x678) from 
 [c00529e0] (update_process_times+0x38/0x68)
 [  906.045349] [c00529e0] (update_process_times+0x38/0x68) from 
 [c008bf14] (tick_sched_timer+0x80/0xec)
 [  906.050537] [c008bf14] (tick_sched_timer+0x80/0xec) from [c006840c] 
 (__run_hrtimer+0x7c/0x1e0)
 [  906.055419] [c006840c] (__run_hrtimer+0x7c/0x1e0) from [c00691f0] 
 (hrtimer_interrupt+0x11c/0x2d0)
 [  906.060424] [c00691f0] (hrtimer_interrupt+0x11c/0x2d0) from [c001a04c] 
 (twd_handler+0x30/0x44)
 [  906.065307] [c001a04c] (twd_handler+0x30/0x44) from [c00a7068] 
 (handle_percpu_devid_irq+0x90/0x13c)
 [  906.070434] [c00a7068] (handle_percpu_devid_irq+0x90/0x13c) from 
 [c00a37dc] (generic_handle_irq+0x30/0x48)
 [  906.075897] [c00a37dc] (generic_handle_irq+0x30/0x48) from [c0014c58] 
 (handle_IRQ+0x4c/0xac)
 [  906.080688] [c0014c58] (handle_IRQ+0x4c/0xac) from [c0008478] 
 (gic_handle_irq+0x28/0x5c)
 [  906.085296] [c0008478] (gic_handle_irq+0x28/0x5c) from [c04f8ca4] 
 (__irq_svc+0x44/0x5c)
 [  906.089843] Exception stack(0xde86ff88 to 0xde86ffd0)
 [  906.092590] ff80:   0003cb06 0001  de8660c0 
 de86e000

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-13 Thread Paul Walmsley

Hi Paul,

thanks for the reply,

On Wed, 12 Sep 2012, Paul E. McKenney wrote:

> Interesting.  I am assuming that the interrupt in the stack below came
> from idle, if not, please let me know what.

According to the exception stack section in the original traceback, it
appears that the serial interrupt took the SoC out of idle.

> Could you please reproduce with CONFIG_RCU_CPU_STALL_INFO=y?  That would
> give me a bit more information about why RCU thought that there was
> a stall.  (CCing Becky Bruce, who saw something similar recently.)

At the bottom of this mail is a series of tracebacks with
CONFIG_RCU_CPU_STALL_INFO=y.  Unlike the traceback that was sent in
the last message, these were not triggered by serial activity.  These
appeared every 300 seconds.

> Subodh Nijsure (also CCed) reported something that might be similar on
> ARM, and also reported that setting the following got rid of the stalls:
> 
>   CONFIG_CPU_IDLE=y
>   CONFIG_CPU_IDLE_GOV_LADDER=y
>   CONFIG_CPU_IDLE_GOV_MENU=y
> 
> At which point he was happy, which was good, but which also left the
> underlying problem unsolved.  Do these affect your system?  If so,
> do they cause a different ARM idle loop to be executed?

Will give this a try.  What board was Subodh using?


- Paul


Debian GNU/Linux wheezy/sid armel ttyO2

armel login: [  305.942108] INFO: rcu_sched self-detected stall on CPU
[  305.944946]  1: (7 GPs behind) idle=57b/1/0 
[  305.947265]   (t=22811 jiffies)
[  305.949066] [] (unwind_backtrace+0x0/0xf0) from [] 
(rcu_check_callbacks+0x1b0/0x678)
[  305.954223] [] (rcu_check_callbacks+0x1b0/0x678) from [] 
(update_process_times+0x38/0x68)
[  305.959625] [] (update_process_times+0x38/0x68) from [] 
(tick_sched_timer+0x80/0xec)
[  305.964813] [] (tick_sched_timer+0x80/0xec) from [] 
(__run_hrtimer+0x7c/0x1e0)
[  305.969696] [] (__run_hrtimer+0x7c/0x1e0) from [] 
(hrtimer_interrupt+0x11c/0x2d0)
[  305.974731] [] (hrtimer_interrupt+0x11c/0x2d0) from [] 
(twd_handler+0x30/0x44)
[  305.979644] [] (twd_handler+0x30/0x44) from [] 
(handle_percpu_devid_irq+0x90/0x13c)
[  305.984741] [] (handle_percpu_devid_irq+0x90/0x13c) from 
[] (generic_handle_irq+0x30/0x48)
[  305.990234] [] (generic_handle_irq+0x30/0x48) from [] 
(handle_IRQ+0x4c/0xac)
[  305.995025] [] (handle_IRQ+0x4c/0xac) from [] 
(gic_handle_irq+0x28/0x5c)
[  305.999633] [] (gic_handle_irq+0x28/0x5c) from [] 
(__irq_svc+0x44/0x5c)
[  306.004180] Exception stack(0xde86ff88 to 0xde86ffd0)
[  306.006927] ff80:   0003c6d0 0001  de8660c0 
de86e000 c07c23c8
[  306.011383] ffa0: c0504590 c0749e20  411fc092 c074a040  
0001 de86ffd0
[  306.015838] ffc0: 0003c6d1 c0014f50 2113 
[  306.018585] [] (__irq_svc+0x44/0x5c) from [] 
(default_idle+0x20/0x44)
[  306.023040] [] (default_idle+0x20/0x44) from [] 
(cpu_idle+0x9c/0x114)
[  306.027526] [] (cpu_idle+0x9c/0x114) from [<804f1af4>] (0x804f1af4)
[  602.004486] INFO: rcu_sched detected stalls on CPUs/tasks:
[  602.007476]  (detected by 0, t=60707 jiffies)
[  602.009857] INFO: Stall ended before state dump start
[  906.027893] INFO: rcu_sched self-detected stall on CPU
[  906.030700]  1: (6 GPs behind) idle=647/1/0 
[  906.033020]   (t=38379 jiffies)
[  906.034790] [] (unwind_backtrace+0x0/0xf0) from [] 
(rcu_check_callbacks+0x1b0/0x678)
[  906.039947] [] (rcu_check_callbacks+0x1b0/0x678) from [] 
(update_process_times+0x38/0x68)
[  906.045349] [] (update_process_times+0x38/0x68) from [] 
(tick_sched_timer+0x80/0xec)
[  906.050537] [] (tick_sched_timer+0x80/0xec) from [] 
(__run_hrtimer+0x7c/0x1e0)
[  906.055419] [] (__run_hrtimer+0x7c/0x1e0) from [] 
(hrtimer_interrupt+0x11c/0x2d0)
[  906.060424] [] (hrtimer_interrupt+0x11c/0x2d0) from [] 
(twd_handler+0x30/0x44)
[  906.065307] [] (twd_handler+0x30/0x44) from [] 
(handle_percpu_devid_irq+0x90/0x13c)
[  906.070434] [] (handle_percpu_devid_irq+0x90/0x13c) from 
[] (generic_handle_irq+0x30/0x48)
[  906.075897] [] (generic_handle_irq+0x30/0x48) from [] 
(handle_IRQ+0x4c/0xac)
[  906.080688] [] (handle_IRQ+0x4c/0xac) from [] 
(gic_handle_irq+0x28/0x5c)
[  906.085296] [] (gic_handle_irq+0x28/0x5c) from [] 
(__irq_svc+0x44/0x5c)
[  906.089843] Exception stack(0xde86ff88 to 0xde86ffd0)
[  906.092590] ff80:   0003cb06 0001  de8660c0 
de86e000 c07c23c8
[  906.097045] ffa0: c0504590 c0749e20  411fc092 c074a040  
0001 de86ffd0
[  906.101501] ffc0: 0003cb07 c0014f50 2113 
[  906.104278] [] (__irq_svc+0x44/0x5c) from [] 
(default_idle+0x20/0x44)
[  906.108734] [] (default_idle+0x20/0x44) from [] 
(cpu_idle+0x9c/0x114)
[  906.113189] [] (cpu_idle+0x9c/0x114) from [<804f1af4>] (0x804f1af4)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-13 Thread Paul Walmsley

Hi Paul,

thanks for the reply,

On Wed, 12 Sep 2012, Paul E. McKenney wrote:

 Interesting.  I am assuming that the interrupt in the stack below came
 from idle, if not, please let me know what.

According to the exception stack section in the original traceback, it
appears that the serial interrupt took the SoC out of idle.

 Could you please reproduce with CONFIG_RCU_CPU_STALL_INFO=y?  That would
 give me a bit more information about why RCU thought that there was
 a stall.  (CCing Becky Bruce, who saw something similar recently.)

At the bottom of this mail is a series of tracebacks with
CONFIG_RCU_CPU_STALL_INFO=y.  Unlike the traceback that was sent in
the last message, these were not triggered by serial activity.  These
appeared every 300 seconds.

 Subodh Nijsure (also CCed) reported something that might be similar on
 ARM, and also reported that setting the following got rid of the stalls:
 
   CONFIG_CPU_IDLE=y
   CONFIG_CPU_IDLE_GOV_LADDER=y
   CONFIG_CPU_IDLE_GOV_MENU=y
 
 At which point he was happy, which was good, but which also left the
 underlying problem unsolved.  Do these affect your system?  If so,
 do they cause a different ARM idle loop to be executed?

Will give this a try.  What board was Subodh using?


- Paul


Debian GNU/Linux wheezy/sid armel ttyO2

armel login: [  305.942108] INFO: rcu_sched self-detected stall on CPU
[  305.944946]  1: (7 GPs behind) idle=57b/1/0 
[  305.947265]   (t=22811 jiffies)
[  305.949066] [c001b7cc] (unwind_backtrace+0x0/0xf0) from [c00acc28] 
(rcu_check_callbacks+0x1b0/0x678)
[  305.954223] [c00acc28] (rcu_check_callbacks+0x1b0/0x678) from [c00529e0] 
(update_process_times+0x38/0x68)
[  305.959625] [c00529e0] (update_process_times+0x38/0x68) from [c008bf14] 
(tick_sched_timer+0x80/0xec)
[  305.964813] [c008bf14] (tick_sched_timer+0x80/0xec) from [c006840c] 
(__run_hrtimer+0x7c/0x1e0)
[  305.969696] [c006840c] (__run_hrtimer+0x7c/0x1e0) from [c00691f0] 
(hrtimer_interrupt+0x11c/0x2d0)
[  305.974731] [c00691f0] (hrtimer_interrupt+0x11c/0x2d0) from [c001a04c] 
(twd_handler+0x30/0x44)
[  305.979644] [c001a04c] (twd_handler+0x30/0x44) from [c00a7068] 
(handle_percpu_devid_irq+0x90/0x13c)
[  305.984741] [c00a7068] (handle_percpu_devid_irq+0x90/0x13c) from 
[c00a37dc] (generic_handle_irq+0x30/0x48)
[  305.990234] [c00a37dc] (generic_handle_irq+0x30/0x48) from [c0014c58] 
(handle_IRQ+0x4c/0xac)
[  305.995025] [c0014c58] (handle_IRQ+0x4c/0xac) from [c0008478] 
(gic_handle_irq+0x28/0x5c)
[  305.999633] [c0008478] (gic_handle_irq+0x28/0x5c) from [c04f8ca4] 
(__irq_svc+0x44/0x5c)
[  306.004180] Exception stack(0xde86ff88 to 0xde86ffd0)
[  306.006927] ff80:   0003c6d0 0001  de8660c0 
de86e000 c07c23c8
[  306.011383] ffa0: c0504590 c0749e20  411fc092 c074a040  
0001 de86ffd0
[  306.015838] ffc0: 0003c6d1 c0014f50 2113 
[  306.018585] [c04f8ca4] (__irq_svc+0x44/0x5c) from [c0014f50] 
(default_idle+0x20/0x44)
[  306.023040] [c0014f50] (default_idle+0x20/0x44) from [c001517c] 
(cpu_idle+0x9c/0x114)
[  306.027526] [c001517c] (cpu_idle+0x9c/0x114) from [804f1af4] (0x804f1af4)
[  602.004486] INFO: rcu_sched detected stalls on CPUs/tasks:
[  602.007476]  (detected by 0, t=60707 jiffies)
[  602.009857] INFO: Stall ended before state dump start
[  906.027893] INFO: rcu_sched self-detected stall on CPU
[  906.030700]  1: (6 GPs behind) idle=647/1/0 
[  906.033020]   (t=38379 jiffies)
[  906.034790] [c001b7cc] (unwind_backtrace+0x0/0xf0) from [c00acc28] 
(rcu_check_callbacks+0x1b0/0x678)
[  906.039947] [c00acc28] (rcu_check_callbacks+0x1b0/0x678) from [c00529e0] 
(update_process_times+0x38/0x68)
[  906.045349] [c00529e0] (update_process_times+0x38/0x68) from [c008bf14] 
(tick_sched_timer+0x80/0xec)
[  906.050537] [c008bf14] (tick_sched_timer+0x80/0xec) from [c006840c] 
(__run_hrtimer+0x7c/0x1e0)
[  906.055419] [c006840c] (__run_hrtimer+0x7c/0x1e0) from [c00691f0] 
(hrtimer_interrupt+0x11c/0x2d0)
[  906.060424] [c00691f0] (hrtimer_interrupt+0x11c/0x2d0) from [c001a04c] 
(twd_handler+0x30/0x44)
[  906.065307] [c001a04c] (twd_handler+0x30/0x44) from [c00a7068] 
(handle_percpu_devid_irq+0x90/0x13c)
[  906.070434] [c00a7068] (handle_percpu_devid_irq+0x90/0x13c) from 
[c00a37dc] (generic_handle_irq+0x30/0x48)
[  906.075897] [c00a37dc] (generic_handle_irq+0x30/0x48) from [c0014c58] 
(handle_IRQ+0x4c/0xac)
[  906.080688] [c0014c58] (handle_IRQ+0x4c/0xac) from [c0008478] 
(gic_handle_irq+0x28/0x5c)
[  906.085296] [c0008478] (gic_handle_irq+0x28/0x5c) from [c04f8ca4] 
(__irq_svc+0x44/0x5c)
[  906.089843] Exception stack(0xde86ff88 to 0xde86ffd0)
[  906.092590] ff80:   0003cb06 0001  de8660c0 
de86e000 c07c23c8
[  906.097045] ffa0: c0504590 c0749e20  411fc092 c074a040  
0001 de86ffd0
[  906.101501] ffc0: 0003cb07 c0014f50 2113 
[  906.104278] [c04f8ca4] (__irq_svc+0x44/0x5c) from [c0014f50] 
(default_idle+0x20/0x44)
[  906.108734] [c0014f50]

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-12 Thread Paul E. McKenney

On Wed, Sep 12, 2012 at 10:51:30PM +, Paul Walmsley wrote:
> 
> Hi Paul
> 
> Recently several of us have been seeing "INFO: rcu_sched self-detected
> stall on CPU { 0} (t=20611 jiffies)" stack tracebacks on various OMAP3
> and 4 board.
> 
> I only noticed it during v3.6-rc3, but I suspect it's been happening 
> for users at least since May:
> 
> http://www.mail-archive.com/linux-omap@vger.kernel.org/msg68942.html
> 
> 
> The only quasi-reproducible test case that I've found so far 
> is to boot the board with serial console enabled to a login prompt, then 
> wait for a few minutes, then send a keypress to the board via serial.
> The tracebacks I get look like this:

Interesting.  I am assuming that the interrupt in the stack below came
from idle, if not, please let me know what.

Could you please reproduce with CONFIG_RCU_CPU_STALL_INFO=y?  That would
give me a bit more information about why RCU thought that there was
a stall.  (CCing Becky Bruce, who saw something similar recently.)

Subodh Nijsure (also CCed) reported something that might be similar on
ARM, and also reported that setting the following got rid of the stalls:

CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y

At which point he was happy, which was good, but which also left the
underlying problem unsolved.  Do these affect your system?  If so,
do they cause a different ARM idle loop to be executed?

Thanx, Paul

> [  467.480712] INFO: rcu_sched self-detected stall on CPU { 0}  (t=20611 
> jiffies)
> [  467.484741] [] (unwind_backtrace+0x0/0xf0) from [] 
> (rcu_check_callbacks+0x180/0x630)
> [  467.489929] [] (rcu_check_callbacks+0x180/0x630) from 
> [] (update_process_times+0x38/0x68)
> [  467.495361] [] (update_process_times+0x38/0x68) from 
> [] (tick_sched_timer+0x80/0xec)
> [  467.500518] [] (tick_sched_timer+0x80/0xec) from [] 
> (__run_hrtimer+0x7c/0x1e0)
> [  467.505401] [] (__run_hrtimer+0x7c/0x1e0) from [] 
> (hrtimer_interrupt+0x11c/0x2d0)
> [  467.510437] [] (hrtimer_interrupt+0x11c/0x2d0) from [] 
> (twd_handler+0x30/0x44)
> [  467.515350] [] (twd_handler+0x30/0x44) from [] 
> (handle_percpu_devid_irq+0x90/0x13c)
> [  467.520477] [] (handle_percpu_devid_irq+0x90/0x13c) from 
> [] (generic_handle_irq+0x30/0x48)
> [  467.525939] [] (generic_handle_irq+0x30/0x48) from [] 
> (handle_IRQ+0x4c/0xac)
> [  467.530731] [] (handle_IRQ+0x4c/0xac) from [] 
> (gic_handle_irq+0x28/0x5c)
> [  467.535339] [] (gic_handle_irq+0x28/0x5c) from [] 
> (__irq_svc+0x44/0x5c)
> [  467.539886] Exception stack(0xc0729f58 to 0xc0729fa0)
> [  467.542663] 9f40:   
> 00047f2a 0001
> [  467.547119] 9f60:  c074a940 c0728000 c07c4b08 c05045a0 c074be20 
>  411fc092
> [  467.551574] 9f80: c074c040  0001 c0729fa0 00047f2b c0014f50 
> 2113 
> [  467.556030] [] (__irq_svc+0x44/0x5c) from [] 
> (default_idle+0x20/0x44)
> [  467.560485] [] (default_idle+0x20/0x44) from [] 
> (cpu_idle+0x9c/0x114)
> [  467.564971] [] (cpu_idle+0x9c/0x114) from [] 
> (start_kernel+0x2b4/0x304)
> 
> Looks like this message was added as of commit
> a858af2875fb291d0f4b0a4419fefbf03c2379c0 ("rcu: Print scheduling-clock
> information on RCU CPU stall-warning messages").
> 
> Do you have any suggestions for how we can determine what is causing
> this?
> 
> Here's an example of a kernel config that we use:
> 
> http://www.pwsan.com/omap/testlogs/am33xx_hwmod_clock_devel_3.7/20120912092510/build/omap2plus_defconfig/Kconfig
> 
> A few observations that may or may not be relevant: we use NO_HZ, and 
> we also have a clockevents timer that is relatively slow to program.
> 
> 
> regards,
> 
> - Paul
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-12 Thread Paul E. McKenney

On Wed, Sep 12, 2012 at 10:51:30PM +, Paul Walmsley wrote:
 
 Hi Paul
 
 Recently several of us have been seeing INFO: rcu_sched self-detected
 stall on CPU { 0} (t=20611 jiffies) stack tracebacks on various OMAP3
 and 4 board.
 
 I only noticed it during v3.6-rc3, but I suspect it's been happening 
 for users at least since May:
 
 http://www.mail-archive.com/linux-omap@vger.kernel.org/msg68942.html
 
 
 The only quasi-reproducible test case that I've found so far 
 is to boot the board with serial console enabled to a login prompt, then 
 wait for a few minutes, then send a keypress to the board via serial.
 The tracebacks I get look like this:

Interesting.  I am assuming that the interrupt in the stack below came
from idle, if not, please let me know what.

Could you please reproduce with CONFIG_RCU_CPU_STALL_INFO=y?  That would
give me a bit more information about why RCU thought that there was
a stall.  (CCing Becky Bruce, who saw something similar recently.)

Subodh Nijsure (also CCed) reported something that might be similar on
ARM, and also reported that setting the following got rid of the stalls:

CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y

At which point he was happy, which was good, but which also left the
underlying problem unsolved.  Do these affect your system?  If so,
do they cause a different ARM idle loop to be executed?

Thanx, Paul

 [  467.480712] INFO: rcu_sched self-detected stall on CPU { 0}  (t=20611 
 jiffies)
 [  467.484741] [c001b7cc] (unwind_backtrace+0x0/0xf0) from [c00acc94] 
 (rcu_check_callbacks+0x180/0x630)
 [  467.489929] [c00acc94] (rcu_check_callbacks+0x180/0x630) from 
 [c0052b18] (update_process_times+0x38/0x68)
 [  467.495361] [c0052b18] (update_process_times+0x38/0x68) from 
 [c008c04c] (tick_sched_timer+0x80/0xec)
 [  467.500518] [c008c04c] (tick_sched_timer+0x80/0xec) from [c0068544] 
 (__run_hrtimer+0x7c/0x1e0)
 [  467.505401] [c0068544] (__run_hrtimer+0x7c/0x1e0) from [c0069328] 
 (hrtimer_interrupt+0x11c/0x2d0)
 [  467.510437] [c0069328] (hrtimer_interrupt+0x11c/0x2d0) from [c001a04c] 
 (twd_handler+0x30/0x44)
 [  467.515350] [c001a04c] (twd_handler+0x30/0x44) from [c00a71a0] 
 (handle_percpu_devid_irq+0x90/0x13c)
 [  467.520477] [c00a71a0] (handle_percpu_devid_irq+0x90/0x13c) from 
 [c00a3914] (generic_handle_irq+0x30/0x48)
 [  467.525939] [c00a3914] (generic_handle_irq+0x30/0x48) from [c0014c58] 
 (handle_IRQ+0x4c/0xac)
 [  467.530731] [c0014c58] (handle_IRQ+0x4c/0xac) from [c0008478] 
 (gic_handle_irq+0x28/0x5c)
 [  467.535339] [c0008478] (gic_handle_irq+0x28/0x5c) from [c04f8ce4] 
 (__irq_svc+0x44/0x5c)
 [  467.539886] Exception stack(0xc0729f58 to 0xc0729fa0)
 [  467.542663] 9f40:   
 00047f2a 0001
 [  467.547119] 9f60:  c074a940 c0728000 c07c4b08 c05045a0 c074be20 
  411fc092
 [  467.551574] 9f80: c074c040  0001 c0729fa0 00047f2b c0014f50 
 2113 
 [  467.556030] [c04f8ce4] (__irq_svc+0x44/0x5c) from [c0014f50] 
 (default_idle+0x20/0x44)
 [  467.560485] [c0014f50] (default_idle+0x20/0x44) from [c001517c] 
 (cpu_idle+0x9c/0x114)
 [  467.564971] [c001517c] (cpu_idle+0x9c/0x114) from [c06d77b0] 
 (start_kernel+0x2b4/0x304)
 
 Looks like this message was added as of commit
 a858af2875fb291d0f4b0a4419fefbf03c2379c0 (rcu: Print scheduling-clock
 information on RCU CPU stall-warning messages).
 
 Do you have any suggestions for how we can determine what is causing
 this?
 
 Here's an example of a kernel config that we use:
 
 http://www.pwsan.com/omap/testlogs/am33xx_hwmod_clock_devel_3.7/20120912092510/build/omap2plus_defconfig/Kconfig
 
 A few observations that may or may not be relevant: we use NO_HZ, and 
 we also have a clockevents timer that is relatively slow to program.
 
 
 regards,
 
 - Paul
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

88 matches

Mail list logo