On Fri, 18 Mar 2016 16:56:41 -0700 "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> On Fri, Mar 18, 2016 at 02:00:11PM -0700, Josh Triplett wrote: > > On Thu, Feb 25, 2016 at 04:56:38PM -0800, Paul E. McKenney wrote: > > > On Thu, Feb 25, 2016 at 04:13:11PM +1100, Ross Green wrote: > > > > On Wed, Feb 24, 2016 at 8:28 AM, Ross Green > > > > <rgker...@gmail.com> wrote: > > > > > On Wed, Feb 24, 2016 at 7:55 AM, Paul E. McKenney > > > > > <paul...@linux.vnet.ibm.com> wrote: > > > > > > [ . . . ] > > > > > > > >> Still working on getting decent traces... > > > > > > And I might have succeeded, see below. > > > > > > > >> > > > > >> Thanx, > > > > >> Paul > > > > >> > > > > > > > > > > G'day all, > > > > > > > > > > Here is another dmesg output for 4.5-rc5 showing another > > > > > rcu_preempt stall. This one appeared after only a day of > > > > > running. CONFIG_DEBUG_TIMING is turned on, but can't see any > > > > > output that shows from this. > > > > > > > > > > Again testing as before, > > > > > > > > > > Boot, run a series of small benchmarks, then just let the > > > > > system be and idle away. > > > > > > > > > > I notice in the stack trace there is mention of > > > > > hrtimer_run_queues and hrtimer_interrupt. > > > > > > > > > > Anyway, leave this for a few more eyes to look at. > > > > > > > > > > Open to any other suggestions of things to test. > > > > > > > > > > Regards, > > > > > > > > > > Ross Green > > > > > > > > > > > > G'day Paul, > > > > > > > > I left the pandaboard running and captured another stall. > > > > > > > > the attachment is the dmesg output. > > > > > > > > Again there is no apparent output from any CONFIG_DEBUG_TIMING > > > > so I assume there is nothing happening there. > > > > > > I agree, looks like this is not due to time skew. > > > > > > > I just saw the updates for 4.6 RCU code. > > > > Is the patch in [PATCH tip/core/rcu 04/13] valid here? > > > > > > I doubt that it will help, but you never know. > > > > > > > do you want me try the new patch set with this configuration? > > > > > > Even better would be to try Daniel Wagner's swait patchset. I > > > have attached them in UNIX mbox format, or you can get them from > > > the -tip tree. > > > > > > And I -finally- got some tracing that -might- be useful. The > > > dmesg, all 67MB of it, is here: > > > > > > http://www.rdrop.com/~paulmck/submission/console.2016.02.23a.log > > > > > > This failure mode is less likely to happen, and looks a bit > > > different than the ones that I was seeing before enabling > > > tracing. Then, an additional wakeup would actually wake the task > > > up. In contrast, with tracing enabled, the RCU grace-period > > > kthread goes into "teenager mode", refusing to wake up despite > > > repeated attempts. However, this might be a side-effect of the > > > ftrace dump. > > > > > > On line 525,132, we see that the rcu_preempt grace-period kthread > > > has been starved for 1,188,154 jiffies, or about 20 minutes. > > > This seems unlikely... The kthread is waiting for no more than a > > > three-jiffy timeout ("RCU_GP_WAIT_FQS(3)") and is in > > > TASK_INTERRUPTIBLE state ("0x1"). > > > > We're seeing a similar stall (~60 seconds) on an x86 development > > system here. Any luck tracking down the cause of this? If not, any > > suggestions for traces that might be helpful? > > The dmesg containing the stall, the kernel version, and the .config > would be helpful! Working on a torture test specific to this bug... > > Thanx, Paul > +Reinette, she has the system that can reproduce the issue. I believe she is having some other problems with it at the moment. But the .config should be available. Version is v4.5.