Re: rcu_preempt self-detected stall on CPU from 4.4-rc4, since 3.17

2016-01-06 Thread Ross Green
Just for completeness I have attached another rcu_preempt stall
warning from a kinux-3.17.1 kernel.

So it looks like these stall warning go back that far.

It could be that the kernel instrumentation improved enough to detect
this situation around the 3.17 kernel. I can't find any earlier
records of these stall warnings before these kernels. I guess someone
must have suspected there was a potential for these stalls and hence
the detection facility.

Regards,

Ross

On Tue, Jan 5, 2016 at 1:21 AM, Ross Green  wrote:
> Well with the release of 4.4-rc8 I have built and
>  started testing the kernel.
>
> With some luck I managed to get a rcu_preempt stall within a few hours
> of testing.
>
> Upon booting, I ran a small series of bench marks to make sure
> everything is running as expected. limited regression testing and then
> just left the system to idle away, with periodic monitoring from the
> network.
>
>
> please find attached two stack traces from linux-4.4-rc8 and also 
> linux-4.4-rc7.
>
> The interesting thing with the rc7 trace is that there are multiple
> stalls that have occurred over a 6 day period.
>
> I realise Paul you have a number of changes pending for the the RCU
> code. It would be good to try and establish what is happening with
> these stalls before the impact of those changes given that timings and
> dynamics might change whats happening in the current environment.
>
> As reported earlier, i have never been able to induce these stalls
> with heavy loading of the system. The only method I can be sure of is
> to leave the system quiet and let time go by till a stall occurs.
>
> Regards,
>
> Ross
>
> On Mon, Jan 4, 2016 at 9:00 AM, Ross Green  wrote:
>> Thanks Paul for your analysis and investigation,
>>
>> I understand your patches are designed not to "fix" the problem, but
>> more to move the problem ahead.
>>
>> I will include a few more stack traces from various kernels. I can go
>> back to around 3.17, with similar trace results.
>>
>> My observation is that the problem can occur at various times and with
>> out any "bad" effect other than more stalls could happen afterwards.
>>
>> At first I wondered if they might actually be a false positive as the
>> kernel seemed to carry on and run quite happily.  It is rare that I
>> find a kernel just locks up after observing such a stall, or a
>> complete kernel splat!
>>
>> Unfortunately with my testing I have never been able to induce the
>> problem under any heavy load that would immediately trigger the
>> problem. Indeed most heavy cpu utilisation seemed to just sail on
>> quite nicely.
>>
>> The time for a fault, seems so far, to be non-deterministic with
>> quiescent systems taking anywhere from a few hours through to some six
>> days before showing the problem.
>>
>> More recent kernels seem to have richer stall and back trace
>> information so I was hoping that might shed some light on how they
>> might be occurring..
>>
>> I usually only run a kernel till I get a new -rc release to test with,
>> so a run of 1 week is a typical cycle.
>>
>> I just wish I could find a sure fire method to trigger the problem!!
>>
>> I have included a few more traces of various kernels all showing the problem.
>>
>> Regards,
>>
>> Ross
>>
>> On Mon, Jan 4, 2016 at 5:15 AM, Paul E. McKenney
>>  wrote:
>>> On Sun, Jan 03, 2016 at 07:27:17PM +1100, Ross Green wrote:
 I would not describe the load on this test machine as high or real time.

 Apart from a number of standard daemons not much more is running at all!

 I normally build a release kernel as soon as possible and set it running.
 Typically I run a series of benchmarks to confirm most things appear
 to be working and then just leave it running. During a normal day i
 will check on the machine 4/5 times just to see how its going!
 Typically I will logon remotely via wifi network connection.

 just for your information i will include a few other stack traces from
 previous kernels that may show some trend!


 Please see the attachments.
>>>
>>> Thank you for the additional details.  This does look similar to some
>>> problems I am seeing, though only in heavy rcutorture workloads with
>>> CPU hotplugging.
>>>
>>> I have some crude workarounds in progress, see for example
>>> 2da26818e515 (rcu: Awaken grace-period kthread when stalled) at
>>> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git.
>>> This workaround kicks the RCU grace-period kthread on every stall warning.
>>> In my testing, this workaround results in slow but real forward progress.
>>>
>>> I have a better workaround in progress, however, please note:
>>>
>>> 1.  I have no intention of sending these workarounds upstream.
>>>
>>> 2.  The workarounds will splat when they take effect.  In other words,
>>> the idea is not to paper over the problem, but instead to allow
>>> me to separate testing concerns.
>>>
>>> 3.  A fix is needed for th

Re: rcu_preempt self-detected stall on CPU from 4.4-rc4, since 3.17

2016-01-04 Thread Ross Green
Well with the release of 4.4-rc8 I have built and
 started testing the kernel.

With some luck I managed to get a rcu_preempt stall within a few hours
of testing.

Upon booting, I ran a small series of bench marks to make sure
everything is running as expected. limited regression testing and then
just left the system to idle away, with periodic monitoring from the
network.


please find attached two stack traces from linux-4.4-rc8 and also linux-4.4-rc7.

The interesting thing with the rc7 trace is that there are multiple
stalls that have occurred over a 6 day period.

I realise Paul you have a number of changes pending for the the RCU
code. It would be good to try and establish what is happening with
these stalls before the impact of those changes given that timings and
dynamics might change whats happening in the current environment.

As reported earlier, i have never been able to induce these stalls
with heavy loading of the system. The only method I can be sure of is
to leave the system quiet and let time go by till a stall occurs.

Regards,

Ross

On Mon, Jan 4, 2016 at 9:00 AM, Ross Green  wrote:
> Thanks Paul for your analysis and investigation,
>
> I understand your patches are designed not to "fix" the problem, but
> more to move the problem ahead.
>
> I will include a few more stack traces from various kernels. I can go
> back to around 3.17, with similar trace results.
>
> My observation is that the problem can occur at various times and with
> out any "bad" effect other than more stalls could happen afterwards.
>
> At first I wondered if they might actually be a false positive as the
> kernel seemed to carry on and run quite happily.  It is rare that I
> find a kernel just locks up after observing such a stall, or a
> complete kernel splat!
>
> Unfortunately with my testing I have never been able to induce the
> problem under any heavy load that would immediately trigger the
> problem. Indeed most heavy cpu utilisation seemed to just sail on
> quite nicely.
>
> The time for a fault, seems so far, to be non-deterministic with
> quiescent systems taking anywhere from a few hours through to some six
> days before showing the problem.
>
> More recent kernels seem to have richer stall and back trace
> information so I was hoping that might shed some light on how they
> might be occurring..
>
> I usually only run a kernel till I get a new -rc release to test with,
> so a run of 1 week is a typical cycle.
>
> I just wish I could find a sure fire method to trigger the problem!!
>
> I have included a few more traces of various kernels all showing the problem.
>
> Regards,
>
> Ross
>
> On Mon, Jan 4, 2016 at 5:15 AM, Paul E. McKenney
>  wrote:
>> On Sun, Jan 03, 2016 at 07:27:17PM +1100, Ross Green wrote:
>>> I would not describe the load on this test machine as high or real time.
>>>
>>> Apart from a number of standard daemons not much more is running at all!
>>>
>>> I normally build a release kernel as soon as possible and set it running.
>>> Typically I run a series of benchmarks to confirm most things appear
>>> to be working and then just leave it running. During a normal day i
>>> will check on the machine 4/5 times just to see how its going!
>>> Typically I will logon remotely via wifi network connection.
>>>
>>> just for your information i will include a few other stack traces from
>>> previous kernels that may show some trend!
>>>
>>>
>>> Please see the attachments.
>>
>> Thank you for the additional details.  This does look similar to some
>> problems I am seeing, though only in heavy rcutorture workloads with
>> CPU hotplugging.
>>
>> I have some crude workarounds in progress, see for example
>> 2da26818e515 (rcu: Awaken grace-period kthread when stalled) at
>> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git.
>> This workaround kicks the RCU grace-period kthread on every stall warning.
>> In my testing, this workaround results in slow but real forward progress.
>>
>> I have a better workaround in progress, however, please note:
>>
>> 1.  I have no intention of sending these workarounds upstream.
>>
>> 2.  The workarounds will splat when they take effect.  In other words,
>> the idea is not to paper over the problem, but instead to allow
>> me to separate testing concerns.
>>
>> 3.  A fix is needed for the underlying bug, wherever it might be.
>>
>> Thanx, Paul
>>
>>> Regards,
>>>
>>> Ross
>>>
>>>
>>>
>>> On Sun, Jan 3, 2016 at 5:17 PM, Paul E. McKenney
>>>  wrote:
>>> > On Sun, Jan 03, 2016 at 04:29:11PM +1100, Ross Green wrote:
>>> >> Still seeing these rcu_preempt stalls on kernels through to 4.4-rc7
>>> >>
>>> >> Still have not found a sure fire method to evoke this stall, but have
>>> >> found that it will normally occur within a week of running a kernel -
>>> >> usually when it is quiet with light load.
>>> >>
>>> >> Have seen similar self detected stalls all the way back to 3.17.
>>> >

Re: rcu_preempt self-detected stall on CPU from 4.4-rc4, since 3.17

2016-01-03 Thread Paul E. McKenney
On Sun, Jan 03, 2016 at 07:27:17PM +1100, Ross Green wrote:
> I would not describe the load on this test machine as high or real time.
> 
> Apart from a number of standard daemons not much more is running at all!
> 
> I normally build a release kernel as soon as possible and set it running.
> Typically I run a series of benchmarks to confirm most things appear
> to be working and then just leave it running. During a normal day i
> will check on the machine 4/5 times just to see how its going!
> Typically I will logon remotely via wifi network connection.
> 
> just for your information i will include a few other stack traces from
> previous kernels that may show some trend!
> 
> 
> Please see the attachments.

Thank you for the additional details.  This does look similar to some
problems I am seeing, though only in heavy rcutorture workloads with
CPU hotplugging.

I have some crude workarounds in progress, see for example
2da26818e515 (rcu: Awaken grace-period kthread when stalled) at
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git.
This workaround kicks the RCU grace-period kthread on every stall warning.
In my testing, this workaround results in slow but real forward progress.

I have a better workaround in progress, however, please note:

1.  I have no intention of sending these workarounds upstream.

2.  The workarounds will splat when they take effect.  In other words,
the idea is not to paper over the problem, but instead to allow
me to separate testing concerns.

3.  A fix is needed for the underlying bug, wherever it might be.

Thanx, Paul

> Regards,
> 
> Ross
> 
> 
> 
> On Sun, Jan 3, 2016 at 5:17 PM, Paul E. McKenney
>  wrote:
> > On Sun, Jan 03, 2016 at 04:29:11PM +1100, Ross Green wrote:
> >> Still seeing these rcu_preempt stalls on kernels through to 4.4-rc7
> >>
> >> Still have not found a sure fire method to evoke this stall, but have
> >> found that it will normally occur within a week of running a kernel -
> >> usually when it is quiet with light load.
> >>
> >> Have seen similar self detected stalls all the way back to 3.17.
> >> Most recent kernels have included 4.4-rc5 4.4-rc6 and 4.4-rc7
> >>
> >> Regards,
> >>
> >> Ross
> >>
> >> On Fri, Dec 11, 2015 at 10:17 PM, Ross Green  wrote:
> >> > I have been getting these stalls in kernels going back to 3.17.
> >> >
> >> > This stall occurs usually under light load but often requires several
> >> > days to show itself. I have not found any simple way to trigger the
> >> > stall. Indeed heavy workloads seems not to show the fault.
> >> >
> >> > Anyone have any thoughts here?
> >> >
> >> > The recent patch by peterz with kernel/sched/wait.c I thought might
> >> > help the situation, but alas after a few days of running 4.4-rc4 the
> >> > following turned up.
> >> >
> >> > [179922.003570] INFO: rcu_preempt self-detected stall on CPU
> >> > [179922.008178] INFO: rcu_preempt detected stalls on CPUs/tasks:
> >> > [179922.008178] 0-...: (1 ticks this GP) idle=a91/1/0
> >
> > CPU 0 is non-idle from an RCU perspective.
> >
> >> > softirq=1296733/1296733 fqs=0
> >> > [179922.008178]
> >> > [179922.008209] (detected by 1, t=8775 jiffies, g=576439, c=576438, 
> >> > q=102)
> >> > [179922.008209] Task dump for CPU 0:
> >> > [179922.008209] swapper/0   R [179922.008209]  running 
> >> > [179922.008209] 0 0  0 0x
> >> > [179922.008209] Backtrace:
> >> >
> >> > [179922.008239] Backtrace aborted due to bad frame pointer 
> >
> > Can't have everything, I guess...
> >
> >> > [179922.008239] rcu_preempt kthread starved for 8775 jiffies! g576439 
> >> > c576438 f0x0 s3 ->state=0x1
> >
> > Something is keeping the rcu_preempt grace-period kthread from
> > running.  This far into the grace period, it should have a
> > timer event waking it every few jiffies.  It is currently
> > in TASK_INTERRUPTIBLE state.
> >
> >> > [179922.060302] 0-...: (1 ticks this GP) idle=a91/1/0 
> >> > softirq=1296733/1296733 fqs=0
> >> > [179922.068023]  (t=8775 jiffies g=576439 c=576438 q=102)
> >> > [179922.073913] rcu_preempt kthread starved for 8775 jiffies! g576439 
> >> > c576438 f0x2 s3 ->state=0x100
> >
> > Same story, same grace period, pretty much same time.  Now there is an FQS
> > request (f0x2) and the state is now TASK_WAKING (->state=0x100 == 256).
> >
> >> > [179922.083587] Task dump for CPU 0:
> >> > [179922.087097] swapper/0   R running  0 0  0 0x
> >> > [179922.093292] Backtrace:
> >> > [179922.096313] [] (dump_backtrace) from [] 
> >> > (show_stack+0x18/0x1c)
> >> > [179922.104675]  r7:c0908514 r6:80080193 r5: r4:c090aca8
> >> > [179922.110809] [] (show_stack) from [] 
> >> > (sched_show_task+0xbc/0x110)
> >> > [179922.119049] [] (sched_show_task) from [] 
> >> > (dump_cpu_task+0x40/0x44)
> >> > [179922.127624]  r5:c0917680 r4:
> >> > [179922.131042] [] (dump_cpu_task) from [] 
>

Re: rcu_preempt self-detected stall on CPU from 4.4-rc4, since 3.17

2016-01-03 Thread Ross Green
I would not describe the load on this test machine as high or real time.

Apart from a number of standard daemons not much more is running at all!

I normally build a release kernel as soon as possible and set it running.
Typically I run a series of benchmarks to confirm most things appear
to be working and then just leave it running. During a normal day i
will check on the machine 4/5 times just to see how its going!
Typically I will logon remotely via wifi network connection.

just for your information i will include a few other stack traces from
previous kernels that may show some trend!


Please see the attachments.

Regards,

Ross



On Sun, Jan 3, 2016 at 5:17 PM, Paul E. McKenney
 wrote:
> On Sun, Jan 03, 2016 at 04:29:11PM +1100, Ross Green wrote:
>> Still seeing these rcu_preempt stalls on kernels through to 4.4-rc7
>>
>> Still have not found a sure fire method to evoke this stall, but have
>> found that it will normally occur within a week of running a kernel -
>> usually when it is quiet with light load.
>>
>> Have seen similar self detected stalls all the way back to 3.17.
>> Most recent kernels have included 4.4-rc5 4.4-rc6 and 4.4-rc7
>>
>> Regards,
>>
>> Ross
>>
>> On Fri, Dec 11, 2015 at 10:17 PM, Ross Green  wrote:
>> > I have been getting these stalls in kernels going back to 3.17.
>> >
>> > This stall occurs usually under light load but often requires several
>> > days to show itself. I have not found any simple way to trigger the
>> > stall. Indeed heavy workloads seems not to show the fault.
>> >
>> > Anyone have any thoughts here?
>> >
>> > The recent patch by peterz with kernel/sched/wait.c I thought might
>> > help the situation, but alas after a few days of running 4.4-rc4 the
>> > following turned up.
>> >
>> > [179922.003570] INFO: rcu_preempt self-detected stall on CPU
>> > [179922.008178] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> > [179922.008178] 0-...: (1 ticks this GP) idle=a91/1/0
>
> CPU 0 is non-idle from an RCU perspective.
>
>> > softirq=1296733/1296733 fqs=0
>> > [179922.008178]
>> > [179922.008209] (detected by 1, t=8775 jiffies, g=576439, c=576438, q=102)
>> > [179922.008209] Task dump for CPU 0:
>> > [179922.008209] swapper/0   R [179922.008209]  running [179922.008209] 
>> > 0 0  0 0x
>> > [179922.008209] Backtrace:
>> >
>> > [179922.008239] Backtrace aborted due to bad frame pointer 
>
> Can't have everything, I guess...
>
>> > [179922.008239] rcu_preempt kthread starved for 8775 jiffies! g576439 
>> > c576438 f0x0 s3 ->state=0x1
>
> Something is keeping the rcu_preempt grace-period kthread from
> running.  This far into the grace period, it should have a
> timer event waking it every few jiffies.  It is currently
> in TASK_INTERRUPTIBLE state.
>
>> > [179922.060302] 0-...: (1 ticks this GP) idle=a91/1/0 
>> > softirq=1296733/1296733 fqs=0
>> > [179922.068023]  (t=8775 jiffies g=576439 c=576438 q=102)
>> > [179922.073913] rcu_preempt kthread starved for 8775 jiffies! g576439 
>> > c576438 f0x2 s3 ->state=0x100
>
> Same story, same grace period, pretty much same time.  Now there is an FQS
> request (f0x2) and the state is now TASK_WAKING (->state=0x100 == 256).
>
>> > [179922.083587] Task dump for CPU 0:
>> > [179922.087097] swapper/0   R running  0 0  0 0x
>> > [179922.093292] Backtrace:
>> > [179922.096313] [] (dump_backtrace) from [] 
>> > (show_stack+0x18/0x1c)
>> > [179922.104675]  r7:c0908514 r6:80080193 r5: r4:c090aca8
>> > [179922.110809] [] (show_stack) from [] 
>> > (sched_show_task+0xbc/0x110)
>> > [179922.119049] [] (sched_show_task) from [] 
>> > (dump_cpu_task+0x40/0x44)
>> > [179922.127624]  r5:c0917680 r4:
>> > [179922.131042] [] (dump_cpu_task) from [] 
>> > (rcu_dump_cpu_stacks+0x9c/0xdc)
>> > [179922.140350]  r5:c0917680 r4:0001
>> > [179922.143157] [] (rcu_dump_cpu_stacks) from [] 
>> > (rcu_check_callbacks+0x504/0x8e4)
>> > [179922.153808]  r9:c0908514 r8:c0917680 r7:0066 r6:2eeab000
>> > r5:c0904300 r4:ef7af300
>> > [179922.161499] [] (rcu_check_callbacks) from [] 
>> > (update_process_times+0x40/0x6c)
>> > [179922.170898]  r10:c009a584 r9:0001 r8:ef7abc4c r7:a3a3
>> > r6:4ec3391c r5:
>> > [179922.179901]  r4:c090aca8
>> > [179922.182708] [] (update_process_times) from []
>> > (tick_sched_handle+0x50/0x54)
>> > [179922.192108]  r5:c0907f10 r4:ef7abe40
>> > [179922.195983] [] (tick_sched_handle) from []
>> > (tick_sched_timer+0x50/0x94)
>> > [179922.204895] [] (tick_sched_timer) from []
>> > (__hrtimer_run_queues+0x110/0x1a0)
>> > [179922.214324]  r7: r6:ef7abc40 r5:ef7abe40 r4:ef7abc00
>> > [179922.220428] [] (__hrtimer_run_queues) from []
>> > (hrtimer_interrupt+0xac/0x1f8)
>> > [179922.227111]  r10:ef7abc78 r9:ef7abc98 r8:ef7abc14 r7:ef7abcb8
>> > r6: r5:0003
>> > [179922.238220]  r4:ef7abc00
>> > [179922.238220] [] (hrtimer_interrupt) from []
>> > (twd_handler+0x38/0x48)
>> > [179922.238220]  r10:c090

Re: rcu_preempt self-detected stall on CPU from 4.4-rc4, since 3.17

2016-01-02 Thread Paul E. McKenney
On Sun, Jan 03, 2016 at 04:29:11PM +1100, Ross Green wrote:
> Still seeing these rcu_preempt stalls on kernels through to 4.4-rc7
> 
> Still have not found a sure fire method to evoke this stall, but have
> found that it will normally occur within a week of running a kernel -
> usually when it is quiet with light load.
> 
> Have seen similar self detected stalls all the way back to 3.17.
> Most recent kernels have included 4.4-rc5 4.4-rc6 and 4.4-rc7
> 
> Regards,
> 
> Ross
> 
> On Fri, Dec 11, 2015 at 10:17 PM, Ross Green  wrote:
> > I have been getting these stalls in kernels going back to 3.17.
> >
> > This stall occurs usually under light load but often requires several
> > days to show itself. I have not found any simple way to trigger the
> > stall. Indeed heavy workloads seems not to show the fault.
> >
> > Anyone have any thoughts here?
> >
> > The recent patch by peterz with kernel/sched/wait.c I thought might
> > help the situation, but alas after a few days of running 4.4-rc4 the
> > following turned up.
> >
> > [179922.003570] INFO: rcu_preempt self-detected stall on CPU
> > [179922.008178] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [179922.008178] 0-...: (1 ticks this GP) idle=a91/1/0

CPU 0 is non-idle from an RCU perspective.

> > softirq=1296733/1296733 fqs=0
> > [179922.008178]
> > [179922.008209] (detected by 1, t=8775 jiffies, g=576439, c=576438, q=102)
> > [179922.008209] Task dump for CPU 0:
> > [179922.008209] swapper/0   R [179922.008209]  running [179922.008209]  
> >0 0  0 0x
> > [179922.008209] Backtrace:
> >
> > [179922.008239] Backtrace aborted due to bad frame pointer 

Can't have everything, I guess...

> > [179922.008239] rcu_preempt kthread starved for 8775 jiffies! g576439 
> > c576438 f0x0 s3 ->state=0x1

Something is keeping the rcu_preempt grace-period kthread from
running.  This far into the grace period, it should have a
timer event waking it every few jiffies.  It is currently
in TASK_INTERRUPTIBLE state.

> > [179922.060302] 0-...: (1 ticks this GP) idle=a91/1/0 
> > softirq=1296733/1296733 fqs=0
> > [179922.068023]  (t=8775 jiffies g=576439 c=576438 q=102)
> > [179922.073913] rcu_preempt kthread starved for 8775 jiffies! g576439 
> > c576438 f0x2 s3 ->state=0x100

Same story, same grace period, pretty much same time.  Now there is an FQS
request (f0x2) and the state is now TASK_WAKING (->state=0x100 == 256).

> > [179922.083587] Task dump for CPU 0:
> > [179922.087097] swapper/0   R running  0 0  0 0x
> > [179922.093292] Backtrace:
> > [179922.096313] [] (dump_backtrace) from [] 
> > (show_stack+0x18/0x1c)
> > [179922.104675]  r7:c0908514 r6:80080193 r5: r4:c090aca8
> > [179922.110809] [] (show_stack) from [] 
> > (sched_show_task+0xbc/0x110)
> > [179922.119049] [] (sched_show_task) from [] 
> > (dump_cpu_task+0x40/0x44)
> > [179922.127624]  r5:c0917680 r4:
> > [179922.131042] [] (dump_cpu_task) from [] 
> > (rcu_dump_cpu_stacks+0x9c/0xdc)
> > [179922.140350]  r5:c0917680 r4:0001
> > [179922.143157] [] (rcu_dump_cpu_stacks) from [] 
> > (rcu_check_callbacks+0x504/0x8e4)
> > [179922.153808]  r9:c0908514 r8:c0917680 r7:0066 r6:2eeab000
> > r5:c0904300 r4:ef7af300
> > [179922.161499] [] (rcu_check_callbacks) from [] 
> > (update_process_times+0x40/0x6c)
> > [179922.170898]  r10:c009a584 r9:0001 r8:ef7abc4c r7:a3a3
> > r6:4ec3391c r5:
> > [179922.179901]  r4:c090aca8
> > [179922.182708] [] (update_process_times) from []
> > (tick_sched_handle+0x50/0x54)
> > [179922.192108]  r5:c0907f10 r4:ef7abe40
> > [179922.195983] [] (tick_sched_handle) from []
> > (tick_sched_timer+0x50/0x94)
> > [179922.204895] [] (tick_sched_timer) from []
> > (__hrtimer_run_queues+0x110/0x1a0)
> > [179922.214324]  r7: r6:ef7abc40 r5:ef7abe40 r4:ef7abc00
> > [179922.220428] [] (__hrtimer_run_queues) from []
> > (hrtimer_interrupt+0xac/0x1f8)
> > [179922.227111]  r10:ef7abc78 r9:ef7abc98 r8:ef7abc14 r7:ef7abcb8
> > r6: r5:0003
> > [179922.238220]  r4:ef7abc00
> > [179922.238220] [] (hrtimer_interrupt) from []
> > (twd_handler+0x38/0x48)
> > [179922.238220]  r10:c09084e8 r9:fa241100 r8:0011 r7:ef028780
> > r6:c092574c r5:ef005cc0

All interrupt stack up to this point.

It is quite possible that the stuff below here is at fault as well.
That said, the grace-period should actually get to execute at some
point.  Do you have a heavy real-time load that might be starving
things?

Thanx, Paul

> > [179922.257110]  r4:0001
> > [179922.257110] [] (twd_handler) from [] 
> > (handle_percpu_devid_irq+0x74/0x8c)
> > [179922.269683]  r5:ef005cc0 r4:ef7b1740
> > [179922.269683] [] (handle_percpu_devid_irq) from [] 
> > (generic_handle_irq+0x2c/0x3c)
> > [179922.283233]  r9:fa241100 r8:ef008000 r7:0001 r6:
> > r5: r4:c09013e8
> > [179922.290985] [] (generic_handle_irq) from [] 
> > (__handle_

Re: rcu_preempt self-detected stall on CPU from 4.4-rc4, since 3.17

2016-01-02 Thread Ross Green
Still seeing these rcu_preempt stalls on kernels through to 4.4-rc7

Still have not found a sure fire method to evoke this stall, but have
found that it will normally occur within a week of running a kernel -
usually when it is quiet with light load.

Have seen similar self detected stalls all the way back to 3.17.
Most recent kernels have included 4.4-rc5 4.4-rc6 and 4.4-rc7

Regards,

Ross

On Fri, Dec 11, 2015 at 10:17 PM, Ross Green  wrote:
> I have been getting these stalls in kernels going back to 3.17.
>
> This stall occurs usually under light load but often requires several
> days to show itself. I have not found any simple way to trigger the
> stall. Indeed heavy workloads seems not to show the fault.
>
> Anyone have any thoughts here?
>
> The recent patch by peterz with kernel/sched/wait.c I thought might
> help the situation, but alas after a few days of running 4.4-rc4 the
> following turned up.
>
> [179922.003570] INFO: rcu_preempt self-detected stall on CPU
> [179922.008178] INFO: rcu_preempt detected stalls on CPUs/tasks:
>
> [179922.008178] 0-...: (1 ticks this GP) idle=a91/1/0
> softirq=1296733/1296733 fqs=0
> [179922.008178]
> [179922.008209] (detected by 1, t=8775 jiffies, g=576439, c=576438, q=102)
> [179922.008209] Task dump for CPU 0:
> [179922.008209] swapper/0   R
> [179922.008209]  running
> [179922.008209] 0 0  0 0x
> [179922.008209] Backtrace:
>
> [179922.008239] Backtrace aborted due to bad frame pointer 
> [179922.008239] rcu_preempt kthread starved for 8775 jiffies! g576439
> c576438 f0x0 s3 ->state=0x1
>
> [179922.060302] 0-...: (1 ticks this GP) idle=a91/1/0
> softirq=1296733/1296733 fqs=0
> [179922.068023]  (t=8775 jiffies g=576439 c=576438 q=102)
> [179922.073913] rcu_preempt kthread starved for 8775 jiffies! g576439
> c576438 f0x2 s3 ->state=0x100
> [179922.083587] Task dump for CPU 0:
> [179922.087097] swapper/0   R running  0 0  0 0x
> [179922.093292] Backtrace:
> [179922.096313] [] (dump_backtrace) from []
> (show_stack+0x18/0x1c)
> [179922.104675]  r7:c0908514 r6:80080193 r5: r4:c090aca8
> [179922.110809] [] (show_stack) from []
> (sched_show_task+0xbc/0x110)
> [179922.119049] [] (sched_show_task) from []
> (dump_cpu_task+0x40/0x44)
> [179922.127624]  r5:c0917680 r4:
> [179922.131042] [] (dump_cpu_task) from []
> (rcu_dump_cpu_stacks+0x9c/0xdc)
> [179922.140350]  r5:c0917680 r4:0001
> [179922.143157] [] (rcu_dump_cpu_stacks) from []
> (rcu_check_callbacks+0x504/0x8e4)
> [179922.153808]  r9:c0908514 r8:c0917680 r7:0066 r6:2eeab000
> r5:c0904300 r4:ef7af300
> [179922.161499] [] (rcu_check_callbacks) from []
> (update_process_times+0x40/0x6c)
> [179922.170898]  r10:c009a584 r9:0001 r8:ef7abc4c r7:a3a3
> r6:4ec3391c r5:
> [179922.179901]  r4:c090aca8
> [179922.182708] [] (update_process_times) from []
> (tick_sched_handle+0x50/0x54)
> [179922.192108]  r5:c0907f10 r4:ef7abe40
> [179922.195983] [] (tick_sched_handle) from []
> (tick_sched_timer+0x50/0x94)
> [179922.204895] [] (tick_sched_timer) from []
> (__hrtimer_run_queues+0x110/0x1a0)
> [179922.214324]  r7: r6:ef7abc40 r5:ef7abe40 r4:ef7abc00
> [179922.220428] [] (__hrtimer_run_queues) from []
> (hrtimer_interrupt+0xac/0x1f8)
> [179922.227111]  r10:ef7abc78 r9:ef7abc98 r8:ef7abc14 r7:ef7abcb8
> r6: r5:0003
> [179922.238220]  r4:ef7abc00
> [179922.238220] [] (hrtimer_interrupt) from []
> (twd_handler+0x38/0x48)
> [179922.238220]  r10:c09084e8 r9:fa241100 r8:0011 r7:ef028780
> r6:c092574c r5:ef005cc0
> [179922.257110]  r4:0001
> [179922.257110] [] (twd_handler) from []
> (handle_percpu_devid_irq+0x74/0x8c)
> [179922.269683]  r5:ef005cc0 r4:ef7b1740
> [179922.269683] [] (handle_percpu_devid_irq) from
> [] (generic_handle_irq+0x2c/0x3c)
> [179922.283233]  r9:fa241100 r8:ef008000 r7:0001 r6:
> r5: r4:c09013e8
> [179922.290985] [] (generic_handle_irq) from []
> (__handle_domain_irq+0x64/0xbc)
> [179922.300842] [] (__handle_domain_irq) from []
> (gic_handle_irq+0x50/0x90)
> [179922.303222]  r9:fa241100 r8:fa240100 r7:c0907f10 r6:fa24010c
> r5:c09087a8 r4:c0925748
> [179922.315216] [] (gic_handle_irq) from []
> (__irq_svc+0x54/0x90)
> [179922.319000] Exception stack(0xc0907f10 to 0xc0907f58)
> [179922.331542] 7f00: 
> ef7ab390 fe60 
> [179922.331542] 7f20: c0906000 c090849c c0900364 c06a8124 c0907f80
> c0944563 c09084e8 c0907f6c
> [179922.349029] 7f40: c0907f4c c0907f60 c00287ac c0010ba8 60080113 
> [179922.349029]  r9:c0944563 r8:c0907f80 r7:c0907f44 r6:
> r5:60080113 r4:c0010ba8
> [179922.357116] [] (arch_cpu_idle) from []
> (default_idle_call+0x28/0x34)
> [179922.368926] [] (default_idle_call) from []
> (cpu_startup_entry+0x114/0x18c)
> [179922.368926] [] (cpu_startup_entry) from []
> (rest_init+0x90/0x94)
> [179922.385284]  r7: r4:0002
> [179922.393463] [] (rest_init) from []
> (start

Re: rcu_preempt self-detected stall on CPU from 4.4-rc4, since 3.17

2015-12-11 Thread Peter Zijlstra
On Fri, Dec 11, 2015 at 10:17:33PM +1100, Ross Green wrote:
> I have been getting these stalls in kernels going back to 3.17.
> 
> This stall occurs usually under light load but often requires several
> days to show itself. I have not found any simple way to trigger the
> stall. Indeed heavy workloads seems not to show the fault.
> 
> Anyone have any thoughts here?

see lkml.kernel.org/r/564f3dca.1080...@arm.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/