Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-05-02 Thread Geert Uytterhoeven
Hi Vincent,

On Thu, Apr 26, 2018 at 12:31 PM, Vincent Guittot
 wrote:
> Le Thursday 26 Apr 2018 à 00:56:03 (+0200), Niklas Söderlund a écrit :
>> Here are the result, sorry for the delay.
>>
>> On 2018-04-23 11:54:20 +0200, Vincent Guittot wrote:
>>
>> [snip]
>>
>> >
>> > Thanks for the report. Can you re run with the following trace-cmd 
>> > sequence ? My previous sequence disables ftrace events
>> >
>> > trace-cmd reset > /dev/null
>> > trace-cmd start -b 4 -p function -l dump_backtrace:traceoff -e sched 
>> > -e cpu_idle -e cpu_frequency -e timer -e ipi -e irq -e printk
>> > trace-cmd start -b 4 -p function -l dump_backtrace -e sched -e 
>> > cpu_idle -e cpu_frequency -e timer -e ipi -e irq -e printk
>> >
>> > I have updated the patch and added traces to check that scheduler returns 
>> > from idle_balance function and doesn't stay stuck
>>
>> Once more I applied the change bellow on-top of c18bb396d3d261eb ("Merge
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
>>
>> This time the result of 'trace-cmd report' is so large I do not include
>> it here, but I attach the trace.dat file. Not sure why but the timing of
>> sending the NMI to the backtrace print is different (but content the
>> same AFIK) so in the odd change it can help figure this out:
>>
>
> Thanks for the trace, I have been able to catch a problem with it.
> Could you test the patch below to confirm that the problem is solved ?
> The patch apply on-top of
> c18bb396d3d261eb ("Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
>
> From: Vincent Guittot 
> Date: Thu, 26 Apr 2018 12:19:32 +0200
> Subject: [PATCH] sched/fair: fix the update of blocked load when newly idle
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> With commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle"),
> we release the rq->lock when updating blocked load of idle CPUs. This open
> a time window during which another CPU can add a task to this CPU's cfs_rq.
> The check for newly added task of idle_balance() is not in the common path.
> Move the out label to include this check.
>
> Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
> Reported-by: Heiner Kallweit 
> Reported-by: Niklas Söderlund 
> Signed-off-by: Vincent Guittot 

Thanks!
Tested-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-26 Thread Niklas Söderlund
Hi Vincent,

On 2018-04-26 17:27:24 +0200, Vincent Guittot wrote:
> Hi Niklas,
> 
> >> Thanks for the trace, I have been able to catch a problem with it.
> >> Could you test the patch below to confirm that the problem is solved ?
> >> The patch apply on-top of
> >> c18bb396d3d261eb ("Merge 
> >> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
> >
> > I can confirm that with the patch bellow I can no longer produce the
> > problem. Thanks!
> 
> Thanks for testing
> Do you mind if I add
> Tested-by: Niklas Söderlund 

Please do.

> 
> Peter, Ingo,
> Do you want me to re-send the patch with all tags or you will take
> this version ?
> 
> Regards,
> Vincent
> 
> >
> >>
> >> From: Vincent Guittot 
> >> Date: Thu, 26 Apr 2018 12:19:32 +0200
> >> Subject: [PATCH] sched/fair: fix the update of blocked load when newly idle
> >> MIME-Version: 1.0
> >> Content-Type: text/plain; charset=UTF-8
> >> Content-Transfer-Encoding: 8bit
> >>
> >> With commit 31e77c93e432 ("sched/fair: Update blocked load when newly 
> >> idle"),
> >> we release the rq->lock when updating blocked load of idle CPUs. This open
> >> a time window during which another CPU can add a task to this CPU's cfs_rq.
> >> The check for newly added task of idle_balance() is not in the common path.
> >> Move the out label to include this check.
> >>
> >> Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
> >> Reported-by: Heiner Kallweit 
> >> Reported-by: Niklas Söderlund 
> >> Signed-off-by: Vincent Guittot 
> >> ---
> >>  kernel/sched/fair.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 0951d1c..15a9f5e 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -9847,6 +9847,7 @@ static int idle_balance(struct rq *this_rq, struct 
> >> rq_flags *rf)
> >>   if (curr_cost > this_rq->max_idle_balance_cost)
> >>   this_rq->max_idle_balance_cost = curr_cost;
> >>
> >> +out:
> >>   /*
> >>* While browsing the domains, we released the rq lock, a task could
> >>* have been enqueued in the meantime. Since we're not going idle,
> >> @@ -9855,7 +9856,6 @@ static int idle_balance(struct rq *this_rq, struct 
> >> rq_flags *rf)
> >>   if (this_rq->cfs.h_nr_running && !pulled_task)
> >>   pulled_task = 1;
> >>
> >> -out:
> >>   /* Move the next balance forward */
> >>   if (time_after(this_rq->next_balance, next_balance))
> >>   this_rq->next_balance = next_balance;
> >> --
> >> 2.7.4
> >>
> >>
> >>
> >> [snip]
> >>
> >
> > --
> > Regards,
> > Niklas Söderlund

-- 
Regards,
Niklas Söderlund


Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-26 Thread Vincent Guittot
Hi Niklas,

>> Thanks for the trace, I have been able to catch a problem with it.
>> Could you test the patch below to confirm that the problem is solved ?
>> The patch apply on-top of
>> c18bb396d3d261eb ("Merge 
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
>
> I can confirm that with the patch bellow I can no longer produce the
> problem. Thanks!

Thanks for testing
Do you mind if I add
Tested-by: Niklas Söderlund 

Peter, Ingo,
Do you want me to re-send the patch with all tags or you will take
this version ?

Regards,
Vincent

>
>>
>> From: Vincent Guittot 
>> Date: Thu, 26 Apr 2018 12:19:32 +0200
>> Subject: [PATCH] sched/fair: fix the update of blocked load when newly idle
>> MIME-Version: 1.0
>> Content-Type: text/plain; charset=UTF-8
>> Content-Transfer-Encoding: 8bit
>>
>> With commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle"),
>> we release the rq->lock when updating blocked load of idle CPUs. This open
>> a time window during which another CPU can add a task to this CPU's cfs_rq.
>> The check for newly added task of idle_balance() is not in the common path.
>> Move the out label to include this check.
>>
>> Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
>> Reported-by: Heiner Kallweit 
>> Reported-by: Niklas Söderlund 
>> Signed-off-by: Vincent Guittot 
>> ---
>>  kernel/sched/fair.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0951d1c..15a9f5e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9847,6 +9847,7 @@ static int idle_balance(struct rq *this_rq, struct 
>> rq_flags *rf)
>>   if (curr_cost > this_rq->max_idle_balance_cost)
>>   this_rq->max_idle_balance_cost = curr_cost;
>>
>> +out:
>>   /*
>>* While browsing the domains, we released the rq lock, a task could
>>* have been enqueued in the meantime. Since we're not going idle,
>> @@ -9855,7 +9856,6 @@ static int idle_balance(struct rq *this_rq, struct 
>> rq_flags *rf)
>>   if (this_rq->cfs.h_nr_running && !pulled_task)
>>   pulled_task = 1;
>>
>> -out:
>>   /* Move the next balance forward */
>>   if (time_after(this_rq->next_balance, next_balance))
>>   this_rq->next_balance = next_balance;
>> --
>> 2.7.4
>>
>>
>>
>> [snip]
>>
>
> --
> Regards,
> Niklas Söderlund


Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-26 Thread Niklas Söderlund
Hi Vincent,

Thanks for all your help.

On 2018-04-26 12:31:33 +0200, Vincent Guittot wrote:
> Hi Niklas,
> 
> Le Thursday 26 Apr 2018 à 00:56:03 (+0200), Niklas Söderlund a écrit :
> > Hi Vincent,
> > 
> > Here are the result, sorry for the delay.
> > 
> > On 2018-04-23 11:54:20 +0200, Vincent Guittot wrote:
> > 
> > [snip]
> > 
> > > 
> > > Thanks for the report. Can you re run with the following trace-cmd 
> > > sequence ? My previous sequence disables ftrace events
> > > 
> > > trace-cmd reset > /dev/null
> > > trace-cmd start -b 4 -p function -l dump_backtrace:traceoff -e sched 
> > > -e cpu_idle -e cpu_frequency -e timer -e ipi -e irq -e printk
> > > trace-cmd start -b 4 -p function -l dump_backtrace -e sched -e 
> > > cpu_idle -e cpu_frequency -e timer -e ipi -e irq -e printk
> > > 
> > > I have updated the patch and added traces to check that scheduler returns 
> > > from idle_balance function and doesn't stay stuck
> > 
> > Once more I applied the change bellow on-top of c18bb396d3d261eb ("Merge 
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
> > 
> > This time the result of 'trace-cmd report' is so large I do not include 
> > it here, but I attach the trace.dat file. Not sure why but the timing of 
> > sending the NMI to the backtrace print is different (but content the 
> > same AFIK) so in the odd change it can help figure this out:
> > 
> 
> Thanks for the trace, I have been able to catch a problem with it.
> Could you test the patch below to confirm that the problem is solved ?
> The patch apply on-top of
> c18bb396d3d261eb ("Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")

I can confirm that with the patch bellow I can no longer produce the 
problem. Thanks!

> 
> From: Vincent Guittot 
> Date: Thu, 26 Apr 2018 12:19:32 +0200
> Subject: [PATCH] sched/fair: fix the update of blocked load when newly idle
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> With commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle"),
> we release the rq->lock when updating blocked load of idle CPUs. This open
> a time window during which another CPU can add a task to this CPU's cfs_rq.
> The check for newly added task of idle_balance() is not in the common path.
> Move the out label to include this check.
> 
> Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
> Reported-by: Heiner Kallweit 
> Reported-by: Niklas Söderlund 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0951d1c..15a9f5e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9847,6 +9847,7 @@ static int idle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   if (curr_cost > this_rq->max_idle_balance_cost)
>   this_rq->max_idle_balance_cost = curr_cost;
>  
> +out:
>   /*
>* While browsing the domains, we released the rq lock, a task could
>* have been enqueued in the meantime. Since we're not going idle,
> @@ -9855,7 +9856,6 @@ static int idle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   if (this_rq->cfs.h_nr_running && !pulled_task)
>   pulled_task = 1;
>  
> -out:
>   /* Move the next balance forward */
>   if (time_after(this_rq->next_balance, next_balance))
>   this_rq->next_balance = next_balance;
> -- 
> 2.7.4
> 
> 
> 
> [snip]
> 

-- 
Regards,
Niklas Söderlund


Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-26 Thread Peter Zijlstra
On Thu, Apr 26, 2018 at 12:31:33PM +0200, Vincent Guittot wrote:
> From: Vincent Guittot 
> Date: Thu, 26 Apr 2018 12:19:32 +0200
> Subject: [PATCH] sched/fair: fix the update of blocked load when newly idle
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> With commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle"),
> we release the rq->lock when updating blocked load of idle CPUs. This open
> a time window during which another CPU can add a task to this CPU's cfs_rq.
> The check for newly added task of idle_balance() is not in the common path.
> Move the out label to include this check.

Ah quite so indeed. This could result in us running idle even though
there is a runnable task around -- which is bad.

> Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
> Reported-by: Heiner Kallweit 
> Reported-by: Niklas Söderlund 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0951d1c..15a9f5e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9847,6 +9847,7 @@ static int idle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   if (curr_cost > this_rq->max_idle_balance_cost)
>   this_rq->max_idle_balance_cost = curr_cost;
>  
> +out:
>   /*
>* While browsing the domains, we released the rq lock, a task could
>* have been enqueued in the meantime. Since we're not going idle,
> @@ -9855,7 +9856,6 @@ static int idle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   if (this_rq->cfs.h_nr_running && !pulled_task)
>   pulled_task = 1;
>  
> -out:
>   /* Move the next balance forward */
>   if (time_after(this_rq->next_balance, next_balance))
>   this_rq->next_balance = next_balance;




Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-26 Thread Vincent Guittot
Hi Niklas,

Le Thursday 26 Apr 2018 à 00:56:03 (+0200), Niklas Söderlund a écrit :
> Hi Vincent,
> 
> Here are the result, sorry for the delay.
> 
> On 2018-04-23 11:54:20 +0200, Vincent Guittot wrote:
> 
> [snip]
> 
> > 
> > Thanks for the report. Can you re run with the following trace-cmd sequence 
> > ? My previous sequence disables ftrace events
> > 
> > trace-cmd reset > /dev/null
> > trace-cmd start -b 4 -p function -l dump_backtrace:traceoff -e sched -e 
> > cpu_idle -e cpu_frequency -e timer -e ipi -e irq -e printk
> > trace-cmd start -b 4 -p function -l dump_backtrace -e sched -e cpu_idle 
> > -e cpu_frequency -e timer -e ipi -e irq -e printk
> > 
> > I have updated the patch and added traces to check that scheduler returns 
> > from idle_balance function and doesn't stay stuck
> 
> Once more I applied the change bellow on-top of c18bb396d3d261eb ("Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
> 
> This time the result of 'trace-cmd report' is so large I do not include 
> it here, but I attach the trace.dat file. Not sure why but the timing of 
> sending the NMI to the backtrace print is different (but content the 
> same AFIK) so in the odd change it can help figure this out:
> 

Thanks for the trace, I have been able to catch a problem with it.
Could you test the patch below to confirm that the problem is solved ?
The patch apply on-top of
c18bb396d3d261eb ("Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")

From: Vincent Guittot 
Date: Thu, 26 Apr 2018 12:19:32 +0200
Subject: [PATCH] sched/fair: fix the update of blocked load when newly idle
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle"),
we release the rq->lock when updating blocked load of idle CPUs. This open
a time window during which another CPU can add a task to this CPU's cfs_rq.
The check for newly added task of idle_balance() is not in the common path.
Move the out label to include this check.

Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
Reported-by: Heiner Kallweit 
Reported-by: Niklas Söderlund 
Signed-off-by: Vincent Guittot 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0951d1c..15a9f5e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9847,6 +9847,7 @@ static int idle_balance(struct rq *this_rq, struct 
rq_flags *rf)
if (curr_cost > this_rq->max_idle_balance_cost)
this_rq->max_idle_balance_cost = curr_cost;
 
+out:
/*
 * While browsing the domains, we released the rq lock, a task could
 * have been enqueued in the meantime. Since we're not going idle,
@@ -9855,7 +9856,6 @@ static int idle_balance(struct rq *this_rq, struct 
rq_flags *rf)
if (this_rq->cfs.h_nr_running && !pulled_task)
pulled_task = 1;
 
-out:
/* Move the next balance forward */
if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;
-- 
2.7.4



[snip]



Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-23 Thread Vincent Guittot
Hi Niklas,

Le Monday 23 Apr 2018 à 00:18:27 (+0200), Niklas Söderlund a écrit :
> Hi Vincent,
> 
> On 2018-04-20 18:00:13 +0200, Vincent Guittot wrote:
> 
> [snip]
> 
> > > 
> > > You results are similar to Heiner's ones. The problem is still there
> > > even if we only kick ilb which mainly send an IPI reschedule to the
> > > other CPU if Idle
> > > 
> > 
> > Could it be possible to record some traces of the problem to get a better 
> > view
> > of what happens ?
> > 
> > I have a small patch that adds some traces in the functions that seems to 
> > create
> > the problem
> 
> Sure, I applied the patch bellow on-top of c18bb396d3d261eb ("Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
> 
> > 
> > 
> > 
> > Also that would be good to stop tracing when the RCU stall happens
> > 
> > In case you are not familiar with the trace tool, I have put below how to 
> > configure ftrace to trace scheudler, irq, timer ... events and stop tracing 
> > when dump_backtrace function is called
> > 
> > trace-cmd reset > /dev/null
> > trace-cmd start -b 4 -p function -l dump_backtrace:traceoff -e sched -e 
> > cpu_idle -e cpu_frequency -e timer -e ipi -e irq -e printk
> > trace-cmd start -b 4 -p function -l dump_backtrace
> > 
> > The trace-cmd start has to be called twice to make sure that we will not 
> > trace all function
> > 
> > Once the dump happen and the trace is stopped, you can extract the traces 
> > with 
> > 
> > trace-cmd extract -o 
> 
> Thanks for the help with trace-cmd, I have attached the full 
> trace-cmd.dat extracted from running this. But a quick run of trace-cmd 
> report.i

Thanks for the report. Can you re run with the following trace-cmd sequence ? 
My previous sequence disables ftrace events

trace-cmd reset > /dev/null
trace-cmd start -b 4 -p function -l dump_backtrace:traceoff -e sched -e 
cpu_idle -e cpu_frequency -e timer -e ipi -e irq -e printk
trace-cmd start -b 4 -p function -l dump_backtrace -e sched -e cpu_idle -e 
cpu_frequency -e timer -e ipi -e irq -e printk

I have updated the patch and added traces to check that scheduler returns from 
idle_balance function and doesn't stay stuck

---
 kernel/sched/fair.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0951d1c..4285511 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9606,6 +9606,8 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
unsigned int flags,
 */
WRITE_ONCE(nohz.has_blocked, 0);
 
+   trace_printk("_nohz_idle_balance cpu %d idle %d flag %x", this_cpu, 
idle, flags);
+
/*
 * Ensures that if we miss the CPU, we must see the has_blocked
 * store from nohz_balance_enter_idle().
@@ -9628,6 +9630,8 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
unsigned int flags,
 
rq = cpu_rq(balance_cpu);
 
+   trace_printk("_nohz_idle_balance update cpu %d ", balance_cpu);
+
has_blocked_load |= update_nohz_stats(rq, true);
 
/*
@@ -9680,6 +9684,8 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
unsigned int flags,
if (likely(update_next_balance))
nohz.next_balance = next_balance;
 
+   trace_printk("_nohz_idle_balance return %d", ret);
+
return ret;
 }
 
@@ -9732,6 +9738,8 @@ static void nohz_newidle_balance(struct rq *this_rq)
time_before(jiffies, READ_ONCE(nohz.next_blocked)))
return;
 
+   trace_printk("nohz_newidle_balance start update");
+
raw_spin_unlock(_rq->lock);
/*
 * This CPU is going to be idle and blocked load of idle CPUs
@@ -9742,6 +9750,9 @@ static void nohz_newidle_balance(struct rq *this_rq)
if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
kick_ilb(NOHZ_STATS_KICK);
raw_spin_lock(_rq->lock);
+
+   trace_printk("nohz_newidle_balance lock back");
+
 }
 
 #else /* !CONFIG_NO_HZ_COMMON */
@@ -9869,6 +9880,8 @@ static int idle_balance(struct rq *this_rq, struct 
rq_flags *rf)
 
rq_repin_lock(this_rq, rf);
 
+   trace_printk("idle_balance %d", pulled_task);
+
return pulled_task;
 }
 
-- 
2.7.4

>
> 

[snip]

>rcu_sched-9 [000]   147.342741: bputs:
> pick_next_task_fair: nohz_newidle_balance start update
>rcu_sched-9 [000]   147.342749: bprint:   
> _nohz_idle_balance: _nohz_idle_balance cpu 0 idle 2 flag 2
>rcu_sched-9 [000]   147.342754: bprint:   
> _nohz_idle_balance: _nohz_idle_balance return 1
>rcu_sched-9 [000]   147.382739: bputs:
> pick_next_task_fair: nohz_newidle_balance start update
>rcu_sched-9 [000]   147.382741: bprint:   
> _nohz_idle_balance: _nohz_idle_balance cpu 0 idle 2 flag 2
>rcu_sched-9 [000]   147.382745: bprint:   
> _nohz_idle_balance: _nohz_idle_balance 

Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-20 Thread Joel Fernandes
On Fri, Apr 20, 2018 at 9:00 AM, Vincent Guittot
 wrote:
[..]
>
> Le Saturday 14 Apr 2018 à 13:24:20 (+0200), Vincent Guittot a écrit :
>> Hi Niklas,
>>
>> On 13 April 2018 at 00:39, Niklas Söderlund
>>  wrote:
>> > Hi Vincent,
>> >
>> > Thanks for helping trying to figure this out.
>> >
>> > On 2018-04-12 15:30:31 +0200, Vincent Guittot wrote:
>> >
>> > [snip]
>> >
>> >>
>> >> I'd like to narrow the problem a bit more with the 2 patchies aboves. Can 
>> >> you try
>> >> them separatly on top of c18bb396d3d261eb ("Merge 
>> >> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"))
>> >> and check if one of them fixes the problem ?i
>> >
>> > I tried your suggested changes based on top of c18bb396d3d261eb.
>> >
>> >>
>> >> (They should apply on linux-next as well)
>> >>
>> >> First patch always kick ilb instead of doing ilb on local cpu before 
>> >> entering idle
>> >>
>> >> ---
>> >>  kernel/sched/fair.c | 3 +--
>> >>  1 file changed, 1 insertion(+), 2 deletions(-)
>> >>
>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> index 0951d1c..b21925b 100644
>> >> --- a/kernel/sched/fair.c
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -9739,8 +9739,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>> >>* candidate for ilb instead of waking up another idle CPU.
>> >>* Kick an normal ilb if we failed to do the update.
>> >>*/
>> >> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>> >> - kick_ilb(NOHZ_STATS_KICK);
>> >> + kick_ilb(NOHZ_STATS_KICK);
>> >>   raw_spin_lock(_rq->lock);
>> >>  }
>> >
>> > This change don't seem to effect the issue. I can still get the single
>> > ssh session and the system to lockup by hitting the return key. And
>> > opening a second ssh session immediately unblocks both the first ssh
>> > session and the serial console. And I can still trigger the console
>> > warning by just letting the system be once it locks-up. I do have
>> > just as before reset the system a few times to trigger the issue.
>>
>> You results are similar to Heiner's ones. The problem is still there
>> even if we only kick ilb which mainly send an IPI reschedule to the
>> other CPU if Idle
>>
>
> Could it be possible to record some traces of the problem to get a better view
> of what happens ?
>
> I have a small patch that adds some traces in the functions that seems to 
> create
> the problem
>
> ---
>  kernel/sched/fair.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0951d1c..a951464 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9606,6 +9606,8 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
> unsigned int flags,
>  */
> WRITE_ONCE(nohz.has_blocked, 0);
>
> +   trace_printk("_nohz_idle_balance cpu %d idle %d flag %x", this_cpu, 
> idle, flags);
> +
> /*
>  * Ensures that if we miss the CPU, we must see the has_blocked
>  * store from nohz_balance_enter_idle().
> @@ -9680,6 +9682,8 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
> unsigned int flags,
> if (likely(update_next_balance))
> nohz.next_balance = next_balance;
>
> +   trace_printk("_nohz_idle_balance return %d", ret);
> +
> return ret;
>  }
>
> @@ -9732,6 +9736,8 @@ static void nohz_newidle_balance(struct rq *this_rq)
> time_before(jiffies, READ_ONCE(nohz.next_blocked)))
> return;
>
> +   trace_printk("nohz_newidle_balance start update");
> +
> raw_spin_unlock(_rq->lock);
> /*
>  * This CPU is going to be idle and blocked load of idle CPUs
> --
> 2.7.4
>
>
> Also that would be good to stop tracing when the RCU stall happens
>
> In case you are not familiar with the trace tool, I have put below how to 
> configure ftrace to trace scheudler, irq, timer ... events and stop tracing 
> when dump_backtrace function is called
>
> trace-cmd reset > /dev/null
> trace-cmd start -b 4 -p function -l dump_backtrace:traceoff -e sched -e 
> cpu_idle -e cpu_frequency -e timer -e ipi -e irq -e printk
> trace-cmd start -b 4 -p function -l dump_backtrace
>
> The trace-cmd start has to be called twice to make sure that we will not 
> trace all function
>
> Once the dump happen and the trace is stopped, you can extract the traces with
>
> trace-cmd extract -o 

Just adding to that (and... Feel free to use or ignore these tips)
To prevent the trace_printk from getting lost in the trace flood, you
can also call tracing_stop() from wherever the problem is detected and
look at the last set of messages which is easier. Also you can dump
the ftrace buffer to the kernel logs before the lock up, using
ftrace_dump() and when you do that you probably want to limit the
ftrace buffer size.

thanks,

- Joel


Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-20 Thread Vincent Guittot
Hi Heiner and Niklas,

Le Saturday 14 Apr 2018 à 13:24:20 (+0200), Vincent Guittot a écrit :
> Hi Niklas,
> 
> On 13 April 2018 at 00:39, Niklas Söderlund
>  wrote:
> > Hi Vincent,
> >
> > Thanks for helping trying to figure this out.
> >
> > On 2018-04-12 15:30:31 +0200, Vincent Guittot wrote:
> >
> > [snip]
> >
> >>
> >> I'd like to narrow the problem a bit more with the 2 patchies aboves. Can 
> >> you try
> >> them separatly on top of c18bb396d3d261eb ("Merge 
> >> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"))
> >> and check if one of them fixes the problem ?i
> >
> > I tried your suggested changes based on top of c18bb396d3d261eb.
> >
> >>
> >> (They should apply on linux-next as well)
> >>
> >> First patch always kick ilb instead of doing ilb on local cpu before 
> >> entering idle
> >>
> >> ---
> >>  kernel/sched/fair.c | 3 +--
> >>  1 file changed, 1 insertion(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 0951d1c..b21925b 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -9739,8 +9739,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
> >>* candidate for ilb instead of waking up another idle CPU.
> >>* Kick an normal ilb if we failed to do the update.
> >>*/
> >> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> >> - kick_ilb(NOHZ_STATS_KICK);
> >> + kick_ilb(NOHZ_STATS_KICK);
> >>   raw_spin_lock(_rq->lock);
> >>  }
> >
> > This change don't seem to effect the issue. I can still get the single
> > ssh session and the system to lockup by hitting the return key. And
> > opening a second ssh session immediately unblocks both the first ssh
> > session and the serial console. And I can still trigger the console
> > warning by just letting the system be once it locks-up. I do have
> > just as before reset the system a few times to trigger the issue.
> 
> You results are similar to Heiner's ones. The problem is still there
> even if we only kick ilb which mainly send an IPI reschedule to the
> other CPU if Idle
> 

Could it be possible to record some traces of the problem to get a better view
of what happens ?

I have a small patch that adds some traces in the functions that seems to create
the problem

---
 kernel/sched/fair.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0951d1c..a951464 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9606,6 +9606,8 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
unsigned int flags,
 */
WRITE_ONCE(nohz.has_blocked, 0);
 
+   trace_printk("_nohz_idle_balance cpu %d idle %d flag %x", this_cpu, 
idle, flags);
+
/*
 * Ensures that if we miss the CPU, we must see the has_blocked
 * store from nohz_balance_enter_idle().
@@ -9680,6 +9682,8 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
unsigned int flags,
if (likely(update_next_balance))
nohz.next_balance = next_balance;
 
+   trace_printk("_nohz_idle_balance return %d", ret);
+
return ret;
 }
 
@@ -9732,6 +9736,8 @@ static void nohz_newidle_balance(struct rq *this_rq)
time_before(jiffies, READ_ONCE(nohz.next_blocked)))
return;
 
+   trace_printk("nohz_newidle_balance start update");
+
raw_spin_unlock(_rq->lock);
/*
 * This CPU is going to be idle and blocked load of idle CPUs
-- 
2.7.4


Also that would be good to stop tracing when the RCU stall happens

In case you are not familiar with the trace tool, I have put below how to 
configure ftrace to trace scheudler, irq, timer ... events and stop tracing 
when dump_backtrace function is called

trace-cmd reset > /dev/null
trace-cmd start -b 4 -p function -l dump_backtrace:traceoff -e sched -e 
cpu_idle -e cpu_frequency -e timer -e ipi -e irq -e printk
trace-cmd start -b 4 -p function -l dump_backtrace

The trace-cmd start has to be called twice to make sure that we will not trace 
all function

Once the dump happen and the trace is stopped, you can extract the traces with 

trace-cmd extract -o 

Thanks
Vincent



Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-14 Thread Vincent Guittot
On 13 April 2018 at 22:38, Niklas Söderlund
 wrote:
> Hi Vincent,
>
> On 2018-04-12 13:15:19 +0200, Niklas Söderlund wrote:
>> Hi Vincent,
>>
>> Thanks for your feedback.
>>
>> On 2018-04-12 12:33:27 +0200, Vincent Guittot wrote:
>> > Hi Niklas,
>> >
>> > On 12 April 2018 at 11:18, Niklas Söderlund
>> >  wrote:
>> > > Hi Vincent,
>> > >
>> > > I have observed issues running on linus/master from a few days back [1].
>> > > I'm running on a Renesas Koelsch board (arm32) and I can trigger a issue
>> > > by X forwarding the v4l2 test application qv4l2 over ssh and moving the
>> > > courser around in the GUI (best test case description award...). I'm
>> > > sorry about the really bad way I trigger this but I can't do it in any
>> > > other way, I'm happy to try other methods if you got some ideas. The
>> > > symptom of the issue is a complete hang of the system for more then 30
>> > > seconds and then this information is printed in the console:
>> >
>> > Heiner (edded cc) also reported similar problem with his platform: a
>> > dual core celeron
>> >
>> > Do you confirm that your platform is a dual cortex-A15 ? At least that
>> > what I have seen on web
>> > This would confirm that dual system is a key point.
>>
>> I can confirm that my platform is a dual core.
>
> I tested another dual core system today Renesas M3-W ARM64 system and I
> can observe the same lockups-on that system if it helps you understand
> the problem. It seems to be much harder to trigger the issue on this
> system for some reason. Hitting return in a ssh session don't seem to
> produce the lockup while starting a GUI using X forwarding over ssh it's
> possible.

Thanks for the test. That's confirm, it's only happen on dual core

>
> [  392.306441] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  392.312201]  (detected by 0, t=19366 jiffies, g=7177, c=7176, q=35)
> [  392.318555] All QSes seen, last rcu_preempt kthread activity 19368
> (4294990375-4294971007), jiffies_till_next_fqs=1, root ->qsmask 0x0
> [  392.330758] swapper/0   R  running task0 0  0
> 0x0022
> [  392.337883] Call trace:
> [  392.340365]  dump_backtrace+0x0/0x1c8
> [  392.344065]  show_stack+0x14/0x20
> [  392.347416]  sched_show_task+0x224/0x2e8
> [  392.351377]  rcu_check_callbacks+0x8ac/0x8b0
> [  392.355686]  update_process_times+0x2c/0x58
> [  392.359908]  tick_sched_handle.isra.5+0x30/0x50
> [  392.364479]  tick_sched_timer+0x40/0x90
> [  392.368351]  __hrtimer_run_queues+0xfc/0x208
> [  392.372659]  hrtimer_interrupt+0xd4/0x258
> [  392.376710]  arch_timer_handler_virt+0x28/0x48
> [  392.381194]  handle_percpu_devid_irq+0x80/0x138
> [  392.385767]  generic_handle_irq+0x28/0x40
> [  392.389813]  __handle_domain_irq+0x5c/0xb8
> [  392.393946]  gic_handle_irq+0x58/0xa8
> [  392.397640]  el1_irq+0xb4/0x130
> [  392.400810]  arch_cpu_idle+0x14/0x20
> [  392.404422]  default_idle_call+0x1c/0x38
> [  392.408381]  do_idle+0x17c/0x1f8
> [  392.411640]  cpu_startup_entry+0x20/0x28
> [  392.415598]  rest_init+0x24c/0x260
> [  392.419037]  start_kernel+0x3e8/0x414
>
> I was running the same tests on another ARM64 platform earlier using the
> same build which have more then two cores and there I could not observe
> this issue.
>
> --
> Regards,
> Niklas Söderlund


Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-14 Thread Vincent Guittot
Hi Niklas,

On 13 April 2018 at 00:39, Niklas Söderlund
 wrote:
> Hi Vincent,
>
> Thanks for helping trying to figure this out.
>
> On 2018-04-12 15:30:31 +0200, Vincent Guittot wrote:
>
> [snip]
>
>>
>> I'd like to narrow the problem a bit more with the 2 patchies aboves. Can 
>> you try
>> them separatly on top of c18bb396d3d261eb ("Merge 
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"))
>> and check if one of them fixes the problem ?i
>
> I tried your suggested changes based on top of c18bb396d3d261eb.
>
>>
>> (They should apply on linux-next as well)
>>
>> First patch always kick ilb instead of doing ilb on local cpu before 
>> entering idle
>>
>> ---
>>  kernel/sched/fair.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0951d1c..b21925b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9739,8 +9739,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>>* candidate for ilb instead of waking up another idle CPU.
>>* Kick an normal ilb if we failed to do the update.
>>*/
>> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>> - kick_ilb(NOHZ_STATS_KICK);
>> + kick_ilb(NOHZ_STATS_KICK);
>>   raw_spin_lock(_rq->lock);
>>  }
>
> This change don't seem to effect the issue. I can still get the single
> ssh session and the system to lockup by hitting the return key. And
> opening a second ssh session immediately unblocks both the first ssh
> session and the serial console. And I can still trigger the console
> warning by just letting the system be once it locks-up. I do have
> just as before reset the system a few times to trigger the issue.

You results are similar to Heiner's ones. The problem is still there
even if we only kick ilb which mainly send an IPI reschedule to the
other CPU if Idle

>
> [  245.351693] INFO: rcu_sched detected stalls on CPUs/tasks:
> [  245.357199]  0-...!: (1 GPs behind) idle=93c/0/0 softirq=2224/2225 fqs=0
> [  245.363988]  (detected by 1, t=3025 jiffies, g=337, c=336, q=10)
> [  245.370003] Sending NMI from CPU 1 to CPUs 0:
> [  245.374368] NMI backtrace for cpu 0
> [  245.374377] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> 4.16.0-10930-ged741fb4567c816f #42
> [  245.374379] Hardware name: Generic R8A7791 (Flattened Device Tree)
> [  245.374393] PC is at arch_cpu_idle+0x24/0x40
> [  245.374397] LR is at arch_cpu_idle+0x34/0x40
> [  245.374400] pc : []lr : []psr: 60050013
> [  245.374403] sp : c0b01f40  ip : c0b01f50  fp : c0b01f4c
> [  245.374405] r10: c0a56a38  r9 : e7fffbc0  r8 : c0b04c00
> [  245.374407] r7 : c0b04c78  r6 : c0b04c2c  r5 : e000  r4 : 0001
> [  245.374410] r3 : c0119100  r2 : e77813a8  r1 : 0002d93c  r0 : 
> [  245.374414] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment 
> none
> [  245.374417] Control: 10c5387d  Table: 6662006a  DAC: 0051
> [  245.374421] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> 4.16.0-10930-ged741fb4567c816f #42
> [  245.374423] Hardware name: Generic R8A7791 (Flattened Device Tree)
> [  245.374425] Backtrace:
> [  245.374435] [] (dump_backtrace) from [] 
> (show_stack+0x18/0x1c)
> [  245.374440]  r7:c0b47278 r6:60050193 r5: r4:c0b73d80
> [  245.374450] [] (show_stack) from [] 
> (dump_stack+0x84/0xa4)
> [  245.374456] [] (dump_stack) from [] 
> (show_regs+0x14/0x18)
> [  245.374460]  r7:c0b47278 r6:c0b01ef0 r5: r4:c0bc62c8
> [  245.374468] [] (show_regs) from [] 
> (nmi_cpu_backtrace+0xfc/0x118)
> [  245.374475] [] (nmi_cpu_backtrace) from [] 
> (handle_IPI+0x22c/0x294)
> [  245.374479]  r7:c0b47278 r6:c0b01ef0 r5:0007 r4:c0a775fc
> [  245.374488] [] (handle_IPI) from [] 
> (gic_handle_irq+0x8c/0x98)
> [  245.374492]  r10:c0a56a38 r9:c0b0 r8:f0803000 r7:c0b47278 r6:c0b01ef0 
> r5:c0b05244
> [  245.374495]  r4:f0802000 r3:0407
> [  245.374501] [] (gic_handle_irq) from [] 
> (__irq_svc+0x6c/0x90)
> [  245.374504] Exception stack(0xc0b01ef0 to 0xc0b01f38)
> [  245.374507] 1ee0:  0002d93c 
> e77813a8 c0119100
> [  245.374512] 1f00: 0001 e000 c0b04c2c c0b04c78 c0b04c00 e7fffbc0 
> c0a56a38 c0b01f4c
> [  245.374516] 1f20: c0b01f50 c0b01f40 c0108564 c0108554 60050013 
> [  245.374521]  r9:c0b0 r8:c0b04c00 r7:c0b01f24 r6: r5:60050013 
> r4:c0108554
> [  245.374528] [] (arch_cpu_idle) from [] 
> (default_idle_call+0x30/0x34)
> [  245.374535] [] (default_idle_call) from [] 
> (do_idle+0xd8/0x128)
> [  245.374540] [] (do_idle) from [] 
> (cpu_startup_entry+0x20/0x24)
> [  245.374543]  r7:c0b04c08 r6: r5:c0b80380 r4:00c2
> [  245.374549] [] (cpu_startup_entry) from [] 
> (rest_init+0x9c/0xbc)
> [  245.374555] [] (rest_init) from [] 
> (start_kernel+0x368/0x3ec)
> [  245.374558]  r5:c0b80380 r4:c0b803c0
> [  245.374563] [] (start_kernel) from [<>] (  (null))
> [  245.375369] 

Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-14 Thread Vincent Guittot
Heiner,

On 12 April 2018 at 21:43, Heiner Kallweit  wrote:


 I'm going to prepare a debug patch to spy what's happening when entering 
 idle
>>
>> I'd like to narrow the problem a bit more with the 2 patchies aboves. Can 
>> you try
>> them separatly on top of c18bb396d3d261eb ("Merge 
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"))
>> and check if one of them fixes the problem ?i
>>
>> (They should apply on linux-next as well)
>>
>> First patch always kick ilb instead of doing ilb on local cpu before 
>> entering idle
>>
>> ---
>>  kernel/sched/fair.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0951d1c..b21925b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9739,8 +9739,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>>* candidate for ilb instead of waking up another idle CPU.
>>* Kick an normal ilb if we failed to do the update.
>>*/
>> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>> - kick_ilb(NOHZ_STATS_KICK);
>> + kick_ilb(NOHZ_STATS_KICK);
>>   raw_spin_lock(_rq->lock);
>>  }
>>
>>
> I tested both patches, with both of them the issue still occurs. However,
> on top of linux-next from yesterday I have the impression that it happens
> less frequent with the second patch.
> On top of the commit mentioned by you I don't see a change in system behavior
> with either patch.

Thanks for the tests.
I was expecting to have more differences between the 2 patches and
especially no problem with the 1st patch which only send a ipi
reschedule to the other CPU if it is idle.
It seems to not really be related to what is done but to the fact that
it is done at that place in the code

Thanks
>
> Regards, Heiner


Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-13 Thread Niklas Söderlund
Hi Vincent,

On 2018-04-12 13:15:19 +0200, Niklas Söderlund wrote:
> Hi Vincent,
> 
> Thanks for your feedback.
> 
> On 2018-04-12 12:33:27 +0200, Vincent Guittot wrote:
> > Hi Niklas,
> > 
> > On 12 April 2018 at 11:18, Niklas Söderlund
> >  wrote:
> > > Hi Vincent,
> > >
> > > I have observed issues running on linus/master from a few days back [1].
> > > I'm running on a Renesas Koelsch board (arm32) and I can trigger a issue
> > > by X forwarding the v4l2 test application qv4l2 over ssh and moving the
> > > courser around in the GUI (best test case description award...). I'm
> > > sorry about the really bad way I trigger this but I can't do it in any
> > > other way, I'm happy to try other methods if you got some ideas. The
> > > symptom of the issue is a complete hang of the system for more then 30
> > > seconds and then this information is printed in the console:
> > 
> > Heiner (edded cc) also reported similar problem with his platform: a
> > dual core celeron
> > 
> > Do you confirm that your platform is a dual cortex-A15 ? At least that
> > what I have seen on web
> > This would confirm that dual system is a key point.
> 
> I can confirm that my platform is a dual core.

I tested another dual core system today Renesas M3-W ARM64 system and I 
can observe the same lockups-on that system if it helps you understand 
the problem. It seems to be much harder to trigger the issue on this 
system for some reason. Hitting return in a ssh session don't seem to 
produce the lockup while starting a GUI using X forwarding over ssh it's 
possible.

[  392.306441] INFO: rcu_preempt detected stalls on CPUs/tasks:
[  392.312201]  (detected by 0, t=19366 jiffies, g=7177, c=7176, q=35)
[  392.318555] All QSes seen, last rcu_preempt kthread activity 19368 
(4294990375-4294971007), jiffies_till_next_fqs=1, root ->qsmask 0x0
[  392.330758] swapper/0   R  running task0 0  0 
0x0022
[  392.337883] Call trace:
[  392.340365]  dump_backtrace+0x0/0x1c8
[  392.344065]  show_stack+0x14/0x20
[  392.347416]  sched_show_task+0x224/0x2e8
[  392.351377]  rcu_check_callbacks+0x8ac/0x8b0
[  392.355686]  update_process_times+0x2c/0x58
[  392.359908]  tick_sched_handle.isra.5+0x30/0x50
[  392.364479]  tick_sched_timer+0x40/0x90
[  392.368351]  __hrtimer_run_queues+0xfc/0x208
[  392.372659]  hrtimer_interrupt+0xd4/0x258
[  392.376710]  arch_timer_handler_virt+0x28/0x48
[  392.381194]  handle_percpu_devid_irq+0x80/0x138
[  392.385767]  generic_handle_irq+0x28/0x40
[  392.389813]  __handle_domain_irq+0x5c/0xb8
[  392.393946]  gic_handle_irq+0x58/0xa8
[  392.397640]  el1_irq+0xb4/0x130
[  392.400810]  arch_cpu_idle+0x14/0x20
[  392.404422]  default_idle_call+0x1c/0x38
[  392.408381]  do_idle+0x17c/0x1f8
[  392.411640]  cpu_startup_entry+0x20/0x28
[  392.415598]  rest_init+0x24c/0x260
[  392.419037]  start_kernel+0x3e8/0x414

I was running the same tests on another ARM64 platform earlier using the 
same build which have more then two cores and there I could not observe 
this issue.

-- 
Regards,
Niklas Söderlund


Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-12 Thread Niklas Söderlund
Hi Vincent,

Thanks for helping trying to figure this out.

On 2018-04-12 15:30:31 +0200, Vincent Guittot wrote:

[snip]

> 
> I'd like to narrow the problem a bit more with the 2 patchies aboves. Can you 
> try
> them separatly on top of c18bb396d3d261eb ("Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"))
> and check if one of them fixes the problem ?i

I tried your suggested changes based on top of c18bb396d3d261eb.

> 
> (They should apply on linux-next as well)
> 
> First patch always kick ilb instead of doing ilb on local cpu before entering 
> idle
> 
> ---
>  kernel/sched/fair.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0951d1c..b21925b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9739,8 +9739,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>* candidate for ilb instead of waking up another idle CPU.
>* Kick an normal ilb if we failed to do the update.
>*/
> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> - kick_ilb(NOHZ_STATS_KICK);
> + kick_ilb(NOHZ_STATS_KICK);
>   raw_spin_lock(_rq->lock);
>  }

This change don't seem to effect the issue. I can still get the single 
ssh session and the system to lockup by hitting the return key. And 
opening a second ssh session immediately unblocks both the first ssh 
session and the serial console. And I can still trigger the console 
warning by just letting the system be once it locks-up. I do have 
just as before reset the system a few times to trigger the issue.

[  245.351693] INFO: rcu_sched detected stalls on CPUs/tasks:
[  245.357199]  0-...!: (1 GPs behind) idle=93c/0/0 softirq=2224/2225 fqs=0 
[  245.363988]  (detected by 1, t=3025 jiffies, g=337, c=336, q=10)
[  245.370003] Sending NMI from CPU 1 to CPUs 0:
[  245.374368] NMI backtrace for cpu 0
[  245.374377] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.16.0-10930-ged741fb4567c816f #42
[  245.374379] Hardware name: Generic R8A7791 (Flattened Device Tree)
[  245.374393] PC is at arch_cpu_idle+0x24/0x40
[  245.374397] LR is at arch_cpu_idle+0x34/0x40
[  245.374400] pc : []lr : []psr: 60050013
[  245.374403] sp : c0b01f40  ip : c0b01f50  fp : c0b01f4c
[  245.374405] r10: c0a56a38  r9 : e7fffbc0  r8 : c0b04c00
[  245.374407] r7 : c0b04c78  r6 : c0b04c2c  r5 : e000  r4 : 0001
[  245.374410] r3 : c0119100  r2 : e77813a8  r1 : 0002d93c  r0 : 
[  245.374414] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  245.374417] Control: 10c5387d  Table: 6662006a  DAC: 0051
[  245.374421] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.16.0-10930-ged741fb4567c816f #42
[  245.374423] Hardware name: Generic R8A7791 (Flattened Device Tree)
[  245.374425] Backtrace: 
[  245.374435] [] (dump_backtrace) from [] 
(show_stack+0x18/0x1c)
[  245.374440]  r7:c0b47278 r6:60050193 r5: r4:c0b73d80
[  245.374450] [] (show_stack) from [] 
(dump_stack+0x84/0xa4)
[  245.374456] [] (dump_stack) from [] (show_regs+0x14/0x18)
[  245.374460]  r7:c0b47278 r6:c0b01ef0 r5: r4:c0bc62c8
[  245.374468] [] (show_regs) from [] 
(nmi_cpu_backtrace+0xfc/0x118)
[  245.374475] [] (nmi_cpu_backtrace) from [] 
(handle_IPI+0x22c/0x294)
[  245.374479]  r7:c0b47278 r6:c0b01ef0 r5:0007 r4:c0a775fc
[  245.374488] [] (handle_IPI) from [] 
(gic_handle_irq+0x8c/0x98)
[  245.374492]  r10:c0a56a38 r9:c0b0 r8:f0803000 r7:c0b47278 r6:c0b01ef0 
r5:c0b05244
[  245.374495]  r4:f0802000 r3:0407
[  245.374501] [] (gic_handle_irq) from [] 
(__irq_svc+0x6c/0x90)
[  245.374504] Exception stack(0xc0b01ef0 to 0xc0b01f38)
[  245.374507] 1ee0:  0002d93c 
e77813a8 c0119100
[  245.374512] 1f00: 0001 e000 c0b04c2c c0b04c78 c0b04c00 e7fffbc0 
c0a56a38 c0b01f4c
[  245.374516] 1f20: c0b01f50 c0b01f40 c0108564 c0108554 60050013 
[  245.374521]  r9:c0b0 r8:c0b04c00 r7:c0b01f24 r6: r5:60050013 
r4:c0108554
[  245.374528] [] (arch_cpu_idle) from [] 
(default_idle_call+0x30/0x34)
[  245.374535] [] (default_idle_call) from [] 
(do_idle+0xd8/0x128)
[  245.374540] [] (do_idle) from [] 
(cpu_startup_entry+0x20/0x24)
[  245.374543]  r7:c0b04c08 r6: r5:c0b80380 r4:00c2
[  245.374549] [] (cpu_startup_entry) from [] 
(rest_init+0x9c/0xbc)
[  245.374555] [] (rest_init) from [] 
(start_kernel+0x368/0x3ec)
[  245.374558]  r5:c0b80380 r4:c0b803c0
[  245.374563] [] (start_kernel) from [<>] (  (null))
[  245.375369] rcu_sched kthread starved for 3028 jiffies! g337 c336 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1
[  245.641242] RCU grace-period kthread stack dump:
[  245.645859] rcu_sched   I0 9  2 0x
[  245.651350] Backtrace: 
[  245.653804] [] (__schedule) from [] (schedule+0x94/0xb8)
[  245.660857]  r10:e77903c0 r9:e77903c0 r8: r7:e709bed4 r6:ded7 
r5:
[  245.668689]  r4:e000
[  

Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-12 Thread Niklas Söderlund
Hi Vincent,

On 2018-04-12 12:33:27 +0200, Vincent Guittot wrote:
[snip[

> 
> Can you send me your config ?
> 
> I'm going to prepare a debug patch to spy what's happening when entering idle

I'm sorry i missed this request when first reading your reply, I'm using
arch/arm/configs/shmobile_defconfig for my tests.

-- 
Regards,
Niklas Söderlund


Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-12 Thread Heiner Kallweit
Am 12.04.2018 um 15:30 schrieb Vincent Guittot:
> Heiner, Niklas,
> 
> Le Thursday 12 Apr 2018 à 13:15:19 (+0200), Niklas Söderlund a écrit :
>> Hi Vincent,
>>
>> Thanks for your feedback.
>>
>> On 2018-04-12 12:33:27 +0200, Vincent Guittot wrote:
>>> Hi Niklas,
>>>
>>> On 12 April 2018 at 11:18, Niklas Söderlund
>>>  wrote:
 Hi Vincent,

 I have observed issues running on linus/master from a few days back [1].
 I'm running on a Renesas Koelsch board (arm32) and I can trigger a issue
 by X forwarding the v4l2 test application qv4l2 over ssh and moving the
 courser around in the GUI (best test case description award...). I'm
 sorry about the really bad way I trigger this but I can't do it in any
 other way, I'm happy to try other methods if you got some ideas. The
 symptom of the issue is a complete hang of the system for more then 30
 seconds and then this information is printed in the console:
>>>
>>> Heiner (edded cc) also reported similar problem with his platform: a
>>> dual core celeron
>>>
>>> Do you confirm that your platform is a dual cortex-A15 ? At least that
>>> what I have seen on web
>>> This would confirm that dual system is a key point.
>>
>> I can confirm that my platform is a dual core.
>>
>>>
>>> The ssh connection is also common with Heiner's setup
>>
>> Interesting, I found Heiner's mail and I can confirm that I too 
>> experience ssh sessions lockups. I ssh into the system and by repeatedly 
>> hitting the return key I can lockup the board, while locked up starting 
>> another ssh session unblocks the first. If I don't start another ssh 
>> session but keep hitting return key sporadically in the first one I can 
>> get the trace I reported in my first mail to be printed on the serial 
>> console.
>>
>> When locked up the symptoms are that both the single ssh session is dead 
>> and the serial console. But attempting another ssh connection 
>> immediately unblocks both ssh and serial console. And if I allow enough 
>> time before starting the second ssh connection I can trigger a trace to 
>> be printed on the serial console, it's similar but different from the 
>> first I reported.
>>
>> [  207.548610]   1-...!: (0 ticks this GP) idle=79a/1/1073741824 
>> softirq=2146/2146 fqs=0 
>> [  207.556442]   (detected by 0, t=12645 jiffies, g=333, c=332, q=20)
>> [  207.562546] rcu_sched kthread starved for 12645 jiffies! g333 c332 f0x2 
>> RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
>> [  207.572548] RCU grace-period kthread stack dump:
>>
>> [  207.577166] rcu_sched   R  running task0 9  2 
>> 0x
>> [  207.584389] Backtrace: 
>> [  207.586849] [] (__schedule) from [] 
>> (schedule+0x94/0xb8)
>> [  207.593901]  r10:e77813c0 r9:e77813c0 r8: r7:e709bed4 r6:aa80 
>> r5:
>> [  207.601732]  r4:e000
>> [  207.604269] [] (schedule) from [] 
>> (schedule_timeout+0x380/0x3dc)
>> [  207.612013]  r5: r4:
>> [  207.615596] [] (schedule_timeout) from [] 
>> (rcu_gp_kthread+0x668/0xe2c)
>> [  207.623863]  r10:c0b79018 r9:014d r8:014c r7:0001 r6: 
>> r5:c0b10ad0
>> [  207.631693]  r4:c0b10980
>> [  207.634230] [] (rcu_gp_kthread) from [] 
>> (kthread+0x148/0x160)
>> [  207.641712]  r7:c0b10980
>> [  207.644249] [] (kthread) from [] 
>> (ret_from_fork+0x14/0x2c)
>> [  207.651472] Exception stack(0xe709bfb0 to 0xe709bff8)
>> [  207.656527] bfa0:   
>>  
>> [  207.664709] bfc0:       
>>  
>> [  207.672890] bfe0:     0013 
>> [  207.679508]  r10: r9: r8: r7: r6: 
>> r5:c013dc90
>> [  207.687340]  r4:e7026f4
>>
>> Continuing the anecdotal testing, I can't seem to be able to trigger the 
>> lockup if i have ever had two ssh sessions open to the systems. And 
>> about half the time I can't trigger it at all but after a reset of the 
>> system it triggers with just hitting the return key 2-5 times of opening 
>> a ssh session and just hitting the return key. But please take this part 
>> with a grain of salt as it's done by the monkey testing method :-)
>>
>> All tests above have been run base on c18bb396d3d261eb ("Merge 
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
>>
>>>

> 
> [snip]
> 

 I'm a bit lost on how to progress with this issue and would appreciate
 any help you can provide to help me figure this out.
>>>
>>> Can you send me your config ?
>>>
>>> I'm going to prepare a debug patch to spy what's happening when entering 
>>> idle
> 
> I'd like to narrow the problem a bit more with the 2 patchies aboves. Can you 
> try
> them separatly on top of c18bb396d3d261eb ("Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"))
> and check if one of them fixes the problem ?i
> 
> (They should apply 

Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-12 Thread Niklas Söderlund
Hi Vincent,

Thanks for your feedback.

On 2018-04-12 12:33:27 +0200, Vincent Guittot wrote:
> Hi Niklas,
> 
> On 12 April 2018 at 11:18, Niklas Söderlund
>  wrote:
> > Hi Vincent,
> >
> > I have observed issues running on linus/master from a few days back [1].
> > I'm running on a Renesas Koelsch board (arm32) and I can trigger a issue
> > by X forwarding the v4l2 test application qv4l2 over ssh and moving the
> > courser around in the GUI (best test case description award...). I'm
> > sorry about the really bad way I trigger this but I can't do it in any
> > other way, I'm happy to try other methods if you got some ideas. The
> > symptom of the issue is a complete hang of the system for more then 30
> > seconds and then this information is printed in the console:
> 
> Heiner (edded cc) also reported similar problem with his platform: a
> dual core celeron
> 
> Do you confirm that your platform is a dual cortex-A15 ? At least that
> what I have seen on web
> This would confirm that dual system is a key point.

I can confirm that my platform is a dual core.

> 
> The ssh connection is also common with Heiner's setup

Interesting, I found Heiner's mail and I can confirm that I too 
experience ssh sessions lockups. I ssh into the system and by repeatedly 
hitting the return key I can lockup the board, while locked up starting 
another ssh session unblocks the first. If I don't start another ssh 
session but keep hitting return key sporadically in the first one I can 
get the trace I reported in my first mail to be printed on the serial 
console.

When locked up the symptoms are that both the single ssh session is dead 
and the serial console. But attempting another ssh connection 
immediately unblocks both ssh and serial console. And if I allow enough 
time before starting the second ssh connection I can trigger a trace to 
be printed on the serial console, it's similar but different from the 
first I reported.

[  207.548610]  1-...!: (0 ticks this GP) idle=79a/1/1073741824 
softirq=2146/2146 fqs=0 
[  207.556442]  (detected by 0, t=12645 jiffies, g=333, c=332, q=20)
[  207.562546] rcu_sched kthread starved for 12645 jiffies! g333 c332 f0x2 
RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
[  207.572548] RCU grace-period kthread stack dump:

[  207.577166] rcu_sched   R  running task0 9  2 0x
[  207.584389] Backtrace: 
[  207.586849] [] (__schedule) from [] (schedule+0x94/0xb8)
[  207.593901]  r10:e77813c0 r9:e77813c0 r8: r7:e709bed4 r6:aa80 
r5:
[  207.601732]  r4:e000
[  207.604269] [] (schedule) from [] 
(schedule_timeout+0x380/0x3dc)
[  207.612013]  r5: r4:
[  207.615596] [] (schedule_timeout) from [] 
(rcu_gp_kthread+0x668/0xe2c)
[  207.623863]  r10:c0b79018 r9:014d r8:014c r7:0001 r6: 
r5:c0b10ad0
[  207.631693]  r4:c0b10980
[  207.634230] [] (rcu_gp_kthread) from [] 
(kthread+0x148/0x160)
[  207.641712]  r7:c0b10980
[  207.644249] [] (kthread) from [] 
(ret_from_fork+0x14/0x2c)
[  207.651472] Exception stack(0xe709bfb0 to 0xe709bff8)
[  207.656527] bfa0:   
 
[  207.664709] bfc0:       
 
[  207.672890] bfe0:     0013 
[  207.679508]  r10: r9: r8: r7: r6: 
r5:c013dc90
[  207.687340]  r4:e7026f4

Continuing the anecdotal testing, I can't seem to be able to trigger the 
lockup if i have ever had two ssh sessions open to the systems. And 
about half the time I can't trigger it at all but after a reset of the 
system it triggers with just hitting the return key 2-5 times of opening 
a ssh session and just hitting the return key. But please take this part 
with a grain of salt as it's done by the monkey testing method :-)

All tests above have been run base on c18bb396d3d261eb ("Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").

> 
> >
> > [  142.849390] INFO: rcu_sched detected stalls on CPUs/tasks:
> > [  142.854972]  1-...!: (1 GPs behind) idle=7a4/0/0 softirq=3214/3217 fqs=0
> > [  142.861976]  (detected by 0, t=8232 jiffies, g=930, c=929, q=11)
> > [  142.868042] Sending NMI from CPU 0 to CPUs 1:
> > [  142.872443] NMI backtrace for cpu 1
> > [  142.872452] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
> > 4.16.0-05506-g28aba11c1393691a #14
> > [  142.872455] Hardware name: Generic R8A7791 (Flattened Device Tree)
> > [  142.872473] PC is at arch_cpu_idle+0x28/0x44
> > [  142.872484] LR is at trace_hardirqs_on_caller+0x1a4/0x1d4
> > [  142.872488] pc : []lr : []psr: 20070013
> > [  142.872491] sp : eb0b9f90  ip : eb0b9f60  fp : eb0b9f9c
> > [  142.872495] r10:   r9 : 413fc0f2  r8 : 4000406a
> > [  142.872498] r7 : c0c08478  r6 : c0c0842c  r5 : e000  r4 : 0002
> > [  142.872502] r3 : eb0b6ec0  r2 :   r1 : 0004  r0 : 0001
> > [  

Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

2018-04-12 Thread Vincent Guittot
Hi Niklas,

On 12 April 2018 at 11:18, Niklas Söderlund
 wrote:
> Hi Vincent,
>
> I have observed issues running on linus/master from a few days back [1].
> I'm running on a Renesas Koelsch board (arm32) and I can trigger a issue
> by X forwarding the v4l2 test application qv4l2 over ssh and moving the
> courser around in the GUI (best test case description award...). I'm
> sorry about the really bad way I trigger this but I can't do it in any
> other way, I'm happy to try other methods if you got some ideas. The
> symptom of the issue is a complete hang of the system for more then 30
> seconds and then this information is printed in the console:

Heiner (edded cc) also reported similar problem with his platform: a
dual core celeron

Do you confirm that your platform is a dual cortex-A15 ? At least that
what I have seen on web
This would confirm that dual system is a key point.

The ssh connection is also common with Heiner's setup

>
> [  142.849390] INFO: rcu_sched detected stalls on CPUs/tasks:
> [  142.854972]  1-...!: (1 GPs behind) idle=7a4/0/0 softirq=3214/3217 fqs=0
> [  142.861976]  (detected by 0, t=8232 jiffies, g=930, c=929, q=11)
> [  142.868042] Sending NMI from CPU 0 to CPUs 1:
> [  142.872443] NMI backtrace for cpu 1
> [  142.872452] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
> 4.16.0-05506-g28aba11c1393691a #14
> [  142.872455] Hardware name: Generic R8A7791 (Flattened Device Tree)
> [  142.872473] PC is at arch_cpu_idle+0x28/0x44
> [  142.872484] LR is at trace_hardirqs_on_caller+0x1a4/0x1d4
> [  142.872488] pc : []lr : []psr: 20070013
> [  142.872491] sp : eb0b9f90  ip : eb0b9f60  fp : eb0b9f9c
> [  142.872495] r10:   r9 : 413fc0f2  r8 : 4000406a
> [  142.872498] r7 : c0c08478  r6 : c0c0842c  r5 : e000  r4 : 0002
> [  142.872502] r3 : eb0b6ec0  r2 :   r1 : 0004  r0 : 0001
> [  142.872507] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment 
> none
> [  142.872511] Control: 10c5387d  Table: 6a61406a  DAC: 0051
> [  142.872516] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
> 4.16.0-05506-g28aba11c1393691a #14
> [  142.872519] Hardware name: Generic R8A7791 (Flattened Device Tree)
> [  142.872522] Backtrace:
> [  142.872534] [] (dump_backtrace) from [] 
> (show_stack+0x18/0x1c)
> [  142.872540]  r7:c0c81388 r6: r5:60070193 r4:c0c81388
> [  142.872550] [] (show_stack) from [] 
> (dump_stack+0xa4/0xd8)
> [  142.872557] [] (dump_stack) from [] 
> (show_regs+0x14/0x18)
> [  142.872563]  r9:0001 r8: r7:c0c4f678 r6:eb0b9f40 r5:0001 
> r4:c13e1130
> [  142.872571] [] (show_regs) from [] 
> (nmi_cpu_backtrace+0xfc/0x118)
> [  142.872578] [] (nmi_cpu_backtrace) from [] 
> (handle_IPI+0x2a8/0x320)
> [  142.872583]  r7:c0c4f678 r6:eb0b9f40 r5:0007 r4:c0b75b68
> [  142.872594] [] (handle_IPI) from [] 
> (gic_handle_irq+0x8c/0x98)
> [  142.872599]  r10: r9:eb0b8000 r8:f0803000 r7:c0c4f678 r6:eb0b9f40 
> r5:c0c08a90
> [  142.872602]  r4:f0802000
> [  142.872608] [] (gic_handle_irq) from [] 
> (__irq_svc+0x70/0x98)
> [  142.872612] Exception stack(0xeb0b9f40 to 0xeb0b9f88)
> [  142.872618] 9f40: 0001 0004  eb0b6ec0 0002 e000 
> c0c0842c c0c08478
> [  142.872624] 9f60: 4000406a 413fc0f2  eb0b9f9c eb0b9f60 eb0b9f90 
> c01747a8 c01088a4
> [  142.872627] 9f80: 20070013 
> [  142.872632]  r9:eb0b8000 r8:4000406a r7:eb0b9f74 r6: r5:20070013 
> r4:c01088a4
> [  142.872642] [] (arch_cpu_idle) from [] 
> (default_idle_call+0x34/0x38)
> [  142.872650] [] (default_idle_call) from [] 
> (do_idle+0xe0/0x134)
> [  142.872656] [] (do_idle) from [] 
> (cpu_startup_entry+0x20/0x24)
> [  142.872660]  r7:c0c8e9d0 r6:10c0387d r5:0051 r4:0085
> [  142.872667] [] (cpu_startup_entry) from [] 
> (secondary_start_kernel+0x114/0x134)
> [  142.872673] [] (secondary_start_kernel) from [<401026ec>] 
> (0x401026ec)
> [  142.872676]  r5:0051 r4:6b0a406a
> [  142.873456] rcu_sched kthread starved for 8235 jiffies! g930 c929 f0x0 
> RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
> [  143.135040] RCU grace-period kthread stack dump:
> [  143.139695] rcu_sched   I0 9  2 0x
> [  143.145234] Backtrace:
> [  143.147719] [] (__schedule) from [] 
> (schedule+0x94/0xb8)
> [  143.154823]  r10:c0b714c0 r9:c0c85f8a r8: r7:eb0abec4 r6:a274 
> r5:
> [  143.162712]  r4:e000
> [  143.165273] [] (schedule) from [] 
> (schedule_timeout+0x440/0x4b0)
> [  143.173076]  r5: r4:eb79c4c0
> [  143.176692] [] (schedule_timeout) from [] 
> (rcu_gp_kthread+0x958/0x150c)
> [  143.185108]  r10:c0c87274 r9: r8:c0c165b8 r7:0001 r6: 
> r5:c0c16590
> [  143.192997]  r4:c0c16300
> [  143.195560] [] (rcu_gp_kthread) from [] 
> (kthread+0x148/0x160)
> [  143.203099]  r7:c0c16300
> [  143.205660] [] (kthread) from [] 
> (ret_from_fork+0x14/0x20)
> [  143.212938] Exception stack(0xeb0abfb0 to 0xeb0abff8)
> [  143.218030] bfa0: