Re: RCU stall and the system boot hang with nfsroot

2016-01-05 Thread Paul E. McKenney
On Tue, Jan 05, 2016 at 03:57:54PM +0800, Aaron Ma wrote:
> On Tue, Jan 5, 2016 at 5:18 AM, Paul E. McKenney
>  wrote:
> > On Mon, Jan 04, 2016 at 06:01:37PM +0800, Aaron Ma wrote:
> >> On Fri, Jan 1, 2016 at 3:49 AM, Paul E. McKenney
> >>  wrote:
> >> > On Wed, Dec 30, 2015 at 09:41:45AM -0800, Paul E. McKenney wrote:
> >> >> On Wed, Dec 30, 2015 at 03:03:33PM +0800, Aaron Ma wrote:
> >> >> > On Wed, Dec 30, 2015 at 7:42 AM, Paul E. McKenney
> >> >> >  wrote:
> >> >
> >> > [ . . . ]
> >> >
> >> >> > cfg80211: Calling CRDA to update world regulatory domain
> >> >> > cfg80211: Calling CRDA to update world regulatory domain
> >> >> > cfg80211: Exceeded CRDA call max attempts. Not calling CRDA
> >> >> > INFO: rcu_preempt detected stalls on CPUs/tasks:
> >> >> > 71: (0 ticks this GP) idle=1ac/0/0 softirq=0/0 fqs=0
> >> >> > (detected by 62, t=26002 jiffies, g=3735, c=3734, q=366014)
> >> >> > Task dump for CPU 71:
> >> >> > swapper/71  R  running task0 0  1 0x0020
> >> >> >  81492587 8804633cbe58 814f21d7 0004
> >> >> >  0004 e8fffb405310 820dc5c0 8804633cbea8
> >> >> >  8181db85   0046
> >> >> > Call Trace:
> >> >> >  [] ? debug_smp_processor_id+0x17/0x20
> >> >> >  [] ? intel_idle+0x137/0x140
> >> >> >  [] ? cpuidle_enter_state+0x65/0x3e0
> >> >> >  [] ? cpuidle_enter+0x17/0x20
> >> >> >  [] ? cpu_startup_entry+0x33d/0x630
> >> >> >  [] ? start_secondary+0x12e/0x140
> >> >> > rcu_preempt kthread starved for 26002 jiffies!
> >> >> > rcu_check_gp_kthread_starvation --->show task:
> >> >> > rcu_preempt S 880456413c68 0 8  2 0x
> >> >> >  880456413c68 8804564025d0 d7a0 880456b18000
> >> >> >  8804564025d0 880456413c38 81492587 880456413c58
> >> >> >  880456414000 8804564025d0 880456413cb8 880869dce500
> >> >> > Call Trace:
> >> >> >  [] ? debug_smp_processor_id+0x17/0x20
> >> >> >  [] schedule+0x3f/0xd0
> >> >> >  [] schedule_timeout+0x189/0x3f0
> >> >> >  [] ? swait_prepare+0x24/0x90
> >> >> >  [] ? timer_cpu_notify+0x190/0x190
> >> >> >  [] ? swait_prepare+0x5b/0x90
> >> >> >  [] rcu_gp_kthread+0x8a8/0x2190
> >> >> >  [] ? trace_hardirqs_on+0xd/0x10
> >> >> >  [] ? __schedule+0x4af/0x1180
> >> >> >  [] ? call_rcu_sched+0x20/0x20
> >> >> >  [] kthread+0xe4/0x100
> >> >> >  [] ? trace_hardirqs_on+0xd/0x10
> >> >> >  [] ? kthread_create_on_node+0x240/0x240
> >> >> >  [] ret_from_fork+0x42/0x70
> >> >> >  [] ? kthread_create_on_node+0x240/0x240
> >> >> > rcu_check_gp_kthread_starvation --->end
> >> >> >
> >> >> > It seems wait in rcu_gp_kthread. it should be no task blocked right?
> >> >> > If so, why the swait_event_interruptible_timeout is not awaken? the
> >> >> > timeout is CONFIG_HZ=1000.
> >> >>
> >> >> Given that this happens at boot, perhaps ftrace is a good next step.
> >> >> The thought would be to enable ftrace via the kernel boot parameters
> >> >> for the timers.
> >> >>
> >> >> And how often does this problem occur?
> >> >
> >> > And does the following diagnostic patch help?  Its expected behavior
> >> > would be to turn a hard hang into something that recovered in a few
> >> > minutes, while giving a few stall-warning splats.
> >> >
> >> > Thanx, Paul
> >> >
> >> > 
> >> >
> >> > commit 7798a5efb2acabfa3ca788dd9b5b118eb1bff443
> >> > Author: Paul E. McKenney 
> >> > Date:   Thu Dec 31 08:48:36 2015 -0800
> >> >
> >> > rcu: Awaken grace-period kthread when stalled
> >> >
> >> > Recent kernels can fail to awaken the grace-period kthread for
> >> > quiescent-state forcing.  This commit is a crude hack that does
> >> > a wakeup any time a stall is detected.
> >> >
> >> > Signed-off-by: Paul E. McKenney 
> >> >
> >> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >> > index 4b3de6718f7c..51da7ef3561f 100644
> >> > --- a/kernel/rcu/tree.c
> >> > +++ b/kernel/rcu/tree.c
> >> > @@ -1225,8 +1225,10 @@ static void 
> >> > rcu_check_gp_kthread_starvation(struct rcu_state *rsp)
> >> >rsp->gp_flags,
> >> >gp_state_getname(rsp->gp_state), rsp->gp_state,
> >> >rsp->gp_kthread ? rsp->gp_kthread->state : ~0);
> >> > -   if (rsp->gp_kthread)
> >> > +   if (rsp->gp_kthread) {
> >> > sched_show_task(rsp->gp_kthread);
> >> > +   wake_up_process(rsp->gp_kthread);
> >> > +   }
> >> > }
> >> >  }
> >>
> >> Sorry for late response because of holiday.
> >>
> >> During my knowledge, now this issue is splitted to 2 issues.
> >>
> >> First, there is a mistake in the kernel I used.
> >> I used kernel is changed by someone else, and it ported a patch:
> >> Author: Thomas Gleixner 
> >> Date:  

Re: RCU stall and the system boot hang with nfsroot

2016-01-04 Thread Aaron Ma
On Tue, Jan 5, 2016 at 5:18 AM, Paul E. McKenney
 wrote:
> On Mon, Jan 04, 2016 at 06:01:37PM +0800, Aaron Ma wrote:
>> On Fri, Jan 1, 2016 at 3:49 AM, Paul E. McKenney
>>  wrote:
>> > On Wed, Dec 30, 2015 at 09:41:45AM -0800, Paul E. McKenney wrote:
>> >> On Wed, Dec 30, 2015 at 03:03:33PM +0800, Aaron Ma wrote:
>> >> > On Wed, Dec 30, 2015 at 7:42 AM, Paul E. McKenney
>> >> >  wrote:
>> >
>> > [ . . . ]
>> >
>> >> > cfg80211: Calling CRDA to update world regulatory domain
>> >> > cfg80211: Calling CRDA to update world regulatory domain
>> >> > cfg80211: Exceeded CRDA call max attempts. Not calling CRDA
>> >> > INFO: rcu_preempt detected stalls on CPUs/tasks:
>> >> > 71: (0 ticks this GP) idle=1ac/0/0 softirq=0/0 fqs=0
>> >> > (detected by 62, t=26002 jiffies, g=3735, c=3734, q=366014)
>> >> > Task dump for CPU 71:
>> >> > swapper/71  R  running task0 0  1 0x0020
>> >> >  81492587 8804633cbe58 814f21d7 0004
>> >> >  0004 e8fffb405310 820dc5c0 8804633cbea8
>> >> >  8181db85   0046
>> >> > Call Trace:
>> >> >  [] ? debug_smp_processor_id+0x17/0x20
>> >> >  [] ? intel_idle+0x137/0x140
>> >> >  [] ? cpuidle_enter_state+0x65/0x3e0
>> >> >  [] ? cpuidle_enter+0x17/0x20
>> >> >  [] ? cpu_startup_entry+0x33d/0x630
>> >> >  [] ? start_secondary+0x12e/0x140
>> >> > rcu_preempt kthread starved for 26002 jiffies!
>> >> > rcu_check_gp_kthread_starvation --->show task:
>> >> > rcu_preempt S 880456413c68 0 8  2 0x
>> >> >  880456413c68 8804564025d0 d7a0 880456b18000
>> >> >  8804564025d0 880456413c38 81492587 880456413c58
>> >> >  880456414000 8804564025d0 880456413cb8 880869dce500
>> >> > Call Trace:
>> >> >  [] ? debug_smp_processor_id+0x17/0x20
>> >> >  [] schedule+0x3f/0xd0
>> >> >  [] schedule_timeout+0x189/0x3f0
>> >> >  [] ? swait_prepare+0x24/0x90
>> >> >  [] ? timer_cpu_notify+0x190/0x190
>> >> >  [] ? swait_prepare+0x5b/0x90
>> >> >  [] rcu_gp_kthread+0x8a8/0x2190
>> >> >  [] ? trace_hardirqs_on+0xd/0x10
>> >> >  [] ? __schedule+0x4af/0x1180
>> >> >  [] ? call_rcu_sched+0x20/0x20
>> >> >  [] kthread+0xe4/0x100
>> >> >  [] ? trace_hardirqs_on+0xd/0x10
>> >> >  [] ? kthread_create_on_node+0x240/0x240
>> >> >  [] ret_from_fork+0x42/0x70
>> >> >  [] ? kthread_create_on_node+0x240/0x240
>> >> > rcu_check_gp_kthread_starvation --->end
>> >> >
>> >> > It seems wait in rcu_gp_kthread. it should be no task blocked right?
>> >> > If so, why the swait_event_interruptible_timeout is not awaken? the
>> >> > timeout is CONFIG_HZ=1000.
>> >>
>> >> Given that this happens at boot, perhaps ftrace is a good next step.
>> >> The thought would be to enable ftrace via the kernel boot parameters
>> >> for the timers.
>> >>
>> >> And how often does this problem occur?
>> >
>> > And does the following diagnostic patch help?  Its expected behavior
>> > would be to turn a hard hang into something that recovered in a few
>> > minutes, while giving a few stall-warning splats.
>> >
>> > Thanx, Paul
>> >
>> > 
>> >
>> > commit 7798a5efb2acabfa3ca788dd9b5b118eb1bff443
>> > Author: Paul E. McKenney 
>> > Date:   Thu Dec 31 08:48:36 2015 -0800
>> >
>> > rcu: Awaken grace-period kthread when stalled
>> >
>> > Recent kernels can fail to awaken the grace-period kthread for
>> > quiescent-state forcing.  This commit is a crude hack that does
>> > a wakeup any time a stall is detected.
>> >
>> > Signed-off-by: Paul E. McKenney 
>> >
>> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> > index 4b3de6718f7c..51da7ef3561f 100644
>> > --- a/kernel/rcu/tree.c
>> > +++ b/kernel/rcu/tree.c
>> > @@ -1225,8 +1225,10 @@ static void rcu_check_gp_kthread_starvation(struct 
>> > rcu_state *rsp)
>> >rsp->gp_flags,
>> >gp_state_getname(rsp->gp_state), rsp->gp_state,
>> >rsp->gp_kthread ? rsp->gp_kthread->state : ~0);
>> > -   if (rsp->gp_kthread)
>> > +   if (rsp->gp_kthread) {
>> > sched_show_task(rsp->gp_kthread);
>> > +   wake_up_process(rsp->gp_kthread);
>> > +   }
>> > }
>> >  }
>>
>> Sorry for late response because of holiday.
>>
>> During my knowledge, now this issue is splitted to 2 issues.
>>
>> First, there is a mistake in the kernel I used.
>> I used kernel is changed by someone else, and it ported a patch:
>> Author: Thomas Gleixner 
>> Date:   Tue Apr 14 21:08:54 2015 +
>>
>> tick: sched: Force tick interrupt and get rid of softirq magic
>>
>> commit 0ff53d09642204c648424def0caa9117e7a3caaf upstream
>>
>>  static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
>>   

Re: RCU stall and the system boot hang with nfsroot

2016-01-04 Thread Paul E. McKenney
On Mon, Jan 04, 2016 at 06:01:37PM +0800, Aaron Ma wrote:
> On Fri, Jan 1, 2016 at 3:49 AM, Paul E. McKenney
>  wrote:
> > On Wed, Dec 30, 2015 at 09:41:45AM -0800, Paul E. McKenney wrote:
> >> On Wed, Dec 30, 2015 at 03:03:33PM +0800, Aaron Ma wrote:
> >> > On Wed, Dec 30, 2015 at 7:42 AM, Paul E. McKenney
> >> >  wrote:
> >
> > [ . . . ]
> >
> >> > cfg80211: Calling CRDA to update world regulatory domain
> >> > cfg80211: Calling CRDA to update world regulatory domain
> >> > cfg80211: Exceeded CRDA call max attempts. Not calling CRDA
> >> > INFO: rcu_preempt detected stalls on CPUs/tasks:
> >> > 71: (0 ticks this GP) idle=1ac/0/0 softirq=0/0 fqs=0
> >> > (detected by 62, t=26002 jiffies, g=3735, c=3734, q=366014)
> >> > Task dump for CPU 71:
> >> > swapper/71  R  running task0 0  1 0x0020
> >> >  81492587 8804633cbe58 814f21d7 0004
> >> >  0004 e8fffb405310 820dc5c0 8804633cbea8
> >> >  8181db85   0046
> >> > Call Trace:
> >> >  [] ? debug_smp_processor_id+0x17/0x20
> >> >  [] ? intel_idle+0x137/0x140
> >> >  [] ? cpuidle_enter_state+0x65/0x3e0
> >> >  [] ? cpuidle_enter+0x17/0x20
> >> >  [] ? cpu_startup_entry+0x33d/0x630
> >> >  [] ? start_secondary+0x12e/0x140
> >> > rcu_preempt kthread starved for 26002 jiffies!
> >> > rcu_check_gp_kthread_starvation --->show task:
> >> > rcu_preempt S 880456413c68 0 8  2 0x
> >> >  880456413c68 8804564025d0 d7a0 880456b18000
> >> >  8804564025d0 880456413c38 81492587 880456413c58
> >> >  880456414000 8804564025d0 880456413cb8 880869dce500
> >> > Call Trace:
> >> >  [] ? debug_smp_processor_id+0x17/0x20
> >> >  [] schedule+0x3f/0xd0
> >> >  [] schedule_timeout+0x189/0x3f0
> >> >  [] ? swait_prepare+0x24/0x90
> >> >  [] ? timer_cpu_notify+0x190/0x190
> >> >  [] ? swait_prepare+0x5b/0x90
> >> >  [] rcu_gp_kthread+0x8a8/0x2190
> >> >  [] ? trace_hardirqs_on+0xd/0x10
> >> >  [] ? __schedule+0x4af/0x1180
> >> >  [] ? call_rcu_sched+0x20/0x20
> >> >  [] kthread+0xe4/0x100
> >> >  [] ? trace_hardirqs_on+0xd/0x10
> >> >  [] ? kthread_create_on_node+0x240/0x240
> >> >  [] ret_from_fork+0x42/0x70
> >> >  [] ? kthread_create_on_node+0x240/0x240
> >> > rcu_check_gp_kthread_starvation --->end
> >> >
> >> > It seems wait in rcu_gp_kthread. it should be no task blocked right?
> >> > If so, why the swait_event_interruptible_timeout is not awaken? the
> >> > timeout is CONFIG_HZ=1000.
> >>
> >> Given that this happens at boot, perhaps ftrace is a good next step.
> >> The thought would be to enable ftrace via the kernel boot parameters
> >> for the timers.
> >>
> >> And how often does this problem occur?
> >
> > And does the following diagnostic patch help?  Its expected behavior
> > would be to turn a hard hang into something that recovered in a few
> > minutes, while giving a few stall-warning splats.
> >
> > Thanx, Paul
> >
> > 
> >
> > commit 7798a5efb2acabfa3ca788dd9b5b118eb1bff443
> > Author: Paul E. McKenney 
> > Date:   Thu Dec 31 08:48:36 2015 -0800
> >
> > rcu: Awaken grace-period kthread when stalled
> >
> > Recent kernels can fail to awaken the grace-period kthread for
> > quiescent-state forcing.  This commit is a crude hack that does
> > a wakeup any time a stall is detected.
> >
> > Signed-off-by: Paul E. McKenney 
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 4b3de6718f7c..51da7ef3561f 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1225,8 +1225,10 @@ static void rcu_check_gp_kthread_starvation(struct 
> > rcu_state *rsp)
> >rsp->gp_flags,
> >gp_state_getname(rsp->gp_state), rsp->gp_state,
> >rsp->gp_kthread ? rsp->gp_kthread->state : ~0);
> > -   if (rsp->gp_kthread)
> > +   if (rsp->gp_kthread) {
> > sched_show_task(rsp->gp_kthread);
> > +   wake_up_process(rsp->gp_kthread);
> > +   }
> > }
> >  }
> 
> Sorry for late response because of holiday.
> 
> During my knowledge, now this issue is splitted to 2 issues.
> 
> First, there is a mistake in the kernel I used.
> I used kernel is changed by someone else, and it ported a patch:
> Author: Thomas Gleixner 
> Date:   Tue Apr 14 21:08:54 2015 +
> 
> tick: sched: Force tick interrupt and get rid of softirq magic
> 
> commit 0ff53d09642204c648424def0caa9117e7a3caaf upstream
> 
>  static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
>  ktime_t now, int cpu)
>  {
> @@ -691,22 +705,18 @@ static ktime_t tick_nohz_stop_sched_tick(struct
> tick_sched *ts,
> if (ts->nohz

Re: RCU stall and the system boot hang with nfsroot

2016-01-04 Thread Aaron Ma
On Fri, Jan 1, 2016 at 3:49 AM, Paul E. McKenney
 wrote:
> On Wed, Dec 30, 2015 at 09:41:45AM -0800, Paul E. McKenney wrote:
>> On Wed, Dec 30, 2015 at 03:03:33PM +0800, Aaron Ma wrote:
>> > On Wed, Dec 30, 2015 at 7:42 AM, Paul E. McKenney
>> >  wrote:
>
> [ . . . ]
>
>> > cfg80211: Calling CRDA to update world regulatory domain
>> > cfg80211: Calling CRDA to update world regulatory domain
>> > cfg80211: Exceeded CRDA call max attempts. Not calling CRDA
>> > INFO: rcu_preempt detected stalls on CPUs/tasks:
>> > 71: (0 ticks this GP) idle=1ac/0/0 softirq=0/0 fqs=0
>> > (detected by 62, t=26002 jiffies, g=3735, c=3734, q=366014)
>> > Task dump for CPU 71:
>> > swapper/71  R  running task0 0  1 0x0020
>> >  81492587 8804633cbe58 814f21d7 0004
>> >  0004 e8fffb405310 820dc5c0 8804633cbea8
>> >  8181db85   0046
>> > Call Trace:
>> >  [] ? debug_smp_processor_id+0x17/0x20
>> >  [] ? intel_idle+0x137/0x140
>> >  [] ? cpuidle_enter_state+0x65/0x3e0
>> >  [] ? cpuidle_enter+0x17/0x20
>> >  [] ? cpu_startup_entry+0x33d/0x630
>> >  [] ? start_secondary+0x12e/0x140
>> > rcu_preempt kthread starved for 26002 jiffies!
>> > rcu_check_gp_kthread_starvation --->show task:
>> > rcu_preempt S 880456413c68 0 8  2 0x
>> >  880456413c68 8804564025d0 d7a0 880456b18000
>> >  8804564025d0 880456413c38 81492587 880456413c58
>> >  880456414000 8804564025d0 880456413cb8 880869dce500
>> > Call Trace:
>> >  [] ? debug_smp_processor_id+0x17/0x20
>> >  [] schedule+0x3f/0xd0
>> >  [] schedule_timeout+0x189/0x3f0
>> >  [] ? swait_prepare+0x24/0x90
>> >  [] ? timer_cpu_notify+0x190/0x190
>> >  [] ? swait_prepare+0x5b/0x90
>> >  [] rcu_gp_kthread+0x8a8/0x2190
>> >  [] ? trace_hardirqs_on+0xd/0x10
>> >  [] ? __schedule+0x4af/0x1180
>> >  [] ? call_rcu_sched+0x20/0x20
>> >  [] kthread+0xe4/0x100
>> >  [] ? trace_hardirqs_on+0xd/0x10
>> >  [] ? kthread_create_on_node+0x240/0x240
>> >  [] ret_from_fork+0x42/0x70
>> >  [] ? kthread_create_on_node+0x240/0x240
>> > rcu_check_gp_kthread_starvation --->end
>> >
>> > It seems wait in rcu_gp_kthread. it should be no task blocked right?
>> > If so, why the swait_event_interruptible_timeout is not awaken? the
>> > timeout is CONFIG_HZ=1000.
>>
>> Given that this happens at boot, perhaps ftrace is a good next step.
>> The thought would be to enable ftrace via the kernel boot parameters
>> for the timers.
>>
>> And how often does this problem occur?
>
> And does the following diagnostic patch help?  Its expected behavior
> would be to turn a hard hang into something that recovered in a few
> minutes, while giving a few stall-warning splats.
>
> Thanx, Paul
>
> 
>
> commit 7798a5efb2acabfa3ca788dd9b5b118eb1bff443
> Author: Paul E. McKenney 
> Date:   Thu Dec 31 08:48:36 2015 -0800
>
> rcu: Awaken grace-period kthread when stalled
>
> Recent kernels can fail to awaken the grace-period kthread for
> quiescent-state forcing.  This commit is a crude hack that does
> a wakeup any time a stall is detected.
>
> Signed-off-by: Paul E. McKenney 
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 4b3de6718f7c..51da7ef3561f 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1225,8 +1225,10 @@ static void rcu_check_gp_kthread_starvation(struct 
> rcu_state *rsp)
>rsp->gp_flags,
>gp_state_getname(rsp->gp_state), rsp->gp_state,
>rsp->gp_kthread ? rsp->gp_kthread->state : ~0);
> -   if (rsp->gp_kthread)
> +   if (rsp->gp_kthread) {
> sched_show_task(rsp->gp_kthread);
> +   wake_up_process(rsp->gp_kthread);
> +   }
> }
>  }

Sorry for late response because of holiday.

During my knowledge, now this issue is splitted to 2 issues.

First, there is a mistake in the kernel I used.
I used kernel is changed by someone else, and it ported a patch:
Author: Thomas Gleixner 
Date:   Tue Apr 14 21:08:54 2015 +

tick: sched: Force tick interrupt and get rid of softirq magic

commit 0ff53d09642204c648424def0caa9117e7a3caaf upstream

 static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 ktime_t now, int cpu)
 {
@@ -691,22 +705,18 @@ static ktime_t tick_nohz_stop_sched_tick(struct
tick_sched *ts,
if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
hrtimer_cancel(&ts->sched_timer);
goto out;
-   }
+}

-   if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
-   hrtimer_start(&ts->sched_timer, expires,
-  

Re: RCU stall and the system boot hang with nfsroot

2015-12-31 Thread Paul E. McKenney
On Wed, Dec 30, 2015 at 09:41:45AM -0800, Paul E. McKenney wrote:
> On Wed, Dec 30, 2015 at 03:03:33PM +0800, Aaron Ma wrote:
> > On Wed, Dec 30, 2015 at 7:42 AM, Paul E. McKenney
> >  wrote:

[ . . . ]

> > cfg80211: Calling CRDA to update world regulatory domain
> > cfg80211: Calling CRDA to update world regulatory domain
> > cfg80211: Exceeded CRDA call max attempts. Not calling CRDA
> > INFO: rcu_preempt detected stalls on CPUs/tasks:
> > 71: (0 ticks this GP) idle=1ac/0/0 softirq=0/0 fqs=0
> > (detected by 62, t=26002 jiffies, g=3735, c=3734, q=366014)
> > Task dump for CPU 71:
> > swapper/71  R  running task0 0  1 0x0020
> >  81492587 8804633cbe58 814f21d7 0004
> >  0004 e8fffb405310 820dc5c0 8804633cbea8
> >  8181db85   0046
> > Call Trace:
> >  [] ? debug_smp_processor_id+0x17/0x20
> >  [] ? intel_idle+0x137/0x140
> >  [] ? cpuidle_enter_state+0x65/0x3e0
> >  [] ? cpuidle_enter+0x17/0x20
> >  [] ? cpu_startup_entry+0x33d/0x630
> >  [] ? start_secondary+0x12e/0x140
> > rcu_preempt kthread starved for 26002 jiffies!
> > rcu_check_gp_kthread_starvation --->show task:
> > rcu_preempt S 880456413c68 0 8  2 0x
> >  880456413c68 8804564025d0 d7a0 880456b18000
> >  8804564025d0 880456413c38 81492587 880456413c58
> >  880456414000 8804564025d0 880456413cb8 880869dce500
> > Call Trace:
> >  [] ? debug_smp_processor_id+0x17/0x20
> >  [] schedule+0x3f/0xd0
> >  [] schedule_timeout+0x189/0x3f0
> >  [] ? swait_prepare+0x24/0x90
> >  [] ? timer_cpu_notify+0x190/0x190
> >  [] ? swait_prepare+0x5b/0x90
> >  [] rcu_gp_kthread+0x8a8/0x2190
> >  [] ? trace_hardirqs_on+0xd/0x10
> >  [] ? __schedule+0x4af/0x1180
> >  [] ? call_rcu_sched+0x20/0x20
> >  [] kthread+0xe4/0x100
> >  [] ? trace_hardirqs_on+0xd/0x10
> >  [] ? kthread_create_on_node+0x240/0x240
> >  [] ret_from_fork+0x42/0x70
> >  [] ? kthread_create_on_node+0x240/0x240
> > rcu_check_gp_kthread_starvation --->end
> > 
> > It seems wait in rcu_gp_kthread. it should be no task blocked right?
> > If so, why the swait_event_interruptible_timeout is not awaken? the
> > timeout is CONFIG_HZ=1000.
> 
> Given that this happens at boot, perhaps ftrace is a good next step.
> The thought would be to enable ftrace via the kernel boot parameters
> for the timers.
> 
> And how often does this problem occur?

And does the following diagnostic patch help?  Its expected behavior
would be to turn a hard hang into something that recovered in a few
minutes, while giving a few stall-warning splats.

Thanx, Paul



commit 7798a5efb2acabfa3ca788dd9b5b118eb1bff443
Author: Paul E. McKenney 
Date:   Thu Dec 31 08:48:36 2015 -0800

rcu: Awaken grace-period kthread when stalled

Recent kernels can fail to awaken the grace-period kthread for
quiescent-state forcing.  This commit is a crude hack that does
a wakeup any time a stall is detected.

Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4b3de6718f7c..51da7ef3561f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1225,8 +1225,10 @@ static void rcu_check_gp_kthread_starvation(struct 
rcu_state *rsp)
   rsp->gp_flags,
   gp_state_getname(rsp->gp_state), rsp->gp_state,
   rsp->gp_kthread ? rsp->gp_kthread->state : ~0);
-   if (rsp->gp_kthread)
+   if (rsp->gp_kthread) {
sched_show_task(rsp->gp_kthread);
+   wake_up_process(rsp->gp_kthread);
+   }
}
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RCU stall and the system boot hang with nfsroot

2015-12-30 Thread Paul E. McKenney
On Wed, Dec 30, 2015 at 03:03:33PM +0800, Aaron Ma wrote:
> On Wed, Dec 30, 2015 at 7:42 AM, Paul E. McKenney
>  wrote:
> > On Tue, Dec 29, 2015 at 05:34:38PM +0800, Aaron Ma wrote:
> >> Add paul...@linux.vnet.ibm.com
> >>
> >> On Tue, Dec 29, 2015 at 5:32 PM, Aaron Ma  wrote:
> >> > Hi, Paul:
> >> > I found the linux-stable-4.1.15 with rt15 patches boot hang sometimes.
> >> > Hardware is Grantley-EP and WildcatPass.
> >
> > I must confess that I am unfamiliar with this hardware, for whatever
> > that might be worth.
> >
> >> > No response by sysrq.
> >> >
> >> > Did you found any issue about this? Or how can I address this issue?
> >
> > I see something similar in post-4.1 mainline, but only when CPU hotplug
> > is enabled and only under extreme stress.  Which is probably not the
> > case during your boot-up.  But please see below.
> >
> >> > Attached kernel config.
> >> >
> >> > Thanks,
> >> > Pengyu
> >> >
> >> > xhci_hcd :00:14.0: cache line size of 32 is not supported
> >> > hub 1-0:1.0: USB hub found
> >> > hub 1-0:1.0: 15 ports detected
> >> > xhci_hcd :00:14.0: xHCI Host Controller
> >> > xhci_hcd :00:14.0: new USB bus registered, assigned bus number 2
> >> > hub 2-0:1.0: USB hub found
> >> > hub 2-0:1.0: 6 ports detected
> >> > initcall xhci_pci_init+0x0/0x44 returned 0 after 306571 usecs
> >> > calling  ehci_hcd_init+0x0/0x5d @ 1
> >> > ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
> >> > initcall ehci_hcd_init+0x0/0x5d returned 0 after 5695 usecs
> >> > calling  ehci_pci_init+0x0/0x69 @ 1
> >> > ehci-pci: EHCI PCI platform driver
> >> > ehci-pci :00:1a.0: enabling bus mastering
> >> > ehci-pci :00:1a.0: EHCI Host Controller
> >> > ehci-pci :00:1a.0: new USB bus registered, assigned bus number 3
> >> > ehci-pci :00:1a.0: debug port 2
> >> > ehci-pci :00:1a.0: cache line size of 32 is not supported
> >> > ehci-pci :00:1a.0: irq 18, io mem 0x91d02000
> >> > cfg80211: Calling CRDA to update world regulatory domain
> >> > cfg80211: Calling CRDA to update world regulatory domain
> >> > cfg80211: Calling CRDA to update world regulatory domain
> >> > cfg80211: Calling CRDA to update world regulatory domain
> >> > cfg80211: Calling CRDA to update world regulatory domain
> >> > cfg80211: Calling CRDA to update world regulatory domain
> >> > cfg80211: Exceeded CRDA call max attempts. Not calling CRDA
> >> > INFO: rcu_preempt detected stalls on CPUs/tasks:
> >> > 18: (0 ticks this GP) idle=284/0/0 softirq=0/0 fqs=0
> >
> > The "stalling" CPU is idle, which is a quiescent state and therefore
> > should not stall grace periods.  But please see below...
> >
> >> > (detected by 12, t=26002 jiffies, g=5351, c=5350, q=451332)
> >> > Task dump for CPU 18:
> >> > swapper/18  R  running task0 0  1 0x0020
> >> >  814946a7 88045fbc7e58 814f42f7 0004
> >> >  0004 8804663e0f60 820dc5c0 88045fbc7ea8
> >> >  8181fca5   0046
> >> > Call Trace:
> >> >  [] ? debug_smp_processor_id+0x17/0x20
> >> >  [] ? intel_idle+0x137/0x140
> >> >  [] ? cpuidle_enter_state+0x65/0x3e0
> >> >  [] ? cpuidle_enter+0x17/0x20
> >> >  [] ? cpu_startup_entry+0x33d/0x630
> >> >  [] ? start_secondary+0x12e/0x140
> >> > rcu_preempt kthread starved for 26002 jiffies!
> >
> > The reason that the idle CPU is not being recognized as a legitimate
> > quiescent state is that the rcu_preempt grace-period kthread is not
> > being allowed to run.  In fact, it has not been permitted to run for more
> > than 26 seconds.  Despite the fact that in this situation, it would have
> > invoked wait_event_interruptible_timeout() with a three-jiffy timeout.
> >
> > One thing to do is to modify the rcu_check_gp_kthread_starvation()
> > function to print the value of rsp->gp_kthread, then to also print
> > fields from the resulting pointer to task_struct to see what the thread
> > is up to.  (It is tempting to suspect that this kthread might never have
> > been spawned, but in that case the grace period would not have started.)
> >
> > For example, add the following in the "if (j - gpa > 2 * HZ)"
> > body:
> >
> > if (rsp->gp_kthread)
> > sched_show_task(rsp->gp_kthread);
> >
> > Don't forget to add the "{" "}" to accommodate the additional statement
> > within the "if" statement.
> >
> > If the additional output shows that the rcu_preempt kthread is runnable,
> > the next question is "why is it not running?".  If the output instead
> > shows that the task is blocked, the next question is "why didn't the
> > wait_event_interruptible_timeout() awaken it?
> >
> > Thanx, Paul
> 
> Add sched_show_task to show the current task on stalled CPU:
> 
> calling  efi_load_efivars+0x0/0x40 @ 1
> initcall efi_load_efivars+0x0/0x40 returned 0 after 0 usecs
> calling  esrt_sysfs_init+0x0/0x2d6 @ 1
> initcall esrt_sysfs_

Re: RCU stall and the system boot hang with nfsroot

2015-12-29 Thread Aaron Ma
On Wed, Dec 30, 2015 at 7:42 AM, Paul E. McKenney
 wrote:
> On Tue, Dec 29, 2015 at 05:34:38PM +0800, Aaron Ma wrote:
>> Add paul...@linux.vnet.ibm.com
>>
>> On Tue, Dec 29, 2015 at 5:32 PM, Aaron Ma  wrote:
>> > Hi, Paul:
>> > I found the linux-stable-4.1.15 with rt15 patches boot hang sometimes.
>> > Hardware is Grantley-EP and WildcatPass.
>
> I must confess that I am unfamiliar with this hardware, for whatever
> that might be worth.
>
>> > No response by sysrq.
>> >
>> > Did you found any issue about this? Or how can I address this issue?
>
> I see something similar in post-4.1 mainline, but only when CPU hotplug
> is enabled and only under extreme stress.  Which is probably not the
> case during your boot-up.  But please see below.
>
>> > Attached kernel config.
>> >
>> > Thanks,
>> > Pengyu
>> >
>> > xhci_hcd :00:14.0: cache line size of 32 is not supported
>> > hub 1-0:1.0: USB hub found
>> > hub 1-0:1.0: 15 ports detected
>> > xhci_hcd :00:14.0: xHCI Host Controller
>> > xhci_hcd :00:14.0: new USB bus registered, assigned bus number 2
>> > hub 2-0:1.0: USB hub found
>> > hub 2-0:1.0: 6 ports detected
>> > initcall xhci_pci_init+0x0/0x44 returned 0 after 306571 usecs
>> > calling  ehci_hcd_init+0x0/0x5d @ 1
>> > ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
>> > initcall ehci_hcd_init+0x0/0x5d returned 0 after 5695 usecs
>> > calling  ehci_pci_init+0x0/0x69 @ 1
>> > ehci-pci: EHCI PCI platform driver
>> > ehci-pci :00:1a.0: enabling bus mastering
>> > ehci-pci :00:1a.0: EHCI Host Controller
>> > ehci-pci :00:1a.0: new USB bus registered, assigned bus number 3
>> > ehci-pci :00:1a.0: debug port 2
>> > ehci-pci :00:1a.0: cache line size of 32 is not supported
>> > ehci-pci :00:1a.0: irq 18, io mem 0x91d02000
>> > cfg80211: Calling CRDA to update world regulatory domain
>> > cfg80211: Calling CRDA to update world regulatory domain
>> > cfg80211: Calling CRDA to update world regulatory domain
>> > cfg80211: Calling CRDA to update world regulatory domain
>> > cfg80211: Calling CRDA to update world regulatory domain
>> > cfg80211: Calling CRDA to update world regulatory domain
>> > cfg80211: Exceeded CRDA call max attempts. Not calling CRDA
>> > INFO: rcu_preempt detected stalls on CPUs/tasks:
>> > 18: (0 ticks this GP) idle=284/0/0 softirq=0/0 fqs=0
>
> The "stalling" CPU is idle, which is a quiescent state and therefore
> should not stall grace periods.  But please see below...
>
>> > (detected by 12, t=26002 jiffies, g=5351, c=5350, q=451332)
>> > Task dump for CPU 18:
>> > swapper/18  R  running task0 0  1 0x0020
>> >  814946a7 88045fbc7e58 814f42f7 0004
>> >  0004 8804663e0f60 820dc5c0 88045fbc7ea8
>> >  8181fca5   0046
>> > Call Trace:
>> >  [] ? debug_smp_processor_id+0x17/0x20
>> >  [] ? intel_idle+0x137/0x140
>> >  [] ? cpuidle_enter_state+0x65/0x3e0
>> >  [] ? cpuidle_enter+0x17/0x20
>> >  [] ? cpu_startup_entry+0x33d/0x630
>> >  [] ? start_secondary+0x12e/0x140
>> > rcu_preempt kthread starved for 26002 jiffies!
>
> The reason that the idle CPU is not being recognized as a legitimate
> quiescent state is that the rcu_preempt grace-period kthread is not
> being allowed to run.  In fact, it has not been permitted to run for more
> than 26 seconds.  Despite the fact that in this situation, it would have
> invoked wait_event_interruptible_timeout() with a three-jiffy timeout.
>
> One thing to do is to modify the rcu_check_gp_kthread_starvation()
> function to print the value of rsp->gp_kthread, then to also print
> fields from the resulting pointer to task_struct to see what the thread
> is up to.  (It is tempting to suspect that this kthread might never have
> been spawned, but in that case the grace period would not have started.)
>
> For example, add the following in the "if (j - gpa > 2 * HZ)"
> body:
>
> if (rsp->gp_kthread)
> sched_show_task(rsp->gp_kthread);
>
> Don't forget to add the "{" "}" to accommodate the additional statement
> within the "if" statement.
>
> If the additional output shows that the rcu_preempt kthread is runnable,
> the next question is "why is it not running?".  If the output instead
> shows that the task is blocked, the next question is "why didn't the
> wait_event_interruptible_timeout() awaken it?
>
> Thanx, Paul

Add sched_show_task to show the current task on stalled CPU:

calling  efi_load_efivars+0x0/0x40 @ 1
initcall efi_load_efivars+0x0/0x40 returned 0 after 0 usecs
calling  esrt_sysfs_init+0x0/0x2d6 @ 1
initcall esrt_sysfs_init+0x0/0x2d6 returned -38 after 1 usecs
calling  hid_init+0x0/0x4c @ 1
initcall hid_init+0x0/0x4c returned 0 after 373 usecs
calling  hid_generic_init+0x0/0x1b @ 1
initcall hid_generic_init+0x0/0x1b returned 0 after 108 usecs
calling  sensor_hub_driver_init+0x0/0x1b @ 1

Re: RCU stall and the system boot hang with nfsroot

2015-12-29 Thread Paul E. McKenney
On Tue, Dec 29, 2015 at 05:34:38PM +0800, Aaron Ma wrote:
> Add paul...@linux.vnet.ibm.com
> 
> On Tue, Dec 29, 2015 at 5:32 PM, Aaron Ma  wrote:
> > Hi, Paul:
> > I found the linux-stable-4.1.15 with rt15 patches boot hang sometimes.
> > Hardware is Grantley-EP and WildcatPass.

I must confess that I am unfamiliar with this hardware, for whatever
that might be worth.

> > No response by sysrq.
> >
> > Did you found any issue about this? Or how can I address this issue?

I see something similar in post-4.1 mainline, but only when CPU hotplug
is enabled and only under extreme stress.  Which is probably not the
case during your boot-up.  But please see below.

> > Attached kernel config.
> >
> > Thanks,
> > Pengyu
> >
> > xhci_hcd :00:14.0: cache line size of 32 is not supported
> > hub 1-0:1.0: USB hub found
> > hub 1-0:1.0: 15 ports detected
> > xhci_hcd :00:14.0: xHCI Host Controller
> > xhci_hcd :00:14.0: new USB bus registered, assigned bus number 2
> > hub 2-0:1.0: USB hub found
> > hub 2-0:1.0: 6 ports detected
> > initcall xhci_pci_init+0x0/0x44 returned 0 after 306571 usecs
> > calling  ehci_hcd_init+0x0/0x5d @ 1
> > ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
> > initcall ehci_hcd_init+0x0/0x5d returned 0 after 5695 usecs
> > calling  ehci_pci_init+0x0/0x69 @ 1
> > ehci-pci: EHCI PCI platform driver
> > ehci-pci :00:1a.0: enabling bus mastering
> > ehci-pci :00:1a.0: EHCI Host Controller
> > ehci-pci :00:1a.0: new USB bus registered, assigned bus number 3
> > ehci-pci :00:1a.0: debug port 2
> > ehci-pci :00:1a.0: cache line size of 32 is not supported
> > ehci-pci :00:1a.0: irq 18, io mem 0x91d02000
> > cfg80211: Calling CRDA to update world regulatory domain
> > cfg80211: Calling CRDA to update world regulatory domain
> > cfg80211: Calling CRDA to update world regulatory domain
> > cfg80211: Calling CRDA to update world regulatory domain
> > cfg80211: Calling CRDA to update world regulatory domain
> > cfg80211: Calling CRDA to update world regulatory domain
> > cfg80211: Exceeded CRDA call max attempts. Not calling CRDA
> > INFO: rcu_preempt detected stalls on CPUs/tasks:
> > 18: (0 ticks this GP) idle=284/0/0 softirq=0/0 fqs=0

The "stalling" CPU is idle, which is a quiescent state and therefore
should not stall grace periods.  But please see below...

> > (detected by 12, t=26002 jiffies, g=5351, c=5350, q=451332)
> > Task dump for CPU 18:
> > swapper/18  R  running task0 0  1 0x0020
> >  814946a7 88045fbc7e58 814f42f7 0004
> >  0004 8804663e0f60 820dc5c0 88045fbc7ea8
> >  8181fca5   0046
> > Call Trace:
> >  [] ? debug_smp_processor_id+0x17/0x20
> >  [] ? intel_idle+0x137/0x140
> >  [] ? cpuidle_enter_state+0x65/0x3e0
> >  [] ? cpuidle_enter+0x17/0x20
> >  [] ? cpu_startup_entry+0x33d/0x630
> >  [] ? start_secondary+0x12e/0x140
> > rcu_preempt kthread starved for 26002 jiffies!

The reason that the idle CPU is not being recognized as a legitimate
quiescent state is that the rcu_preempt grace-period kthread is not
being allowed to run.  In fact, it has not been permitted to run for more
than 26 seconds.  Despite the fact that in this situation, it would have
invoked wait_event_interruptible_timeout() with a three-jiffy timeout.

One thing to do is to modify the rcu_check_gp_kthread_starvation()
function to print the value of rsp->gp_kthread, then to also print
fields from the resulting pointer to task_struct to see what the thread
is up to.  (It is tempting to suspect that this kthread might never have
been spawned, but in that case the grace period would not have started.)

For example, add the following in the "if (j - gpa > 2 * HZ)"
body:

if (rsp->gp_kthread)
sched_show_task(rsp->gp_kthread);

Don't forget to add the "{" "}" to accommodate the additional statement
within the "if" statement.

If the additional output shows that the rcu_preempt kthread is runnable,
the next question is "why is it not running?".  If the output instead
shows that the task is blocked, the next question is "why didn't the
wait_event_interruptible_timeout() awaken it?

Thanx, Paul

> > INFO: rcu_preempt detected stalls on CPUs/tasks:
> > 18: (0 ticks this GP) idle=326/0/0 softirq=0/0 fqs=1
> > (detected by 67, t=104007 jiffies, g=5351, c=5350, q=451332)
> > Task dump for CPU 18:
> > swapper/18  R  running task0 0  1 0x0020
> >  814946a7 88045fbc7e58 814f42f7 0004
> >  0004 8804663e0f60 820dc5c0 88045fbc7ea8
> >  8181fca5   0046
> > Call Trace:
> >  [] ? debug_smp_processor_id+0x17/0x20
> >  [] ? intel_idle+0x137/0x140
> >  [] ? cpuidle_enter_state+0x65/0x3e0
> >  [] ? cpuidle_enter+0x17/0x20
> >  [] ? cp

Re: RCU stall and the system boot hang with nfsroot

2015-12-29 Thread Aaron Ma
Add paul...@linux.vnet.ibm.com

On Tue, Dec 29, 2015 at 5:32 PM, Aaron Ma  wrote:
> Hi, Paul:
> I found the linux-stable-4.1.15 with rt15 patches boot hang sometimes.
> Hardware is Grantley-EP and WildcatPass.
> No response by sysrq.
>
> Did you found any issue about this? Or how can I address this issue?
>
> Attached kernel config.
>
> Thanks,
> Pengyu
>
> xhci_hcd :00:14.0: cache line size of 32 is not supported
> hub 1-0:1.0: USB hub found
> hub 1-0:1.0: 15 ports detected
> xhci_hcd :00:14.0: xHCI Host Controller
> xhci_hcd :00:14.0: new USB bus registered, assigned bus number 2
> hub 2-0:1.0: USB hub found
> hub 2-0:1.0: 6 ports detected
> initcall xhci_pci_init+0x0/0x44 returned 0 after 306571 usecs
> calling  ehci_hcd_init+0x0/0x5d @ 1
> ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
> initcall ehci_hcd_init+0x0/0x5d returned 0 after 5695 usecs
> calling  ehci_pci_init+0x0/0x69 @ 1
> ehci-pci: EHCI PCI platform driver
> ehci-pci :00:1a.0: enabling bus mastering
> ehci-pci :00:1a.0: EHCI Host Controller
> ehci-pci :00:1a.0: new USB bus registered, assigned bus number 3
> ehci-pci :00:1a.0: debug port 2
> ehci-pci :00:1a.0: cache line size of 32 is not supported
> ehci-pci :00:1a.0: irq 18, io mem 0x91d02000
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Exceeded CRDA call max attempts. Not calling CRDA
> INFO: rcu_preempt detected stalls on CPUs/tasks:
> 18: (0 ticks this GP) idle=284/0/0 softirq=0/0 fqs=0
> (detected by 12, t=26002 jiffies, g=5351, c=5350, q=451332)
> Task dump for CPU 18:
> swapper/18  R  running task0 0  1 0x0020
>  814946a7 88045fbc7e58 814f42f7 0004
>  0004 8804663e0f60 820dc5c0 88045fbc7ea8
>  8181fca5   0046
> Call Trace:
>  [] ? debug_smp_processor_id+0x17/0x20
>  [] ? intel_idle+0x137/0x140
>  [] ? cpuidle_enter_state+0x65/0x3e0
>  [] ? cpuidle_enter+0x17/0x20
>  [] ? cpu_startup_entry+0x33d/0x630
>  [] ? start_secondary+0x12e/0x140
> rcu_preempt kthread starved for 26002 jiffies!
> INFO: rcu_preempt detected stalls on CPUs/tasks:
> 18: (0 ticks this GP) idle=326/0/0 softirq=0/0 fqs=1
> (detected by 67, t=104007 jiffies, g=5351, c=5350, q=451332)
> Task dump for CPU 18:
> swapper/18  R  running task0 0  1 0x0020
>  814946a7 88045fbc7e58 814f42f7 0004
>  0004 8804663e0f60 820dc5c0 88045fbc7ea8
>  8181fca5   0046
> Call Trace:
>  [] ? debug_smp_processor_id+0x17/0x20
>  [] ? intel_idle+0x137/0x140
>  [] ? cpuidle_enter_state+0x65/0x3e0
>  [] ? cpuidle_enter+0x17/0x20
>  [] ? cpu_startup_entry+0x33d/0x630
>  [] ? start_secondary+0x12e/0x140
> rcu_preempt kthread starved for 78005 jiffies!
> INFO: rcu_preempt detected stalls on CPUs/tasks:
> 18: (0 ticks this GP) idle=35c/0/0 softirq=0/0 fqs=0
> (detected by 62, t=26002 jiffies, g=5352, c=5351, q=451332)
> Task dump for CPU 18:
> swapper/18  R  running task0 0  1 0x0020
>  814946a7 88045fbc7e58 814f42f7 0004
>  0004 8804663e0f60 820dc5c0 88045fbc7ea8
>  8181fca5   0046
> Call Trace:
>  [] ? debug_smp_processor_id+0x17/0x20
>  [] ? intel_idle+0x137/0x140
>  [] ? cpuidle_enter_state+0x65/0x3e0
>  [] ? cpuidle_enter+0x17/0x20
>  [] ? cpu_startup_entry+0x33d/0x630
>  [] ? start_secondary+0x12e/0x140
> rcu_preempt kthread starved for 26002 jiffies!
> perf interrupt took too long (2575 > 2500), lowering
> kernel.perf_event_max_sample_rate to 5
> INFO: task swapper/0:1 blocked for more than 120 seconds.
>   Not tainted 4.1.15-rt13-WR8.0.0.0_preempt-rt+ #25
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> swapper/0   D 880453547a18 0 1  0 0x
>  880453547a18 880851aa8000 d7a0 88045f604ba0
>  880851aa8000 8804535479e8 814946a7 880453547a08
>  880453548000 880851aa8000 880453547a68 8804663ce500
> Call Trace:
>  [] ? debug_smp_processor_id+0x17/0x20
>  [] schedule+0x3f/0xd0
>  [] schedule_timeout+0x189/0x3f0
>  [] ? timer_cpu_notify+0x190/0x190
>  [] msleep+0x42/0x50
>  [] ehci_run+0xf6/0x1d0
>  [] usb_add_hcd+0x2dd/0x810
>  [] usb_hcd_pci_probe+0x33e/0x490
>  [] ehci_pci_probe+0x36/0x40
>  [] local_pci_probe+0x42/0xa0
>  [] ? pci_match_device+0xf2/0x120
>  [] pci_device_probe+0xe9/0x150
>  [] driver_probe_device+0x181/0x310
>  [] __dr

Re: RCU stall and the system boot hang

2015-12-01 Thread Paul E. McKenney
On Mon, Nov 30, 2015 at 09:19:18AM -0800, Paul E. McKenney wrote:
> On Mon, Nov 30, 2015 at 02:54:13PM +0800, fupan li wrote:

[ . . . ]

> > No,  just a normal boot, and these stalls were happened before
> > systemd services running.
>
> Interesting.  My testing show v4.1 being OK, with the first issues showing
> up somewhere between v4.1 and v4.2.  Or at least v4.1 is reliable enough
> that is passes 42 hours of focused rcutorture testing, where v4.2 tends
> to fail in under two hours.  And it seems to happen only on multisocket
> systems -- I seem to be able to hammer as hard as I want on my four-core
> (eight hardware thread) laptop without an issue.

And I take it back.  After beating on it for the better part of a week,
I did get one failure on my single-socket laptop.

So maybe I need to make my rcutorture scripts force tests to cross socket
boundaries where possible...

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RCU stall and the system boot hang

2015-11-30 Thread Paul E. McKenney
On Mon, Nov 30, 2015 at 02:54:13PM +0800, fupan li wrote:
> 2015-11-29 14:05 GMT+08:00 Paul E. McKenney :
> 
> > On Sun, Nov 29, 2015 at 12:46:10PM +0800, fupan li wrote:
> > > 2015-11-28 22:53 GMT+08:00 Paul E. McKenney  > >:
> > >
> > > > On Sat, Nov 28, 2015 at 01:05:52PM +0800, fupan li wrote:
> > > > > 2015-11-28 0:28 GMT+08:00 Paul E. McKenney <
> > paul...@linux.vnet.ibm.com>:
> > > > >
> > > > > > On Fri, Nov 27, 2015 at 08:23:24PM +0800, fupan li wrote:
> > > > > > > Hi, Paul
> > > > > > >
> > > > > > > On my Wildcat_Pass (Haswell) board, the system boot will hang as
> > > > below.
> > > > > > > The kernel is preempt-rt kernel.
> > > > > > > But it was not reproduced every time, about 1 time per 5-10
> > boots.
> > > > > >
> > > > > > CCing LMKL and linux-rt-users in case anyone else is seeing this.
> > > > > >
> > > > > > OK, let's take a look at the stall warning...
> > > > > >
> > > > > > [ . . . ]
> > > > > >
> > > > > > > Btrfs loaded
> > > > > > > console [netcon0] enabled
> > > > > > > netconsole: network logging started
> > > > > > > rtc_cmos 00:00: setting system clock to 2015-11-11 18:18:03 UTC
> > > > > > (1447265883)
> > > > > > > usb 1-9: new full-speed USB device number 3 using xhci_hcd
> > > > > > > IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> > > > > > > IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
> > > > > > > usb 1-9: ep 0x81 - rounding interval to 64 microframes, ep desc
> > says
> > > > 80
> > > > > > > microframes
> > > > > > > input: American Megatrends Inc. Virtual Keyboard and Mouse as
> > > > > > >
> > > > > >
> > > >
> > /devices/pci:00/:00:14.0/usb1/1-9/1-9:1.0/0003:046B:FF10.0002/input/input3
> > > > > > > hid-generic 0003:046B:FF10.0002: input: USB HID v1.10 Keyboard
> > > > [American
> > > > > > > Megatrends Inc. Virtual Keyboard and Mouse] on
> > > > usb-:00:14.0-9/input0
> > > > > > > input: American Megatrends Inc. Virtual Keyboard and Mouse as
> > > > > > >
> > > > > >
> > > >
> > /devices/pci:00/:00:14.0/usb1/1-9/1-9:1.1/0003:046B:FF10.0003/input/input4
> > > > > > > hid-generic 0003:046B:FF10.0003: input: USB HID v1.10 Mouse
> > [American
> > > > > > > Megatrends Inc. Virtual Keyboard and Mouse] on
> > > > usb-:00:14.0-9/input1
> > > > > > > ixgbe :02:00.0: registered PHC device on eth2
> > > > > > > IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
> > > > > > > ixgbe :02:00.1: registered PHC device on eth3
> > > > > > > IPv6: ADDRCONF(NETDEV_UP): eth3: link is not ready
> > > > > > > igb :04:00.0 eth0: igb: eth0 NIC Link is Up 100 Mbps Full
> > Duplex,
> > > > > > Flow
> > > > > > > Control: RX/TX
> > > > > > > IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > Sending DHCP requests ., OK
> > > > > > > random: nonblocking pool is initialized
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > 12: (0 ticks this GP) idle=f48/0/0 softirq=0/0 fqs=0
> > > > > > > 17: (59 GPs behind) idle=55a/0/0 softirq=0/0 fqs=0
> > > > > >
> > > > > > So CPUs 12 and 17 are stalling the grace period.
> > > > > >
> > > > > > > (detected by 5, t=21002 jiffies, g=-237, c=-238, q=136)
> > > > > > > Task dump for CPU 12:
> > > > > > > swapper/12  R  running task0 0  1 0x0020
> > > > > > >  8140f1f7 880859037eb8 810b7ff5
> > 880859037ea8
> > > > > > >  810759ef 82140220 000c
> > 82140220
> > > > > > >  880859034000 880859037ea8 8140f1f7
> > 880859037ec8
> > > > > > > Call Trace:
> > > > > > >  [] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [] ? rcu_eqs_enter_common+0x85/0x240
> > > > > > >  [] ? __atomic_notifier_call_chain+0x4f/0x70
> > > > > > >  [] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [] ? mwait_idle+0x42/0x1d0
> > > > > > >  [] ? arch_cpu_idle+0xf/0x20
> > > > > > >  [] ? cpu_startup_entry+0xae/0x300
> > > > > > >  [] ? start_secondary+0x12c/0x140
> > > > > >
> > > > > > But CPU 12 seems to be idle (as is in fact indicated by the
> > > > "idle=55a/0/0"
> > > > > > above), so the grace-period kthread should have reported a
> > quiescent
> > > > > > state on its behalf.
> > > > > >
> > > > > > > Task dump for CPU 17:
> > > > > > > swapper/17  R  running task0 0  1 0x0020
> > > > > > >  8140f1f7 880859063eb8 810b7ff5
> > 880859063ea8
> > > > > > >  810759ef 82140220 0011
> > 82140220
> > > > > > >  ff

Re: RCU stall and the system boot hang

2015-11-28 Thread Paul E. McKenney
On Sun, Nov 29, 2015 at 12:46:10PM +0800, fupan li wrote:
> 2015-11-28 22:53 GMT+08:00 Paul E. McKenney :
> 
> > On Sat, Nov 28, 2015 at 01:05:52PM +0800, fupan li wrote:
> > > 2015-11-28 0:28 GMT+08:00 Paul E. McKenney :
> > >
> > > > On Fri, Nov 27, 2015 at 08:23:24PM +0800, fupan li wrote:
> > > > > Hi, Paul
> > > > >
> > > > > On my Wildcat_Pass (Haswell) board, the system boot will hang as
> > below.
> > > > > The kernel is preempt-rt kernel.
> > > > > But it was not reproduced every time, about 1 time per 5-10 boots.
> > > >
> > > > CCing LMKL and linux-rt-users in case anyone else is seeing this.
> > > >
> > > > OK, let's take a look at the stall warning...
> > > >
> > > > [ . . . ]
> > > >
> > > > > Btrfs loaded
> > > > > console [netcon0] enabled
> > > > > netconsole: network logging started
> > > > > rtc_cmos 00:00: setting system clock to 2015-11-11 18:18:03 UTC
> > > > (1447265883)
> > > > > usb 1-9: new full-speed USB device number 3 using xhci_hcd
> > > > > IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> > > > > IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
> > > > > usb 1-9: ep 0x81 - rounding interval to 64 microframes, ep desc says
> > 80
> > > > > microframes
> > > > > input: American Megatrends Inc. Virtual Keyboard and Mouse as
> > > > >
> > > >
> > /devices/pci:00/:00:14.0/usb1/1-9/1-9:1.0/0003:046B:FF10.0002/input/input3
> > > > > hid-generic 0003:046B:FF10.0002: input: USB HID v1.10 Keyboard
> > [American
> > > > > Megatrends Inc. Virtual Keyboard and Mouse] on
> > usb-:00:14.0-9/input0
> > > > > input: American Megatrends Inc. Virtual Keyboard and Mouse as
> > > > >
> > > >
> > /devices/pci:00/:00:14.0/usb1/1-9/1-9:1.1/0003:046B:FF10.0003/input/input4
> > > > > hid-generic 0003:046B:FF10.0003: input: USB HID v1.10 Mouse [American
> > > > > Megatrends Inc. Virtual Keyboard and Mouse] on
> > usb-:00:14.0-9/input1
> > > > > ixgbe :02:00.0: registered PHC device on eth2
> > > > > IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
> > > > > ixgbe :02:00.1: registered PHC device on eth3
> > > > > IPv6: ADDRCONF(NETDEV_UP): eth3: link is not ready
> > > > > igb :04:00.0 eth0: igb: eth0 NIC Link is Up 100 Mbps Full Duplex,
> > > > Flow
> > > > > Control: RX/TX
> > > > > IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > Sending DHCP requests ., OK
> > > > > random: nonblocking pool is initialized
> > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > 12: (0 ticks this GP) idle=f48/0/0 softirq=0/0 fqs=0
> > > > > 17: (59 GPs behind) idle=55a/0/0 softirq=0/0 fqs=0
> > > >
> > > > So CPUs 12 and 17 are stalling the grace period.
> > > >
> > > > > (detected by 5, t=21002 jiffies, g=-237, c=-238, q=136)
> > > > > Task dump for CPU 12:
> > > > > swapper/12  R  running task0 0  1 0x0020
> > > > >  8140f1f7 880859037eb8 810b7ff5 880859037ea8
> > > > >  810759ef 82140220 000c 82140220
> > > > >  880859034000 880859037ea8 8140f1f7 880859037ec8
> > > > > Call Trace:
> > > > >  [] ? debug_smp_processor_id+0x17/0x20
> > > > >  [] ? rcu_eqs_enter_common+0x85/0x240
> > > > >  [] ? __atomic_notifier_call_chain+0x4f/0x70
> > > > >  [] ? debug_smp_processor_id+0x17/0x20
> > > > >  [] ? mwait_idle+0x42/0x1d0
> > > > >  [] ? arch_cpu_idle+0xf/0x20
> > > > >  [] ? cpu_startup_entry+0xae/0x300
> > > > >  [] ? start_secondary+0x12c/0x140
> > > >
> > > > But CPU 12 seems to be idle (as is in fact indicated by the
> > "idle=55a/0/0"
> > > > above), so the grace-period kthread should have reported a quiescent
> > > > state on its behalf.
> > > >
> > > > > Task dump for CPU 17:
> > > > > swapper/17  R  running task0 0  1 0x0020
> > > > >  8140f1f7 880859063eb8 810b7ff5 880859063ea8
> > > > >  810759ef 82140220 0011 82140220
> > > > >  88085906 880859063ea8 8140f1f7 880859063ec8
> > > > > Call Trace:
> > > > >  [] ? debug_smp_processor_id+0x17/0x20
> > > > >  [] ? rcu_eqs_enter_common+0x85/0x240
> > > > >  [] ? __atomic_notifier_call_chain+0x4f/0x70
> > > > >  [] ? debug_smp_processor_id+0x17/0x20
> > > > >  [] ? mwait_idle+0x42/0x1d0
> > > > >  [] ? arch_cpu_idle+0xf/0x20
> > > > >  [] ? cpu_startup_entry+0xae/0x300
> > > > >  [] ? start_secondary+0x12c/0x140
> > > >
> > > > And the same for CPU 17.
> > > >
> > > > > rcu_preempt kthread starved for 21002 jiffies!
> > > >
> > 

Re: RCU stall and the system boot hang

2015-11-28 Thread Paul E. McKenney
On Sat, Nov 28, 2015 at 01:05:52PM +0800, fupan li wrote:
> 2015-11-28 0:28 GMT+08:00 Paul E. McKenney :
> 
> > On Fri, Nov 27, 2015 at 08:23:24PM +0800, fupan li wrote:
> > > Hi, Paul
> > >
> > > On my Wildcat_Pass (Haswell) board, the system boot will hang as below.
> > > The kernel is preempt-rt kernel.
> > > But it was not reproduced every time, about 1 time per 5-10 boots.
> >
> > CCing LMKL and linux-rt-users in case anyone else is seeing this.
> >
> > OK, let's take a look at the stall warning...
> >
> > [ . . . ]
> >
> > > Btrfs loaded
> > > console [netcon0] enabled
> > > netconsole: network logging started
> > > rtc_cmos 00:00: setting system clock to 2015-11-11 18:18:03 UTC
> > (1447265883)
> > > usb 1-9: new full-speed USB device number 3 using xhci_hcd
> > > IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> > > IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
> > > usb 1-9: ep 0x81 - rounding interval to 64 microframes, ep desc says 80
> > > microframes
> > > input: American Megatrends Inc. Virtual Keyboard and Mouse as
> > >
> > /devices/pci:00/:00:14.0/usb1/1-9/1-9:1.0/0003:046B:FF10.0002/input/input3
> > > hid-generic 0003:046B:FF10.0002: input: USB HID v1.10 Keyboard [American
> > > Megatrends Inc. Virtual Keyboard and Mouse] on usb-:00:14.0-9/input0
> > > input: American Megatrends Inc. Virtual Keyboard and Mouse as
> > >
> > /devices/pci:00/:00:14.0/usb1/1-9/1-9:1.1/0003:046B:FF10.0003/input/input4
> > > hid-generic 0003:046B:FF10.0003: input: USB HID v1.10 Mouse [American
> > > Megatrends Inc. Virtual Keyboard and Mouse] on usb-:00:14.0-9/input1
> > > ixgbe :02:00.0: registered PHC device on eth2
> > > IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
> > > ixgbe :02:00.1: registered PHC device on eth3
> > > IPv6: ADDRCONF(NETDEV_UP): eth3: link is not ready
> > > igb :04:00.0 eth0: igb: eth0 NIC Link is Up 100 Mbps Full Duplex,
> > Flow
> > > Control: RX/TX
> > > IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> > > cfg80211: Calling CRDA to update world regulatory domain
> > > Sending DHCP requests ., OK
> > > random: nonblocking pool is initialized
> > > cfg80211: Calling CRDA to update world regulatory domain
> > > cfg80211: Calling CRDA to update world regulatory domain
> > > cfg80211: Calling CRDA to update world regulatory domain
> > > cfg80211: Calling CRDA to update world regulatory domain
> > > cfg80211: Calling CRDA to update world regulatory domain
> > > cfg80211: Calling CRDA to update world regulatory domain
> > > INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > 12: (0 ticks this GP) idle=f48/0/0 softirq=0/0 fqs=0
> > > 17: (59 GPs behind) idle=55a/0/0 softirq=0/0 fqs=0
> >
> > So CPUs 12 and 17 are stalling the grace period.
> >
> > > (detected by 5, t=21002 jiffies, g=-237, c=-238, q=136)
> > > Task dump for CPU 12:
> > > swapper/12  R  running task0 0  1 0x0020
> > >  8140f1f7 880859037eb8 810b7ff5 880859037ea8
> > >  810759ef 82140220 000c 82140220
> > >  880859034000 880859037ea8 8140f1f7 880859037ec8
> > > Call Trace:
> > >  [] ? debug_smp_processor_id+0x17/0x20
> > >  [] ? rcu_eqs_enter_common+0x85/0x240
> > >  [] ? __atomic_notifier_call_chain+0x4f/0x70
> > >  [] ? debug_smp_processor_id+0x17/0x20
> > >  [] ? mwait_idle+0x42/0x1d0
> > >  [] ? arch_cpu_idle+0xf/0x20
> > >  [] ? cpu_startup_entry+0xae/0x300
> > >  [] ? start_secondary+0x12c/0x140
> >
> > But CPU 12 seems to be idle (as is in fact indicated by the "idle=55a/0/0"
> > above), so the grace-period kthread should have reported a quiescent
> > state on its behalf.
> >
> > > Task dump for CPU 17:
> > > swapper/17  R  running task0 0  1 0x0020
> > >  8140f1f7 880859063eb8 810b7ff5 880859063ea8
> > >  810759ef 82140220 0011 82140220
> > >  88085906 880859063ea8 8140f1f7 880859063ec8
> > > Call Trace:
> > >  [] ? debug_smp_processor_id+0x17/0x20
> > >  [] ? rcu_eqs_enter_common+0x85/0x240
> > >  [] ? __atomic_notifier_call_chain+0x4f/0x70
> > >  [] ? debug_smp_processor_id+0x17/0x20
> > >  [] ? mwait_idle+0x42/0x1d0
> > >  [] ? arch_cpu_idle+0xf/0x20
> > >  [] ? cpu_startup_entry+0xae/0x300
> > >  [] ? start_secondary+0x12c/0x140
> >
> > And the same for CPU 17.
> >
> > > rcu_preempt kthread starved for 21002 jiffies!
> >
> > But the grace-period kthread hasn't had a chance to run for more
> > than 21 seconds, which explains why the grace period is not ending.
> >
> > Are you starting up a heavy real-time workload at boot time?
> 
> No,  just a normal boot, and these stalls were happened before
> systemd services running.

Interesting.  My testing show v4.1 being OK, with the first issues showing
up somewhere between v4.1 and v4.2.  Or at least v4.1 is reliable enough
that is passes 42 hours of focused rcutorture testing, where v4.2 tends
to fail i

Re: RCU stall and the system boot hang

2015-11-27 Thread Paul E. McKenney
On Fri, Nov 27, 2015 at 08:23:24PM +0800, fupan li wrote:
> Hi, Paul
> 
> On my Wildcat_Pass (Haswell) board, the system boot will hang as below.
> The kernel is preempt-rt kernel.
> But it was not reproduced every time, about 1 time per 5-10 boots.

CCing LMKL and linux-rt-users in case anyone else is seeing this.

OK, let's take a look at the stall warning...

[ . . . ]

> Btrfs loaded
> console [netcon0] enabled
> netconsole: network logging started
> rtc_cmos 00:00: setting system clock to 2015-11-11 18:18:03 UTC (1447265883)
> usb 1-9: new full-speed USB device number 3 using xhci_hcd
> IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
> usb 1-9: ep 0x81 - rounding interval to 64 microframes, ep desc says 80
> microframes
> input: American Megatrends Inc. Virtual Keyboard and Mouse as
> /devices/pci:00/:00:14.0/usb1/1-9/1-9:1.0/0003:046B:FF10.0002/input/input3
> hid-generic 0003:046B:FF10.0002: input: USB HID v1.10 Keyboard [American
> Megatrends Inc. Virtual Keyboard and Mouse] on usb-:00:14.0-9/input0
> input: American Megatrends Inc. Virtual Keyboard and Mouse as
> /devices/pci:00/:00:14.0/usb1/1-9/1-9:1.1/0003:046B:FF10.0003/input/input4
> hid-generic 0003:046B:FF10.0003: input: USB HID v1.10 Mouse [American
> Megatrends Inc. Virtual Keyboard and Mouse] on usb-:00:14.0-9/input1
> ixgbe :02:00.0: registered PHC device on eth2
> IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
> ixgbe :02:00.1: registered PHC device on eth3
> IPv6: ADDRCONF(NETDEV_UP): eth3: link is not ready
> igb :04:00.0 eth0: igb: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow
> Control: RX/TX
> IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> cfg80211: Calling CRDA to update world regulatory domain
> Sending DHCP requests ., OK
> random: nonblocking pool is initialized
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Calling CRDA to update world regulatory domain
> INFO: rcu_preempt detected stalls on CPUs/tasks:
> 12: (0 ticks this GP) idle=f48/0/0 softirq=0/0 fqs=0
> 17: (59 GPs behind) idle=55a/0/0 softirq=0/0 fqs=0

So CPUs 12 and 17 are stalling the grace period.

> (detected by 5, t=21002 jiffies, g=-237, c=-238, q=136)
> Task dump for CPU 12:
> swapper/12  R  running task0 0  1 0x0020
>  8140f1f7 880859037eb8 810b7ff5 880859037ea8
>  810759ef 82140220 000c 82140220
>  880859034000 880859037ea8 8140f1f7 880859037ec8
> Call Trace:
>  [] ? debug_smp_processor_id+0x17/0x20
>  [] ? rcu_eqs_enter_common+0x85/0x240
>  [] ? __atomic_notifier_call_chain+0x4f/0x70
>  [] ? debug_smp_processor_id+0x17/0x20
>  [] ? mwait_idle+0x42/0x1d0
>  [] ? arch_cpu_idle+0xf/0x20
>  [] ? cpu_startup_entry+0xae/0x300
>  [] ? start_secondary+0x12c/0x140

But CPU 12 seems to be idle (as is in fact indicated by the "idle=55a/0/0"
above), so the grace-period kthread should have reported a quiescent
state on its behalf.

> Task dump for CPU 17:
> swapper/17  R  running task0 0  1 0x0020
>  8140f1f7 880859063eb8 810b7ff5 880859063ea8
>  810759ef 82140220 0011 82140220
>  88085906 880859063ea8 8140f1f7 880859063ec8
> Call Trace:
>  [] ? debug_smp_processor_id+0x17/0x20
>  [] ? rcu_eqs_enter_common+0x85/0x240
>  [] ? __atomic_notifier_call_chain+0x4f/0x70
>  [] ? debug_smp_processor_id+0x17/0x20
>  [] ? mwait_idle+0x42/0x1d0
>  [] ? arch_cpu_idle+0xf/0x20
>  [] ? cpu_startup_entry+0xae/0x300
>  [] ? start_secondary+0x12c/0x140

And the same for CPU 17.

> rcu_preempt kthread starved for 21002 jiffies!

But the grace-period kthread hasn't had a chance to run for more
than 21 seconds, which explains why the grace period is not ending.

Are you starting up a heavy real-time workload at boot time?

> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Exceeded CRDA call max attempts. Not calling CRDA
> INFO: rcu_preempt detected stalls on CPUs/tasks:
> 12: (0 ticks this GP) idle=04c/0/0 softirq=0/0 fqs=1

At this point, CPU 17's quiescent state has been recorded.

> (detected by 24, t=84007 jiffies, g=-237, c=-238, q=147)
> Task dump for CPU 12:
> swapper/12  R  running task0 0  1 0x0020
>  8140f1f7 880859037eb8 810b7ff5 880859037ea8
>  810759ef 82140220 000c 82140220
>  880859034000 880859037ea8 8140f1f7 880859037ec8
> Call Trace:
>  [] ? debug_smp_processor_id+0x17/0x20
>  [] ? rcu_eqs_enter_common+0x85/0x240
>  [] ? __a