Re: rcu: frequent rcu lockups

2015-03-20 Thread Paul E. McKenney
On Fri, Mar 20, 2015 at 10:46:33AM -0400, Sasha Levin wrote:
> On 03/20/2015 10:29 AM, Paul E. McKenney wrote:
> > On Fri, Mar 20, 2015 at 10:15:34AM -0400, Sasha Levin wrote:
> >> On 03/15/2015 08:02 PM, Paul E. McKenney wrote:
> >>> On Sun, Mar 15, 2015 at 07:32:32PM -0400, Sasha Levin wrote:
> > On 03/15/2015 04:43 PM, Paul E. McKenney wrote:
> >>> But I did find a bug that would result in the other warnings, and 
> >>> could
> >>> also result in too-short grace periods, which could in turn result in
> >>> arbitrarily arbitrary misbehavior.  The patch below, which is also on
> >>> its way into -next, should fix this.  Please let me know how it does
> >>> for you.
> >
> > I've stopped seeing the warnings I've previously reported, but started
> > seeing a new one:
> >
> > [  788.564596] WARNING: CPU: 12 PID: 9711 at kernel/rcu/tree.c:2201 
> > rcu_report_qs_rnp+0x42e/0x5a0()
> > [  788.568123] Modules linked in:
> > [  788.568123] CPU: 12 PID: 9711 Comm: trinity-main Not tainted 
> > 4.0.0-rc3-next-20150313-sasha-00041-g83a3dc8-dirty #2078
> > [  788.568123]  8803a1ba 400df16a 880442807cc8 
> > b1ab01ca
> > [  788.568123]    880442807d18 
> > a71e261a
> > [  788.568123]  dc00 a733d2ee 880442807d28 
> > b4724000
> > [  788.568123] Call Trace:
> > [  788.568123]  dump_stack (lib/dump_stack.c:52)
> > [  788.568123] warn_slowpath_common (kernel/panic.c:447)
> > [  788.568123] ? rcu_report_qs_rnp (kernel/rcu/tree.c:2201 
> > (discriminator 3))
> > [  788.568123] warn_slowpath_null (kernel/panic.c:481)
> > [  788.568123] rcu_report_qs_rnp (kernel/rcu/tree.c:2201 (discriminator 
> > 3))
> > [  788.568123] rcu_process_callbacks (kernel/rcu/tree.c:2302 
> > kernel/rcu/tree.c:2338 kernel/rcu/tree.c:2824 kernel/rcu/tree.c:2857)
> > [  788.568123] __do_softirq (kernel/softirq.c:273 
> > include/linux/jump_label.h:114 include/trace/events/irq.h:126 
> > kernel/softirq.c:274)
> > [  788.568123] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
> > [  788.568123] smp_apic_timer_interrupt 
> > (arch/x86/kernel/apic/apic.c:918)
> > [  788.568123] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
> > [  788.568123]  ? mark_held_locks (kernel/locking/lockdep.c:2525)
> > [  788.568123] finish_task_switch (kernel/sched/core.c:2231)
> > [  788.568123] __schedule (kernel/sched/core.c:2337 
> > kernel/sched/core.c:2795)
> > [  788.568123] schedule (./arch/x86/include/asm/bitops.h:311 
> > (discriminator 1) kernel/sched/core.c:2824 (discriminator 1))
> > [  788.568123] schedule_preempt_disabled (kernel/sched/core.c:2856)
> > [  788.568123] mutex_lock_nested (kernel/locking/mutex.c:585 
> > kernel/locking/mutex.c:623)
> > [  788.568123] kernfs_iop_permission (fs/kernfs/inode.c:366)
> > [  788.568123] __inode_permission (fs/namei.c:374 fs/namei.c:408)
> > [  788.568123] inode_permission (fs/namei.c:460)
> > [  788.568123] link_path_walk (fs/namei.c:1520 fs/namei.c:1782)
> > [  788.568123] path_init (fs/namei.c:1947)
> > [  788.568123] path_lookupat (fs/namei.c:1989)
> > [  788.568123] filename_lookup (fs/namei.c:2025)
> > [  788.568123] user_path_at_empty (fs/namei.c:2189)
> > [  788.568123] user_path_at (fs/namei.c:2200)
> > [  788.568123] vfs_fstatat (fs/stat.c:106)
> > [  788.568123] SYSC_newfstatat (fs/stat.c:298)
> > [  788.568123] SyS_newfstatat (fs/stat.c:291)
> > [  788.568123] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)
> >>> OK, I guess it would help to update the WARN_ON()s while I am at it.  :-/
> >>>
> >>> Here is an updated patch that replaces the one resulting in the above
> >>> splat.
> >>
> >> That did the trick for me, but I don't think it made it to -next?
> > 
> > Good to hear!  May I add your Tested-by?
> 
> Sure.
> 
> > And thank you for the reminder on -next, pushing it out now.
> 
> Thanks!

Both done, thank you!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: frequent rcu lockups

2015-03-20 Thread Sasha Levin
On 03/20/2015 10:29 AM, Paul E. McKenney wrote:
> On Fri, Mar 20, 2015 at 10:15:34AM -0400, Sasha Levin wrote:
>> On 03/15/2015 08:02 PM, Paul E. McKenney wrote:
>>> On Sun, Mar 15, 2015 at 07:32:32PM -0400, Sasha Levin wrote:
> On 03/15/2015 04:43 PM, Paul E. McKenney wrote:
>>> But I did find a bug that would result in the other warnings, and could
>>> also result in too-short grace periods, which could in turn result in
>>> arbitrarily arbitrary misbehavior.  The patch below, which is also on
>>> its way into -next, should fix this.  Please let me know how it does
>>> for you.
>
> I've stopped seeing the warnings I've previously reported, but started
> seeing a new one:
>
> [  788.564596] WARNING: CPU: 12 PID: 9711 at kernel/rcu/tree.c:2201 
> rcu_report_qs_rnp+0x42e/0x5a0()
> [  788.568123] Modules linked in:
> [  788.568123] CPU: 12 PID: 9711 Comm: trinity-main Not tainted 
> 4.0.0-rc3-next-20150313-sasha-00041-g83a3dc8-dirty #2078
> [  788.568123]  8803a1ba 400df16a 880442807cc8 
> b1ab01ca
> [  788.568123]    880442807d18 
> a71e261a
> [  788.568123]  dc00 a733d2ee 880442807d28 
> b4724000
> [  788.568123] Call Trace:
> [  788.568123]  dump_stack (lib/dump_stack.c:52)
> [  788.568123] warn_slowpath_common (kernel/panic.c:447)
> [  788.568123] ? rcu_report_qs_rnp (kernel/rcu/tree.c:2201 (discriminator 
> 3))
> [  788.568123] warn_slowpath_null (kernel/panic.c:481)
> [  788.568123] rcu_report_qs_rnp (kernel/rcu/tree.c:2201 (discriminator 
> 3))
> [  788.568123] rcu_process_callbacks (kernel/rcu/tree.c:2302 
> kernel/rcu/tree.c:2338 kernel/rcu/tree.c:2824 kernel/rcu/tree.c:2857)
> [  788.568123] __do_softirq (kernel/softirq.c:273 
> include/linux/jump_label.h:114 include/trace/events/irq.h:126 
> kernel/softirq.c:274)
> [  788.568123] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
> [  788.568123] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
> [  788.568123] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
> [  788.568123]  ? mark_held_locks (kernel/locking/lockdep.c:2525)
> [  788.568123] finish_task_switch (kernel/sched/core.c:2231)
> [  788.568123] __schedule (kernel/sched/core.c:2337 
> kernel/sched/core.c:2795)
> [  788.568123] schedule (./arch/x86/include/asm/bitops.h:311 
> (discriminator 1) kernel/sched/core.c:2824 (discriminator 1))
> [  788.568123] schedule_preempt_disabled (kernel/sched/core.c:2856)
> [  788.568123] mutex_lock_nested (kernel/locking/mutex.c:585 
> kernel/locking/mutex.c:623)
> [  788.568123] kernfs_iop_permission (fs/kernfs/inode.c:366)
> [  788.568123] __inode_permission (fs/namei.c:374 fs/namei.c:408)
> [  788.568123] inode_permission (fs/namei.c:460)
> [  788.568123] link_path_walk (fs/namei.c:1520 fs/namei.c:1782)
> [  788.568123] path_init (fs/namei.c:1947)
> [  788.568123] path_lookupat (fs/namei.c:1989)
> [  788.568123] filename_lookup (fs/namei.c:2025)
> [  788.568123] user_path_at_empty (fs/namei.c:2189)
> [  788.568123] user_path_at (fs/namei.c:2200)
> [  788.568123] vfs_fstatat (fs/stat.c:106)
> [  788.568123] SYSC_newfstatat (fs/stat.c:298)
> [  788.568123] SyS_newfstatat (fs/stat.c:291)
> [  788.568123] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)
>>> OK, I guess it would help to update the WARN_ON()s while I am at it.  :-/
>>>
>>> Here is an updated patch that replaces the one resulting in the above
>>> splat.
>>
>> That did the trick for me, but I don't think it made it to -next?
> 
> Good to hear!  May I add your Tested-by?

Sure.

> And thank you for the reminder on -next, pushing it out now.

Thanks!


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: frequent rcu lockups

2015-03-20 Thread Paul E. McKenney
On Fri, Mar 20, 2015 at 10:15:34AM -0400, Sasha Levin wrote:
> On 03/15/2015 08:02 PM, Paul E. McKenney wrote:
> > On Sun, Mar 15, 2015 at 07:32:32PM -0400, Sasha Levin wrote:
> >> > On 03/15/2015 04:43 PM, Paul E. McKenney wrote:
> >>> > > But I did find a bug that would result in the other warnings, and 
> >>> > > could
> >>> > > also result in too-short grace periods, which could in turn result in
> >>> > > arbitrarily arbitrary misbehavior.  The patch below, which is also on
> >>> > > its way into -next, should fix this.  Please let me know how it does
> >>> > > for you.
> >> > 
> >> > I've stopped seeing the warnings I've previously reported, but started
> >> > seeing a new one:
> >> > 
> >> > [  788.564596] WARNING: CPU: 12 PID: 9711 at kernel/rcu/tree.c:2201 
> >> > rcu_report_qs_rnp+0x42e/0x5a0()
> >> > [  788.568123] Modules linked in:
> >> > [  788.568123] CPU: 12 PID: 9711 Comm: trinity-main Not tainted 
> >> > 4.0.0-rc3-next-20150313-sasha-00041-g83a3dc8-dirty #2078
> >> > [  788.568123]  8803a1ba 400df16a 880442807cc8 
> >> > b1ab01ca
> >> > [  788.568123]    880442807d18 
> >> > a71e261a
> >> > [  788.568123]  dc00 a733d2ee 880442807d28 
> >> > b4724000
> >> > [  788.568123] Call Trace:
> >> > [  788.568123]  dump_stack (lib/dump_stack.c:52)
> >> > [  788.568123] warn_slowpath_common (kernel/panic.c:447)
> >> > [  788.568123] ? rcu_report_qs_rnp (kernel/rcu/tree.c:2201 
> >> > (discriminator 3))
> >> > [  788.568123] warn_slowpath_null (kernel/panic.c:481)
> >> > [  788.568123] rcu_report_qs_rnp (kernel/rcu/tree.c:2201 (discriminator 
> >> > 3))
> >> > [  788.568123] rcu_process_callbacks (kernel/rcu/tree.c:2302 
> >> > kernel/rcu/tree.c:2338 kernel/rcu/tree.c:2824 kernel/rcu/tree.c:2857)
> >> > [  788.568123] __do_softirq (kernel/softirq.c:273 
> >> > include/linux/jump_label.h:114 include/trace/events/irq.h:126 
> >> > kernel/softirq.c:274)
> >> > [  788.568123] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
> >> > [  788.568123] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
> >> > [  788.568123] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
> >> > [  788.568123]  ? mark_held_locks (kernel/locking/lockdep.c:2525)
> >> > [  788.568123] finish_task_switch (kernel/sched/core.c:2231)
> >> > [  788.568123] __schedule (kernel/sched/core.c:2337 
> >> > kernel/sched/core.c:2795)
> >> > [  788.568123] schedule (./arch/x86/include/asm/bitops.h:311 
> >> > (discriminator 1) kernel/sched/core.c:2824 (discriminator 1))
> >> > [  788.568123] schedule_preempt_disabled (kernel/sched/core.c:2856)
> >> > [  788.568123] mutex_lock_nested (kernel/locking/mutex.c:585 
> >> > kernel/locking/mutex.c:623)
> >> > [  788.568123] kernfs_iop_permission (fs/kernfs/inode.c:366)
> >> > [  788.568123] __inode_permission (fs/namei.c:374 fs/namei.c:408)
> >> > [  788.568123] inode_permission (fs/namei.c:460)
> >> > [  788.568123] link_path_walk (fs/namei.c:1520 fs/namei.c:1782)
> >> > [  788.568123] path_init (fs/namei.c:1947)
> >> > [  788.568123] path_lookupat (fs/namei.c:1989)
> >> > [  788.568123] filename_lookup (fs/namei.c:2025)
> >> > [  788.568123] user_path_at_empty (fs/namei.c:2189)
> >> > [  788.568123] user_path_at (fs/namei.c:2200)
> >> > [  788.568123] vfs_fstatat (fs/stat.c:106)
> >> > [  788.568123] SYSC_newfstatat (fs/stat.c:298)
> >> > [  788.568123] SyS_newfstatat (fs/stat.c:291)
> >> > [  788.568123] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)
> > OK, I guess it would help to update the WARN_ON()s while I am at it.  :-/
> > 
> > Here is an updated patch that replaces the one resulting in the above
> > splat.
> 
> That did the trick for me, but I don't think it made it to -next?

Good to hear!  May I add your Tested-by?

And thank you for the reminder on -next, pushing it out now.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: frequent rcu lockups

2015-03-20 Thread Sasha Levin
On 03/15/2015 08:02 PM, Paul E. McKenney wrote:
> On Sun, Mar 15, 2015 at 07:32:32PM -0400, Sasha Levin wrote:
>> > On 03/15/2015 04:43 PM, Paul E. McKenney wrote:
>>> > > But I did find a bug that would result in the other warnings, and could
>>> > > also result in too-short grace periods, which could in turn result in
>>> > > arbitrarily arbitrary misbehavior.  The patch below, which is also on
>>> > > its way into -next, should fix this.  Please let me know how it does
>>> > > for you.
>> > 
>> > I've stopped seeing the warnings I've previously reported, but started
>> > seeing a new one:
>> > 
>> > [  788.564596] WARNING: CPU: 12 PID: 9711 at kernel/rcu/tree.c:2201 
>> > rcu_report_qs_rnp+0x42e/0x5a0()
>> > [  788.568123] Modules linked in:
>> > [  788.568123] CPU: 12 PID: 9711 Comm: trinity-main Not tainted 
>> > 4.0.0-rc3-next-20150313-sasha-00041-g83a3dc8-dirty #2078
>> > [  788.568123]  8803a1ba 400df16a 880442807cc8 
>> > b1ab01ca
>> > [  788.568123]    880442807d18 
>> > a71e261a
>> > [  788.568123]  dc00 a733d2ee 880442807d28 
>> > b4724000
>> > [  788.568123] Call Trace:
>> > [  788.568123]  dump_stack (lib/dump_stack.c:52)
>> > [  788.568123] warn_slowpath_common (kernel/panic.c:447)
>> > [  788.568123] ? rcu_report_qs_rnp (kernel/rcu/tree.c:2201 (discriminator 
>> > 3))
>> > [  788.568123] warn_slowpath_null (kernel/panic.c:481)
>> > [  788.568123] rcu_report_qs_rnp (kernel/rcu/tree.c:2201 (discriminator 3))
>> > [  788.568123] rcu_process_callbacks (kernel/rcu/tree.c:2302 
>> > kernel/rcu/tree.c:2338 kernel/rcu/tree.c:2824 kernel/rcu/tree.c:2857)
>> > [  788.568123] __do_softirq (kernel/softirq.c:273 
>> > include/linux/jump_label.h:114 include/trace/events/irq.h:126 
>> > kernel/softirq.c:274)
>> > [  788.568123] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
>> > [  788.568123] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
>> > [  788.568123] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
>> > [  788.568123]  ? mark_held_locks (kernel/locking/lockdep.c:2525)
>> > [  788.568123] finish_task_switch (kernel/sched/core.c:2231)
>> > [  788.568123] __schedule (kernel/sched/core.c:2337 
>> > kernel/sched/core.c:2795)
>> > [  788.568123] schedule (./arch/x86/include/asm/bitops.h:311 
>> > (discriminator 1) kernel/sched/core.c:2824 (discriminator 1))
>> > [  788.568123] schedule_preempt_disabled (kernel/sched/core.c:2856)
>> > [  788.568123] mutex_lock_nested (kernel/locking/mutex.c:585 
>> > kernel/locking/mutex.c:623)
>> > [  788.568123] kernfs_iop_permission (fs/kernfs/inode.c:366)
>> > [  788.568123] __inode_permission (fs/namei.c:374 fs/namei.c:408)
>> > [  788.568123] inode_permission (fs/namei.c:460)
>> > [  788.568123] link_path_walk (fs/namei.c:1520 fs/namei.c:1782)
>> > [  788.568123] path_init (fs/namei.c:1947)
>> > [  788.568123] path_lookupat (fs/namei.c:1989)
>> > [  788.568123] filename_lookup (fs/namei.c:2025)
>> > [  788.568123] user_path_at_empty (fs/namei.c:2189)
>> > [  788.568123] user_path_at (fs/namei.c:2200)
>> > [  788.568123] vfs_fstatat (fs/stat.c:106)
>> > [  788.568123] SYSC_newfstatat (fs/stat.c:298)
>> > [  788.568123] SyS_newfstatat (fs/stat.c:291)
>> > [  788.568123] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)
> OK, I guess it would help to update the WARN_ON()s while I am at it.  :-/
> 
> Here is an updated patch that replaces the one resulting in the above
> splat.

That did the trick for me, but I don't think it made it to -next?


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: frequent rcu lockups

2015-03-15 Thread Paul E. McKenney
On Sun, Mar 15, 2015 at 07:32:32PM -0400, Sasha Levin wrote:
> On 03/15/2015 04:43 PM, Paul E. McKenney wrote:
> > But I did find a bug that would result in the other warnings, and could
> > also result in too-short grace periods, which could in turn result in
> > arbitrarily arbitrary misbehavior.  The patch below, which is also on
> > its way into -next, should fix this.  Please let me know how it does
> > for you.
> 
> I've stopped seeing the warnings I've previously reported, but started
> seeing a new one:
> 
> [  788.564596] WARNING: CPU: 12 PID: 9711 at kernel/rcu/tree.c:2201 
> rcu_report_qs_rnp+0x42e/0x5a0()
> [  788.568123] Modules linked in:
> [  788.568123] CPU: 12 PID: 9711 Comm: trinity-main Not tainted 
> 4.0.0-rc3-next-20150313-sasha-00041-g83a3dc8-dirty #2078
> [  788.568123]  8803a1ba 400df16a 880442807cc8 
> b1ab01ca
> [  788.568123]    880442807d18 
> a71e261a
> [  788.568123]  dc00 a733d2ee 880442807d28 
> b4724000
> [  788.568123] Call Trace:
> [  788.568123]  dump_stack (lib/dump_stack.c:52)
> [  788.568123] warn_slowpath_common (kernel/panic.c:447)
> [  788.568123] ? rcu_report_qs_rnp (kernel/rcu/tree.c:2201 (discriminator 3))
> [  788.568123] warn_slowpath_null (kernel/panic.c:481)
> [  788.568123] rcu_report_qs_rnp (kernel/rcu/tree.c:2201 (discriminator 3))
> [  788.568123] rcu_process_callbacks (kernel/rcu/tree.c:2302 
> kernel/rcu/tree.c:2338 kernel/rcu/tree.c:2824 kernel/rcu/tree.c:2857)
> [  788.568123] __do_softirq (kernel/softirq.c:273 
> include/linux/jump_label.h:114 include/trace/events/irq.h:126 
> kernel/softirq.c:274)
> [  788.568123] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
> [  788.568123] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
> [  788.568123] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
> [  788.568123]  ? mark_held_locks (kernel/locking/lockdep.c:2525)
> [  788.568123] finish_task_switch (kernel/sched/core.c:2231)
> [  788.568123] __schedule (kernel/sched/core.c:2337 kernel/sched/core.c:2795)
> [  788.568123] schedule (./arch/x86/include/asm/bitops.h:311 (discriminator 
> 1) kernel/sched/core.c:2824 (discriminator 1))
> [  788.568123] schedule_preempt_disabled (kernel/sched/core.c:2856)
> [  788.568123] mutex_lock_nested (kernel/locking/mutex.c:585 
> kernel/locking/mutex.c:623)
> [  788.568123] kernfs_iop_permission (fs/kernfs/inode.c:366)
> [  788.568123] __inode_permission (fs/namei.c:374 fs/namei.c:408)
> [  788.568123] inode_permission (fs/namei.c:460)
> [  788.568123] link_path_walk (fs/namei.c:1520 fs/namei.c:1782)
> [  788.568123] path_init (fs/namei.c:1947)
> [  788.568123] path_lookupat (fs/namei.c:1989)
> [  788.568123] filename_lookup (fs/namei.c:2025)
> [  788.568123] user_path_at_empty (fs/namei.c:2189)
> [  788.568123] user_path_at (fs/namei.c:2200)
> [  788.568123] vfs_fstatat (fs/stat.c:106)
> [  788.568123] SYSC_newfstatat (fs/stat.c:298)
> [  788.568123] SyS_newfstatat (fs/stat.c:291)
> [  788.568123] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)

OK, I guess it would help to update the WARN_ON()s while I am at it.  :-/

Here is an updated patch that replaces the one resulting in the above
splat.

Thanx, Paul



rcu: Associate quiescent-state reports with grace period

As noted in earlier commit logs, CPU hotplug operations running
concurrently with grace-period initialization can result in a given
leaf rcu_node structure having all CPUs offline and no blocked readers,
but with this rcu_node structure nevertheless blocking the current
grace period.  Therefore, the quiescent-state forcing code now checks
for this situation and repairs it.

Unfortunately, this checking can result in false positives, for example,
when the last task has just removed itself from this leaf rcu_node
structure, but has not yet started clearing the ->qsmask bits further
up the structure.  This means that the grace-period kthread (which
forces quiescent states) and some other task might be attempting to
concurrently clear these ->qsmask bits.  This is usually not a problem:
One of these tasks will be the first to acquire the upper-level rcu_node
structure's lock and with therefore clear the bit, and the other task,
seeing the bit already cleared, will stop trying to clear bits.

Sadly, this means that the following unusual sequence of events -can-
result in a problem:

1.  The grace-period kthread wins, and clears the ->qsmask bits.

2.  This is the last thing blocking the current grace period, so
that the grace-period kthread clears ->qsmask bits all the way
to the root and finds that the root ->qsmask field is now zero.

3.  Another grace period is required, so that the

Re: rcu: frequent rcu lockups

2015-03-15 Thread Sasha Levin
On 03/15/2015 04:43 PM, Paul E. McKenney wrote:
> But I did find a bug that would result in the other warnings, and could
> also result in too-short grace periods, which could in turn result in
> arbitrarily arbitrary misbehavior.  The patch below, which is also on
> its way into -next, should fix this.  Please let me know how it does
> for you.

I've stopped seeing the warnings I've previously reported, but started
seeing a new one:

[  788.564596] WARNING: CPU: 12 PID: 9711 at kernel/rcu/tree.c:2201 
rcu_report_qs_rnp+0x42e/0x5a0()
[  788.568123] Modules linked in:
[  788.568123] CPU: 12 PID: 9711 Comm: trinity-main Not tainted 
4.0.0-rc3-next-20150313-sasha-00041-g83a3dc8-dirty #2078
[  788.568123]  8803a1ba 400df16a 880442807cc8 
b1ab01ca
[  788.568123]    880442807d18 
a71e261a
[  788.568123]  dc00 a733d2ee 880442807d28 
b4724000
[  788.568123] Call Trace:
[  788.568123]  dump_stack (lib/dump_stack.c:52)
[  788.568123] warn_slowpath_common (kernel/panic.c:447)
[  788.568123] ? rcu_report_qs_rnp (kernel/rcu/tree.c:2201 (discriminator 3))
[  788.568123] warn_slowpath_null (kernel/panic.c:481)
[  788.568123] rcu_report_qs_rnp (kernel/rcu/tree.c:2201 (discriminator 3))
[  788.568123] rcu_process_callbacks (kernel/rcu/tree.c:2302 
kernel/rcu/tree.c:2338 kernel/rcu/tree.c:2824 kernel/rcu/tree.c:2857)
[  788.568123] __do_softirq (kernel/softirq.c:273 
include/linux/jump_label.h:114 include/trace/events/irq.h:126 
kernel/softirq.c:274)
[  788.568123] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
[  788.568123] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
[  788.568123] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
[  788.568123]  ? mark_held_locks (kernel/locking/lockdep.c:2525)
[  788.568123] finish_task_switch (kernel/sched/core.c:2231)
[  788.568123] __schedule (kernel/sched/core.c:2337 kernel/sched/core.c:2795)
[  788.568123] schedule (./arch/x86/include/asm/bitops.h:311 (discriminator 1) 
kernel/sched/core.c:2824 (discriminator 1))
[  788.568123] schedule_preempt_disabled (kernel/sched/core.c:2856)
[  788.568123] mutex_lock_nested (kernel/locking/mutex.c:585 
kernel/locking/mutex.c:623)
[  788.568123] kernfs_iop_permission (fs/kernfs/inode.c:366)
[  788.568123] __inode_permission (fs/namei.c:374 fs/namei.c:408)
[  788.568123] inode_permission (fs/namei.c:460)
[  788.568123] link_path_walk (fs/namei.c:1520 fs/namei.c:1782)
[  788.568123] path_init (fs/namei.c:1947)
[  788.568123] path_lookupat (fs/namei.c:1989)
[  788.568123] filename_lookup (fs/namei.c:2025)
[  788.568123] user_path_at_empty (fs/namei.c:2189)
[  788.568123] user_path_at (fs/namei.c:2200)
[  788.568123] vfs_fstatat (fs/stat.c:106)
[  788.568123] SYSC_newfstatat (fs/stat.c:298)
[  788.568123] SyS_newfstatat (fs/stat.c:291)
[  788.568123] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: frequent rcu lockups

2015-03-15 Thread Paul E. McKenney
On Fri, Mar 13, 2015 at 10:39:43AM -0700, Paul E. McKenney wrote:
> On Thu, Mar 12, 2015 at 07:07:49AM -0700, Paul E. McKenney wrote:
> > On Thu, Mar 12, 2015 at 08:32:05AM -0400, Sasha Levin wrote:
> > > On 03/12/2015 08:28 AM, Sasha Levin wrote:
> > > > On 03/11/2015 07:16 PM, Paul E. McKenney wrote:
> > > >> > On Wed, Mar 11, 2015 at 07:06:40PM -0400, Sasha Levin wrote:
> > >  >> > On 03/11/2015 07:01 PM, Paul E. McKenney wrote:
> > >   > >> With the commit I didn't hit it yet, but I do see 4 
> > >   > >> different WARNings:
> > > >> >>> > > I wish that I could say that I am surprised, but the sad 
> > > >> >>> > > fact is that
> > > >> >>> > > I am still shaking the bugs out.  
> > >  >> > 
> > >  >> > I have one more to add:
> > >  >> > 
> > >  >> > [   93.330539] WARNING: CPU: 1 PID: 8 at 
> > >  >> > kernel/rcu/tree_plugin.h:476 rcu_gp_kthread+0x1eaa/0x4dd0()
> > > >> > A bit different, but still in the class of a combining-tree bitmask
> > > >> > handling bug.
> > > > I left it overnight, and am still seeing hangs. Although (and don't 
> > > > catch me
> > > > by that) it seems to be significantly less of them.
> > > 
> > > In one of the cases, KASan ended up complaining about something odd going 
> > > on in
> > > rcu_check_callbacks():
> > 
> > Hmmm...  Looks like I have a synchronization blow between RCU CPU stall
> > warnings and task exit or some such.  Promising clue, thank you!
> 
> But I am not seeing it.  There should be no synchronization issues
> with sched_show_task(current), as the current task cannot exit while
> it is calling sched_show_task(current).  There is a remote-task
> sched_show_task(t), but it is invoked only on tasks that are on RCU's
> list of tasks that have blocked within an RCU read-side critical section
> during the current grace period.  Such a task might exit while in in this
> critical section, but in that case exit_rcu() will take care of this, and
> it will acquire the same lock held across the call to sched_show_task(t).
> 
> I suppose this could happen if someone re-enabled preemption late
> in the exit() code path and then entered an RCU read-side critical
> section, but that would be a bad breakage of the exit() path.
> 
> Am I missing something here?

But I did find a bug that would result in the other warnings, and could
also result in too-short grace periods, which could in turn result in
arbitrarily arbitrary misbehavior.  The patch below, which is also on
its way into -next, should fix this.  Please let me know how it does
for you.

Thanx, Paul



rcu: Associate quiescent-state reports with grace period

As noted in earlier commit logs, CPU hotplug operations running
concurrently with grace-period initialization can result in a given
leaf rcu_node structure having all CPUs offline and no blocked readers,
but with this rcu_node structure nevertheless blocking the current
grace period.  Therefore, the quiescent-state forcing code now checks
for this situation and repairs it.

Unfortunately, this checking can result in false positives, for example,
when the last task has just removed itself from this leaf rcu_node
structure, but has not yet started clearing the ->qsmask bits further
up the structure.  This means that the grace-period kthread (which
forces quiescent states) and some other task might be attempting to
concurrently clear these ->qsmask bits.  This is usually not a problem:
One of these tasks will be the first to acquire the upper-level rcu_node
structure's lock and with therefore clear the bit, and the other task,
seeing the bit already cleared, will stop trying to clear bits.

Sadly, this means that the following unusual sequence of events -can-
result in a problem:

1.  The grace-period kthread wins, and clears the ->qsmask bits.

2.  This is the last thing blocking the current grace period, so
that the grace-period kthread clears ->qsmask bits all the way
to the root and finds that the root ->qsmask field is now zero.

3.  Another grace period is required, so that the grace period kthread
initializes it, including setting all the needed qsmask bits.

4.  The leaf rcu_node structure (the one that started this whole
mess) is blocking this new grace period, either because it
has at least one online CPU or because there is at least one
task that had blocked within an RCU read-side critical section
while running on one of this leaf rcu_node structure's CPUs.
(And yes, that CPU might well have gone offline before the
grace period in step (3) above started, which can mean that
there is a task on the leaf rcu_node structure's ->blkd_tasks
list, but ->qsmask equal 

Re: rcu: frequent rcu lockups

2015-03-13 Thread Paul E. McKenney
On Thu, Mar 12, 2015 at 07:07:49AM -0700, Paul E. McKenney wrote:
> On Thu, Mar 12, 2015 at 08:32:05AM -0400, Sasha Levin wrote:
> > On 03/12/2015 08:28 AM, Sasha Levin wrote:
> > > On 03/11/2015 07:16 PM, Paul E. McKenney wrote:
> > >> > On Wed, Mar 11, 2015 at 07:06:40PM -0400, Sasha Levin wrote:
> >  >> > On 03/11/2015 07:01 PM, Paul E. McKenney wrote:
> >   > >> With the commit I didn't hit it yet, but I do see 4 
> >   > >> different WARNings:
> > >> >>> > > I wish that I could say that I am surprised, but the sad 
> > >> >>> > > fact is that
> > >> >>> > > I am still shaking the bugs out.  
> >  >> > 
> >  >> > I have one more to add:
> >  >> > 
> >  >> > [   93.330539] WARNING: CPU: 1 PID: 8 at 
> >  >> > kernel/rcu/tree_plugin.h:476 rcu_gp_kthread+0x1eaa/0x4dd0()
> > >> > A bit different, but still in the class of a combining-tree bitmask
> > >> > handling bug.
> > > I left it overnight, and am still seeing hangs. Although (and don't catch 
> > > me
> > > by that) it seems to be significantly less of them.
> > 
> > In one of the cases, KASan ended up complaining about something odd going 
> > on in
> > rcu_check_callbacks():
> 
> Hmmm...  Looks like I have a synchronization blow between RCU CPU stall
> warnings and task exit or some such.  Promising clue, thank you!

But I am not seeing it.  There should be no synchronization issues
with sched_show_task(current), as the current task cannot exit while
it is calling sched_show_task(current).  There is a remote-task
sched_show_task(t), but it is invoked only on tasks that are on RCU's
list of tasks that have blocked within an RCU read-side critical section
during the current grace period.  Such a task might exit while in in this
critical section, but in that case exit_rcu() will take care of this, and
it will acquire the same lock held across the call to sched_show_task(t).

I suppose this could happen if someone re-enabled preemption late
in the exit() code path and then entered an RCU read-side critical
section, but that would be a bad breakage of the exit() path.

Am I missing something here?

Thanx, Paul

> > [ 7170.779042] BUG: KASan: out of bounds on stack in 
> > sched_show_task+0x324/0x3d0 at addr 88075315fdd8
> > [ 7170.779042] Read of size 8 by task trinity-c0/15862
> > [ 7170.779042] page:ea001d4c57c0 count:0 mapcount:0 mapping:  
> > (null) index:0x0
> > [ 7170.779042] flags: 0x5af8000()
> > [ 7170.779042] page dumped because: kasan: bad access detected
> > [ 7170.779042] CPU: 0 PID: 15862 Comm: trinity-c0 Tainted: GW   
> > 4.0.0-rc3-next-20150311-sasha-00035-g0a45b48-dirty #2052
> > [ 7170.779042]  88075315fdd8 ac401e1d 880050607cd8 
> > 9da97e89
> > [ 7170.779042]  1d4003a98aff 880050607d68 880050607d58 
> > 9360eda4
> > [ 7170.779042]  880050607d98 9da8c9df 0082 
> > a0260429
> > [ 7170.779042] Call Trace:
> > [ 7170.779042]  dump_stack (lib/dump_stack.c:52)
> > [ 7170.779042] kasan_report_error (mm/kasan/report.c:132 
> > mm/kasan/report.c:193)
> > [ 7170.779042] ? printk (kernel/printk/printk.c:1854)
> > [ 7170.779042] ? bitmap_weight (include/linux/bitmap.h:303)
> > [ 7170.779042] __asan_report_load8_noabort (mm/kasan/report.c:230 
> > mm/kasan/report.c:251)
> > [ 7170.779042] ? sched_show_task (kernel/sched/core.c:4526)
> > [ 7170.779042] sched_show_task (kernel/sched/core.c:4526)
> > [ 7170.779042] rcu_check_callbacks (kernel/rcu/tree.c:1225 
> > kernel/rcu/tree.c:1331 kernel/rcu/tree.c:3400 kernel/rcu/tree.c:3464 
> > kernel/rcu/tree.c:2682)
> > [ 7170.779042] update_process_times (./arch/x86/include/asm/preempt.h:22 
> > kernel/time/timer.c:1386)
> > [ 7170.779042] tick_periodic (kernel/time/tick-common.c:92)
> > [ 7170.779042] ? tick_handle_periodic (kernel/time/tick-common.c:105)
> > [ 7170.779042] tick_handle_periodic (kernel/time/tick-common.c:105)
> > [ 7170.779042] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:891)
> > [ 7170.779042] ? irq_enter (kernel/softirq.c:338)
> > [ 7170.779042] smp_apic_timer_interrupt (./arch/x86/include/asm/apic.h:650 
> > arch/x86/kernel/apic/apic.c:915)
> > [ 7170.779042] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
> > [ 7170.779042]  ? do_raw_read_lock (include/asm-generic/qrwlock.h:104 
> > kernel/locking/spinlock_debug.c:204)
> > [ 7170.779042] _raw_read_lock (kernel/locking/spinlock.c:224)
> > [ 7170.779042] ? do_wait (kernel/exit.c:1498 (discriminator 1))
> > [ 7170.779042] do_wait (kernel/exit.c:1498 (discriminator 1))
> > [ 7170.779042] ? wait_consider_task (kernel/exit.c:1465)
> > [ 7170.779042] ? find_get_pid (kernel/pid.c:490)
> > [ 7170.779042] SyS_wait4 (kernel/exit.c:1618 kernel/exit.c:1586)
> > [ 7170.779042] ? SyS_waitid (kernel/exit.c:1586)
> > [ 7170.779042] ? kill_orphaned_pgrp (kernel/exit.c:1444)
> > [ 7170.779042] ? syscall_trace

Re: rcu: frequent rcu lockups

2015-03-12 Thread Paul E. McKenney
On Thu, Mar 12, 2015 at 08:32:05AM -0400, Sasha Levin wrote:
> On 03/12/2015 08:28 AM, Sasha Levin wrote:
> > On 03/11/2015 07:16 PM, Paul E. McKenney wrote:
> >> > On Wed, Mar 11, 2015 at 07:06:40PM -0400, Sasha Levin wrote:
>  >> > On 03/11/2015 07:01 PM, Paul E. McKenney wrote:
>   > >> With the commit I didn't hit it yet, but I do see 4 
>   > >> different WARNings:
> >> >>> > > I wish that I could say that I am surprised, but the sad fact 
> >> >>> > > is that
> >> >>> > > I am still shaking the bugs out.  
>  >> > 
>  >> > I have one more to add:
>  >> > 
>  >> > [   93.330539] WARNING: CPU: 1 PID: 8 at 
>  >> > kernel/rcu/tree_plugin.h:476 rcu_gp_kthread+0x1eaa/0x4dd0()
> >> > A bit different, but still in the class of a combining-tree bitmask
> >> > handling bug.
> > I left it overnight, and am still seeing hangs. Although (and don't catch me
> > by that) it seems to be significantly less of them.
> 
> In one of the cases, KASan ended up complaining about something odd going on 
> in
> rcu_check_callbacks():

Hmmm...  Looks like I have a synchronization blow between RCU CPU stall
warnings and task exit or some such.  Promising clue, thank you!

Thanx, Paul

> [ 7170.779042] BUG: KASan: out of bounds on stack in 
> sched_show_task+0x324/0x3d0 at addr 88075315fdd8
> [ 7170.779042] Read of size 8 by task trinity-c0/15862
> [ 7170.779042] page:ea001d4c57c0 count:0 mapcount:0 mapping:  
> (null) index:0x0
> [ 7170.779042] flags: 0x5af8000()
> [ 7170.779042] page dumped because: kasan: bad access detected
> [ 7170.779042] CPU: 0 PID: 15862 Comm: trinity-c0 Tainted: GW   
> 4.0.0-rc3-next-20150311-sasha-00035-g0a45b48-dirty #2052
> [ 7170.779042]  88075315fdd8 ac401e1d 880050607cd8 
> 9da97e89
> [ 7170.779042]  1d4003a98aff 880050607d68 880050607d58 
> 9360eda4
> [ 7170.779042]  880050607d98 9da8c9df 0082 
> a0260429
> [ 7170.779042] Call Trace:
> [ 7170.779042]  dump_stack (lib/dump_stack.c:52)
> [ 7170.779042] kasan_report_error (mm/kasan/report.c:132 
> mm/kasan/report.c:193)
> [ 7170.779042] ? printk (kernel/printk/printk.c:1854)
> [ 7170.779042] ? bitmap_weight (include/linux/bitmap.h:303)
> [ 7170.779042] __asan_report_load8_noabort (mm/kasan/report.c:230 
> mm/kasan/report.c:251)
> [ 7170.779042] ? sched_show_task (kernel/sched/core.c:4526)
> [ 7170.779042] sched_show_task (kernel/sched/core.c:4526)
> [ 7170.779042] rcu_check_callbacks (kernel/rcu/tree.c:1225 
> kernel/rcu/tree.c:1331 kernel/rcu/tree.c:3400 kernel/rcu/tree.c:3464 
> kernel/rcu/tree.c:2682)
> [ 7170.779042] update_process_times (./arch/x86/include/asm/preempt.h:22 
> kernel/time/timer.c:1386)
> [ 7170.779042] tick_periodic (kernel/time/tick-common.c:92)
> [ 7170.779042] ? tick_handle_periodic (kernel/time/tick-common.c:105)
> [ 7170.779042] tick_handle_periodic (kernel/time/tick-common.c:105)
> [ 7170.779042] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:891)
> [ 7170.779042] ? irq_enter (kernel/softirq.c:338)
> [ 7170.779042] smp_apic_timer_interrupt (./arch/x86/include/asm/apic.h:650 
> arch/x86/kernel/apic/apic.c:915)
> [ 7170.779042] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
> [ 7170.779042]  ? do_raw_read_lock (include/asm-generic/qrwlock.h:104 
> kernel/locking/spinlock_debug.c:204)
> [ 7170.779042] _raw_read_lock (kernel/locking/spinlock.c:224)
> [ 7170.779042] ? do_wait (kernel/exit.c:1498 (discriminator 1))
> [ 7170.779042] do_wait (kernel/exit.c:1498 (discriminator 1))
> [ 7170.779042] ? wait_consider_task (kernel/exit.c:1465)
> [ 7170.779042] ? find_get_pid (kernel/pid.c:490)
> [ 7170.779042] SyS_wait4 (kernel/exit.c:1618 kernel/exit.c:1586)
> [ 7170.779042] ? SyS_waitid (kernel/exit.c:1586)
> [ 7170.779042] ? kill_orphaned_pgrp (kernel/exit.c:1444)
> [ 7170.779042] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1592)
> [ 7170.779042] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
> [ 7170.779042] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)
> [ 7170.779042] Memory state around the buggy address:
> [ 7170.779042]  88075315fc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00
> [ 7170.779042]  88075315fd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00
> [ 7170.779042] >88075315fd80: 00 f1 f1 f1 f1 00 f4 f4 f4 f2 f2 f2 f2 00 
> f4 f4
> [ 7170.779042] ^
> [ 7170.779042]  88075315fe00: f4 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 f1
> [ 7170.779042]  88075315fe80: f1 f1 f1 00 00 00 00 00 00 00 00 00 00 00 
> f4 00
> 
> 
> Thanks,
> Sasha
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lk

Re: rcu: frequent rcu lockups

2015-03-12 Thread Sasha Levin
On 03/12/2015 08:28 AM, Sasha Levin wrote:
> On 03/11/2015 07:16 PM, Paul E. McKenney wrote:
>> > On Wed, Mar 11, 2015 at 07:06:40PM -0400, Sasha Levin wrote:
 >> > On 03/11/2015 07:01 PM, Paul E. McKenney wrote:
  > >> With the commit I didn't hit it yet, but I do see 4 
  > >> different WARNings:
>> >>> > > I wish that I could say that I am surprised, but the sad fact is 
>> >>> > > that
>> >>> > > I am still shaking the bugs out.  
 >> > 
 >> > I have one more to add:
 >> > 
 >> > [   93.330539] WARNING: CPU: 1 PID: 8 at kernel/rcu/tree_plugin.h:476 
 >> > rcu_gp_kthread+0x1eaa/0x4dd0()
>> > A bit different, but still in the class of a combining-tree bitmask
>> > handling bug.
> I left it overnight, and am still seeing hangs. Although (and don't catch me
> by that) it seems to be significantly less of them.

In one of the cases, KASan ended up complaining about something odd going on in
rcu_check_callbacks():

[ 7170.779042] BUG: KASan: out of bounds on stack in 
sched_show_task+0x324/0x3d0 at addr 88075315fdd8
[ 7170.779042] Read of size 8 by task trinity-c0/15862
[ 7170.779042] page:ea001d4c57c0 count:0 mapcount:0 mapping:  
(null) index:0x0
[ 7170.779042] flags: 0x5af8000()
[ 7170.779042] page dumped because: kasan: bad access detected
[ 7170.779042] CPU: 0 PID: 15862 Comm: trinity-c0 Tainted: GW   
4.0.0-rc3-next-20150311-sasha-00035-g0a45b48-dirty #2052
[ 7170.779042]  88075315fdd8 ac401e1d 880050607cd8 
9da97e89
[ 7170.779042]  1d4003a98aff 880050607d68 880050607d58 
9360eda4
[ 7170.779042]  880050607d98 9da8c9df 0082 
a0260429
[ 7170.779042] Call Trace:
[ 7170.779042]  dump_stack (lib/dump_stack.c:52)
[ 7170.779042] kasan_report_error (mm/kasan/report.c:132 mm/kasan/report.c:193)
[ 7170.779042] ? printk (kernel/printk/printk.c:1854)
[ 7170.779042] ? bitmap_weight (include/linux/bitmap.h:303)
[ 7170.779042] __asan_report_load8_noabort (mm/kasan/report.c:230 
mm/kasan/report.c:251)
[ 7170.779042] ? sched_show_task (kernel/sched/core.c:4526)
[ 7170.779042] sched_show_task (kernel/sched/core.c:4526)
[ 7170.779042] rcu_check_callbacks (kernel/rcu/tree.c:1225 
kernel/rcu/tree.c:1331 kernel/rcu/tree.c:3400 kernel/rcu/tree.c:3464 
kernel/rcu/tree.c:2682)
[ 7170.779042] update_process_times (./arch/x86/include/asm/preempt.h:22 
kernel/time/timer.c:1386)
[ 7170.779042] tick_periodic (kernel/time/tick-common.c:92)
[ 7170.779042] ? tick_handle_periodic (kernel/time/tick-common.c:105)
[ 7170.779042] tick_handle_periodic (kernel/time/tick-common.c:105)
[ 7170.779042] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:891)
[ 7170.779042] ? irq_enter (kernel/softirq.c:338)
[ 7170.779042] smp_apic_timer_interrupt (./arch/x86/include/asm/apic.h:650 
arch/x86/kernel/apic/apic.c:915)
[ 7170.779042] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
[ 7170.779042]  ? do_raw_read_lock (include/asm-generic/qrwlock.h:104 
kernel/locking/spinlock_debug.c:204)
[ 7170.779042] _raw_read_lock (kernel/locking/spinlock.c:224)
[ 7170.779042] ? do_wait (kernel/exit.c:1498 (discriminator 1))
[ 7170.779042] do_wait (kernel/exit.c:1498 (discriminator 1))
[ 7170.779042] ? wait_consider_task (kernel/exit.c:1465)
[ 7170.779042] ? find_get_pid (kernel/pid.c:490)
[ 7170.779042] SyS_wait4 (kernel/exit.c:1618 kernel/exit.c:1586)
[ 7170.779042] ? SyS_waitid (kernel/exit.c:1586)
[ 7170.779042] ? kill_orphaned_pgrp (kernel/exit.c:1444)
[ 7170.779042] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1592)
[ 7170.779042] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
[ 7170.779042] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)
[ 7170.779042] Memory state around the buggy address:
[ 7170.779042]  88075315fc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
[ 7170.779042]  88075315fd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
[ 7170.779042] >88075315fd80: 00 f1 f1 f1 f1 00 f4 f4 f4 f2 f2 f2 f2 00 f4 
f4
[ 7170.779042] ^
[ 7170.779042]  88075315fe00: f4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
f1
[ 7170.779042]  88075315fe80: f1 f1 f1 00 00 00 00 00 00 00 00 00 00 00 f4 
00


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: frequent rcu lockups

2015-03-12 Thread Sasha Levin
On 03/11/2015 07:16 PM, Paul E. McKenney wrote:
> On Wed, Mar 11, 2015 at 07:06:40PM -0400, Sasha Levin wrote:
>> > On 03/11/2015 07:01 PM, Paul E. McKenney wrote:
 > >> With the commit I didn't hit it yet, but I do see 4 different 
 > >> WARNings:
>>> > > I wish that I could say that I am surprised, but the sad fact is that
>>> > > I am still shaking the bugs out.  
>> > 
>> > I have one more to add:
>> > 
>> > [   93.330539] WARNING: CPU: 1 PID: 8 at kernel/rcu/tree_plugin.h:476 
>> > rcu_gp_kthread+0x1eaa/0x4dd0()
> A bit different, but still in the class of a combining-tree bitmask
> handling bug.

I left it overnight, and am still seeing hangs. Although (and don't catch me
by that) it seems to be significantly less of them.

[ 4423.001809] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 4423.001809]  Tasks blocked on level-1 rcu_node (CPUs 16-31):
[ 4423.001809]  (detected by 0, t=30502 jiffies, g=60989, c=60988, q=18648)
[ 4423.001809] All QSes seen, last rcu_preempt kthread activity 1 
(4295375352-4295375351), jiffies_till_next_fqs=1, root ->qsmask 0x2
[ 4423.001809] trinity-c0  R  running task27480 15862   9833 0x1008
[ 4423.001809]  2669 ac401e1d 880050607de8 
9327679b
[ 4423.001809]  880050607db8 a0b36000 0001 
0001000639f7
[ 4423.001809]  a0b351c8 dc00 880050622000 
a0721000
[ 4423.001809] Call Trace:
[ 4423.001809]  sched_show_task (kernel/sched/core.c:4542)
[ 4423.001809] rcu_check_callbacks (kernel/rcu/tree.c:1225 
kernel/rcu/tree.c:1331 kernel/rcu/tree.c:3400 kernel/rcu/tree.c:3464 
kernel/rcu/tree.c:2682)
[ 4423.001809] ? acct_account_cputime (kernel/tsacct.c:168)
[ 4423.001809] update_process_times (./arch/x86/include/asm/preempt.h:22 
kernel/time/timer.c:1386)
[ 4423.001809] tick_periodic (kernel/time/tick-common.c:92)
[ 4423.001809] ? tick_handle_periodic (kernel/time/tick-common.c:105)
[ 4423.001809] tick_handle_periodic (kernel/time/tick-common.c:105)
[ 4423.001809] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:891)
[ 4423.001809] ? irq_enter (kernel/softirq.c:338)
[ 4423.001809] smp_apic_timer_interrupt (./arch/x86/include/asm/apic.h:650 
arch/x86/kernel/apic/apic.c:915)
[ 4423.001809] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
[ 4423.001809]  ? remove_wait_queue (include/linux/wait.h:145 
kernel/sched/wait.c:50)
[ 4423.001809] ? _raw_spin_unlock_irqrestore 
(./arch/x86/include/asm/paravirt.h:809 include/linux/spinlock_api_smp.h:162 
kernel/locking/spinlock.c:191)
[ 4423.001809] remove_wait_queue (kernel/sched/wait.c:52)
[ 4423.001809] do_wait (kernel/exit.c:1465 (discriminator 1))
[ 4423.001809] ? wait_consider_task (kernel/exit.c:1465)
[ 4423.001809] ? find_get_pid (kernel/pid.c:490)
[ 4423.001809] SyS_wait4 (kernel/exit.c:1618 kernel/exit.c:1586)
[ 4423.001809] ? SyS_waitid (kernel/exit.c:1586)
[ 4423.001809] ? kill_orphaned_pgrp (kernel/exit.c:1444)
[ 4423.001809] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1592)
[ 4423.001809] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
[ 4423.001809] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: frequent rcu lockups

2015-03-11 Thread Paul E. McKenney
On Wed, Mar 11, 2015 at 07:06:40PM -0400, Sasha Levin wrote:
> On 03/11/2015 07:01 PM, Paul E. McKenney wrote:
> >> With the commit I didn't hit it yet, but I do see 4 different WARNings:
> > I wish that I could say that I am surprised, but the sad fact is that
> > I am still shaking the bugs out.  
> 
> I have one more to add:
> 
> [   93.330539] WARNING: CPU: 1 PID: 8 at kernel/rcu/tree_plugin.h:476 
> rcu_gp_kthread+0x1eaa/0x4dd0()

A bit different, but still in the class of a combining-tree bitmask
handling bug.

Thanx, Paul

> [   93.333515] Modules linked in:
> [   93.334678] CPU: 1 PID: 8 Comm: rcu_preempt Tainted: GW   
> 4.0.0-rc3-next-20150311-sasha-00035-g0a45b48-dirty #2052
> [   93.338402]  8800496f8000 f3e72ee7 880049ac7b28 
> 9ea97e89
> [   93.339495]    880049ac7b78 
> 941e242a
> [   93.339495]  a1721038 9433bf2a a1721040 
> 8800a2822000
> [   93.339495] Call Trace:
> [   93.339495]  [] dump_stack+0x4f/0x7b
> [   93.339495]  [] warn_slowpath_common+0xca/0x130
> [   93.339495]  [] ? rcu_gp_kthread+0x1eaa/0x4dd0
> [   93.339495]  [] warn_slowpath_null+0x1a/0x20
> [   93.339495]  [] rcu_gp_kthread+0x1eaa/0x4dd0
> [   93.339495]  [] ? plist_check_head+0x3c/0x50
> [   93.339495]  [] ? cond_synchronize_rcu+0x20/0x20
> [   93.339495]  [] ? trace_hardirqs_on_caller+0x475/0x610
> [   93.339495]  [] ? _raw_spin_unlock_irq+0x69/0xa0
> [   93.339495]  [] ? __schedule+0x993/0x1b90
> [   93.339495]  [] ? ___preempt_schedule+0x12/0x14
> [   93.339495]  [] ? abort_exclusive_wait+0x1d0/0x1d0
> [   93.339495]  [] ? cond_synchronize_rcu+0x20/0x20
> [   93.339495]  [] kthread+0x216/0x2e0
> [   93.339495]  [] ? flush_kthread_work+0x300/0x300
> [   93.339495]  [] ? _raw_spin_unlock_irq+0x30/0xa0
> [   93.339495]  [] ? flush_kthread_work+0x300/0x300
> [   93.339495]  [] ret_from_fork+0x53/0x80
> [   93.339495]  [] ? flush_kthread_work+0x300/0x300
> 
> 
> Thanks,
> Sasha
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: frequent rcu lockups

2015-03-11 Thread Sasha Levin
On 03/11/2015 07:01 PM, Paul E. McKenney wrote:
>> With the commit I didn't hit it yet, but I do see 4 different WARNings:
> I wish that I could say that I am surprised, but the sad fact is that
> I am still shaking the bugs out.  

I have one more to add:

[   93.330539] WARNING: CPU: 1 PID: 8 at kernel/rcu/tree_plugin.h:476 
rcu_gp_kthread+0x1eaa/0x4dd0()
[   93.333515] Modules linked in:
[   93.334678] CPU: 1 PID: 8 Comm: rcu_preempt Tainted: GW   
4.0.0-rc3-next-20150311-sasha-00035-g0a45b48-dirty #2052
[   93.338402]  8800496f8000 f3e72ee7 880049ac7b28 
9ea97e89
[   93.339495]    880049ac7b78 
941e242a
[   93.339495]  a1721038 9433bf2a a1721040 
8800a2822000
[   93.339495] Call Trace:
[   93.339495]  [] dump_stack+0x4f/0x7b
[   93.339495]  [] warn_slowpath_common+0xca/0x130
[   93.339495]  [] ? rcu_gp_kthread+0x1eaa/0x4dd0
[   93.339495]  [] warn_slowpath_null+0x1a/0x20
[   93.339495]  [] rcu_gp_kthread+0x1eaa/0x4dd0
[   93.339495]  [] ? plist_check_head+0x3c/0x50
[   93.339495]  [] ? cond_synchronize_rcu+0x20/0x20
[   93.339495]  [] ? trace_hardirqs_on_caller+0x475/0x610
[   93.339495]  [] ? _raw_spin_unlock_irq+0x69/0xa0
[   93.339495]  [] ? __schedule+0x993/0x1b90
[   93.339495]  [] ? ___preempt_schedule+0x12/0x14
[   93.339495]  [] ? abort_exclusive_wait+0x1d0/0x1d0
[   93.339495]  [] ? cond_synchronize_rcu+0x20/0x20
[   93.339495]  [] kthread+0x216/0x2e0
[   93.339495]  [] ? flush_kthread_work+0x300/0x300
[   93.339495]  [] ? _raw_spin_unlock_irq+0x30/0xa0
[   93.339495]  [] ? flush_kthread_work+0x300/0x300
[   93.339495]  [] ret_from_fork+0x53/0x80
[   93.339495]  [] ? flush_kthread_work+0x300/0x300


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: frequent rcu lockups

2015-03-11 Thread Paul E. McKenney
On Wed, Mar 11, 2015 at 06:18:57PM -0400, Sasha Levin wrote:
> On 03/11/2015 04:41 PM, Paul E. McKenney wrote:
> > Aha, I forgot to update rcu/next.  I have now updated it, so it should
> > make it there today or tomorrow.  In the meantime, does the following
> > commit help?
> > 
> > Also, how quickly does your test setup reproduce this?
> 
> I usually hit it within an hour.

I clearly need to work more on rcutorture -- took tens of hours to hit
it there.

> With the commit I didn't hit it yet, but I do see 4 different WARNings:

I wish that I could say that I am surprised, but the sad fact is that
I am still shaking the bugs out.  

> [ 1002.577249] WARNING: CPU: 19 PID: 22311 at kernel/rcu/tree.c:2190 
> rcu_report_qs_rnp+0x41e/0x590()

I haven't seen this particular one, but it is another likely symptom.

> [ 1002.582438] Modules linked in:
> [ 1002.582438] CPU: 19 PID: 22311 Comm: trinity-c58 Not tainted 
> 4.0.0-rc3-next-20150311-sasha-00035-g0a45b48 #2051
> [ 1002.582438]  88047f20 727b89c5 880672807cc8 
> 9ea97ec9
> [ 1002.582438]    880672807d18 
> 941e242a
> [ 1002.582438]  0282 943339be dc00 
> a1721000
> [ 1002.582438] Call Trace:
> [ 1002.582438]  dump_stack (lib/dump_stack.c:52)
> [ 1002.582438] warn_slowpath_common (kernel/panic.c:447)
> [ 1002.582438] ? rcu_report_qs_rnp (kernel/rcu/tree.c:2190 (discriminator 3))
> [ 1002.582438] warn_slowpath_null (kernel/panic.c:481)
> [ 1002.582438] rcu_report_qs_rnp (kernel/rcu/tree.c:2190 (discriminator 3))
> [ 1002.582438] rcu_process_callbacks (kernel/rcu/tree.c:2288 
> kernel/rcu/tree.c:2324 kernel/rcu/tree.c:2816 kernel/rcu/tree.c:2849)
> [ 1002.582438] ? __lock_is_held (kernel/locking/lockdep.c:3518)
> [ 1002.582438] __do_softirq (kernel/softirq.c:273 
> include/linux/jump_label.h:114 include/trace/events/irq.h:126 
> kernel/softirq.c:274)
> [ 1002.582438] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
> [ 1002.582438] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
> [ 1002.582438] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
> [ 1002.582438]  ? __lock_is_held (kernel/locking/lockdep.c:3518)
> [ 1002.582438] ? lock_acquire (./arch/x86/include/asm/paravirt.h:809 
> kernel/locking/lockdep.c:3605)
> [ 1002.582438] ? rmap_walk (mm/rmap.c:1569 mm/rmap.c:1594)
> [ 1002.582438] down_read (./arch/x86/include/asm/rwsem.h:65 
> kernel/locking/rwsem.c:24)
> [ 1002.582438] ? rmap_walk (mm/rmap.c:1569 mm/rmap.c:1594)
> [ 1002.582438] ? up_read (./arch/x86/include/asm/rwsem.h:156 
> kernel/locking/rwsem.c:81)
> [ 1002.582438] ? page_remove_rmap (mm/rmap.c:1209)
> [ 1002.582438] rmap_walk (mm/rmap.c:1569 mm/rmap.c:1594)
> [ 1002.582438] try_to_munlock (mm/rmap.c:1446)
> [ 1002.582438] ? try_to_unmap (mm/rmap.c:1446)
> [ 1002.582438] ? page_remove_rmap (mm/rmap.c:1209)
> [ 1002.582438] ? invalid_migration_vma (mm/rmap.c:1382)
> [ 1002.582438] ? anon_vma_prepare (mm/rmap.c:497)
> [ 1002.582438] __munlock_isolated_page (mm/mlock.c:132)
> [ 1002.582438] __munlock_pagevec (mm/mlock.c:332)
> [ 1002.582438] ? __munlock_isolate_lru_page (mm/mlock.c:277)
> [ 1002.582438] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:95 
> include/linux/spinlock_api_smp.h:154 kernel/locking/spinlock.c:183)
> [ 1002.582438] ? __get_locked_pte (mm/memory.c:1432)
> [ 1002.582438] munlock_vma_pages_range (mm/mlock.c:477)
> [ 1002.582438] ? munlock_vma_page (mm/mlock.c:424)
> [ 1002.582438] ? __lock_acquire (kernel/locking/lockdep.c:2019 
> kernel/locking/lockdep.c:3184)
> [ 1002.582438] ? kfree (mm/slub.c:1300 mm/slub.c:2707 mm/slub.c:3416)
> [ 1002.582438] ? __lock_acquire (kernel/locking/lockdep.c:2019 
> kernel/locking/lockdep.c:3184)
> [ 1002.582438] ? vma_merge (mm/mmap.c:1085)
> [ 1002.582438] mlock_fixup (mm/mlock.c:549)
> [ 1002.582438] do_mlockall (mm/mlock.c:677)
> [ 1002.582438] ? do_mlock (mm/mlock.c:658)
> [ 1002.582438] sys_munlockall (mm/mlock.c:720)
> [ 1002.582438] ia32_do_call (arch/x86/ia32/ia32entry.S:486)
> 
> [ 1034.418016] WARNING: CPU: 0 PID: 8 at kernel/rcu/tree.c:1936 
> rcu_gp_kthread+0x2a7c/0x4dd0()

I have seen this one, tracking it down.

> [ 1034.419933] Modules linked in:
> [ 1034.420802] CPU: 0 PID: 8 Comm: rcu_preempt Tainted: GW   
> 4.0.0-rc3-next-20150311-sasha-00035-g0a45b48 #2051
> [ 1034.423329]  8800496f8000 5fe03e34 880049ac7b28 
> 9ea97ec9
> [ 1034.426460]    880049ac7b78 
> 941e242a
> [ 1034.426460]  00022000 9433cafc a1725040 
> 0001000110de
> [ 1034.426460] Call Trace:
> [ 1034.426460] dump_stack (lib/dump_stack.c:52)
> [ 1034.426460] warn_slowpath_common (kernel/panic.c:447)
> [ 1034.426460] ? rcu_gp_kthread (kernel/rcu/tree.c:1936 
> kernel/rcu/tree.c:2061)
> [ 1034.426460] warn_slowpath_null (kernel/panic.c:481)
> [ 1034.426460] rcu_gp_kthread (kernel/rcu/tree.c:1936 kernel/rcu/tree.c:2061)
> [ 

Re: rcu: frequent rcu lockups

2015-03-11 Thread Sasha Levin
On 03/11/2015 04:41 PM, Paul E. McKenney wrote:
> Aha, I forgot to update rcu/next.  I have now updated it, so it should
> make it there today or tomorrow.  In the meantime, does the following
> commit help?
> 
> Also, how quickly does your test setup reproduce this?

I usually hit it within an hour.

With the commit I didn't hit it yet, but I do see 4 different WARNings:

[ 1002.577249] WARNING: CPU: 19 PID: 22311 at kernel/rcu/tree.c:2190 
rcu_report_qs_rnp+0x41e/0x590()
[ 1002.582438] Modules linked in:
[ 1002.582438] CPU: 19 PID: 22311 Comm: trinity-c58 Not tainted 
4.0.0-rc3-next-20150311-sasha-00035-g0a45b48 #2051
[ 1002.582438]  88047f20 727b89c5 880672807cc8 
9ea97ec9
[ 1002.582438]    880672807d18 
941e242a
[ 1002.582438]  0282 943339be dc00 
a1721000
[ 1002.582438] Call Trace:
[ 1002.582438]  dump_stack (lib/dump_stack.c:52)
[ 1002.582438] warn_slowpath_common (kernel/panic.c:447)
[ 1002.582438] ? rcu_report_qs_rnp (kernel/rcu/tree.c:2190 (discriminator 3))
[ 1002.582438] warn_slowpath_null (kernel/panic.c:481)
[ 1002.582438] rcu_report_qs_rnp (kernel/rcu/tree.c:2190 (discriminator 3))
[ 1002.582438] rcu_process_callbacks (kernel/rcu/tree.c:2288 
kernel/rcu/tree.c:2324 kernel/rcu/tree.c:2816 kernel/rcu/tree.c:2849)
[ 1002.582438] ? __lock_is_held (kernel/locking/lockdep.c:3518)
[ 1002.582438] __do_softirq (kernel/softirq.c:273 
include/linux/jump_label.h:114 include/trace/events/irq.h:126 
kernel/softirq.c:274)
[ 1002.582438] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
[ 1002.582438] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
[ 1002.582438] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
[ 1002.582438]  ? __lock_is_held (kernel/locking/lockdep.c:3518)
[ 1002.582438] ? lock_acquire (./arch/x86/include/asm/paravirt.h:809 
kernel/locking/lockdep.c:3605)
[ 1002.582438] ? rmap_walk (mm/rmap.c:1569 mm/rmap.c:1594)
[ 1002.582438] down_read (./arch/x86/include/asm/rwsem.h:65 
kernel/locking/rwsem.c:24)
[ 1002.582438] ? rmap_walk (mm/rmap.c:1569 mm/rmap.c:1594)
[ 1002.582438] ? up_read (./arch/x86/include/asm/rwsem.h:156 
kernel/locking/rwsem.c:81)
[ 1002.582438] ? page_remove_rmap (mm/rmap.c:1209)
[ 1002.582438] rmap_walk (mm/rmap.c:1569 mm/rmap.c:1594)
[ 1002.582438] try_to_munlock (mm/rmap.c:1446)
[ 1002.582438] ? try_to_unmap (mm/rmap.c:1446)
[ 1002.582438] ? page_remove_rmap (mm/rmap.c:1209)
[ 1002.582438] ? invalid_migration_vma (mm/rmap.c:1382)
[ 1002.582438] ? anon_vma_prepare (mm/rmap.c:497)
[ 1002.582438] __munlock_isolated_page (mm/mlock.c:132)
[ 1002.582438] __munlock_pagevec (mm/mlock.c:332)
[ 1002.582438] ? __munlock_isolate_lru_page (mm/mlock.c:277)
[ 1002.582438] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:95 
include/linux/spinlock_api_smp.h:154 kernel/locking/spinlock.c:183)
[ 1002.582438] ? __get_locked_pte (mm/memory.c:1432)
[ 1002.582438] munlock_vma_pages_range (mm/mlock.c:477)
[ 1002.582438] ? munlock_vma_page (mm/mlock.c:424)
[ 1002.582438] ? __lock_acquire (kernel/locking/lockdep.c:2019 
kernel/locking/lockdep.c:3184)
[ 1002.582438] ? kfree (mm/slub.c:1300 mm/slub.c:2707 mm/slub.c:3416)
[ 1002.582438] ? __lock_acquire (kernel/locking/lockdep.c:2019 
kernel/locking/lockdep.c:3184)
[ 1002.582438] ? vma_merge (mm/mmap.c:1085)
[ 1002.582438] mlock_fixup (mm/mlock.c:549)
[ 1002.582438] do_mlockall (mm/mlock.c:677)
[ 1002.582438] ? do_mlock (mm/mlock.c:658)
[ 1002.582438] sys_munlockall (mm/mlock.c:720)
[ 1002.582438] ia32_do_call (arch/x86/ia32/ia32entry.S:486)

[ 1034.418016] WARNING: CPU: 0 PID: 8 at kernel/rcu/tree.c:1936 
rcu_gp_kthread+0x2a7c/0x4dd0()
[ 1034.419933] Modules linked in:
[ 1034.420802] CPU: 0 PID: 8 Comm: rcu_preempt Tainted: GW   
4.0.0-rc3-next-20150311-sasha-00035-g0a45b48 #2051
[ 1034.423329]  8800496f8000 5fe03e34 880049ac7b28 
9ea97ec9
[ 1034.426460]    880049ac7b78 
941e242a
[ 1034.426460]  00022000 9433cafc a1725040 
0001000110de
[ 1034.426460] Call Trace:
[ 1034.426460] dump_stack (lib/dump_stack.c:52)
[ 1034.426460] warn_slowpath_common (kernel/panic.c:447)
[ 1034.426460] ? rcu_gp_kthread (kernel/rcu/tree.c:1936 kernel/rcu/tree.c:2061)
[ 1034.426460] warn_slowpath_null (kernel/panic.c:481)
[ 1034.426460] rcu_gp_kthread (kernel/rcu/tree.c:1936 kernel/rcu/tree.c:2061)
[ 1034.426460] ? plist_check_head (lib/plist.c:62)
[ 1034.426460] ? cond_synchronize_rcu (kernel/rcu/tree.c:1972)
[ 1034.426460] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 
kernel/locking/lockdep.c:2601)
[ 1034.426460] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:95 
include/linux/spinlock_api_smp.h:171 kernel/locking/spinlock.c:199)
[ 1034.426460] ? __schedule (kernel/sched/core.c:2803)
[ 1034.426460] ? ___preempt_schedule (arch/x86/lib/thunk_64.S:51)
[ 1034.426460] ? abort_exclusive_wait (kernel/sched/wait.c:292

Re: rcu: frequent rcu lockups

2015-03-11 Thread Paul E. McKenney
On Wed, Mar 11, 2015 at 04:18:49PM -0400, Sasha Levin wrote:
> On 03/11/2015 04:17 PM, Paul E. McKenney wrote:
> > On Wed, Mar 11, 2015 at 03:57:32PM -0400, Sasha Levin wrote:
> >> Hi all,
> >>
> >> I've started seeing the following hang pretty often during my fuzzing. The
> >> system proceeds to lock up after that.
> >>
> >> [ 3209.655703] INFO: rcu_preempt detected stalls on CPUs/tasks:
> >> [ 3209.655703] Tasks blocked on level-1 rcu_node (CPUs 16-31):
> >> [ 3209.655703] (detected by 0, t=30502 jiffies, g=48799, c=48798, 
> >> q=1730)
> >> [ 3209.655703] All QSes seen, last rcu_preempt kthread activity 1 
> >> (4295246069-4295246068), jiffies_till_next_fqs=1, root ->qsmask 0x2
> >> [ 3209.655703] trinity-c24 R  running task26944  9338   9110 
> >> 0x1008
> >> [ 3209.655703]  2396 e68fa48e 880050607dc8 
> >> a427679b
> >> [ 3209.655703]  880050607d98 b1b36000 0001 
> >> 0001000440f4
> >> [ 3209.655703]  b1b351c8 dc00 880050622000 
> >> b1721000
> >> [ 3209.655703] Call Trace:
> >> [ 3209.655703]  sched_show_task (kernel/sched/core.c:4542)
> >> [ 3209.655703] rcu_check_callbacks (kernel/rcu/tree.c:1225 
> >> kernel/rcu/tree.c:1331 kernel/rcu/tree.c:3389 kernel/rcu/tree.c:3453 
> >> kernel/rcu/tree.c:2683)
> >> [ 3209.655703] ? acct_account_cputime (kernel/tsacct.c:168)
> >> [ 3209.655703] update_process_times (./arch/x86/include/asm/preempt.h:22 
> >> kernel/time/timer.c:1386)
> >> [ 3209.655703] tick_periodic (kernel/time/tick-common.c:92)
> >> [ 3209.655703] ? tick_handle_periodic (kernel/time/tick-common.c:105)
> >> [ 3209.655703] tick_handle_periodic (kernel/time/tick-common.c:105)
> >> [ 3209.655703] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:891)
> >> [ 3209.655703] smp_trace_apic_timer_interrupt 
> >> (arch/x86/kernel/apic/apic.c:934 include/linux/jump_label.h:114 
> >> ./arch/x86/include/asm/trace/irq_vectors.h:45 
> >> arch/x86/kernel/apic/apic.c:935)
> >> [ 3209.655703] trace_apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
> >> [ 3209.655703]  ? add_wait_queue (include/linux/wait.h:116 
> >> kernel/sched/wait.c:29)
> >> [ 3209.655703] ? _raw_spin_unlock_irqrestore 
> >> (./arch/x86/include/asm/paravirt.h:809 
> >> include/linux/spinlock_api_smp.h:162 kernel/locking/spinlock.c:191)
> >> [ 3209.655703] add_wait_queue (kernel/sched/wait.c:31)
> >> [ 3209.655703] do_wait (kernel/exit.c:1473)
> >> [ 3209.655703] ? trace_rcu_dyntick (include/trace/events/rcu.h:363 
> >> (discriminator 19))
> >> [ 3209.655703] ? wait_consider_task (kernel/exit.c:1465)
> >> [ 3209.655703] ? find_get_pid (kernel/pid.c:490)
> >> [ 3209.655703] SyS_wait4 (kernel/exit.c:1618 kernel/exit.c:1586)
> >> [ 3209.655703] ? perf_syscall_exit (kernel/trace/trace_syscalls.c:549)
> >> [ 3209.655703] ? SyS_waitid (kernel/exit.c:1586)
> >> [ 3209.655703] ? kill_orphaned_pgrp (kernel/exit.c:1444)
> >> [ 3209.655703] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1592)
> >> [ 3209.655703] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
> >> [ 3209.655703] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)
> > 
> > OK, that is not good.
> > 
> > What version are you running and what is your .config?
> 
> latest -next. .config attached.

Aha, I forgot to update rcu/next.  I have now updated it, so it should
make it there today or tomorrow.  In the meantime, does the following
commit help?

Also, how quickly does your test setup reproduce this?

Thanx, Paul



rcu: Yet another fix for preemption and CPU hotplug

As noted earlier, the following sequence of events can occur when
running PREEMPT_RCU and HOTPLUG_CPU on a system with a multi-level
rcu_node combining tree:

1.  A group of tasks block on CPUs corresponding to a given leaf
rcu_node structure while within RCU read-side critical sections.
2.  All CPUs corrsponding to that rcu_node structure go offline.
3.  The next grace period starts, but because there are still tasks
blocked, the upper-level bits corresponding to this leaf rcu_node
structure remain set.
4.  All the tasks exit their RCU read-side critical sections and
remove themselves from the leaf rcu_node structure's list,
leaving it empty.
5.  But because there now is code to check for this condition at
force-quiescent-state time, the upper bits are cleared and the
grace period completes.

However, there is another complication that can occur following step 4 above:

4a. The grace period starts, and the leaf rcu_node structure's
gp_tasks pointer is set to NULL because there are no tasks
blocked on this structure.
4b. One of the CPUs corresponding to the leaf rcu_node structure
comes back online.
4b. An endless stream of tasks are preempted with

Re: rcu: frequent rcu lockups

2015-03-11 Thread Sasha Levin
On 03/11/2015 04:17 PM, Paul E. McKenney wrote:
> On Wed, Mar 11, 2015 at 03:57:32PM -0400, Sasha Levin wrote:
>> Hi all,
>>
>> I've started seeing the following hang pretty often during my fuzzing. The
>> system proceeds to lock up after that.
>>
>> [ 3209.655703] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [ 3209.655703]   Tasks blocked on level-1 rcu_node (CPUs 16-31):
>> [ 3209.655703]   (detected by 0, t=30502 jiffies, g=48799, c=48798, 
>> q=1730)
>> [ 3209.655703] All QSes seen, last rcu_preempt kthread activity 1 
>> (4295246069-4295246068), jiffies_till_next_fqs=1, root ->qsmask 0x2
>> [ 3209.655703] trinity-c24 R  running task26944  9338   9110 
>> 0x1008
>> [ 3209.655703]  2396 e68fa48e 880050607dc8 
>> a427679b
>> [ 3209.655703]  880050607d98 b1b36000 0001 
>> 0001000440f4
>> [ 3209.655703]  b1b351c8 dc00 880050622000 
>> b1721000
>> [ 3209.655703] Call Trace:
>> [ 3209.655703]  sched_show_task (kernel/sched/core.c:4542)
>> [ 3209.655703] rcu_check_callbacks (kernel/rcu/tree.c:1225 
>> kernel/rcu/tree.c:1331 kernel/rcu/tree.c:3389 kernel/rcu/tree.c:3453 
>> kernel/rcu/tree.c:2683)
>> [ 3209.655703] ? acct_account_cputime (kernel/tsacct.c:168)
>> [ 3209.655703] update_process_times (./arch/x86/include/asm/preempt.h:22 
>> kernel/time/timer.c:1386)
>> [ 3209.655703] tick_periodic (kernel/time/tick-common.c:92)
>> [ 3209.655703] ? tick_handle_periodic (kernel/time/tick-common.c:105)
>> [ 3209.655703] tick_handle_periodic (kernel/time/tick-common.c:105)
>> [ 3209.655703] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:891)
>> [ 3209.655703] smp_trace_apic_timer_interrupt 
>> (arch/x86/kernel/apic/apic.c:934 include/linux/jump_label.h:114 
>> ./arch/x86/include/asm/trace/irq_vectors.h:45 
>> arch/x86/kernel/apic/apic.c:935)
>> [ 3209.655703] trace_apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
>> [ 3209.655703]  ? add_wait_queue (include/linux/wait.h:116 
>> kernel/sched/wait.c:29)
>> [ 3209.655703] ? _raw_spin_unlock_irqrestore 
>> (./arch/x86/include/asm/paravirt.h:809 include/linux/spinlock_api_smp.h:162 
>> kernel/locking/spinlock.c:191)
>> [ 3209.655703] add_wait_queue (kernel/sched/wait.c:31)
>> [ 3209.655703] do_wait (kernel/exit.c:1473)
>> [ 3209.655703] ? trace_rcu_dyntick (include/trace/events/rcu.h:363 
>> (discriminator 19))
>> [ 3209.655703] ? wait_consider_task (kernel/exit.c:1465)
>> [ 3209.655703] ? find_get_pid (kernel/pid.c:490)
>> [ 3209.655703] SyS_wait4 (kernel/exit.c:1618 kernel/exit.c:1586)
>> [ 3209.655703] ? perf_syscall_exit (kernel/trace/trace_syscalls.c:549)
>> [ 3209.655703] ? SyS_waitid (kernel/exit.c:1586)
>> [ 3209.655703] ? kill_orphaned_pgrp (kernel/exit.c:1444)
>> [ 3209.655703] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1592)
>> [ 3209.655703] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
>> [ 3209.655703] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)
> 
> OK, that is not good.
> 
> What version are you running and what is your .config?

latest -next. .config attached.


Thanks,
Sasha



config-sasha.gz
Description: application/gzip


Re: rcu: frequent rcu lockups

2015-03-11 Thread Paul E. McKenney
On Wed, Mar 11, 2015 at 03:57:32PM -0400, Sasha Levin wrote:
> Hi all,
> 
> I've started seeing the following hang pretty often during my fuzzing. The
> system proceeds to lock up after that.
> 
> [ 3209.655703] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 3209.655703]Tasks blocked on level-1 rcu_node (CPUs 16-31):
> [ 3209.655703](detected by 0, t=30502 jiffies, g=48799, c=48798, 
> q=1730)
> [ 3209.655703] All QSes seen, last rcu_preempt kthread activity 1 
> (4295246069-4295246068), jiffies_till_next_fqs=1, root ->qsmask 0x2
> [ 3209.655703] trinity-c24 R  running task26944  9338   9110 
> 0x1008
> [ 3209.655703]  2396 e68fa48e 880050607dc8 
> a427679b
> [ 3209.655703]  880050607d98 b1b36000 0001 
> 0001000440f4
> [ 3209.655703]  b1b351c8 dc00 880050622000 
> b1721000
> [ 3209.655703] Call Trace:
> [ 3209.655703]  sched_show_task (kernel/sched/core.c:4542)
> [ 3209.655703] rcu_check_callbacks (kernel/rcu/tree.c:1225 
> kernel/rcu/tree.c:1331 kernel/rcu/tree.c:3389 kernel/rcu/tree.c:3453 
> kernel/rcu/tree.c:2683)
> [ 3209.655703] ? acct_account_cputime (kernel/tsacct.c:168)
> [ 3209.655703] update_process_times (./arch/x86/include/asm/preempt.h:22 
> kernel/time/timer.c:1386)
> [ 3209.655703] tick_periodic (kernel/time/tick-common.c:92)
> [ 3209.655703] ? tick_handle_periodic (kernel/time/tick-common.c:105)
> [ 3209.655703] tick_handle_periodic (kernel/time/tick-common.c:105)
> [ 3209.655703] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:891)
> [ 3209.655703] smp_trace_apic_timer_interrupt 
> (arch/x86/kernel/apic/apic.c:934 include/linux/jump_label.h:114 
> ./arch/x86/include/asm/trace/irq_vectors.h:45 arch/x86/kernel/apic/apic.c:935)
> [ 3209.655703] trace_apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
> [ 3209.655703]  ? add_wait_queue (include/linux/wait.h:116 
> kernel/sched/wait.c:29)
> [ 3209.655703] ? _raw_spin_unlock_irqrestore 
> (./arch/x86/include/asm/paravirt.h:809 include/linux/spinlock_api_smp.h:162 
> kernel/locking/spinlock.c:191)
> [ 3209.655703] add_wait_queue (kernel/sched/wait.c:31)
> [ 3209.655703] do_wait (kernel/exit.c:1473)
> [ 3209.655703] ? trace_rcu_dyntick (include/trace/events/rcu.h:363 
> (discriminator 19))
> [ 3209.655703] ? wait_consider_task (kernel/exit.c:1465)
> [ 3209.655703] ? find_get_pid (kernel/pid.c:490)
> [ 3209.655703] SyS_wait4 (kernel/exit.c:1618 kernel/exit.c:1586)
> [ 3209.655703] ? perf_syscall_exit (kernel/trace/trace_syscalls.c:549)
> [ 3209.655703] ? SyS_waitid (kernel/exit.c:1586)
> [ 3209.655703] ? kill_orphaned_pgrp (kernel/exit.c:1444)
> [ 3209.655703] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1592)
> [ 3209.655703] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
> [ 3209.655703] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)

OK, that is not good.

What version are you running and what is your .config?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


rcu: frequent rcu lockups

2015-03-11 Thread Sasha Levin
Hi all,

I've started seeing the following hang pretty often during my fuzzing. The
system proceeds to lock up after that.

[ 3209.655703] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 3209.655703]  Tasks blocked on level-1 rcu_node (CPUs 16-31):
[ 3209.655703]  (detected by 0, t=30502 jiffies, g=48799, c=48798, q=1730)
[ 3209.655703] All QSes seen, last rcu_preempt kthread activity 1 
(4295246069-4295246068), jiffies_till_next_fqs=1, root ->qsmask 0x2
[ 3209.655703] trinity-c24 R  running task26944  9338   9110 0x1008
[ 3209.655703]  2396 e68fa48e 880050607dc8 
a427679b
[ 3209.655703]  880050607d98 b1b36000 0001 
0001000440f4
[ 3209.655703]  b1b351c8 dc00 880050622000 
b1721000
[ 3209.655703] Call Trace:
[ 3209.655703]  sched_show_task (kernel/sched/core.c:4542)
[ 3209.655703] rcu_check_callbacks (kernel/rcu/tree.c:1225 
kernel/rcu/tree.c:1331 kernel/rcu/tree.c:3389 kernel/rcu/tree.c:3453 
kernel/rcu/tree.c:2683)
[ 3209.655703] ? acct_account_cputime (kernel/tsacct.c:168)
[ 3209.655703] update_process_times (./arch/x86/include/asm/preempt.h:22 
kernel/time/timer.c:1386)
[ 3209.655703] tick_periodic (kernel/time/tick-common.c:92)
[ 3209.655703] ? tick_handle_periodic (kernel/time/tick-common.c:105)
[ 3209.655703] tick_handle_periodic (kernel/time/tick-common.c:105)
[ 3209.655703] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:891)
[ 3209.655703] smp_trace_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:934 
include/linux/jump_label.h:114 ./arch/x86/include/asm/trace/irq_vectors.h:45 
arch/x86/kernel/apic/apic.c:935)
[ 3209.655703] trace_apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
[ 3209.655703]  ? add_wait_queue (include/linux/wait.h:116 
kernel/sched/wait.c:29)
[ 3209.655703] ? _raw_spin_unlock_irqrestore 
(./arch/x86/include/asm/paravirt.h:809 include/linux/spinlock_api_smp.h:162 
kernel/locking/spinlock.c:191)
[ 3209.655703] add_wait_queue (kernel/sched/wait.c:31)
[ 3209.655703] do_wait (kernel/exit.c:1473)
[ 3209.655703] ? trace_rcu_dyntick (include/trace/events/rcu.h:363 
(discriminator 19))
[ 3209.655703] ? wait_consider_task (kernel/exit.c:1465)
[ 3209.655703] ? find_get_pid (kernel/pid.c:490)
[ 3209.655703] SyS_wait4 (kernel/exit.c:1618 kernel/exit.c:1586)
[ 3209.655703] ? perf_syscall_exit (kernel/trace/trace_syscalls.c:549)
[ 3209.655703] ? SyS_waitid (kernel/exit.c:1586)
[ 3209.655703] ? kill_orphaned_pgrp (kernel/exit.c:1444)
[ 3209.655703] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1592)
[ 3209.655703] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
[ 3209.655703] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/