Re: ftrace hangs waiting for rcu (was: Re: [PATCH] ftrace: Have architectures opt-in for mcount build time sorting)

2022-01-28 Thread Mark Rutland
Hi Sven,

On Thu, Jan 27, 2022 at 07:42:35PM +0100, Sven Schnelle wrote:
> Mark Rutland  writes:
> 
> > * I intermittently see a hang when running the tests. I previously hit that
> >   when originally trying to bisect this issue (and IIRC that bisected down 
> > to
> >   some RCU changes, but I need to re-run that). When the tests hang I
> >   magic-srsrq + L tells me:
> >
> >   [  271.938438] sysrq: Show Blocked State
> >   [  271.939245] task:ftracetest  state:D stack:0 pid: 5687 ppid:  
> > 5627 flags:0x0200
> >   [  271.940961] Call trace:
> >   [  271.941472]  __switch_to+0x104/0x160
> >   [  271.942213]  __schedule+0x2b0/0x6e0
> >   [  271.942933]  schedule+0x5c/0xf0
> >   [  271.943586]  schedule_timeout+0x184/0x1c4
> >   [  271.944410]  wait_for_completion+0x8c/0x12c
> >   [  271.945274]  __wait_rcu_gp+0x184/0x190
> >   [  271.946047]  synchronize_rcu_tasks_rude+0x48/0x70
> >   [  271.947007]  update_ftrace_function+0xa4/0xec
> >   [  271.947897]  __unregister_ftrace_function+0xa4/0xf0
> >   [  271.948898]  unregister_ftrace_function+0x34/0x70
> >   [  271.949857]  wakeup_tracer_reset+0x4c/0x100
> >   [  271.950713]  tracing_set_tracer+0xd0/0x2b0
> >   [  271.951552]  tracing_set_trace_write+0xe8/0x150
> >   [  271.952477]  vfs_write+0xfc/0x284
> >   [  271.953171]  ksys_write+0x7c/0x110
> >   [  271.953874]  __arm64_sys_write+0x2c/0x40
> >   [  271.954678]  invoke_syscall+0x5c/0x130
> >   [  271.955442]  el0_svc_common.constprop.0+0x108/0x130
> >   [  271.956435]  do_el0_svc+0x74/0x90
> >   [  271.957124]  el0_svc+0x2c/0x90
> >   [  271.957757]  el0t_64_sync_handler+0xa8/0x12c
> >   [  271.958629]  el0t_64_sync+0x1a0/0x1a4

On arm64 I bisected this down to:

  7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic 
queue selection")

Which was going wrong because ilog2() rounds down, and so the shift was wrong
for any nr_cpus that was not a power-of-two. Paul had already fixed that in
rcu-next, and just sent a pull request to Linus:

  
https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/

With that applied, I no longer see these hangs.

Does your s390 test machine have a non-power-of-two nr_cpus, and does that fix
the issue for you?

On arm64 the startup tests didn't seem to trigger the hang, but I was able to
trigger the hang fairly reliably with the ftrace selftests, e.g.

  $ for N in $(seq 1 10); do ./ftracetest test.d/00basic/basic2.tc; done

... which prior to the fix, would hang between runs 2 to 5.

Thanks,
Mark.

> that's interesting. On s390 i'm seeing the same problem in CI, but with
> the startup ftrace tests. So that's likely not arm64 spacific.
> 
> On s390, the last messages from ftrace are [5.663568] clocksource: 
> jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 
> 1911260446275 ns
> [5.667099] futex hash table entries: 65536 (order: 12, 16777216 bytes, 
> vmalloc)
> [5.739549] Running postponed tracer tests:
> [5.740662] Testing tracer function: PASSED
> [6.194635] Testing dynamic ftrace: PASSED
> [6.471213] Testing dynamic ftrace ops #1: 
> [6.558445] (1 0 1 0 0) 
> [6.558458] (1 1 2 0 0) 
> [6.699135] (2 1 3 0 764347) 
> [6.699252] (2 2 4 0 766466) 
> [6.759857] (3 2 4 0 1159604)
> [..] hangs here
> 
> The backtrace looks like this, which is very similar to the one above:
> 
> crash> bt 1
> PID: 1  TASK: 80e68100  CPU: 133  COMMAND: "swapper/0"
>  #0 [380004df808] __schedule at cda39f0e
>  #1 [380004df880] schedule at cda3a488
>  #2 [380004df8b0] schedule_timeout at cda41ef6
>  #3 [380004df978] wait_for_completion at cda3bd0a
>  #4 [380004df9d8] __wait_rcu_gp at cc92
>  #5 [380004dfa30] synchronize_rcu_tasks_generic at ccdde0aa
>  #6 [380004dfad8] ftrace_shutdown at cce7b050
>  #7 [380004dfb18] unregister_ftrace_function at cce7b192
>  #8 [380004dfb50] trace_selftest_ops at cda1e0fa
>  #9 [380004dfba0] run_tracer_selftest at cda1e4f2
> #10 [380004dfc00] trace_selftest_startup_function at ce74355c
> #11 [380004dfc58] run_tracer_selftest at cda1e2fc
> #12 [380004dfc98] init_trace_selftests at ce742d30
> #13 [380004dfcd0] do_one_initcall at cccdca16
> #14 [380004dfd68] do_initcalls at ce72e776
> #15 [380004dfde0] kernel_init_freeable at ce72ea60
> #16 [380004dfe50] kernel_init at cda333fe
> #17 [380004dfe68] __ret_from_fork at cccdf920
> #18 [380004dfe98] ret_from_fork at cda444ca
> 
> I didn't had success reproducing it so far, but it is good to know that
> this also happens when running the ftrace testsuite.
> 
> I have several crashdumps, so i could try to pull out some information
> if someone tells me what to look for.
> 
> Thanks,
> Sven


ftrace hangs waiting for rcu (was: Re: [PATCH] ftrace: Have architectures opt-in for mcount build time sorting)

2022-01-27 Thread Sven Schnelle
Hi Mark,

Mark Rutland  writes:

> * I intermittently see a hang when running the tests. I previously hit that
>   when originally trying to bisect this issue (and IIRC that bisected down to
>   some RCU changes, but I need to re-run that). When the tests hang I
>   magic-srsrq + L tells me:
>
>   [  271.938438] sysrq: Show Blocked State
>   [  271.939245] task:ftracetest  state:D stack:0 pid: 5687 ppid:  
> 5627 flags:0x0200
>   [  271.940961] Call trace:
>   [  271.941472]  __switch_to+0x104/0x160
>   [  271.942213]  __schedule+0x2b0/0x6e0
>   [  271.942933]  schedule+0x5c/0xf0
>   [  271.943586]  schedule_timeout+0x184/0x1c4
>   [  271.944410]  wait_for_completion+0x8c/0x12c
>   [  271.945274]  __wait_rcu_gp+0x184/0x190
>   [  271.946047]  synchronize_rcu_tasks_rude+0x48/0x70
>   [  271.947007]  update_ftrace_function+0xa4/0xec
>   [  271.947897]  __unregister_ftrace_function+0xa4/0xf0
>   [  271.948898]  unregister_ftrace_function+0x34/0x70
>   [  271.949857]  wakeup_tracer_reset+0x4c/0x100
>   [  271.950713]  tracing_set_tracer+0xd0/0x2b0
>   [  271.951552]  tracing_set_trace_write+0xe8/0x150
>   [  271.952477]  vfs_write+0xfc/0x284
>   [  271.953171]  ksys_write+0x7c/0x110
>   [  271.953874]  __arm64_sys_write+0x2c/0x40
>   [  271.954678]  invoke_syscall+0x5c/0x130
>   [  271.955442]  el0_svc_common.constprop.0+0x108/0x130
>   [  271.956435]  do_el0_svc+0x74/0x90
>   [  271.957124]  el0_svc+0x2c/0x90
>   [  271.957757]  el0t_64_sync_handler+0xa8/0x12c
>   [  271.958629]  el0t_64_sync+0x1a0/0x1a4

that's interesting. On s390 i'm seeing the same problem in CI, but with
the startup ftrace tests. So that's likely not arm64 spacific.

On s390, the last messages from ftrace are [5.663568] clocksource: jiffies: 
mask: 0x max_cycles: 0x, max_idle_ns: 1911260446275 ns
[5.667099] futex hash table entries: 65536 (order: 12, 16777216 bytes, 
vmalloc)
[5.739549] Running postponed tracer tests:
[5.740662] Testing tracer function: PASSED
[6.194635] Testing dynamic ftrace: PASSED
[6.471213] Testing dynamic ftrace ops #1: 
[6.558445] (1 0 1 0 0) 
[6.558458] (1 1 2 0 0) 
[6.699135] (2 1 3 0 764347) 
[6.699252] (2 2 4 0 766466) 
[6.759857] (3 2 4 0 1159604)
[..] hangs here

The backtrace looks like this, which is very similar to the one above:

crash> bt 1
PID: 1  TASK: 80e68100  CPU: 133  COMMAND: "swapper/0"
 #0 [380004df808] __schedule at cda39f0e
 #1 [380004df880] schedule at cda3a488
 #2 [380004df8b0] schedule_timeout at cda41ef6
 #3 [380004df978] wait_for_completion at cda3bd0a
 #4 [380004df9d8] __wait_rcu_gp at cc92
 #5 [380004dfa30] synchronize_rcu_tasks_generic at ccdde0aa
 #6 [380004dfad8] ftrace_shutdown at cce7b050
 #7 [380004dfb18] unregister_ftrace_function at cce7b192
 #8 [380004dfb50] trace_selftest_ops at cda1e0fa
 #9 [380004dfba0] run_tracer_selftest at cda1e4f2
#10 [380004dfc00] trace_selftest_startup_function at ce74355c
#11 [380004dfc58] run_tracer_selftest at cda1e2fc
#12 [380004dfc98] init_trace_selftests at ce742d30
#13 [380004dfcd0] do_one_initcall at cccdca16
#14 [380004dfd68] do_initcalls at ce72e776
#15 [380004dfde0] kernel_init_freeable at ce72ea60
#16 [380004dfe50] kernel_init at cda333fe
#17 [380004dfe68] __ret_from_fork at cccdf920
#18 [380004dfe98] ret_from_fork at cda444ca

I didn't had success reproducing it so far, but it is good to know that
this also happens when running the ftrace testsuite.

I have several crashdumps, so i could try to pull out some information
if someone tells me what to look for.

Thanks,
Sven