Consolidating replies into one thread. Hi Kunwu,
> One thing that is still unclear is dispatch behavior: > `process_srcu` stays pending for a long time, while the same pwq dump shows > idle workers. > > So the key question is: what prevents pending work from being dispatched on > that pwq? > Is it due to: > 1) pwq stalled/hung state, > 2) worker availability/affinity constraints, > 3) or another dispatch-side condition? > > Also, for scope: > - your crash instances consistently show the shutdown path > (irqfd_resampler_shutdown + synchronize_srcu), > - while assign-path evidence, per current thread data, appears to come > from a separate stress case. > A time-aligned dump with pwq state, pending/in-flight lists, and worker states > should help clarify this. I have a dmesg log showing this issue. This is from an automated stress reboot test. The log is very similar to what Sonam shared. <0>[ 434.338427] BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 293s! <6>[ 434.339037] Showing busy workqueues and worker pools: <6>[ 434.339387] workqueue events: flags=0x100 <6>[ 434.339667] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=2 refcnt=3 <6>[ 434.339691] pending: 2*xhci_dbc_handle_events <6>[ 434.340512] workqueue events: flags=0x100 <6>[ 434.340789] pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2 <6>[ 434.340793] pending: vmstat_shepherd <6>[ 434.341507] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=45 refcnt=46 <6>[ 434.341511] pending: delayed_vfree_work, kernfs_notify_workfn, 5*destroy_super_work, 3*bpf_prog_free_deferred, 5*destroy_super_work, binder_deferred_func, bpf_prog_free_deferred, 25*destroy_super_work, drain_local_memcg_stock, update_stats_workfn, psi_avgs_work <6>[ 434.343578] pwq 30: cpus=7 node=0 flags=0x0 nice=0 active=1 refcnt=2 <6>[ 434.343582] in-flight: 325:do_emergency_remount <6>[ 434.344376] workqueue events_unbound: flags=0x2 <6>[ 434.344688] pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=2 refcnt=3 <6>[ 434.344693] in-flight: 339:fsnotify_connector_destroy_workfn fsnotify_connector_destroy_workfn <6>[ 434.345755] pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=2 refcnt=8 <6>[ 434.345759] in-flight: 153:fsnotify_mark_destroy_workfn BAR(3098) BAR(2564) BAR(2299) fsnotify_mark_destroy_workfn BAR(416) BAR(1116) <6>[ 434.347151] workqueue events_freezable: flags=0x104 <6>[ 434.347590] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2 <6>[ 434.347595] pending: pci_pme_list_scan <6>[ 434.348681] workqueue events_power_efficient: flags=0x180 <6>[ 434.349221] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2 <6>[ 434.349226] pending: check_lifetime <6>[ 434.350397] workqueue rcu_gp: flags=0x108 <6>[ 434.350853] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4 <6>[ 434.350857] pending: 3*process_srcu <6>[ 434.351918] workqueue slub_flushwq: flags=0x8 <6>[ 434.352409] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=3 <6>[ 434.352413] pending: flush_cpu_slab BAR(1) <6>[ 434.353529] workqueue mm_percpu_wq: flags=0x8 <6>[ 434.354087] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2 <6>[ 434.354092] pending: vmstat_update <6>[ 434.355205] workqueue quota_events_unbound: flags=0xa <6>[ 434.355725] pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=1 refcnt=3 <6>[ 434.355730] in-flight: 354:quota_release_workfn BAR(325) <6>[ 434.356980] workqueue kvm-irqfd-cleanup: flags=0x0 <6>[ 434.357582] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4 <6>[ 434.357586] in-flight: 51:irqfd_shutdown ,3453:irqfd_shutdown ,3449:irqfd_shutdown <6>[ 434.359101] pool 22: cpus=5 node=0 flags=0x0 nice=0 hung=293s workers=11 idle: 282 154 3452 3451 3448 3450 3455 3454 <6>[ 434.359989] pool 30: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3460 332 <6>[ 434.360539] pool 34: cpus=0-7 node=0 flags=0x4 nice=0 hung=0s workers=5 idle: 256 66 The relevant pwq is pwq 22. All three irqfd_shutdown workers are in-flight but in D state. rcu_gp's process_srcu items are stuck pending. Worker 51 (kworker/5:0) — blocked acquiring resampler_lock: <6>[ 440.576612] task:kworker/5:0 state:D stack:0 pid:51 tgid:51 ppid:2 task_flags:0x4208060 flags:0x00080000 <6>[ 440.577379] Workqueue: kvm-irqfd-cleanup irqfd_shutdown <6>[ 440.578085] <TASK> <6>[ 440.578337] preempt_schedule_irq+0x4a/0x90 <6>[ 440.583712] __mutex_lock+0x413/0xe40 <6>[ 440.583969] irqfd_resampler_shutdown+0x23/0x150 <6>[ 440.584288] irqfd_shutdown+0x66/0xc0 <6>[ 440.584546] process_scheduled_works+0x219/0x450 <6>[ 440.584864] worker_thread+0x2a7/0x3b0 <6>[ 440.585421] kthread+0x230/0x270 Worker 3449 (kworker/5:4) — same, blocked acquiring resampler_lock: <6>[ 440.671294] task:kworker/5:4 state:D stack:0 pid:3449 tgid:3449 ppid:2 task_flags:0x4208060 flags:0x00080000 <6>[ 440.672088] Workqueue: kvm-irqfd-cleanup irqfd_shutdown <6>[ 440.672662] <TASK> <6>[ 440.673069] schedule+0x5e/0xe0 <6>[ 440.673708] __mutex_lock+0x413/0xe40 <6>[ 440.674059] irqfd_resampler_shutdown+0x23/0x150 <6>[ 440.674381] irqfd_shutdown+0x66/0xc0 <6>[ 440.674638] process_scheduled_works+0x219/0x450 <6>[ 440.674956] worker_thread+0x2a7/0x3b0 <6>[ 440.675308] kthread+0x230/0x270 Worker 3453 (kworker/5:8) — holds resampler_lock, blocked waiting for SRCU GP: <6>[ 440.677368] task:kworker/5:8 state:D stack:0 pid:3453 tgid:3453 ppid:2 task_flags:0x4208060 flags:0x00080000 <6>[ 440.678185] Workqueue: kvm-irqfd-cleanup irqfd_shutdown <6>[ 440.678720] <TASK> <6>[ 440.679127] schedule+0x5e/0xe0 <6>[ 440.679354] schedule_timeout+0x2e/0x130 <6>[ 440.680084] wait_for_common+0xf7/0x1f0 <6>[ 440.680355] synchronize_srcu_expedited+0x109/0x140 <6>[ 440.681164] irqfd_resampler_shutdown+0xf0/0x150 <6>[ 440.681481] irqfd_shutdown+0x66/0xc0 <6>[ 440.681738] process_scheduled_works+0x219/0x450 <6>[ 440.682055] worker_thread+0x2a7/0x3b0 <6>[ 440.682403] kthread+0x230/0x270 The sequence is: worker 3453 acquires resampler_lock, and calls synchronize_srcu_expedited() while holding the lock. This queues process_srcu on rcu_gp, then blocks waiting for the GP to complete. Workers 51 and 3449 are blocked trying to acquire the same resampler_lock. Regarding your dispatch question: all three workers are in D state, so they have all called schedule() and wq_worker_sleeping() should have decremented pool->nr_running to zero. With nr_running == 0 and process_srcu in the worklist, needs_more_worker() should be true and an idle worker should be woken via kick_pool() when process_srcu is enqueued. Why none of the 8 idle workers end up dispatching process_srcu is not entirely clear to me. Moving the synchronize_srcu_expedited() does solve this issue, but it is not exactly sure why the deadlock between irqfd-shutdown workers is causing the work queue to stall. The full dmesg is at: https://gist.github.com/vineethrp/883db560a4503612448db9b10e02a9b5 Hi Paul, > Just to be clear, I am guessing that you have the workqueues counterpart > to a fork bomb. However, if you are using a small finite number of > workqueue handlers, then we need to make adjustments in SRCU, workqueues, > or maybe SRCU's use of workqueues. In this log, I am not seeing a workqueue being stressed out. There are 8 idle workers, but for some reason no worker is assigned to run process_srcu. Not sure if its a work queue related race condition or if its working as intended to not kick new workers if there are in-flight workers in D state. > SRCU and RCU use their own workqueue, which no one else should be > using (and that prohibition most definitely includes the irqfd workers). kvm-irqfd-cleanup and rcu_gp while being separate workqueues, share the same per-CPU pool(pwq 22). Both are CPU-bound: rcu_gp has flags=0x108 (WQ_UNBOUND|WQ_FREEZABLE) but its pwq for CPU 5 resolves to the same per-CPU pool (pool 22, flags=0x0) as kvm-irqfd-cleanup (flags=0x0). I think CPU-bound workqueues share the per-CPU pool regardless of being separate workqueues and these two workqueues end up competing for the same underlying pool's workers. Making kvm-irqfd-cleanup unbound (WQ_UNBOUND) would place it on a separate pool from rcu_gp, preventing this interference and fixing the stall I guess. Thanks, Vineeth

