On Thu, Apr 16, 2026 at 9:18 PM Vineeth Pillai <[email protected]> wrote:
>
> Consolidating replies into one thread.
>
> Hi Kunwu,
>
> > One thing that is still unclear is dispatch behavior:
> > `process_srcu` stays pending for a long time, while the same pwq dump shows
> > idle workers.
> >
> > So the key question is: what prevents pending work from being dispatched on
> > that pwq?
> > Is it due to:
> > 1) pwq stalled/hung state,
> > 2) worker availability/affinity constraints,
> > 3) or another dispatch-side condition?
> >
> > Also, for scope:
> > - your crash instances consistently show the shutdown path
> > (irqfd_resampler_shutdown + synchronize_srcu),
> > - while assign-path evidence, per current thread data, appears to come
> > from a separate stress case.
>
> > A time-aligned dump with pwq state, pending/in-flight lists, and worker
> > states
> > should help clarify this.
>
> I have a dmesg log showing this issue. This is from an automated stress
> reboot test. The log is very similar to what Sonam shared.
>
> <0>[ 434.338427] BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0
> stuck for 293s!
> <6>[ 434.339037] Showing busy workqueues and worker pools:
> <6>[ 434.339387] workqueue events: flags=0x100
> ...
> <6>[ 434.350853] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
> <6>[ 434.350857] pending: 3*process_srcu
> ...
> <6>[ 434.356980] workqueue kvm-irqfd-cleanup: flags=0x0
> <6>[ 434.357582] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
> <6>[ 434.357586] in-flight: 51:irqfd_shutdown ,3453:irqfd_shutdown
> ,3449:irqfd_shutdown
>
> The relevant pwq is pwq 22. All three irqfd_shutdown workers are in-flight
> but in D state. rcu_gp's process_srcu items are stuck pending.
>
> Worker 51 (kworker/5:0) — blocked acquiring resampler_lock:
> <6>[ 440.576612] task:kworker/5:0 state:D stack:0 pid:51 tgid:51
> ppid:2 task_flags:0x4208060 flags:0x00080000
> <6>[ 440.577379] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[ 440.578085] <TASK>
> <6>[ 440.578337] preempt_schedule_irq+0x4a/0x90
> <6>[ 440.583712] __mutex_lock+0x413/0xe40
> <6>[ 440.583969] irqfd_resampler_shutdown+0x23/0x150
> <6>[ 440.584288] irqfd_shutdown+0x66/0xc0
> <6>[ 440.584546] process_scheduled_works+0x219/0x450
> <6>[ 440.584864] worker_thread+0x2a7/0x3b0
> <6>[ 440.585421] kthread+0x230/0x270
>
> Worker 3449 (kworker/5:4) — same, blocked acquiring resampler_lock:
> <6>[ 440.671294] task:kworker/5:4 state:D stack:0 pid:3449
> tgid:3449 ppid:2 task_flags:0x4208060 flags:0x00080000
> <6>[ 440.672088] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[ 440.672662] <TASK>
> <6>[ 440.673069] schedule+0x5e/0xe0
> <6>[ 440.673708] __mutex_lock+0x413/0xe40
> <6>[ 440.674059] irqfd_resampler_shutdown+0x23/0x150
> <6>[ 440.674381] irqfd_shutdown+0x66/0xc0
> <6>[ 440.674638] process_scheduled_works+0x219/0x450
> <6>[ 440.674956] worker_thread+0x2a7/0x3b0
> <6>[ 440.675308] kthread+0x230/0x270
>
> Worker 3453 (kworker/5:8) — holds resampler_lock, blocked waiting for SRCU GP:
> <6>[ 440.677368] task:kworker/5:8 state:D stack:0 pid:3453
> tgid:3453 ppid:2 task_flags:0x4208060 flags:0x00080000
> <6>[ 440.678185] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[ 440.678720] <TASK>
> <6>[ 440.679127] schedule+0x5e/0xe0
> <6>[ 440.679354] schedule_timeout+0x2e/0x130
> <6>[ 440.680084] wait_for_common+0xf7/0x1f0
> <6>[ 440.680355] synchronize_srcu_expedited+0x109/0x140
> <6>[ 440.681164] irqfd_resampler_shutdown+0xf0/0x150
> <6>[ 440.681481] irqfd_shutdown+0x66/0xc0
> <6>[ 440.681738] process_scheduled_works+0x219/0x450
> <6>[ 440.682055] worker_thread+0x2a7/0x3b0
> <6>[ 440.682403] kthread+0x230/0x270
>
> The sequence is: worker 3453 acquires resampler_lock, and calls
> synchronize_srcu_expedited() while holding the lock. This queues
> process_srcu on rcu_gp, then blocks waiting for the GP to complete.
> Workers 51 and 3449 are blocked trying to acquire the same resampler_lock.
>
> Regarding your dispatch question: all three workers are in D state, so
> they have all called schedule() and wq_worker_sleeping() should have
> decremented pool->nr_running to zero. With nr_running == 0 and
> process_srcu in the worklist, needs_more_worker() should be true and an
> idle worker should be woken via kick_pool() when process_srcu is enqueued.
> Why none of the 8 idle workers end up dispatching process_srcu is not
> entirely clear to me.
>
> Moving the synchronize_srcu_expedited() does solve this issue, but it
> is not exactly sure why the deadlock between irqfd-shutdown workers is
> causing the work queue to stall.
>
I think I know what is happening now. After adding some more debug
prints, I see that worker->sleeping is 0 for one of the workers
waiting for the mutex(pid 51) in the example above, and
pool->nr_running is 1. This prevents the pool from dispatching idle
workers.
This time I got a more descriptive stack trace as well:
<6>[18433.604285][T10987] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[18433.611204][T10987] Call Trace:
<6>[18433.615001][T10987] <TASK>
<6>[18433.618414][T10987] __schedule+0x8cf/0xdb0
<6>[18433.623372][T10987] preempt_schedule_irq+0x4a/0x90
<6>[18433.629112][T10987] asm_sysvec_reschedule_ipi+0x1a/0x20
<6>[18433.635340][T10987] RIP: 0010:kthread_data+0x15/0x30
<6>[18433.715343][T10987] wq_worker_sleeping+0xc/0x90
<6>[18433.720806][T10987] schedule+0x30/0xe0
<6>[18433.725379][T10987] schedule_preempt_disabled+0x10/0x20
<6>[18433.731604][T10987] __mutex_lock+0x413/0xe40
<6>[18433.736763][T10987] irqfd_resampler_shutdown+0x23/0x150
<6>[18433.742989][T10987] irqfd_shutdown+0x66/0xc0
<6>[18433.748145][T10987] process_scheduled_works+0x219/0x450
<6>[18433.754370][T10987] worker_thread+0x30b/0x450
<6>[18433.765460][T10987] kthread+0x227/0x2a0
<6>[18433.775383][T10987] ret_from_fork+0xfe/0x1b0
If I am reading the stack correctly, an IPI was serviced while at
wq_worker_sleeping() (which is responsible for setting
worker->sleeping to zero and decrementing nr_running). I guess the
process was interrupted before it could update nr_running and
sleeping. After IPI was serviced, preempt_schedule_irq() was called
and then __schedule() which schedules out the task before it could
decrement nr_running. And it is never woken up because the mutex
holder is waiting for the GP to complete. But process_srcu cannot
proceed because the workqueue pool is not kicking idle workers as
nr_running is 1. Effectively deadlocking.
So, basically what happens is (based on above example):
- srcu gp worker and irqfd workers(3453, 51) on the same per-cpu Pool
- worker 3453 acquires resampler_lock, and calls
synchronize_srcu_expedited() while holding the lock.
- worker 51 waits on the lock, but is unable to update critical
workqueue counters(nr_running and sleeping) before it schedules out.
- Workqueue pool is stalled and thereby preventing srcu GP progress.
This also explains why the issue is not seen when the
synchronize_srcu_expedited is called outside the lock.
Going directly to __schedule() after servicing IPI is the main problem
as wq_worker_sleeping() could not complete. Without the IPI in
picture, schedule out would be:
_mutex_lock
schedule()
sched_submit_work()
wq_worker_sleeping()
__schedule_loop()
__schedule()
WIth IPI in picture, it would be:
_mutex_lock
schedule()
sched_submit_work()
wq_worker_sleeping() <-- half way through
IPI
preempt_schedule_irq()
__schedule()
Moving `sched_submit_work()` to __schedule might solve this issue, but
I'm not sure if it would cause other issues. Adding Tejun for an
expert opinion on the workqueue side :-)
Thanks,
Vineeth