On Thu, Apr 16, 2026 at 9:18 PM Vineeth Pillai <[email protected]> wrote:
>
> Consolidating replies into one thread.
>
> Hi Kunwu,
>
> > One thing that is still unclear is dispatch behavior:
> > `process_srcu` stays pending for a long time, while the same pwq dump shows 
> > idle workers.
> >
> > So the key question is: what prevents pending work from being dispatched on 
> > that pwq?
> > Is it due to:
> >   1) pwq stalled/hung state,
> >   2) worker availability/affinity constraints,
> >   3) or another dispatch-side condition?
> >
> > Also, for scope:
> > - your crash instances consistently show the shutdown path
> >   (irqfd_resampler_shutdown + synchronize_srcu),
> > - while assign-path evidence, per current thread data, appears to come
> >   from a separate stress case.
>
> > A time-aligned dump with pwq state, pending/in-flight lists, and worker 
> > states
> > should help clarify this.
>
> I have a dmesg log showing this issue. This is from an automated stress
> reboot test. The log is very similar to what Sonam shared.
>
> <0>[  434.338427] BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 
> stuck for 293s!
> <6>[  434.339037] Showing busy workqueues and worker pools:
> <6>[  434.339387] workqueue events: flags=0x100
>  ...
> <6>[  434.350853]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
> <6>[  434.350857]     pending: 3*process_srcu
> ...
> <6>[  434.356980] workqueue kvm-irqfd-cleanup: flags=0x0
> <6>[  434.357582]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
> <6>[  434.357586]     in-flight: 51:irqfd_shutdown ,3453:irqfd_shutdown 
> ,3449:irqfd_shutdown
>
> The relevant pwq is pwq 22. All three irqfd_shutdown workers are in-flight
> but in D state. rcu_gp's process_srcu items are stuck pending.
>
> Worker 51 (kworker/5:0) — blocked acquiring resampler_lock:
> <6>[  440.576612] task:kworker/5:0     state:D stack:0     pid:51    tgid:51  
>   ppid:2      task_flags:0x4208060 flags:0x00080000
> <6>[  440.577379] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[  440.578085]  <TASK>
> <6>[  440.578337]  preempt_schedule_irq+0x4a/0x90
> <6>[  440.583712]  __mutex_lock+0x413/0xe40
> <6>[  440.583969]  irqfd_resampler_shutdown+0x23/0x150
> <6>[  440.584288]  irqfd_shutdown+0x66/0xc0
> <6>[  440.584546]  process_scheduled_works+0x219/0x450
> <6>[  440.584864]  worker_thread+0x2a7/0x3b0
> <6>[  440.585421]  kthread+0x230/0x270
>
> Worker 3449 (kworker/5:4) — same, blocked acquiring resampler_lock:
> <6>[  440.671294] task:kworker/5:4     state:D stack:0     pid:3449  
> tgid:3449  ppid:2      task_flags:0x4208060 flags:0x00080000
> <6>[  440.672088] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[  440.672662]  <TASK>
> <6>[  440.673069]  schedule+0x5e/0xe0
> <6>[  440.673708]  __mutex_lock+0x413/0xe40
> <6>[  440.674059]  irqfd_resampler_shutdown+0x23/0x150
> <6>[  440.674381]  irqfd_shutdown+0x66/0xc0
> <6>[  440.674638]  process_scheduled_works+0x219/0x450
> <6>[  440.674956]  worker_thread+0x2a7/0x3b0
> <6>[  440.675308]  kthread+0x230/0x270
>
> Worker 3453 (kworker/5:8) — holds resampler_lock, blocked waiting for SRCU GP:
> <6>[  440.677368] task:kworker/5:8     state:D stack:0     pid:3453  
> tgid:3453  ppid:2      task_flags:0x4208060 flags:0x00080000
> <6>[  440.678185] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> <6>[  440.678720]  <TASK>
> <6>[  440.679127]  schedule+0x5e/0xe0
> <6>[  440.679354]  schedule_timeout+0x2e/0x130
> <6>[  440.680084]  wait_for_common+0xf7/0x1f0
> <6>[  440.680355]  synchronize_srcu_expedited+0x109/0x140
> <6>[  440.681164]  irqfd_resampler_shutdown+0xf0/0x150
> <6>[  440.681481]  irqfd_shutdown+0x66/0xc0
> <6>[  440.681738]  process_scheduled_works+0x219/0x450
> <6>[  440.682055]  worker_thread+0x2a7/0x3b0
> <6>[  440.682403]  kthread+0x230/0x270
>
> The sequence is: worker 3453 acquires resampler_lock, and calls
> synchronize_srcu_expedited() while holding the lock. This queues
> process_srcu on rcu_gp, then blocks waiting for the GP to complete.
> Workers 51 and 3449 are blocked trying to acquire the same resampler_lock.
>
> Regarding your dispatch question: all three workers are in D state, so
> they have all called schedule() and wq_worker_sleeping() should have
> decremented pool->nr_running to zero. With nr_running == 0 and
> process_srcu in the worklist, needs_more_worker() should be true and an
> idle worker should be woken via kick_pool() when process_srcu is enqueued.
> Why none of the 8 idle workers end up dispatching process_srcu is not
> entirely clear to me.
>
> Moving the synchronize_srcu_expedited() does solve this issue, but it
> is not exactly sure why the deadlock between irqfd-shutdown workers is
> causing the work queue to stall.
>

I think I know what is happening now. After adding some more debug
prints, I see that worker->sleeping is 0 for one of the workers
waiting for the mutex(pid 51) in the example above, and
pool->nr_running is 1. This prevents the pool from dispatching idle
workers.

This time I got a more descriptive stack trace as well:

<6>[18433.604285][T10987] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[18433.611204][T10987] Call Trace:
<6>[18433.615001][T10987]  <TASK>
<6>[18433.618414][T10987]  __schedule+0x8cf/0xdb0
<6>[18433.623372][T10987]  preempt_schedule_irq+0x4a/0x90
<6>[18433.629112][T10987]  asm_sysvec_reschedule_ipi+0x1a/0x20
<6>[18433.635340][T10987] RIP: 0010:kthread_data+0x15/0x30
<6>[18433.715343][T10987]  wq_worker_sleeping+0xc/0x90
<6>[18433.720806][T10987]  schedule+0x30/0xe0
<6>[18433.725379][T10987]  schedule_preempt_disabled+0x10/0x20
<6>[18433.731604][T10987]  __mutex_lock+0x413/0xe40
<6>[18433.736763][T10987]  irqfd_resampler_shutdown+0x23/0x150
<6>[18433.742989][T10987]  irqfd_shutdown+0x66/0xc0
<6>[18433.748145][T10987]  process_scheduled_works+0x219/0x450
<6>[18433.754370][T10987]  worker_thread+0x30b/0x450
<6>[18433.765460][T10987]  kthread+0x227/0x2a0
<6>[18433.775383][T10987]  ret_from_fork+0xfe/0x1b0

If I am reading the stack correctly, an IPI was serviced while at
wq_worker_sleeping() (which is responsible for setting
worker->sleeping to zero and decrementing nr_running). I guess the
process was interrupted before it could update nr_running and
sleeping. After IPI was serviced, preempt_schedule_irq() was called
and then __schedule() which schedules out the task before it could
decrement nr_running. And it is never woken up because the mutex
holder is waiting for the GP to complete. But process_srcu cannot
proceed because the workqueue pool is not kicking idle workers as
nr_running is 1. Effectively deadlocking.

So, basically what happens is (based on above example):
- srcu gp worker and irqfd workers(3453, 51) on the same per-cpu Pool
- worker 3453 acquires resampler_lock, and calls
synchronize_srcu_expedited() while holding the lock.
- worker 51 waits on the lock, but is unable to update critical
workqueue counters(nr_running and sleeping) before it schedules out.
- Workqueue pool is stalled and thereby preventing srcu GP progress.

This also explains why the issue is not seen when the
synchronize_srcu_expedited is called outside the lock.

Going directly to __schedule() after servicing IPI is the main problem
as  wq_worker_sleeping() could not complete. Without the IPI in
picture, schedule out would be:
_mutex_lock
 schedule()
    sched_submit_work()
        wq_worker_sleeping()
    __schedule_loop()
        __schedule()

WIth IPI in picture, it would be:
_mutex_lock
 schedule()
    sched_submit_work()
        wq_worker_sleeping() <-- half way through
              IPI
           preempt_schedule_irq()
              __schedule()

Moving `sched_submit_work()` to __schedule might solve this issue, but
I'm not sure if it would cause other issues. Adding Tejun for an
expert opinion on the workqueue side :-)

Thanks,
Vineeth

Reply via email to