Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

Kunwu Chan Mon, 06 Apr 2026 07:25:13 -0700

April 1, 2026 at 10:24 PM, "Sonam Sanju" 
<[email protected] 
mailto:[email protected]?to=%22Sonam%20Sanju%22%20%3Csonam.sanju%40intel.corp-partner.google.com%3E
 > wrote:



> 
> From: Sonam Sanju <[email protected]>
> 
> On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
> 
> > 
> > Building on the discussion so far, it would be helpful from the SRCU
> >  side to gather a bit more evidence to classify the issue.
> > 
> >  Calling synchronize_srcu_expedited() while holding a mutex is generally
> >  valid, so the observed behavior may be workload-dependent.
> > 
> >  The reported deadlock seems to rely on the assumption that SRCU grace
> >  period progress is indirectly blocked by irqfd workqueue saturation.
> >  It would be good to confirm whether that assumption actually holds.
> > 
> I went back through our logs from two independent crash instances and
> can now provide data for each of your questions.
> 
> > 
> > 1) Are SRCU GP kthreads/workers still making forward progress when
> >  the system is stuck?
> > 
> No. In both crash instances, process_srcu work items remain permanently
> "pending" (never "in-flight") throughout the entire hang.
> 
> Instance 1 — kernel 6.18.8, pool 14 (cpus=3):
> 
>  [ 62.712760] workqueue rcu_gp: flags=0x108
>  [ 62.717801] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
>  [ 62.717801] pending: 2*process_srcu
> 
>  [ 187.735092] workqueue rcu_gp: flags=0x108 (125 seconds later)
>  [ 187.735093] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
>  [ 187.735093] pending: 2*process_srcu (still pending)
> 
>  9 consecutive dumps from t=62s to t=312s — process_srcu never runs.
> 
> Instance 2 — kernel 6.18.2, pool 22 (cpus=5):
> 
>  [ 93.280711] workqueue rcu_gp: flags=0x108
>  [ 93.280713] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
>  [ 93.280716] pending: process_srcu
> 
>  [ 309.040801] workqueue rcu_gp: flags=0x108 (216 seconds later)
>  [ 309.040806] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
>  [ 309.040806] pending: process_srcu (still pending)
> 
>  8 consecutive dumps from t=93s to t=341s — process_srcu never runs.
> 
> In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
> where the kvm-irqfd-cleanup workers are blocked. Both pools have idle
> workers but are marked as hung/stalled:
> 
>  Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 
> 4043 156 77 (7 idle)
>  Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 
> 4245 435 4244 4239 (8 idle)
> 
> > 
> > 2) How many irqfd workers are active in the reported scenario, and
> >  can they saturate CPU or worker pools?
> > 
> 4 kvm-irqfd-cleanup workers in both instances, consistently across all
> dumps:
> 
> Instance 1 ( pool 14 / cpus=3):
> 
>  [ 62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
>  [ 62.837838] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
>  [ 62.837838] in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
>  102:irqfd_shutdown ,39:irqfd_shutdown
> 
> Instance 2 ( pool 22 / cpus=5):
> 
>  [ 93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
>  [ 93.280896] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
>  [ 93.280900] in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
>  4241:irqfd_shutdown ,4243:irqfd_shutdown
> 
> These are from crosvm instances with multiple virtio devices
> (virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
> with a resampler. During VM shutdown, all irqfds are detached
> concurrently, queueing that many irqfd_shutdown work items.
> 
> The 4 workers are not saturating CPU — they're all in D state. But they
> ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.
> 
> > 
> > 3) Do we have a concrete wait-for cycle showing that tasks blocked
> >  on resampler_lock are in turn preventing SRCU GP completion?
> > 
> Yes, in both instances the hung task dump identifies the mutex holder
> stuck in synchronize_srcu, with the other workers waiting on the mutex.
> 
> Instance 1 (t=314s):
> 
>  Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:
> 
>  [ 315.963979] task:kworker/3:8 state:D pid:4044
>  [ 315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
>  [ 316.012504] __synchronize_srcu+0x100/0x130
>  [ 316.023157] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 
> (synchronize_srcu)
> 
>  Workers pid 39, 102, 157 — MUTEX WAITERS:
> 
>  [ 314.793025] task:kworker/3:4 state:D pid:157
>  [ 314.837472] __mutex_lock+0x409/0xd90
>  [ 314.843100] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 
> (mutex_lock)
> 
> Instance 2 (t=343s):
> 
>  Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:
> 
>  [ 343.193294] task:kworker/5:4 state:D pid:4241
>  [ 343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
>  [ 343.193328] __synchronize_srcu+0x100/0x130
>  [ 343.193335] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 
> (synchronize_srcu)
> 
>  Workers pid 151, 4243, 4246 — MUTEX WAITERS:
> 
>  [ 343.193369] task:kworker/5:6 state:D pid:4243
>  [ 343.193397] __mutex_lock+0x37d/0xbb0
>  [ 343.193397] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 
> (mutex_lock)
> 
> Both instances show the identical wait-for cycle:
> 
>  1. One worker holds resampler_lock, blocks in __synchronize_srcu
>  (waiting for SRCU grace period)
>  2. SRCU GP needs process_srcu to run — but it stays "pending"
>  on the same pool
>  3. Other irqfd workers block on __mutex_lock in the same pool
>  4. The pool is marked "hung" and no pending work makes progress
>  for 250-300 seconds until kernel panic
> 
> > 
> > 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> >  and kvm_irqfd_assign() paths?
> > 
> In our 4 crash instances the stuck mutex holder is always in 
> irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu). This 
> is consistent — these are all VM shutdown scenarios where only 
> irqfd_shutdown workqueue items run.
> 
> The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
> during a VM create/destroy stress test where assign and shutdown race.
> His traces showed kvm_irqfd (the assign path) stuck in
> synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
> the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.
> 
> > 
> > If SRCU GP remains independent, it would help distinguish whether
> >  this is a strict deadlock or a form of workqueue starvation / lock
> >  contention.
> > 
> Based on the data from both instances, SRCU GP is NOT remaining
> independent. process_srcu stays permanently pending on the affected
> per-CPU pool for 250-300 seconds. But it's not just process_srcu —
> ALL pending work on the pool is stuck, including items from events,
> cgroup, mm, slub, and other workqueues.
> 
> > 
> > A timestamp-correlated dump (blocked stacks + workqueue state +
> >  SRCU GP activity) would likely be sufficient to classify this.
> > 
> I hope the correlated dumps above from both instances are helpful.
> To summarize the timeline (consistent across both):
> 
>  t=0: VM shutdown begins, crosvm detaches irqfds
>  t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
>  One worker acquires resampler_lock, enters synchronize_srcu
>  Other 3 workers block on __mutex_lock
>  t=~43: First "BUG: workqueue lockup" — pool detected stuck
>  rcu_gp: process_srcu shown as "pending" on same pool
>  t=~93 Through t=~312: Repeated dumps every ~30s
>  process_srcu remains permanently "pending"
>  Pool has idle workers but no pending work executes
>  t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
>  t=~316: init triggers sysrq crash → kernel panic
> 

Thanks, this is useful and much clearer.

One thing that is still unclear is dispatch behavior:
`process_srcu` stays pending for a long time, while the same pwq dump shows 
idle workers.

So the key question is: what prevents pending work from being dispatched on 
that pwq?
Is it due to:
  1) pwq stalled/hung state,
  2) worker availability/affinity constraints,
  3) or another dispatch-side condition?

Also, for scope:
- your crash instances consistently show the shutdown path
  (irqfd_resampler_shutdown + synchronize_srcu),
- while assign-path evidence, per current thread data, appears to come
  from a separate stress case.

A time-aligned dump with pwq state, pending/in-flight lists, and worker states
should help clarify this.


> > 
> > Happy to help look at traces if available.
> > 
> I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
> instances. Shall I post them or send them off-list?
> 

If possible, please post sanitized ramoops/dmesg logs on-list so others can 
validate.

Thanx, Kunwu

> Thanks,
> Sonam
>

Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

Reply via email to