April 1, 2026 at 10:24 PM, "Sonam Sanju" <[email protected] mailto:[email protected]?to=%22Sonam%20Sanju%22%20%3Csonam.sanju%40intel.corp-partner.google.com%3E > wrote:
> > From: Sonam Sanju <[email protected]> > > On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote: > > > > > Building on the discussion so far, it would be helpful from the SRCU > > side to gather a bit more evidence to classify the issue. > > > > Calling synchronize_srcu_expedited() while holding a mutex is generally > > valid, so the observed behavior may be workload-dependent. > > > > The reported deadlock seems to rely on the assumption that SRCU grace > > period progress is indirectly blocked by irqfd workqueue saturation. > > It would be good to confirm whether that assumption actually holds. > > > I went back through our logs from two independent crash instances and > can now provide data for each of your questions. > > > > > 1) Are SRCU GP kthreads/workers still making forward progress when > > the system is stuck? > > > No. In both crash instances, process_srcu work items remain permanently > "pending" (never "in-flight") throughout the entire hang. > > Instance 1 — kernel 6.18.8, pool 14 (cpus=3): > > [ 62.712760] workqueue rcu_gp: flags=0x108 > [ 62.717801] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3 > [ 62.717801] pending: 2*process_srcu > > [ 187.735092] workqueue rcu_gp: flags=0x108 (125 seconds later) > [ 187.735093] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3 > [ 187.735093] pending: 2*process_srcu (still pending) > > 9 consecutive dumps from t=62s to t=312s — process_srcu never runs. > > Instance 2 — kernel 6.18.2, pool 22 (cpus=5): > > [ 93.280711] workqueue rcu_gp: flags=0x108 > [ 93.280713] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2 > [ 93.280716] pending: process_srcu > > [ 309.040801] workqueue rcu_gp: flags=0x108 (216 seconds later) > [ 309.040806] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2 > [ 309.040806] pending: process_srcu (still pending) > > 8 consecutive dumps from t=93s to t=341s — process_srcu never runs. > > In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool > where the kvm-irqfd-cleanup workers are blocked. Both pools have idle > workers but are marked as hung/stalled: > > Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 > 4043 156 77 (7 idle) > Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 > 4245 435 4244 4239 (8 idle) > > > > > 2) How many irqfd workers are active in the reported scenario, and > > can they saturate CPU or worker pools? > > > 4 kvm-irqfd-cleanup workers in both instances, consistently across all > dumps: > > Instance 1 ( pool 14 / cpus=3): > > [ 62.831877] workqueue kvm-irqfd-cleanup: flags=0x0 > [ 62.837838] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5 > [ 62.837838] in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown , > 102:irqfd_shutdown ,39:irqfd_shutdown > > Instance 2 ( pool 22 / cpus=5): > > [ 93.280894] workqueue kvm-irqfd-cleanup: flags=0x0 > [ 93.280896] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5 > [ 93.280900] in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown , > 4241:irqfd_shutdown ,4243:irqfd_shutdown > > These are from crosvm instances with multiple virtio devices > (virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd > with a resampler. During VM shutdown, all irqfds are detached > concurrently, queueing that many irqfd_shutdown work items. > > The 4 workers are not saturating CPU — they're all in D state. But they > ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work. > > > > > 3) Do we have a concrete wait-for cycle showing that tasks blocked > > on resampler_lock are in turn preventing SRCU GP completion? > > > Yes, in both instances the hung task dump identifies the mutex holder > stuck in synchronize_srcu, with the other workers waiting on the mutex. > > Instance 1 (t=314s): > > Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu: > > [ 315.963979] task:kworker/3:8 state:D pid:4044 > [ 315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown > [ 316.012504] __synchronize_srcu+0x100/0x130 > [ 316.023157] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 > (synchronize_srcu) > > Workers pid 39, 102, 157 — MUTEX WAITERS: > > [ 314.793025] task:kworker/3:4 state:D pid:157 > [ 314.837472] __mutex_lock+0x409/0xd90 > [ 314.843100] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 > (mutex_lock) > > Instance 2 (t=343s): > > Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu: > > [ 343.193294] task:kworker/5:4 state:D pid:4241 > [ 343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown > [ 343.193328] __synchronize_srcu+0x100/0x130 > [ 343.193335] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 > (synchronize_srcu) > > Workers pid 151, 4243, 4246 — MUTEX WAITERS: > > [ 343.193369] task:kworker/5:6 state:D pid:4243 > [ 343.193397] __mutex_lock+0x37d/0xbb0 > [ 343.193397] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 > (mutex_lock) > > Both instances show the identical wait-for cycle: > > 1. One worker holds resampler_lock, blocks in __synchronize_srcu > (waiting for SRCU grace period) > 2. SRCU GP needs process_srcu to run — but it stays "pending" > on the same pool > 3. Other irqfd workers block on __mutex_lock in the same pool > 4. The pool is marked "hung" and no pending work makes progress > for 250-300 seconds until kernel panic > > > > > 4) Is the behavior reproducible in both irqfd_resampler_shutdown() > > and kvm_irqfd_assign() paths? > > > In our 4 crash instances the stuck mutex holder is always in > irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu). This > is consistent — these are all VM shutdown scenarios where only > irqfd_shutdown workqueue items run. > > The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google) > during a VM create/destroy stress test where assign and shutdown race. > His traces showed kvm_irqfd (the assign path) stuck in > synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on > the mutex, and workqueue pwq 46 at active=1024 refcnt=2062. > > > > > If SRCU GP remains independent, it would help distinguish whether > > this is a strict deadlock or a form of workqueue starvation / lock > > contention. > > > Based on the data from both instances, SRCU GP is NOT remaining > independent. process_srcu stays permanently pending on the affected > per-CPU pool for 250-300 seconds. But it's not just process_srcu — > ALL pending work on the pool is stuck, including items from events, > cgroup, mm, slub, and other workqueues. > > > > > A timestamp-correlated dump (blocked stacks + workqueue state + > > SRCU GP activity) would likely be sufficient to classify this. > > > I hope the correlated dumps above from both instances are helpful. > To summarize the timeline (consistent across both): > > t=0: VM shutdown begins, crosvm detaches irqfds > t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool > One worker acquires resampler_lock, enters synchronize_srcu > Other 3 workers block on __mutex_lock > t=~43: First "BUG: workqueue lockup" — pool detected stuck > rcu_gp: process_srcu shown as "pending" on same pool > t=~93 Through t=~312: Repeated dumps every ~30s > process_srcu remains permanently "pending" > Pool has idle workers but no pending work executes > t=~314: Hung task dump confirms mutex holder in __synchronize_srcu > t=~316: init triggers sysrq crash → kernel panic > Thanks, this is useful and much clearer. One thing that is still unclear is dispatch behavior: `process_srcu` stays pending for a long time, while the same pwq dump shows idle workers. So the key question is: what prevents pending work from being dispatched on that pwq? Is it due to: 1) pwq stalled/hung state, 2) worker availability/affinity constraints, 3) or another dispatch-side condition? Also, for scope: - your crash instances consistently show the shutdown path (irqfd_resampler_shutdown + synchronize_srcu), - while assign-path evidence, per current thread data, appears to come from a separate stress case. A time-aligned dump with pwq state, pending/in-flight lists, and worker states should help clarify this. > > > > Happy to help look at traces if available. > > > I can share the full console-ramoops-0 and dmesg-ramoops-0 from both > instances. Shall I post them or send them off-list? > If possible, please post sanitized ramoops/dmesg logs on-list so others can validate. Thanx, Kunwu > Thanks, > Sonam >

