Hi Vineeth, Kunwu, Tejun,
Collected new crash logs with additional debug instrumentation in
wq_worker_sleeping(), kick_pool(), and show_one_worker_pool() to capture
pool state during the hang. The results conclusively confirm Vineeth's
preemption race theory.
>From the new logs:
1. Pool dump with nr_running/nr_idle (added instrumentation):
pool 10: cpus=2 flags=0x0 hung=201s workers=11 nr_running=1 nr_idle=5
11 workers, 5 idle, 6 in D-state (all irqfd_shutdown) -- yet
nr_running=1. No worker is actually running on CPU 2.
2. NMI backtrace confirms CPU 2 is completely idle:
NMI backtrace for cpu 2 skipped: idling at intel_idle+0x57/0xa0
So nr_running=1 is a phantom count -- no worker is running, but
the pool thinks one is.
3. The first stuck worker (kworker/2:0, PID 33) shows the preemption
in wq_worker_sleeping:
kworker/2:0 state:D Workqueue: kvm-irqfd-cleanup irqfd_shutdown
__schedule+0x87a/0xd60
preempt_schedule_irq+0x4a/0x90
asm_fred_entrypoint_kernel+0x41/0x70
___ratelimit+0x1a1/0x1f0 <-- inside pr_info_ratelimited
wq_worker_sleeping+0x53/0x190 <-- preempted HERE
schedule+0x30/0xe0
schedule_preempt_disabled+0x10/0x20
__mutex_lock+0x413/0xe40
irqfd_resampler_shutdown+0x53/0x200
irqfd_shutdown+0xfa/0x190
This confirms the exact race: a reschedule IPI interrupted
wq_worker_sleeping() after worker->sleeping was set to 1 but
before pool->nr_running was decremented. The preemption triggered
wq_worker_running() which incremented nr_running (1->2), then
on resume the decrement brought it back to 1 instead of 0.
4. The second pool dump 31 seconds later shows the stall is permanent:
pool 10: cpus=2 flags=0x0 hung=232s workers=11 nr_running=1 nr_idle=5
Same phantom nr_running=1, hung time growing.
5. The deadlock chain:
- PID 33: holds resampler_lock mutex, stuck in wq_worker_sleeping
- PID 520: past mutex, stuck in synchronize_srcu_expedited
- PIDs 120, 4792, 4793, 4796: waiting on resampler_lock mutex
- crosvm_vcpu2: waiting in kvm_vm_release -> __flush_workqueue
- init (PID 1): stuck in pci_device_shutdown -> __flush_work
- Multiple userspace processes stuck in fsnotify_destroy_group
- Reboot thread timed out, system triggered sysrq crash
6. kick_pool_skip debug print fired for other pools but NOT for
pool 10 -- because need_more_worker() was never true (nr_running
was never 0), so kick_pool() was never even called for this pool.
Regarding a fix, we can consider a workqueue-level fix in
wq_worker_sleeping() itself:
void wq_worker_sleeping(struct task_struct *task)
{
...
if (READ_ONCE(worker->sleeping))
return;
+ preempt_disable();
WRITE_ONCE(worker->sleeping, 1);
raw_spin_lock_irq(&pool->lock);
if (worker->flags & WORKER_NOT_RUNNING) {
raw_spin_unlock_irq(&pool->lock);
+ preempt_enable();
return;
}
pool->nr_running--;
if (kick_pool(pool))
worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;
raw_spin_unlock_irq(&pool->lock);
+ preempt_enable();
}
The idea is to disable preemption from sleeping=1 until we hold the pool
lock (which disables IRQs). This prevents the reschedule IPI from
triggering preempt_schedule_irq() in this window. Note that
wq_worker_running() already uses preempt_disable/enable around its
nr_running++ for a similar race against unbind_workers().
Does this approach look correct to you?
Thanks,
Sonam