Hi Vineeth, Kunwu, Tejun,

Collected new crash logs with additional debug instrumentation in
wq_worker_sleeping(), kick_pool(), and show_one_worker_pool() to capture
pool state during the hang. The results conclusively confirm Vineeth's
preemption race theory.

>From the new logs:

1. Pool dump with nr_running/nr_idle (added instrumentation):

   pool 10: cpus=2 flags=0x0 hung=201s workers=11 nr_running=1 nr_idle=5

   11 workers, 5 idle, 6 in D-state (all irqfd_shutdown) -- yet
   nr_running=1. No worker is actually running on CPU 2.

2. NMI backtrace confirms CPU 2 is completely idle:

   NMI backtrace for cpu 2 skipped: idling at intel_idle+0x57/0xa0

   So nr_running=1 is a phantom count -- no worker is running, but
   the pool thinks one is.

3. The first stuck worker (kworker/2:0, PID 33) shows the preemption
   in wq_worker_sleeping:

   kworker/2:0  state:D  Workqueue: kvm-irqfd-cleanup irqfd_shutdown
     __schedule+0x87a/0xd60
     preempt_schedule_irq+0x4a/0x90
     asm_fred_entrypoint_kernel+0x41/0x70
     ___ratelimit+0x1a1/0x1f0            <-- inside pr_info_ratelimited
     wq_worker_sleeping+0x53/0x190       <-- preempted HERE
     schedule+0x30/0xe0
     schedule_preempt_disabled+0x10/0x20
     __mutex_lock+0x413/0xe40
     irqfd_resampler_shutdown+0x53/0x200
     irqfd_shutdown+0xfa/0x190

   This confirms the exact race: a reschedule IPI interrupted
   wq_worker_sleeping() after worker->sleeping was set to 1 but
   before pool->nr_running was decremented. The preemption triggered
   wq_worker_running() which incremented nr_running (1->2), then
   on resume the decrement brought it back to 1 instead of 0.

4. The second pool dump 31 seconds later shows the stall is permanent:

   pool 10: cpus=2 flags=0x0 hung=232s workers=11 nr_running=1 nr_idle=5

   Same phantom nr_running=1, hung time growing.

5. The deadlock chain:
   - PID 33: holds resampler_lock mutex, stuck in wq_worker_sleeping
   - PID 520: past mutex, stuck in synchronize_srcu_expedited
   - PIDs 120, 4792, 4793, 4796: waiting on resampler_lock mutex
   - crosvm_vcpu2: waiting in kvm_vm_release -> __flush_workqueue
   - init (PID 1): stuck in pci_device_shutdown -> __flush_work
   - Multiple userspace processes stuck in fsnotify_destroy_group
   - Reboot thread timed out, system triggered sysrq crash

6. kick_pool_skip debug print fired for other pools but NOT for
   pool 10 -- because need_more_worker() was never true (nr_running
   was never 0), so kick_pool() was never even called for this pool.

Regarding a fix, we can consider a workqueue-level fix in 
wq_worker_sleeping() itself:

  void wq_worker_sleeping(struct task_struct *task)
  {
      ...
      if (READ_ONCE(worker->sleeping))
          return;

  +   preempt_disable();
      WRITE_ONCE(worker->sleeping, 1);
      raw_spin_lock_irq(&pool->lock);

      if (worker->flags & WORKER_NOT_RUNNING) {
          raw_spin_unlock_irq(&pool->lock);
  +       preempt_enable();
          return;
      }

      pool->nr_running--;
      if (kick_pool(pool))
          worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;

      raw_spin_unlock_irq(&pool->lock);
  +   preempt_enable();
  }

The idea is to disable preemption from sleeping=1 until we hold the pool
lock (which disables IRQs). This prevents the reschedule IPI from
triggering preempt_schedule_irq() in this window. Note that
wq_worker_running() already uses preempt_disable/enable around its
nr_running++ for a similar race against unbind_workers().

Does this approach look correct to you?


Thanks,
Sonam

Reply via email to