Public bug reported:

[SRU Justification]

[Impact]

Systems on Jammy running high-throughput DMA workloads experience soft lockups
and RCU stalls in fq_flush_timeout, which result in system hangs.

The IOVA allocator in the 5.15 kernel uses a per-CPU magazine cache (rcache) to
avoid expensive rbtree operations. Each CPU has two magazines of 128 PFNs; when
both are full, the primary "loaded" magazine is pushed to a global depot (a
fixed-size array of 32 magazines per size-bin). When the depot is also full, the
overflow magazine is freed via iova_magazine_free_pfns(), which acquires
iova_rbtree_lock and performs up to 128 rbtree lookups and removals while
holding it.

The problem manifests through the flush-queue timer. Every 10ms,
fq_flush_timeout fires in softirq context and drains all CPUs' flush queues in a
single non-preemptible loop. Because __iova_rcache_insert uses raw_cpu_ptr(),
all recycled IOVAs are funnelled into the timer CPU's magazines. Once those
magazines and the shared depot are full, every subsequent overflow triggers
the expensive iova_magazine_free_pfns, resulting in up to 128 rbtree operations
under iova_rbtree_lock, all within the same softirq:

  fq_flush_timeout (timer softirq on CPU X)
    iova_domain_flush
    for_each_possible_cpu(cpu):
      fq_ring_free (up to IOVA_FQ_SIZE=256 entries)
        free_iova_fast
          __iova_rcache_insert (into CPU X's rcache via raw_cpu_ptr)
            if depot_size >= 32:
              iova_magazine_free_pfns (128 rbtree ops under iova_rbtree_lock)

The RCU stall trace from an affected system on 5.15.0-117 confirms this exact
path with reliable stack frames:

  native_queued_spin_lock_slowpath+0x2c/0x40
  _raw_spin_lock_irqsave+0x3d/0x50
  iova_magazine_free_pfns.part.0+0x20/0xd0
  free_iova_fast+0x219/0x290
  fq_ring_free+0xa8/0x170
  fq_flush_timeout+0x74/0xc0
  call_timer_fn
  run_timer_softirq
  __do_softirq

[Fix]

Backport upstream commits, adapted for the 5.15 codebase:
1. 911aa1245da8 ("iommu/iova: Make the rcache depot scale better")
2. 233045378dbb ("iommu/iova: Manage the depot list size")

Cherry-pick upstream commit:
3. 7591c127f3b1 ("kmemleak: iommu/iova: fix transient kmemleak false positive")

Patch 1 replaces the fixed-size depot array with an unbounded singly-linked
list. Magazines are always pushed to the depot regardless of size. As a result,
the overflow path and its inline call to iova_magazine_free_pfns are eliminated
from __iova_rcache_insert.

Patch 2 prevents unbounded memory growth of the now-unlimited depot by adding a
delayed_work (background workqueue) that trims the depot when it exceeds
num_online_cpus() magazines. This reclaim runs in process context, which is
preemptible and sleepable, and therefore, cannot cause soft lockups.

Patch 3 fixes a kmemleak false positive introduced by patch 1.

Adaptations made for 5.15 backport:

- Patches 1 and 2 modify both drivers/iommu/iova.c and include/linux/iova.h
  because in 5.15, struct iova_rcache is defined in the header (upstream moved
  it into iova.c in a prior refactoring series not present in 5.15).
- The rcache init function in 5.15 is init_iova_rcaches() (static void, called
  unconditionally from init_iova_domain) rather than upstream's
  iova_domain_init_rcaches() (exported, returns int with error cleanup). The
  backport preserves the 5.15 function signature and error handling pattern.
- 5.15 uses top-of-function variable declarations rather than upstream's C99
  in-loop declarations.
- The core logic (depot linked-list, overflow elimination, background worker) is
  identical between upstream and the backport.

[Test Plan]

TODO

[Where problems could occur]

Regression risk is low as changes in patches 1 and 2 are confined to the IOVA
rcache depot internals (drivers/iommu/iova.c and include/linux/iova.h). No
changes have been made to IOVA allocation or free semantics from the caller's
perspective. Patch 3 is purely diagnostic and has no runtime effect. Moreover,
the fix is already available on Noble and Resolute, where it has been thoroughly
tested.

[Other Info]

Similar issues have been reported in [0], [1], and [2]. The fix has already been
integrated into Noble and subsequent releases. Backporting this fix ensures
stability for users of the 5.15 kernel.

[0] - https://lkml.rescloud.iu.edu/2304.1/01286.html
[1] - 
https://mailweb.openeuler.org/archives/list/[email protected]/message/FAOBDKYWJ5SNADM625H2A4YCOPRAIRGB/
[2] - https://access.redhat.com/solutions/7031930

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Fix Released

** Affects: linux (Ubuntu Jammy)
     Importance: Undecided
     Assignee: Munir Siddiqui (munirsid)
         Status: In Progress

** Affects: linux (Ubuntu Noble)
     Importance: Undecided
         Status: Fix Released

** Affects: linux (Ubuntu Resolute)
     Importance: Undecided
         Status: Fix Released

** Also affects: linux (Ubuntu Jammy)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Resolute)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu Jammy)
       Status: New => In Progress

** Changed in: linux (Ubuntu Noble)
       Status: New => Fix Released

** Changed in: linux (Ubuntu Resolute)
       Status: New => Fix Released

** Changed in: linux (Ubuntu)
       Status: New => Fix Released

** Changed in: linux (Ubuntu Jammy)
     Assignee: (unassigned) => Munir Siddiqui (munirsid)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2158106

Title:
  [Jammy] soft lockups and rcu stalls in fq_flush_timeout causing system
  hangs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2158106/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to