Public bug reported:

BugLink: https://bugs.launchpad.net/bugs/2064999

[Impact]

On systems with IOMMU enabled, every streaming DMA mapping involves an
IOVA to be allocated and freed. For small mappings, IOVA sizes are
normally cached, so IOVA allocations complete in a reasonable time. For
larger mappings, things can be significantly slower to the point where
softlockups occur due to lock contention on iova_rbtree_lock.

commit 9257b4a206fc ("iommu/iova: introduce per-cpu caching to iova allocation")
introduced a scalable IOVA cache mechanism, that helps increase performance up 
to 128kb mappings.

On systems that do larger streaming DMA mappings, e.g. a NVMe device
with:

/sys/block/nvme0n1/queue/max_hw_sectors_kb
2048

The 2048kb mapping takes significantly longer, causing lock contention
on the iova_rbtree_lock, as other resources such as ethernet NICs are
also trying to acquire the lock.

We hit the following soft lockup:

watchdog: BUG: soft lockup - CPU#60 stuck for 24s!
CPU: 60 PID: 608304 Comm: segment-merger- Tainted: P        W   EL    
5.15.0-76-generic #83~20.04.1-Ubuntu
RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
Call Trace:
 <IRQ>
 fq_flush_timeout+0x82/0xc0
 ? fq_ring_free+0x170/0x170
 call_timer_fn+0x2e/0x120
 run_timer_softirq+0x433/0x4c0
 ? lapic_next_event+0x21/0x30
 ? clockevents_program_event+0xab/0x130
 __do_softirq+0xdd/0x2ee
 irq_exit_rcu+0x7d/0xa0
 sysvec_apic_timer_interrupt+0x80/0x90
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x1b/0x20
RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
...
 alloc_iova+0x1d8/0x1f0
 alloc_iova_fast+0x5c/0x3a0
 iommu_dma_alloc_iova.isra.0+0x128/0x170
 ? __kmalloc+0x1ab/0x4b0
 iommu_dma_map_sg+0x1a4/0x4c0
 __dma_map_sg_attrs+0x72/0x80
 dma_map_sg_attrs+0xe/0x20
 nvme_map_data+0xde/0x800 [nvme]
 ? recalibrate_cpu_khz+0x10/0x10
 ? ktime_get+0x46/0xc0
 nvme_queue_rq+0xaf/0x1f0 [nvme]
 ? __update_load_avg_se+0x2a2/0x2c0
 __blk_mq_try_issue_directly+0x15b/0x200
 blk_mq_request_issue_directly+0x51/0xa0
 blk_mq_try_issue_list_directly+0x7f/0xf0
 blk_mq_sched_insert_requests+0xa4/0xf0
 blk_mq_flush_plug_list+0x103/0x1c0
 blk_flush_plug_list+0xe3/0x110
 blk_mq_submit_bio+0x29d/0x600
 __submit_bio+0x1e5/0x220
 ? ext4_inode_block_valid+0x9f/0xc0
 submit_bio_noacct+0xac/0x2c0
 ? xa_load+0x61/0xa0
 submit_bio+0x50/0x140
 ext4_mpage_readpages+0x6a2/0xe20
 ? __mod_lruvec_page_state+0x6b/0xb0
 ext4_readahead+0x37/0x40
 read_pages+0x95/0x280
 page_cache_ra_unbounded+0x161/0x220
 do_page_cache_ra+0x3d/0x50
 ondemand_readahead+0x137/0x330
 page_cache_async_ra+0xa6/0xd0
 filemap_get_pages+0x224/0x660
 ? filemap_get_pages+0x9e/0x660
 filemap_read+0xbe/0x410
 generic_file_read_iter+0xe5/0x150
 ext4_file_read_iter+0x5b/0x190
 new_sync_read+0x110/0x1a0
 vfs_read+0x102/0x1a0
 ksys_pread64+0x71/0xa0
 __x64_sys_pread64+0x1e/0x30
 unload_network_ops_symbols+0xc4de/0xf750 [falcon_lsm_pinned_15907]
 do_syscall_64+0x5c/0xc0
 ? do_syscall_64+0x69/0xc0
 ? do_syscall_64+0x69/0xc0
 entry_SYSCALL_64_after_hwframe+0x61/0xcb

A workaround is to disable IOMMU with "iommu=off amd_iommu=off" on the
kernel command line.

[Fix]

The fix is to clamp max_hw_sectors to the optimised IOVA size that still
fits in the cache, so allocation and freeing of the IOVA is faster,
during streaming DMA mapping.

The fix requires two dependency commits, which introduces a function to
find the optimal value, dma_opt_mapping_size().

commit a229cc14f3395311b899e5e582b71efa8dd01df0
Author: John Garry <john.ga...@huawei.com>
Date:   Thu Jul 14 19:15:24 2022 +0800
Subject: dma-mapping: add dma_opt_mapping_size()
Link: 
https://github.com/torvalds/linux/commit/a229cc14f3395311b899e5e582b71efa8dd01df0

commit 6d9870b7e5def2450e21316515b9efc0529204dd
Author: John Garry <john.ga...@huawei.com>
Date:   Thu Jul 14 19:15:25 2022 +0800
Subject: dma-iommu: add iommu_dma_opt_mapping_size()
Link: 
https://github.com/torvalds/linux/commit/6d9870b7e5def2450e21316515b9efc0529204dd

The dependencies are present in 6.0-rc1 and later.

The fix itself simply changes max_hw_sectors from dma_max_mapping_size()
to dma_opt_mapping_size(). The fix needs a backport, as setting
dev->ctrl.max_hw_sectors moved from nvme_reset_work() in 5.15 to
nvme_pci_alloc_dev() in later releases.

commit 3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd
Author: Adrian Huang <ahuan...@lenovo.com>
Date:   Fri Apr 21 16:08:00 2023 +0800
Subject: nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
Link: 
https://github.com/torvalds/linux/commit/3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd

The fix is present in 6.4-rc3 and later.

[Testcase]

The system needs to be extremely busy. So busy infact, that we cannot
reproduce it in lab environments, and only in production.

The systems that hit this issue have 64 cores, ~90%+ sustained CPU
usage, ~90%+ sustained memory usage, high disk I/O, and nearly saturated
network throughput with 100Gb NICs.

The NVMe disk MUST have /sys/block/nvme0n1/queue/max_hw_sectors_kb greater than
128kb, in this case, 2048kb.

Leave the system at sustained load until IOVA allocations slow to a halt
and soft or hardlockups occur, waiting for iova_rbtree_lock.

A test kernel is available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf374805-test

If you install the kernel and leave it running, the soft lockups will no
longer occur.

[Where problems could occur]

We are changing the value of max_hw_sectors_kb for NVMe devices, for
systems with IOMMU enabled. For those without IOMMU or IOMMU disabled,
it will remain the same as it is now.

The value is the minimum between the maximum supported by hardware, and
the largest that fits into cache. For some workloads, this might have a
small impact on performance, due to the need to split up larger IOVA
allocations into multiple smaller ones, but there should be a larger net
gain due to IOVA allocations now fitting into the cache, and completing
much faster than a single large one.

If a regression were to occur, users could disable the IOMMU as a
workaround.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Fix Released

** Affects: linux (Ubuntu Jammy)
     Importance: Medium
     Assignee: Matthew Ruffell (mruffell)
         Status: In Progress


** Tags: jammy sts

** Also affects: linux (Ubuntu Jammy)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu)
       Status: New => Fix Released

** Changed in: linux (Ubuntu Jammy)
       Status: New => In Progress

** Changed in: linux (Ubuntu Jammy)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Jammy)
     Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Tags added: jammy sts

** Description changed:

- BugLink: https://bugs.launchpad.net/bugs/
+ BugLink: https://bugs.launchpad.net/bugs/2064999
  
  [Impact]
  
  On systems with IOMMU enabled, every streaming DMA mapping involves an IOVA 
to be allocated and freed. For small mappings, IOVA sizes are normally cached, 
so IOVA allocations complete in a reasonable time. For larger mappings, things 
can be significantly slower to the point where softlockups occur due to lock 
contention
  on iova_rbtree_lock.
  
- commit 9257b4a206fc ("iommu/iova: introduce per-cpu caching to iova 
allocation") 
+ commit 9257b4a206fc ("iommu/iova: introduce per-cpu caching to iova 
allocation")
  introduced a scalable IOVA cache mechanism, that helps increase performance up
- to 128kb mappings. 
+ to 128kb mappings.
  
  On systems that do larger streaming DMA mappings, e.g. a NVMe device
  with:
  
  /sys/block/nvme0n1/queue/max_hw_sectors_kb
  2048
  
  The 2048kb mapping takes significantly longer, causing lock contention on the 
iova_rbtree_lock, as other resources such as ethernet NICs are also trying to
  acquire the lock.
  
  We hit the following soft lockup:
  
  watchdog: BUG: soft lockup - CPU#60 stuck for 24s!
  CPU: 60 PID: 608304 Comm: segment-merger- Tainted: P        W   EL    
5.15.0-76-generic #83~20.04.1-Ubuntu
  RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
  Call Trace:
-  <IRQ>
-  fq_flush_timeout+0x82/0xc0
-  ? fq_ring_free+0x170/0x170
-  call_timer_fn+0x2e/0x120
-  run_timer_softirq+0x433/0x4c0
-  ? lapic_next_event+0x21/0x30
-  ? clockevents_program_event+0xab/0x130
-  __do_softirq+0xdd/0x2ee
-  irq_exit_rcu+0x7d/0xa0
-  sysvec_apic_timer_interrupt+0x80/0x90
-  </IRQ>
-  <TASK>
-  asm_sysvec_apic_timer_interrupt+0x1b/0x20
+  <IRQ>
+  fq_flush_timeout+0x82/0xc0
+  ? fq_ring_free+0x170/0x170
+  call_timer_fn+0x2e/0x120
+  run_timer_softirq+0x433/0x4c0
+  ? lapic_next_event+0x21/0x30
+  ? clockevents_program_event+0xab/0x130
+  __do_softirq+0xdd/0x2ee
+  irq_exit_rcu+0x7d/0xa0
+  sysvec_apic_timer_interrupt+0x80/0x90
+  </IRQ>
+  <TASK>
+  asm_sysvec_apic_timer_interrupt+0x1b/0x20
  RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
  ...
-  alloc_iova+0x1d8/0x1f0
-  alloc_iova_fast+0x5c/0x3a0
-  iommu_dma_alloc_iova.isra.0+0x128/0x170
-  ? __kmalloc+0x1ab/0x4b0
-  iommu_dma_map_sg+0x1a4/0x4c0
-  __dma_map_sg_attrs+0x72/0x80
-  dma_map_sg_attrs+0xe/0x20
-  nvme_map_data+0xde/0x800 [nvme]
-  ? recalibrate_cpu_khz+0x10/0x10
-  ? ktime_get+0x46/0xc0
-  nvme_queue_rq+0xaf/0x1f0 [nvme]
-  ? __update_load_avg_se+0x2a2/0x2c0
-  __blk_mq_try_issue_directly+0x15b/0x200
-  blk_mq_request_issue_directly+0x51/0xa0
-  blk_mq_try_issue_list_directly+0x7f/0xf0
-  blk_mq_sched_insert_requests+0xa4/0xf0
-  blk_mq_flush_plug_list+0x103/0x1c0
-  blk_flush_plug_list+0xe3/0x110
-  blk_mq_submit_bio+0x29d/0x600
-  __submit_bio+0x1e5/0x220
-  ? ext4_inode_block_valid+0x9f/0xc0
-  submit_bio_noacct+0xac/0x2c0
-  ? xa_load+0x61/0xa0
-  submit_bio+0x50/0x140
-  ext4_mpage_readpages+0x6a2/0xe20
-  ? __mod_lruvec_page_state+0x6b/0xb0
-  ext4_readahead+0x37/0x40
-  read_pages+0x95/0x280
-  page_cache_ra_unbounded+0x161/0x220
-  do_page_cache_ra+0x3d/0x50
-  ondemand_readahead+0x137/0x330
-  page_cache_async_ra+0xa6/0xd0
-  filemap_get_pages+0x224/0x660
-  ? filemap_get_pages+0x9e/0x660
-  filemap_read+0xbe/0x410
-  generic_file_read_iter+0xe5/0x150
-  ext4_file_read_iter+0x5b/0x190
-  new_sync_read+0x110/0x1a0
-  vfs_read+0x102/0x1a0
-  ksys_pread64+0x71/0xa0
-  __x64_sys_pread64+0x1e/0x30
-  unload_network_ops_symbols+0xc4de/0xf750 [falcon_lsm_pinned_15907]
-  do_syscall_64+0x5c/0xc0
-  ? do_syscall_64+0x69/0xc0
-  ? do_syscall_64+0x69/0xc0
-  entry_SYSCALL_64_after_hwframe+0x61/0xcb
+  alloc_iova+0x1d8/0x1f0
+  alloc_iova_fast+0x5c/0x3a0
+  iommu_dma_alloc_iova.isra.0+0x128/0x170
+  ? __kmalloc+0x1ab/0x4b0
+  iommu_dma_map_sg+0x1a4/0x4c0
+  __dma_map_sg_attrs+0x72/0x80
+  dma_map_sg_attrs+0xe/0x20
+  nvme_map_data+0xde/0x800 [nvme]
+  ? recalibrate_cpu_khz+0x10/0x10
+  ? ktime_get+0x46/0xc0
+  nvme_queue_rq+0xaf/0x1f0 [nvme]
+  ? __update_load_avg_se+0x2a2/0x2c0
+  __blk_mq_try_issue_directly+0x15b/0x200
+  blk_mq_request_issue_directly+0x51/0xa0
+  blk_mq_try_issue_list_directly+0x7f/0xf0
+  blk_mq_sched_insert_requests+0xa4/0xf0
+  blk_mq_flush_plug_list+0x103/0x1c0
+  blk_flush_plug_list+0xe3/0x110
+  blk_mq_submit_bio+0x29d/0x600
+  __submit_bio+0x1e5/0x220
+  ? ext4_inode_block_valid+0x9f/0xc0
+  submit_bio_noacct+0xac/0x2c0
+  ? xa_load+0x61/0xa0
+  submit_bio+0x50/0x140
+  ext4_mpage_readpages+0x6a2/0xe20
+  ? __mod_lruvec_page_state+0x6b/0xb0
+  ext4_readahead+0x37/0x40
+  read_pages+0x95/0x280
+  page_cache_ra_unbounded+0x161/0x220
+  do_page_cache_ra+0x3d/0x50
+  ondemand_readahead+0x137/0x330
+  page_cache_async_ra+0xa6/0xd0
+  filemap_get_pages+0x224/0x660
+  ? filemap_get_pages+0x9e/0x660
+  filemap_read+0xbe/0x410
+  generic_file_read_iter+0xe5/0x150
+  ext4_file_read_iter+0x5b/0x190
+  new_sync_read+0x110/0x1a0
+  vfs_read+0x102/0x1a0
+  ksys_pread64+0x71/0xa0
+  __x64_sys_pread64+0x1e/0x30
+  unload_network_ops_symbols+0xc4de/0xf750 [falcon_lsm_pinned_15907]
+  do_syscall_64+0x5c/0xc0
+  ? do_syscall_64+0x69/0xc0
+  ? do_syscall_64+0x69/0xc0
+  entry_SYSCALL_64_after_hwframe+0x61/0xcb
  
  A workaround is to disable IOMMU with "iommu=off amd_iommu=off" on the
  kernel command line.
  
  [Fix]
  
  The fix is to clamp max_hw_sectors to the optimised IOVA size that still
  fits in the cache, so allocation and freeing of the IOVA is faster,
  during streaming DMA mapping.
  
- The fix requires two dependency commits, which introduces a function to find 
the optimal value, dma_opt_mapping_size().
-     
+ The fix requires two dependency commits, which introduces a function to
+ find the optimal value, dma_opt_mapping_size().
+ 
  commit a229cc14f3395311b899e5e582b71efa8dd01df0
  Author: John Garry <john.ga...@huawei.com>
  Date:   Thu Jul 14 19:15:24 2022 +0800
  Subject: dma-mapping: add dma_opt_mapping_size()
  Link: 
https://github.com/torvalds/linux/commit/a229cc14f3395311b899e5e582b71efa8dd01df0
  
  commit 6d9870b7e5def2450e21316515b9efc0529204dd
  Author: John Garry <john.ga...@huawei.com>
  Date:   Thu Jul 14 19:15:25 2022 +0800
  Subject: dma-iommu: add iommu_dma_opt_mapping_size()
  Link: 
https://github.com/torvalds/linux/commit/6d9870b7e5def2450e21316515b9efc0529204dd
  
  The dependencies are present in 6.0-rc1 and later.
  
  The fix itself simply changes max_hw_sectors from dma_max_mapping_size()
  to dma_opt_mapping_size(). The fix needs a backport, as setting
  dev->ctrl.max_hw_sectors moved from nvme_reset_work() in 5.15 to
  nvme_pci_alloc_dev() in later releases.
  
  commit 3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd
  Author: Adrian Huang <ahuan...@lenovo.com>
  Date:   Fri Apr 21 16:08:00 2023 +0800
  Subject: nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
  Link: 
https://github.com/torvalds/linux/commit/3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd
  
  The fix is present in 6.4-rc3 and later.
  
  [Testcase]
  
  The system needs to be extremely busy. So busy infact, that we cannot
  reproduce it in lab environments, and only in production.
  
  The systems that hit this issue have 64 cores, ~90%+ sustained CPU
  usage, ~90%+ sustained memory usage, high disk I/O, and nearly saturated
  network throughput with 100Gb NICs.
  
  The NVMe disk MUST have /sys/block/nvme0n1/queue/max_hw_sectors_kb greater 
than
  128kb, in this case, 2048kb.
  
  Leave the system at sustained load until IOVA allocations slow to a halt
  and soft or hardlockups occur, waiting for iova_rbtree_lock.
  
  A test kernel is available in the following ppa:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf374805-test
  
  If you install the kernel and leave it running, the soft lockups will no
  longer occur.
  
  [Where problems could occur]
  
  We are changing the value of max_hw_sectors_kb for NVMe devices, for
  systems with IOMMU enabled. For those without IOMMU or IOMMU disabled,
  it will remain the same as it is now.
  
  The value is the minimum between the maximum supported by hardware, and
  the largest that fits into cache. For some workloads, this might have a
  small impact on performance, due to the need to split up larger IOVA
  allocations into multiple smaller ones, but there should be a larger net
  gain due to IOVA allocations now fitting into the cache, and completing
  much faster than a single large one.
  
  If a regression were to occur, users could disable the IOMMU as a
  workaround.

** Description changed:

  BugLink: https://bugs.launchpad.net/bugs/2064999
  
  [Impact]
  
- On systems with IOMMU enabled, every streaming DMA mapping involves an IOVA 
to be allocated and freed. For small mappings, IOVA sizes are normally cached, 
so IOVA allocations complete in a reasonable time. For larger mappings, things 
can be significantly slower to the point where softlockups occur due to lock 
contention
- on iova_rbtree_lock.
+ On systems with IOMMU enabled, every streaming DMA mapping involves an
+ IOVA to be allocated and freed. For small mappings, IOVA sizes are
+ normally cached, so IOVA allocations complete in a reasonable time. For
+ larger mappings, things can be significantly slower to the point where
+ softlockups occur due to lock contention on iova_rbtree_lock.
  
  commit 9257b4a206fc ("iommu/iova: introduce per-cpu caching to iova 
allocation")
  introduced a scalable IOVA cache mechanism, that helps increase performance up
  to 128kb mappings.
  
  On systems that do larger streaming DMA mappings, e.g. a NVMe device
  with:
  
  /sys/block/nvme0n1/queue/max_hw_sectors_kb
  2048
  
  The 2048kb mapping takes significantly longer, causing lock contention on the 
iova_rbtree_lock, as other resources such as ethernet NICs are also trying to
  acquire the lock.
  
  We hit the following soft lockup:
  
  watchdog: BUG: soft lockup - CPU#60 stuck for 24s!
  CPU: 60 PID: 608304 Comm: segment-merger- Tainted: P        W   EL    
5.15.0-76-generic #83~20.04.1-Ubuntu
  RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
  Call Trace:
   <IRQ>
   fq_flush_timeout+0x82/0xc0
   ? fq_ring_free+0x170/0x170
   call_timer_fn+0x2e/0x120
   run_timer_softirq+0x433/0x4c0
   ? lapic_next_event+0x21/0x30
   ? clockevents_program_event+0xab/0x130
   __do_softirq+0xdd/0x2ee
   irq_exit_rcu+0x7d/0xa0
   sysvec_apic_timer_interrupt+0x80/0x90
   </IRQ>
   <TASK>
   asm_sysvec_apic_timer_interrupt+0x1b/0x20
  RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
  ...
   alloc_iova+0x1d8/0x1f0
   alloc_iova_fast+0x5c/0x3a0
   iommu_dma_alloc_iova.isra.0+0x128/0x170
   ? __kmalloc+0x1ab/0x4b0
   iommu_dma_map_sg+0x1a4/0x4c0
   __dma_map_sg_attrs+0x72/0x80
   dma_map_sg_attrs+0xe/0x20
   nvme_map_data+0xde/0x800 [nvme]
   ? recalibrate_cpu_khz+0x10/0x10
   ? ktime_get+0x46/0xc0
   nvme_queue_rq+0xaf/0x1f0 [nvme]
   ? __update_load_avg_se+0x2a2/0x2c0
   __blk_mq_try_issue_directly+0x15b/0x200
   blk_mq_request_issue_directly+0x51/0xa0
   blk_mq_try_issue_list_directly+0x7f/0xf0
   blk_mq_sched_insert_requests+0xa4/0xf0
   blk_mq_flush_plug_list+0x103/0x1c0
   blk_flush_plug_list+0xe3/0x110
   blk_mq_submit_bio+0x29d/0x600
   __submit_bio+0x1e5/0x220
   ? ext4_inode_block_valid+0x9f/0xc0
   submit_bio_noacct+0xac/0x2c0
   ? xa_load+0x61/0xa0
   submit_bio+0x50/0x140
   ext4_mpage_readpages+0x6a2/0xe20
   ? __mod_lruvec_page_state+0x6b/0xb0
   ext4_readahead+0x37/0x40
   read_pages+0x95/0x280
   page_cache_ra_unbounded+0x161/0x220
   do_page_cache_ra+0x3d/0x50
   ondemand_readahead+0x137/0x330
   page_cache_async_ra+0xa6/0xd0
   filemap_get_pages+0x224/0x660
   ? filemap_get_pages+0x9e/0x660
   filemap_read+0xbe/0x410
   generic_file_read_iter+0xe5/0x150
   ext4_file_read_iter+0x5b/0x190
   new_sync_read+0x110/0x1a0
   vfs_read+0x102/0x1a0
   ksys_pread64+0x71/0xa0
   __x64_sys_pread64+0x1e/0x30
   unload_network_ops_symbols+0xc4de/0xf750 [falcon_lsm_pinned_15907]
   do_syscall_64+0x5c/0xc0
   ? do_syscall_64+0x69/0xc0
   ? do_syscall_64+0x69/0xc0
   entry_SYSCALL_64_after_hwframe+0x61/0xcb
  
  A workaround is to disable IOMMU with "iommu=off amd_iommu=off" on the
  kernel command line.
  
  [Fix]
  
  The fix is to clamp max_hw_sectors to the optimised IOVA size that still
  fits in the cache, so allocation and freeing of the IOVA is faster,
  during streaming DMA mapping.
  
  The fix requires two dependency commits, which introduces a function to
  find the optimal value, dma_opt_mapping_size().
  
  commit a229cc14f3395311b899e5e582b71efa8dd01df0
  Author: John Garry <john.ga...@huawei.com>
  Date:   Thu Jul 14 19:15:24 2022 +0800
  Subject: dma-mapping: add dma_opt_mapping_size()
  Link: 
https://github.com/torvalds/linux/commit/a229cc14f3395311b899e5e582b71efa8dd01df0
  
  commit 6d9870b7e5def2450e21316515b9efc0529204dd
  Author: John Garry <john.ga...@huawei.com>
  Date:   Thu Jul 14 19:15:25 2022 +0800
  Subject: dma-iommu: add iommu_dma_opt_mapping_size()
  Link: 
https://github.com/torvalds/linux/commit/6d9870b7e5def2450e21316515b9efc0529204dd
  
  The dependencies are present in 6.0-rc1 and later.
  
  The fix itself simply changes max_hw_sectors from dma_max_mapping_size()
  to dma_opt_mapping_size(). The fix needs a backport, as setting
  dev->ctrl.max_hw_sectors moved from nvme_reset_work() in 5.15 to
  nvme_pci_alloc_dev() in later releases.
  
  commit 3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd
  Author: Adrian Huang <ahuan...@lenovo.com>
  Date:   Fri Apr 21 16:08:00 2023 +0800
  Subject: nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
  Link: 
https://github.com/torvalds/linux/commit/3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd
  
  The fix is present in 6.4-rc3 and later.
  
  [Testcase]
  
  The system needs to be extremely busy. So busy infact, that we cannot
  reproduce it in lab environments, and only in production.
  
  The systems that hit this issue have 64 cores, ~90%+ sustained CPU
  usage, ~90%+ sustained memory usage, high disk I/O, and nearly saturated
  network throughput with 100Gb NICs.
  
  The NVMe disk MUST have /sys/block/nvme0n1/queue/max_hw_sectors_kb greater 
than
  128kb, in this case, 2048kb.
  
  Leave the system at sustained load until IOVA allocations slow to a halt
  and soft or hardlockups occur, waiting for iova_rbtree_lock.
  
  A test kernel is available in the following ppa:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf374805-test
  
  If you install the kernel and leave it running, the soft lockups will no
  longer occur.
  
  [Where problems could occur]
  
  We are changing the value of max_hw_sectors_kb for NVMe devices, for
  systems with IOMMU enabled. For those without IOMMU or IOMMU disabled,
  it will remain the same as it is now.
  
  The value is the minimum between the maximum supported by hardware, and
  the largest that fits into cache. For some workloads, this might have a
  small impact on performance, due to the need to split up larger IOVA
  allocations into multiple smaller ones, but there should be a larger net
  gain due to IOVA allocations now fitting into the cache, and completing
  much faster than a single large one.
  
  If a regression were to occur, users could disable the IOMMU as a
  workaround.

** Description changed:

  BugLink: https://bugs.launchpad.net/bugs/2064999
  
  [Impact]
  
  On systems with IOMMU enabled, every streaming DMA mapping involves an
  IOVA to be allocated and freed. For small mappings, IOVA sizes are
  normally cached, so IOVA allocations complete in a reasonable time. For
  larger mappings, things can be significantly slower to the point where
  softlockups occur due to lock contention on iova_rbtree_lock.
  
  commit 9257b4a206fc ("iommu/iova: introduce per-cpu caching to iova 
allocation")
  introduced a scalable IOVA cache mechanism, that helps increase performance up
  to 128kb mappings.
  
  On systems that do larger streaming DMA mappings, e.g. a NVMe device
  with:
  
  /sys/block/nvme0n1/queue/max_hw_sectors_kb
  2048
  
- The 2048kb mapping takes significantly longer, causing lock contention on the 
iova_rbtree_lock, as other resources such as ethernet NICs are also trying to
- acquire the lock.
+ The 2048kb mapping takes significantly longer, causing lock contention
+ on the iova_rbtree_lock, as other resources such as ethernet NICs are
+ also trying to acquire the lock.
  
  We hit the following soft lockup:
  
  watchdog: BUG: soft lockup - CPU#60 stuck for 24s!
  CPU: 60 PID: 608304 Comm: segment-merger- Tainted: P        W   EL    
5.15.0-76-generic #83~20.04.1-Ubuntu
  RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
  Call Trace:
   <IRQ>
   fq_flush_timeout+0x82/0xc0
   ? fq_ring_free+0x170/0x170
   call_timer_fn+0x2e/0x120
   run_timer_softirq+0x433/0x4c0
   ? lapic_next_event+0x21/0x30
   ? clockevents_program_event+0xab/0x130
   __do_softirq+0xdd/0x2ee
   irq_exit_rcu+0x7d/0xa0
   sysvec_apic_timer_interrupt+0x80/0x90
   </IRQ>
   <TASK>
   asm_sysvec_apic_timer_interrupt+0x1b/0x20
  RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
  ...
   alloc_iova+0x1d8/0x1f0
   alloc_iova_fast+0x5c/0x3a0
   iommu_dma_alloc_iova.isra.0+0x128/0x170
   ? __kmalloc+0x1ab/0x4b0
   iommu_dma_map_sg+0x1a4/0x4c0
   __dma_map_sg_attrs+0x72/0x80
   dma_map_sg_attrs+0xe/0x20
   nvme_map_data+0xde/0x800 [nvme]
   ? recalibrate_cpu_khz+0x10/0x10
   ? ktime_get+0x46/0xc0
   nvme_queue_rq+0xaf/0x1f0 [nvme]
   ? __update_load_avg_se+0x2a2/0x2c0
   __blk_mq_try_issue_directly+0x15b/0x200
   blk_mq_request_issue_directly+0x51/0xa0
   blk_mq_try_issue_list_directly+0x7f/0xf0
   blk_mq_sched_insert_requests+0xa4/0xf0
   blk_mq_flush_plug_list+0x103/0x1c0
   blk_flush_plug_list+0xe3/0x110
   blk_mq_submit_bio+0x29d/0x600
   __submit_bio+0x1e5/0x220
   ? ext4_inode_block_valid+0x9f/0xc0
   submit_bio_noacct+0xac/0x2c0
   ? xa_load+0x61/0xa0
   submit_bio+0x50/0x140
   ext4_mpage_readpages+0x6a2/0xe20
   ? __mod_lruvec_page_state+0x6b/0xb0
   ext4_readahead+0x37/0x40
   read_pages+0x95/0x280
   page_cache_ra_unbounded+0x161/0x220
   do_page_cache_ra+0x3d/0x50
   ondemand_readahead+0x137/0x330
   page_cache_async_ra+0xa6/0xd0
   filemap_get_pages+0x224/0x660
   ? filemap_get_pages+0x9e/0x660
   filemap_read+0xbe/0x410
   generic_file_read_iter+0xe5/0x150
   ext4_file_read_iter+0x5b/0x190
   new_sync_read+0x110/0x1a0
   vfs_read+0x102/0x1a0
   ksys_pread64+0x71/0xa0
   __x64_sys_pread64+0x1e/0x30
   unload_network_ops_symbols+0xc4de/0xf750 [falcon_lsm_pinned_15907]
   do_syscall_64+0x5c/0xc0
   ? do_syscall_64+0x69/0xc0
   ? do_syscall_64+0x69/0xc0
   entry_SYSCALL_64_after_hwframe+0x61/0xcb
  
  A workaround is to disable IOMMU with "iommu=off amd_iommu=off" on the
  kernel command line.
  
  [Fix]
  
  The fix is to clamp max_hw_sectors to the optimised IOVA size that still
  fits in the cache, so allocation and freeing of the IOVA is faster,
  during streaming DMA mapping.
  
  The fix requires two dependency commits, which introduces a function to
  find the optimal value, dma_opt_mapping_size().
  
  commit a229cc14f3395311b899e5e582b71efa8dd01df0
  Author: John Garry <john.ga...@huawei.com>
  Date:   Thu Jul 14 19:15:24 2022 +0800
  Subject: dma-mapping: add dma_opt_mapping_size()
  Link: 
https://github.com/torvalds/linux/commit/a229cc14f3395311b899e5e582b71efa8dd01df0
  
  commit 6d9870b7e5def2450e21316515b9efc0529204dd
  Author: John Garry <john.ga...@huawei.com>
  Date:   Thu Jul 14 19:15:25 2022 +0800
  Subject: dma-iommu: add iommu_dma_opt_mapping_size()
  Link: 
https://github.com/torvalds/linux/commit/6d9870b7e5def2450e21316515b9efc0529204dd
  
  The dependencies are present in 6.0-rc1 and later.
  
  The fix itself simply changes max_hw_sectors from dma_max_mapping_size()
  to dma_opt_mapping_size(). The fix needs a backport, as setting
  dev->ctrl.max_hw_sectors moved from nvme_reset_work() in 5.15 to
  nvme_pci_alloc_dev() in later releases.
  
  commit 3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd
  Author: Adrian Huang <ahuan...@lenovo.com>
  Date:   Fri Apr 21 16:08:00 2023 +0800
  Subject: nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
  Link: 
https://github.com/torvalds/linux/commit/3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd
  
  The fix is present in 6.4-rc3 and later.
  
  [Testcase]
  
  The system needs to be extremely busy. So busy infact, that we cannot
  reproduce it in lab environments, and only in production.
  
  The systems that hit this issue have 64 cores, ~90%+ sustained CPU
  usage, ~90%+ sustained memory usage, high disk I/O, and nearly saturated
  network throughput with 100Gb NICs.
  
  The NVMe disk MUST have /sys/block/nvme0n1/queue/max_hw_sectors_kb greater 
than
  128kb, in this case, 2048kb.
  
  Leave the system at sustained load until IOVA allocations slow to a halt
  and soft or hardlockups occur, waiting for iova_rbtree_lock.
  
  A test kernel is available in the following ppa:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf374805-test
  
  If you install the kernel and leave it running, the soft lockups will no
  longer occur.
  
  [Where problems could occur]
  
  We are changing the value of max_hw_sectors_kb for NVMe devices, for
  systems with IOMMU enabled. For those without IOMMU or IOMMU disabled,
  it will remain the same as it is now.
  
  The value is the minimum between the maximum supported by hardware, and
  the largest that fits into cache. For some workloads, this might have a
  small impact on performance, due to the need to split up larger IOVA
  allocations into multiple smaller ones, but there should be a larger net
  gain due to IOVA allocations now fitting into the cache, and completing
  much faster than a single large one.
  
  If a regression were to occur, users could disable the IOMMU as a
  workaround.

** Description changed:

  BugLink: https://bugs.launchpad.net/bugs/2064999
  
  [Impact]
  
  On systems with IOMMU enabled, every streaming DMA mapping involves an
  IOVA to be allocated and freed. For small mappings, IOVA sizes are
  normally cached, so IOVA allocations complete in a reasonable time. For
  larger mappings, things can be significantly slower to the point where
  softlockups occur due to lock contention on iova_rbtree_lock.
  
  commit 9257b4a206fc ("iommu/iova: introduce per-cpu caching to iova 
allocation")
- introduced a scalable IOVA cache mechanism, that helps increase performance up
- to 128kb mappings.
+ introduced a scalable IOVA cache mechanism, that helps increase performance 
up to 128kb mappings.
  
  On systems that do larger streaming DMA mappings, e.g. a NVMe device
  with:
  
  /sys/block/nvme0n1/queue/max_hw_sectors_kb
  2048
  
  The 2048kb mapping takes significantly longer, causing lock contention
  on the iova_rbtree_lock, as other resources such as ethernet NICs are
  also trying to acquire the lock.
  
  We hit the following soft lockup:
  
  watchdog: BUG: soft lockup - CPU#60 stuck for 24s!
  CPU: 60 PID: 608304 Comm: segment-merger- Tainted: P        W   EL    
5.15.0-76-generic #83~20.04.1-Ubuntu
  RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
  Call Trace:
   <IRQ>
   fq_flush_timeout+0x82/0xc0
   ? fq_ring_free+0x170/0x170
   call_timer_fn+0x2e/0x120
   run_timer_softirq+0x433/0x4c0
   ? lapic_next_event+0x21/0x30
   ? clockevents_program_event+0xab/0x130
   __do_softirq+0xdd/0x2ee
   irq_exit_rcu+0x7d/0xa0
   sysvec_apic_timer_interrupt+0x80/0x90
   </IRQ>
   <TASK>
   asm_sysvec_apic_timer_interrupt+0x1b/0x20
  RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
  ...
   alloc_iova+0x1d8/0x1f0
   alloc_iova_fast+0x5c/0x3a0
   iommu_dma_alloc_iova.isra.0+0x128/0x170
   ? __kmalloc+0x1ab/0x4b0
   iommu_dma_map_sg+0x1a4/0x4c0
   __dma_map_sg_attrs+0x72/0x80
   dma_map_sg_attrs+0xe/0x20
   nvme_map_data+0xde/0x800 [nvme]
   ? recalibrate_cpu_khz+0x10/0x10
   ? ktime_get+0x46/0xc0
   nvme_queue_rq+0xaf/0x1f0 [nvme]
   ? __update_load_avg_se+0x2a2/0x2c0
   __blk_mq_try_issue_directly+0x15b/0x200
   blk_mq_request_issue_directly+0x51/0xa0
   blk_mq_try_issue_list_directly+0x7f/0xf0
   blk_mq_sched_insert_requests+0xa4/0xf0
   blk_mq_flush_plug_list+0x103/0x1c0
   blk_flush_plug_list+0xe3/0x110
   blk_mq_submit_bio+0x29d/0x600
   __submit_bio+0x1e5/0x220
   ? ext4_inode_block_valid+0x9f/0xc0
   submit_bio_noacct+0xac/0x2c0
   ? xa_load+0x61/0xa0
   submit_bio+0x50/0x140
   ext4_mpage_readpages+0x6a2/0xe20
   ? __mod_lruvec_page_state+0x6b/0xb0
   ext4_readahead+0x37/0x40
   read_pages+0x95/0x280
   page_cache_ra_unbounded+0x161/0x220
   do_page_cache_ra+0x3d/0x50
   ondemand_readahead+0x137/0x330
   page_cache_async_ra+0xa6/0xd0
   filemap_get_pages+0x224/0x660
   ? filemap_get_pages+0x9e/0x660
   filemap_read+0xbe/0x410
   generic_file_read_iter+0xe5/0x150
   ext4_file_read_iter+0x5b/0x190
   new_sync_read+0x110/0x1a0
   vfs_read+0x102/0x1a0
   ksys_pread64+0x71/0xa0
   __x64_sys_pread64+0x1e/0x30
   unload_network_ops_symbols+0xc4de/0xf750 [falcon_lsm_pinned_15907]
   do_syscall_64+0x5c/0xc0
   ? do_syscall_64+0x69/0xc0
   ? do_syscall_64+0x69/0xc0
   entry_SYSCALL_64_after_hwframe+0x61/0xcb
  
  A workaround is to disable IOMMU with "iommu=off amd_iommu=off" on the
  kernel command line.
  
  [Fix]
  
  The fix is to clamp max_hw_sectors to the optimised IOVA size that still
  fits in the cache, so allocation and freeing of the IOVA is faster,
  during streaming DMA mapping.
  
  The fix requires two dependency commits, which introduces a function to
  find the optimal value, dma_opt_mapping_size().
  
  commit a229cc14f3395311b899e5e582b71efa8dd01df0
  Author: John Garry <john.ga...@huawei.com>
  Date:   Thu Jul 14 19:15:24 2022 +0800
  Subject: dma-mapping: add dma_opt_mapping_size()
  Link: 
https://github.com/torvalds/linux/commit/a229cc14f3395311b899e5e582b71efa8dd01df0
  
  commit 6d9870b7e5def2450e21316515b9efc0529204dd
  Author: John Garry <john.ga...@huawei.com>
  Date:   Thu Jul 14 19:15:25 2022 +0800
  Subject: dma-iommu: add iommu_dma_opt_mapping_size()
  Link: 
https://github.com/torvalds/linux/commit/6d9870b7e5def2450e21316515b9efc0529204dd
  
  The dependencies are present in 6.0-rc1 and later.
  
  The fix itself simply changes max_hw_sectors from dma_max_mapping_size()
  to dma_opt_mapping_size(). The fix needs a backport, as setting
  dev->ctrl.max_hw_sectors moved from nvme_reset_work() in 5.15 to
  nvme_pci_alloc_dev() in later releases.
  
  commit 3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd
  Author: Adrian Huang <ahuan...@lenovo.com>
  Date:   Fri Apr 21 16:08:00 2023 +0800
  Subject: nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
  Link: 
https://github.com/torvalds/linux/commit/3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd
  
  The fix is present in 6.4-rc3 and later.
  
  [Testcase]
  
  The system needs to be extremely busy. So busy infact, that we cannot
  reproduce it in lab environments, and only in production.
  
  The systems that hit this issue have 64 cores, ~90%+ sustained CPU
  usage, ~90%+ sustained memory usage, high disk I/O, and nearly saturated
  network throughput with 100Gb NICs.
  
  The NVMe disk MUST have /sys/block/nvme0n1/queue/max_hw_sectors_kb greater 
than
  128kb, in this case, 2048kb.
  
  Leave the system at sustained load until IOVA allocations slow to a halt
  and soft or hardlockups occur, waiting for iova_rbtree_lock.
  
  A test kernel is available in the following ppa:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf374805-test
  
  If you install the kernel and leave it running, the soft lockups will no
  longer occur.
  
  [Where problems could occur]
  
  We are changing the value of max_hw_sectors_kb for NVMe devices, for
  systems with IOMMU enabled. For those without IOMMU or IOMMU disabled,
  it will remain the same as it is now.
  
  The value is the minimum between the maximum supported by hardware, and
  the largest that fits into cache. For some workloads, this might have a
  small impact on performance, due to the need to split up larger IOVA
  allocations into multiple smaller ones, but there should be a larger net
  gain due to IOVA allocations now fitting into the cache, and completing
  much faster than a single large one.
  
  If a regression were to occur, users could disable the IOMMU as a
  workaround.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2064999

Title:
  Prevent soft lockups during IOMMU streaming DMA mapping by limiting
  nvme max_hw_sectors_kb to cache optimised size

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2064999/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to