** Description changed:

+ SRU Justification:
+ 
+ [Impact]
+ 
+  * With the introduction of c76c067e488c "s390/pci: Use dma-iommu layer"
+    (upstream with since kernel v6.7-rc1) there was a move (on s390x only)
+    to a different dma-iommu implementation.
+ 
+  * And with 92bce97f0c34 "s390/pci: Fix reset of IOMMU software counters"
+    (again upstream since 6.7(rc-1) the IOMMU_DEFAULT_DMA_LAZY kernel config
+    option should now be set to 'yes' by default for s390x.
+ 
+  * Since CONFIG_IOMMU_DEFAULT_DMA_STRICT and IOMMU_DEFAULT_DMA_LAZY
+    are related to each other CONFIG_IOMMU_DEFAULT_DMA_STRICT needs to be
+    set to "no" by default, which was upstream done by b2b97a62f055
+    "Revert "s390: update defconfigs"".
+ 
+  * These changes are all upstream, but were not picked up by the Ubuntu
+    kernel config.
+ 
+  * And not having these config options set properly is causing significant
+    PCI-related network throughput degradation (up to -72%).
+ 
+  * This shows for almost all workloads and numbers of connections,
+    deteriorating with the number of connections increasing.
+ 
+  * Especially drastic is the drop for a high number of parallel connections
+    (50 and 250) and for small and medium-size transactional workloads.
+    However, also for streaming-type workloads the degradation is clearly
+    visible (up to 48% degradation).
+ 
+ [Fix]
+ 
+  * The (upstream) fix is to set
+    IOMMU_DEFAULT_DMA_STRICT=no
+    and
+    IOMMU_DEFAULT_DMA_LAZY=y
+    (which is needed for the changed DAM IOMMU implementation since v6.7).
+ 
+ [Test Case]
+ 
+  * Setup two Ubuntu Server 24.04 LPARs (with kernel 6.8)
+    (one acting as server and as client)
+    that have (PCIe attached) RoCE Express devices attached
+    and that are connected to each other.
+ 
+  * Sample workload rr1c-200x1000-250 with rr1c-200x1000-250.xml:
+    <?xml version="1.0"?>
+    <profile name="TCP_RR">
+            <group nprocs="250">
+                    <transaction iterations="1">
+                            <flowop type="connect" options="remotehost=<remote 
IP> protocol=tcp tcp_nodelay" />
+                    </transaction>
+                    <transaction duration="300">
+                            <flowop type="write" options="size=200"/>
+                            <flowop type="read" options="size=1000"/>
+                    </transaction>
+                    <transaction iterations="1">
+                            <flowop type="disconnect" />
+                    </transaction>
+            </group>
+    </profile>
+ 
+  * Install uperf on both systems, client and server.
+ 
+  * Start uperf at server: uperf -s
+ 
+  * Start uperf at client: uperf -vai 5 -m uperf-profile.xml
+ 
+  * Switch from strict to lazy mode 
+    either using the new kernel (or the test build below)
+    or using kernel cmd-line parameter iommu.strict=0.
+ 
+  * Restart uperf on server and client, like before.
+ 
+  * Verification will be performed by IBM.
+ 
+ [Regression Potential]
+ 
+  * The is a certain regression potential, since the behavior with
+    the two modified kernel config options will change significantly.
+ 
+  * This may solve the (network) throughput issue with PCI devices,
+    but may also come with side-effects on other PCIe based devices
+    (the old compression adapters or the new NVMe carrier cards).
+ 
+ [Other]
+ 
+  * CCW devices are not affected.
+ 
+  * This is s390x-specific only, hence will not affect any other
+ architecture.
+ 
+ __________
+ 
  Symptom:
  Comparing Ubuntu 24.04 (kernelversion: 6.8.0-31-generic) against Ubuntu 
22.04, all of our PCI-related network measurements on LPAR show massive 
throughput degradations (up to -72%). This shows for almost all workloads and 
numbers of connections, detereorating with the number of connections 
increasing. Especially drastic is the drop for a high number of parallel 
connections (50 and 250) and for small and medium-size transactional workloads. 
However, also for streaming-type workloads the degradation is clearly visible 
(up to 48% degradation).
  
  Problem:
  With kernel config setting CONFIG_IOMMU_DEFAULT_DMA_STRICT=y, IOMMU DMA mode 
changed from lazy to strict, causing these massive degradations.
  Behavior can also be changed with a kernel commandline parameter 
(iommu.strict) for easy verification.
  
  The issue is known and was quickly fixed upstream in December 2023, after 
being present for little less than two weeks.
  Upstream fix: 
https://github.com/torvalds/linux/commit/b2b97a62f055dd638f7f02087331a8380d8f139a
  
  Repro:
  rr1c-200x1000-250 with rr1c-200x1000-250.xml:
  
  <?xml version="1.0"?>
  <profile name="TCP_RR">
-         <group nprocs="250">
-                 <transaction iterations="1">
-                         <flowop type="connect" options="remotehost=<remote 
IP> protocol=tcp  tcp_nodelay" />
-                 </transaction>
-                 <transaction duration="300">
-                         <flowop type="write" options="size=200"/>
-                         <flowop type="read" options="size=1000"/>
-                 </transaction>
-                 <transaction iterations="1">
-                         <flowop type="disconnect" />
-                 </transaction>
-         </group>
+         <group nprocs="250">
+                 <transaction iterations="1">
+                         <flowop type="connect" options="remotehost=<remote 
IP> protocol=tcp  tcp_nodelay" />
+                 </transaction>
+                 <transaction duration="300">
+                         <flowop type="write" options="size=200"/>
+                         <flowop type="read" options="size=1000"/>
+                 </transaction>
+                 <transaction iterations="1">
+                         <flowop type="disconnect" />
+                 </transaction>
+         </group>
  </profile>
  
  0) Install uperf on both systems, client and server.
  1) Start uperf at server: uperf -s
  2) Start uperf at client: uperf -vai 5 -m uperf-profile.xml
  
  3) Switch from strict to lazy mode using kernel commandline parameter 
iommu.strict=0.
  4) Repeat steps 1) and 2).
  
  Example:
  For the following example, we chose the workload named above 
(rr1c-200x1000-250):
  
  iommu.strict=1 (strict): 233464.914 TPS
  iommu.strict=0 (lazy): 835123.193 TPS

** Changed in: ubuntu-z-systems
       Status: New => In Progress

** Changed in: linux (Ubuntu)
       Status: New => In Progress

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2071471

Title:
  [UBUNTU 24.04] IOMMU DMA mode changed in kernel config causes massive
  throughput degradation for PCI-related network workloads

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2071471/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to