** Description changed:

- This patch series adds comprehensive CXL (Compute Express Link) support
- to the nvidia-6.17 kernel, including:
+ This patch series adds comprehensive CXL (Compute Express Link) support to the
+ nvidia-6.17 kernel, including:
  
- 1. CXL Type-2 device support - Enables accelerator devices (like GPUs
- and SmartNICs) to use CXL for coherent memory access
- 
+ 1. CXL Type-2 device support - Enables accelerator devices (like GPUs and
+    SmartNICs) to use CXL for coherent memory access via firmware-provisioned
+    regions
  2. CXL RAS (Reliability, Availability, Serviceability) error handling -
- Implements PCIe Port Protocol error handling and logging for CXL devices
- 
- 3. Prerequisite CXL driver updates - Cherry-picked commits from Linux
- v6.18 that are required dependencies
- 
+    Implements PCIe Port Protocol error handling and logging for CXL Root 
Ports,
+    Downstream Switch Ports, and Upstream Switch Ports
+ 3. CXL DVSEC and HDM state save/restore - Preserves CXL DVSEC control/range
+    registers and HDM decoder programming across PCI resets and link 
transitions,
+    enabling device re-initialization after reset for firmware-provisioned
+    configurations
+ 4. CXL Reset support - Implements the CXL Reset method (CXL Spec v3.2,
+    Sections 8.1.3, 9.6, 9.7) via a sysfs interface for Type-2 devices,
+    including memory offlining, cache flushing, multi-function sibling
+    coordination, and DVSEC reset sequencing
+ 5. Multi-level interleaving fix - Supports firmware-configured CXL
+    interleaving where lower levels use smaller granularities than parent ports
+    (reverse HPA bit ordering)
+ 6. Prerequisite CXL and PCI driver updates - Cherry-picked commits from
+    upstream torvalds/master covering the range from v6.17.9 to the merge
+    point of Terry Bowman's v14 series into v7.0
+ 7. CXL DAX support - Enables direct memory access to CXL RAM regions and
+     mapping CXL DAX devices as System-RAM
  
  Key Features Added:
  
-     CXL Type-2 accelerator device registration and memory management
-     CXL region creation by Type-2 drivers
-     DPA (Device Physical Address) allocation interface for accelerators
-     HPA (Host Physical Address) free space enumeration
-     CXL protocol error detection, forwarding, and recovery
-     RAS register mapping for CXL Endpoints and Switch Ports
+   - CXL Type-2 accelerator device registration and memory management
+   - CXL region creation by Type-2 drivers
+   - DPA (Device Physical Address) allocation interface for accelerators
+   - HPA (Host Physical Address) free space enumeration
+   - Multi-level CXL address translation (SPA↔HPA↔DPA)
+   - CXL protocol error detection, forwarding, and recovery
+   - CXL RAS error handling for Endpoints, RCH, and Switch Ports
+     (replacing the old PCIEAER_CXL symbol with the new CXL_RAS def_bool)
+   - CXL extended linear cache region support
+   - CXL DVSEC and HDM decoder state save/restore across PCI resets
+   - CXL Reset sysfs interface (/sys/bus/pci/devices/.../cxl_reset) for
+     Type-2 devices with Reset Capable bit set
+   - Multi-function sibling coordination during CXL reset via Non-CXL
+     Function Map DVSEC
+   - CPU cache flush using cpu_cache_invalidate_memregion() during reset
+   - Multi-level interleaving with smaller granularities for lower decoder
+     levels (firmware-provisioned configurations)
+   - CXL DAX device access (DEV_DAX_CXL) and System-RAM mapping
+     (DEV_DAX_KMEM)
+   - CXL protocol error injection via APEI EINJ (ACPI_APEI_EINJ_CXL)
  
  Justification
  
- CXL Type-2 device support is critical for next-generation NVIDIA
- accelerators and data center workloads:
+ CXL Type-2 device support is critical for next-generation NVIDIA accelerators
+ and data center workloads:
+ 
+   - Enables coherent memory sharing between CPUs and accelerators
+   - Supports firmware-provisioned CXL regions for accelerator memory
+   - Provides proper error handling and reporting for CXL fabric errors
+   - Enables device reset and state recovery for CXL Type-2 devices
+   - Preserves firmware-programmed DVSEC and HDM decoder state across resets
+   - Required for upcoming NVIDIA hardware with CXL capabilities
+ 
+ Source
+ Patch Breakdown (139 patches + 1 revert):
+ #  Category                      Count  Source
+ 
--------------------------------------------------------------------------------
+ 1  Revert old CXL reset (f198764)  1    OOT (cleanup)
+ 
--------------------------------------------------------------------------------
+ 2  Upstream CXL/PCI prerequisite   103          Upstream torvalds/master 
(v6.17.9
+    cherry-picks                           → merge of Terry Bowman v14 into 
v7.0)
+ 
--------------------------------------------------------------------------------
+ 3  Smita Koralahalli's CXL EINJ    1      LKML (v6, not yet merged)
+    series v6 patch 3/9
+ 
--------------------------------------------------------------------------------
+ 4  Alejandro Lucero's CXL Type-2   22     LKML (v23, not yet merged)
+    series v23
+ 
--------------------------------------------------------------------------------
+ 5  Robert Richter's multi-level    1      LKML (v1, not yet merged)
+    interleaving fix
+ 
--------------------------------------------------------------------------------
+ 6  Srirangan Madhavan's CXL state  5      LKML (v1, not yet merged)
+    save/restore series
+ 
--------------------------------------------------------------------------------
+ 7  Srirangan Madhavan's CXL reset  7      LKML (v5, not yet merged)
+    series
+ 
--------------------------------------------------------------------------------
+ 8  Config annotations update     3      OOT (build config)
+ 
--------------------------------------------------------------------------------
+    TOTAL                         143
  
  
-     Enables coherent memory sharing between CPUs and accelerators
-     Supports firmware-provisioned CXL regions for accelerator memory
-     Provides proper error handling and reporting for CXL fabric errors
-     Required for upcoming NVIDIA hardware with CXL capabilities
+ Notes on the upstream cherry-picks (item 2):
+ 
+ The 103 upstream commits span 1bfd0faa78d0 (v6.17.9) to
+ 0da3050bdded (Merge of for-7.0/cxl-aer-prep into cxl-for-next).
+ This range includes 17 out of 34 patches from Terry Bowman's v14 series
+ that were reworked by the CXL maintainer and merged into v7.0 via the
+ for-7.0/cxl-aer-prep branch. The remaining 17 patches from Terry's v14
+ were refactored into v15 (9 patches, not yet merged) and are not included
+ in this port.
  
  
- Patch Breakdown (80 commits total):
+ Notes on the save/restore and reset series (items 6–7):
  
- Category                          Count  Source
- Revert old CXL reset               1     OOT (cleanup)
- v6.18 CXL driver prerequisites          28     Upstream (cherry-picked from 
torvalds/linux v6.18)
- Terry Bowman's CXL RAS series           25     Upstream (RESEND v13)
- Alejandro Lucero's Type-2 series  25     Upstream (v22)
- CXL Config update                  1     OOT (build config)
+ Srirangan's patches were authored against upstream v7.0-rc1 (which does not
+ include Alejandro's v23 Type-2 series). For this port, the header
+ reorganization in patch 2/5 of the save/restore series was adapted to align
+ with Alejandro's v23 approach: HDM decoder and register map definitions were
+ moved to include/cxl/cxl.h (not include/cxl/pci.h as in the original
+ patch) to follow the convention established by Alejandro's series. Upstream
+ reviewers have indicated that Srirangan's series should be rebased on top of
+ Alejandro's once it merges.
  
  
  Lore Links:
- Terry Bowman's CXL RAS series (RESEND v13):
- 
https://lore.kernel.org/linux-cxl/[email protected]/
  
- Alejandro Lucero's CXL Type-2 series (v22):
- 
https://lore.kernel.org/linux-cxl/[email protected]/
+ - Terry Bowman's CXL RAS series (v14, partially merged into v7.0):
+   https://lore.kernel.org/all/[email protected]/
+ 
+ - Smita Koralahalli's CXL EINJ series (v6, patch 3/9 only):
+   
https://lore.kernel.org/linux-cxl/[email protected]/
+ 
+ - Alejandro Lucero's CXL Type-2 series (v23):
+   
https://lore.kernel.org/linux-cxl/[email protected]/
+ 
+ - Robert Richter's multi-level interleaving fix (v1):
+   https://lore.kernel.org/all/[email protected]/
+ 
+ - Srirangan Madhavan's CXL state save/restore series:
+   
https://lore.kernel.org/linux-cxl/[email protected]/
+ 
+ - Srirangan Madhavan's CXL reset series (v5):
+   
https://lore.kernel.org/linux-cxl/[email protected]/
+ 
+ 
+ Testing
+ 
+ Build Validation:
+ 
+ - Built successfully for ARM64 4K page size kernel
+ - Built successfully for ARM64 64K page size kernel
+ - Built successfully for x86
+ 
+ Runtime Testing:
+ 
+ - Boot test on ARM64 system
+ - CXL device enumeration test
+ - CXL interleaving testing
+ - CXL reset test
+ - DVSEC save/restore verified (CXLCtl, Range register preserved)
+ 
  
  Notes
  
- CONFIG_CXL_BUS and CONFIG_CXL_PCI changed from tristate to bool by the
- Type-2 patches (intentional design change for built-in CXL support)
+ - CONFIG_PCIEAER_CXL has been removed from Kconfig by upstream commit
+ d18f1b7beadf (PCI/AER: Replace PCIEAER_CXL symbol with CXL_RAS).
+ The debian.master annotation for PCIEAER_CXL=y is overridden to -
+ in debian.nvidia-6.17/config/annotations.
  
- Kernel config annotations updated in
- debian.nvidia-6.17/config/annotations to reflect these changes
+ - CONFIG_CXL_BUS, CONFIG_CXL_PCI, CONFIG_CXL_MEM, CONFIG_CXL_PORT
+ remain tristate (not bool) — the v14 series kept them as tristate,
+ unlike earlier draft versions.
+ 
+ - CONFIG_DEV_DAX, CONFIG_DEV_DAX_CXL, and CONFIG_DEV_DAX_KMEM are
+ overridden from m (debian.master default) to y to support built-in
+ CXL RAM region DAX access and System-RAM mapping.
+ 
+ - CONFIG_PCI_CXL is a new hidden bool introduced by the save/restore
+ series; auto-enabled when CXL_BUS=y. Gates compilation of
+ drivers/pci/cxl.o for DVSEC and HDM state save/restore.
+ 
+ - CONFIG_GENERIC_CPU_CACHE_MAINTENANCE and 
CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION are new configs
+ introduced by the upstream cherry-picks; arm64 auto-selects both.
+ cpu_cache_invalidate_memregion() is also used by the CXL reset
+ series for cache flushing during reset.
+ 
+ - Kernel config annotations updated in debian.nvidia-6.17/config/annotations
+ to reflect all of the above changes.
+ 
+ - Srirangan's save/restore series header reorganization was adapted to
+ align with Alejandro's v23 approach (include/cxl/cxl.h instead of
+ include/cxl/pci.h). See commit message on patch 2/5 for details.

** Description changed:

  This patch series adds comprehensive CXL (Compute Express Link) support to the
  nvidia-6.17 kernel, including:
  
  1. CXL Type-2 device support - Enables accelerator devices (like GPUs and
-    SmartNICs) to use CXL for coherent memory access via firmware-provisioned
-    regions
+    SmartNICs) to use CXL for coherent memory access via firmware-provisioned
+    regions
  2. CXL RAS (Reliability, Availability, Serviceability) error handling -
-    Implements PCIe Port Protocol error handling and logging for CXL Root 
Ports,
-    Downstream Switch Ports, and Upstream Switch Ports
+    Implements PCIe Port Protocol error handling and logging for CXL Root 
Ports,
+    Downstream Switch Ports, and Upstream Switch Ports
  3. CXL DVSEC and HDM state save/restore - Preserves CXL DVSEC control/range
-    registers and HDM decoder programming across PCI resets and link 
transitions,
-    enabling device re-initialization after reset for firmware-provisioned
-    configurations
+    registers and HDM decoder programming across PCI resets and link 
transitions,
+    enabling device re-initialization after reset for firmware-provisioned
+    configurations
  4. CXL Reset support - Implements the CXL Reset method (CXL Spec v3.2,
-    Sections 8.1.3, 9.6, 9.7) via a sysfs interface for Type-2 devices,
-    including memory offlining, cache flushing, multi-function sibling
-    coordination, and DVSEC reset sequencing
+    Sections 8.1.3, 9.6, 9.7) via a sysfs interface for Type-2 devices,
+    including memory offlining, cache flushing, multi-function sibling
+    coordination, and DVSEC reset sequencing
  5. Multi-level interleaving fix - Supports firmware-configured CXL
-    interleaving where lower levels use smaller granularities than parent ports
-    (reverse HPA bit ordering)
+    interleaving where lower levels use smaller granularities than parent ports
+    (reverse HPA bit ordering)
  6. Prerequisite CXL and PCI driver updates - Cherry-picked commits from
-    upstream torvalds/master covering the range from v6.17.9 to the merge
-    point of Terry Bowman's v14 series into v7.0
+    upstream torvalds/master covering the range from v6.17.9 to the merge
+    point of Terry Bowman's v14 series into v7.0
  7. CXL DAX support - Enables direct memory access to CXL RAM regions and
-     mapping CXL DAX devices as System-RAM
+     mapping CXL DAX devices as System-RAM
  
  Key Features Added:
  
-   - CXL Type-2 accelerator device registration and memory management
-   - CXL region creation by Type-2 drivers
-   - DPA (Device Physical Address) allocation interface for accelerators
-   - HPA (Host Physical Address) free space enumeration
-   - Multi-level CXL address translation (SPA↔HPA↔DPA)
-   - CXL protocol error detection, forwarding, and recovery
-   - CXL RAS error handling for Endpoints, RCH, and Switch Ports
-     (replacing the old PCIEAER_CXL symbol with the new CXL_RAS def_bool)
-   - CXL extended linear cache region support
-   - CXL DVSEC and HDM decoder state save/restore across PCI resets
-   - CXL Reset sysfs interface (/sys/bus/pci/devices/.../cxl_reset) for
-     Type-2 devices with Reset Capable bit set
-   - Multi-function sibling coordination during CXL reset via Non-CXL
-     Function Map DVSEC
-   - CPU cache flush using cpu_cache_invalidate_memregion() during reset
-   - Multi-level interleaving with smaller granularities for lower decoder
-     levels (firmware-provisioned configurations)
-   - CXL DAX device access (DEV_DAX_CXL) and System-RAM mapping
-     (DEV_DAX_KMEM)
-   - CXL protocol error injection via APEI EINJ (ACPI_APEI_EINJ_CXL)
+   - CXL Type-2 accelerator device registration and memory management
+   - CXL region creation by Type-2 drivers
+   - DPA (Device Physical Address) allocation interface for accelerators
+   - HPA (Host Physical Address) free space enumeration
+   - Multi-level CXL address translation (SPA↔HPA↔DPA)
+   - CXL protocol error detection, forwarding, and recovery
+   - CXL RAS error handling for Endpoints, RCH, and Switch Ports
+     (replacing the old PCIEAER_CXL symbol with the new CXL_RAS def_bool)
+   - CXL extended linear cache region support
+   - CXL DVSEC and HDM decoder state save/restore across PCI resets
+   - CXL Reset sysfs interface (/sys/bus/pci/devices/.../cxl_reset) for
+     Type-2 devices with Reset Capable bit set
+   - Multi-function sibling coordination during CXL reset via Non-CXL
+     Function Map DVSEC
+   - CPU cache flush using cpu_cache_invalidate_memregion() during reset
+   - Multi-level interleaving with smaller granularities for lower decoder
+     levels (firmware-provisioned configurations)
+   - CXL DAX device access (DEV_DAX_CXL) and System-RAM mapping
+     (DEV_DAX_KMEM)
+   - CXL protocol error injection via APEI EINJ (ACPI_APEI_EINJ_CXL)
  
  Justification
  
  CXL Type-2 device support is critical for next-generation NVIDIA accelerators
  and data center workloads:
  
-   - Enables coherent memory sharing between CPUs and accelerators
-   - Supports firmware-provisioned CXL regions for accelerator memory
-   - Provides proper error handling and reporting for CXL fabric errors
-   - Enables device reset and state recovery for CXL Type-2 devices
-   - Preserves firmware-programmed DVSEC and HDM decoder state across resets
-   - Required for upcoming NVIDIA hardware with CXL capabilities
+   - Enables coherent memory sharing between CPUs and accelerators
+   - Supports firmware-provisioned CXL regions for accelerator memory
+   - Provides proper error handling and reporting for CXL fabric errors
+   - Enables device reset and state recovery for CXL Type-2 devices
+   - Preserves firmware-programmed DVSEC and HDM decoder state across resets
+   - Required for upcoming NVIDIA hardware with CXL capabilities
  
  Source
  Patch Breakdown (139 patches + 1 revert):
  #  Category                      Count  Source
  
--------------------------------------------------------------------------------
  1  Revert old CXL reset (f198764)  1    OOT (cleanup)
  
--------------------------------------------------------------------------------
  2  Upstream CXL/PCI prerequisite   103          Upstream torvalds/master 
(v6.17.9
-    cherry-picks                           → merge of Terry Bowman v14 into 
v7.0)
+    cherry-picks                           → merge of Terry Bowman v14 into 
v7.0)
  
--------------------------------------------------------------------------------
  3  Smita Koralahalli's CXL EINJ    1      LKML (v6, not yet merged)
-    series v6 patch 3/9
+    series v6 patch 3/9
  
--------------------------------------------------------------------------------
  4  Alejandro Lucero's CXL Type-2   22     LKML (v23, not yet merged)
-    series v23
+    series v23
  
--------------------------------------------------------------------------------
  5  Robert Richter's multi-level    1      LKML (v1, not yet merged)
-    interleaving fix
+    interleaving fix
  
--------------------------------------------------------------------------------
  6  Srirangan Madhavan's CXL state  5      LKML (v1, not yet merged)
-    save/restore series
+    save/restore series
  
--------------------------------------------------------------------------------
  7  Srirangan Madhavan's CXL reset  7      LKML (v5, not yet merged)
-    series
+    series
  
--------------------------------------------------------------------------------
  8  Config annotations update     3      OOT (build config)
  
--------------------------------------------------------------------------------
-    TOTAL                         143
- 
+    TOTAL                         143
  
  Notes on the upstream cherry-picks (item 2):
  
  The 103 upstream commits span 1bfd0faa78d0 (v6.17.9) to
  0da3050bdded (Merge of for-7.0/cxl-aer-prep into cxl-for-next).
  This range includes 17 out of 34 patches from Terry Bowman's v14 series
  that were reworked by the CXL maintainer and merged into v7.0 via the
  for-7.0/cxl-aer-prep branch. The remaining 17 patches from Terry's v14
  were refactored into v15 (9 patches, not yet merged) and are not included
  in this port.
- 
  
  Notes on the save/restore and reset series (items 6–7):
  
  Srirangan's patches were authored against upstream v7.0-rc1 (which does not
  include Alejandro's v23 Type-2 series). For this port, the header
  reorganization in patch 2/5 of the save/restore series was adapted to align
  with Alejandro's v23 approach: HDM decoder and register map definitions were
  moved to include/cxl/cxl.h (not include/cxl/pci.h as in the original
  patch) to follow the convention established by Alejandro's series. Upstream
  reviewers have indicated that Srirangan's series should be rebased on top of
  Alejandro's once it merges.
  
- 
  Lore Links:
  
  - Terry Bowman's CXL RAS series (v14, partially merged into v7.0):
-   https://lore.kernel.org/all/[email protected]/
+   https://lore.kernel.org/all/[email protected]/
  
  - Smita Koralahalli's CXL EINJ series (v6, patch 3/9 only):
-   
https://lore.kernel.org/linux-cxl/[email protected]/
+   
https://lore.kernel.org/linux-cxl/[email protected]/
  
  - Alejandro Lucero's CXL Type-2 series (v23):
-   
https://lore.kernel.org/linux-cxl/[email protected]/
+   
https://lore.kernel.org/linux-cxl/[email protected]/
  
  - Robert Richter's multi-level interleaving fix (v1):
-   https://lore.kernel.org/all/[email protected]/
+   https://lore.kernel.org/all/[email protected]/
  
  - Srirangan Madhavan's CXL state save/restore series:
-   
https://lore.kernel.org/linux-cxl/[email protected]/
+   
https://lore.kernel.org/linux-cxl/[email protected]/
  
  - Srirangan Madhavan's CXL reset series (v5):
-   
https://lore.kernel.org/linux-cxl/[email protected]/
- 
+   
https://lore.kernel.org/linux-cxl/[email protected]/
  
  Testing
  
  Build Validation:
  
  - Built successfully for ARM64 4K page size kernel
  - Built successfully for ARM64 64K page size kernel
  - Built successfully for x86
  
  Runtime Testing:
  
  - Boot test on ARM64 system
  - CXL device enumeration test
  - CXL interleaving testing
  - CXL reset test
  - DVSEC save/restore verified (CXLCtl, Range register preserved)
- 
  
  Notes
  
  - CONFIG_PCIEAER_CXL has been removed from Kconfig by upstream commit
  d18f1b7beadf (PCI/AER: Replace PCIEAER_CXL symbol with CXL_RAS).
  The debian.master annotation for PCIEAER_CXL=y is overridden to -
  in debian.nvidia-6.17/config/annotations.
  
  - CONFIG_CXL_BUS, CONFIG_CXL_PCI, CONFIG_CXL_MEM, CONFIG_CXL_PORT
  remain tristate (not bool) — the v14 series kept them as tristate,
  unlike earlier draft versions.
  
  - CONFIG_DEV_DAX, CONFIG_DEV_DAX_CXL, and CONFIG_DEV_DAX_KMEM are
  overridden from m (debian.master default) to y to support built-in
  CXL RAM region DAX access and System-RAM mapping.
  
  - CONFIG_PCI_CXL is a new hidden bool introduced by the save/restore
  series; auto-enabled when CXL_BUS=y. Gates compilation of
  drivers/pci/cxl.o for DVSEC and HDM state save/restore.
  
  - CONFIG_GENERIC_CPU_CACHE_MAINTENANCE and 
CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION are new configs
  introduced by the upstream cherry-picks; arm64 auto-selects both.
  cpu_cache_invalidate_memregion() is also used by the CXL reset
  series for cache flushing during reset.
  
  - Kernel config annotations updated in debian.nvidia-6.17/config/annotations
  to reflect all of the above changes.
  
  - Srirangan's save/restore series header reorganization was adapted to
  align with Alejandro's v23 approach (include/cxl/cxl.h instead of
  include/cxl/pci.h). See commit message on patch 2/5 for details.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2143032

Title:
  Add CXL Type-2 device support, RAS error handling, reset, state
  save/restore, and interleaving support

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2143032/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to