** Description changed:

  This patch series adds comprehensive CXL (Compute Express Link) support to the
  nvidia-6.17 kernel, including:
  
  1. CXL Type-2 device support - Enables accelerator devices (like GPUs and
     SmartNICs) to use CXL for coherent memory access via firmware-provisioned
     regions
  2. CXL RAS (Reliability, Availability, Serviceability) error handling -
     Implements PCIe Port Protocol error handling and logging for CXL Root 
Ports,
     Downstream Switch Ports, and Upstream Switch Ports
  3. CXL DVSEC and HDM state save/restore - Preserves CXL DVSEC control/range
     registers and HDM decoder programming across PCI resets and link 
transitions,
     enabling device re-initialization after reset for firmware-provisioned
     configurations
  4. CXL Reset support - Implements the CXL Reset method (CXL Spec v3.2,
     Sections 8.1.3, 9.6, 9.7) via a sysfs interface for Type-2 devices,
     including memory offlining, cache flushing, multi-function sibling
     coordination, and DVSEC reset sequencing
  5. Multi-level interleaving fix - Supports firmware-configured CXL
     interleaving where lower levels use smaller granularities than parent ports
     (reverse HPA bit ordering)
  6. Prerequisite CXL and PCI driver updates - Cherry-picked commits from
     upstream torvalds/master covering the range from v6.17.9 to the merge
     point of Terry Bowman's v14 series into v7.0
  7. CXL DAX support - Enables direct memory access to CXL RAM regions and
      mapping CXL DAX devices as System-RAM
  
  Key Features Added:
  
    - CXL Type-2 accelerator device registration and memory management
    - CXL region creation by Type-2 drivers
    - DPA (Device Physical Address) allocation interface for accelerators
    - HPA (Host Physical Address) free space enumeration
    - Multi-level CXL address translation (SPA↔HPA↔DPA)
    - CXL protocol error detection, forwarding, and recovery
    - CXL RAS error handling for Endpoints, RCH, and Switch Ports
      (replacing the old PCIEAER_CXL symbol with the new CXL_RAS def_bool)
    - CXL extended linear cache region support
    - CXL DVSEC and HDM decoder state save/restore across PCI resets
    - CXL Reset sysfs interface (/sys/bus/pci/devices/.../cxl_reset) for
      Type-2 devices with Reset Capable bit set
    - Multi-function sibling coordination during CXL reset via Non-CXL
      Function Map DVSEC
    - CPU cache flush using cpu_cache_invalidate_memregion() during reset
    - Multi-level interleaving with smaller granularities for lower decoder
      levels (firmware-provisioned configurations)
    - CXL DAX device access (DEV_DAX_CXL) and System-RAM mapping
      (DEV_DAX_KMEM)
    - CXL protocol error injection via APEI EINJ (ACPI_APEI_EINJ_CXL)
  
  Justification
  
  CXL Type-2 device support is critical for next-generation NVIDIA accelerators
  and data center workloads:
  
    - Enables coherent memory sharing between CPUs and accelerators
    - Supports firmware-provisioned CXL regions for accelerator memory
    - Provides proper error handling and reporting for CXL fabric errors
    - Enables device reset and state recovery for CXL Type-2 devices
    - Preserves firmware-programmed DVSEC and HDM decoder state across resets
    - Required for upcoming NVIDIA hardware with CXL capabilities
  
  Source
  Patch Breakdown (139 patches + 1 revert):
  #  Category                      Count  Source
  
--------------------------------------------------------------------------------
  1  Revert old CXL reset (f198764)  1    OOT (cleanup)
  
--------------------------------------------------------------------------------
  2  Upstream CXL/PCI prerequisite   103          Upstream torvalds/master 
(v6.17.9
     cherry-picks                           → merge of Terry Bowman v14 into 
v7.0)
  
--------------------------------------------------------------------------------
  3  Smita Koralahalli's CXL EINJ    1      LKML (v6, not yet merged)
     series v6 patch 3/9
  
--------------------------------------------------------------------------------
  4  Alejandro Lucero's CXL Type-2   22     LKML (v23, not yet merged)
     series v23
  
--------------------------------------------------------------------------------
  5  Robert Richter's multi-level    1      LKML (v1, not yet merged)
     interleaving fix
  
--------------------------------------------------------------------------------
  6  Srirangan Madhavan's CXL state  5      LKML (v1, not yet merged)
     save/restore series
  
--------------------------------------------------------------------------------
  7  Srirangan Madhavan's CXL reset  7      LKML (v5, not yet merged)
     series
  
--------------------------------------------------------------------------------
+ 8  Upstream fixes for ported       14     13 Upstream merged fixes + 1 
+    commits                                prerequisite
+ 
--------------------------------------------------------------------------------
  8  Config annotations update     3      OOT (build config)
  
--------------------------------------------------------------------------------
-    TOTAL                         143
+    TOTAL                         157
  
  Notes on the upstream cherry-picks (item 2):
  
  The 103 upstream commits span 1bfd0faa78d0 (v6.17.9) to
  0da3050bdded (Merge of for-7.0/cxl-aer-prep into cxl-for-next).
  This range includes 17 out of 34 patches from Terry Bowman's v14 series
  that were reworked by the CXL maintainer and merged into v7.0 via the
  for-7.0/cxl-aer-prep branch. The remaining 17 patches from Terry's v14
  were refactored into v15 (9 patches, not yet merged) and are not included
  in this port.
  
  Notes on the save/restore and reset series (items 6–7):
  
  Srirangan's patches were authored against upstream v7.0-rc1 (which does not
  include Alejandro's v23 Type-2 series). For this port, the header
  reorganization in patch 2/5 of the save/restore series was adapted to align
  with Alejandro's v23 approach: HDM decoder and register map definitions were
  moved to include/cxl/cxl.h (not include/cxl/pci.h as in the original
  patch) to follow the convention established by Alejandro's series. Upstream
  reviewers have indicated that Srirangan's series should be rebased on top of
  Alejandro's once it merges.
  
+ Notes on the upstream fixes (item 8):
+ 
+ 14 upstream commits cherry-picked from torvalds/master to fix bugs in the
+ ported commits from items 2 and 6-7.  These include 13 fixes (identified
+ via "Fixes:" tags in upstream) plus 1 prerequisite helper function 
(port_to_host()) required by one of the fixes:
+ 
+ 
+ 822655e6751d (upstream SHA): cxl/port: Introduce port_to_host() helper
+ prerequisite for 0066688dbcdc cxl/port: Hold port host lock during dport 
adding.
+ 
+ 88c72bab77aa (upstream SHA): cxl/region: fix format string for resource_size_t
+ Fixes d6602e25819d cxl/region: Add support to indicate region has extended 
linear cache
+ 
+ 8fdc61faa730 (upstream SHA): soc: renesas: Fix missing dependency on 
CACHEMAINT_FOR_DMA
+ Fixes 4d1608d0ab33 cache: Make top level Kconfig menu a boolean dependent on 
RISCV
+ 
+ 521cadb4b69e (upstream SHA) riscv: ERRATA_STARFIVE_JH7100: Fix missing 
dependency
+ Fixes 4d1608d0ab33 cache: Make top level Kconfig menu a boolean dependent on 
RISCV
+ 
+ 8441c7d3bd6c (upstream SHA) cxl: Check for invalid addresses from translation 
funcs
+ Fixes b78b9e7b7979 cxl/region: Refactor address translation funcs for testing
+ Fixes c3dd67681c70 xl/region: Add inject and clear poison by region offset
+ 
+ 3e8aaacdad4f (upstream SHA) cxl/port: Fix target list for multiple decoders 
sharing dport     
+ Fixes 4f06d81e7c6a cxl: Defer dport allocation for switch ports
+ 
+ 49d106347913 (upstream SHA) cxl/acpi: Restore HBIW check before dereferencing 
platform_data   
+ Fixes 4fe516d2ad1a cxl/acpi: Make the XOR calculations available for testing
+ 
+ 77b310bb7b5f (upstream SHA) cxl/region: Fix leakage in
+ __construct_region()     Fixes d6602e25819d extended linear cache
+ 
+ 0066688dbcdc (upstream SHA) cxl/port: Hold port host lock during dport adding
+ Fixes 4f06d81e7c6a cxl/region: Add support to indicate region has extended 
linear cache
+ 
+ 318c58852e68 (upstream SHA) cxl/memdev: fix deadlock in 
cxl_memdev_autoremove() 
+ Fixes 29317f8dc6ed cxl/mem: Introduce cxl_memdev_attach for CXL-dependent 
operation
+ 
+ 0a70b7cd397e (upstream SHA) cxl: Test CXL_DECODER_F_LOCK as a bitmask
+ Fixes 2230c4bdc412 cxl: Add handling of locked CXL decoder
+ 
+ 9a6a2091324a (upstream SHA) cxl/mbox: Use proper endpoint validity check upon 
sanitize
+ Fixes 29317f8dc6ed cxl/mem: Introduce cxl_memdev_attach for CXL-dependent 
operation
+ 
+ 3bfc213d4675 (upstream SHA) soc: microchip: mpfs-mss-top-sysreg: Fix resource 
leak
+ Fixes 4aac11c9a6e7 soc: microchip: add mfd drivers for two syscon regions on 
PolarFire SoC
+ 
+ 27459f86a437 (upstream SHA) soc: microchip: mpfs-control-scb: Fix resource 
leak
+ Fixes 4aac11c9a6e7 soc: microchip: add mfd drivers for two syscon regions on 
PolarFire SoC
+ 
+ 
  Lore Links:
  
  - Terry Bowman's CXL RAS series (v14, partially merged into v7.0):
    https://lore.kernel.org/all/[email protected]/
  
  - Smita Koralahalli's CXL EINJ series (v6, patch 3/9 only):
    
https://lore.kernel.org/linux-cxl/[email protected]/
  
  - Alejandro Lucero's CXL Type-2 series (v23):
    
https://lore.kernel.org/linux-cxl/[email protected]/
  
  - Robert Richter's multi-level interleaving fix (v1):
    https://lore.kernel.org/all/[email protected]/
  
  - Srirangan Madhavan's CXL state save/restore series:
    
https://lore.kernel.org/linux-cxl/[email protected]/
  
  - Srirangan Madhavan's CXL reset series (v5):
    
https://lore.kernel.org/linux-cxl/[email protected]/
  
  Testing
  
  Build Validation:
  
  - Built successfully for ARM64 4K page size kernel
  - Built successfully for ARM64 64K page size kernel
  - Built successfully for x86
  
  Runtime Testing:
  
  - Boot test on ARM64 system
  - CXL device enumeration test
  - CXL interleaving testing
  - CXL reset test
  - DVSEC save/restore verified (CXLCtl, Range register preserved)
  
  Notes
  
  - CONFIG_PCIEAER_CXL has been removed from Kconfig by upstream commit
  d18f1b7beadf (PCI/AER: Replace PCIEAER_CXL symbol with CXL_RAS).
  The debian.master annotation for PCIEAER_CXL=y is overridden to -
  in debian.nvidia-6.17/config/annotations.
  
  - CONFIG_CXL_BUS, CONFIG_CXL_PCI, CONFIG_CXL_MEM, CONFIG_CXL_PORT
  remain tristate (not bool) — the v14 series kept them as tristate,
  unlike earlier draft versions.
  
  - CONFIG_DEV_DAX, CONFIG_DEV_DAX_CXL, and CONFIG_DEV_DAX_KMEM are
  overridden from m (debian.master default) to y to support built-in
  CXL RAM region DAX access and System-RAM mapping.
  
  - CONFIG_PCI_CXL is a new hidden bool introduced by the save/restore
  series; auto-enabled when CXL_BUS=y. Gates compilation of
  drivers/pci/cxl.o for DVSEC and HDM state save/restore.
  
  - CONFIG_GENERIC_CPU_CACHE_MAINTENANCE and 
CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION are new configs
  introduced by the upstream cherry-picks; arm64 auto-selects both.
  cpu_cache_invalidate_memregion() is also used by the CXL reset
  series for cache flushing during reset.
  
  - Kernel config annotations updated in debian.nvidia-6.17/config/annotations
  to reflect all of the above changes.
  
  - Srirangan's save/restore series header reorganization was adapted to
  align with Alejandro's v23 approach (include/cxl/cxl.h instead of
  include/cxl/pci.h). See commit message on patch 2/5 for details.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2143032

Title:
  Add CXL Type-2 device support, RAS error handling, reset, state
  save/restore, and interleaving support

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2143032/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to