[Bug 2143032] Re: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support
This bug is awaiting verification that the linux-azure- nvidia-6.17/6.17.0-1010.11 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-noble-linux- azure-nvidia-6.17' to 'verification-done-noble-linux-azure-nvidia-6.17'. If the problem still exists, change the tag 'verification-needed-noble- linux-azure-nvidia-6.17' to 'verification-failed-noble-linux-azure- nvidia-6.17'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-noble-linux-azure-nvidia-6.17-v2 verification-needed-noble-linux-azure-nvidia-6.17 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2143032 Title: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2143032/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2143032] Re: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support
This bug is awaiting verification that the linux-nvidia- bos-7.0/7.0.0-2008.8~24.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-noble-linux- nvidia-bos-7.0' to 'verification-done-noble-linux-nvidia-bos-7.0'. If the problem still exists, change the tag 'verification-needed-noble- linux-nvidia-bos-7.0' to 'verification-failed-noble-linux-nvidia- bos-7.0'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-noble-linux-nvidia-bos-7.0-v2 verification-needed-noble-linux-nvidia-bos-7.0 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2143032 Title: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2143032/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2143032] Re: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support
This bug is awaiting verification that the linux-nvidia-bos/7.0.0-2007.7 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-resolute-linux-nvidia-bos' to 'verification- done-resolute-linux-nvidia-bos'. If the problem still exists, change the tag 'verification-needed-resolute-linux-nvidia-bos' to 'verification- failed-resolute-linux-nvidia-bos'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-resolute-linux-nvidia-bos-v2 verification-needed-resolute-linux-nvidia-bos -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2143032 Title: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2143032/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2143032] Re: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support
** Tags added: kernel-daily-bug -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2143032 Title: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2143032/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2143032] Re: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support
** Also affects: linux-nvidia-bos (Ubuntu) Importance: Undecided Status: New ** Changed in: linux-nvidia-bos (Ubuntu) Status: New => Invalid ** Also affects: linux-nvidia-6.17 (Ubuntu Resolute) Importance: Undecided Status: New ** Also affects: linux-nvidia-bos (Ubuntu Resolute) Importance: Undecided Status: New ** Changed in: linux-nvidia-bos (Ubuntu Noble) Status: New => Invalid ** Changed in: linux-nvidia-6.17 (Ubuntu Resolute) Status: New => Invalid ** Changed in: linux-nvidia-bos (Ubuntu Resolute) Status: New => Fix Committed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2143032 Title: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2143032/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2143032] Re: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support
This bug is awaiting verification that the linux- nvidia-6.17/6.17.0-1017.17 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-noble-linux- nvidia-6.17' to 'verification-done-noble-linux-nvidia-6.17'. If the problem still exists, change the tag 'verification-needed-noble-linux- nvidia-6.17' to 'verification-failed-noble-linux-nvidia-6.17'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-noble-linux-nvidia-6.17-v2 verification-needed-noble-linux-nvidia-6.17 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2143032 Title: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2143032/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2143032] Re: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support
** Also affects: linux-nvidia-6.17 (Ubuntu Noble) Importance: Undecided Status: New ** Changed in: linux-nvidia-6.17 (Ubuntu) Status: New => Invalid ** Changed in: linux-nvidia-6.17 (Ubuntu Noble) Status: New => Fix Committed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2143032 Title: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2143032/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2143032] Re: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support
** Description changed: This patch series adds comprehensive CXL (Compute Express Link) support to the nvidia-6.17 kernel, including: 1. CXL Type-2 device support - Enables accelerator devices (like GPUs and SmartNICs) to use CXL for coherent memory access via firmware-provisioned regions 2. CXL RAS (Reliability, Availability, Serviceability) error handling - Implements PCIe Port Protocol error handling and logging for CXL Root Ports, Downstream Switch Ports, and Upstream Switch Ports 3. CXL DVSEC and HDM state save/restore - Preserves CXL DVSEC control/range registers and HDM decoder programming across PCI resets and link transitions, enabling device re-initialization after reset for firmware-provisioned configurations 4. CXL Reset support - Implements the CXL Reset method (CXL Spec v3.2, Sections 8.1.3, 9.6, 9.7) via a sysfs interface for Type-2 devices, including memory offlining, cache flushing, multi-function sibling coordination, and DVSEC reset sequencing 5. Multi-level interleaving fix - Supports firmware-configured CXL interleaving where lower levels use smaller granularities than parent ports (reverse HPA bit ordering) 6. Prerequisite CXL and PCI driver updates - Cherry-picked commits from upstream torvalds/master covering the range from v6.17.9 to the merge point of Terry Bowman's v14 series into v7.0 7. CXL DAX support - Enables direct memory access to CXL RAM regions and mapping CXL DAX devices as System-RAM Key Features Added: - CXL Type-2 accelerator device registration and memory management - CXL region creation by Type-2 drivers - DPA (Device Physical Address) allocation interface for accelerators - HPA (Host Physical Address) free space enumeration - Multi-level CXL address translation (SPA↔HPA↔DPA) - CXL protocol error detection, forwarding, and recovery - CXL RAS error handling for Endpoints, RCH, and Switch Ports (replacing the old PCIEAER_CXL symbol with the new CXL_RAS def_bool) - CXL extended linear cache region support - CXL DVSEC and HDM decoder state save/restore across PCI resets - CXL Reset sysfs interface (/sys/bus/pci/devices/.../cxl_reset) for Type-2 devices with Reset Capable bit set - Multi-function sibling coordination during CXL reset via Non-CXL Function Map DVSEC - CPU cache flush using cpu_cache_invalidate_memregion() during reset - Multi-level interleaving with smaller granularities for lower decoder levels (firmware-provisioned configurations) - CXL DAX device access (DEV_DAX_CXL) and System-RAM mapping (DEV_DAX_KMEM) - CXL protocol error injection via APEI EINJ (ACPI_APEI_EINJ_CXL) Justification CXL Type-2 device support is critical for next-generation NVIDIA accelerators and data center workloads: - Enables coherent memory sharing between CPUs and accelerators - Supports firmware-provisioned CXL regions for accelerator memory - Provides proper error handling and reporting for CXL fabric errors - Enables device reset and state recovery for CXL Type-2 devices - Preserves firmware-programmed DVSEC and HDM decoder state across resets - Required for upcoming NVIDIA hardware with CXL capabilities Source Patch Breakdown (139 patches + 1 revert): # Category Count Source 1 Revert old CXL reset (f198764) 1OOT (cleanup) 2 Upstream CXL/PCI prerequisite 103 Upstream torvalds/master (v6.17.9 cherry-picks → merge of Terry Bowman v14 into v7.0) 3 Smita Koralahalli's CXL EINJ1 LKML (v6, not yet merged) series v6 patch 3/9 4 Alejandro Lucero's CXL Type-2 22 LKML (v23, not yet merged) series v23 5 Robert Richter's multi-level1 LKML (v1, not yet merged) interleaving fix 6 Srirangan Madhavan's CXL state 5 LKML (v1, not yet merged) save/restore series 7 Srirangan Madhavan's CXL reset 7 LKML (v5, not yet merged) series + 8 Upstream fixes for ported 14 13 Upstream merged fixes + 1 + commitsprerequisite + 8
[Bug 2143032] Re: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support
** Description changed: - This patch series adds comprehensive CXL (Compute Express Link) support - to the nvidia-6.17 kernel, including: + This patch series adds comprehensive CXL (Compute Express Link) support to the + nvidia-6.17 kernel, including: - 1. CXL Type-2 device support - Enables accelerator devices (like GPUs - and SmartNICs) to use CXL for coherent memory access - + 1. CXL Type-2 device support - Enables accelerator devices (like GPUs and +SmartNICs) to use CXL for coherent memory access via firmware-provisioned +regions 2. CXL RAS (Reliability, Availability, Serviceability) error handling - - Implements PCIe Port Protocol error handling and logging for CXL devices - - 3. Prerequisite CXL driver updates - Cherry-picked commits from Linux - v6.18 that are required dependencies - +Implements PCIe Port Protocol error handling and logging for CXL Root Ports, +Downstream Switch Ports, and Upstream Switch Ports + 3. CXL DVSEC and HDM state save/restore - Preserves CXL DVSEC control/range +registers and HDM decoder programming across PCI resets and link transitions, +enabling device re-initialization after reset for firmware-provisioned +configurations + 4. CXL Reset support - Implements the CXL Reset method (CXL Spec v3.2, +Sections 8.1.3, 9.6, 9.7) via a sysfs interface for Type-2 devices, +including memory offlining, cache flushing, multi-function sibling +coordination, and DVSEC reset sequencing + 5. Multi-level interleaving fix - Supports firmware-configured CXL +interleaving where lower levels use smaller granularities than parent ports +(reverse HPA bit ordering) + 6. Prerequisite CXL and PCI driver updates - Cherry-picked commits from +upstream torvalds/master covering the range from v6.17.9 to the merge +point of Terry Bowman's v14 series into v7.0 + 7. CXL DAX support - Enables direct memory access to CXL RAM regions and + mapping CXL DAX devices as System-RAM Key Features Added: - CXL Type-2 accelerator device registration and memory management - CXL region creation by Type-2 drivers - DPA (Device Physical Address) allocation interface for accelerators - HPA (Host Physical Address) free space enumeration - CXL protocol error detection, forwarding, and recovery - RAS register mapping for CXL Endpoints and Switch Ports + - CXL Type-2 accelerator device registration and memory management + - CXL region creation by Type-2 drivers + - DPA (Device Physical Address) allocation interface for accelerators + - HPA (Host Physical Address) free space enumeration + - Multi-level CXL address translation (SPA↔HPA↔DPA) + - CXL protocol error detection, forwarding, and recovery + - CXL RAS error handling for Endpoints, RCH, and Switch Ports + (replacing the old PCIEAER_CXL symbol with the new CXL_RAS def_bool) + - CXL extended linear cache region support + - CXL DVSEC and HDM decoder state save/restore across PCI resets + - CXL Reset sysfs interface (/sys/bus/pci/devices/.../cxl_reset) for + Type-2 devices with Reset Capable bit set + - Multi-function sibling coordination during CXL reset via Non-CXL + Function Map DVSEC + - CPU cache flush using cpu_cache_invalidate_memregion() during reset + - Multi-level interleaving with smaller granularities for lower decoder + levels (firmware-provisioned configurations) + - CXL DAX device access (DEV_DAX_CXL) and System-RAM mapping + (DEV_DAX_KMEM) + - CXL protocol error injection via APEI EINJ (ACPI_APEI_EINJ_CXL) Justification - CXL Type-2 device support is critical for next-generation NVIDIA - accelerators and data center workloads: + CXL Type-2 device support is critical for next-generation NVIDIA accelerators + and data center workloads: + + - Enables coherent memory sharing between CPUs and accelerators + - Supports firmware-provisioned CXL regions for accelerator memory + - Provides proper error handling and reporting for CXL fabric errors + - Enables device reset and state recovery for CXL Type-2 devices + - Preserves firmware-programmed DVSEC and HDM decoder state across resets + - Required for upcoming NVIDIA hardware with CXL capabilities + + Source + Patch Breakdown (139 patches + 1 revert): + # Category Count Source + + 1 Revert old CXL reset (f198764) 1OOT (cleanup) + + 2 Upstream CXL/PCI prerequisite 103 Upstream torvalds/master (v6.17.9 +cherry-picks → merge of Terry Bowman v14 into v7.0) + + 3 Smita Koralahalli's CXL EINJ1 LKML (v6, not yet merged) +series v6 patch 3/9 + ---
[Bug 2143032] Re: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support
** Summary changed: - Add CXL Type-2 device support and CXL RAS error handling + Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2143032 Title: Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2143032/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
