cxl: Add a performant (and correct) path for the non interleaved cases

Jonathan Cameron via qemu development Fri, 06 Mar 2026 07:20:36 -0800

On Fri, 6 Mar 2026 12:11:48 +0000
Alireza Sanaee <[email protected]> wrote:


> Hey everyone,

Michael,

Assuming all other feedback that might come in is good, I consider this series
ready for upstream.  So if you happen to do another pull request before
the soft freeze feel free to pick it up directly.

If not I'll carry it on my staging tree and post a rebased version next cycle.

Otherwise for CXL stuff there are a few fixes being revised. I don't currently
see anything new other than this as being ready for upstream.

Thanks for picking up so much earlier this cycle. That has made things
a lot more manageable!

Thanks,

Jonathan
 
> 
> This is v7 of performant CXL type 3 regions set:
> 
> 
> v7 -> v8:
>           - Rebased on top of the latest master. Base-commit stated at the 
> end of cover-letter.
>           - Thanks to Gregory and Zhijian for testing and feedback. Addressed
>           their comments. 
> v5 -> v6: 
>           - Use object_unparent() in the third commit when deleting alias 
> regions. 
>           - Thanks to Gregory for the suggestion and testing.
> v4 -> v5: 
>           - Fixed some minor patch style like missing trailing white space 
> and such.
> v3 -> v4: 
>           - Tear down path changed, given that it is done differently than
>           setup.
>           - Dropped Gregory's tested-by tag due to tear down changes.
> v2 -> v3: 
>           - Addressing Zhijian Li. Thanks for the feedback.
> v1 -> v2: 
>           - Mainly rebase.
> 
> ==========================================================
> 
> The CXL address to device decoding logic is complex because of the need
> to correctly decode fine grained interleave. The current implementation
> prevents use with KVM where executed instructions may reside in that
> memory and gives very slow performance even in TCG.
> 
> In many real cases non interleaved memory configurations are useful and
> for those we can use a more conventional memory region alias allowing
> similar performance to other memory in the system.
> 
> Whether this fast path is applicable can be established once the full
> set of HDM decoders has been committed (in whatever order the guest
> decides to commit them). As such a check is performed on each commit /
> uncommit of HDM decoder to establish if the alias should be added or
> removed.
> 
> 
> Performance numbers:
> 
> For a read/write test with 4K block size, 256M region size, and 1 thread
> with 100 iteration on TCG (it should do similar on KVM):
> 
>   - Non-interleaved region (fast path): 25-30 seconds.
>   - Interleaved region (no fast path):  Never finishes within 10
>     minutes.
> 
> Tested Topologies and Region Layouts
> ====================================
> 
> This series was validated across multiple CXL topology configurations,
> covering single-device, multi-device, multi-host-bridge, and switched
> fabrics. Region creation was exercised using the `cxl` userspace tool
> with both non-interleaved and interleaved setups.
> 
> Decoder and memdev identifiers were discovered using:
> 
>   cxl list
>   cxl list -D
> 
> Decoder IDs (e.g. decoder0.0) and memdev names (mem0, mem1) are
> environment-specific. Commands below use placeholders such as
> <decoder_span_both> which should be replaced with IDs from `cxl list -D`.
> 
> ---------------------------------------------------------------------
> 
> Region Layout Notation
> ----------------------
> 
> CFMW (CXL Fixed Memory Window) is shown as a linear address space
> containing regions:
> 
>   CFMW: [ R0 | R1 | R2 ]
> 
> R0, R1, R2 are regions created by `cxl create-region`.
> 
> Non-interleaved region:
> 
>   R0 (ways=1) -> entirely on one device (mem0 or mem1)
>   Fast path: APPLICABLE
> 
> 2-way interleaved region (g=256):
> 
>   R1 (ways=2, g=256) striped across devices:
> 
>     |mem0|mem1|mem0|mem1|mem0|mem1| ...
>      256  256  256  256  256  256  bytes
> 
>   Fast path: NOT APPLICABLE
> 
> ---------------------------------------------------------------------
> 
> 1) One device, one host bridge, one fixed window
> ------------------------------------------------
> 
> QEMU:
> 
>   -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
>   -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
>   -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
>   -object memory-backend-ram,id=mem0,size=512M,share=on
>   -device cxl-type3,id=dev0,bus=rp0,memdev=mem0
> 
> Topology:
> 
>   Host
>     |
>     +-- CXL Host Bridge (cxl.0)
>          |
>          +-- Root Port (rp0)
>               |
>               +-- Type-3 (dev0, mem0)
> 
> Regions created:
> 
>   cxl create-region ... -w 1 ... mem0   (Fast path: YES)
>   cxl create-region ... -w 1 ... mem0   (Fast path: YES)
> 
> Layout:
> 
>   CFMW: [ R0 | R1 ]
> 
>   R0 -> mem0  (Fast path: YES)
>   R1 -> mem0  (Fast path: YES)
> 
> ---------------------------------------------------------------------
> 
> 2) One host bridge, two Type-3 devices (via two root ports)
> ------------------------------------------------------------
> 
> QEMU:
> 
>   -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
>   -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
>   -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
>   -device cxl-rp,id=rp1,bus=cxl.0,port=1,chassis=0,slot=3
>   -object memory-backend-ram,id=mem0,size=512M,share=on
>   -object memory-backend-ram,id=mem1,size=512M,share=on
>   -device cxl-type3,id=dev0,bus=rp0,memdev=mem0
>   -device cxl-type3,id=dev1,bus=rp1,memdev=mem1
> 
> Topology:
> 
>   Host
>     |
>     +-- CXL Host Bridge (cxl.0)
>          |
>          +-- Root Port (rp0) -- Type-3 (dev0, mem0)
>          |
>          +-- Root Port (rp1) -- Type-3 (dev1, mem1)
> 
> Region patterns exercised:
> 
> 2.1 All non-interleaved:
>   R0 -> mem0  (Fast path: YES)
>   R1 -> mem0  (Fast path: YES)
>   R2 -> mem1  (Fast path: YES)
>   R3 -> mem1  (Fast path: YES)
> 
> 2.2 Interleaved + local:
>   R0 -> mem0/mem1 interleaved  (Fast path: NO)
>   R1 -> mem0                   (Fast path: YES)
> 
> 2.3 Local + interleaved + local:
>   R0 -> mem0                   (Fast path: YES)
>   R1 -> mem0/mem1 interleaved  (Fast path: NO)
>   R2 -> mem1                   (Fast path: YES)
> 
> ---------------------------------------------------------------------
> 
> 3) Two host bridges, one device per host bridge
> ------------------------------------------------
> 
> QEMU:
> 
>   -M q35,cxl=on,
>      cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,
>      cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G,
>      cxl-fmw.2.targets.0=cxl.0,cxl-fmw.2.size=4G
>   -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
>   -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
>   -object memory-backend-ram,id=mem0,size=512M,share=on
>   -device cxl-type3,id=dev0,bus=rp0,memdev=mem0
>   -device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=13
>   -device cxl-rp,id=rp1,bus=cxl.1,port=0,chassis=1,slot=2
>   -object memory-backend-ram,id=mem1,size=512M,share=on
>   -device cxl-type3,id=dev1,bus=rp1,memdev=mem1
> 
> Region patterns identical to section 2, and fast-path applicability is
> identical per region mapping (non-interleaved: YES, interleaved: NO).
> 
> ---------------------------------------------------------------------
> 
> 4) Switch topology
> ------------------
> 
> QEMU:
> 
>   -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
>   -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
>   -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
>   -device cxl-rp,id=rp1,bus=cxl.0,port=0,chassis=0,slot=3
>   -device cxl-upstream,id=us0,bus=rp0
>   -device cxl-downstream,id=ds0,bus=us0,port=0,chassis=0,slot=4
>   -object memory-backend-ram,id=mem0,size=512M,share=on
>   -device cxl-type3,id=dev0,bus=ds0,memdev=mem0
> 
> Topology (detailed):
> 
>   Host
>     |
>     +-- CXL Host Bridge (cxl.0)
>          |
>          +-- Root Port (rp0)
>          |     |
>          |     +-- CXL Switch (upstream us0)
>          |           |
>          |           +-- Downstream Port (ds0) -- Type-3 (mem0)
>          |           |
>          |           +-- Downstream Port (ds1) -- Type-3 (mem1) [optional]
>          +-- Root Port (rp1)
>                |
>                +-- More devices/switches.
> 
> Fast-path interpretation in this topology:
> 
>   If only mem0 exists:
>     All regions -> Fast path: YES
> 
>   If mem0 and mem1 exist:
>     Non-interleaved regions -> Fast path: YES
>     Interleaved regions     -> Fast path: NO
> 
> ---------------------------------------------------------------------
> 
> Summary
> -------
> 
> Across all topologies, region creation, enablement, and HDM decoder
> commit/uncommit flows were exercised. The fast path is enabled only when
> all decoders describe a non-interleaved mapping and is removed when any
> interleave configuration is introduced.
> 
> Alireza Sanaee (3):
>   hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in
>     window.
>   hw/cxl: Allow cxl_cfmws_find_device() to filter on whether interleaved
>     paths are accepted
>   hw/cxl: Add a performant (and correct) path for the non interleaved
>     cases
> 
>  hw/cxl/cxl-component-utils.c |   6 +
>  hw/cxl/cxl-host.c            | 231 +++++++++++++++++++++++++++++++++--
>  hw/mem/cxl_type3.c           |   4 +
>  include/hw/cxl/cxl.h         |   1 +
>  include/hw/cxl/cxl_device.h  |   4 +
>  5 files changed, 237 insertions(+), 9 deletions(-)
> 
> 
> base-commit: 483cb5b74cd247b1520e0994b4fae4d8fe44cb00

Re: [PATCH v7 0/3] hw/cxl: Add a performant (and correct) path for the non interleaved cases

Reply via email to