On Fri, 6 Mar 2026 12:11:48 +0000
Alireza Sanaee <[email protected]> wrote:

Hi,

> Hey everyone,
> 
> This is v7 of performant CXL type 3 regions set:
> 
> 
> v7 -> v8:
oops, typo. v8 is wrong.
v6 -> v7:
>           - Rebased on top of the latest master. Base-commit stated at the 
> end of cover-letter.
>           - Thanks to Gregory and Zhijian for testing and feedback. Addressed
>           their comments. 
> v5 -> v6: 
>           - Use object_unparent() in the third commit when deleting alias 
> regions. 
>           - Thanks to Gregory for the suggestion and testing.
> v4 -> v5: 
>           - Fixed some minor patch style like missing trailing white space 
> and such.
> v3 -> v4: 
>           - Tear down path changed, given that it is done differently than
>           setup.
>           - Dropped Gregory's tested-by tag due to tear down changes.
> v2 -> v3: 
>           - Addressing Zhijian Li. Thanks for the feedback.
> v1 -> v2: 
>           - Mainly rebase.
> 
> ==========================================================
> 
> The CXL address to device decoding logic is complex because of the need
> to correctly decode fine grained interleave. The current implementation
> prevents use with KVM where executed instructions may reside in that
> memory and gives very slow performance even in TCG.
> 
> In many real cases non interleaved memory configurations are useful and
> for those we can use a more conventional memory region alias allowing
> similar performance to other memory in the system.
> 
> Whether this fast path is applicable can be established once the full
> set of HDM decoders has been committed (in whatever order the guest
> decides to commit them). As such a check is performed on each commit /
> uncommit of HDM decoder to establish if the alias should be added or
> removed.
> 
> 
> Performance numbers:
> 
> For a read/write test with 4K block size, 256M region size, and 1 thread
> with 100 iteration on TCG (it should do similar on KVM):
> 
>   - Non-interleaved region (fast path): 25-30 seconds.
>   - Interleaved region (no fast path):  Never finishes within 10
>     minutes.
> 
> Tested Topologies and Region Layouts
> ====================================
> 
> This series was validated across multiple CXL topology configurations,
> covering single-device, multi-device, multi-host-bridge, and switched
> fabrics. Region creation was exercised using the `cxl` userspace tool
> with both non-interleaved and interleaved setups.
> 
> Decoder and memdev identifiers were discovered using:
> 
>   cxl list
>   cxl list -D
> 
> Decoder IDs (e.g. decoder0.0) and memdev names (mem0, mem1) are
> environment-specific. Commands below use placeholders such as
> <decoder_span_both> which should be replaced with IDs from `cxl list -D`.
> 
> ---------------------------------------------------------------------
> 
> Region Layout Notation
> ----------------------
> 
> CFMW (CXL Fixed Memory Window) is shown as a linear address space
> containing regions:
> 
>   CFMW: [ R0 | R1 | R2 ]
> 
> R0, R1, R2 are regions created by `cxl create-region`.
> 
> Non-interleaved region:
> 
>   R0 (ways=1) -> entirely on one device (mem0 or mem1)
>   Fast path: APPLICABLE
> 
> 2-way interleaved region (g=256):
> 
>   R1 (ways=2, g=256) striped across devices:
> 
>     |mem0|mem1|mem0|mem1|mem0|mem1| ...
>      256  256  256  256  256  256  bytes
> 
>   Fast path: NOT APPLICABLE
> 
> ---------------------------------------------------------------------
> 
> 1) One device, one host bridge, one fixed window
> ------------------------------------------------
> 
> QEMU:
> 
>   -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
>   -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
>   -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
>   -object memory-backend-ram,id=mem0,size=512M,share=on
>   -device cxl-type3,id=dev0,bus=rp0,memdev=mem0
> 
> Topology:
> 
>   Host
>     |
>     +-- CXL Host Bridge (cxl.0)
>          |
>          +-- Root Port (rp0)
>               |
>               +-- Type-3 (dev0, mem0)
> 
> Regions created:
> 
>   cxl create-region ... -w 1 ... mem0   (Fast path: YES)
>   cxl create-region ... -w 1 ... mem0   (Fast path: YES)
> 
> Layout:
> 
>   CFMW: [ R0 | R1 ]
> 
>   R0 -> mem0  (Fast path: YES)
>   R1 -> mem0  (Fast path: YES)
> 
> ---------------------------------------------------------------------
> 
> 2) One host bridge, two Type-3 devices (via two root ports)
> ------------------------------------------------------------
> 
> QEMU:
> 
>   -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
>   -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
>   -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
>   -device cxl-rp,id=rp1,bus=cxl.0,port=1,chassis=0,slot=3
>   -object memory-backend-ram,id=mem0,size=512M,share=on
>   -object memory-backend-ram,id=mem1,size=512M,share=on
>   -device cxl-type3,id=dev0,bus=rp0,memdev=mem0
>   -device cxl-type3,id=dev1,bus=rp1,memdev=mem1
> 
> Topology:
> 
>   Host
>     |
>     +-- CXL Host Bridge (cxl.0)
>          |
>          +-- Root Port (rp0) -- Type-3 (dev0, mem0)
>          |
>          +-- Root Port (rp1) -- Type-3 (dev1, mem1)
> 
> Region patterns exercised:
> 
> 2.1 All non-interleaved:
>   R0 -> mem0  (Fast path: YES)
>   R1 -> mem0  (Fast path: YES)
>   R2 -> mem1  (Fast path: YES)
>   R3 -> mem1  (Fast path: YES)
> 
> 2.2 Interleaved + local:
>   R0 -> mem0/mem1 interleaved  (Fast path: NO)
>   R1 -> mem0                   (Fast path: YES)
> 
> 2.3 Local + interleaved + local:
>   R0 -> mem0                   (Fast path: YES)
>   R1 -> mem0/mem1 interleaved  (Fast path: NO)
>   R2 -> mem1                   (Fast path: YES)
> 
> ---------------------------------------------------------------------
> 
> 3) Two host bridges, one device per host bridge
> ------------------------------------------------
> 
> QEMU:
> 
>   -M q35,cxl=on,
>      cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,
>      cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G,
>      cxl-fmw.2.targets.0=cxl.0,cxl-fmw.2.size=4G
>   -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
>   -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
>   -object memory-backend-ram,id=mem0,size=512M,share=on
>   -device cxl-type3,id=dev0,bus=rp0,memdev=mem0
>   -device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=13
>   -device cxl-rp,id=rp1,bus=cxl.1,port=0,chassis=1,slot=2
>   -object memory-backend-ram,id=mem1,size=512M,share=on
>   -device cxl-type3,id=dev1,bus=rp1,memdev=mem1
> 
> Region patterns identical to section 2, and fast-path applicability is
> identical per region mapping (non-interleaved: YES, interleaved: NO).
> 
> ---------------------------------------------------------------------
> 
> 4) Switch topology
> ------------------
> 
> QEMU:
> 
>   -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
>   -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
>   -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
>   -device cxl-rp,id=rp1,bus=cxl.0,port=0,chassis=0,slot=3
>   -device cxl-upstream,id=us0,bus=rp0
>   -device cxl-downstream,id=ds0,bus=us0,port=0,chassis=0,slot=4
>   -object memory-backend-ram,id=mem0,size=512M,share=on
>   -device cxl-type3,id=dev0,bus=ds0,memdev=mem0
> 
> Topology (detailed):
> 
>   Host
>     |
>     +-- CXL Host Bridge (cxl.0)
>          |
>          +-- Root Port (rp0)
>          |     |
>          |     +-- CXL Switch (upstream us0)
>          |           |
>          |           +-- Downstream Port (ds0) -- Type-3 (mem0)
>          |           |
>          |           +-- Downstream Port (ds1) -- Type-3 (mem1) [optional]
>          +-- Root Port (rp1)
>                |
>                +-- More devices/switches.
> 
> Fast-path interpretation in this topology:
> 
>   If only mem0 exists:
>     All regions -> Fast path: YES
> 
>   If mem0 and mem1 exist:
>     Non-interleaved regions -> Fast path: YES
>     Interleaved regions     -> Fast path: NO
> 
> ---------------------------------------------------------------------
> 
> Summary
> -------
> 
> Across all topologies, region creation, enablement, and HDM decoder
> commit/uncommit flows were exercised. The fast path is enabled only when
> all decoders describe a non-interleaved mapping and is removed when any
> interleave configuration is introduced.
> 
> Alireza Sanaee (3):
>   hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in
>     window.
>   hw/cxl: Allow cxl_cfmws_find_device() to filter on whether interleaved
>     paths are accepted
>   hw/cxl: Add a performant (and correct) path for the non interleaved
>     cases
> 
>  hw/cxl/cxl-component-utils.c |   6 +
>  hw/cxl/cxl-host.c            | 231 +++++++++++++++++++++++++++++++++--
>  hw/mem/cxl_type3.c           |   4 +
>  include/hw/cxl/cxl.h         |   1 +
>  include/hw/cxl/cxl_device.h  |   4 +
>  5 files changed, 237 insertions(+), 9 deletions(-)
> 
> 
> base-commit: 483cb5b74cd247b1520e0994b4fae4d8fe44cb00


Reply via email to