On Fri, 6 Mar 2026 12:11:48 +0000 Alireza Sanaee <[email protected]> wrote:
Hi, > Hey everyone, > > This is v7 of performant CXL type 3 regions set: > > > v7 -> v8: oops, typo. v8 is wrong. v6 -> v7: > - Rebased on top of the latest master. Base-commit stated at the > end of cover-letter. > - Thanks to Gregory and Zhijian for testing and feedback. Addressed > their comments. > v5 -> v6: > - Use object_unparent() in the third commit when deleting alias > regions. > - Thanks to Gregory for the suggestion and testing. > v4 -> v5: > - Fixed some minor patch style like missing trailing white space > and such. > v3 -> v4: > - Tear down path changed, given that it is done differently than > setup. > - Dropped Gregory's tested-by tag due to tear down changes. > v2 -> v3: > - Addressing Zhijian Li. Thanks for the feedback. > v1 -> v2: > - Mainly rebase. > > ========================================================== > > The CXL address to device decoding logic is complex because of the need > to correctly decode fine grained interleave. The current implementation > prevents use with KVM where executed instructions may reside in that > memory and gives very slow performance even in TCG. > > In many real cases non interleaved memory configurations are useful and > for those we can use a more conventional memory region alias allowing > similar performance to other memory in the system. > > Whether this fast path is applicable can be established once the full > set of HDM decoders has been committed (in whatever order the guest > decides to commit them). As such a check is performed on each commit / > uncommit of HDM decoder to establish if the alias should be added or > removed. > > > Performance numbers: > > For a read/write test with 4K block size, 256M region size, and 1 thread > with 100 iteration on TCG (it should do similar on KVM): > > - Non-interleaved region (fast path): 25-30 seconds. > - Interleaved region (no fast path): Never finishes within 10 > minutes. > > Tested Topologies and Region Layouts > ==================================== > > This series was validated across multiple CXL topology configurations, > covering single-device, multi-device, multi-host-bridge, and switched > fabrics. Region creation was exercised using the `cxl` userspace tool > with both non-interleaved and interleaved setups. > > Decoder and memdev identifiers were discovered using: > > cxl list > cxl list -D > > Decoder IDs (e.g. decoder0.0) and memdev names (mem0, mem1) are > environment-specific. Commands below use placeholders such as > <decoder_span_both> which should be replaced with IDs from `cxl list -D`. > > --------------------------------------------------------------------- > > Region Layout Notation > ---------------------- > > CFMW (CXL Fixed Memory Window) is shown as a linear address space > containing regions: > > CFMW: [ R0 | R1 | R2 ] > > R0, R1, R2 are regions created by `cxl create-region`. > > Non-interleaved region: > > R0 (ways=1) -> entirely on one device (mem0 or mem1) > Fast path: APPLICABLE > > 2-way interleaved region (g=256): > > R1 (ways=2, g=256) striped across devices: > > |mem0|mem1|mem0|mem1|mem0|mem1| ... > 256 256 256 256 256 256 bytes > > Fast path: NOT APPLICABLE > > --------------------------------------------------------------------- > > 1) One device, one host bridge, one fixed window > ------------------------------------------------ > > QEMU: > > -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12 > -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2 > -object memory-backend-ram,id=mem0,size=512M,share=on > -device cxl-type3,id=dev0,bus=rp0,memdev=mem0 > > Topology: > > Host > | > +-- CXL Host Bridge (cxl.0) > | > +-- Root Port (rp0) > | > +-- Type-3 (dev0, mem0) > > Regions created: > > cxl create-region ... -w 1 ... mem0 (Fast path: YES) > cxl create-region ... -w 1 ... mem0 (Fast path: YES) > > Layout: > > CFMW: [ R0 | R1 ] > > R0 -> mem0 (Fast path: YES) > R1 -> mem0 (Fast path: YES) > > --------------------------------------------------------------------- > > 2) One host bridge, two Type-3 devices (via two root ports) > ------------------------------------------------------------ > > QEMU: > > -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12 > -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2 > -device cxl-rp,id=rp1,bus=cxl.0,port=1,chassis=0,slot=3 > -object memory-backend-ram,id=mem0,size=512M,share=on > -object memory-backend-ram,id=mem1,size=512M,share=on > -device cxl-type3,id=dev0,bus=rp0,memdev=mem0 > -device cxl-type3,id=dev1,bus=rp1,memdev=mem1 > > Topology: > > Host > | > +-- CXL Host Bridge (cxl.0) > | > +-- Root Port (rp0) -- Type-3 (dev0, mem0) > | > +-- Root Port (rp1) -- Type-3 (dev1, mem1) > > Region patterns exercised: > > 2.1 All non-interleaved: > R0 -> mem0 (Fast path: YES) > R1 -> mem0 (Fast path: YES) > R2 -> mem1 (Fast path: YES) > R3 -> mem1 (Fast path: YES) > > 2.2 Interleaved + local: > R0 -> mem0/mem1 interleaved (Fast path: NO) > R1 -> mem0 (Fast path: YES) > > 2.3 Local + interleaved + local: > R0 -> mem0 (Fast path: YES) > R1 -> mem0/mem1 interleaved (Fast path: NO) > R2 -> mem1 (Fast path: YES) > > --------------------------------------------------------------------- > > 3) Two host bridges, one device per host bridge > ------------------------------------------------ > > QEMU: > > -M q35,cxl=on, > cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G, > cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G, > cxl-fmw.2.targets.0=cxl.0,cxl-fmw.2.size=4G > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12 > -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2 > -object memory-backend-ram,id=mem0,size=512M,share=on > -device cxl-type3,id=dev0,bus=rp0,memdev=mem0 > -device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=13 > -device cxl-rp,id=rp1,bus=cxl.1,port=0,chassis=1,slot=2 > -object memory-backend-ram,id=mem1,size=512M,share=on > -device cxl-type3,id=dev1,bus=rp1,memdev=mem1 > > Region patterns identical to section 2, and fast-path applicability is > identical per region mapping (non-interleaved: YES, interleaved: NO). > > --------------------------------------------------------------------- > > 4) Switch topology > ------------------ > > QEMU: > > -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12 > -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2 > -device cxl-rp,id=rp1,bus=cxl.0,port=0,chassis=0,slot=3 > -device cxl-upstream,id=us0,bus=rp0 > -device cxl-downstream,id=ds0,bus=us0,port=0,chassis=0,slot=4 > -object memory-backend-ram,id=mem0,size=512M,share=on > -device cxl-type3,id=dev0,bus=ds0,memdev=mem0 > > Topology (detailed): > > Host > | > +-- CXL Host Bridge (cxl.0) > | > +-- Root Port (rp0) > | | > | +-- CXL Switch (upstream us0) > | | > | +-- Downstream Port (ds0) -- Type-3 (mem0) > | | > | +-- Downstream Port (ds1) -- Type-3 (mem1) [optional] > +-- Root Port (rp1) > | > +-- More devices/switches. > > Fast-path interpretation in this topology: > > If only mem0 exists: > All regions -> Fast path: YES > > If mem0 and mem1 exist: > Non-interleaved regions -> Fast path: YES > Interleaved regions -> Fast path: NO > > --------------------------------------------------------------------- > > Summary > ------- > > Across all topologies, region creation, enablement, and HDM decoder > commit/uncommit flows were exercised. The fast path is enabled only when > all decoders describe a non-interleaved mapping and is removed when any > interleave configuration is introduced. > > Alireza Sanaee (3): > hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in > window. > hw/cxl: Allow cxl_cfmws_find_device() to filter on whether interleaved > paths are accepted > hw/cxl: Add a performant (and correct) path for the non interleaved > cases > > hw/cxl/cxl-component-utils.c | 6 + > hw/cxl/cxl-host.c | 231 +++++++++++++++++++++++++++++++++-- > hw/mem/cxl_type3.c | 4 + > include/hw/cxl/cxl.h | 1 + > include/hw/cxl/cxl_device.h | 4 + > 5 files changed, 237 insertions(+), 9 deletions(-) > > > base-commit: 483cb5b74cd247b1520e0994b4fae4d8fe44cb00
