On Mon, Jan 08, 2024 at 05:05:38PM -0800, Hao Xiang wrote: > On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xi...@bytedance.com> wrote: > > > > On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.pr...@memverge.com> > > wrote: > > > > > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote: > > > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price > > > > <gregory.pr...@memverge.com> wrote: > > > > > > > > > > For a variety of performance reasons, this will not work the way you > > > > > want it to. You are essentially telling QEMU to map the vmem0 into a > > > > > virtual cxl device, and now any memory accesses to that memory region > > > > > will end up going through the cxl-type3 device logic - which is an IO > > > > > path from the perspective of QEMU. > > > > > > > > I didn't understand exactly how the virtual cxl-type3 device works. I > > > > thought it would go with the same "guest virtual address -> guest > > > > physical address -> host physical address" translation totally done by > > > > CPU. But if it is going through an emulation path handled by virtual > > > > cxl-type3, I agree the performance would be bad. Do you know why > > > > accessing memory on a virtual cxl-type3 device can't go with the > > > > nested page table translation? > > > > > > > > > > Because a byte-access on CXL memory can have checks on it that must be > > > emulated by the virtual device, and because there are caching > > > implications that have to be emulated as well. > > > > Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the > > CXL memory data path goes through them, the performance would be > > pretty problematic. We have actually run Intel's Memory Latency > > Checker benchmark from inside a guest VM with both system-DRAM and > > virtual CXL-type3 configured. The idle latency on the virtual CXL > > memory is 2X of system DRAM, which is on-par with the benchmark > > running from a physical host. I need to debug this more to understand > > why the latency is actually much better than I would expect now. > > So we double checked on benchmark testing. What we see is that running > Intel Memory Latency Checker from a guest VM with virtual CXL memory > VS from a physical host with CXL1.1 memory expander has the same > latency. > > From guest VM: local socket system-DRAM latency is 117.0ns, local > socket CXL-DRAM latency is 269.4ns > From physical host: local socket system-DRAM latency is 113.6ns , > local socket CXL-DRAM latency is 267.5ns > > I also set debugger breakpoints on cxl_type3_read/cxl_type3_write > while running the benchmark testing but those two functions are not > ever hit. We used the virtual CXL configuration while launching QEMU > but the CXL memory is present as a separate NUMA node and we are not > creating devdax devices. Does that make any difference? >
Could you possibly share your full QEMU configuration and what OS/kernel you are running inside the guest? The only thing I'm surprised by is that the numa node appears without requiring the driver to generate the NUMA node. It's possible I missed a QEMU update that allows this. ~Gregory