On Mon, Jan 08, 2024 at 05:05:38PM -0800, Hao Xiang wrote:
> On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xi...@bytedance.com> wrote:
> >
> > On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.pr...@memverge.com> 
> > wrote:
> > >
> > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> > > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price 
> > > > <gregory.pr...@memverge.com> wrote:
> > > > >
> > > > > For a variety of performance reasons, this will not work the way you
> > > > > want it to.  You are essentially telling QEMU to map the vmem0 into a
> > > > > virtual cxl device, and now any memory accesses to that memory region
> > > > > will end up going through the cxl-type3 device logic - which is an IO
> > > > > path from the perspective of QEMU.
> > > >
> > > > I didn't understand exactly how the virtual cxl-type3 device works. I
> > > > thought it would go with the same "guest virtual address ->  guest
> > > > physical address -> host physical address" translation totally done by
> > > > CPU. But if it is going through an emulation path handled by virtual
> > > > cxl-type3, I agree the performance would be bad. Do you know why
> > > > accessing memory on a virtual cxl-type3 device can't go with the
> > > > nested page table translation?
> > > >
> > >
> > > Because a byte-access on CXL memory can have checks on it that must be
> > > emulated by the virtual device, and because there are caching
> > > implications that have to be emulated as well.
> >
> > Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the
> > CXL memory data path goes through them, the performance would be
> > pretty problematic. We have actually run Intel's Memory Latency
> > Checker benchmark from inside a guest VM with both system-DRAM and
> > virtual CXL-type3 configured. The idle latency on the virtual CXL
> > memory is 2X of system DRAM, which is on-par with the benchmark
> > running from a physical host. I need to debug this more to understand
> > why the latency is actually much better than I would expect now.
> 
> So we double checked on benchmark testing. What we see is that running
> Intel Memory Latency Checker from a guest VM with virtual CXL memory
> VS from a physical host with CXL1.1 memory expander has the same
> latency.
> 
> From guest VM: local socket system-DRAM latency is 117.0ns, local
> socket CXL-DRAM latency is 269.4ns
> From physical host: local socket system-DRAM latency is 113.6ns ,
> local socket CXL-DRAM latency is 267.5ns
> 
> I also set debugger breakpoints on cxl_type3_read/cxl_type3_write
> while running the benchmark testing but those two functions are not
> ever hit. We used the virtual CXL configuration while launching QEMU
> but the CXL memory is present as a separate NUMA node and we are not
> creating devdax devices. Does that make any difference?
> 

Could you possibly share your full QEMU configuration and what OS/kernel
you are running inside the guest?

The only thing I'm surprised by is that the numa node appears without
requiring the driver to generate the NUMA node.  It's possible I missed
a QEMU update that allows this.

~Gregory

Reply via email to