On Tue, Jan 2, 2024 at 5:04 AM David Hildenbrand <da...@redhat.com> wrote:
>
> On 01.01.24 08:53, Ho-Ren (Jack) Chuang wrote:
> > Introduce a new configuration option 'host-mem-type=' in the
> > '-object memory-backend-ram', allowing users to specify
> > from which type of memory to allocate.
> >
> > Users can specify 'cxlram' as an argument, and QEMU will then
> > automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> > For example:
> >       -object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
> >       -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> >       -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> >       -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> >       -M 
> > cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k
> >  \
> >
>
> You can achieve the exact same thing already simply by using memory
> policies and detecting the node(s) before calling QEMU, no?

Yes, I agree this can be done with memory policy and bind to the CXL
memory numa nodes on host.

>
> There has to be a good reason to add such a shortcut into QEMU, and it
> should be spelled out here.

So our end goal here is to enable CXL memory in the guest VM and have
the guest kernel to recognize the CXL memory to the correct memory
tier (slow tier) in the Linux kernel tiered memory system.

Here is what we observed:
* The kernel tiered memory system relies on calculating the memory
attributes (read/write latency, bandwidth from ACPI) for fast vs slow
tier.
* The kernel tiered memory system has two path to recognize a memory
tier 1) in mm subsystem init, memory_tier_init 2) in kmem driver
device probe dev_dax_kmem_probe. Since the ACPI subsystem is
initialized after mm, reading the memory attributes from ACPI can only
be done in 2). CXL memory has to be presented as a devdax device,
which can then be probed by the kmem driver in the guest and
recognized as the slow tier.

We do see that QEMU has this option "-numa hmat-lb" to set the memory
attributes per VM's numa node. The problem is that setting the memory
attributes per numa node means that the numa node is created during
guest kernel initialization. A CXL devdax device can only be created
post kernel initialization and new numa nodes are created for the CXL
devdax devices. The guest kernel is not reading the memory attributes
from "-numa hmat-lb" for the CXL devdax devices.

So we thought if we create an explicit CXL memory backend, and
associate that with the virtual CXL type-3 frontend, we can pass the
CXL memory attributes from the host into the guest VM and avoid using
memory policy and "-numa hmat-lb", thus simplifying the configuration.
We are still figuring out exactly how to pass the memory attributes
from the CXL backend into the VM. There is probably a solution to
"-numa hmat-lb" for the CXL devdax devices as well and we are also
looking into it.

>
> --
> Cheers,
>
> David / dhildenb
>

Reply via email to