On Fri, 15 Sep 2023 16:19:29 +0200 Cédric Le Goater <c...@redhat.com> wrote:
> Hello Ankit, > > On 9/15/23 04:45, ank...@nvidia.com wrote: > > From: Ankit Agrawal <ank...@nvidia.com> > > > > For devices which allow CPU to cache coherently access their memory, > > it is sensible to expose such memory as NUMA nodes separate from > > the sysmem node. Qemu currently do not provide a mechanism for creation > > of NUMA nodes associated with a vfio-pci device. > > > > Implement a mechanism to create and associate a set of unique NUMA nodes > > with a vfio-pci device.> > > NUMA node is created by inserting a series of the unique proximity > > domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once > > at the time of bootup by the kernel to determine the NUMA configuration > > and is inflexible post that. Hence this feature is incompatible with > > device hotplug. The added node range associated with the device is > > communicated through ACPI DSD and can be fetched by the VM kernel or > > kernel modules. QEMU's VM SRAT and DSD builder code is modified > > accordingly. > > > > New command line params are introduced for admin to have a control on > > the NUMA node assignment. > > This approach seems to bypass the NUMA framework in place in QEMU and > will be a challenge for the upper layers. QEMU is generally used from > libvirt when dealing with KVM guests. > > Typically, a command line for a virt machine with NUMA nodes would look > like : > > -object memory-backend-ram,id=ram-node0,size=1G \ > -numa node,nodeid=0,memdev=ram-node0 \ > -object memory-backend-ram,id=ram-node1,size=1G \ > -numa node,nodeid=1,cpus=0-3,memdev=ram-node1 > > which defines 2 nodes, one with memory and all CPUs and a second with > only memory. > > # numactl -H > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 > node 0 size: 1003 MB > node 0 free: 734 MB > node 1 cpus: > node 1 size: 975 MB > node 1 free: 968 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > > Could it be a new type of host memory backend ? Have you considered > this approach ? Good idea. Fundamentally the device should not be creating NUMA nodes, the VM should be configured with NUMA nodes and the device memory associated with those nodes. I think we're also dealing with a lot of very, very device specific behavior, so I question whether we shouldn't create a separate device for this beyond vifo-pci or vfio-pci-nohotplug. In particular, a PCI device typically only has association to a single proximity domain, so what sense does it make to describe the coherent memory as a PCI BAR to only then create a confusing mapping where the device has a proximity domain separate from a resources associated with the device? It's seeming like this device should create memory objects that can be associated as memory backing for command line specified NUMA nodes. Thanks, Alex