On Tue, 24 Feb 2026 16:41:16 +0000
Jonathan Cameron via qemu development <[email protected]> wrote:

> On Tue, 24 Feb 2026 16:22:56 +0000
> Ankit Agrawal <[email protected]> wrote:
> 
> > >> Now the kernel parse it in the sequence of their occurrence. A jumbled up
> > >> sequence thus results in a jumbled up assignment.    
> > >
> > > But what is the actual failure mode here? So the numa IDs are all in a
> > > weird order, what goes wrong from that?    
> > 
> > This interferes with the ability to replicate the numa distance topology
> > on host in the VM through qemu command line.
> > 
> > E.g. consider a NUMA system with 2 sockets each with a GPU.
> > 0,1 are the node ids for the sysmem on socket 0,1 respectively and
> > 2,3 are the node ids for the GPU memory on socket 0,1 respectively
> > dist(0,2) = X
> > dist(0,3) = Y
> > 
> > If we try to replicate this for the VM by passing qemu arguments with
> > 4 numa nodes and assign numa distances similar to host, and for the
> > sake of example qemu mixes up by putting GI for 3 over 2. The SLIT
> > which sets up the distances do it considering the original order in the
> > qemu command line.
> > https://github.com/qemu/qemu/blob/stable-10.2/hw/acpi/aml-build.c#L2040
> > 
> > This would lead to a different numa config in terms of distance within
> > the VM that the one intended through the qemu command line.  
> 
> This is the case where I'd like to see an example of the tables before
> and after your patch.  If the SLIT is not correctly created wrt to PXMs
> (rather than the order of the commands) then we indeed have a QEMU bug that
> needs fixing.  However, I'm confused as SLIT should also not be ordered
> by command line if the say the command line was:
> 
>        -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=3 \
>        -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=4 \
>        -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=6 \
>        -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
>        -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=2 \
>        -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
>        -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
>        -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
> 
> and numa stuff was something like
>        -numa dist,src=3,dst=0,val=100
>        -numa dist,src=4,dst=0,val=200
>        -numa dist,src=5,dst=0,val=300
>        -numa dist,src=6,dst=0,val=100
>        -numa dist,src=7,dst=0,val=200
>        -numa dist,src=8,dst=0,val=300
>        -numa dist,src=9,dst=0,val=100
> 
> Then it should be matching src numbers here to node in the GIs whatever the 
> order.

I had a mess around and it seems SLIT is stable to ordering of the nodes (based
on a very minimal test so I may well be missing something!), but because the
/sys/bus/node/devices/nodeX/distance is reordered by the PXM to kernel numa
node mapping (which as you've observed is first come first served in parsing
for GIs in new nodes), you will see that apparently reordering to reflect the
kernel numa node order.

How do you associate the resulting numa node with a particular resource on your
GPU?  That mapping should also be by PXM and as a result I would expect to see 
it
refer to the appropriate entry after PXM to node translation in the kernel
whatever order stuff under /sys/bus/nodes/devices/nodeX ends up in.

For extra fun I put my CPUs and memory on different nodes and that always ends
up mapped to the first node in Linux (assuming they are all on one node) with
appropriate reordering of the nodeX/distance entries.

Jonathan


> 
> Thanks,
> 
> Jonathan
> 
> 
> > 
> > Thanks
> > Ankit Agrawal  
> 
> 


Reply via email to