On Tue, 24 Feb 2026 09:01:30 -0500 "Michael S. Tsirkin" <[email protected]> wrote:
> On Tue, Feb 24, 2026 at 09:51:06AM -0400, Jason Gunthorpe wrote: > > On Mon, Feb 23, 2026 at 11:13:02AM +0000, Jonathan Cameron wrote: > > > > > Ankit, can you give an example complete with table dumps please. > > > > > > I'm a little unsure on where things are getting scrambled. > > > Everything should be keyed of PXM. Sounds like we have a bug > > > somewhere but ordering shouldn't be relevant. > > > > I understood the issue is Linux assigns the uAPI visible NUMA node > > numbers based on the ordering. The proximity/etc internal to the > > kernel (I thought) was OK? > > > > Then the problem is that uAPI has developed meaning based on what the > > bare metal HW does and now there are SW stacks that are expecting > > these platforms to have certain NUMA IDs in the Linux uAPI. Sure you > > can argue this is bad/etc/etc but the point of QEMU is to allow > > creating VMs that closely match real HW and in this instance real HW > > produces an ACPI table with a certain ordering and the SW is sensitive > > to this ordering. > > > > Even if there is some Linux bug mis-parsing the ACPI, then that still > > should be addressed from a qemu perspective by providing the ACPI > > construction that doesn't trigger any bug so existing VM images will > > work under qemu. > > > > Thus qemu needs a way to reflect the ordering on the command line to > > properly emulate this system and accomodate the existing VM software... > > > > Jason > > Not arguing against this, but if there's a linux bug it is important > to fix it as a 1st step. qemu work arounds for broken guests > notwithstanding. then we can check how long the uapi has been around, > how practical bugfix backport in linux is, and decide on whether > a host side work around is worth it. > IIRC NUMA IDs in linux aren't even consistent across architectures. I think there are cases where the x86 code handles certain SRAT entrees earlier than the arm64 code does (was either CPUless or memory less nodes if my memory is right). I haven't poked this stuff for a while though so maybe those differences got ironed out. Anyhow, relying on those numbers being stable is optimistic at best. Longer term I'd like to see the ACPI spec comprehend this case where lots of GIs are the same device and hence add some way of distinguishing between them that isn't the PXM. I'd not be against having a kernel patch that sorted at least GI only nodes by PXM rather that order in the ACPI table (probably GP only ones as well, though they aren't as visible anyway). It may be controversial! With that in place it might also make sense to make qemu stop generating them in a random order (So what this patch is doing). Note I did a very similar thing for CXL fixed memory windows, but that was to maintain consistency when I made the fixed memory windows devices as before that we had them in a list built from the command line. Without it we got breakage in the bios tests and the physical addresses shuffled in a fashion that depending on a hash that in theory might change at any time. In that case I didn't have an explicit list, but instead stashed an index parameter in the object and built temporary lists for sorting purposes. https://lore.kernel.org/all/[email protected]/ Jonathan
