Hi, This RFC series implements FORM2 NUMA associativity support in the pSeries machine. This new associativity format is going to be added in the LOPAR spec in the near future. For now, the preview of the specification can be found in Aneesh kernel side patches that implements this support, specially the documentation patch [2].
For QEMU, the most drastic change FORM2 brings is that, at long last, we're free from the shackles of an overcomplicated and bloated way of calculating NUMA distances. This new affinity format promotes separation from performance metrics such as distance, latency, bandwidth and so on from the ibm,associativity arrays of the devices. This also allows for asymmetric NUMA configurations. FORM2 is set by ibm,architecture-vec-5 bit 2 byte 5. This means that the guest is able to choose between FORM1 and FORM2 during CAS, and we need to adapt NUMA internals accordingly based on this choice. Patches 1 to 5 implement the base FORM2 support in the pSeries machine. Patches 6-8 deal with NVDIMM changes. FORM2 allows NVDIMMs to declare an extra NUMA node called 'device-node' to support their use as persistent memory. 'device-node' is locality based an can be different from the NUMA node that the NVDIMM belongs to when used as regular memory. With this series and Aneesh's guest kernel from [1], this is the 'numactl -H' output of this guest: ----- sudo ppc64-softmmu/qemu-system-ppc64 \ -machine pseries,accel=kvm,usb=off,dump-guest-core=off \ -m size=14G,slots=256,maxmem=256G -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 \ (...) -object memory-backend-ram,id=mem0,size=4G -numa node,memdev=mem0,cpus=0-1,nodeid=0 \ -object memory-backend-ram,id=mem1,size=4G -numa node,memdev=mem1,cpus=2-3,nodeid=1 \ -object memory-backend-ram,id=mem2,size=4G -numa node,memdev=mem2,cpus=4-5,nodeid=2 \ -object memory-backend-ram,id=mem3,size=2G -numa node,memdev=mem3,cpus=6-7,nodeid=3 \ -numa dist,src=0,dst=1,val=22 -numa dist,src=0,dst=2,val=22 -numa dist,src=0,dst=3,val=22 \ -numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=44 -numa dist,src=1,dst=3,val=44 \ -numa dist,src=2,dst=0,val=66 -numa dist,src=2,dst=1,val=66 -numa dist,src=2,dst=3,val=66 \ -numa dist,src=3,dst=0,val=88 -numa dist,src=3,dst=1,val=88 -numa dist,src=3,dst=2,val=88 # numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 node 0 size: 3987 MB node 0 free: 3394 MB node 1 cpus: 2 3 node 1 size: 4090 MB node 1 free: 4073 MB node 2 cpus: 4 5 node 2 size: 4090 MB node 2 free: 4072 MB node 3 cpus: 6 7 node 3 size: 2027 MB node 3 free: 2012 MB node distances: node 0 1 2 3 0: 10 22 22 22 1: 44 10 44 44 2: 66 66 10 66 3: 88 88 88 10 The exact user NUMA distances were reflected in the kernel, without any approximation like we have to do for FORM1. [1] https://lore.kernel.org/linuxppc-dev/20210614164003.196094-1-aneesh.ku...@linux.ibm.com/ [2] https://lore.kernel.org/linuxppc-dev/20210614164003.196094-8-aneesh.ku...@linux.ibm.com/ Daniel Henrique Barboza (8): spapr: move NUMA data init to do_client_architecture_support() spapr_numa.c: split FORM1 code into helpers spapr_numa.c: wait for CAS before writing rtas DT spapr_numa.c: base FORM2 NUMA affinity support spapr: simplify spapr_numa_associativity_init params nvdimm: add PPC64 'device-node' property spapr_numa, spapar_nvdimm: write secondary NUMA domain for nvdimms spapr: move memory/cpu less check to spapr_numa_FORM1_affinity_init() hw/mem/nvdimm.c | 28 ++++ hw/ppc/spapr.c | 53 +++----- hw/ppc/spapr_hcall.c | 4 + hw/ppc/spapr_numa.c | 250 +++++++++++++++++++++++++++++++++--- hw/ppc/spapr_nvdimm.c | 3 +- include/hw/mem/nvdimm.h | 12 ++ include/hw/ppc/spapr_numa.h | 6 +- include/hw/ppc/spapr_ovec.h | 1 + 8 files changed, 299 insertions(+), 58 deletions(-) -- 2.31.1