On 3/12/21 6:53 AM, Cédric Le Goater wrote:
On 3/12/21 2:55 AM, David Gibson wrote:
On Tue, 9 Mar 2021 18:26:35 +0100
Cédric Le Goater <c...@kaod.org> wrote:

On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:


On 3/9/21 12:33 PM, Cédric Le Goater wrote:
On 3/8/21 6:13 PM, Greg Kurz wrote:
On Wed, 3 Mar 2021 18:48:50 +0100
Cédric Le Goater <c...@kaod.org> wrote:
The 'chip_id' field of the XIVE CPU structure is used to choose a
target for a source located on the same chip when possible. This field
is assigned on the PowerNV platform using the "ibm,chip-id" property
on pSeries under KVM when NUMA nodes are defined but it is undefined

This sentence seems to have a syntax problem... like it is missing an
'and' before 'on pSeries'.

ah yes, or simply a comma.
under PowerVM. The XIVE source structure has a similar field
'src_chip' which is only assigned on the PowerNV platform.

cpu_to_node() returns a compatible value on all platforms, 0 being the
default node. It will also give us the opportunity to set the affinity
of a source on pSeries when we can localize them.

IIUC this relies on the fact that the NUMA node id is == to chip id
on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
with this change.

Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
in Cc:)
  [...]

On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
the node id. This value is built from the chip id in OPAL, so the
value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
property are unlikely to be different.

cpu_to_node(cpu) is used in many places to allocate the structures
locally to the owning node. XIVE is not an exception (see below in the
same patch), it is better to be consistent and get the same information
(node id) using the same routine.


In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
unifies the controllers of the system to only expose one the OS. This
is problematic and should be changed but it's another topic.

On the other hand, you have the pSeries case under PowerVM that
doesn't xc->chip_id, which isn't passed to any hcall AFAICT.

yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
chip id.

QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
always correct btw)


If you have a way to reliably reproduce this, let me know and I'll fix it
up in QEMU.

with :

    -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G 
-numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object 
memory-backend-ram,id=ram-node1,size=2G -numa 
node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1

# dmesg | grep numa
[    0.013106] numa: Node 0 CPUs: 0-1
[    0.013136] numa: Node 1 CPUs: 2-3

# dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
                ibm,chip-id = <0x01>;
                ibm,chip-id = <0x02>;
                ibm,chip-id = <0x00>;
                ibm,chip-id = <0x03>;

with :

   -smp 4,cores=4,maxcpus=8,threads=1 -object 
memory-backend-ram,id=ram-node0,size=2G -numa 
node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object 
memory-backend-ram,id=ram-node1,size=2G -numa 
node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1

# dmesg | grep numa
[    0.013106] numa: Node 0 CPUs: 0-1
[    0.013136] numa: Node 1 CPUs: 2-3

# dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
                ibm,chip-id = <0x00>;
                ibm,chip-id = <0x00>;
                ibm,chip-id = <0x00>;
                ibm,chip-id = <0x00>;

I think we should simply remove "ibm,chip-id" since it's not used and
not in the PAPR spec.

As I mentioned to Daniel on our call this morning, oddly it *does*
appear to be used in the RHEL kernel, even though that's 4.18 based.
This patch seems to have caused a minor regression; not in the
identification of NUMA nodes, but in the number of sockets shown be
lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
for more information.

Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we
remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 :

    [root@localhost ~]# lscpu
    Architecture:        ppc64le
    Byte Order:          Little Endian
    CPU(s):              128
    On-line CPU(s) list: 0-127
    Thread(s) per core:  4
    Core(s) per socket:  16
    Socket(s):           2
    NUMA node(s):        2
    Model:               2.2 (pvr 004e 1202)
    Model name:          POWER9 (architected), altivec supported
    Hypervisor vendor:   KVM
    Virtualization type: para
    L1d cache:           32K
    L1i cache:           32K
    NUMA node0 CPU(s):   0-63
    NUMA node1 CPU(s):   64-127

    [root@localhost ~]# grep . 
/sys/devices/system/cpu/*/topology/physical_package_id
    /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1
    /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1
    /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1
    /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1
    /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1
    ....

"ibm,chip-id" is still being used on some occasion on pSeries machines.
This is wrong :/ The problem is :

   #define topology_physical_package_id(cpu)      (cpu_to_chip_id(cpu))

We should be using cpu_to_node().


IIUC the "real fix" then is this change you mentioned above, together with
this xive patch as well, to stop using ibm,chip-id for good in the pserie
 kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries
machine without impact. Is this correct?

If that's the case, then I believe it's ok to go forward with the QEMU side
change (just for 6.0.0 and newer machines). Or should I wait for the kernel
changes to be merged upstream first?


Thanks,


DHB



C.


Since the value was used by some PAPR kernels - even if they shouldn't
have - I think we should only remove this for newer machine types.  We
also need to check what we're not supplying that the guest kernel is
showing a different number of sockets than specified on the qemu
command line.


Thanks,

C.

  [...]
  [...]
  [...]
  [...]
  [...]
  [...]
  [...]
  [...]
  [...]





Reply via email to