Re: [RFC PATCH] Disable Book-E KVM support?

2022-12-02 Thread Daniel Henrique Barboza




On 11/30/22 17:45, Crystal Wood wrote:

On Mon, 2022-11-28 at 14:36 +1000, Nicholas Piggin wrote:

BookE KVM is in a deep maintenance state, I'm not sure how much testing
it gets. I don't have a test setup, and it does not look like QEMU has
any HV architecture enabled. It hasn't been too painful but there are
some cases where it causes a bit of problem not being able to test, e.g.,

https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-November/251452.html

Time to begin removal process, or are there still people using it? I'm
happy to to keep making occasional patches to try keep it going if
there are people testing upstream. Getting HV support into QEMU would
help with long term support, not sure how big of a job that would be.


Not sure what you mean about QEMU not having e500 HV support?  I don't know if
it's bitrotted, but it's there.


AFAIK all QEMU ppc boards, aside from pSeries and the Mac ones, are always used 
in
emulated mode in an use case similar to what Bernhard described in his reply 
(run
in x86 due to lack of ppc hardware).

I am not aware of e500 KVM support in QEMU since I never attempted it. But yes,
it is present, but poorly tested - if tested at all. And the reason why there's
no push on our side to removed it from QEMU is because its code is so entwined
with pSeries KVM that it would take too much effort.

Do not take the presence of e500 KVM support in QEMU as a blocker to disabled 
it in
the kernel. As far as the current QEMU usage goes e500 KVM can be removed 
without
too much drama from our side.


Cedric, do you have any opinions about it?



Daniel




I don't know whether anyone is still using this, but if they are, it's
probably e500mc and not e500v2 (which involved a bunch of hacks to get almost-
sorta-usable performance out of hardware not designed for virtualization).  I
do see that there have been a few recent patches on QEMU e500 (beyond the
treewide cleanup type stuff), though I don't know if they're using KVM.  CCing
them and the QEMU list.

I have an e6500 I could occasionally test on, if it turns out people do still
care about this.  Don't count me as the use case, though. :-)

FWIW, as far as the RECONCILE_IRQ_STATE issue, that used to be done in
kvmppc_handle_exit(), but was moved in commit 9bd880a2c882 to be "cleaner and
faster". :-P

-Crystal




Re: [PATCH v2 0/4] pseries-wdt: initial support for H_WATCHDOG-based watchdog timers

2022-06-17 Thread Daniel Henrique Barboza




On 6/2/22 14:53, Scott Cheloha wrote:

PAPR v2.12 defines a new hypercall, H_WATCHDOG.  This patch series
adds support for this hypercall to powerpc/pseries kernels and
introduces a new watchdog driver, "pseries-wdt", for the virtual
timers exposed by the hypercall.

This series is preceded by the following:

RFC v1: 
https://lore.kernel.org/linux-watchdog/20220413165104.179144-1-chel...@linux.ibm.com/
RFC v2: 
https://lore.kernel.org/linux-watchdog/20220509174357.5448-1-chel...@linux.ibm.com/
PATCH v1: 
https://lore.kernel.org/linux-watchdog/20220520183552.33426-1-chel...@linux.ibm.com/


Tested successfully with mainline QEMU plus Alexey's new h_watchdog patches in
https://patchwork.ozlabs.org/project/qemu-ppc/list/?series=305226.


All patches of this series:

Tested-by: Daniel Henrique Barboza 


Thanks,


Daniel



Changes of note from PATCH v1:

- Trim down the large comment documenting the H_WATCHDOG hypercall.
   The comment is likely to rot, so remove anything we aren't using
   and anything overly obvious.

- Remove any preprocessor definitions not actually used in the module
   right now.  If we want to use other features offered by the hypercall
   we can add them in later.  They're just clutter until then.

- Simplify the "action" module parameter.  The value is now an index
   into an array of possible timeoutAction values.  This design removes
   the need for the custom get/set methods used in PATCH v1.

   Now we merely need to check that the "action" value is a valid
   index during pseries_wdt_probe().  Easy.

- Make the timeoutAction a member of pseries_wdt, "action".  This
   eliminates the use of a global variable during pseries_wdt_start().

- Use watchdog_init_timeout() idiomatically.  Check its return value
   and error out of pseries_wdt_probe() if it fails.




Re: [PATCH v2 0/4] pseries-wdt: initial support for H_WATCHDOG-based watchdog timers

2022-06-16 Thread Daniel Henrique Barboza

Hi,

Update: checking out 'dmesg' more carefully I found out that the module probe
is failing with the following message:

[  186.298424][  T811] pseries-wdt: probe of pseries-wdt.0 failed with error -5

This fail is consistent. If I remove the module and modprobe it again the same
error happens.

The message is being throw by pseries_wdt_probe() (patch 4/4). Back in QEMU, in
Alexey's H_WATCHDOG implementation [1], I see that h_watchdog is returning 
H_PARAMETER
because the retrieved 'watchdogNumber' is zero.

Also, the pseries_wdt module still appears in 'lsmod' despite this probe error. 
Not sure
if this is a bug:

[root@fedora ~]# rmmod pseries_wdt
[root@fedora ~]# modprobe pseries_wdt
[ 1792.846769][  T865] pseries-wdt: probe of pseries-wdt.0 failed with error -5
[root@fedora ~]# lsmod | grep pseries
pseries_wdt   262144  0
[root@fedora ~]#


For reference this is all the output of 'lsmod' in the guest:

[root@fedora ~]# lsmod
Module  Size  Used by
pseries_wdt   262144  0
nfnetlink 262144  1
evdev 327680  1
input_leds262144  0
led_class 262144  1 input_leds
fuse  458752  1
xfs  1835008  2
libcrc32c 262144  1 xfs
virtio_scsi   327680  2
virtio_pci262144  0
virtio327680  2 virtio_scsi,virtio_pci
vmx_crypto262144  0
gf128mul  262144  1 vmx_crypto
crc32c_vpmsum 327680  1
virtio_ring   327680  3 virtio_scsi,virtio_pci,virtio
virtio_pci_legacy_dev   262144  1 virtio_pci
virtio_pci_modern_dev   262144  1 virtio_pci
autofs4   327680  2


Given that the error is being thrown by Alexey's QEMU code, I'll bring it up in 
the QEMU
mailing list in [1] and follow it up from there.


[1] 
https://patchwork.ozlabs.org/project/qemu-ppc/patch/20220608030153.1862335-1-...@ozlabs.ru/


Thanks,


Daniel


On 6/15/22 22:43, Daniel Henrique Barboza wrote:

Hi,

I tried this series out with mainline QEMU built with Alexey's patch [1]
and I wasn't able to get it to work. I'm using a simple QEMU command line
booting a fedora36 guest in a Power9 boston host:

sudo  ./qemu-system-ppc64 \
-M 
pseries,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off,ic-mode=dual
 \
-m 4G -accel kvm -cpu POWER9 -smp 1,maxcpus=1,threads=1,cores=1,sockets=1 \
-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 \
-drive 
file=/home/danielhb/fedora36.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none
 \
-device 
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2
 \
-device qemu-xhci,id=usb,bus=pci.0,addr=0x4 -nographic -display none


Guest is running v5.19-rc2 with this series applied. Kernel config consists of
'pseries_le_defconfig' plus the following 'watchdog' related changes:

[root@fedora ~]# cat linux/.config | grep PSERIES_WDT
CONFIG_PSERIES_WDT=y

[root@fedora ~]# cat linux/.config | grep -i watchdog
CONFIG_PPC_WATCHDOG=y
CONFIG_HAVE_NMI_WATCHDOG=y
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
CONFIG_WATCHDOG_NOWAYOUT=y
CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y
CONFIG_WATCHDOG_OPEN_TIMEOUT=0
# CONFIG_WATCHDOG_SYSFS is not set
# CONFIG_WATCHDOG_HRTIMER_PRETIMEOUT is not set
# Watchdog Pretimeout Governors
# CONFIG_WATCHDOG_PRETIMEOUT_GOV is not set
# Watchdog Device Drivers
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_XILINX_WATCHDOG is not set
# CONFIG_ZIIRAVE_WATCHDOG is not set
# CONFIG_CADENCE_WATCHDOG is not set
# CONFIG_DW_WATCHDOG is not set
# CONFIG_MAX63XX_WATCHDOG is not set
CONFIG_WATCHDOG_RTAS=y
# PCI-based Watchdog Cards
# CONFIG_PCIPCWATCHDOG is not set
# USB-based Watchdog Cards
# CONFIG_USBPCWATCHDOG is not set
# CONFIG_WQ_WATCHDOG is not set
[root@fedora ~]#



Kernel command line:

[root@fedora ~]# cat /proc/cmdline
BOOT_IMAGE=(ieee1275/disk,msdos2)/vmlinuz-5.19.0-rc2-00054-g12ede8ffb103 \
root=/dev/mapper/fedora_fedora-root ro rd.lvm.lv=fedora_fedora/root \
pseries-wdt.timeout=60 pseries-wdt.nowayout=1 pseries-wdt.action=2


With all that, executing

echo V > /dev/watchdog0

Does nothing. dmesg is clean and the guest doesn't reboot after the 60 sec
timeout.  I also tried with PSERIES_WDT being compiled as a module instead
of built-in. Same results.


What am I missing?


[1] 
https://patchwork.ozlabs.org/project/qemu-ppc/patch/20220608030153.1862335-1-...@ozlabs.ru/



Thanks,


Daniel




On 6/2/22 14:53, Scott Cheloha wrote:

PAPR v2.12 defines a new hypercall, H_WATCHDOG.  This patch series
adds support for this hypercall to powerpc/pseries kernels and
introduces a new watchdog driver, "pseries-wdt", for the virtual
timers exposed by the hypercall.

This series is preceded by the following:

RFC v1: 
https://lore.kernel.org/linux-watchdog/20220413165104.179144-1-chel...@linux.ibm.com/
RFC v2: 
https://lore.kernel.org/linux-watchdog/20220509174357.5448-1-chel...@linux.ibm.com/
PATCH v1: 
https://lore.kernel.org/linux-watchdog/20220520183552

Re: [PATCH v2 0/4] pseries-wdt: initial support for H_WATCHDOG-based watchdog timers

2022-06-16 Thread Daniel Henrique Barboza




On 6/16/22 13:44, Tyrel Datwyler wrote:

On 6/15/22 18:43, Daniel Henrique Barboza wrote:

Hi,

I tried this series out with mainline QEMU built with Alexey's patch [1]
and I wasn't able to get it to work. I'm using a simple QEMU command line
booting a fedora36 guest in a Power9 boston host:


I would assume the H_WATCHDOG hypercall is not implemented by the qemu pseries
machine type. It is a very new hypercall.


Alexey implemented QEMU support for this hypercall here:

"[qemu] ppc/spapr: Implement H_WATCHDOG"
https://patchwork.ozlabs.org/project/qemu-ppc/patch/20220608030153.1862335-1-...@ozlabs.ru/

It is under review in the QEMU mailing list. I tried it out with this series
and wasn't able to get it to work.


Thanks,

Daniel




-Tyrel



sudo  ./qemu-system-ppc64 \
-M
pseries,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off,ic-mode=dual
\
-m 4G -accel kvm -cpu POWER9 -smp 1,maxcpus=1,threads=1,cores=1,sockets=1 \
-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 \
-drive
file=/home/danielhb/fedora36.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none
\
-device
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2
\
-device qemu-xhci,id=usb,bus=pci.0,addr=0x4 -nographic -display none


Guest is running v5.19-rc2 with this series applied. Kernel config consists of
'pseries_le_defconfig' plus the following 'watchdog' related changes:

[root@fedora ~]# cat linux/.config | grep PSERIES_WDT
CONFIG_PSERIES_WDT=y

[root@fedora ~]# cat linux/.config | grep -i watchdog
CONFIG_PPC_WATCHDOG=y
CONFIG_HAVE_NMI_WATCHDOG=y
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
CONFIG_WATCHDOG_NOWAYOUT=y
CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y
CONFIG_WATCHDOG_OPEN_TIMEOUT=0
# CONFIG_WATCHDOG_SYSFS is not set
# CONFIG_WATCHDOG_HRTIMER_PRETIMEOUT is not set
# Watchdog Pretimeout Governors
# CONFIG_WATCHDOG_PRETIMEOUT_GOV is not set
# Watchdog Device Drivers
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_XILINX_WATCHDOG is not set
# CONFIG_ZIIRAVE_WATCHDOG is not set
# CONFIG_CADENCE_WATCHDOG is not set
# CONFIG_DW_WATCHDOG is not set
# CONFIG_MAX63XX_WATCHDOG is not set
CONFIG_WATCHDOG_RTAS=y
# PCI-based Watchdog Cards
# CONFIG_PCIPCWATCHDOG is not set
# USB-based Watchdog Cards
# CONFIG_USBPCWATCHDOG is not set
# CONFIG_WQ_WATCHDOG is not set
[root@fedora ~]#



Kernel command line:

[root@fedora ~]# cat /proc/cmdline
BOOT_IMAGE=(ieee1275/disk,msdos2)/vmlinuz-5.19.0-rc2-00054-g12ede8ffb103 \
root=/dev/mapper/fedora_fedora-root ro rd.lvm.lv=fedora_fedora/root \
pseries-wdt.timeout=60 pseries-wdt.nowayout=1 pseries-wdt.action=2


With all that, executing

echo V > /dev/watchdog0

Does nothing. dmesg is clean and the guest doesn't reboot after the 60 sec
timeout.  I also tried with PSERIES_WDT being compiled as a module instead
of built-in. Same results.


What am I missing?


[1]
https://patchwork.ozlabs.org/project/qemu-ppc/patch/20220608030153.1862335-1-...@ozlabs.ru/




Thanks,


Daniel




On 6/2/22 14:53, Scott Cheloha wrote:

PAPR v2.12 defines a new hypercall, H_WATCHDOG.  This patch series
adds support for this hypercall to powerpc/pseries kernels and
introduces a new watchdog driver, "pseries-wdt", for the virtual
timers exposed by the hypercall.

This series is preceded by the following:

RFC v1:
https://lore.kernel.org/linux-watchdog/20220413165104.179144-1-chel...@linux.ibm.com/

RFC v2:
https://lore.kernel.org/linux-watchdog/20220509174357.5448-1-chel...@linux.ibm.com/

PATCH v1:
https://lore.kernel.org/linux-watchdog/20220520183552.33426-1-chel...@linux.ibm.com/


Changes of note from PATCH v1:

- Trim down the large comment documenting the H_WATCHDOG hypercall.
    The comment is likely to rot, so remove anything we aren't using
    and anything overly obvious.

- Remove any preprocessor definitions not actually used in the module
    right now.  If we want to use other features offered by the hypercall
    we can add them in later.  They're just clutter until then.

- Simplify the "action" module parameter.  The value is now an index
    into an array of possible timeoutAction values.  This design removes
    the need for the custom get/set methods used in PATCH v1.

    Now we merely need to check that the "action" value is a valid
    index during pseries_wdt_probe().  Easy.

- Make the timeoutAction a member of pseries_wdt, "action".  This
    eliminates the use of a global variable during pseries_wdt_start().

- Use watchdog_init_timeout() idiomatically.  Check its return value
    and error out of pseries_wdt_probe() if it fails.






Re: [PATCH v2 0/4] pseries-wdt: initial support for H_WATCHDOG-based watchdog timers

2022-06-15 Thread Daniel Henrique Barboza

Hi,

I tried this series out with mainline QEMU built with Alexey's patch [1]
and I wasn't able to get it to work. I'm using a simple QEMU command line
booting a fedora36 guest in a Power9 boston host:

sudo  ./qemu-system-ppc64 \
-M 
pseries,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off,ic-mode=dual
 \
-m 4G -accel kvm -cpu POWER9 -smp 1,maxcpus=1,threads=1,cores=1,sockets=1 \
-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 \
-drive 
file=/home/danielhb/fedora36.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none
 \
-device 
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2
 \
-device qemu-xhci,id=usb,bus=pci.0,addr=0x4 -nographic -display none


Guest is running v5.19-rc2 with this series applied. Kernel config consists of
'pseries_le_defconfig' plus the following 'watchdog' related changes:

[root@fedora ~]# cat linux/.config | grep PSERIES_WDT
CONFIG_PSERIES_WDT=y

[root@fedora ~]# cat linux/.config | grep -i watchdog
CONFIG_PPC_WATCHDOG=y
CONFIG_HAVE_NMI_WATCHDOG=y
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
CONFIG_WATCHDOG_NOWAYOUT=y
CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y
CONFIG_WATCHDOG_OPEN_TIMEOUT=0
# CONFIG_WATCHDOG_SYSFS is not set
# CONFIG_WATCHDOG_HRTIMER_PRETIMEOUT is not set
# Watchdog Pretimeout Governors
# CONFIG_WATCHDOG_PRETIMEOUT_GOV is not set
# Watchdog Device Drivers
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_XILINX_WATCHDOG is not set
# CONFIG_ZIIRAVE_WATCHDOG is not set
# CONFIG_CADENCE_WATCHDOG is not set
# CONFIG_DW_WATCHDOG is not set
# CONFIG_MAX63XX_WATCHDOG is not set
CONFIG_WATCHDOG_RTAS=y
# PCI-based Watchdog Cards
# CONFIG_PCIPCWATCHDOG is not set
# USB-based Watchdog Cards
# CONFIG_USBPCWATCHDOG is not set
# CONFIG_WQ_WATCHDOG is not set
[root@fedora ~]#



Kernel command line:

[root@fedora ~]# cat /proc/cmdline
BOOT_IMAGE=(ieee1275/disk,msdos2)/vmlinuz-5.19.0-rc2-00054-g12ede8ffb103 \
root=/dev/mapper/fedora_fedora-root ro rd.lvm.lv=fedora_fedora/root \
pseries-wdt.timeout=60 pseries-wdt.nowayout=1 pseries-wdt.action=2


With all that, executing

echo V > /dev/watchdog0

Does nothing. dmesg is clean and the guest doesn't reboot after the 60 sec
timeout.  I also tried with PSERIES_WDT being compiled as a module instead
of built-in. Same results.


What am I missing?


[1] 
https://patchwork.ozlabs.org/project/qemu-ppc/patch/20220608030153.1862335-1-...@ozlabs.ru/



Thanks,


Daniel




On 6/2/22 14:53, Scott Cheloha wrote:

PAPR v2.12 defines a new hypercall, H_WATCHDOG.  This patch series
adds support for this hypercall to powerpc/pseries kernels and
introduces a new watchdog driver, "pseries-wdt", for the virtual
timers exposed by the hypercall.

This series is preceded by the following:

RFC v1: 
https://lore.kernel.org/linux-watchdog/20220413165104.179144-1-chel...@linux.ibm.com/
RFC v2: 
https://lore.kernel.org/linux-watchdog/20220509174357.5448-1-chel...@linux.ibm.com/
PATCH v1: 
https://lore.kernel.org/linux-watchdog/20220520183552.33426-1-chel...@linux.ibm.com/

Changes of note from PATCH v1:

- Trim down the large comment documenting the H_WATCHDOG hypercall.
   The comment is likely to rot, so remove anything we aren't using
   and anything overly obvious.

- Remove any preprocessor definitions not actually used in the module
   right now.  If we want to use other features offered by the hypercall
   we can add them in later.  They're just clutter until then.

- Simplify the "action" module parameter.  The value is now an index
   into an array of possible timeoutAction values.  This design removes
   the need for the custom get/set methods used in PATCH v1.

   Now we merely need to check that the "action" value is a valid
   index during pseries_wdt_probe().  Easy.

- Make the timeoutAction a member of pseries_wdt, "action".  This
   eliminates the use of a global variable during pseries_wdt_start().

- Use watchdog_init_timeout() idiomatically.  Check its return value
   and error out of pseries_wdt_probe() if it fails.




[PATCH] powerpc/mm/numa: skip NUMA_NO_NODE onlining in parse_numa_properties()

2022-02-24 Thread Daniel Henrique Barboza
Executing node_set_online() when nid = NUMA_NO_NODE results in an
undefined behavior. node_set_online() will call node_set_state(), into
__node_set(), into set_bit(), and since NUMA_NO_NODE is -1 we'll end up
doing a negative shift operation inside
arch/powerpc/include/asm/bitops.h. This potential UB was detected
running a kernel with CONFIG_UBSAN.

The behavior was introduced by commit 10f78fd0dabb ("powerpc/numa: Fix a
regression on memoryless node 0"), where the check for nid > 0 was
removed to fix a problem that was happening with nid = 0, but the result
is that now we're trying to online NUMA_NO_NODE nids as well.

Checking for nid >= 0 will allow node 0 to be onlined while avoiding
this UB with NUMA_NO_NODE.

Reported-by: Ping Fang 
Cc: Diego Domingos 
Cc: Aneesh Kumar K.V 
Cc: Srikar Dronamraju 
Fixes: 10f78fd0dabb ("powerpc/numa: Fix a regression on memoryless node 0")
Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/mm/numa.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9d5f710d2c20..b9b7fefbb64b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -956,7 +956,9 @@ static int __init parse_numa_properties(void)
of_node_put(cpu);
}
 
-   node_set_online(nid);
+   /* node_set_online() is an UB if 'nid' is negative */
+   if (likely(nid >= 0))
+   node_set_online(nid);
}
 
get_n_mem_cells(_mem_addr_cells, _mem_size_cells);
-- 
2.35.1



Re: [PATCH v2 4/4] powerpc/pseries/cpuhp: remove obsolete comment from pseries_cpu_die

2021-09-28 Thread Daniel Henrique Barboza




On 9/27/21 17:19, Nathan Lynch wrote:

This comment likely refers to the obsolete DLPAR workflow where some
resource state transitions were driven more directly from user space
utilities, but it also seems to contradict itself: "Change isolate state to
Isolate [...]" is at odds with the preceding sentences, and it does not
relate at all to the code that follows.



This comment was added by commit 413f7c405a34, a 2006 commit where Mike
Ellerman moved code from platform/pseries/smp.c into hotplug-cpu.c.

I checked the original code back in smp.c and this comment was added there
by commit 1da177e4c3f41, which is Linus's initial git commit, where he mentions
that he didn't bothered with full history (although it is available somewhere,
allegedly).

This is enough to say that we can't easily see the history behind this comment.
I also believe that we're better of without it since it doesn't make sense
with the current codebase.


Reviewed-by: Daniel Henrique Barboza 



Remove it to prevent confusion.

Signed-off-by: Nathan Lynch 
---
  arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 -
  1 file changed, 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index b50f3e9aa259..5ab44600c8d3 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -137,11 +137,6 @@ static void pseries_cpu_die(unsigned int cpu)
cpu, pcpu);
}
  
-	/* Isolation and deallocation are definitely done by

-* drslot_chrp_cpu.  If they were not they would be
-* done here.  Change isolate state to Isolate and
-* change allocation-state to Unusable.
-*/
paca_ptrs[cpu]->cpu_start = 0;
  }
  



Re: [PATCH 3/3] powerpc/pseries/cpuhp: delete add/remove_by_count code

2021-09-20 Thread Daniel Henrique Barboza




On 9/20/21 10:55, Nathan Lynch wrote:

The core DLPAR code supports two actions (add and remove) and three
subtypes of action:

* By DRC index: the action is attempted on a single specified resource.
   This is the usual case for processors.
* By indexed count: the action is attempted on a range of resources
   beginning at the specified index. This is implemented only by the memory
   DLPAR code.
* By count: the lower layer (CPU or memory) is responsible for locating the
   specified number of resources to which the action can be applied.

I cannot find any evidence of the "by count" subtype being used by drmgr or
qemu for processors. And when I try to exercise this code, the add case
does not work:



Just to clarify: did you check both CPU and memory cases and found out that the
'by count' subtype isn't used with CPUs, but drmgr has some cases in which
'by count' is used with LMBs?

I'm asking because I worked with a part of the LMB removal code a few months 
ago,
and got stuck in a situation in which the 'by count' and 'by indexed count' are
similar enough to feel repetitive, but distinct enough to not be easily reduced
into a single function. If drmgr wasn't using the 'by count' subtypes for LMBs
that would be a good chance for more code redux.




   $ ppc64_cpu --smt ; nproc
   SMT=8
   24
   $ printf "cpu remove count 2" > /sys/kernel/dlpar
   $ nproc
   8
   $ printf "cpu add count 2" > /sys/kernel/dlpar
   -bash: printf: write error: Invalid argument
   $ dmesg | tail -2
   pseries-hotplug-cpu: Failed to find enough CPUs (1 of 2) to add
   dlpar: Could not handle DLPAR request "cpu add count 2"
   $ nproc
   8
   $ drmgr -c cpu -a -q 2 # this uses the by-index method
   Validating CPU DLPAR capability...yes.
   CPU 1
   CPU 17
   $ nproc
   24

This is because find_drc_info_cpus_to_add() does not increment drc_index
appropriately during its search.

This is not hard to fix. But the _by_count() functions also have the
property that they attempt to roll back all prior operations if the entire
request cannot be satisfied, even though the rollback itself can encounter
errors. It's not possible to provide transaction-like behavior at this
level, and it's undesirable to have code that can only pretend to do that.
Any users of these functions cannot know what the state of the system is in
the error case. And the error paths are, to my knowledge, impossible to
test without adding custom error injection code.

Summary:

* This code has not worked reliably since its introduction.
* There is no evidence that it is used.
* It contains questionable rollback behaviors in error paths which are
   difficult to test.

So let's remove it.

Signed-off-by: Nathan Lynch 
Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality")
Fixes: b015f6bc9547 ("powerpc/pseries: Add cpu DLPAR support for drc-info 
property")
---


Tested with a QEMU pseries guest, no issues found.


Reviewed-by: Daniel Henrique Barboza 
Tested-by: Daniel Henrique Barboza 




  arch/powerpc/platforms/pseries/hotplug-cpu.c | 218 +--
  1 file changed, 2 insertions(+), 216 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 87a0fbe9cf12..768997261ce8 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -741,216 +741,6 @@ static int dlpar_cpu_remove_by_index(u32 drc_index)
return rc;
  }
  
-static int find_dlpar_cpus_to_remove(u32 *cpu_drcs, int cpus_to_remove)

-{
-   struct device_node *dn;
-   int cpus_found = 0;
-   int rc;
-
-   /* We want to find cpus_to_remove + 1 CPUs to ensure we do not
-* remove the last CPU.
-*/
-   for_each_node_by_type(dn, "cpu") {
-   cpus_found++;
-
-   if (cpus_found > cpus_to_remove) {
-   of_node_put(dn);
-   break;
-   }
-
-   /* Note that cpus_found is always 1 ahead of the index
-* into the cpu_drcs array, so we use cpus_found - 1
-*/
-   rc = of_property_read_u32(dn, "ibm,my-drc-index",
- _drcs[cpus_found - 1]);
-   if (rc) {
-   pr_warn("Error occurred getting drc-index for %pOFn\n",
-   dn);
-   of_node_put(dn);
-   return -1;
-   }
-   }
-
-   if (cpus_found < cpus_to_remove) {
-   pr_warn("Failed to find enough CPUs (%d of %d) to remove\n",
-   cpus_found, cpus_to_remove);
-   } else if (cpus_found == cpus_to_remove) {
-   pr_warn("Cannot remove all

Re: [PATCH 2/3] powerpc/cpuhp: BUG -> WARN conversion in offline path

2021-09-20 Thread Daniel Henrique Barboza




On 9/20/21 10:55, Nathan Lynch wrote:

If, due to bugs elsewhere, we get into unregister_cpu_online() with a CPU
that isn't marked hotpluggable, we can emit a warning and return an
appropriate error instead of crashing.

Signed-off-by: Nathan Lynch 
---


As mentioned by Christophe this will not solve the crash for kernels with
panic_on_warn, but at least it will panic with a clearer message on those
and will not panic for everyone else.


Reviewed-by: Daniel Henrique Barboza 




  arch/powerpc/kernel/sysfs.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index defecb3b1b15..08d8072d6e7a 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -928,7 +928,8 @@ static int unregister_cpu_online(unsigned int cpu)
struct device_attribute *attrs, *pmc_attrs;
int i, nattrs;
  
-	BUG_ON(!c->hotpluggable);

+   if (WARN_RATELIMIT(!c->hotpluggable, "cpu %d can't be offlined\n", cpu))
+   return -EBUSY;
  
  #ifdef CONFIG_PPC64

if (cpu_has_feature(CPU_FTR_SMT))



Re: [PATCH 1/3] powerpc/pseries/cpuhp: cache node corrections

2021-09-20 Thread Daniel Henrique Barboza




On 9/20/21 10:55, Nathan Lynch wrote:

On pseries, cache nodes in the device tree can be added and removed by the
CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
partitions in dedicated processor mode typically have L2 and L3 cache
nodes.

The CPU DLPAR code has the following shortcomings:

* Cache nodes returned as siblings of a new CPU node by
   ibm,configure-connector are silently discarded; only the CPU node is
   added to the device tree.

* Cache nodes which become unreferenced in the processor removal path are
   not removed from the device tree. This can lead to duplicate nodes when
   the post-migration device tree update code replaces cache nodes.

This is long-standing behavior. Presumably it has gone mostly unnoticed
because the two bugs have the property of obscuring each other in common
simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
only if you cared to inspect the device tree or the sysfs cacheinfo
information.

Booted with two processors:

   $ pwd
   /sys/firmware/devicetree/base/cpus
   $ ls -1d */
   l2-cache@2010/
   l2-cache@2011/
   l3-cache@3110/
   l3-cache@3111/
   PowerPC,POWER9@0/
   PowerPC,POWER9@8/
   $ lsprop */l2-cache
   l2-cache@2010/l2-cache
  3110 (12560)
   l2-cache@2011/l2-cache
  3111 (12561)
   PowerPC,POWER9@0/l2-cache
  2010 (8208)
   PowerPC,POWER9@8/l2-cache
  2011 (8209)
   $ ls /sys/devices/system/cpu/cpu0/cache/
   index0  index1  index2  index3

After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
missing a cache level in its sched domain hierarchy:

   $ ls -1d */
   l2-cache@2010/
   l2-cache@2011/
   l3-cache@3110/
   l3-cache@3111/
   PowerPC,POWER9@0/
   PowerPC,POWER9@10/
   PowerPC,POWER9@8/
   $ lsprop PowerPC\,POWER9@10/l2-cache
   PowerPC,POWER9@10/l2-cache
  2012 (8210)
   $ ls /sys/devices/system/cpu/cpu16/cache/
   index0  index1
   $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
   /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
   /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
   /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
   /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
   /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
   /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
   /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
   /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE

When removing PowerPC,POWER9@8, we see that its cache nodes are left
behind:

   $ ls -1d */
   l2-cache@2010/
   l2-cache@2011/
   l3-cache@3110/
   l3-cache@3111/
   PowerPC,POWER9@0/

When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
removing one processor, then migrating, adding a processor, and then
migrating again can result in warnings from the OF core during
post-migration device tree updates:

   Duplicate name in cpus, renamed to "l2-cache@2011#1"
   Duplicate name in cpus, renamed to "l3-cache@3111#1"

and nodes with duplicated phandles in the tree, making lookup behavior
unpredictable:

   $ lsprop l[23]-cache@*/ibm,phandle
   l2-cache@2010/ibm,phandle
2010 (8208)
   l2-cache@2011#1/ibm,phandle
2011 (8209)
   l2-cache@2011/ibm,phandle
2011 (8209)
   l3-cache@3110/ibm,phandle
3110 (12560)
   l3-cache@3111#1/ibm,phandle
3111 (12561)
   l3-cache@3111/ibm,phandle
3111 (12561)

Address these issues by:

* Correctly processing siblings of the node returned from
   dlpar_configure_connector().
* Removing cache nodes in the CPU remove path when it can be determined
   that they are not associated with other CPUs or caches.

Use the of_changeset API in both cases, which allows us to keep the error
handling in this code from becoming more complex while ensuring that the
device tree cannot become inconsistent.

Signed-off-by: Nathan Lynch 
Fixes: ac71380 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf18 ("powerpc/pseries: Add CPU dlpar add functionality")
---
  arch/powerpc/platforms/pseries/hotplug-cpu.c | 72 +++-
  1 file changed, 70 insertions(+), 2 deletions(-)


Tested with a QEMU pseries guest, multiple CPU add/removals of the same CPU,
and no issues found with these new pseries_cpuhp* functions.

Code LGTM as well.


Reviewed-by: Daniel Henrique Barboza 
Tested-by: Daniel Henrique Barboza 






diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index d646c22e94ab..87a0fbe9cf12 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -521,6 +521,27 @@ static bool valid_cpu_drc_index(st

Re: [PATCH v7 0/6] Add support for FORM2 associativity

2021-08-09 Thread Daniel Henrique Barboza




On 8/9/21 2:24 AM, Aneesh Kumar K.V wrote:

Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. More details can be found in patch 7.

$ numactl -H
...
node distances:
node   0   1   2   3
   0:  10  11  222  33
   1:  44  10  55  66
   2:  77  88  10  99
   3:  101  121  132  10
$

After DAX kmem memory add
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4
   0:  10  11  222  33  240
   1:  44  10  55  66  255
   2:  77  88  10  99  255
   3:  101  121  132  10  230
   4:  255  255  255  230  10


PAPR SCM now use the numa distance details to find the numa_node and target_node
for the device.

kvaneesh@ubuntu-guest:~$ ndctl  list -N -v
[
   {
 "dev":"namespace0.0",
 "mode":"devdax",
 "map":"dev",
 "size":1071644672,
 "uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2",
 "raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362",
 "daxregion":{
   "id":0,
   "size":1071644672,
   "devices":[
 {
   "chardev":"dax0.0",
   "size":1071644672,
   "target_node":4,
   "mode":"devdax"
 }
   ]
 },
 "align":2097152,
 "numa_node":3
   }
]
kvaneesh@ubuntu-guest:~$


The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa 
dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa 
dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa 
dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa 
dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa 
dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \
-object 
memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}
  \
-device 
nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4

Qemu changes can be found at 
https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/



Series tested with the forementioned QEMU build. All patches:


Tested-by: Daniel Henrique Barboza 



Changes from v6:
* Address review feedback

Changes from v5:
* Fix build error reported by kernel test robot
* Address review feedback

Changes from v4:
* Drop DLPAR related device tree property for now because both Qemu nor PowerVM
   will provide the distance details of all possible NUMA nodes during boot.
* Rework numa distance code based on review feedback.

Changes from v3:
* Drop PAPR SCM specific changes and depend completely on NUMA distance 
information.

Changes from v2:
* Add nvdimm list to Cc:
* update PATCH 8 commit message.

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index


Aneesh Kumar K.V (6):
   powerpc/pseries: rename min_common_depth to primary_domain_index
   powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
   powerpc/pseries: Consolidate different NUMA distance update code paths
   powerpc/pseries: Add a helper for form1 cpu distance
   powerpc/pseries: Add support for FORM2 associativity
   powerpc/pseries: Consolidate form1 distance initialization into a
 helper

  Documentation/powerpc/associativity.rst   | 103 +
  arch/powerpc/include/asm/firmware.h   |   7 +-
  arch/powerpc/include/asm/prom.h   |   3 +-
  arch/powerpc/include/asm/topology.h   |   6 +-
  arch/powerpc/kernel/prom_init.c   |   3 +-
  arch/powerpc/mm/numa.c| 433 ++
  arch/powerpc/platforms/pseries/firmware.c |   3 +-
  arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
  .../platforms/pseries/hotplug-memory.c|   2 +
  arch/powerpc/platforms/pseries/lpar.c |   4 +-
  10 files changed, 455 insertions(+), 111 deletions(-)
  create mode 100644 Documentation/powerpc/associativity.rst



Re: [PATCH v5 0/6] Add support for FORM2 associativity

2021-07-13 Thread Daniel Henrique Barboza

Aneesh,

This series compiles with a configuration made with "pseries_le_defconfig"
but fails with a config based on an existing RHEL8 config.

The reason, which is hinted in the robot replies in patch 4, is that you defined
a "__vphn_get_associativity" inside a #ifdef CONFIG_PPC_SPLPAR guard but didn't
define how the function would behave without the config, and you ended up
using the function elsewhere.

This fixes the compilation but I'm not sure if this is what you intended
for this function:


diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c68846fc9550..6e8551d16b7a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -680,6 +680,11 @@ static int vphn_get_nid(long lcpu)
 
 }

 #else
+static int __vphn_get_associativity(long lcpu, __be32 *associativity)
+{
+   return -1;
+}
+
 static int vphn_get_nid(long unused)
 {
return NUMA_NO_NODE;


I'll post a new version of the QEMU FORM2 changes using these patches as is 
(with
the above fixup), but I guess you'll want to post a v6.



Thanks,



Daniel




On 6/28/21 12:11 PM, Aneesh Kumar K.V wrote:

Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. More details can be found in patch 7.

$ numactl -H
...
node distances:
node   0   1   2   3
   0:  10  11  222  33
   1:  44  10  55  66
   2:  77  88  10  99
   3:  101  121  132  10
$

After DAX kmem memory add
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4
   0:  10  11  222  33  240
   1:  44  10  55  66  255
   2:  77  88  10  99  255
   3:  101  121  132  10  230
   4:  255  255  255  230  10


PAPR SCM now use the numa distance details to find the numa_node and target_node
for the device.

kvaneesh@ubuntu-guest:~$ ndctl  list -N -v
[
   {
 "dev":"namespace0.0",
 "mode":"devdax",
 "map":"dev",
 "size":1071644672,
 "uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2",
 "raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362",
 "daxregion":{
   "id":0,
   "size":1071644672,
   "devices":[
 {
   "chardev":"dax0.0",
   "size":1071644672,
   "target_node":4,
   "mode":"devdax"
 }
   ]
 },
 "align":2097152,
 "numa_node":3
   }
]
kvaneesh@ubuntu-guest:~$


The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa 
dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa 
dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa 
dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa 
dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa 
dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \
-object 
memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}
  \
-device 
nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4

Qemu changes can be found at 
https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/

Changes from v4:
* Drop DLPAR related device tree property for now because both Qemu nor PowerVM
   will provide the distance details of all possible NUMA nodes during boot.
* Rework numa distance code based on review feedback.

Changes from v3:
* Drop PAPR SCM specific changes and depend completely on NUMA distance 
information.

Changes from v2:
* Add nvdimm list to Cc:
* update PATCH 8 commit message.

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index


Aneesh Kumar K.V (6):
   powerpc/pseries: rename min_common_depth to primary_domain_index
   powerpc/pseries: rename distance_ref_points_depth to
 max_associativity_domain_index
   powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
   powerpc/pseries: Consolidate different NUMA distance update code paths
   powerpc/pseries: Add a helper for form1 cpu distance
   powerpc/pseries: Add support for FORM2 associativity

  Documentation/powerpc/associativity.rst   | 103 +
  arch/powerpc/include/asm/firmware.h   |   7 +-
  arch/powerpc/include/asm/prom.h   |   3 +-
  arch/powerpc/include/asm/topology.h   |   4 +-
  arch/powerpc/kernel/prom_init.c   |   3 +-
  arch/powerpc/mm/numa.c| 415 +-
  arch/powerpc/platforms/pseries/firmware.c |   3 +-
  arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
  .../platforms/pseries/hotplug-memory.c|   2 +
  arch/powerpc/platforms/pseries/lpar.c |   4 +-
  arch/powerpc/platforms/pseries/pseries.h  |   1 +
  11 files changed, 432 insertions(+), 115 deletions(-)
  create 

Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

2021-06-22 Thread Daniel Henrique Barboza




On 6/22/21 9:07 AM, Aneesh Kumar K.V wrote:

Daniel Henrique Barboza  writes:


On 6/17/21 1:51 PM, Aneesh Kumar K.V wrote:

PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
   Documentation/powerpc/associativity.rst   | 135 
   arch/powerpc/include/asm/firmware.h   |   3 +-
   arch/powerpc/include/asm/prom.h   |   1 +
   arch/powerpc/kernel/prom_init.c   |   3 +-
   arch/powerpc/mm/numa.c| 149 +-
   arch/powerpc/platforms/pseries/firmware.c |   1 +
   6 files changed, 286 insertions(+), 6 deletions(-)
   create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
new file mode 100644
index ..93be604ac54d
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,135 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux 
kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these 
resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the older format and is now considered 
deprecated.
+
+Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 
property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points and 
ibm,associativity
+device tree properties are used to determine the NUMA distance between 
resource groups/domains.
+
+The “ibm,associativity” property contains one or more lists of numbers 
(domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains one or more list of 
numbers
+(domainID index) that represents the 1 based ordinal in the associativity 
lists.
+The list of domainID index represnets increasing hierachy of resource grouping.
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node 
id.
+Linux kernel computes NUMA distance between two domains by recursively 
comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties representing 
NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows 
flexible primary
+domain numbering. With numa distance computation now detached from the index 
value of
+"ibm,associativity" property, Form 2 allows a large number of primary domain 
ids at the
+same domainID index representing resource groups of different 
performance/latency characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in 
the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains one or more list numbers 
representing
+the domainIDs present in the system. The offset of the domainID in this 
property is considered
+the domainID index.
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, 
followed by
+N domainID encoded as with encode-int
+
+For ex:
+ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for 
domainID 8 is 1.
+
+"ibm,numa-distance-table" property contains one or more list of numbers 
representing the NUMA
+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the distance values encoded as with 
encode-int, followed by
+N distance values encoded as with encode-bytes

[PATCH 3/3] powerpc/pseries: fail quicker in dlpar_memory_add_by_ic()

2021-06-22 Thread Daniel Henrique Barboza
The validation done at the start of dlpar_memory_add_by_ic() is an all
of nothing scenario - if any LMBs in the range is marked as RESERVED we
can fail right away.

We then can remove the 'lmbs_available' var and its check with
'lmbs_to_add' since the whole LMB range was already validated in the
previous step.

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c0a03e1537cb..377d852f5a9a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -796,7 +796,6 @@ static int dlpar_memory_add_by_index(u32 drc_index)
 static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 drc_index)
 {
struct drmem_lmb *lmb, *start_lmb, *end_lmb;
-   int lmbs_available = 0;
int rc;
 
pr_info("Attempting to hot-add %u LMB(s) at index %x\n",
@@ -811,15 +810,14 @@ static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 
drc_index)
 
/* Validate that the LMBs in this range are not reserved */
for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-   if (lmb->flags & DRCONF_MEM_RESERVED)
-   break;
-
-   lmbs_available++;
+   /* Fail immediately if the whole range can't be hot-added */
+   if (lmb->flags & DRCONF_MEM_RESERVED) {
+   pr_err("Memory at %llx (drc index %x) is reserved\n",
+   lmb->base_addr, lmb->drc_index);
+   return -EINVAL;
+   }
}
 
-   if (lmbs_available < lmbs_to_add)
-   return -EINVAL;
-
for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
if (lmb->flags & DRCONF_MEM_ASSIGNED)
continue;
-- 
2.31.1



[PATCH 2/3] powerpc/pseries: break early in dlpar_memory_add_by_count() loops

2021-06-22 Thread Daniel Henrique Barboza
After a successful dlpar_add_lmb() call the LMB is marked as reserved.
Later on, depending whether we added enough LMBs or not, we rely on
the marked LMBs to see which ones might need to be removed, and we
remove the reservation of all of them.

These are done in for_each_drmem_lmb() loops without any break
condition. This means that we're going to check all LMBs of the partition
even after going through all the reserved ones.

This patch adds break conditions in both loops to avoid this. The
'lmbs_added' variable was renamed to 'lmbs_reserved', and it's now
being decremented each time a lmb reservation is removed, indicating
if there are still marked LMBs to be processed.

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 28a7fd90232f..c0a03e1537cb 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -673,7 +673,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
 {
struct drmem_lmb *lmb;
int lmbs_available = 0;
-   int lmbs_added = 0;
+   int lmbs_reserved = 0;
int rc;
 
pr_info("Attempting to hot-add %d LMB(s)\n", lmbs_to_add);
@@ -714,13 +714,12 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
 * requested LMBs cannot be added.
 */
drmem_mark_lmb_reserved(lmb);
-
-   lmbs_added++;
-   if (lmbs_added == lmbs_to_add)
+   lmbs_reserved++;
+   if (lmbs_reserved == lmbs_to_add)
break;
}
 
-   if (lmbs_added != lmbs_to_add) {
+   if (lmbs_reserved != lmbs_to_add) {
pr_err("Memory hot-add failed, removing any added LMBs\n");
 
for_each_drmem_lmb(lmb) {
@@ -735,6 +734,10 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
dlpar_release_drc(lmb->drc_index);
 
drmem_remove_lmb_reservation(lmb);
+   lmbs_reserved--;
+
+   if (lmbs_reserved == 0)
+   break;
}
rc = -EINVAL;
} else {
@@ -745,6 +748,10 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
pr_debug("Memory at %llx (drc index %x) was 
hot-added\n",
 lmb->base_addr, lmb->drc_index);
drmem_remove_lmb_reservation(lmb);
+   lmbs_reserved--;
+
+   if (lmbs_reserved == 0)
+   break;
}
rc = 0;
}
-- 
2.31.1



[PATCH 0/3] powerpc/pseries: cleanups for dlpar_memory_add* functions

2021-06-22 Thread Daniel Henrique Barboza
Hi,

These are a couple of cleanups for the dlpar_memory_add* functions
that are similar to those I did a month or so ago in
dlpar_memory_remove_by_count and dlpar_memory_remove_by_ic. 



Daniel Henrique Barboza (3):
  powerpc/pseries: skip reserved LMBs in dlpar_memory_add_by_count()
  powerpc/pseries: break early in dlpar_memory_add_by_count() loops
  powerpc/pseries: fail quicker in dlpar_memory_add_by_ic()

 .../platforms/pseries/hotplug-memory.c| 34 ---
 1 file changed, 21 insertions(+), 13 deletions(-)

-- 
2.31.1



[PATCH 1/3] powerpc/pseries: skip reserved LMBs in dlpar_memory_add_by_count()

2021-06-22 Thread Daniel Henrique Barboza
The function is counting reserved LMBs as available to be added, but
they aren't. This will cause the function to miscalculate the available
LMBs and can trigger errors later on when executing dlpar_add_lmb().

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 36f66556a7c6..28a7fd90232f 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -683,6 +683,9 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
 
/* Validate that there are enough LMBs to satisfy the request */
for_each_drmem_lmb(lmb) {
+   if (lmb->flags & DRCONF_MEM_RESERVED)
+   continue;
+
if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
lmbs_available++;
 
-- 
2.31.1



Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

2021-06-21 Thread Daniel Henrique Barboza




On 6/17/21 1:51 PM, Aneesh Kumar K.V wrote:

PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
  Documentation/powerpc/associativity.rst   | 135 
  arch/powerpc/include/asm/firmware.h   |   3 +-
  arch/powerpc/include/asm/prom.h   |   1 +
  arch/powerpc/kernel/prom_init.c   |   3 +-
  arch/powerpc/mm/numa.c| 149 +-
  arch/powerpc/platforms/pseries/firmware.c |   1 +
  6 files changed, 286 insertions(+), 6 deletions(-)
  create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
new file mode 100644
index ..93be604ac54d
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,135 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux 
kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these 
resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the older format and is now considered 
deprecated.
+
+Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 
property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points and 
ibm,associativity
+device tree properties are used to determine the NUMA distance between 
resource groups/domains.
+
+The “ibm,associativity” property contains one or more lists of numbers 
(domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains one or more list of 
numbers
+(domainID index) that represents the 1 based ordinal in the associativity 
lists.
+The list of domainID index represnets increasing hierachy of resource grouping.
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node 
id.
+Linux kernel computes NUMA distance between two domains by recursively 
comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties representing 
NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows 
flexible primary
+domain numbering. With numa distance computation now detached from the index 
value of
+"ibm,associativity" property, Form 2 allows a large number of primary domain 
ids at the
+same domainID index representing resource groups of different 
performance/latency characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in 
the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains one or more list numbers 
representing
+the domainIDs present in the system. The offset of the domainID in this 
property is considered
+the domainID index.
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, 
followed by
+N domainID encoded as with encode-int
+
+For ex:
+ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for 
domainID 8 is 1.
+
+"ibm,numa-distance-table" property contains one or more list of numbers 
representing the NUMA
+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the distance values encoded as with 
encode-int, followed by
+N distance values encoded as with encode-bytes. The max distance value we 
could encode is 255.
+
+For ex:
+ibm,numa-lookup-index-tab

Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details

2021-06-17 Thread Daniel Henrique Barboza




On 6/17/21 8:11 AM, Aneesh Kumar K.V wrote:

Daniel Henrique Barboza  writes:


On 6/17/21 4:46 AM, David Gibson wrote:

On Tue, Jun 15, 2021 at 12:35:17PM +0530, Aneesh Kumar K.V wrote:

David Gibson  writes:


On Tue, Jun 15, 2021 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:

David Gibson  writes:


On Mon, Jun 14, 2021 at 10:10:03PM +0530, Aneesh Kumar K.V wrote:

FORM2 introduce a concept of secondary domain which is identical to the
conceept of FORM1 primary domain. Use secondary domain as the numa node
when using persistent memory device. For DAX kmem use the logical domain
id introduced in FORM2. This new numa node

Signed-off-by: Aneesh Kumar K.V 
---
   arch/powerpc/mm/numa.c| 28 +++
   arch/powerpc/platforms/pseries/papr_scm.c | 26 +
   arch/powerpc/platforms/pseries/pseries.h  |  1 +
   3 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 86cd2af014f7..b9ac6d02e944 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -265,6 +265,34 @@ static int associativity_to_nid(const __be32 
*associativity)
return nid;
   }
   
+int get_primary_and_secondary_domain(struct device_node *node, int *primary, int *secondary)

+{
+   int secondary_index;
+   const __be32 *associativity;
+
+   if (!numa_enabled) {
+   *primary = NUMA_NO_NODE;
+   *secondary = NUMA_NO_NODE;
+   return 0;
+   }
+
+   associativity = of_get_associativity(node);
+   if (!associativity)
+   return -ENODEV;
+
+   if (of_read_number(associativity, 1) >= primary_domain_index) {
+   *primary = of_read_number([primary_domain_index], 
1);
+   secondary_index = of_read_number(_ref_points[1], 1);


Secondary ID is always the second reference point, but primary depends
on the length of resources?  That seems very weird.


primary_domain_index is distance_ref_point[0]. With Form2 we would find
both primary and secondary domain ID same for all resources other than
persistent memory device. The usage w.r.t. persistent memory is
explained in patch 7.


Right, I misunderstood



With Form2 the primary domainID and secondary domainID are used to identify the 
NUMA nodes
the kernel should use when using persistent memory devices.


This seems kind of bogus.  With Form1, the primary/secondary ID are a
sort of heirarchy of distance (things with same primary ID are very
close, things with same secondary are kinda-close, etc.).  With Form2,
it's referring to their effective node for different purposes.

Using the same terms for different meanings seems unnecessarily
confusing.


They are essentially domainIDs. The interpretation of them are different
between Form1 and Form2. Hence I kept referring to them as primary and
secondary domainID. Any suggestion on what to name them with Form2?


My point is that reusing associativity-reference-points for something
with completely unrelated semantics seems like a very poor choice.



I agree that this reuse can be confusing. I could argue that there is
precedent for that in PAPR - FORM0 puts a different spin on the same
property as well - but there is no need to keep following existing PAPR
practices in new spec (and some might argue it's best not to).

As far as QEMU goes, renaming this property to "numa-associativity-mode"
(just an example) is a quick change to do since we separated FORM1 and FORM2
code over there.

Doing such a rename can also help with the issue of having to describe new
FORM2 semantics using "least significant boundary" or "primary domain" or
any FORM0|FORM1 related terminology.



It is not just changing the name, we will then have to explain the
meaning of ibm,associativity-reference-points with FORM2 right?


H why? My idea over there was to add a new property that indicates that
resource might have a different NUMA affinity based on the mode of operation
(like PMEM), and get rid of ibm,associativity-reference-points altogether.

The NUMA distances already express the topology. Closer distances indicates
closer proximity, larger distances indicates otherwise. Having
"associativity-reference-points" to reflect a  associativity domain
relationship, when you already have all the distances from each node, is
somewhat redundant.

The concept of 'associativity domain' was necessary in FORM1 because we had no
other way of telling distance between NUMA nodes. We needed to rely on these
overly complex and convoluted subdomain abstractions to say that "nodeA belongs
to the same third-level domain as node B, and in the second-level domain with
node C". The kernel would read that and calculate that each level is doubling
the distance from the level before and local_distance is 10, so:

distAA = 10  distAB= 20 distAC = 40

With FORM2, if this information is already explicit in ibm,numa-distanc

Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details

2021-06-17 Thread Daniel Henrique Barboza




On 6/17/21 4:46 AM, David Gibson wrote:

On Tue, Jun 15, 2021 at 12:35:17PM +0530, Aneesh Kumar K.V wrote:

David Gibson  writes:


On Tue, Jun 15, 2021 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:

David Gibson  writes:


On Mon, Jun 14, 2021 at 10:10:03PM +0530, Aneesh Kumar K.V wrote:

FORM2 introduce a concept of secondary domain which is identical to the
conceept of FORM1 primary domain. Use secondary domain as the numa node
when using persistent memory device. For DAX kmem use the logical domain
id introduced in FORM2. This new numa node

Signed-off-by: Aneesh Kumar K.V 
---
  arch/powerpc/mm/numa.c| 28 +++
  arch/powerpc/platforms/pseries/papr_scm.c | 26 +
  arch/powerpc/platforms/pseries/pseries.h  |  1 +
  3 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 86cd2af014f7..b9ac6d02e944 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -265,6 +265,34 @@ static int associativity_to_nid(const __be32 
*associativity)
return nid;
  }
  
+int get_primary_and_secondary_domain(struct device_node *node, int *primary, int *secondary)

+{
+   int secondary_index;
+   const __be32 *associativity;
+
+   if (!numa_enabled) {
+   *primary = NUMA_NO_NODE;
+   *secondary = NUMA_NO_NODE;
+   return 0;
+   }
+
+   associativity = of_get_associativity(node);
+   if (!associativity)
+   return -ENODEV;
+
+   if (of_read_number(associativity, 1) >= primary_domain_index) {
+   *primary = of_read_number([primary_domain_index], 
1);
+   secondary_index = of_read_number(_ref_points[1], 1);


Secondary ID is always the second reference point, but primary depends
on the length of resources?  That seems very weird.


primary_domain_index is distance_ref_point[0]. With Form2 we would find
both primary and secondary domain ID same for all resources other than
persistent memory device. The usage w.r.t. persistent memory is
explained in patch 7.


Right, I misunderstood



With Form2 the primary domainID and secondary domainID are used to identify the 
NUMA nodes
the kernel should use when using persistent memory devices.


This seems kind of bogus.  With Form1, the primary/secondary ID are a
sort of heirarchy of distance (things with same primary ID are very
close, things with same secondary are kinda-close, etc.).  With Form2,
it's referring to their effective node for different purposes.

Using the same terms for different meanings seems unnecessarily
confusing.


They are essentially domainIDs. The interpretation of them are different
between Form1 and Form2. Hence I kept referring to them as primary and
secondary domainID. Any suggestion on what to name them with Form2?


My point is that reusing associativity-reference-points for something
with completely unrelated semantics seems like a very poor choice.



I agree that this reuse can be confusing. I could argue that there is
precedent for that in PAPR - FORM0 puts a different spin on the same
property as well - but there is no need to keep following existing PAPR
practices in new spec (and some might argue it's best not to).

As far as QEMU goes, renaming this property to "numa-associativity-mode"
(just an example) is a quick change to do since we separated FORM1 and FORM2
code over there.

Doing such a rename can also help with the issue of having to describe new
FORM2 semantics using "least significant boundary" or "primary domain" or
any FORM0|FORM1 related terminology.


Thanks,


Daniel






Persistent memory devices
can also be used as regular memory using DAX KMEM driver and primary domainID 
indicates
the numa node number OS should use when using these devices as regular memory. 
Secondary
domainID is the numa node number that should be used when using this device as
persistent memory.


It's weird to me that you'd want to consider them in different nodes
for those different purposes.



--
   |NUMA node0 |
   |ProcA -> MEMA  |
   | | |
   || |
   |---> PMEMB|
   |   |
---

---
   |NUMA node1 |
   |   |
   |ProcB ---> MEMC|
   || |
   |---> PMEMD|
   |   |
   |   |
---
  


For a topology like the above application running of ProcA wants to find out
persistent memory mount local to its NUMA node. Hence when using it as
pmem fsdax mount or devdax 

Re: [RFC PATCH 0/8] Add support for FORM2 associativity

2021-06-14 Thread Daniel Henrique Barboza




On 6/14/21 1:39 PM, Aneesh Kumar K.V wrote:

Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. This also allows PAPR SCM device
to use better associativity when using the device as DAX KMEM
device. More details can be found in patch x

$ ndctl list -N -v
[
   {
 "dev":"namespace0.0",
 "mode":"devdax",
 "map":"dev",
 "size":1071644672,
 "uuid":"37dea198-ddb5-4e42-915a-99a915e24188",
 "raw_uuid":"148deeaa-4a2f-41d1-8d74-fd9a942d26ba",
 "daxregion":{
   "id":0,
   "size":1071644672,
   "devices":[
 {
   "chardev":"dax0.0",
   "size":1071644672,
   "target_node":4,
   "mode":"devdax"
 }
   ]
 },
 "align":2097152,
 "numa_node":1
   }
]

$ numactl -H
...
node distances:
node   0   1   2   3
   0:  10  11  222  33
   1:  44  10  55  66
   2:  77  88  10  99
   3:  101  121  132  10
$

After DAX KMEM
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4
   0:  10  11  22  33  255
   1:  44  10  55  66  255
   2:  77  88  10  99  255
   3:  101  121  132  10  255
   4:  255  255  255  255  10
#

The above output is with a Qemu command line



For reference, this QEMU:


https://github.com/danielhb/qemu/tree/form2_affinity_v1

https://lists.gnu.org/archive/html/qemu-devel/2021-06/msg03617.html


but ...



-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=22 -numa 
dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=255 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa 
dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa 
dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa 
dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=255 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa 
dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=255 \
-object 
memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}
  \
-device 
nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4,persistent-nodeid=1



with 'device-node=1' instead of 'persistent=nodeid=1' in the nvdimm parameter
up here.






Aneesh Kumar K.V (8):
   powerpc/pseries: rename min_common_depth to primary_domain_index
   powerpc/pseries: rename distance_ref_points_depth to max_domain_index
   powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
   powerpc/pseries: Consolidate DLPAR NUMA distance update
   powerpc/pseries: Consolidate NUMA distance update during boot
   powerpc/pseries: Add a helper for form1 cpu distance
   powerpc/pseries: Add support for FORM2 associativity
   powerpc/papr_scm: Use FORM2 associativity details



Series:


Tested-by: Daniel Henrique Barboza 





  Documentation/powerpc/associativity.rst   | 139 ++
  arch/powerpc/include/asm/firmware.h   |   7 +-
  arch/powerpc/include/asm/prom.h   |   3 +-
  arch/powerpc/kernel/prom_init.c   |   3 +-
  arch/powerpc/mm/numa.c| 436 ++
  arch/powerpc/platforms/pseries/firmware.c |   3 +-
  arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
  .../platforms/pseries/hotplug-memory.c|   2 +
  arch/powerpc/platforms/pseries/papr_scm.c |  26 +-
  arch/powerpc/platforms/pseries/pseries.h  |   2 +
  10 files changed, 522 insertions(+), 101 deletions(-)
  create mode 100644 Documentation/powerpc/associativity.rst



Re: [PATCH -next] powerpc/pseries/memhotplug: Remove unused inline function dlpar_memory_remove()

2021-05-14 Thread Daniel Henrique Barboza




On 5/14/21 4:10 AM, YueHaibing wrote:

dlpar_memory_remove() is never used, so can be removed.

Signed-off-by: YueHaibing 
---


Reviewed-by: Daniel Henrique Barboza 


  arch/powerpc/platforms/pseries/hotplug-memory.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..3d93f1c48e23 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -585,10 +585,6 @@ static inline int pseries_remove_mem_node(struct 
device_node *np)
  {
return 0;
  }
-static inline int dlpar_memory_remove(struct pseries_hp_errorlog *hp_elog)
-{
-   return -EOPNOTSUPP;
-}
  static int dlpar_remove_lmb(struct drmem_lmb *lmb)
  {
return -EOPNOTSUPP;



Re: [PATCH 2/3] hotplug-memory.c: enhance dlpar_memory_remove* LMB checks

2021-05-12 Thread Daniel Henrique Barboza



On 5/3/21 10:02 PM, David Gibson wrote:

On Fri, Apr 30, 2021 at 09:09:16AM -0300, Daniel Henrique Barboza wrote:

dlpar_memory_remove_by_ic() validates the amount of LMBs to be removed
by checking !DRCONF_MEM_RESERVED, and in the following loop before
dlpar_remove_lmb() a check for DRCONF_MEM_ASSIGNED is made before
removing it. This means that a LMB that is both !DRCONF_MEM_RESERVED and
!DRCONF_MEM_ASSIGNED will be counted as valid, but then not being
removed.  The function will end up not removing all 'lmbs_to_remove'
LMBs while also not reporting any errors.

Comparing it to dlpar_memory_remove_by_count(), the validation is done
via lmb_is_removable(), which checks for DRCONF_MEM_ASSIGNED and fadump
constraints. No additional check is made afterwards, and
DRCONF_MEM_RESERVED is never checked before dlpar_remove_lmb(). The
function doesn't have the same 'check A for validation, then B for
removal' issue as remove_by_ic(), but it's not checking if the LMB is
reserved.

There is no reason for these functions to validate the same operation in
two different manners.


Actually, I think there is: remove_by_ic() is handling a request to
remove a specific range of LMBs.  If any are reserved, they can't be
removed and so this needs to fail.  But if they are !ASSIGNED, that
essentially means they're *already* removed (or never added), so
"removing" them is, correctly, a no-op.

remove_by_count(), in contrast, is being asked to remove a fixed
number of LMBs from wherever they can be found, and for that it needs
to find LMBs that haven't already been removed.

Basically remove_by_ic() is an absolute request: "make this set of
LMBs be not-plugged", whereas remove_by_count() is a relative request
"make N less LMBs be plugged".


So I think remove_by_ic()s existing handling is correct.  I'm less
sure if remove_by_count() ignoring RESERVED is correct - I couldn't
quickly find under what circumstances RESERVED gets set.


RESERVED is never set by the kernel. It is written in the DT by the
firmware/hypervisor and the kernel just checks its value. QEMU sets it in
spapr_dt_dynamic_memory() with the following comment:


/*
 * LMB information for RMA, boot time RAM and gap b/n RAM and
 * device memory region -- all these are marked as reserved
 * and as having no valid DRC.
 */
dynamic_memory[0] = cpu_to_be32(addr >> 32);
dynamic_memory[1] = cpu_to_be32(addr & 0x);
dynamic_memory[2] = cpu_to_be32(0);
dynamic_memory[3] = cpu_to_be32(0); /* reserved */
dynamic_memory[4] = cpu_to_be32(-1);
dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_RESERVED |
SPAPR_LMB_FLAGS_DRC_INVALID);


The flag is formally described in LOPAR section 4.2.8, "Reserved Memory":

"Memory nodes marked with the special value of the “status” property of
“reserved” is not to be used or altered by the base OS."


This makes me confident that we should check DRCONF_MEM_RESERVED in
remove_by_count() as well, since phyp needs do adhere to these semantics and
shouldn't be able to remove a LMB marked as RESERVED.



Thanks,



Daniel








This patch addresses that by changing
lmb_is_removable() to also check for DRCONF_MEM_RESERVED to tell if a
lmb is removable, making dlpar_memory_remove_by_count() take the
reservation state into account when counting the LMBs.
lmb_is_removable() is then used in the validation step of
dlpar_memory_remove_by_ic(), which is already checking for both states
but in different stages, to avoid counting a LMB that is not assigned as
eligible for removal. We can then skip the check before
dlpar_remove_lmb() since we're validating all LMBs beforehand.

Signed-off-by: Daniel Henrique Barboza 
---
  arch/powerpc/platforms/pseries/hotplug-memory.c | 8 +++-
  1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index bb98574a84a2..4e6d162c3f1a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -348,7 +348,8 @@ static int pseries_remove_mem_node(struct device_node *np)
  
  static bool lmb_is_removable(struct drmem_lmb *lmb)

  {
-   if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
+   if ((lmb->flags & DRCONF_MEM_RESERVED) ||
+   !(lmb->flags & DRCONF_MEM_ASSIGNED))
return false;
  
  #ifdef CONFIG_FA_DUMP

@@ -523,7 +524,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
  
  	/* Validate that there are enough LMBs to satisfy the request */

for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-   if (lmb->flags & DRCONF_MEM_RESERVED)
+   if (!lmb_is_removable(lmb))
break

[PATCH v2 4/4] powerpc/pseries: minor enhancements in dlpar_memory_remove_by_ic()

2021-05-12 Thread Daniel Henrique Barboza
We don't need the 'lmbs_available' variable to count the valid LMBs and
to check if we have less than 'lmbs_to_remove'. We must ensure that the
entire LMB range must be removed, so we can error out immediately if any
LMB in the range is marked as reserved.

Add a couple of comments explaining the reasoning behind the differences
we have in this function in contrast to what it is done in its sister
function, dlpar_memory_remove_by_count().

Signed-off-by: Daniel Henrique Barboza 
---
 .../platforms/pseries/hotplug-memory.c| 28 +--
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 3c7ce5361ce3..ee88c1540fba 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -517,7 +517,6 @@ static int dlpar_memory_remove_by_index(u32 drc_index)
 static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 {
struct drmem_lmb *lmb, *start_lmb, *end_lmb;
-   int lmbs_available = 0;
int rc;
 
pr_info("Attempting to hot-remove %u LMB(s) at %x\n",
@@ -530,18 +529,29 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
if (rc)
return -EINVAL;
 
-   /* Validate that there are enough LMBs to satisfy the request */
+   /*
+* Validate that all LMBs in range are not reserved. Note that it
+* is ok if they are !ASSIGNED since our goal here is to remove the
+* LMB range, regardless of whether some LMBs were already removed
+* by any other reason.
+*
+* This is a contrast to what is done in remove_by_count() where we
+* check for both RESERVED and !ASSIGNED (via lmb_is_removable()),
+* because we want to remove a fixed amount of LMBs in that function.
+*/
for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-   if (lmb->flags & DRCONF_MEM_RESERVED)
-   break;
-
-   lmbs_available++;
+   if (lmb->flags & DRCONF_MEM_RESERVED) {
+   pr_err("Memory at %llx (drc index %x) is reserved\n",
+   lmb->base_addr, lmb->drc_index);
+   return -EINVAL;
+   }
}
 
-   if (lmbs_available < lmbs_to_remove)
-   return -EINVAL;
-
for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
+   /*
+* dlpar_remove_lmb() will error out if the LMB is already
+* !ASSIGNED, but this case is a no-op for us.
+*/
if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
continue;
 
-- 
2.31.1



[PATCH v2 3/4] powerpc/pseries: break early in dlpar_memory_remove_by_count() loops

2021-05-12 Thread Daniel Henrique Barboza
After marking the LMBs as reserved depending on dlpar_remove_lmb() rc,
we evaluate whether we need to add the LMBs back or if we can release
the LMB DRCs. In both cases, a for_each_drmem_lmb() loop without a break
condition is used. This means that we're going to cycle through all LMBs
of the partition even after we're done with what we were going to do.

This patch adds break conditions in both loops to avoid this. The
'lmbs_removed' variable was renamed to 'lmbs_reserved', and it's now
being decremented each time a lmb reservation is removed, indicating
that the operation we're doing (adding back LMBs or releasing DRCs) is
completed.

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c21d9278c1ce..3c7ce5361ce3 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -402,7 +402,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
 {
struct drmem_lmb *lmb;
-   int lmbs_removed = 0;
+   int lmbs_reserved = 0;
int lmbs_available = 0;
int rc;
 
@@ -436,12 +436,12 @@ static int dlpar_memory_remove_by_count(u32 
lmbs_to_remove)
 */
drmem_mark_lmb_reserved(lmb);
 
-   lmbs_removed++;
-   if (lmbs_removed == lmbs_to_remove)
+   lmbs_reserved++;
+   if (lmbs_reserved == lmbs_to_remove)
break;
}
 
-   if (lmbs_removed != lmbs_to_remove) {
+   if (lmbs_reserved != lmbs_to_remove) {
pr_err("Memory hot-remove failed, adding LMB's back\n");
 
for_each_drmem_lmb(lmb) {
@@ -454,6 +454,10 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
   lmb->drc_index);
 
drmem_remove_lmb_reservation(lmb);
+
+   lmbs_reserved--;
+   if (lmbs_reserved == 0)
+   break;
}
 
rc = -EINVAL;
@@ -467,6 +471,10 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
lmb->base_addr);
 
drmem_remove_lmb_reservation(lmb);
+
+   lmbs_reserved--;
+   if (lmbs_reserved == 0)
+   break;
}
rc = 0;
}
-- 
2.31.1



[PATCH v2 0/4] Unisolate LMBs DRC on removal error + cleanups

2021-05-12 Thread Daniel Henrique Barboza
changes from v1:
- patch 1: added David's r-b
- patch 2:
* removed the DRCONF_MEM_RESERVED assumption for
dlpar_memory_remove_by_ic()
* reworded the commit msg
- patch 3: dropped. the differences between dlpar_memory_remove_by_ic()
  and dlpar_memory_remove_by_count() makes a helper function too complex
  to handle both cases
- (new) patch 3 and 4: minor enhancements

v1 link: 
https://lore.kernel.org/linuxppc-dev/20210430120917.217951-1-danielhb...@gmail.com/

Daniel Henrique Barboza (4):
  powerpc/pseries: Set UNISOLATE on dlpar_memory_remove_by_ic() error
  powerpc/pseries: check DRCONF_MEM_RESERVED in lmb_is_removable()
  powerpc/pseries: break early in dlpar_memory_remove_by_count() loops
  powerpc/pseries: minor enhancements in dlpar_memory_remove_by_ic()

 .../platforms/pseries/hotplug-memory.c| 54 ++-
 1 file changed, 40 insertions(+), 14 deletions(-)

-- 
2.31.1



[PATCH v2 1/4] powerpc/pseries: Set UNISOLATE on dlpar_memory_remove_by_ic() error

2021-05-12 Thread Daniel Henrique Barboza
As previously done in dlpar_cpu_remove() for CPUs, this patch changes
dlpar_memory_remove_by_ic() to unisolate the LMB DRC when the LMB is
failed to be removed. The hypervisor, seeing a LMB DRC that was supposed
to be removed being unisolated instead, can do error recovery on its
side.

This change is done in dlpar_memory_remove_by_ic() only because, as of
today, only QEMU is using this code path for error recovery (via the
PSERIES_HP_ELOG_ID_DRC_IC event). phyp treats it as a no-op.

Reviewed-by: David Gibson 
Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..bb98574a84a2 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -551,6 +551,13 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
if (!drmem_lmb_reserved(lmb))
continue;
 
+   /*
+* Setting the isolation state of an 
UNISOLATED/CONFIGURED
+* device to UNISOLATE is a no-op, but the hypervisor 
can
+* use it as a hint that the LMB removal failed.
+*/
+   dlpar_unisolate_drc(lmb->drc_index);
+
rc = dlpar_add_lmb(lmb);
if (rc)
pr_err("Failed to add LMB, drc index %x\n",
-- 
2.31.1



[PATCH v2 2/4] powerpc/pseries: check DRCONF_MEM_RESERVED in lmb_is_removable()

2021-05-12 Thread Daniel Henrique Barboza
DRCONF_MEM_RESERVED is a flag that represents the "Reserved Memory"
status in LOPAR v2.10, section 4.2.8. If a LMB is marked as reserved,
quoting LOPAR, "is not to be used or altered by the base OS". This flag
is read only in the kernel, being set by the firmware/hypervisor in the
DT. As an example, QEMU will set this flag in hw/ppc/spapr.c,
spapr_dt_dynamic_memory().

lmb_is_removable() does not check for DRCONF_MEM_RESERVED. This function
is used in dlpar_remove_lmb() as a guard before the removal logic. Since
it is failing to check for !RESERVED, dlpar_remove_lmb() will fail in a
later stage instead of failing in the validation when receiving a
reserved LMB as input.

lmb_is_removable() is also used in dlpar_memory_remove_by_count() to
evaluate if we have enough LMBs to complete the request. The missing
!RESERVED check in this case is causing dlpar_memory_remove_by_count()
to miscalculate the number of elegible LMBs for the removal, and can
make it error out later on instead of failing in the validation with the
'not enough LMBs to satisfy request' message.

Making a DRCONF_MEM_RESERVED check in lmb_is_removable() fixes all these
issues.

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index bb98574a84a2..c21d9278c1ce 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -348,7 +348,8 @@ static int pseries_remove_mem_node(struct device_node *np)
 
 static bool lmb_is_removable(struct drmem_lmb *lmb)
 {
-   if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
+   if ((lmb->flags & DRCONF_MEM_RESERVED) ||
+   !(lmb->flags & DRCONF_MEM_ASSIGNED))
return false;
 
 #ifdef CONFIG_FA_DUMP
-- 
2.31.1



Re: [PATCH 2/3] hotplug-memory.c: enhance dlpar_memory_remove* LMB checks

2021-05-07 Thread Daniel Henrique Barboza




On 5/3/21 10:02 PM, David Gibson wrote:

On Fri, Apr 30, 2021 at 09:09:16AM -0300, Daniel Henrique Barboza wrote:

dlpar_memory_remove_by_ic() validates the amount of LMBs to be removed
by checking !DRCONF_MEM_RESERVED, and in the following loop before
dlpar_remove_lmb() a check for DRCONF_MEM_ASSIGNED is made before
removing it. This means that a LMB that is both !DRCONF_MEM_RESERVED and
!DRCONF_MEM_ASSIGNED will be counted as valid, but then not being
removed.  The function will end up not removing all 'lmbs_to_remove'
LMBs while also not reporting any errors.

Comparing it to dlpar_memory_remove_by_count(), the validation is done
via lmb_is_removable(), which checks for DRCONF_MEM_ASSIGNED and fadump
constraints. No additional check is made afterwards, and
DRCONF_MEM_RESERVED is never checked before dlpar_remove_lmb(). The
function doesn't have the same 'check A for validation, then B for
removal' issue as remove_by_ic(), but it's not checking if the LMB is
reserved.

There is no reason for these functions to validate the same operation in
two different manners.


Actually, I think there is: remove_by_ic() is handling a request to
remove a specific range of LMBs.  If any are reserved, they can't be
removed and so this needs to fail.  But if they are !ASSIGNED, that
essentially means they're *already* removed (or never added), so
"removing" them is, correctly, a no-op.


I guess that makes sense. Although I am not aware of any situation, at least
thinking about how QEMU adds/removes LMBs, where some LMBs would be removed
'ad-hoc' in the middle of a LMB range that maps to a QEMU DIMM, I can't say
that this wouldn't never happen either. It is sensible to make remove_by_ic()
resilient to this situation.

I'll re-send this patch just with the remove_by_count() change.


Thanks,


Daniel



remove_by_count(), in contrast, is being asked to remove a fixed
number of LMBs from wherever they can be found, and for that it needs
to find LMBs that haven't already been removed.

Basically remove_by_ic() is an absolute request: "make this set of
LMBs be not-plugged", whereas remove_by_count() is a relative request
"make N less LMBs be plugged".


So I think remove_by_ic()s existing handling is correct.  I'm less
sure if remove_by_count() ignoring RESERVED is correct - I couldn't
quickly find under what circumstances RESERVED gets set.



This patch addresses that by changing
lmb_is_removable() to also check for DRCONF_MEM_RESERVED to tell if a
lmb is removable, making dlpar_memory_remove_by_count() take the
reservation state into account when counting the LMBs.
lmb_is_removable() is then used in the validation step of
dlpar_memory_remove_by_ic(), which is already checking for both states
but in different stages, to avoid counting a LMB that is not assigned as
eligible for removal. We can then skip the check before
dlpar_remove_lmb() since we're validating all LMBs beforehand.

Signed-off-by: Daniel Henrique Barboza 
---
  arch/powerpc/platforms/pseries/hotplug-memory.c | 8 +++-
  1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index bb98574a84a2..4e6d162c3f1a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -348,7 +348,8 @@ static int pseries_remove_mem_node(struct device_node *np)
  
  static bool lmb_is_removable(struct drmem_lmb *lmb)

  {
-   if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
+   if ((lmb->flags & DRCONF_MEM_RESERVED) ||
+   !(lmb->flags & DRCONF_MEM_ASSIGNED))
return false;
  
  #ifdef CONFIG_FA_DUMP

@@ -523,7 +524,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
  
  	/* Validate that there are enough LMBs to satisfy the request */

for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-   if (lmb->flags & DRCONF_MEM_RESERVED)
+   if (!lmb_is_removable(lmb))
break;
  
  		lmbs_available++;

@@ -533,9 +534,6 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
return -EINVAL;
  
  	for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {

-   if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
-   continue;
-
rc = dlpar_remove_lmb(lmb);
if (rc)
break;




[PATCH 3/3] pseries/hotplug-memory.c: adding dlpar_memory_remove_lmbs_common()

2021-04-30 Thread Daniel Henrique Barboza
One difference between dlpar_memory_remove_by_count() and
dlpar_memory_remove_by_ic() is that the latter, added in commit
753843471cbb, removes the LMBs in a contiguous block. This was done
because QEMU works with DIMMs, which is nothing more than a set of LMBs,
that must be added or removed together. Failing to remove one LMB must
fail the removal of the entire set of LMBs. Another difference is that
dlpar_memory_remove_by_ic() is going to set the LMB DRC to unisolate in
case of a removal error, which is a no-op for the hypervisors that
doesn't care about this error handling knob, and could be called by
remove_by_count() without issues.

Aside from that, the logic is the same for both functions, and yet we
keep them separated and having to duplicate LMB removal logic in both.

This patch introduces a helper called dlpar_memory_remove_lmbs_common()
to be used by both functions. The helper handles the case of block
removal of remove_by_ic() by failing earlier in the validation and
removal steps if the helper was called by remove_by_ic() (i.e. it was
called with a drc_index) while preserving the more relaxed behavior of
remove_by_count() if drc_index is 0.

Signed-off-by: Daniel Henrique Barboza 
---
 .../platforms/pseries/hotplug-memory.c| 163 +++---
 1 file changed, 67 insertions(+), 96 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 4e6d162c3f1a..a031993725ca 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -399,25 +399,43 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
return 0;
 }
 
-static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
+static int dlpar_memory_remove_lmbs_common(u32 lmbs_to_remove, u32 drc_index)
 {
-   struct drmem_lmb *lmb;
-   int lmbs_removed = 0;
+   struct drmem_lmb *lmb, *start_lmb, *end_lmb;
int lmbs_available = 0;
+   int lmbs_removed = 0;
int rc;
-
-   pr_info("Attempting to hot-remove %d LMB(s)\n", lmbs_to_remove);
+   /*
+* dlpar_memory_remove_by_ic() wants to remove all
+* 'lmbs_to_remove' LMBs, starting from drc_index, in a
+* contiguous block.
+*/
+   bool block_removal;
 
if (lmbs_to_remove == 0)
return -EINVAL;
 
+   block_removal = drc_index != 0;
+
+   if (block_removal) {
+   rc = get_lmb_range(drc_index, lmbs_to_remove, _lmb, 
_lmb);
+   if (rc)
+   return -EINVAL;
+   } else {
+   start_lmb = _info->lmbs[0];
+   end_lmb = _info->lmbs[drmem_info->n_lmbs];
+   }
+
/* Validate that there are enough LMBs to satisfy the request */
-   for_each_drmem_lmb(lmb) {
-   if (lmb_is_removable(lmb))
+   for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
+   if (lmb_is_removable(lmb)) {
lmbs_available++;
 
-   if (lmbs_available == lmbs_to_remove)
+   if (lmbs_available == lmbs_to_remove)
+   break;
+   } else if (block_removal) {
break;
+   }
}
 
if (lmbs_available < lmbs_to_remove) {
@@ -426,28 +444,40 @@ static int dlpar_memory_remove_by_count(u32 
lmbs_to_remove)
return -EINVAL;
}
 
-   for_each_drmem_lmb(lmb) {
+   for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
rc = dlpar_remove_lmb(lmb);
-   if (rc)
-   continue;
 
-   /* Mark this lmb so we can add it later if all of the
-* requested LMBs cannot be removed.
-*/
-   drmem_mark_lmb_reserved(lmb);
+   if (!rc) {
+   /* Mark this lmb so we can add it later if all of the
+* requested LMBs cannot be removed.
+*/
+   drmem_mark_lmb_reserved(lmb);
 
-   lmbs_removed++;
-   if (lmbs_removed == lmbs_to_remove)
+   lmbs_removed++;
+   if (lmbs_removed == lmbs_to_remove)
+   break;
+   } else if (block_removal) {
break;
+   }
}
 
if (lmbs_removed != lmbs_to_remove) {
-   pr_err("Memory hot-remove failed, adding LMB's back\n");
+   if (block_removal)
+   pr_err("Memory indexed-count-remove failed, adding any 
removed LMBs\n");
+   else
+   pr_err("Memory hot-remove failed, adding LMB's back\n");
 
-   for_each_drmem_lmb(lmb) {
+   for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
if (!drmem_lmb_res

[PATCH 1/3] powerpc/pseries: Set UNISOLATE on dlpar_memory_remove_by_ic() error

2021-04-30 Thread Daniel Henrique Barboza
As previously done in dlpar_cpu_remove() for CPUs, this patch changes
dlpar_memory_remove_by_ic() to unisolate the LMB DRC when the LMB is
failed to be removed. The hypervisor, seeing a LMB DRC that was supposed
to be removed being unisolated instead, can do error recovery on its
side.

This change is done in dlpar_memory_remove_by_ic() only because, as of
today, only QEMU is using this code path for error recovery (via the
PSERIES_HP_ELOG_ID_DRC_IC event). phyp treats it as a no-op.

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..bb98574a84a2 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -551,6 +551,13 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
if (!drmem_lmb_reserved(lmb))
continue;
 
+   /*
+* Setting the isolation state of an 
UNISOLATED/CONFIGURED
+* device to UNISOLATE is a no-op, but the hypervisor 
can
+* use it as a hint that the LMB removal failed.
+*/
+   dlpar_unisolate_drc(lmb->drc_index);
+
rc = dlpar_add_lmb(lmb);
if (rc)
pr_err("Failed to add LMB, drc index %x\n",
-- 
2.30.2



[PATCH 2/3] hotplug-memory.c: enhance dlpar_memory_remove* LMB checks

2021-04-30 Thread Daniel Henrique Barboza
dlpar_memory_remove_by_ic() validates the amount of LMBs to be removed
by checking !DRCONF_MEM_RESERVED, and in the following loop before
dlpar_remove_lmb() a check for DRCONF_MEM_ASSIGNED is made before
removing it. This means that a LMB that is both !DRCONF_MEM_RESERVED and
!DRCONF_MEM_ASSIGNED will be counted as valid, but then not being
removed.  The function will end up not removing all 'lmbs_to_remove'
LMBs while also not reporting any errors.

Comparing it to dlpar_memory_remove_by_count(), the validation is done
via lmb_is_removable(), which checks for DRCONF_MEM_ASSIGNED and fadump
constraints. No additional check is made afterwards, and
DRCONF_MEM_RESERVED is never checked before dlpar_remove_lmb(). The
function doesn't have the same 'check A for validation, then B for
removal' issue as remove_by_ic(), but it's not checking if the LMB is
reserved.

There is no reason for these functions to validate the same operation in
two different manners. This patch addresses that by changing
lmb_is_removable() to also check for DRCONF_MEM_RESERVED to tell if a
lmb is removable, making dlpar_memory_remove_by_count() take the
reservation state into account when counting the LMBs.
lmb_is_removable() is then used in the validation step of
dlpar_memory_remove_by_ic(), which is already checking for both states
but in different stages, to avoid counting a LMB that is not assigned as
eligible for removal. We can then skip the check before
dlpar_remove_lmb() since we're validating all LMBs beforehand.

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index bb98574a84a2..4e6d162c3f1a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -348,7 +348,8 @@ static int pseries_remove_mem_node(struct device_node *np)
 
 static bool lmb_is_removable(struct drmem_lmb *lmb)
 {
-   if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
+   if ((lmb->flags & DRCONF_MEM_RESERVED) ||
+   !(lmb->flags & DRCONF_MEM_ASSIGNED))
return false;
 
 #ifdef CONFIG_FA_DUMP
@@ -523,7 +524,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
 
/* Validate that there are enough LMBs to satisfy the request */
for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-   if (lmb->flags & DRCONF_MEM_RESERVED)
+   if (!lmb_is_removable(lmb))
break;
 
lmbs_available++;
@@ -533,9 +534,6 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
return -EINVAL;
 
for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-   if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
-   continue;
-
rc = dlpar_remove_lmb(lmb);
if (rc)
break;
-- 
2.30.2



[PATCH 0/3] Unisolate LMBs DRC on removal error + cleanups

2021-04-30 Thread Daniel Henrique Barboza
Hi,

This is a follow-up of the work done in dlpar_cpu_remove() to
report CPU removal error by unisolating the DRC. This time I'm
doing it for LMBs. Patch 01 handles this.

Patches 2 and 3 are cleanups I consider worth posting.


Daniel Henrique Barboza (3):
  powerpc/pseries: Set UNISOLATE on dlpar_memory_remove_by_ic() error
  hotplug-memory.c: enhance dlpar_memory_remove* LMB checks
  pseries/hotplug-memory.c: adding dlpar_memory_remove_lmbs_common()

 .../platforms/pseries/hotplug-memory.c| 162 --
 1 file changed, 69 insertions(+), 93 deletions(-)

-- 
2.30.2



Re: [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure

2021-04-19 Thread Daniel Henrique Barboza




On 4/19/21 9:48 AM, Michael Ellerman wrote:

Daniel Henrique Barboza  writes:

The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is
already UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else
for both QEMU and phyp. This gives us an opportunity to use this
behavior to signal the hypervisor layer when an error during device
removal happens, allowing it to do a proper error handling, while not
breaking QEMU/phyp implementations that don't have this support.

This patch introduces this idea by unisolating all CPU DRCs that failed
to be removed by dlpar_cpu_remove_by_index(), when handling the
PSERIES_HP_ELOG_ID_DRC_INDEX event. This is being done for this event
only because its the only CPU removal event QEMU uses, and there's no
need at this moment to add this mechanism for phyp only code.


Have you also confirmed that phyp is not bothered by it? ie. everything
seems to continue working when you trigger this path on phyp.


Yes. Daniel Bueso (dbue...@us.ibm.com) from the partition firmware team
helped me with that. We confirmed that phyp returns RTAS_OK under these
conditions (Unisolating an unisolated/configured DRC).


Thanks,


DHB



cheers


diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..ed66895c2f51 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -802,8 +802,15 @@ int dlpar_cpu(struct pseries_hp_errorlog *hp_elog)
case PSERIES_HP_ELOG_ACTION_REMOVE:
if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_COUNT)
rc = dlpar_cpu_remove_by_count(count);
-   else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX)
+   else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) {
rc = dlpar_cpu_remove_by_index(drc_index);
+   /* Setting the isolation state of an 
UNISOLATED/CONFIGURED
+* device to UNISOLATE is a no-op, but the hypervison 
can
+* use it as a hint that the cpu removal failed.
+*/
+   if (rc)
+   dlpar_unisolate_drc(drc_index);
+   }
else
rc = -EINVAL;
break;
--
2.30.2


[PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure

2021-04-16 Thread Daniel Henrique Barboza
The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is
already UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else
for both QEMU and phyp. This gives us an opportunity to use this
behavior to signal the hypervisor layer when an error during device
removal happens, allowing it to do a proper error handling, while not
breaking QEMU/phyp implementations that don't have this support.

This patch introduces this idea by unisolating all CPU DRCs that failed
to be removed by dlpar_cpu_remove_by_index(), when handling the
PSERIES_HP_ELOG_ID_DRC_INDEX event. This is being done for this event
only because its the only CPU removal event QEMU uses, and there's no
need at this moment to add this mechanism for phyp only code.

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..ed66895c2f51 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -802,8 +802,15 @@ int dlpar_cpu(struct pseries_hp_errorlog *hp_elog)
case PSERIES_HP_ELOG_ACTION_REMOVE:
if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_COUNT)
rc = dlpar_cpu_remove_by_count(count);
-   else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX)
+   else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) {
rc = dlpar_cpu_remove_by_index(drc_index);
+   /* Setting the isolation state of an 
UNISOLATED/CONFIGURED
+* device to UNISOLATE is a no-op, but the hypervison 
can
+* use it as a hint that the cpu removal failed.
+*/
+   if (rc)
+   dlpar_unisolate_drc(drc_index);
+   }
else
rc = -EINVAL;
break;
-- 
2.30.2



[PATCH 1/2] dlpar.c: introduce dlpar_unisolate_drc()

2021-04-16 Thread Daniel Henrique Barboza
Next patch will execute a set-indicator call in hotplug-cpu.c.

Create a dlpar_unisolate_drc() helper to avoid spreading more
rtas_set_indicator() calls outside of dlpar.c.

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/dlpar.c   | 14 ++
 arch/powerpc/platforms/pseries/pseries.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index 233503fcf8f0..3ac70790ec7a 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -329,6 +329,20 @@ int dlpar_release_drc(u32 drc_index)
return 0;
 }
 
+int dlpar_unisolate_drc(u32 drc_index)
+{
+   int dr_status, rc;
+
+   rc = rtas_call(rtas_token("get-sensor-state"), 2, 2, _status,
+   DR_ENTITY_SENSE, drc_index);
+   if (rc || dr_status != DR_ENTITY_PRESENT)
+   return -1;
+
+   rtas_set_indicator(ISOLATION_STATE, drc_index, UNISOLATE);
+
+   return 0;
+}
+
 int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_elog)
 {
int rc;
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index 4fe48c04c6c2..4ea12037c920 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -55,6 +55,7 @@ extern int dlpar_attach_node(struct device_node *, struct 
device_node *);
 extern int dlpar_detach_node(struct device_node *);
 extern int dlpar_acquire_drc(u32 drc_index);
 extern int dlpar_release_drc(u32 drc_index);
+extern int dlpar_unisolate_drc(u32 drc_index);
 
 void queue_hotplug_event(struct pseries_hp_errorlog *hp_errlog);
 int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_errlog);
-- 
2.30.2



[PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error

2021-04-16 Thread Daniel Henrique Barboza
At this moment, PAPR [1] does not have a way to report errors during a device
removal operation. This puts a strain in the hypervisor, which needs extra
mechanisms to try to fallback and recover from an error that might have
happened during the removal. The QEMU community has dealt with it during these
years by either trying to preempt the error before sending the HP event or, in
case of a guest side failure, reboot the guest to complete the removal process.

This started to change with QEMU commit fe1831eff8a4 ("spapr_drc.c: use DRC
reconfiguration to cleanup DIMM unplug state"), where a way to fallback from a
memory removal error was introduced. In this case, when QEMU detects that the
kernel is reconfiguring LMBs DRCs that were marked as pending removal, the
entire process is reverted from the QEMU side as well. Around the same time,
other discussions in the QEMU mailing discussed an alternative for other device
as well.

In [2] the idea of using RTAS set-indicator for this role was first introduced.
The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is already
UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else for both QEMU
and phyp. This gives us an opportunity to use this behavior to signal the
hypervisor layer when a device removal happens, allowing it to do a proper
error handling knowing for sure that the removal failed in the kernel. Using
set-indicator to report HP errors isn't strange to PAPR, as per R1-13.5.3.4-4.
of table 13.7 of [1]:

"For all DR options: If this is a DR operation that involves the user insert-
ing a DR entity, then if the firmware can determine that the inserted entity
would cause a system disturbance, then the set-indicator RTAS call must not
unisolate the entity and must return an error status which is unique to the
particular error."

PAPR does not make any restrictions or considerations about setting an already
Unisolated/Configured DRC to 'unisolate', meaning we have a chance to use it
for this purpose - signal an OS side error when attempting to remove a DR
entity.  To validate the design, this is being implemented only for CPUs.

QEMU will use this mechanism to rollback the device removal (hotunplug) state,
allowing for a better error handling mechanism. A implementation of how QEMU
can do it is in [3]. When using a kernel with this series applied, together
with this QEMU build, this is what happens in a common CPU removal/hotunplug
error scenario (trying to remove the last online CPU):

( QEMU command line: qemu-system-ppc64 -machine pseries,accel=kvm,usb=off
-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 ... )

[root@localhost ~]# QEMU 5.2.92 monitor - type 'help' for more information
(qemu) device_add host-spapr-cpu-core,core-id=1,id=core1
(qemu) 

[root@localhost ~]# echo 0 > /sys/devices/system/cpu/cpu0/online
[   77.548442][   T13] IRQ 19: no longer affine to CPU0
[   77.548452][   T13] IRQ 20: no longer affine to CPU0
[   77.548458][   T13] IRQ 256: no longer affine to CPU0
[   77.548465][   T13] IRQ 258: no longer affine to CPU0
[   77.548472][   T13] IRQ 259: no longer affine to CPU0
[   77.548479][   T13] IRQ 260: no longer affine to CPU0
[   77.548485][   T13] IRQ 261: no longer affine to CPU0
[   77.548590][T0] cpu 0 (hwid 0) Ready to die...
[root@localhost ~]# (qemu) 
(qemu) device_del core1
(qemu) [   83.214073][  T100] pseries-hotplug-cpu: Failed to offline CPU 
PowerPC,POWER9, rc: -16
qemu-system-ppc64: Device hotunplug rejected by the guest for device core1

(qemu) 

As soon as the CPU removal fails in dlpar_cpu(), QEMU becames aware of
it and is able to do error recovery.

If this solution is well received, I'll push for an architecture change
request internally at IBM to make this mechanism PAPR official.


[1] 
https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
[2] https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06395.html
[3] https://github.com/danielhb/qemu/tree/unisolate_drc_callback_v1

Daniel Henrique Barboza (2):
  dlpar.c: introduce dlpar_unisolate_drc()
  hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure

 arch/powerpc/platforms/pseries/dlpar.c   | 14 ++
 arch/powerpc/platforms/pseries/hotplug-cpu.c |  9 -
 arch/powerpc/platforms/pseries/pseries.h |  1 +
 3 files changed, 23 insertions(+), 1 deletion(-)

-- 
2.30.2



Re: [PATCH 0/3] Reintroduce cpu_core_mask

2021-04-15 Thread Daniel Henrique Barboza

Hi,


Using a QEMU pseries guest with this follwing SMP topology, with a
single NUMA node:


(...) -smp 32,threads=4,cores=4,sockets=2, (...)

This is the output of lscpu with a guest running v5.12-rc5:

[root@localhost ~]# lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  32
On-line CPU(s) list: 0-31
Thread(s) per core:  4
Core(s) per socket:  8
Socket(s):   1
NUMA node(s):1
Model:   2.2 (pvr 004e 1202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   0-31
[root@localhost ~]#


The changes with cpu_core_mask made the topology sockets matching NUMA nodes.
In this case, given that we have a single NUMA node, the SMP topology got
adjusted to have 8 cores instead of 4 so we can have a single socket as well.

Although sockets equal to NUMA nodes is true for Power hardware, QEMU doesn't
have this constraint and users expect sockets and NUMA nodes to be kind of
independent, regardless of how unpractical that would be with real hardware.


The same guest running a kernel with this series applied:


[root@localhost ~]# lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  32
On-line CPU(s) list: 0-31
Thread(s) per core:  4
Core(s) per socket:  4
Socket(s):   2
NUMA node(s):1
Model:   2.2 (pvr 004e 1202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   0-31


The sockets and NUMA nodes are being represented separately, as intended via
the QEMU command line.


Thanks for the looking this up, Srikar. For all patches:


Tested-by: Daniel Henrique Barboza 



On 4/15/21 9:09 AM, Srikar Dronamraju wrote:

Daniel had reported that
  QEMU is now unable to see requested topologies in a multi socket single
  NUMA node configurations.
  -smp 8,maxcpus=8,cores=2,threads=2,sockets=2

This patchset reintroduces cpu_core_mask so that users can see requested
topologies while still maintaining the boot time of very large system
configurations.

It includes caching the chip_id as suggested by Michael Ellermann

4 Threads/Core; 4 cores/Socket; 4 Sockets/Node, 2 Nodes in System
   -numa node,nodeid=0,memdev=m0 \
   -numa node,nodeid=1,memdev=m1 \
   -smp 128,sockets=8,threads=4,maxcpus=128  \

5.12.0-rc5 (or any kernel with commit 4ca234a9cbd7)
---
srikar@cloudy:~$ lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):   2 <<<<<-
NUMA node(s):2
Model:   2.3 (pvr 004e 1203)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   1 MiB
L1i cache:   1 MiB
NUMA node0 CPU(s):   0-15,32-47,64-79,96-111
NUMA node1 CPU(s):   16-31,48-63,80-95,112-127
--
srikar@cloudy:~$ dmesg |grep smp
[0.010658] smp: Bringing up secondary CPUs ...
[0.424681] smp: Brought up 2 nodes, 128 CPUs
--

5.12.0-rc5 + 3 patches
--
srikar@cloudy:~$ lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  4
Socket(s):   8<<<<-
NUMA node(s):2
Model:   2.3 (pvr 004e 1203)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   1 MiB
L1i cache:   1 MiB
NUMA node0 CPU(s):   0-15,32-47,64-79,96-111
NUMA node1 CPU(s):   16-31,48-63,80-95,112-127
--
srikar@cloudy:~$ dmesg |grep smp
[0.010372] smp: Bringing up secondary CPUs ...
[0.417892] smp: Brought up 2 nodes, 128 CPUs

5.12.0-rc5
--
srikar@cloudy:~$  lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  1024
On-line CPU(s) list: 0-1023
Thread(s) per core:  8
Core(s) per socket:  128
Socket(s):   1
NUMA node(s):1
Model:   2.3 (pvr 004e 1203)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type:

[PATCH v3 1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_offline()

2021-03-26 Thread Daniel Henrique Barboza
One of the reasons that dlpar_cpu_offline can fail is when attempting to
offline the last online CPU of the kernel. This can be observed in a
pseries QEMU guest that has hotplugged CPUs. If the user offlines all
other CPUs of the guest, and a hotplugged CPU is now the last online
CPU, trying to reclaim it will fail. See [1] for an example.

The current error message in this situation returns rc with -EBUSY and a
generic explanation, e.g.:

pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16

EBUSY can be caused by other conditions, such as cpu_hotplug_disable
being true. Throwing a more specific error message for this case,
instead of just "Failed to offline CPU", makes it clearer that the error
is in fact a known error situation instead of other generic/unknown
cause.

This patch adds a 'last online' check in dlpar_cpu_offline() to catch
the 'last online CPU' offline error, returning a more informative error
message:

pseries-hotplug-cpu: Unable to remove last online CPU PowerPC,POWER9

[1] https://bugzilla.redhat.com/1911414

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..4b9df4d645b4 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -271,6 +271,19 @@ static int dlpar_offline_cpu(struct device_node *dn)
if (!cpu_online(cpu))
break;
 
+   /* device_offline() will return -EBUSY (via cpu_down())
+* if there is only one CPU left. Check it here to fail
+* earlier and with a more informative error message,
+* while also retaining the cpu_add_remove_lock to be 
sure
+* that no CPUs are being online/offlined during this
+* check.
+*/
+   if (num_online_cpus() == 1) {
+   pr_warn("Unable to remove last online CPU 
%pOFn\n", dn);
+   rc = -EBUSY;
+   goto out_unlock;
+   }
+
cpu_maps_update_done();
rc = device_offline(get_cpu_device(cpu));
if (rc)
@@ -283,6 +296,7 @@ static int dlpar_offline_cpu(struct device_node *dn)
thread);
}
}
+out_unlock:
cpu_maps_update_done();
 
 out:
-- 
2.30.2



[PATCH v3 0/1] show 'last online CPU' error in dlpar_cpu_offline()

2021-03-26 Thread Daniel Henrique Barboza
changes in v3 after Daniel Axtens review:
- fixed typo in commit mgs
- fixed block comment format to make checkpatch happy
v2 link:
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210323205056.52768-2-danielhb...@gmail.com/

changes in v2 after Michael Ellerman review:
- moved the verification code from dlpar_cpu_remove() to
  dlpar_cpu_offline(), while holding cpu_add_remove_lock
- reworded the commit message and code comment
v1 link: 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210305173845.451158-1-danielhb...@gmail.com/


Daniel Henrique Barboza (1):
  hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_offline()

 arch/powerpc/platforms/pseries/hotplug-cpu.c | 14 ++
 1 file changed, 14 insertions(+)

-- 
2.30.2



Re: [PATCH v2 1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_offline()

2021-03-26 Thread Daniel Henrique Barboza

Hey Daniel,

On 3/26/21 2:24 AM, Daniel Axtens wrote:

Hi Daniel,

Two small nitpicks:


This patch adds a 'last online' check in dlpar_cpu_offline() to catch
the 'last online CPU' offline error, eturning a more informative error

^--- s/eturning/returning/;



+   /* device_offline() will return -EBUSY (via cpu_down())
+* if there is only one CPU left. Check it here to fail
+* earlier and with a more informative error message,
+* while also retaining the cpu_add_remove_lock to be 
sure
+* that no CPUs are being online/offlined during this
+* check. */


Checkpatch has a small issue with this comment:

WARNING: Block comments use a trailing */ on a separate line
#50: FILE: arch/powerpc/platforms/pseries/hotplug-cpu.c:279:
+* check. */

Apart from that, this patch seems sane to me, but I haven't been able to
test it.



Thanks for the review, and for letting me know of the existence of
'scripts/checkpatch.pl' to verify the patches before posting. I'll
send a v3.


Thanks,


DHB




Kind regards,
Daniel



[PATCH v2 1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_offline()

2021-03-23 Thread Daniel Henrique Barboza
One of the reasons that dlpar_cpu_offline can fail is when attempting to
offline the last online CPU of the kernel. This can be observed in a
pseries QEMU guest that has hotplugged CPUs. If the user offlines all
other CPUs of the guest, and a hotplugged CPU is now the last online
CPU, trying to reclaim it will fail. See [1] for an example.

The current error message in this situation returns rc with -EBUSY and a
generic explanation, e.g.:

pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16

EBUSY can be caused by other conditions, such as cpu_hotplug_disable
being true. Throwing a more specific error message for this case,
instead of just "Failed to offline CPU", makes it clearer that the error
is in fact a known error situation instead of other generic/unknown
cause.

This patch adds a 'last online' check in dlpar_cpu_offline() to catch
the 'last online CPU' offline error, eturning a more informative error
message:

pseries-hotplug-cpu: Unable to remove last online CPU PowerPC,POWER9

[1] https://bugzilla.redhat.com/1911414

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..3ac7e904385c 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -271,6 +271,18 @@ static int dlpar_offline_cpu(struct device_node *dn)
if (!cpu_online(cpu))
break;
 
+   /* device_offline() will return -EBUSY (via cpu_down())
+* if there is only one CPU left. Check it here to fail
+* earlier and with a more informative error message,
+* while also retaining the cpu_add_remove_lock to be 
sure
+* that no CPUs are being online/offlined during this
+* check. */
+   if (num_online_cpus() == 1) {
+   pr_warn("Unable to remove last online CPU 
%pOFn\n", dn);
+   rc = -EBUSY;
+   goto out_unlock;
+   }
+
cpu_maps_update_done();
rc = device_offline(get_cpu_device(cpu));
if (rc)
@@ -283,6 +295,7 @@ static int dlpar_offline_cpu(struct device_node *dn)
thread);
}
}
+out_unlock:
cpu_maps_update_done();
 
 out:
-- 
2.30.2



[PATCH v2 0/1] show 'last online CPU' error in dlpar_cpu_offline()

2021-03-23 Thread Daniel Henrique Barboza
changes in v2 after Michael Ellerman review:
- moved the verification code from dlpar_cpu_remove() to
  dlpar_cpu_offline(), while holding cpu_add_remove_lock
- reworded the commit message and code comment
v1 link: 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210305173845.451158-1-danielhb...@gmail.com/

Daniel Henrique Barboza (1):
  hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_offline()

 arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 +
 1 file changed, 13 insertions(+)

-- 
2.30.2



Re: [PATCH 1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_remove()

2021-03-22 Thread Daniel Henrique Barboza




On 3/19/21 8:26 AM, Michael Ellerman wrote:

Daniel Henrique Barboza  writes:

Ping

On 3/5/21 2:38 PM, Daniel Henrique Barboza wrote:

Of all the reasons that dlpar_cpu_remove() can fail, the 'last online
CPU' is one that can be caused directly by the user offlining CPUs
in a partition/virtual machine that has hotplugged CPUs. Trying to
reclaim a hotplugged CPU can fail if the CPU is now the last online in
the system. This is easily reproduced using QEMU [1].


Sorry, I saw this earlier and never got around to replying.


No problem. Thanks for the review!



I'm wondering if we neet to catch it earlier, ie. in
dlpar_offline_cpu().

By the time we return to dlpar_cpu_remove() we've dropped the
cpu_add_remove_lock (cpu_maps_update_done), so num_online_cpus() could
change out from under us, meaning the num_online_cpus() == 1 check might
trigger incorrectly (or vice versa).

Something like the patch below (completely untested :D)


Makes sense. I'll try it out to see if it works.



And writing that patch makes me wonder, is == 1 the right check?

In most cases we'll remove all but one thread of the core, but we'll
fail on the last thread. Leaving that core effectively stuck in SMT1. Is
that useful behaviour? Should we instead check at the start that we can
remove all threads of the core without going to zero online CPUs?


I think it's ok to allow SMT1 cores, speaking from QEMU perspective.
QEMU does not have a "core hotunplug" operation where the whole core is
hotunplugged at once. The CPU hotplug/unplug operations are done as single
CPU thread add/removal. If the user wants to run 4 cores, all of them with
SMT1, QEMU will allow it.

Libvirt does not operate with the core granularity either - you can specify
the amount of vcpus the guest should run with, and Libvirt will send
hotplug/unplug requests to QEMU to match the desired value. It doesn't
bother with how many threads of a core were offlined or not.


Thanks,


DHB





cheers


diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..498c22331ac8 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -271,6 +271,12 @@ static int dlpar_offline_cpu(struct device_node *dn)
if (!cpu_online(cpu))
break;
  
+			if (num_online_cpus() == 1) {

+   pr_warn("Unable to remove last online CPU 
%pOFn\n", dn);
+   rc = EBUSY;
+   goto out_unlock;
+   }
+
cpu_maps_update_done();
rc = device_offline(get_cpu_device(cpu));
if (rc)
@@ -283,6 +289,7 @@ static int dlpar_offline_cpu(struct device_node *dn)
thread);
}
}
+out_unlock:
cpu_maps_update_done();
  
  out:




Re: Advice needed on SMP regression after cpu_core_mask change

2021-03-18 Thread Daniel Henrique Barboza




On 3/18/21 10:42 AM, Srikar Dronamraju wrote:

* Daniel Henrique Barboza  [2021-03-17 10:00:34]:


Hello,

Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced
a regression in both upstream and RHEL downstream kernels [1]. The assumption 
made
in the commit:

"Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be
equal on Power"

Doesn't seem to be true. After this commit, QEMU is now unable to set single 
NUMA
node SMP topologies such as:

-smp 8,maxcpus=8,cores=2,threads=2,sockets=2


What does it mean for a NUMA to have more than one sockets?
If they are all part of the same node, there are at local distance to each
other. cache is per core. So what resources are shared by the Sockets that
are part of the same NUMA. And how does Userspace/ application make use of
the same.


Honestly, I sympathize with the idea that multiple sockets in the same NUMA
node being "weird". QEMU is accepting this kind of topology since forever
because we didn't pay attention to these other details.

I don't see any problems adding more constraints that makes sense in the
virtual layer, as long as the constraints make sense and are documented.
Putting multiple sockets in a single NUMA node seems like a fair restriction.




Please don't mistake this as attempt to downplay your report but a honest
attempt to better understand the situation.


It's cool. Ask away.



For example, if the socket denotes the hemisphere logic in P10, then can we
see if the coregroup feature can be used. "Coregroup" is suppose to mean a
set of cores within a NUMA that have some characteristics and there can be
multiple coregroups within a NUMA. We add that mostly to mimic hemisphere in
P10. However the number of coregroups in a NUMA is not exported to userspace
at this time.


I see. I thought that the presence of the hemispheres inside the chip would
justify more than one NUMA node inside the chip, meaning that a chip/socket
would have more than one NUMA nodes inside of it.

If that's not the case then I guess socket == NUMA node is still valid in
P10 as well. The last 'lscpu' example I gave here, claiming that this would
be a Power10 scenario, doesn't represent P10 after all.



However if each Socket is associated with a memory and node distance, then
should they be NUMA?

Can you provide me with the unique ibm,chip-ids in your 2 NUMA, 4 node case?
Does this cause an performance issues with the guest/application?


I can fetch some values, but we're trying to move out of it since it's not on 
the
pseries spec (PAPR). Perhaps with these restrictions we can live without
ibm,chip-id in QEMU.



Till your report, I was under the impression that NUMAs == Sockets.


After reading and discussing about it, I think the sensible thing to do is to
put this same constraint in QEMU.

In theory it would be nice to let the virtual machine to have whatever topology 
it
wants, multiple sockets in the same NUMA domain and so on, but in the end we're
emulating Power hardware. If Power hardware - and the powerpc kernel - operates
under these assumptions, then I don't see much point into allowing users to
set unrealistic virtual CPU topologies that will be misrepresented in the
kernel.


I'll try this restriction in QEMU and see how upstream kernel behaves, with
and without ibm,chip-id being advertised in the DT.


Thanks,


DHB





lscpu will give the following output in this case:

# lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):   1
NUMA node(s):1
Model:   2.2 (pvr 004e 1202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   0-7


This is happening because the macro cpu_cpu_mask(cpu) expands to
cpumask_of_node(cpu_to_node(cpu)), which in turn expands to 
node_to_cpumask_map[node].
node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on 
CC to
correct me if I'm wrong). We're now associating sockets to NUMA nodes directly.

If I add a second NUMA node then I can get the intended smp topology:

-smp 8,maxcpus=8,cores=2,threads=2,sockets=2
-numa node,memdev=mem0,cpus=0-3,nodeid=0 \
-numa node,memdev=mem1,cpus=4-7,nodeid=1 \

# lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):   2
NUMA node(s):2
Model:   2.2 (pvr 004e 1202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   0-3
NUMA node1 CPU(s):   4-7


However, if I try a single socket with multiple NUMA nodes topology, which is 
the

Re: [PATCH 1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_remove()

2021-03-18 Thread Daniel Henrique Barboza

Ping

On 3/5/21 2:38 PM, Daniel Henrique Barboza wrote:

Of all the reasons that dlpar_cpu_remove() can fail, the 'last online
CPU' is one that can be caused directly by the user offlining CPUs
in a partition/virtual machine that has hotplugged CPUs. Trying to
reclaim a hotplugged CPU can fail if the CPU is now the last online in
the system. This is easily reproduced using QEMU [1].

Throwing a more specific error message for this case, instead of just
"Failed to offline CPU", makes it clearer that the error is in fact a
known error situation instead of other generic/unknown cause.

[1] https://bugzilla.redhat.com/1911414

Signed-off-by: Daniel Henrique Barboza 
---
  arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++-
  1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..134f393f09e1 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -514,7 +514,17 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, 
u32 drc_index)
  
  	rc = dlpar_offline_cpu(dn);

if (rc) {
-   pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc);
+   /* dlpar_offline_cpu will return -EBUSY from cpu_down() (via
+* device_offline()) in 2 cases: cpu_hotplug_disable is true or
+* there is only one CPU left. Warn the user about the second
+* since this can happen with user offlining CPUs and then
+* attempting hotunplugs.
+*/
+   if (rc == -EBUSY && num_online_cpus() == 1)
+   pr_warn("Unable to remove last online CPU %pOFn\n", dn);
+   else
+   pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, 
rc);
+
return -EINVAL;
}
  



Re: [PATCH] powerpc/numa: Fix topology_physical_package_id() on pSeries

2021-03-18 Thread Daniel Henrique Barboza




On 3/18/21 4:28 AM, Cédric Le Goater wrote:

Also we've been using it for several years and I don't think we should
risk breaking anything by changing the value now.


I guess we can leave it that way. Please read the commit log of
the second patch (not tagged as a v2 ...).

But we should remove ibm,chip-id from QEMU since the property does
not exist on PAPR and that the calculation is anyhow very broken.



I am a strong advocate of getting rid of ibm,chip-id in QEMU. That said,
we need to make sure that the current problem with CPU topologies, that
I reported in that other thread, can be fixed without it.


Thanks,


DHB






Thanks,

C.



Re: Advice needed on SMP regression after cpu_core_mask change

2021-03-17 Thread Daniel Henrique Barboza




On 3/17/21 12:30 PM, Cédric Le Goater wrote:

On 3/17/21 2:00 PM, Daniel Henrique Barboza wrote:

Hello,

Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced
a regression in both upstream and RHEL downstream kernels [1]. The assumption 
made
in the commit:

"Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be
equal on Power"

Doesn't seem to be true. After this commit, QEMU is now unable to set single 
NUMA
node SMP topologies such as:

-smp 8,maxcpus=8,cores=2,threads=2,sockets=2

lscpu will give the following output in this case:

# lscpu
Architecture:    ppc64le
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):   1
NUMA node(s):    1
Model:   2.2 (pvr 004e 1202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   0-7


This is happening because the macro cpu_cpu_mask(cpu) expands to
cpumask_of_node(cpu_to_node(cpu)), which in turn expands to 
node_to_cpumask_map[node].
node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on 
CC to
correct me if I'm wrong). We're now associating sockets to NUMA nodes directly.

If I add a second NUMA node then I can get the intended smp topology:

-smp 8,maxcpus=8,cores=2,threads=2,sockets=2
-numa node,memdev=mem0,cpus=0-3,nodeid=0 \
-numa node,memdev=mem1,cpus=4-7,nodeid=1 \

# lscpu
Architecture:    ppc64le
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):   2
NUMA node(s):    2
Model:   2.2 (pvr 004e 1202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   0-3
NUMA node1 CPU(s):   4-7


However, if I try a single socket with multiple NUMA nodes topology, which is 
the case
of Power10, e.g.:


-smp 8,maxcpus=8,cores=4,threads=2,sockets=1
-numa node,memdev=mem0,cpus=0-3,nodeid=0 \
-numa node,memdev=mem1,cpus=4-7,nodeid=1 \


This is the result:

# lscpu
Architecture:    ppc64le
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):   2
NUMA node(s):    2
Model:   2.2 (pvr 004e 1202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   0-3
NUMA node1 CPU(s):   4-7


This confirms my suspicions that, at this moment, we're making sockets == NUMA 
nodes.


Yes. I don't think we can do better on PAPR and the above examples
seem to confirm that the "sockets" definition is simply ignored.
  

Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The 
commit
after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop 
updating
cpu_core_mask"), is erasing the code that calculated cpu_core_mask. 
cpu_core_mask, despite
its shortcomings that caused its removal, was giving a precise SMP topology. 
And it was
using physical_package_id/'ibm,chip-id' for that.


ibm,chip-id is a no-no on pSeries. I guess this is inherent to PAPR which
is hiding a lot of the underlying HW and topology. May be we are trying
to reconcile two orthogonal views of machine virtualization ...


Checking in QEMU I can say that the ibm,chip-id calculation is the only place 
in the code
that cares about cores per socket information. The kernel is now ignoring that, 
starting
on 4bce545903fa, and now QEMU is unable to provide this info to the guest.

If we're not going to use ibm,chip-id any longer, which seems sensible given 
that PAPR does
not declare it, we need another way of letting the guest know how much cores 
per socket
we want.

The RTAS call "ibm,get-system-parameter" with token "Processor Module
Information" returns that kind of information :

   2 byte binary number (N) of module types followed by N module specifiers of 
the form:
   2 byte binary number (M) of sockets of this module type
   2 byte binary number (L) of chips per this module type
   2 byte binary number (K) of cores per chip in this module type.

See the values in these sysfs files :

   cat /sys/devices/hv_24x7/interface/{sockets,chipspersocket,coresperchip}

But I am afraid these are host level information and not guest/LPAR.



I believe there might be some sort of reasoning behind not having this on
PAPR, but I'll say in advance that the virtual machine should act as the
real hardware, as close as possible. This is the kind of hcall that could
be used in this situation.




I didn't find any LPAR level properties or hca

Advice needed on SMP regression after cpu_core_mask change

2021-03-17 Thread Daniel Henrique Barboza

Hello,

Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced
a regression in both upstream and RHEL downstream kernels [1]. The assumption 
made
in the commit:

"Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be
equal on Power"

Doesn't seem to be true. After this commit, QEMU is now unable to set single 
NUMA
node SMP topologies such as:

-smp 8,maxcpus=8,cores=2,threads=2,sockets=2

lscpu will give the following output in this case:

# lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):   1
NUMA node(s):1
Model:   2.2 (pvr 004e 1202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   0-7


This is happening because the macro cpu_cpu_mask(cpu) expands to
cpumask_of_node(cpu_to_node(cpu)), which in turn expands to 
node_to_cpumask_map[node].
node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on 
CC to
correct me if I'm wrong). We're now associating sockets to NUMA nodes directly.

If I add a second NUMA node then I can get the intended smp topology:

-smp 8,maxcpus=8,cores=2,threads=2,sockets=2
-numa node,memdev=mem0,cpus=0-3,nodeid=0 \
-numa node,memdev=mem1,cpus=4-7,nodeid=1 \

# lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):   2
NUMA node(s):2
Model:   2.2 (pvr 004e 1202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   0-3
NUMA node1 CPU(s):   4-7


However, if I try a single socket with multiple NUMA nodes topology, which is 
the case
of Power10, e.g.:


-smp 8,maxcpus=8,cores=4,threads=2,sockets=1
-numa node,memdev=mem0,cpus=0-3,nodeid=0 \
-numa node,memdev=mem1,cpus=4-7,nodeid=1 \


This is the result:

# lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):   2
NUMA node(s):2
Model:   2.2 (pvr 004e 1202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   0-3
NUMA node1 CPU(s):   4-7


This confirms my suspicions that, at this moment, we're making sockets == NUMA 
nodes.


Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The 
commit
after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop 
updating
cpu_core_mask"), is erasing the code that calculated cpu_core_mask. 
cpu_core_mask, despite
its shortcomings that caused its removal, was giving a precise SMP topology. 
And it was
using physical_package_id/'ibm,chip-id' for that.

Checking in QEMU I can say that the ibm,chip-id calculation is the only place 
in the code
that cares about cores per socket information. The kernel is now ignoring that, 
starting
on 4bce545903fa, and now QEMU is unable to provide this info to the guest.

If we're not going to use ibm,chip-id any longer, which seems sensible given 
that PAPR does
not declare it, we need another way of letting the guest know how much cores 
per socket
we want.



[1] https://bugzilla.redhat.com/1934421



Thanks,


DHB


Re: [PATCH] powerpc/numa: Fix topology_physical_package_id() on pSeries

2021-03-15 Thread Daniel Henrique Barboza




On 3/15/21 1:16 PM, Cédric Le Goater wrote:

On 3/15/21 4:12 PM, Daniel Henrique Barboza wrote:



On 3/12/21 11:31 AM, Cédric Le Goater wrote:

Initial commit 15863ff3b8da ("powerpc: Make chip-id information
available to userspace") introduce a cpu_to_chip_id() routine for the
PowerNV platform using the "ibm,chip-id" property to query the chip id
of a CPU. But PAPR does not specify such a property and the node id
query is broken.

Use cpu_to_node() instead which guarantees to have a correct value on
all platforms, PowerNV an pSeries.


It is worth mentioning that that this patch will change how
topology_physical_package_id() represents in a QEMU guest. Right now, 
ibm,chip-id
in QEMU is matching the socket-id. After this patch, 
topology_physical_package_id()
will now match the NUMA id of the CPU.


yes. I should have added some more background.

LPARs are impacted by the use of ibm,chip-id because the property
does not exist under PowerVM and the topology-id in sysfs is always
-1 even if NUMA nodes are defined.

Under QEMU/KVM, ibm,chip-id is badly calculated when using uncommon
SMT configuration. This leads to a bogus topology-id value being
exported in sysfs.

The use of cpu_to_node() guarantees to have a correct NUMA node id
under both environments QEMU/KVM and PowerVM.

On the PowerNV platform, the numa node id returned by cpu_to_node()
is computed from the "ibm,associativity" property of the CPU. Its
value is built from the OPAL chip id and is equivalent to ibm,chip-id.

May be I should rephrase the commit log in a v2 ?


It's a fine idea, given that apparently we don't have documentation explaining
these details (well, at least I didn't find any). We can reference the commit
message later on as explanation :)




Thanks,

DHB



C.

  

Reviewed-by: Daniel Henrique Barboza 
Tested-by: Daniel Henrique Barboza 



Cc: Nathan Lynch 
Cc: Srikar Dronamraju 
Cc: Vasant Hegde 
Signed-off-by: Cédric Le Goater 
---
   arch/powerpc/include/asm/topology.h | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 3beeb030cd78..887c42a4e43d 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -123,7 +123,7 @@ static inline int cpu_to_coregroup_id(int cpu)
   #ifdef CONFIG_PPC64
   #include 
   -#define topology_physical_package_id(cpu)    (cpu_to_chip_id(cpu))
+#define topology_physical_package_id(cpu)    (cpu_to_node(cpu))
     #define topology_sibling_cpumask(cpu)    (per_cpu(cpu_sibling_map, cpu))
   #define topology_core_cpumask(cpu)    (cpu_cpu_mask(cpu))





Re: [PATCH] powerpc/numa: Fix topology_physical_package_id() on pSeries

2021-03-15 Thread Daniel Henrique Barboza




On 3/12/21 11:31 AM, Cédric Le Goater wrote:

Initial commit 15863ff3b8da ("powerpc: Make chip-id information
available to userspace") introduce a cpu_to_chip_id() routine for the
PowerNV platform using the "ibm,chip-id" property to query the chip id
of a CPU. But PAPR does not specify such a property and the node id
query is broken.

Use cpu_to_node() instead which guarantees to have a correct value on
all platforms, PowerNV an pSeries.


It is worth mentioning that that this patch will change how
topology_physical_package_id() represents in a QEMU guest. Right now, 
ibm,chip-id
in QEMU is matching the socket-id. After this patch, 
topology_physical_package_id()
will now match the NUMA id of the CPU.



Reviewed-by: Daniel Henrique Barboza 
Tested-by: Daniel Henrique Barboza 



Cc: Nathan Lynch 
Cc: Srikar Dronamraju 
Cc: Vasant Hegde 
Signed-off-by: Cédric Le Goater 
---
  arch/powerpc/include/asm/topology.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 3beeb030cd78..887c42a4e43d 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -123,7 +123,7 @@ static inline int cpu_to_coregroup_id(int cpu)
  #ifdef CONFIG_PPC64
  #include 
  
-#define topology_physical_package_id(cpu)	(cpu_to_chip_id(cpu))

+#define topology_physical_package_id(cpu)  (cpu_to_node(cpu))
  
  #define topology_sibling_cpumask(cpu)	(per_cpu(cpu_sibling_map, cpu))

  #define topology_core_cpumask(cpu)(cpu_cpu_mask(cpu))



Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property

2021-03-12 Thread Daniel Henrique Barboza




On 3/12/21 6:53 AM, Cédric Le Goater wrote:

On 3/12/21 2:55 AM, David Gibson wrote:

On Tue, 9 Mar 2021 18:26:35 +0100
Cédric Le Goater  wrote:


On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:



On 3/9/21 12:33 PM, Cédric Le Goater wrote:

On 3/8/21 6:13 PM, Greg Kurz wrote:

On Wed, 3 Mar 2021 18:48:50 +0100
Cédric Le Goater  wrote:
  

The 'chip_id' field of the XIVE CPU structure is used to choose a
target for a source located on the same chip when possible. This field
is assigned on the PowerNV platform using the "ibm,chip-id" property
on pSeries under KVM when NUMA nodes are defined but it is undefined


This sentence seems to have a syntax problem... like it is missing an
'and' before 'on pSeries'.


ah yes, or simply a comma.
  

under PowerVM. The XIVE source structure has a similar field
'src_chip' which is only assigned on the PowerNV platform.

cpu_to_node() returns a compatible value on all platforms, 0 being the
default node. It will also give us the opportunity to set the affinity
of a source on pSeries when we can localize them.
  


IIUC this relies on the fact that the NUMA node id is == to chip id
on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
with this change.


Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
in Cc:)

  [...]


On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
the node id. This value is built from the chip id in OPAL, so the
value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
property are unlikely to be different.

cpu_to_node(cpu) is used in many places to allocate the structures
locally to the owning node. XIVE is not an exception (see below in the
same patch), it is better to be consistent and get the same information
(node id) using the same routine.


In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
unifies the controllers of the system to only expose one the OS. This
is problematic and should be changed but it's another topic.

  

On the other hand, you have the pSeries case under PowerVM that
doesn't xc->chip_id, which isn't passed to any hcall AFAICT.


yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
chip id.

QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
always correct btw)



If you have a way to reliably reproduce this, let me know and I'll fix it
up in QEMU.


with :

-smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G 
-numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object 
memory-backend-ram,id=ram-node1,size=2G -numa 
node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1

# dmesg | grep numa
[0.013106] numa: Node 0 CPUs: 0-1
[0.013136] numa: Node 1 CPUs: 2-3

# dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
ibm,chip-id = <0x01>;
ibm,chip-id = <0x02>;
ibm,chip-id = <0x00>;
ibm,chip-id = <0x03>;

with :

   -smp 4,cores=4,maxcpus=8,threads=1 -object 
memory-backend-ram,id=ram-node0,size=2G -numa 
node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object 
memory-backend-ram,id=ram-node1,size=2G -numa 
node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1

# dmesg | grep numa
[0.013106] numa: Node 0 CPUs: 0-1
[0.013136] numa: Node 1 CPUs: 2-3

# dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
ibm,chip-id = <0x00>;
ibm,chip-id = <0x00>;
ibm,chip-id = <0x00>;
ibm,chip-id = <0x00>;

I think we should simply remove "ibm,chip-id" since it's not used and
not in the PAPR spec.


As I mentioned to Daniel on our call this morning, oddly it *does*
appear to be used in the RHEL kernel, even though that's 4.18 based.
This patch seems to have caused a minor regression; not in the
identification of NUMA nodes, but in the number of sockets shown be
lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
for more information.


Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we
remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 :

[root@localhost ~]# lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):   2
NUMA node(s):2
Model:   2.2 (pvr 004e 1202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   0-63
NUMA n

Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property

2021-03-09 Thread Daniel Henrique Barboza




On 3/9/21 12:33 PM, Cédric Le Goater wrote:

On 3/8/21 6:13 PM, Greg Kurz wrote:

On Wed, 3 Mar 2021 18:48:50 +0100
Cédric Le Goater  wrote:


The 'chip_id' field of the XIVE CPU structure is used to choose a
target for a source located on the same chip when possible. This field
is assigned on the PowerNV platform using the "ibm,chip-id" property
on pSeries under KVM when NUMA nodes are defined but it is undefined


This sentence seems to have a syntax problem... like it is missing an
'and' before 'on pSeries'.


ah yes, or simply a comma.


under PowerVM. The XIVE source structure has a similar field
'src_chip' which is only assigned on the PowerNV platform.

cpu_to_node() returns a compatible value on all platforms, 0 being the
default node. It will also give us the opportunity to set the affinity
of a source on pSeries when we can localize them.



IIUC this relies on the fact that the NUMA node id is == to chip id
on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
with this change.


Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
in Cc:)


That's correct. H_HOME_NODE_ASSOCIATIVITY returns not only the node_id, but
a list with the ibm,associativity domains of the CPU that "proc-no" (processor
identifier) is mapped to inside QEMU.

node_id in this case, considering that we're working with a reference-points
of size 4, is the 4th element of the returned list. The last element is
"procno" itself.




On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
the node id. This value is built from the chip id in OPAL, so the
value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
property are unlikely to be different.

cpu_to_node(cpu) is used in many places to allocate the structures
locally to the owning node. XIVE is not an exception (see below in the
same patch), it is better to be consistent and get the same information
(node id) using the same routine.


In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
unifies the controllers of the system to only expose one the OS. This
is problematic and should be changed but it's another topic.



On the other hand, you have the pSeries case under PowerVM that
doesn't xc->chip_id, which isn't passed to any hcall AFAICT.


yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
chip id.

QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
always correct btw)



If you have a way to reliably reproduce this, let me know and I'll fix it
up in QEMU.



Thanks,


DHB





It looks like the chip id is only used for localization purpose in
this case, right ?


Yes and PAPR sources are not localized. So it's not used. MSI sources
could be if we rewrote the MSI driver.


In this case, what about doing this change for pSeries only,
somewhere in spapr.c ?


The IPI code is common to all platforms and all have the same issue.
I rather not.

Thanks,

C.
  

Signed-off-by: Cédric Le Goater 
---
  arch/powerpc/sysdev/xive/common.c | 7 +--
  1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 595310e056f4..b8e456da28aa 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu)
  
  	xc = per_cpu(xive_cpu, cpu);

if (!xc) {
-   struct device_node *np;
-
xc = kzalloc_node(sizeof(struct xive_cpu),
  GFP_KERNEL, cpu_to_node(cpu));
if (!xc)
return -ENOMEM;
-   np = of_get_cpu_node(cpu, NULL);
-   if (np)
-   xc->chip_id = of_get_ibm_chip_id(np);
-   of_node_put(np);
+   xc->chip_id = cpu_to_node(cpu);
xc->hw_ipi = XIVE_BAD_IRQ;
  
  		per_cpu(xive_cpu, cpu) = xc;






[PATCH 1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_remove()

2021-03-05 Thread Daniel Henrique Barboza
Of all the reasons that dlpar_cpu_remove() can fail, the 'last online
CPU' is one that can be caused directly by the user offlining CPUs
in a partition/virtual machine that has hotplugged CPUs. Trying to
reclaim a hotplugged CPU can fail if the CPU is now the last online in
the system. This is easily reproduced using QEMU [1].

Throwing a more specific error message for this case, instead of just
"Failed to offline CPU", makes it clearer that the error is in fact a
known error situation instead of other generic/unknown cause.

[1] https://bugzilla.redhat.com/1911414

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..134f393f09e1 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -514,7 +514,17 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, 
u32 drc_index)
 
rc = dlpar_offline_cpu(dn);
if (rc) {
-   pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc);
+   /* dlpar_offline_cpu will return -EBUSY from cpu_down() (via
+* device_offline()) in 2 cases: cpu_hotplug_disable is true or
+* there is only one CPU left. Warn the user about the second
+* since this can happen with user offlining CPUs and then
+* attempting hotunplugs.
+*/
+   if (rc == -EBUSY && num_online_cpus() == 1)
+   pr_warn("Unable to remove last online CPU %pOFn\n", dn);
+   else
+   pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, 
rc);
+
return -EINVAL;
}
 
-- 
2.29.2



Re: [PATCH 0/1] powerpc/numa: do not skip node 0 in lookup table

2020-09-04 Thread Daniel Henrique Barboza

I discussed this a bit with Aneesh Kumar in IBM internal Slack, a few weeks
ago, and he informed me that that this patch does not make sense with the
design used by the kernel. The kernel will assume that, for node 0, all
associativity domains must also be zeroed. This is why node 0 is skipped
when creating the distance table.

This of course has consequences for QEMU, so based on that, I've adapted
the QEMU implementation to not touch node 0.



Daniel

On 8/14/20 5:34 PM, Daniel Henrique Barboza wrote:

Hi,

This is a simple fix that I made while testing NUMA changes
I'm making in QEMU [1]. Setting any non-zero value to the
associativity of NUMA node 0 has no impact in the output
of 'numactl' because the distance_lookup_table is never
initialized for node 0.

Seeing through the LOPAPR spec and git history I found no
technical reason to skip node 0, which makes me believe this is
a bug that got under the radar up until now because no one
attempted to set node 0 associativity like I'm doing now.

For anyone wishing to give it a spin, using the QEMU build
in [1] and experimenting with NUMA distances, such as:

sudo ./qemu-system-ppc64 -machine 
pseries-5.2,accel=kvm,usb=off,dump-guest-core=off -m 65536 -overcommit 
mem-lock=off -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc -display none 
-vga none -nographic -boot menu=on -device 
spapr-pci-host-bridge,index=1,id=pci.1 -device 
spapr-pci-host-bridge,index=2,id=pci.2 -device 
spapr-pci-host-bridge,index=3,id=pci.3 -device 
spapr-pci-host-bridge,index=4,id=pci.4 -device 
qemu-xhci,id=usb,bus=pci.0,addr=0x2 -drive 
file=/home/danielhb/f32.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -device usb-kbd,id=input0,bus=usb.0,port=1 -device 
usb-mouse,id=input1,bus=usb.0,port=2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on \
-numa node,nodeid=0,cpus=0 -numa node,nodeid=1,cpus=1 \
-numa node,nodeid=2,cpus=2 -numa node,nodeid=3,cpus=3 \
-numa dist,src=0,dst=1,val=80 -numa dist,src=0,dst=2,val=80 \
-numa dist,src=0,dst=3,val=80 -numa dist,src=1,dst=2,val=80 \
-numa dist,src=1,dst=3,val=80 -numa dist,src=2,dst=3,val=80

The current kernel code will ignore the associativity of
node 0, and numactl will output this:

node distances:
node   0   1   2   3
   0:  10  160  160  160
   1:  160  10  80  80
   2:  160  80  10  80
   3:  160  80  80  10

With this patch:

node distances:
node   0   1   2   3
   0:  10  160  160  160
   1:  160  10  80  40
   2:  160  80  10  20
   3:  160  40  20  10


If anyone wonders, this patch has no conflict with the proposed
NUMA changes in [2] because Aneesh isn't changing this line.


[1] https://github.com/danielhb/qemu/tree/spapr_numa_v1
[2] 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/2020073916.243569-1-aneesh.ku...@linux.ibm.com/


Daniel Henrique Barboza (1):
   powerpc/numa: do not skip node 0 when init lookup table

  arch/powerpc/mm/numa.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)



[PATCH 1/1] powerpc/numa: do not skip node 0 when init lookup table

2020-08-14 Thread Daniel Henrique Barboza
associativity_to_nid() is skipping node 0 when initializing
the distance lookup table. This has no practical effect when
the associativity of node 0 is always zero, which seems to
be case for a long time. As such, this line got introduced in
commit 41eab6f88f24 from 2010 and never revisited.

However, QEMU is making an effort to allow user input to configure
NUMA topologies, and this behavior got exposed when testing
that work. With the existing code, this is what happens with a
4 node NUMA guest with distance = 80 to each other:

$ numactl -H
(...)
node distances:
node   0   1   2   3
  0:  10  160  160  160
  1:  160  10  80  80
  2:  160  80  10  80
  3:  160  80  80  10

With this patch, this is the result:

$ numactl -H
(...)
node distances:
node   0   1   2   3
  0:  10  80  80  80
  1:  80  10  80  80
  2:  80  80  10  80
  3:  80  80  80  10

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 1f61fa2148b5..c11aabad1090 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -239,7 +239,7 @@ static int associativity_to_nid(const __be32 *associativity)
if (nid == 0x || nid >= nr_node_ids)
nid = NUMA_NO_NODE;
 
-   if (nid > 0 &&
+   if (nid >= 0 &&
of_read_number(associativity, 1) >= distance_ref_points_depth) {
/*
 * Skip the length field and send start of associativity array
-- 
2.26.2



[PATCH 0/1] powerpc/numa: do not skip node 0 in lookup table

2020-08-14 Thread Daniel Henrique Barboza
Hi,

This is a simple fix that I made while testing NUMA changes
I'm making in QEMU [1]. Setting any non-zero value to the
associativity of NUMA node 0 has no impact in the output
of 'numactl' because the distance_lookup_table is never
initialized for node 0.

Seeing through the LOPAPR spec and git history I found no
technical reason to skip node 0, which makes me believe this is
a bug that got under the radar up until now because no one
attempted to set node 0 associativity like I'm doing now.

For anyone wishing to give it a spin, using the QEMU build
in [1] and experimenting with NUMA distances, such as:

sudo ./qemu-system-ppc64 -machine 
pseries-5.2,accel=kvm,usb=off,dump-guest-core=off -m 65536 -overcommit 
mem-lock=off -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc -display none 
-vga none -nographic -boot menu=on -device 
spapr-pci-host-bridge,index=1,id=pci.1 -device 
spapr-pci-host-bridge,index=2,id=pci.2 -device 
spapr-pci-host-bridge,index=3,id=pci.3 -device 
spapr-pci-host-bridge,index=4,id=pci.4 -device 
qemu-xhci,id=usb,bus=pci.0,addr=0x2 -drive 
file=/home/danielhb/f32.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -device usb-kbd,id=input0,bus=usb.0,port=1 -device 
usb-mouse,id=input1,bus=usb.0,port=2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on \
-numa node,nodeid=0,cpus=0 -numa node,nodeid=1,cpus=1 \
-numa node,nodeid=2,cpus=2 -numa node,nodeid=3,cpus=3 \
-numa dist,src=0,dst=1,val=80 -numa dist,src=0,dst=2,val=80 \
-numa dist,src=0,dst=3,val=80 -numa dist,src=1,dst=2,val=80 \
-numa dist,src=1,dst=3,val=80 -numa dist,src=2,dst=3,val=80

The current kernel code will ignore the associativity of
node 0, and numactl will output this:

node distances:
node   0   1   2   3 
  0:  10  160  160  160 
  1:  160  10  80  80 
  2:  160  80  10  80 
  3:  160  80  80  10 

With this patch:

node distances:
node   0   1   2   3 
  0:  10  160  160  160 
  1:  160  10  80  40 
  2:  160  80  10  20 
  3:  160  40  20  10 


If anyone wonders, this patch has no conflict with the proposed
NUMA changes in [2] because Aneesh isn't changing this line.


[1] https://github.com/danielhb/qemu/tree/spapr_numa_v1
[2] 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/2020073916.243569-1-aneesh.ku...@linux.ibm.com/


Daniel Henrique Barboza (1):
  powerpc/numa: do not skip node 0 when init lookup table

 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.26.2



Question about NUMA distance calculation in powerpc/mm/numa.c

2020-07-16 Thread Daniel Henrique Barboza

Hello,


I didn't find an explanation about the 'double the distance' logic in
'git log' or anywhere in the kernel docs:


(arch/powerpc/mm/numa.c, __node_distance()):

for (i = 0; i < distance_ref_points_depth; i++) {
if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
break;

/* Double the distance for each NUMA level */
distance *= 2;
}

For reference, the commit that added it:


commit 41eab6f88f24124df89e38067b3766b7bef06ddb
Author: Anton Blanchard 
Date:   Sun May 16 20:22:31 2010 +

powerpc/numa: Use form 1 affinity to setup node distance
 


Is there a technical reason for the distance being calculated as the double
for each NUMA level?

The reason I'm asking is because of the QEMU/Libvirt capability to define NUMA
node distances in the VMs. For x86, an user is capable of setting any distance
values to the NUMA topology due to how ACPI SLIT works.

The user, of course, wants the pseries guest to behave the same way. The best
we can do for now is document why this will not happen. I'll document the
limitations imposed by the design itself (how ibm,associativity-reference-points
is capped to MAX_DISTANCE_REF_POINTS and so on). I also would like to document
that the pseries kernel will double the distance for each NUMA level, and for
that it would be nice to provide an actual reason for that to happen, if
there is any.


Thanks,


Daniel



Re: [RFC PATCH kernel] powerpc/xive: Drop deregistered irqs

2019-07-14 Thread Daniel Henrique Barboza

This patch fixed an issue I was experiencing with virsh start/destroy
of guests with mlx5 and GPU passthrough in a Power 9 server. I
believe it's a similar situation which Alexey described in the post
commit msg.


Tested-by: Daniel Henrique Barboza 


On 7/12/19 5:20 AM, Alexey Kardashevskiy wrote:

There is a race between releasing an irq on one cpu and fetching it
from XIVE on another cpu as there does not seem to be any locking between
these, probably because xive_irq_chip::irq_shutdown() is supposed to
remove the irq from all queues in the system which it does not do.

As a result, when such released irq appears in a queue, we take it
from the queue but we do not change the current priority on that cpu and
since there is no handler for the irq, EOI is never called and the cpu
current priority remains elevated (7 vs. 0xff==unmasked). If another irq
is assigned to the same cpu, then that device stops working until irq
is moved to another cpu or the device is reset.

This checks if irq is still registered, if not, it assumes no valid irq
was fetched from the loop and if there is none left, it continues to
the irq==0 case (not visible in this patch) and sets priority to 0xff
which is basically unmasking. This calls irq_to_desc() on a hot path now
which is a radix tree lookup; hopefully this won't be noticeable as
that tree is quite small.

Signed-off-by: Alexey Kardashevskiy 
---

Found it on P9 system with:
- a host with 8 cpus online
- a boot disk on ahci with its msix on cpu#0
- a guest with 2xGPUs + 6xNVLink + 4 cpus
- GPU#0 from the guest is bound to the same cpu#0.

Killing a guest killed ahci and therefore the host because of the race.
Note that VFIO masks interrupts first and only then resets the device.

Alternatives:

1. Fix xive_irq_chip::irq_shutdown() to walk through all cpu queues and
drop deregistered irqs.

2. Exploit chip->irq_get_irqchip_state function from
62e0468650c30f0298 "genirq: Add optional hardware synchronization for shutdown".

Both require deep XIVE knowledge which I do not have.
---
  arch/powerpc/sysdev/xive/common.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 082c7e1c20f0..65742e280337 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -148,8 +148,12 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool 
just_peek)
irq = xive_read_eq(>queue[prio], just_peek);
  
  		/* Found something ? That's it */

-   if (irq)
-   break;
+   if (irq) {
+   /* Another CPU may have shut this irq down, check it */
+   if (irq_to_desc(irq))
+   break;
+   irq = 0;
+   }
  
  		/* Clear pending bits */

xc->pending_prio &= ~(1 << prio);




Re: [PATCH kernel v2] powerpc/mm: Flush radix process translations when setting MMU type

2018-02-07 Thread Daniel Henrique Barboza



On 02/07/2018 12:33 PM, Laurent Vivier wrote:

On 01/02/2018 06:09, Alexey Kardashevskiy wrote:

Radix guests do normally invalidate process-scoped translations when
a new pid is allocated but migrated guests do not invalidate these so
migrated guests crash sometime, especially easy to reproduce with
migration happening within first 10 seconds after the guest boot start on
the same machine.

This adds the "Invalidate process-scoped translations" flush to fix
radix guests migration.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
Changes:
v2:
* removed PPC_TLBIE_5() from the !(old_HR) case as it is pointless
on hash

---


Not so sure that "process-scoped translations" only require flushing
at pid allocation and migration.

---
  arch/powerpc/mm/pgtable_64.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index c9a623c..d75dd52 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -471,6 +471,8 @@ void mmu_partition_table_set_entry(unsigned int lpid, 
unsigned long dw0,
if (old & PATB_HR) {
asm volatile(PPC_TLBIE_5(%0,%1,2,0,1) : :
 "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
+   asm volatile(PPC_TLBIE_5(%0,%1,2,1,1) : :
+"r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
trace_tlbie(lpid, 0, TLBIEL_INVAL_SET_LPID, lpid, 2, 0, 1);
} else {
asm volatile(PPC_TLBIE_5(%0,%1,2,0,0) : :


This patch fixes for me a VM migration crash on POWER9.


Same here.

Tested-by: Daniel Henrique Barboza <danie...@linux.vnet.ibm.com>



Tested-by: Laurent Vivier <lviv...@redhat.com>

Thanks,
Laurent





Re: Possible LMB hot unplug bug in 4.13+ kernels

2017-10-06 Thread Daniel Henrique Barboza



Unless you (Daniel) think there's some reason lmb_is_removable() is
incorrectly returning false. But most likely it's correct and there's
just an unmovable allocation in that range.


I am not educated enough to say that the current behavior is wrong. What I
can say is that in 4.11 and older kernels that supports LMB hot 
plug/unplug I

didn't see this kernel "refusal" to remove a LMB that was just hotplugged.

Assuming that the kernel is behaving as intended, a QEMU guest started with
4Gb of RAM that receives an extra 1Gb of RAM will not unplug this same 1Gb.
It seems off from the user perspective that a recently added memory is being
considered not removable, thus QEMU will need to keep this limitation in 
mind when

dealing with future LMB bugs in 4.13+ kernels.


Thanks,


Daniel




Possible LMB hot unplug bug in 4.13+ kernels

2017-10-04 Thread Daniel Henrique Barboza

Hi,

I've stumbled in a LMB hot unplug problem when running a guest with 
4.13+ kernel using QEMU 2.10. When trying to hot unplug a recently 
hotplugged LMB this is what I got, using an upstream kernel:


---
QEMU cmd line: sudo ./qemu-system-ppc64 -name migrate_qemu -boot 
strict=on -device nec-usb-xhci,id=usb,bus=pci.0,addr=0xf -device 
spapr-vscsi,id=scsi0,reg=0x2000 -smp 
32,maxcpus=32,sockets=32,cores=1,threads=1 --machine 
pseries,accel=kvm,kvm-type=HV,usb=off,dump-guest-core=off -m 
4G,slots=32,maxmem=32G -drive 
file=/home/danielhb/vm_imgs/f26.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none 
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x2,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 
-nographic



Last login: Wed Oct  4 12:28:25 on hvc0
[danielhb@localhost ~]$ grep Mem /proc/meminfo
MemTotal:    4161728 kB
MemFree: 3204352 kB
MemAvailable:    3558336 kB
[danielhb@localhost ~]$ QEMU 2.10.50 monitor - type 'help' for more 
information

(qemu)
(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0
(qemu)

[danielhb@localhost ~]$ grep Mem /proc/meminfo
MemTotal:    5210304 kB
MemFree: 4254656 kB
MemAvailable:    4598144 kB
[danielhb@localhost ~]$ (qemu)
(qemu) device_del dimm0
(qemu) [  136.058727] pseries-hotplug-mem: Memory indexed-count-remove 
failed, adding any removed LMBs


(qemu)
[danielhb@localhost ~]$ grep Mem /proc/meminfo
MemTotal:    5210304 kB
MemFree: 4253696 kB
MemAvailable:    4597184 kB
[danielhb@localhost ~]$

-

The LMBs are about to be unplugged, them something happens and they 
ended up being hotplugged back.


This isn't reproducible with 4.11 guests. I can reliably reproduce it in 
4.13+. Haven't tried 4.12.


Changing QEMU snapshots or even the hypervisor kernel/OS didn't affect 
the result.



In an attempt to better understand the issue I did the following changes 
in upstream kernel:


diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c

index 1d48ab424bd9..37550833cdb0 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -433,8 +433,10 @@ static bool lmb_is_removable(struct of_drconf_cell 
*lmb)

    unsigned long pfn, block_sz;
    u64 phys_addr;

-   if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
+   if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) {
+   pr_err("lmb is not assigned \n");
    return false;
+   }

    block_sz = memory_block_size_bytes();
    scns_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
@@ -442,8 +444,10 @@ static bool lmb_is_removable(struct of_drconf_cell 
*lmb)


 #ifdef CONFIG_FA_DUMP
    /* Don't hot-remove memory that falls in fadump boot memory area */
-   if (is_fadump_boot_memory_area(phys_addr, block_sz))
+   if (is_fadump_boot_memory_area(phys_addr, block_sz)) {
+   pr_err("lmb belongs to fadump boot memory area\n");
    return false;
+   }
 #endif

    for (i = 0; i < scns_per_block; i++) {
@@ -454,7 +458,9 @@ static bool lmb_is_removable(struct of_drconf_cell *lmb)
    rc &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
    phys_addr += MIN_MEMORY_BLOCK_SIZE;
    }
-
+   if (!rc) {
+   pr_err("is_mem_section_removable returned false \n");
+   }
    return rc ? true : false;
 }

@@ -465,12 +471,16 @@ static int dlpar_remove_lmb(struct of_drconf_cell 
*lmb)

    unsigned long block_sz;
    int nid, rc;

-   if (!lmb_is_removable(lmb))
+   if (!lmb_is_removable(lmb)) {
+   pr_err("dlpar_remove_lmb: lmb is not removable! \n");
    return -EINVAL;
+   }

    rc = dlpar_offline_lmb(lmb);
-   if (rc)
+   if (rc) {
+   pr_err("dlpar_remove_lmb: offline_lmb returned not zero 
\n");

    return rc;
+   }

    block_sz = pseries_memory_block_size();
    nid = memory_add_physaddr_to_nid(lmb->base_addr);

And this is the output:

-

[danielhb@localhost ~]$ QEMU 2.10.50 monitor - type 'help' for more 
information

(qemu)
(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0
(qemu)

[danielhb@localhost ~]$ grep Mem /proc/meminfo
MemTotal:    5210304 kB
MemFree: 4254656 kB
MemAvailable:    4598144 kB
[danielhb@localhost ~]$ (qemu)
(qemu) device_del dimm0
(qemu) [  136.058473] pseries-hotplug-mem: is_mem_section_removable 
returned false

[  136.058607] pseries-hotplug-mem: dlpar_remove_lmb: lmb is not removable!
[  136.058727] pseries-hotplug-mem: Memory indexed-count-remove failed, 
adding any removed LMBs


(qemu)

[danielhb@localhost ~]$ grep Mem /proc/meminfo
MemTotal:    5210304 kB
MemFree: 4253696 kB
MemAvailable:    4597184 kB
[danielhb@localhost ~]$

---

It appears that the hot 

Re: Question: handling early hotplug interrupts

2017-08-29 Thread Daniel Henrique Barboza

Hi Ben,

On 08/29/2017 06:55 PM, Benjamin Herrenschmidt wrote:

On Tue, 2017-08-29 at 17:43 -0300, Daniel Henrique Barboza wrote:

Hi,

This is a scenario I've been facing when working in early device
hotplugs in QEMU. When a device is added, a IRQ pulse is fired to warn
the guest of the event, then the kernel fetches it by calling
'check_exception' and handles it. If the hotplug is done too early
(before SLOF, for example), the pulse is ignored and the hotplug event
is left unchecked in the events queue.

One solution would be to pulse the hotplug queue interrupt after CAS,
when we are sure that the hotplug queue is negotiated. However, this
panics the kernel with sig 11 kernel access of bad area, which suggests
that the kernel wasn't quite ready to handle it.

That's not right. This is a bug that needs fixing. The interrupt should
be masked anyway but still.

Tell us more about the crash (backtrace etc...)  this definitely needs
fixing.


This is the backtrace using a 4.13.0-rc3 guest:

-
[0.008913] Unable to handle kernel paging request for data at 
address 0x0100

[0.008989] Faulting instruction address: 0xc012c318
[0.009046] Oops: Kernel access of bad area, sig: 11 [#1]
[0.009092] SMP NR_CPUS=1024
[0.009092] NUMA
[0.009128] pSeries
[0.009173] Modules linked in:
[0.009210] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc3+ #1
[0.009268] task: c000feb02580 task.stack: c000fe108000
[0.009325] NIP: c012c318 LR: c012c9c4 CTR: 


[0.009394] REGS: c000fffef910 TRAP: 0380   Not tainted (4.13.0-rc3+)
[0.009450] MSR: 82009033 <SF,VEC,EE,ME,IR,DR,RI,LE>
[0.009454]   CR: 28000822  XER: 2000
[0.009554] CFAR: c012c9c0 SOFTE: 0
[0.009554] GPR00: c012c9c4 c000fffefb90 c141f100 
0400
[0.009554] GPR04:  c000fe1851c0  
fee6
[0.009554] GPR08: 000fffe1  0001 
02001001
[0.009554] GPR12: 0040 cfd8 c000db58 

[0.009554] GPR16:    

[0.009554] GPR20:    
0001
[0.009554] GPR24: 0002 0013 c000fe14bc00 
0400
[0.009554] GPR28: 0400  c000fe1851c0 
0001

[0.010121] NIP [c012c318] __queue_work+0x48/0x640
[0.010168] LR [c012c9c4] queue_work_on+0xb4/0xf0
[0.010213] Call Trace:
[0.010239] [c000fffefb90] [c000db58] 
kernel_init+0x8/0x160 (unreliable)

[0.010308] [c000fffefc70] [c012c9c4] queue_work_on+0xb4/0xf0
[0.010368] [c000fffefcb0] [c00c4608] 
queue_hotplug_event+0xd8/0x150
[0.010435] [c000fffefd00] [c00c30d0] 
ras_hotplug_interrupt+0x140/0x190
[0.010505] [c000fffefd90] [c018c8b0] 
__handle_irq_event_percpu+0x90/0x310
[0.010573] [c000fffefe50] [c018cb6c] 
handle_irq_event_percpu+0x3c/0x90
[0.010642] [c000fffefe90] [c018cc24] 
handle_irq_event+0x64/0xc0
[0.010710] [c000fffefec0] [c01928b0] 
handle_fasteoi_irq+0xc0/0x230
[0.010779] [c000fffefef0] [c018ae14] 
generic_handle_irq+0x54/0x80

[0.010847] [c000fffeff20] [c00189f0] __do_irq+0x90/0x210
[0.010904] [c000fffeff90] [c002e730] call_do_irq+0x14/0x24
[0.010961] [c000fe10b640] [c0018c10] do_IRQ+0xa0/0x130
[0.011021] [c000fe10b6a0] [c0008c58] 
hardware_interrupt_common+0x158/0x160

[0.011090] --- interrupt: 501 at __replay_interrupt+0x38/0x3c
[0.011090] LR = arch_local_irq_restore+0x74/0x90
[0.011179] [c000fe10b990] [c000fe10b9e0] 0xc000fe10b9e0 
(unreliable)
[0.011249] [c000fe10b9b0] [c0b967fc] 
_raw_spin_unlock_irqrestore+0x4c/0xb0

[0.011316] [c000fe10b9e0] [c018ff50] __setup_irq+0x630/0x9e0
[0.011374] [c000fe10ba90] [c019054c] 
request_threaded_irq+0x13c/0x250
[0.011441] [c000fe10baf0] [c00c2cd0] 
request_event_sources_irqs+0x100/0x180
[0.011511] [c000fe10bc10] [c0eceda8] 
__machine_initcall_pseries_init_ras_IRQ+0xc4/0x12c
[0.011591] [c000fe10bc40] [c000d8c8] 
do_one_initcall+0x68/0x1e0
[0.011659] [c000fe10bd00] [c0eb4484] 
kernel_init_freeable+0x284/0x370

[0.011725] [c000fe10bdc0] [c000db7c] kernel_init+0x2c/0x160
[0.011782] [c000fe10be30] [c000bc9c] 
ret_from_kernel_thread+0x5c/0xc0

[0.011848] Instruction dump:
[0.011885] fbc1fff0 f8010010 f821ff21 7c7c1b78 7c9d2378 7cbe2b78 
787b0020 6000
[0.011955] 6000 892d028a 2fa9 409e04bc <813d0100> 75290001 
408204c0 3d2061c8

[0.012026] --

Question: handling early hotplug interrupts

2017-08-29 Thread Daniel Henrique Barboza

Hi,

This is a scenario I've been facing when working in early device 
hotplugs in QEMU. When a device is added, a IRQ pulse is fired to warn 
the guest of the event, then the kernel fetches it by calling 
'check_exception' and handles it. If the hotplug is done too early 
(before SLOF, for example), the pulse is ignored and the hotplug event 
is left unchecked in the events queue.


One solution would be to pulse the hotplug queue interrupt after CAS, 
when we are sure that the hotplug queue is negotiated. However, this 
panics the kernel with sig 11 kernel access of bad area, which suggests 
that the kernel wasn't quite ready to handle it.


In my experiments using upstream 4.13 I saw that there is a 'safe time' 
to pulse the queue, sometime after CAS and before mounting the root fs, 
but I wasn't able to pinpoint it. From QEMU perspective, the last hcall 
done (an h_set_mode) is still too early to pulse it and the kernel 
panics. Looking at the kernel source I saw that the IRQ handling is 
initiated quite early in the init process.


So my question (ok, actually 2 questions):

- Is my analysis correct? Is there an unsafe time to fire a IRQ pulse 
before CAS that can break the kernel or am I overlooking/doing something 
wrong?
- is there a reliable way to know when can the kernel safely handle the 
hotplug interrupt?



Thanks,


Daniel



Re: [RFC PATCH] powerpc: Disabling MEMORY_HOTPLUG_DEFAULT_ONLINE option for PPC64 arch

2017-08-02 Thread Daniel Henrique Barboza



On 08/01/2017 11:39 AM, Daniel Henrique Barboza wrote:



On 08/01/2017 11:05 AM, Nathan Fontenot wrote:



At this point I don't think we need this patch to disable auto online
for ppc64. I would be curious if this is still broken with the latest
mainline code though.


If the auto_online feature is already working in upstream 4.13 kernel 
then I don't see
a reason to apply this patch either. We can leave it as a FYI/reminder 
of a problem

that was happening in 4.11 and got solved later on.


I've compiled an upstream kernel with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
and the problem with LMB hotplug is still reproducible.

This means that it would be good to have any sort of mechanism (such as 
the one this
patch proposes) to remove this option from the pseries kernel until it 
is fixed.


Here is the stack trace:


Last login: Tue Aug  1 19:26:21 on hvc0
[danielhb@localhost ~]$ uname -a
Linux localhost.localdomain 4.13.0-rc3+ #1 SMP Tue Aug 1 21:29:14 CDT 
2017 ppc64le ppc64le ppc64le GNU/Linux
[danielhb@localhost ~]$ QEMU 2.9.90 monitor - type 'help' for more 
information

(qemu)
(qemu) object_add memory-backend-ram,id=ram1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=ram1
(qemu) [  617.757086] kernel BUG at mm/memory_hotplug.c:1936!
[  617.757245] Oops: Exception in kernel mode, sig: 5 [#1]
[  617.757332] SMP NR_CPUS=1024
[  617.757333] NUMA
[  617.757396] pSeries
[  617.757479] Modules linked in: ip6t_rpfilter ip6t_REJECT 
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute 
bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
ip6table_mangle ip6table_raw ip6table_security iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
iptable_mangle iptable_raw iptable_security ebtable_filter ebtables 
ip6table_filter ip6_tables sunrpc ghash_generic vmx_crypto xfs libcrc32c 
bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 
ttm drm virtio_blk ibmvscsi virtio_pci ibmveth scsi_transport_srp 
i2c_core virtio_ring virtio crc32c_vpmsum

[  617.758577] CPU: 6 PID: 5 Comm: kworker/u16:0 Not tainted 4.13.0-rc3+ #1
[  617.758741] Workqueue: pseries hotplug workque pseries_hp_work_fn
[  617.758852] task: c000feb9e780 task.stack: c000fe118000
[  617.759047] NIP: c037322c LR: c03731b8 CTR: 


[  617.759171] REGS: c000fe11b790 TRAP: 0700   Not tainted (4.13.0-rc3+)
[  617.759275] MSR: 8282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
[  617.759284]   CR: 22002424  XER: 2000
[  617.759450] CFAR: c03731bc SOFTE: 1
[  617.759450] GPR00: c03731b8 c000fe11ba10 c141f100 
0001
[  617.759450] GPR04:  01a0 c15dac30 
8b80
[  617.759450] GPR08: 0007 0002 c000e9873480 
303078302d303030
[  617.759450] GPR12: 2200 cfd83c00 c0137f88 
c000fe1f0e40
[  617.759450] GPR16:  0001 0004 
0001
[  617.759450] GPR20: c000fc28 0010 c000ebb1c184 
0010
[  617.759450] GPR24: c000fea0 c000f0aec4a0 c000fc28 

[  617.759450] GPR28:  c1347f08 1000 
0001

[  617.760679] NIP [c037322c] remove_memory+0xfc/0x110
[  617.760766] LR [c03731b8] remove_memory+0x88/0x110
[  617.760849] Call Trace:
[  617.760895] [c000fe11ba10] [c03731b8] 
remove_memory+0x88/0x110 (unreliable)
[  617.761032] [c000fe11ba50] [c00cb0e0] 
dlpar_add_lmb+0x280/0x480
[  617.761214] [c000fe11bb30] [c00cc83c] 
dlpar_memory+0xa5c/0xe50
[  617.761319] [c000fe11bbf0] [c00c3b88] 
handle_dlpar_errorlog+0xf8/0x160
[  617.761443] [c000fe11bc60] [c00c3c84] 
pseries_hp_work_fn+0x94/0xa0
[  617.761568] [c000fe11bc90] [c012e3e8] 
process_one_work+0x248/0x540
[  617.761697] [c000fe11bd30] [c012e768] 
worker_thread+0x88/0x5c0

[  617.761857] [c000fe11bdc0] [c013812c] kthread+0x1ac/0x1c0
[  617.761962] [c000fe11be30] [c000bc9c] 
ret_from_kernel_thread+0x5c/0xc0

[  617.762085] Instruction dump:
[  617.762149] 4be09a9d 6000 4bd95ac5 6000 38210040 e8010010 
eb81ffe0 eba1ffe8
[  617.762274] ebc1fff0 7c0803a6 ebe1fff8 4e800020 <0fe0>  
 

[  617.762403] ---[ end trace feaa62099c5987b9 ]---


Thanks,


Daniel




Thanks,


Daniel



-Nathan

However, this change
alone isn't enough to prevent situations such as [1], where
distros can enable the option unaware of the consequences of
doing it (e.g. breaking LMB hotplug altogether).

Instead of relying on all distros knowing that pseries can't handle
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y after 943db62c316c, this patch
changes mm/Kconfig to make the MEMORY_HOTPLUG_DEFAULT_ONLINE config
unavailable for the PPC64 arch.

[1] https://bugzill

Re: [RFC PATCH] powerpc: Disabling MEMORY_HOTPLUG_DEFAULT_ONLINE option for PPC64 arch

2017-08-01 Thread Daniel Henrique Barboza



On 08/01/2017 11:05 AM, Nathan Fontenot wrote:

On 08/01/2017 04:59 AM, Michael Ellerman wrote:

Daniel Henrique Barboza <danie...@linux.vnet.ibm.com> writes:


Commit 943db62c316c ("powerpc/pseries: Revert 'Auto-online
hotplugged memory'") reverted the auto-online feature for pseries due
to problems with LMB removals not updating the device struct properly.
Among other things, this commit made the following change in
arch/powerpc/configs/pseries_defconfig:

@@ -58,7 +58,6 @@ CONFIG_KEXEC_FILE=y
  CONFIG_IRQ_ALL_CPUS=y
  CONFIG_MEMORY_HOTPLUG=y
  CONFIG_MEMORY_HOTREMOVE=y
-CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
  CONFIG_KSM=y

The intent was to disable the option in the defconfig of pseries, since
after that the code doesn't have this support anymore.

It's always polite to Cc the author of a commit you're referring to, so
I added Nathan.

Noted. Thanks for adding Nathan in the CC.



The intention when we merged that fix was that the auto-online code
would be "fixed" to mark the device online. I say "fixed" because it
wasn't entirely clear if that was the correct behaviour, though it
definitely seemed like it should be.

I've lost track of where/if the discussion got to on whether the
auto-online code should do that or not. Did anything get resolved?

I think, though I should go back and test to be sure, that everything
works in the latest mainline code. The issue causing this to be a problem
was in the original implementation of auto_online support. If you wanted
to auto online memory, the code was calling memory_block_change_state().
This worked but did not update the device struct for each of the memory
block that was online'ed such that dev->offline == true even after the
memory was online.

I sent a patch earlier this year (commit dc18d706a436) that corrected
this to call device_online() instead of memory_block_change_state().
With this fix (appears to have gone into the 4.11 kernel) it should be
possible to use auto_online on power systems.


Commit dc18d706a436 was present in the 4.11 kernels that experiences this
issue (Fedora 26 and Ubuntu 17.10 in my tests). So I am not entirely sure
that we can use auto_online on power systems, at least in those kernels.




At this point I don't think we need this patch to disable auto online
for ppc64. I would be curious if this is still broken with the latest
mainline code though.


If the auto_online feature is already working in upstream 4.13 kernel 
then I don't see
a reason to apply this patch either. We can leave it as a FYI/reminder 
of a problem

that was happening in 4.11 and got solved later on.


Thanks,


Daniel



-Nathan
   

However, this change
alone isn't enough to prevent situations such as [1], where
distros can enable the option unaware of the consequences of
doing it (e.g. breaking LMB hotplug altogether).

Instead of relying on all distros knowing that pseries can't handle
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y after 943db62c316c, this patch
changes mm/Kconfig to make the MEMORY_HOTPLUG_DEFAULT_ONLINE config
unavailable for the PPC64 arch.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1476380

Fixes: 943db62c316c ("powerpc/pseries: Revert 'Auto-online hotplugged memory'")
Signed-off-by: Daniel Henrique Barboza <danie...@linux.vnet.ibm.com>
---
  mm/Kconfig | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

I don't own that file, so we at least need an Ack from the mm folks.

cheers


diff --git a/mm/Kconfig b/mm/Kconfig
index 48b1af4..a342c77 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -169,7 +169,7 @@ config MEMORY_HOTPLUG_SPARSE
  config MEMORY_HOTPLUG_DEFAULT_ONLINE
  bool "Online the newly added memory blocks by default"
  default n
-depends on MEMORY_HOTPLUG
+depends on MEMORY_HOTPLUG && !PPC64
  help
  This option sets the default policy setting for memory hotplug
  onlining policy (/sys/devices/system/memory/auto_online_blocks) which
--
2.9.4




[RFC PATCH] powerpc: Disabling MEMORY_HOTPLUG_DEFAULT_ONLINE option for PPC64 arch

2017-07-31 Thread Daniel Henrique Barboza
Commit 943db62c316c ("powerpc/pseries: Revert 'Auto-online
hotplugged memory'") reverted the auto-online feature for pseries due
to problems with LMB removals not updating the device struct properly.
Among other things, this commit made the following change in
arch/powerpc/configs/pseries_defconfig:

@@ -58,7 +58,6 @@ CONFIG_KEXEC_FILE=y
 CONFIG_IRQ_ALL_CPUS=y
 CONFIG_MEMORY_HOTPLUG=y
 CONFIG_MEMORY_HOTREMOVE=y
-CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
 CONFIG_KSM=y

The intent was to disable the option in the defconfig of pseries, since
after that the code doesn't have this support anymore. However, this change
alone isn't enough to prevent situations such as [1], where
distros can enable the option unaware of the consequences of
doing it (e.g. breaking LMB hotplug altogether).

Instead of relying on all distros knowing that pseries can't handle
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y after 943db62c316c, this patch
changes mm/Kconfig to make the MEMORY_HOTPLUG_DEFAULT_ONLINE config
unavailable for the PPC64 arch.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1476380

Fixes: 943db62c316c ("powerpc/pseries: Revert 'Auto-online hotplugged memory'")
Signed-off-by: Daniel Henrique Barboza <danie...@linux.vnet.ibm.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 48b1af4..a342c77 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -169,7 +169,7 @@ config MEMORY_HOTPLUG_SPARSE
 config MEMORY_HOTPLUG_DEFAULT_ONLINE
 bool "Online the newly added memory blocks by default"
 default n
-depends on MEMORY_HOTPLUG
+depends on MEMORY_HOTPLUG && !PPC64
 help
  This option sets the default policy setting for memory hotplug
  onlining policy (/sys/devices/system/memory/auto_online_blocks) which
-- 
2.9.4