Re: [RFC PATCH] Disable Book-E KVM support?
On 11/30/22 17:45, Crystal Wood wrote: On Mon, 2022-11-28 at 14:36 +1000, Nicholas Piggin wrote: BookE KVM is in a deep maintenance state, I'm not sure how much testing it gets. I don't have a test setup, and it does not look like QEMU has any HV architecture enabled. It hasn't been too painful but there are some cases where it causes a bit of problem not being able to test, e.g., https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-November/251452.html Time to begin removal process, or are there still people using it? I'm happy to to keep making occasional patches to try keep it going if there are people testing upstream. Getting HV support into QEMU would help with long term support, not sure how big of a job that would be. Not sure what you mean about QEMU not having e500 HV support? I don't know if it's bitrotted, but it's there. AFAIK all QEMU ppc boards, aside from pSeries and the Mac ones, are always used in emulated mode in an use case similar to what Bernhard described in his reply (run in x86 due to lack of ppc hardware). I am not aware of e500 KVM support in QEMU since I never attempted it. But yes, it is present, but poorly tested - if tested at all. And the reason why there's no push on our side to removed it from QEMU is because its code is so entwined with pSeries KVM that it would take too much effort. Do not take the presence of e500 KVM support in QEMU as a blocker to disabled it in the kernel. As far as the current QEMU usage goes e500 KVM can be removed without too much drama from our side. Cedric, do you have any opinions about it? Daniel I don't know whether anyone is still using this, but if they are, it's probably e500mc and not e500v2 (which involved a bunch of hacks to get almost- sorta-usable performance out of hardware not designed for virtualization). I do see that there have been a few recent patches on QEMU e500 (beyond the treewide cleanup type stuff), though I don't know if they're using KVM. CCing them and the QEMU list. I have an e6500 I could occasionally test on, if it turns out people do still care about this. Don't count me as the use case, though. :-) FWIW, as far as the RECONCILE_IRQ_STATE issue, that used to be done in kvmppc_handle_exit(), but was moved in commit 9bd880a2c882 to be "cleaner and faster". :-P -Crystal
Re: [PATCH v2 0/4] pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
On 6/2/22 14:53, Scott Cheloha wrote: PAPR v2.12 defines a new hypercall, H_WATCHDOG. This patch series adds support for this hypercall to powerpc/pseries kernels and introduces a new watchdog driver, "pseries-wdt", for the virtual timers exposed by the hypercall. This series is preceded by the following: RFC v1: https://lore.kernel.org/linux-watchdog/20220413165104.179144-1-chel...@linux.ibm.com/ RFC v2: https://lore.kernel.org/linux-watchdog/20220509174357.5448-1-chel...@linux.ibm.com/ PATCH v1: https://lore.kernel.org/linux-watchdog/20220520183552.33426-1-chel...@linux.ibm.com/ Tested successfully with mainline QEMU plus Alexey's new h_watchdog patches in https://patchwork.ozlabs.org/project/qemu-ppc/list/?series=305226. All patches of this series: Tested-by: Daniel Henrique Barboza Thanks, Daniel Changes of note from PATCH v1: - Trim down the large comment documenting the H_WATCHDOG hypercall. The comment is likely to rot, so remove anything we aren't using and anything overly obvious. - Remove any preprocessor definitions not actually used in the module right now. If we want to use other features offered by the hypercall we can add them in later. They're just clutter until then. - Simplify the "action" module parameter. The value is now an index into an array of possible timeoutAction values. This design removes the need for the custom get/set methods used in PATCH v1. Now we merely need to check that the "action" value is a valid index during pseries_wdt_probe(). Easy. - Make the timeoutAction a member of pseries_wdt, "action". This eliminates the use of a global variable during pseries_wdt_start(). - Use watchdog_init_timeout() idiomatically. Check its return value and error out of pseries_wdt_probe() if it fails.
Re: [PATCH v2 0/4] pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
Hi, Update: checking out 'dmesg' more carefully I found out that the module probe is failing with the following message: [ 186.298424][ T811] pseries-wdt: probe of pseries-wdt.0 failed with error -5 This fail is consistent. If I remove the module and modprobe it again the same error happens. The message is being throw by pseries_wdt_probe() (patch 4/4). Back in QEMU, in Alexey's H_WATCHDOG implementation [1], I see that h_watchdog is returning H_PARAMETER because the retrieved 'watchdogNumber' is zero. Also, the pseries_wdt module still appears in 'lsmod' despite this probe error. Not sure if this is a bug: [root@fedora ~]# rmmod pseries_wdt [root@fedora ~]# modprobe pseries_wdt [ 1792.846769][ T865] pseries-wdt: probe of pseries-wdt.0 failed with error -5 [root@fedora ~]# lsmod | grep pseries pseries_wdt 262144 0 [root@fedora ~]# For reference this is all the output of 'lsmod' in the guest: [root@fedora ~]# lsmod Module Size Used by pseries_wdt 262144 0 nfnetlink 262144 1 evdev 327680 1 input_leds262144 0 led_class 262144 1 input_leds fuse 458752 1 xfs 1835008 2 libcrc32c 262144 1 xfs virtio_scsi 327680 2 virtio_pci262144 0 virtio327680 2 virtio_scsi,virtio_pci vmx_crypto262144 0 gf128mul 262144 1 vmx_crypto crc32c_vpmsum 327680 1 virtio_ring 327680 3 virtio_scsi,virtio_pci,virtio virtio_pci_legacy_dev 262144 1 virtio_pci virtio_pci_modern_dev 262144 1 virtio_pci autofs4 327680 2 Given that the error is being thrown by Alexey's QEMU code, I'll bring it up in the QEMU mailing list in [1] and follow it up from there. [1] https://patchwork.ozlabs.org/project/qemu-ppc/patch/20220608030153.1862335-1-...@ozlabs.ru/ Thanks, Daniel On 6/15/22 22:43, Daniel Henrique Barboza wrote: Hi, I tried this series out with mainline QEMU built with Alexey's patch [1] and I wasn't able to get it to work. I'm using a simple QEMU command line booting a fedora36 guest in a Power9 boston host: sudo ./qemu-system-ppc64 \ -M pseries,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off,ic-mode=dual \ -m 4G -accel kvm -cpu POWER9 -smp 1,maxcpus=1,threads=1,cores=1,sockets=1 \ -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 \ -drive file=/home/danielhb/fedora36.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2 \ -device qemu-xhci,id=usb,bus=pci.0,addr=0x4 -nographic -display none Guest is running v5.19-rc2 with this series applied. Kernel config consists of 'pseries_le_defconfig' plus the following 'watchdog' related changes: [root@fedora ~]# cat linux/.config | grep PSERIES_WDT CONFIG_PSERIES_WDT=y [root@fedora ~]# cat linux/.config | grep -i watchdog CONFIG_PPC_WATCHDOG=y CONFIG_HAVE_NMI_WATCHDOG=y CONFIG_WATCHDOG=y CONFIG_WATCHDOG_CORE=y CONFIG_WATCHDOG_NOWAYOUT=y CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y CONFIG_WATCHDOG_OPEN_TIMEOUT=0 # CONFIG_WATCHDOG_SYSFS is not set # CONFIG_WATCHDOG_HRTIMER_PRETIMEOUT is not set # Watchdog Pretimeout Governors # CONFIG_WATCHDOG_PRETIMEOUT_GOV is not set # Watchdog Device Drivers # CONFIG_SOFT_WATCHDOG is not set # CONFIG_XILINX_WATCHDOG is not set # CONFIG_ZIIRAVE_WATCHDOG is not set # CONFIG_CADENCE_WATCHDOG is not set # CONFIG_DW_WATCHDOG is not set # CONFIG_MAX63XX_WATCHDOG is not set CONFIG_WATCHDOG_RTAS=y # PCI-based Watchdog Cards # CONFIG_PCIPCWATCHDOG is not set # USB-based Watchdog Cards # CONFIG_USBPCWATCHDOG is not set # CONFIG_WQ_WATCHDOG is not set [root@fedora ~]# Kernel command line: [root@fedora ~]# cat /proc/cmdline BOOT_IMAGE=(ieee1275/disk,msdos2)/vmlinuz-5.19.0-rc2-00054-g12ede8ffb103 \ root=/dev/mapper/fedora_fedora-root ro rd.lvm.lv=fedora_fedora/root \ pseries-wdt.timeout=60 pseries-wdt.nowayout=1 pseries-wdt.action=2 With all that, executing echo V > /dev/watchdog0 Does nothing. dmesg is clean and the guest doesn't reboot after the 60 sec timeout. I also tried with PSERIES_WDT being compiled as a module instead of built-in. Same results. What am I missing? [1] https://patchwork.ozlabs.org/project/qemu-ppc/patch/20220608030153.1862335-1-...@ozlabs.ru/ Thanks, Daniel On 6/2/22 14:53, Scott Cheloha wrote: PAPR v2.12 defines a new hypercall, H_WATCHDOG. This patch series adds support for this hypercall to powerpc/pseries kernels and introduces a new watchdog driver, "pseries-wdt", for the virtual timers exposed by the hypercall. This series is preceded by the following: RFC v1: https://lore.kernel.org/linux-watchdog/20220413165104.179144-1-chel...@linux.ibm.com/ RFC v2: https://lore.kernel.org/linux-watchdog/20220509174357
Re: [PATCH v2 0/4] pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
On 6/16/22 13:44, Tyrel Datwyler wrote: On 6/15/22 18:43, Daniel Henrique Barboza wrote: Hi, I tried this series out with mainline QEMU built with Alexey's patch [1] and I wasn't able to get it to work. I'm using a simple QEMU command line booting a fedora36 guest in a Power9 boston host: I would assume the H_WATCHDOG hypercall is not implemented by the qemu pseries machine type. It is a very new hypercall. Alexey implemented QEMU support for this hypercall here: "[qemu] ppc/spapr: Implement H_WATCHDOG" https://patchwork.ozlabs.org/project/qemu-ppc/patch/20220608030153.1862335-1-...@ozlabs.ru/ It is under review in the QEMU mailing list. I tried it out with this series and wasn't able to get it to work. Thanks, Daniel -Tyrel sudo ./qemu-system-ppc64 \ -M pseries,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off,ic-mode=dual \ -m 4G -accel kvm -cpu POWER9 -smp 1,maxcpus=1,threads=1,cores=1,sockets=1 \ -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 \ -drive file=/home/danielhb/fedora36.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2 \ -device qemu-xhci,id=usb,bus=pci.0,addr=0x4 -nographic -display none Guest is running v5.19-rc2 with this series applied. Kernel config consists of 'pseries_le_defconfig' plus the following 'watchdog' related changes: [root@fedora ~]# cat linux/.config | grep PSERIES_WDT CONFIG_PSERIES_WDT=y [root@fedora ~]# cat linux/.config | grep -i watchdog CONFIG_PPC_WATCHDOG=y CONFIG_HAVE_NMI_WATCHDOG=y CONFIG_WATCHDOG=y CONFIG_WATCHDOG_CORE=y CONFIG_WATCHDOG_NOWAYOUT=y CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y CONFIG_WATCHDOG_OPEN_TIMEOUT=0 # CONFIG_WATCHDOG_SYSFS is not set # CONFIG_WATCHDOG_HRTIMER_PRETIMEOUT is not set # Watchdog Pretimeout Governors # CONFIG_WATCHDOG_PRETIMEOUT_GOV is not set # Watchdog Device Drivers # CONFIG_SOFT_WATCHDOG is not set # CONFIG_XILINX_WATCHDOG is not set # CONFIG_ZIIRAVE_WATCHDOG is not set # CONFIG_CADENCE_WATCHDOG is not set # CONFIG_DW_WATCHDOG is not set # CONFIG_MAX63XX_WATCHDOG is not set CONFIG_WATCHDOG_RTAS=y # PCI-based Watchdog Cards # CONFIG_PCIPCWATCHDOG is not set # USB-based Watchdog Cards # CONFIG_USBPCWATCHDOG is not set # CONFIG_WQ_WATCHDOG is not set [root@fedora ~]# Kernel command line: [root@fedora ~]# cat /proc/cmdline BOOT_IMAGE=(ieee1275/disk,msdos2)/vmlinuz-5.19.0-rc2-00054-g12ede8ffb103 \ root=/dev/mapper/fedora_fedora-root ro rd.lvm.lv=fedora_fedora/root \ pseries-wdt.timeout=60 pseries-wdt.nowayout=1 pseries-wdt.action=2 With all that, executing echo V > /dev/watchdog0 Does nothing. dmesg is clean and the guest doesn't reboot after the 60 sec timeout. I also tried with PSERIES_WDT being compiled as a module instead of built-in. Same results. What am I missing? [1] https://patchwork.ozlabs.org/project/qemu-ppc/patch/20220608030153.1862335-1-...@ozlabs.ru/ Thanks, Daniel On 6/2/22 14:53, Scott Cheloha wrote: PAPR v2.12 defines a new hypercall, H_WATCHDOG. This patch series adds support for this hypercall to powerpc/pseries kernels and introduces a new watchdog driver, "pseries-wdt", for the virtual timers exposed by the hypercall. This series is preceded by the following: RFC v1: https://lore.kernel.org/linux-watchdog/20220413165104.179144-1-chel...@linux.ibm.com/ RFC v2: https://lore.kernel.org/linux-watchdog/20220509174357.5448-1-chel...@linux.ibm.com/ PATCH v1: https://lore.kernel.org/linux-watchdog/20220520183552.33426-1-chel...@linux.ibm.com/ Changes of note from PATCH v1: - Trim down the large comment documenting the H_WATCHDOG hypercall. The comment is likely to rot, so remove anything we aren't using and anything overly obvious. - Remove any preprocessor definitions not actually used in the module right now. If we want to use other features offered by the hypercall we can add them in later. They're just clutter until then. - Simplify the "action" module parameter. The value is now an index into an array of possible timeoutAction values. This design removes the need for the custom get/set methods used in PATCH v1. Now we merely need to check that the "action" value is a valid index during pseries_wdt_probe(). Easy. - Make the timeoutAction a member of pseries_wdt, "action". This eliminates the use of a global variable during pseries_wdt_start(). - Use watchdog_init_timeout() idiomatically. Check its return value and error out of pseries_wdt_probe() if it fails.
Re: [PATCH v2 0/4] pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
Hi, I tried this series out with mainline QEMU built with Alexey's patch [1] and I wasn't able to get it to work. I'm using a simple QEMU command line booting a fedora36 guest in a Power9 boston host: sudo ./qemu-system-ppc64 \ -M pseries,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off,ic-mode=dual \ -m 4G -accel kvm -cpu POWER9 -smp 1,maxcpus=1,threads=1,cores=1,sockets=1 \ -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 \ -drive file=/home/danielhb/fedora36.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2 \ -device qemu-xhci,id=usb,bus=pci.0,addr=0x4 -nographic -display none Guest is running v5.19-rc2 with this series applied. Kernel config consists of 'pseries_le_defconfig' plus the following 'watchdog' related changes: [root@fedora ~]# cat linux/.config | grep PSERIES_WDT CONFIG_PSERIES_WDT=y [root@fedora ~]# cat linux/.config | grep -i watchdog CONFIG_PPC_WATCHDOG=y CONFIG_HAVE_NMI_WATCHDOG=y CONFIG_WATCHDOG=y CONFIG_WATCHDOG_CORE=y CONFIG_WATCHDOG_NOWAYOUT=y CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y CONFIG_WATCHDOG_OPEN_TIMEOUT=0 # CONFIG_WATCHDOG_SYSFS is not set # CONFIG_WATCHDOG_HRTIMER_PRETIMEOUT is not set # Watchdog Pretimeout Governors # CONFIG_WATCHDOG_PRETIMEOUT_GOV is not set # Watchdog Device Drivers # CONFIG_SOFT_WATCHDOG is not set # CONFIG_XILINX_WATCHDOG is not set # CONFIG_ZIIRAVE_WATCHDOG is not set # CONFIG_CADENCE_WATCHDOG is not set # CONFIG_DW_WATCHDOG is not set # CONFIG_MAX63XX_WATCHDOG is not set CONFIG_WATCHDOG_RTAS=y # PCI-based Watchdog Cards # CONFIG_PCIPCWATCHDOG is not set # USB-based Watchdog Cards # CONFIG_USBPCWATCHDOG is not set # CONFIG_WQ_WATCHDOG is not set [root@fedora ~]# Kernel command line: [root@fedora ~]# cat /proc/cmdline BOOT_IMAGE=(ieee1275/disk,msdos2)/vmlinuz-5.19.0-rc2-00054-g12ede8ffb103 \ root=/dev/mapper/fedora_fedora-root ro rd.lvm.lv=fedora_fedora/root \ pseries-wdt.timeout=60 pseries-wdt.nowayout=1 pseries-wdt.action=2 With all that, executing echo V > /dev/watchdog0 Does nothing. dmesg is clean and the guest doesn't reboot after the 60 sec timeout. I also tried with PSERIES_WDT being compiled as a module instead of built-in. Same results. What am I missing? [1] https://patchwork.ozlabs.org/project/qemu-ppc/patch/20220608030153.1862335-1-...@ozlabs.ru/ Thanks, Daniel On 6/2/22 14:53, Scott Cheloha wrote: PAPR v2.12 defines a new hypercall, H_WATCHDOG. This patch series adds support for this hypercall to powerpc/pseries kernels and introduces a new watchdog driver, "pseries-wdt", for the virtual timers exposed by the hypercall. This series is preceded by the following: RFC v1: https://lore.kernel.org/linux-watchdog/20220413165104.179144-1-chel...@linux.ibm.com/ RFC v2: https://lore.kernel.org/linux-watchdog/20220509174357.5448-1-chel...@linux.ibm.com/ PATCH v1: https://lore.kernel.org/linux-watchdog/20220520183552.33426-1-chel...@linux.ibm.com/ Changes of note from PATCH v1: - Trim down the large comment documenting the H_WATCHDOG hypercall. The comment is likely to rot, so remove anything we aren't using and anything overly obvious. - Remove any preprocessor definitions not actually used in the module right now. If we want to use other features offered by the hypercall we can add them in later. They're just clutter until then. - Simplify the "action" module parameter. The value is now an index into an array of possible timeoutAction values. This design removes the need for the custom get/set methods used in PATCH v1. Now we merely need to check that the "action" value is a valid index during pseries_wdt_probe(). Easy. - Make the timeoutAction a member of pseries_wdt, "action". This eliminates the use of a global variable during pseries_wdt_start(). - Use watchdog_init_timeout() idiomatically. Check its return value and error out of pseries_wdt_probe() if it fails.
[PATCH] powerpc/mm/numa: skip NUMA_NO_NODE onlining in parse_numa_properties()
Executing node_set_online() when nid = NUMA_NO_NODE results in an undefined behavior. node_set_online() will call node_set_state(), into __node_set(), into set_bit(), and since NUMA_NO_NODE is -1 we'll end up doing a negative shift operation inside arch/powerpc/include/asm/bitops.h. This potential UB was detected running a kernel with CONFIG_UBSAN. The behavior was introduced by commit 10f78fd0dabb ("powerpc/numa: Fix a regression on memoryless node 0"), where the check for nid > 0 was removed to fix a problem that was happening with nid = 0, but the result is that now we're trying to online NUMA_NO_NODE nids as well. Checking for nid >= 0 will allow node 0 to be onlined while avoiding this UB with NUMA_NO_NODE. Reported-by: Ping Fang Cc: Diego Domingos Cc: Aneesh Kumar K.V Cc: Srikar Dronamraju Fixes: 10f78fd0dabb ("powerpc/numa: Fix a regression on memoryless node 0") Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/mm/numa.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 9d5f710d2c20..b9b7fefbb64b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -956,7 +956,9 @@ static int __init parse_numa_properties(void) of_node_put(cpu); } - node_set_online(nid); + /* node_set_online() is an UB if 'nid' is negative */ + if (likely(nid >= 0)) + node_set_online(nid); } get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells); -- 2.35.1
Re: [PATCH v2 4/4] powerpc/pseries/cpuhp: remove obsolete comment from pseries_cpu_die
On 9/27/21 17:19, Nathan Lynch wrote: This comment likely refers to the obsolete DLPAR workflow where some resource state transitions were driven more directly from user space utilities, but it also seems to contradict itself: "Change isolate state to Isolate [...]" is at odds with the preceding sentences, and it does not relate at all to the code that follows. This comment was added by commit 413f7c405a34, a 2006 commit where Mike Ellerman moved code from platform/pseries/smp.c into hotplug-cpu.c. I checked the original code back in smp.c and this comment was added there by commit 1da177e4c3f41, which is Linus's initial git commit, where he mentions that he didn't bothered with full history (although it is available somewhere, allegedly). This is enough to say that we can't easily see the history behind this comment. I also believe that we're better of without it since it doesn't make sense with the current codebase. Reviewed-by: Daniel Henrique Barboza Remove it to prevent confusion. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index b50f3e9aa259..5ab44600c8d3 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -137,11 +137,6 @@ static void pseries_cpu_die(unsigned int cpu) cpu, pcpu); } - /* Isolation and deallocation are definitely done by -* drslot_chrp_cpu. If they were not they would be -* done here. Change isolate state to Isolate and -* change allocation-state to Unusable. -*/ paca_ptrs[cpu]->cpu_start = 0; }
Re: [PATCH 3/3] powerpc/pseries/cpuhp: delete add/remove_by_count code
On 9/20/21 10:55, Nathan Lynch wrote: The core DLPAR code supports two actions (add and remove) and three subtypes of action: * By DRC index: the action is attempted on a single specified resource. This is the usual case for processors. * By indexed count: the action is attempted on a range of resources beginning at the specified index. This is implemented only by the memory DLPAR code. * By count: the lower layer (CPU or memory) is responsible for locating the specified number of resources to which the action can be applied. I cannot find any evidence of the "by count" subtype being used by drmgr or qemu for processors. And when I try to exercise this code, the add case does not work: Just to clarify: did you check both CPU and memory cases and found out that the 'by count' subtype isn't used with CPUs, but drmgr has some cases in which 'by count' is used with LMBs? I'm asking because I worked with a part of the LMB removal code a few months ago, and got stuck in a situation in which the 'by count' and 'by indexed count' are similar enough to feel repetitive, but distinct enough to not be easily reduced into a single function. If drmgr wasn't using the 'by count' subtypes for LMBs that would be a good chance for more code redux. $ ppc64_cpu --smt ; nproc SMT=8 24 $ printf "cpu remove count 2" > /sys/kernel/dlpar $ nproc 8 $ printf "cpu add count 2" > /sys/kernel/dlpar -bash: printf: write error: Invalid argument $ dmesg | tail -2 pseries-hotplug-cpu: Failed to find enough CPUs (1 of 2) to add dlpar: Could not handle DLPAR request "cpu add count 2" $ nproc 8 $ drmgr -c cpu -a -q 2 # this uses the by-index method Validating CPU DLPAR capability...yes. CPU 1 CPU 17 $ nproc 24 This is because find_drc_info_cpus_to_add() does not increment drc_index appropriately during its search. This is not hard to fix. But the _by_count() functions also have the property that they attempt to roll back all prior operations if the entire request cannot be satisfied, even though the rollback itself can encounter errors. It's not possible to provide transaction-like behavior at this level, and it's undesirable to have code that can only pretend to do that. Any users of these functions cannot know what the state of the system is in the error case. And the error paths are, to my knowledge, impossible to test without adding custom error injection code. Summary: * This code has not worked reliably since its introduction. * There is no evidence that it is used. * It contains questionable rollback behaviors in error paths which are difficult to test. So let's remove it. Signed-off-by: Nathan Lynch Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Fixes: b015f6bc9547 ("powerpc/pseries: Add cpu DLPAR support for drc-info property") --- Tested with a QEMU pseries guest, no issues found. Reviewed-by: Daniel Henrique Barboza Tested-by: Daniel Henrique Barboza arch/powerpc/platforms/pseries/hotplug-cpu.c | 218 +-- 1 file changed, 2 insertions(+), 216 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 87a0fbe9cf12..768997261ce8 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -741,216 +741,6 @@ static int dlpar_cpu_remove_by_index(u32 drc_index) return rc; } -static int find_dlpar_cpus_to_remove(u32 *cpu_drcs, int cpus_to_remove) -{ - struct device_node *dn; - int cpus_found = 0; - int rc; - - /* We want to find cpus_to_remove + 1 CPUs to ensure we do not -* remove the last CPU. -*/ - for_each_node_by_type(dn, "cpu") { - cpus_found++; - - if (cpus_found > cpus_to_remove) { - of_node_put(dn); - break; - } - - /* Note that cpus_found is always 1 ahead of the index -* into the cpu_drcs array, so we use cpus_found - 1 -*/ - rc = of_property_read_u32(dn, "ibm,my-drc-index", - &cpu_drcs[cpus_found - 1]); - if (rc) { - pr_warn("Error occurred getting drc-index for %pOFn\n", - dn); - of_node_put(dn); - return -1; - } - } - - if (cpus_found < cpus_to_remove) { - pr_warn("Failed to find enough CPUs (%d of %d) to remove\n", - cpus_found, cpus_to_remove); - } else
Re: [PATCH 2/3] powerpc/cpuhp: BUG -> WARN conversion in offline path
On 9/20/21 10:55, Nathan Lynch wrote: If, due to bugs elsewhere, we get into unregister_cpu_online() with a CPU that isn't marked hotpluggable, we can emit a warning and return an appropriate error instead of crashing. Signed-off-by: Nathan Lynch --- As mentioned by Christophe this will not solve the crash for kernels with panic_on_warn, but at least it will panic with a clearer message on those and will not panic for everyone else. Reviewed-by: Daniel Henrique Barboza arch/powerpc/kernel/sysfs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index defecb3b1b15..08d8072d6e7a 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -928,7 +928,8 @@ static int unregister_cpu_online(unsigned int cpu) struct device_attribute *attrs, *pmc_attrs; int i, nattrs; - BUG_ON(!c->hotpluggable); + if (WARN_RATELIMIT(!c->hotpluggable, "cpu %d can't be offlined\n", cpu)) + return -EBUSY; #ifdef CONFIG_PPC64 if (cpu_has_feature(CPU_FTR_SMT))
Re: [PATCH 1/3] powerpc/pseries/cpuhp: cache node corrections
On 9/20/21 10:55, Nathan Lynch wrote: On pseries, cache nodes in the device tree can be added and removed by the CPU DLPAR code as well as the partition migration (mobility) code. PowerVM partitions in dedicated processor mode typically have L2 and L3 cache nodes. The CPU DLPAR code has the following shortcomings: * Cache nodes returned as siblings of a new CPU node by ibm,configure-connector are silently discarded; only the CPU node is added to the device tree. * Cache nodes which become unreferenced in the processor removal path are not removed from the device tree. This can lead to duplicate nodes when the post-migration device tree update code replaces cache nodes. This is long-standing behavior. Presumably it has gone mostly unnoticed because the two bugs have the property of obscuring each other in common simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice only if you cared to inspect the device tree or the sysfs cacheinfo information. Booted with two processors: $ pwd /sys/firmware/devicetree/base/cpus $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@8/ $ lsprop */l2-cache l2-cache@2010/l2-cache 3110 (12560) l2-cache@2011/l2-cache 3111 (12561) PowerPC,POWER9@0/l2-cache 2010 (8208) PowerPC,POWER9@8/l2-cache 2011 (8209) $ ls /sys/devices/system/cpu/cpu0/cache/ index0 index1 index2 index3 After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is missing a cache level in its sched domain hierarchy: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@10/ PowerPC,POWER9@8/ $ lsprop PowerPC\,POWER9@10/l2-cache PowerPC,POWER9@10/l2-cache 2012 (8210) $ ls /sys/devices/system/cpu/cpu16/cache/ index0 index1 $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE When removing PowerPC,POWER9@8, we see that its cache nodes are left behind: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ When DLPAR is combined with VM migration, we can get duplicate nodes. E.g. removing one processor, then migrating, adding a processor, and then migrating again can result in warnings from the OF core during post-migration device tree updates: Duplicate name in cpus, renamed to "l2-cache@2011#1" Duplicate name in cpus, renamed to "l3-cache@3111#1" and nodes with duplicated phandles in the tree, making lookup behavior unpredictable: $ lsprop l[23]-cache@*/ibm,phandle l2-cache@2010/ibm,phandle 2010 (8208) l2-cache@2011#1/ibm,phandle 2011 (8209) l2-cache@2011/ibm,phandle 2011 (8209) l3-cache@3110/ibm,phandle 3110 (12560) l3-cache@3111#1/ibm,phandle 3111 (12561) l3-cache@3111/ibm,phandle 3111 (12561) Address these issues by: * Correctly processing siblings of the node returned from dlpar_configure_connector(). * Removing cache nodes in the CPU remove path when it can be determined that they are not associated with other CPUs or caches. Use the of_changeset API in both cases, which allows us to keep the error handling in this code from becoming more complex while ensuring that the device tree cannot become inconsistent. Signed-off-by: Nathan Lynch Fixes: ac71380 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf18 ("powerpc/pseries: Add CPU dlpar add functionality") --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 72 +++- 1 file changed, 70 insertions(+), 2 deletions(-) Tested with a QEMU pseries guest, multiple CPU add/removals of the same CPU, and no issues found with these new pseries_cpuhp* functions. Code LGTM as well. Reviewed-by: Daniel Henrique Barboza Tested-by: Daniel Henrique Barboza diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index d646c22e94ab..87a0fbe9cf12 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -521,6 +521,27 @@ static bool valid_cpu_dr
Re: [PATCH v7 0/6] Add support for FORM2 associativity
On 8/9/21 2:24 AM, Aneesh Kumar K.V wrote: Form2 associativity adds a much more flexible NUMA topology layout than what is provided by Form1. More details can be found in patch 7. $ numactl -H ... node distances: node 0 1 2 3 0: 10 11 222 33 1: 44 10 55 66 2: 77 88 10 99 3: 101 121 132 10 $ After DAX kmem memory add # numactl -H available: 5 nodes (0-4) ... node distances: node 0 1 2 3 4 0: 10 11 222 33 240 1: 44 10 55 66 255 2: 77 88 10 99 255 3: 101 121 132 10 230 4: 255 255 255 230 10 PAPR SCM now use the numa distance details to find the numa_node and target_node for the device. kvaneesh@ubuntu-guest:~$ ndctl list -N -v [ { "dev":"namespace0.0", "mode":"devdax", "map":"dev", "size":1071644672, "uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2", "raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362", "daxregion":{ "id":0, "size":1071644672, "devices":[ { "chardev":"dax0.0", "size":1071644672, "target_node":4, "mode":"devdax" } ] }, "align":2097152, "numa_node":3 } ] kvaneesh@ubuntu-guest:~$ The above output is with a Qemu command line -numa node,nodeid=4 \ -numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \ -numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \ -numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \ -numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \ -numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \ -object memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE} \ -device nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4 Qemu changes can be found at https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/ Series tested with the forementioned QEMU build. All patches: Tested-by: Daniel Henrique Barboza Changes from v6: * Address review feedback Changes from v5: * Fix build error reported by kernel test robot * Address review feedback Changes from v4: * Drop DLPAR related device tree property for now because both Qemu nor PowerVM will provide the distance details of all possible NUMA nodes during boot. * Rework numa distance code based on review feedback. Changes from v3: * Drop PAPR SCM specific changes and depend completely on NUMA distance information. Changes from v2: * Add nvdimm list to Cc: * update PATCH 8 commit message. Changes from v1: * Update FORM2 documentation. * rename max_domain_index to max_associativity_domain_index Aneesh Kumar K.V (6): powerpc/pseries: rename min_common_depth to primary_domain_index powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY powerpc/pseries: Consolidate different NUMA distance update code paths powerpc/pseries: Add a helper for form1 cpu distance powerpc/pseries: Add support for FORM2 associativity powerpc/pseries: Consolidate form1 distance initialization into a helper Documentation/powerpc/associativity.rst | 103 + arch/powerpc/include/asm/firmware.h | 7 +- arch/powerpc/include/asm/prom.h | 3 +- arch/powerpc/include/asm/topology.h | 6 +- arch/powerpc/kernel/prom_init.c | 3 +- arch/powerpc/mm/numa.c| 433 ++ arch/powerpc/platforms/pseries/firmware.c | 3 +- arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 + .../platforms/pseries/hotplug-memory.c| 2 + arch/powerpc/platforms/pseries/lpar.c | 4 +- 10 files changed, 455 insertions(+), 111 deletions(-) create mode 100644 Documentation/powerpc/associativity.rst
Re: [PATCH v5 0/6] Add support for FORM2 associativity
Aneesh, This series compiles with a configuration made with "pseries_le_defconfig" but fails with a config based on an existing RHEL8 config. The reason, which is hinted in the robot replies in patch 4, is that you defined a "__vphn_get_associativity" inside a #ifdef CONFIG_PPC_SPLPAR guard but didn't define how the function would behave without the config, and you ended up using the function elsewhere. This fixes the compilation but I'm not sure if this is what you intended for this function: diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index c68846fc9550..6e8551d16b7a 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -680,6 +680,11 @@ static int vphn_get_nid(long lcpu) } #else +static int __vphn_get_associativity(long lcpu, __be32 *associativity) +{ + return -1; +} + static int vphn_get_nid(long unused) { return NUMA_NO_NODE; I'll post a new version of the QEMU FORM2 changes using these patches as is (with the above fixup), but I guess you'll want to post a v6. Thanks, Daniel On 6/28/21 12:11 PM, Aneesh Kumar K.V wrote: Form2 associativity adds a much more flexible NUMA topology layout than what is provided by Form1. More details can be found in patch 7. $ numactl -H ... node distances: node 0 1 2 3 0: 10 11 222 33 1: 44 10 55 66 2: 77 88 10 99 3: 101 121 132 10 $ After DAX kmem memory add # numactl -H available: 5 nodes (0-4) ... node distances: node 0 1 2 3 4 0: 10 11 222 33 240 1: 44 10 55 66 255 2: 77 88 10 99 255 3: 101 121 132 10 230 4: 255 255 255 230 10 PAPR SCM now use the numa distance details to find the numa_node and target_node for the device. kvaneesh@ubuntu-guest:~$ ndctl list -N -v [ { "dev":"namespace0.0", "mode":"devdax", "map":"dev", "size":1071644672, "uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2", "raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362", "daxregion":{ "id":0, "size":1071644672, "devices":[ { "chardev":"dax0.0", "size":1071644672, "target_node":4, "mode":"devdax" } ] }, "align":2097152, "numa_node":3 } ] kvaneesh@ubuntu-guest:~$ The above output is with a Qemu command line -numa node,nodeid=4 \ -numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \ -numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \ -numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \ -numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \ -numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \ -object memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE} \ -device nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4 Qemu changes can be found at https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/ Changes from v4: * Drop DLPAR related device tree property for now because both Qemu nor PowerVM will provide the distance details of all possible NUMA nodes during boot. * Rework numa distance code based on review feedback. Changes from v3: * Drop PAPR SCM specific changes and depend completely on NUMA distance information. Changes from v2: * Add nvdimm list to Cc: * update PATCH 8 commit message. Changes from v1: * Update FORM2 documentation. * rename max_domain_index to max_associativity_domain_index Aneesh Kumar K.V (6): powerpc/pseries: rename min_common_depth to primary_domain_index powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY powerpc/pseries: Consolidate different NUMA distance update code paths powerpc/pseries: Add a helper for form1 cpu distance powerpc/pseries: Add support for FORM2 associativity Documentation/powerpc/associativity.rst | 103 + arch/powerpc/include/asm/firmware.h | 7 +- arch/powerpc/include/asm/prom.h | 3 +- arch/powerpc/include/asm/topology.h | 4 +- arch/powerpc/kernel/prom_init.c | 3 +- arch/powerpc/mm/numa.c| 415 +- arch/powerpc/platforms/pseries/firmware.c | 3 +- arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 + .../platforms/pseries/hotplug-memory.c| 2 + arch/powerpc/platforms/pseries/lpar.c | 4 +- arch/powerpc/platforms/pseries/pseries.h | 1 + 11 files changed, 432 insertions(+), 115 deletions(-) create
Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
On 6/22/21 9:07 AM, Aneesh Kumar K.V wrote: Daniel Henrique Barboza writes: On 6/17/21 1:51 PM, Aneesh Kumar K.V wrote: PAPR interface currently supports two different ways of communicating resource grouping details to the OS. These are referred to as Form 0 and Form 1 associativity grouping. Form 0 is the older format and is now considered deprecated. This patch adds another resource grouping named FORM2. Signed-off-by: Daniel Henrique Barboza Signed-off-by: Aneesh Kumar K.V --- Documentation/powerpc/associativity.rst | 135 arch/powerpc/include/asm/firmware.h | 3 +- arch/powerpc/include/asm/prom.h | 1 + arch/powerpc/kernel/prom_init.c | 3 +- arch/powerpc/mm/numa.c| 149 +- arch/powerpc/platforms/pseries/firmware.c | 1 + 6 files changed, 286 insertions(+), 6 deletions(-) create mode 100644 Documentation/powerpc/associativity.rst diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst new file mode 100644 index ..93be604ac54d --- /dev/null +++ b/Documentation/powerpc/associativity.rst @@ -0,0 +1,135 @@ + +NUMA resource associativity += + +Associativity represents the groupings of the various platform resources into +domains of substantially similar mean performance relative to resources outside +of that domain. Resources subsets of a given domain that exhibit better +performance relative to each other than relative to other resources subsets +are represented as being members of a sub-grouping domain. This performance +characteristic is presented in terms of NUMA node distance within the Linux kernel. +From the platform view, these groups are also referred to as domains. + +PAPR interface currently supports different ways of communicating these resource +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2 +associativity grouping. Form 0 is the older format and is now considered deprecated. + +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property". +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1. +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used. + +Form 0 +- +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE). + +Form 1 +- +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity +device tree properties are used to determine the NUMA distance between resource groups/domains. + +The “ibm,associativity” property contains one or more lists of numbers (domainID) +representing the resource’s platform grouping domains. + +The “ibm,associativity-reference-points” property contains one or more list of numbers +(domainID index) that represents the 1 based ordinal in the associativity lists. +The list of domainID index represnets increasing hierachy of resource grouping. + +ex: +{ primary domainID index, secondary domainID index, tertiary domainID index.. } + +Linux kernel uses the domainID at the primary domainID index as the NUMA node id. +Linux kernel computes NUMA distance between two domains by recursively comparing +if they belong to the same higher-level domains. For mismatch at every higher +level of the resource group, the kernel doubles the NUMA distance between the +comparing domains. + +Form 2 +--- +Form 2 associativity format adds separate device tree properties representing NUMA node distance +thereby making the node distance computation flexible. Form 2 also allows flexible primary +domain numbering. With numa distance computation now detached from the index value of +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the +same domainID index representing resource groups of different performance/latency characteristics. + +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the +"ibm,architecture-vec-5" property. + +"ibm,numa-lookup-index-table" property contains one or more list numbers representing +the domainIDs present in the system. The offset of the domainID in this property is considered +the domainID index. + +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by +N domainID encoded as with encode-int + +For ex: +ibm,numa-lookup-index-table = {4, 0, 8, 250, 252}, domainID index for domainID 8 is 1. + +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA +distance between resource groups/domains present in the system. + +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by +N distance values encoded as with encode-bytes
[PATCH 3/3] powerpc/pseries: fail quicker in dlpar_memory_add_by_ic()
The validation done at the start of dlpar_memory_add_by_ic() is an all of nothing scenario - if any LMBs in the range is marked as RESERVED we can fail right away. We then can remove the 'lmbs_available' var and its check with 'lmbs_to_add' since the whole LMB range was already validated in the previous step. Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-memory.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index c0a03e1537cb..377d852f5a9a 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -796,7 +796,6 @@ static int dlpar_memory_add_by_index(u32 drc_index) static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 drc_index) { struct drmem_lmb *lmb, *start_lmb, *end_lmb; - int lmbs_available = 0; int rc; pr_info("Attempting to hot-add %u LMB(s) at index %x\n", @@ -811,15 +810,14 @@ static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 drc_index) /* Validate that the LMBs in this range are not reserved */ for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) { - if (lmb->flags & DRCONF_MEM_RESERVED) - break; - - lmbs_available++; + /* Fail immediately if the whole range can't be hot-added */ + if (lmb->flags & DRCONF_MEM_RESERVED) { + pr_err("Memory at %llx (drc index %x) is reserved\n", + lmb->base_addr, lmb->drc_index); + return -EINVAL; + } } - if (lmbs_available < lmbs_to_add) - return -EINVAL; - for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) { if (lmb->flags & DRCONF_MEM_ASSIGNED) continue; -- 2.31.1
[PATCH 2/3] powerpc/pseries: break early in dlpar_memory_add_by_count() loops
After a successful dlpar_add_lmb() call the LMB is marked as reserved. Later on, depending whether we added enough LMBs or not, we rely on the marked LMBs to see which ones might need to be removed, and we remove the reservation of all of them. These are done in for_each_drmem_lmb() loops without any break condition. This means that we're going to check all LMBs of the partition even after going through all the reserved ones. This patch adds break conditions in both loops to avoid this. The 'lmbs_added' variable was renamed to 'lmbs_reserved', and it's now being decremented each time a lmb reservation is removed, indicating if there are still marked LMBs to be processed. Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-memory.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 28a7fd90232f..c0a03e1537cb 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -673,7 +673,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add) { struct drmem_lmb *lmb; int lmbs_available = 0; - int lmbs_added = 0; + int lmbs_reserved = 0; int rc; pr_info("Attempting to hot-add %d LMB(s)\n", lmbs_to_add); @@ -714,13 +714,12 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add) * requested LMBs cannot be added. */ drmem_mark_lmb_reserved(lmb); - - lmbs_added++; - if (lmbs_added == lmbs_to_add) + lmbs_reserved++; + if (lmbs_reserved == lmbs_to_add) break; } - if (lmbs_added != lmbs_to_add) { + if (lmbs_reserved != lmbs_to_add) { pr_err("Memory hot-add failed, removing any added LMBs\n"); for_each_drmem_lmb(lmb) { @@ -735,6 +734,10 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add) dlpar_release_drc(lmb->drc_index); drmem_remove_lmb_reservation(lmb); + lmbs_reserved--; + + if (lmbs_reserved == 0) + break; } rc = -EINVAL; } else { @@ -745,6 +748,10 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add) pr_debug("Memory at %llx (drc index %x) was hot-added\n", lmb->base_addr, lmb->drc_index); drmem_remove_lmb_reservation(lmb); + lmbs_reserved--; + + if (lmbs_reserved == 0) + break; } rc = 0; } -- 2.31.1
[PATCH 0/3] powerpc/pseries: cleanups for dlpar_memory_add* functions
Hi, These are a couple of cleanups for the dlpar_memory_add* functions that are similar to those I did a month or so ago in dlpar_memory_remove_by_count and dlpar_memory_remove_by_ic. Daniel Henrique Barboza (3): powerpc/pseries: skip reserved LMBs in dlpar_memory_add_by_count() powerpc/pseries: break early in dlpar_memory_add_by_count() loops powerpc/pseries: fail quicker in dlpar_memory_add_by_ic() .../platforms/pseries/hotplug-memory.c| 34 --- 1 file changed, 21 insertions(+), 13 deletions(-) -- 2.31.1
[PATCH 1/3] powerpc/pseries: skip reserved LMBs in dlpar_memory_add_by_count()
The function is counting reserved LMBs as available to be added, but they aren't. This will cause the function to miscalculate the available LMBs and can trigger errors later on when executing dlpar_add_lmb(). Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-memory.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 36f66556a7c6..28a7fd90232f 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -683,6 +683,9 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add) /* Validate that there are enough LMBs to satisfy the request */ for_each_drmem_lmb(lmb) { + if (lmb->flags & DRCONF_MEM_RESERVED) + continue; + if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) lmbs_available++; -- 2.31.1
Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
On 6/17/21 1:51 PM, Aneesh Kumar K.V wrote: PAPR interface currently supports two different ways of communicating resource grouping details to the OS. These are referred to as Form 0 and Form 1 associativity grouping. Form 0 is the older format and is now considered deprecated. This patch adds another resource grouping named FORM2. Signed-off-by: Daniel Henrique Barboza Signed-off-by: Aneesh Kumar K.V --- Documentation/powerpc/associativity.rst | 135 arch/powerpc/include/asm/firmware.h | 3 +- arch/powerpc/include/asm/prom.h | 1 + arch/powerpc/kernel/prom_init.c | 3 +- arch/powerpc/mm/numa.c| 149 +- arch/powerpc/platforms/pseries/firmware.c | 1 + 6 files changed, 286 insertions(+), 6 deletions(-) create mode 100644 Documentation/powerpc/associativity.rst diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst new file mode 100644 index ..93be604ac54d --- /dev/null +++ b/Documentation/powerpc/associativity.rst @@ -0,0 +1,135 @@ + +NUMA resource associativity += + +Associativity represents the groupings of the various platform resources into +domains of substantially similar mean performance relative to resources outside +of that domain. Resources subsets of a given domain that exhibit better +performance relative to each other than relative to other resources subsets +are represented as being members of a sub-grouping domain. This performance +characteristic is presented in terms of NUMA node distance within the Linux kernel. +From the platform view, these groups are also referred to as domains. + +PAPR interface currently supports different ways of communicating these resource +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2 +associativity grouping. Form 0 is the older format and is now considered deprecated. + +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property". +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1. +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used. + +Form 0 +- +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE). + +Form 1 +- +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity +device tree properties are used to determine the NUMA distance between resource groups/domains. + +The “ibm,associativity” property contains one or more lists of numbers (domainID) +representing the resource’s platform grouping domains. + +The “ibm,associativity-reference-points” property contains one or more list of numbers +(domainID index) that represents the 1 based ordinal in the associativity lists. +The list of domainID index represnets increasing hierachy of resource grouping. + +ex: +{ primary domainID index, secondary domainID index, tertiary domainID index.. } + +Linux kernel uses the domainID at the primary domainID index as the NUMA node id. +Linux kernel computes NUMA distance between two domains by recursively comparing +if they belong to the same higher-level domains. For mismatch at every higher +level of the resource group, the kernel doubles the NUMA distance between the +comparing domains. + +Form 2 +--- +Form 2 associativity format adds separate device tree properties representing NUMA node distance +thereby making the node distance computation flexible. Form 2 also allows flexible primary +domain numbering. With numa distance computation now detached from the index value of +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the +same domainID index representing resource groups of different performance/latency characteristics. + +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the +"ibm,architecture-vec-5" property. + +"ibm,numa-lookup-index-table" property contains one or more list numbers representing +the domainIDs present in the system. The offset of the domainID in this property is considered +the domainID index. + +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by +N domainID encoded as with encode-int + +For ex: +ibm,numa-lookup-index-table = {4, 0, 8, 250, 252}, domainID index for domainID 8 is 1. + +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA +distance between resource groups/domains present in the system. + +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by +N distance values encoded as with encode-bytes. The max distance value we could encode is 255. + +For ex: +ibm,numa-lookup-index-tab
Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details
On 6/17/21 8:11 AM, Aneesh Kumar K.V wrote: Daniel Henrique Barboza writes: On 6/17/21 4:46 AM, David Gibson wrote: On Tue, Jun 15, 2021 at 12:35:17PM +0530, Aneesh Kumar K.V wrote: David Gibson writes: On Tue, Jun 15, 2021 at 11:27:50AM +0530, Aneesh Kumar K.V wrote: David Gibson writes: On Mon, Jun 14, 2021 at 10:10:03PM +0530, Aneesh Kumar K.V wrote: FORM2 introduce a concept of secondary domain which is identical to the conceept of FORM1 primary domain. Use secondary domain as the numa node when using persistent memory device. For DAX kmem use the logical domain id introduced in FORM2. This new numa node Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/numa.c| 28 +++ arch/powerpc/platforms/pseries/papr_scm.c | 26 + arch/powerpc/platforms/pseries/pseries.h | 1 + 3 files changed, 45 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 86cd2af014f7..b9ac6d02e944 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -265,6 +265,34 @@ static int associativity_to_nid(const __be32 *associativity) return nid; } +int get_primary_and_secondary_domain(struct device_node *node, int *primary, int *secondary) +{ + int secondary_index; + const __be32 *associativity; + + if (!numa_enabled) { + *primary = NUMA_NO_NODE; + *secondary = NUMA_NO_NODE; + return 0; + } + + associativity = of_get_associativity(node); + if (!associativity) + return -ENODEV; + + if (of_read_number(associativity, 1) >= primary_domain_index) { + *primary = of_read_number(&associativity[primary_domain_index], 1); + secondary_index = of_read_number(&distance_ref_points[1], 1); Secondary ID is always the second reference point, but primary depends on the length of resources? That seems very weird. primary_domain_index is distance_ref_point[0]. With Form2 we would find both primary and secondary domain ID same for all resources other than persistent memory device. The usage w.r.t. persistent memory is explained in patch 7. Right, I misunderstood With Form2 the primary domainID and secondary domainID are used to identify the NUMA nodes the kernel should use when using persistent memory devices. This seems kind of bogus. With Form1, the primary/secondary ID are a sort of heirarchy of distance (things with same primary ID are very close, things with same secondary are kinda-close, etc.). With Form2, it's referring to their effective node for different purposes. Using the same terms for different meanings seems unnecessarily confusing. They are essentially domainIDs. The interpretation of them are different between Form1 and Form2. Hence I kept referring to them as primary and secondary domainID. Any suggestion on what to name them with Form2? My point is that reusing associativity-reference-points for something with completely unrelated semantics seems like a very poor choice. I agree that this reuse can be confusing. I could argue that there is precedent for that in PAPR - FORM0 puts a different spin on the same property as well - but there is no need to keep following existing PAPR practices in new spec (and some might argue it's best not to). As far as QEMU goes, renaming this property to "numa-associativity-mode" (just an example) is a quick change to do since we separated FORM1 and FORM2 code over there. Doing such a rename can also help with the issue of having to describe new FORM2 semantics using "least significant boundary" or "primary domain" or any FORM0|FORM1 related terminology. It is not just changing the name, we will then have to explain the meaning of ibm,associativity-reference-points with FORM2 right? H why? My idea over there was to add a new property that indicates that resource might have a different NUMA affinity based on the mode of operation (like PMEM), and get rid of ibm,associativity-reference-points altogether. The NUMA distances already express the topology. Closer distances indicates closer proximity, larger distances indicates otherwise. Having "associativity-reference-points" to reflect a associativity domain relationship, when you already have all the distances from each node, is somewhat redundant. The concept of 'associativity domain' was necessary in FORM1 because we had no other way of telling distance between NUMA nodes. We needed to rely on these overly complex and convoluted subdomain abstractions to say that "nodeA belongs to the same third-level domain as node B, and in the second-level domain with node C". The kernel would read that and calculate that each level is doubling the distance from the level before and local_distance is 10, so: distAA = 10 distAB= 20 distAC = 40 With FORM2, if this i
Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details
On 6/17/21 4:46 AM, David Gibson wrote: On Tue, Jun 15, 2021 at 12:35:17PM +0530, Aneesh Kumar K.V wrote: David Gibson writes: On Tue, Jun 15, 2021 at 11:27:50AM +0530, Aneesh Kumar K.V wrote: David Gibson writes: On Mon, Jun 14, 2021 at 10:10:03PM +0530, Aneesh Kumar K.V wrote: FORM2 introduce a concept of secondary domain which is identical to the conceept of FORM1 primary domain. Use secondary domain as the numa node when using persistent memory device. For DAX kmem use the logical domain id introduced in FORM2. This new numa node Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/numa.c| 28 +++ arch/powerpc/platforms/pseries/papr_scm.c | 26 + arch/powerpc/platforms/pseries/pseries.h | 1 + 3 files changed, 45 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 86cd2af014f7..b9ac6d02e944 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -265,6 +265,34 @@ static int associativity_to_nid(const __be32 *associativity) return nid; } +int get_primary_and_secondary_domain(struct device_node *node, int *primary, int *secondary) +{ + int secondary_index; + const __be32 *associativity; + + if (!numa_enabled) { + *primary = NUMA_NO_NODE; + *secondary = NUMA_NO_NODE; + return 0; + } + + associativity = of_get_associativity(node); + if (!associativity) + return -ENODEV; + + if (of_read_number(associativity, 1) >= primary_domain_index) { + *primary = of_read_number(&associativity[primary_domain_index], 1); + secondary_index = of_read_number(&distance_ref_points[1], 1); Secondary ID is always the second reference point, but primary depends on the length of resources? That seems very weird. primary_domain_index is distance_ref_point[0]. With Form2 we would find both primary and secondary domain ID same for all resources other than persistent memory device. The usage w.r.t. persistent memory is explained in patch 7. Right, I misunderstood With Form2 the primary domainID and secondary domainID are used to identify the NUMA nodes the kernel should use when using persistent memory devices. This seems kind of bogus. With Form1, the primary/secondary ID are a sort of heirarchy of distance (things with same primary ID are very close, things with same secondary are kinda-close, etc.). With Form2, it's referring to their effective node for different purposes. Using the same terms for different meanings seems unnecessarily confusing. They are essentially domainIDs. The interpretation of them are different between Form1 and Form2. Hence I kept referring to them as primary and secondary domainID. Any suggestion on what to name them with Form2? My point is that reusing associativity-reference-points for something with completely unrelated semantics seems like a very poor choice. I agree that this reuse can be confusing. I could argue that there is precedent for that in PAPR - FORM0 puts a different spin on the same property as well - but there is no need to keep following existing PAPR practices in new spec (and some might argue it's best not to). As far as QEMU goes, renaming this property to "numa-associativity-mode" (just an example) is a quick change to do since we separated FORM1 and FORM2 code over there. Doing such a rename can also help with the issue of having to describe new FORM2 semantics using "least significant boundary" or "primary domain" or any FORM0|FORM1 related terminology. Thanks, Daniel Persistent memory devices can also be used as regular memory using DAX KMEM driver and primary domainID indicates the numa node number OS should use when using these devices as regular memory. Secondary domainID is the numa node number that should be used when using this device as persistent memory. It's weird to me that you'd want to consider them in different nodes for those different purposes. -- |NUMA node0 | |ProcA -> MEMA | | | | || | |---> PMEMB| | | --- --- |NUMA node1 | | | |ProcB ---> MEMC| || | |---> PMEMD| | | | | --- For a topology like the above application running of ProcA wants to find out persistent memory mount local to its NUMA node. Hence when using it as pmem fsdax
Re: [RFC PATCH 0/8] Add support for FORM2 associativity
On 6/14/21 1:39 PM, Aneesh Kumar K.V wrote: Form2 associativity adds a much more flexible NUMA topology layout than what is provided by Form1. This also allows PAPR SCM device to use better associativity when using the device as DAX KMEM device. More details can be found in patch x $ ndctl list -N -v [ { "dev":"namespace0.0", "mode":"devdax", "map":"dev", "size":1071644672, "uuid":"37dea198-ddb5-4e42-915a-99a915e24188", "raw_uuid":"148deeaa-4a2f-41d1-8d74-fd9a942d26ba", "daxregion":{ "id":0, "size":1071644672, "devices":[ { "chardev":"dax0.0", "size":1071644672, "target_node":4, "mode":"devdax" } ] }, "align":2097152, "numa_node":1 } ] $ numactl -H ... node distances: node 0 1 2 3 0: 10 11 222 33 1: 44 10 55 66 2: 77 88 10 99 3: 101 121 132 10 $ After DAX KMEM # numactl -H available: 5 nodes (0-4) ... node distances: node 0 1 2 3 4 0: 10 11 22 33 255 1: 44 10 55 66 255 2: 77 88 10 99 255 3: 101 121 132 10 255 4: 255 255 255 255 10 # The above output is with a Qemu command line For reference, this QEMU: https://github.com/danielhb/qemu/tree/form2_affinity_v1 https://lists.gnu.org/archive/html/qemu-devel/2021-06/msg03617.html but ... -numa node,nodeid=4 \ -numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=22 -numa dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=255 \ -numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \ -numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \ -numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=255 \ -numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=255 \ -object memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE} \ -device nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4,persistent-nodeid=1 with 'device-node=1' instead of 'persistent=nodeid=1' in the nvdimm parameter up here. Aneesh Kumar K.V (8): powerpc/pseries: rename min_common_depth to primary_domain_index powerpc/pseries: rename distance_ref_points_depth to max_domain_index powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY powerpc/pseries: Consolidate DLPAR NUMA distance update powerpc/pseries: Consolidate NUMA distance update during boot powerpc/pseries: Add a helper for form1 cpu distance powerpc/pseries: Add support for FORM2 associativity powerpc/papr_scm: Use FORM2 associativity details Series: Tested-by: Daniel Henrique Barboza Documentation/powerpc/associativity.rst | 139 ++ arch/powerpc/include/asm/firmware.h | 7 +- arch/powerpc/include/asm/prom.h | 3 +- arch/powerpc/kernel/prom_init.c | 3 +- arch/powerpc/mm/numa.c| 436 ++ arch/powerpc/platforms/pseries/firmware.c | 3 +- arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 + .../platforms/pseries/hotplug-memory.c| 2 + arch/powerpc/platforms/pseries/papr_scm.c | 26 +- arch/powerpc/platforms/pseries/pseries.h | 2 + 10 files changed, 522 insertions(+), 101 deletions(-) create mode 100644 Documentation/powerpc/associativity.rst
Re: [PATCH -next] powerpc/pseries/memhotplug: Remove unused inline function dlpar_memory_remove()
On 5/14/21 4:10 AM, YueHaibing wrote: dlpar_memory_remove() is never used, so can be removed. Signed-off-by: YueHaibing --- Reviewed-by: Daniel Henrique Barboza arch/powerpc/platforms/pseries/hotplug-memory.c | 4 1 file changed, 4 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 8377f1f7c78e..3d93f1c48e23 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -585,10 +585,6 @@ static inline int pseries_remove_mem_node(struct device_node *np) { return 0; } -static inline int dlpar_memory_remove(struct pseries_hp_errorlog *hp_elog) -{ - return -EOPNOTSUPP; -} static int dlpar_remove_lmb(struct drmem_lmb *lmb) { return -EOPNOTSUPP;
Re: [PATCH 2/3] hotplug-memory.c: enhance dlpar_memory_remove* LMB checks
On 5/3/21 10:02 PM, David Gibson wrote: On Fri, Apr 30, 2021 at 09:09:16AM -0300, Daniel Henrique Barboza wrote: dlpar_memory_remove_by_ic() validates the amount of LMBs to be removed by checking !DRCONF_MEM_RESERVED, and in the following loop before dlpar_remove_lmb() a check for DRCONF_MEM_ASSIGNED is made before removing it. This means that a LMB that is both !DRCONF_MEM_RESERVED and !DRCONF_MEM_ASSIGNED will be counted as valid, but then not being removed. The function will end up not removing all 'lmbs_to_remove' LMBs while also not reporting any errors. Comparing it to dlpar_memory_remove_by_count(), the validation is done via lmb_is_removable(), which checks for DRCONF_MEM_ASSIGNED and fadump constraints. No additional check is made afterwards, and DRCONF_MEM_RESERVED is never checked before dlpar_remove_lmb(). The function doesn't have the same 'check A for validation, then B for removal' issue as remove_by_ic(), but it's not checking if the LMB is reserved. There is no reason for these functions to validate the same operation in two different manners. Actually, I think there is: remove_by_ic() is handling a request to remove a specific range of LMBs. If any are reserved, they can't be removed and so this needs to fail. But if they are !ASSIGNED, that essentially means they're *already* removed (or never added), so "removing" them is, correctly, a no-op. remove_by_count(), in contrast, is being asked to remove a fixed number of LMBs from wherever they can be found, and for that it needs to find LMBs that haven't already been removed. Basically remove_by_ic() is an absolute request: "make this set of LMBs be not-plugged", whereas remove_by_count() is a relative request "make N less LMBs be plugged". So I think remove_by_ic()s existing handling is correct. I'm less sure if remove_by_count() ignoring RESERVED is correct - I couldn't quickly find under what circumstances RESERVED gets set. RESERVED is never set by the kernel. It is written in the DT by the firmware/hypervisor and the kernel just checks its value. QEMU sets it in spapr_dt_dynamic_memory() with the following comment: /* * LMB information for RMA, boot time RAM and gap b/n RAM and * device memory region -- all these are marked as reserved * and as having no valid DRC. */ dynamic_memory[0] = cpu_to_be32(addr >> 32); dynamic_memory[1] = cpu_to_be32(addr & 0x); dynamic_memory[2] = cpu_to_be32(0); dynamic_memory[3] = cpu_to_be32(0); /* reserved */ dynamic_memory[4] = cpu_to_be32(-1); dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_RESERVED | SPAPR_LMB_FLAGS_DRC_INVALID); The flag is formally described in LOPAR section 4.2.8, "Reserved Memory": "Memory nodes marked with the special value of the “status” property of “reserved” is not to be used or altered by the base OS." This makes me confident that we should check DRCONF_MEM_RESERVED in remove_by_count() as well, since phyp needs do adhere to these semantics and shouldn't be able to remove a LMB marked as RESERVED. Thanks, Daniel This patch addresses that by changing lmb_is_removable() to also check for DRCONF_MEM_RESERVED to tell if a lmb is removable, making dlpar_memory_remove_by_count() take the reservation state into account when counting the LMBs. lmb_is_removable() is then used in the validation step of dlpar_memory_remove_by_ic(), which is already checking for both states but in different stages, to avoid counting a LMB that is not assigned as eligible for removal. We can then skip the check before dlpar_remove_lmb() since we're validating all LMBs beforehand. Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-memory.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index bb98574a84a2..4e6d162c3f1a 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -348,7 +348,8 @@ static int pseries_remove_mem_node(struct device_node *np) static bool lmb_is_removable(struct drmem_lmb *lmb) { - if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) + if ((lmb->flags & DRCONF_MEM_RESERVED) || + !(lmb->flags & DRCONF_MEM_ASSIGNED)) return false; #ifdef CONFIG_FA_DUMP @@ -523,7 +524,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) /* Validate that there are enough LMBs to satisfy the request */ for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) { - if (lmb->flags & DRCONF_MEM_RESERVED) +
[PATCH v2 4/4] powerpc/pseries: minor enhancements in dlpar_memory_remove_by_ic()
We don't need the 'lmbs_available' variable to count the valid LMBs and to check if we have less than 'lmbs_to_remove'. We must ensure that the entire LMB range must be removed, so we can error out immediately if any LMB in the range is marked as reserved. Add a couple of comments explaining the reasoning behind the differences we have in this function in contrast to what it is done in its sister function, dlpar_memory_remove_by_count(). Signed-off-by: Daniel Henrique Barboza --- .../platforms/pseries/hotplug-memory.c| 28 +-- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 3c7ce5361ce3..ee88c1540fba 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -517,7 +517,6 @@ static int dlpar_memory_remove_by_index(u32 drc_index) static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) { struct drmem_lmb *lmb, *start_lmb, *end_lmb; - int lmbs_available = 0; int rc; pr_info("Attempting to hot-remove %u LMB(s) at %x\n", @@ -530,18 +529,29 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) if (rc) return -EINVAL; - /* Validate that there are enough LMBs to satisfy the request */ + /* +* Validate that all LMBs in range are not reserved. Note that it +* is ok if they are !ASSIGNED since our goal here is to remove the +* LMB range, regardless of whether some LMBs were already removed +* by any other reason. +* +* This is a contrast to what is done in remove_by_count() where we +* check for both RESERVED and !ASSIGNED (via lmb_is_removable()), +* because we want to remove a fixed amount of LMBs in that function. +*/ for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) { - if (lmb->flags & DRCONF_MEM_RESERVED) - break; - - lmbs_available++; + if (lmb->flags & DRCONF_MEM_RESERVED) { + pr_err("Memory at %llx (drc index %x) is reserved\n", + lmb->base_addr, lmb->drc_index); + return -EINVAL; + } } - if (lmbs_available < lmbs_to_remove) - return -EINVAL; - for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) { + /* +* dlpar_remove_lmb() will error out if the LMB is already +* !ASSIGNED, but this case is a no-op for us. +*/ if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) continue; -- 2.31.1
[PATCH v2 3/4] powerpc/pseries: break early in dlpar_memory_remove_by_count() loops
After marking the LMBs as reserved depending on dlpar_remove_lmb() rc, we evaluate whether we need to add the LMBs back or if we can release the LMB DRCs. In both cases, a for_each_drmem_lmb() loop without a break condition is used. This means that we're going to cycle through all LMBs of the partition even after we're done with what we were going to do. This patch adds break conditions in both loops to avoid this. The 'lmbs_removed' variable was renamed to 'lmbs_reserved', and it's now being decremented each time a lmb reservation is removed, indicating that the operation we're doing (adding back LMBs or releasing DRCs) is completed. Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-memory.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index c21d9278c1ce..3c7ce5361ce3 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -402,7 +402,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) static int dlpar_memory_remove_by_count(u32 lmbs_to_remove) { struct drmem_lmb *lmb; - int lmbs_removed = 0; + int lmbs_reserved = 0; int lmbs_available = 0; int rc; @@ -436,12 +436,12 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove) */ drmem_mark_lmb_reserved(lmb); - lmbs_removed++; - if (lmbs_removed == lmbs_to_remove) + lmbs_reserved++; + if (lmbs_reserved == lmbs_to_remove) break; } - if (lmbs_removed != lmbs_to_remove) { + if (lmbs_reserved != lmbs_to_remove) { pr_err("Memory hot-remove failed, adding LMB's back\n"); for_each_drmem_lmb(lmb) { @@ -454,6 +454,10 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove) lmb->drc_index); drmem_remove_lmb_reservation(lmb); + + lmbs_reserved--; + if (lmbs_reserved == 0) + break; } rc = -EINVAL; @@ -467,6 +471,10 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove) lmb->base_addr); drmem_remove_lmb_reservation(lmb); + + lmbs_reserved--; + if (lmbs_reserved == 0) + break; } rc = 0; } -- 2.31.1
[PATCH v2 0/4] Unisolate LMBs DRC on removal error + cleanups
changes from v1: - patch 1: added David's r-b - patch 2: * removed the DRCONF_MEM_RESERVED assumption for dlpar_memory_remove_by_ic() * reworded the commit msg - patch 3: dropped. the differences between dlpar_memory_remove_by_ic() and dlpar_memory_remove_by_count() makes a helper function too complex to handle both cases - (new) patch 3 and 4: minor enhancements v1 link: https://lore.kernel.org/linuxppc-dev/20210430120917.217951-1-danielhb...@gmail.com/ Daniel Henrique Barboza (4): powerpc/pseries: Set UNISOLATE on dlpar_memory_remove_by_ic() error powerpc/pseries: check DRCONF_MEM_RESERVED in lmb_is_removable() powerpc/pseries: break early in dlpar_memory_remove_by_count() loops powerpc/pseries: minor enhancements in dlpar_memory_remove_by_ic() .../platforms/pseries/hotplug-memory.c| 54 ++- 1 file changed, 40 insertions(+), 14 deletions(-) -- 2.31.1
[PATCH v2 1/4] powerpc/pseries: Set UNISOLATE on dlpar_memory_remove_by_ic() error
As previously done in dlpar_cpu_remove() for CPUs, this patch changes dlpar_memory_remove_by_ic() to unisolate the LMB DRC when the LMB is failed to be removed. The hypervisor, seeing a LMB DRC that was supposed to be removed being unisolated instead, can do error recovery on its side. This change is done in dlpar_memory_remove_by_ic() only because, as of today, only QEMU is using this code path for error recovery (via the PSERIES_HP_ELOG_ID_DRC_IC event). phyp treats it as a no-op. Reviewed-by: David Gibson Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-memory.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 8377f1f7c78e..bb98574a84a2 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -551,6 +551,13 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) if (!drmem_lmb_reserved(lmb)) continue; + /* +* Setting the isolation state of an UNISOLATED/CONFIGURED +* device to UNISOLATE is a no-op, but the hypervisor can +* use it as a hint that the LMB removal failed. +*/ + dlpar_unisolate_drc(lmb->drc_index); + rc = dlpar_add_lmb(lmb); if (rc) pr_err("Failed to add LMB, drc index %x\n", -- 2.31.1
[PATCH v2 2/4] powerpc/pseries: check DRCONF_MEM_RESERVED in lmb_is_removable()
DRCONF_MEM_RESERVED is a flag that represents the "Reserved Memory" status in LOPAR v2.10, section 4.2.8. If a LMB is marked as reserved, quoting LOPAR, "is not to be used or altered by the base OS". This flag is read only in the kernel, being set by the firmware/hypervisor in the DT. As an example, QEMU will set this flag in hw/ppc/spapr.c, spapr_dt_dynamic_memory(). lmb_is_removable() does not check for DRCONF_MEM_RESERVED. This function is used in dlpar_remove_lmb() as a guard before the removal logic. Since it is failing to check for !RESERVED, dlpar_remove_lmb() will fail in a later stage instead of failing in the validation when receiving a reserved LMB as input. lmb_is_removable() is also used in dlpar_memory_remove_by_count() to evaluate if we have enough LMBs to complete the request. The missing !RESERVED check in this case is causing dlpar_memory_remove_by_count() to miscalculate the number of elegible LMBs for the removal, and can make it error out later on instead of failing in the validation with the 'not enough LMBs to satisfy request' message. Making a DRCONF_MEM_RESERVED check in lmb_is_removable() fixes all these issues. Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index bb98574a84a2..c21d9278c1ce 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -348,7 +348,8 @@ static int pseries_remove_mem_node(struct device_node *np) static bool lmb_is_removable(struct drmem_lmb *lmb) { - if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) + if ((lmb->flags & DRCONF_MEM_RESERVED) || + !(lmb->flags & DRCONF_MEM_ASSIGNED)) return false; #ifdef CONFIG_FA_DUMP -- 2.31.1
Re: [PATCH 2/3] hotplug-memory.c: enhance dlpar_memory_remove* LMB checks
On 5/3/21 10:02 PM, David Gibson wrote: On Fri, Apr 30, 2021 at 09:09:16AM -0300, Daniel Henrique Barboza wrote: dlpar_memory_remove_by_ic() validates the amount of LMBs to be removed by checking !DRCONF_MEM_RESERVED, and in the following loop before dlpar_remove_lmb() a check for DRCONF_MEM_ASSIGNED is made before removing it. This means that a LMB that is both !DRCONF_MEM_RESERVED and !DRCONF_MEM_ASSIGNED will be counted as valid, but then not being removed. The function will end up not removing all 'lmbs_to_remove' LMBs while also not reporting any errors. Comparing it to dlpar_memory_remove_by_count(), the validation is done via lmb_is_removable(), which checks for DRCONF_MEM_ASSIGNED and fadump constraints. No additional check is made afterwards, and DRCONF_MEM_RESERVED is never checked before dlpar_remove_lmb(). The function doesn't have the same 'check A for validation, then B for removal' issue as remove_by_ic(), but it's not checking if the LMB is reserved. There is no reason for these functions to validate the same operation in two different manners. Actually, I think there is: remove_by_ic() is handling a request to remove a specific range of LMBs. If any are reserved, they can't be removed and so this needs to fail. But if they are !ASSIGNED, that essentially means they're *already* removed (or never added), so "removing" them is, correctly, a no-op. I guess that makes sense. Although I am not aware of any situation, at least thinking about how QEMU adds/removes LMBs, where some LMBs would be removed 'ad-hoc' in the middle of a LMB range that maps to a QEMU DIMM, I can't say that this wouldn't never happen either. It is sensible to make remove_by_ic() resilient to this situation. I'll re-send this patch just with the remove_by_count() change. Thanks, Daniel remove_by_count(), in contrast, is being asked to remove a fixed number of LMBs from wherever they can be found, and for that it needs to find LMBs that haven't already been removed. Basically remove_by_ic() is an absolute request: "make this set of LMBs be not-plugged", whereas remove_by_count() is a relative request "make N less LMBs be plugged". So I think remove_by_ic()s existing handling is correct. I'm less sure if remove_by_count() ignoring RESERVED is correct - I couldn't quickly find under what circumstances RESERVED gets set. This patch addresses that by changing lmb_is_removable() to also check for DRCONF_MEM_RESERVED to tell if a lmb is removable, making dlpar_memory_remove_by_count() take the reservation state into account when counting the LMBs. lmb_is_removable() is then used in the validation step of dlpar_memory_remove_by_ic(), which is already checking for both states but in different stages, to avoid counting a LMB that is not assigned as eligible for removal. We can then skip the check before dlpar_remove_lmb() since we're validating all LMBs beforehand. Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-memory.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index bb98574a84a2..4e6d162c3f1a 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -348,7 +348,8 @@ static int pseries_remove_mem_node(struct device_node *np) static bool lmb_is_removable(struct drmem_lmb *lmb) { - if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) + if ((lmb->flags & DRCONF_MEM_RESERVED) || + !(lmb->flags & DRCONF_MEM_ASSIGNED)) return false; #ifdef CONFIG_FA_DUMP @@ -523,7 +524,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) /* Validate that there are enough LMBs to satisfy the request */ for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) { - if (lmb->flags & DRCONF_MEM_RESERVED) + if (!lmb_is_removable(lmb)) break; lmbs_available++; @@ -533,9 +534,6 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) return -EINVAL; for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) { - if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) - continue; - rc = dlpar_remove_lmb(lmb); if (rc) break;
[PATCH 3/3] pseries/hotplug-memory.c: adding dlpar_memory_remove_lmbs_common()
One difference between dlpar_memory_remove_by_count() and dlpar_memory_remove_by_ic() is that the latter, added in commit 753843471cbb, removes the LMBs in a contiguous block. This was done because QEMU works with DIMMs, which is nothing more than a set of LMBs, that must be added or removed together. Failing to remove one LMB must fail the removal of the entire set of LMBs. Another difference is that dlpar_memory_remove_by_ic() is going to set the LMB DRC to unisolate in case of a removal error, which is a no-op for the hypervisors that doesn't care about this error handling knob, and could be called by remove_by_count() without issues. Aside from that, the logic is the same for both functions, and yet we keep them separated and having to duplicate LMB removal logic in both. This patch introduces a helper called dlpar_memory_remove_lmbs_common() to be used by both functions. The helper handles the case of block removal of remove_by_ic() by failing earlier in the validation and removal steps if the helper was called by remove_by_ic() (i.e. it was called with a drc_index) while preserving the more relaxed behavior of remove_by_count() if drc_index is 0. Signed-off-by: Daniel Henrique Barboza --- .../platforms/pseries/hotplug-memory.c| 163 +++--- 1 file changed, 67 insertions(+), 96 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 4e6d162c3f1a..a031993725ca 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -399,25 +399,43 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) return 0; } -static int dlpar_memory_remove_by_count(u32 lmbs_to_remove) +static int dlpar_memory_remove_lmbs_common(u32 lmbs_to_remove, u32 drc_index) { - struct drmem_lmb *lmb; - int lmbs_removed = 0; + struct drmem_lmb *lmb, *start_lmb, *end_lmb; int lmbs_available = 0; + int lmbs_removed = 0; int rc; - - pr_info("Attempting to hot-remove %d LMB(s)\n", lmbs_to_remove); + /* +* dlpar_memory_remove_by_ic() wants to remove all +* 'lmbs_to_remove' LMBs, starting from drc_index, in a +* contiguous block. +*/ + bool block_removal; if (lmbs_to_remove == 0) return -EINVAL; + block_removal = drc_index != 0; + + if (block_removal) { + rc = get_lmb_range(drc_index, lmbs_to_remove, &start_lmb, &end_lmb); + if (rc) + return -EINVAL; + } else { + start_lmb = &drmem_info->lmbs[0]; + end_lmb = &drmem_info->lmbs[drmem_info->n_lmbs]; + } + /* Validate that there are enough LMBs to satisfy the request */ - for_each_drmem_lmb(lmb) { - if (lmb_is_removable(lmb)) + for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) { + if (lmb_is_removable(lmb)) { lmbs_available++; - if (lmbs_available == lmbs_to_remove) + if (lmbs_available == lmbs_to_remove) + break; + } else if (block_removal) { break; + } } if (lmbs_available < lmbs_to_remove) { @@ -426,28 +444,40 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove) return -EINVAL; } - for_each_drmem_lmb(lmb) { + for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) { rc = dlpar_remove_lmb(lmb); - if (rc) - continue; - /* Mark this lmb so we can add it later if all of the -* requested LMBs cannot be removed. -*/ - drmem_mark_lmb_reserved(lmb); + if (!rc) { + /* Mark this lmb so we can add it later if all of the +* requested LMBs cannot be removed. +*/ + drmem_mark_lmb_reserved(lmb); - lmbs_removed++; - if (lmbs_removed == lmbs_to_remove) + lmbs_removed++; + if (lmbs_removed == lmbs_to_remove) + break; + } else if (block_removal) { break; + } } if (lmbs_removed != lmbs_to_remove) { - pr_err("Memory hot-remove failed, adding LMB's back\n"); + if (block_removal) + pr_err("Memory indexed-count-remove failed, adding any removed LMBs\n"); + else + pr_err("Memory hot-remove failed, adding LMB's back\n"); - for_each_drmem_lmb(lmb) { + for_each_drme
[PATCH 1/3] powerpc/pseries: Set UNISOLATE on dlpar_memory_remove_by_ic() error
As previously done in dlpar_cpu_remove() for CPUs, this patch changes dlpar_memory_remove_by_ic() to unisolate the LMB DRC when the LMB is failed to be removed. The hypervisor, seeing a LMB DRC that was supposed to be removed being unisolated instead, can do error recovery on its side. This change is done in dlpar_memory_remove_by_ic() only because, as of today, only QEMU is using this code path for error recovery (via the PSERIES_HP_ELOG_ID_DRC_IC event). phyp treats it as a no-op. Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-memory.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 8377f1f7c78e..bb98574a84a2 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -551,6 +551,13 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) if (!drmem_lmb_reserved(lmb)) continue; + /* +* Setting the isolation state of an UNISOLATED/CONFIGURED +* device to UNISOLATE is a no-op, but the hypervisor can +* use it as a hint that the LMB removal failed. +*/ + dlpar_unisolate_drc(lmb->drc_index); + rc = dlpar_add_lmb(lmb); if (rc) pr_err("Failed to add LMB, drc index %x\n", -- 2.30.2
[PATCH 2/3] hotplug-memory.c: enhance dlpar_memory_remove* LMB checks
dlpar_memory_remove_by_ic() validates the amount of LMBs to be removed by checking !DRCONF_MEM_RESERVED, and in the following loop before dlpar_remove_lmb() a check for DRCONF_MEM_ASSIGNED is made before removing it. This means that a LMB that is both !DRCONF_MEM_RESERVED and !DRCONF_MEM_ASSIGNED will be counted as valid, but then not being removed. The function will end up not removing all 'lmbs_to_remove' LMBs while also not reporting any errors. Comparing it to dlpar_memory_remove_by_count(), the validation is done via lmb_is_removable(), which checks for DRCONF_MEM_ASSIGNED and fadump constraints. No additional check is made afterwards, and DRCONF_MEM_RESERVED is never checked before dlpar_remove_lmb(). The function doesn't have the same 'check A for validation, then B for removal' issue as remove_by_ic(), but it's not checking if the LMB is reserved. There is no reason for these functions to validate the same operation in two different manners. This patch addresses that by changing lmb_is_removable() to also check for DRCONF_MEM_RESERVED to tell if a lmb is removable, making dlpar_memory_remove_by_count() take the reservation state into account when counting the LMBs. lmb_is_removable() is then used in the validation step of dlpar_memory_remove_by_ic(), which is already checking for both states but in different stages, to avoid counting a LMB that is not assigned as eligible for removal. We can then skip the check before dlpar_remove_lmb() since we're validating all LMBs beforehand. Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-memory.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index bb98574a84a2..4e6d162c3f1a 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -348,7 +348,8 @@ static int pseries_remove_mem_node(struct device_node *np) static bool lmb_is_removable(struct drmem_lmb *lmb) { - if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) + if ((lmb->flags & DRCONF_MEM_RESERVED) || + !(lmb->flags & DRCONF_MEM_ASSIGNED)) return false; #ifdef CONFIG_FA_DUMP @@ -523,7 +524,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) /* Validate that there are enough LMBs to satisfy the request */ for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) { - if (lmb->flags & DRCONF_MEM_RESERVED) + if (!lmb_is_removable(lmb)) break; lmbs_available++; @@ -533,9 +534,6 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) return -EINVAL; for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) { - if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) - continue; - rc = dlpar_remove_lmb(lmb); if (rc) break; -- 2.30.2
[PATCH 0/3] Unisolate LMBs DRC on removal error + cleanups
Hi, This is a follow-up of the work done in dlpar_cpu_remove() to report CPU removal error by unisolating the DRC. This time I'm doing it for LMBs. Patch 01 handles this. Patches 2 and 3 are cleanups I consider worth posting. Daniel Henrique Barboza (3): powerpc/pseries: Set UNISOLATE on dlpar_memory_remove_by_ic() error hotplug-memory.c: enhance dlpar_memory_remove* LMB checks pseries/hotplug-memory.c: adding dlpar_memory_remove_lmbs_common() .../platforms/pseries/hotplug-memory.c| 162 -- 1 file changed, 69 insertions(+), 93 deletions(-) -- 2.30.2
Re: [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure
On 4/19/21 9:48 AM, Michael Ellerman wrote: Daniel Henrique Barboza writes: The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is already UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else for both QEMU and phyp. This gives us an opportunity to use this behavior to signal the hypervisor layer when an error during device removal happens, allowing it to do a proper error handling, while not breaking QEMU/phyp implementations that don't have this support. This patch introduces this idea by unisolating all CPU DRCs that failed to be removed by dlpar_cpu_remove_by_index(), when handling the PSERIES_HP_ELOG_ID_DRC_INDEX event. This is being done for this event only because its the only CPU removal event QEMU uses, and there's no need at this moment to add this mechanism for phyp only code. Have you also confirmed that phyp is not bothered by it? ie. everything seems to continue working when you trigger this path on phyp. Yes. Daniel Bueso (dbue...@us.ibm.com) from the partition firmware team helped me with that. We confirmed that phyp returns RTAS_OK under these conditions (Unisolating an unisolated/configured DRC). Thanks, DHB cheers diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 12cbffd3c2e3..ed66895c2f51 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -802,8 +802,15 @@ int dlpar_cpu(struct pseries_hp_errorlog *hp_elog) case PSERIES_HP_ELOG_ACTION_REMOVE: if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_COUNT) rc = dlpar_cpu_remove_by_count(count); - else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) + else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) { rc = dlpar_cpu_remove_by_index(drc_index); + /* Setting the isolation state of an UNISOLATED/CONFIGURED +* device to UNISOLATE is a no-op, but the hypervison can +* use it as a hint that the cpu removal failed. +*/ + if (rc) + dlpar_unisolate_drc(drc_index); + } else rc = -EINVAL; break; -- 2.30.2
[PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure
The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is already UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else for both QEMU and phyp. This gives us an opportunity to use this behavior to signal the hypervisor layer when an error during device removal happens, allowing it to do a proper error handling, while not breaking QEMU/phyp implementations that don't have this support. This patch introduces this idea by unisolating all CPU DRCs that failed to be removed by dlpar_cpu_remove_by_index(), when handling the PSERIES_HP_ELOG_ID_DRC_INDEX event. This is being done for this event only because its the only CPU removal event QEMU uses, and there's no need at this moment to add this mechanism for phyp only code. Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 12cbffd3c2e3..ed66895c2f51 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -802,8 +802,15 @@ int dlpar_cpu(struct pseries_hp_errorlog *hp_elog) case PSERIES_HP_ELOG_ACTION_REMOVE: if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_COUNT) rc = dlpar_cpu_remove_by_count(count); - else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) + else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) { rc = dlpar_cpu_remove_by_index(drc_index); + /* Setting the isolation state of an UNISOLATED/CONFIGURED +* device to UNISOLATE is a no-op, but the hypervison can +* use it as a hint that the cpu removal failed. +*/ + if (rc) + dlpar_unisolate_drc(drc_index); + } else rc = -EINVAL; break; -- 2.30.2
[PATCH 1/2] dlpar.c: introduce dlpar_unisolate_drc()
Next patch will execute a set-indicator call in hotplug-cpu.c. Create a dlpar_unisolate_drc() helper to avoid spreading more rtas_set_indicator() calls outside of dlpar.c. Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/dlpar.c | 14 ++ arch/powerpc/platforms/pseries/pseries.h | 1 + 2 files changed, 15 insertions(+) diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index 233503fcf8f0..3ac70790ec7a 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -329,6 +329,20 @@ int dlpar_release_drc(u32 drc_index) return 0; } +int dlpar_unisolate_drc(u32 drc_index) +{ + int dr_status, rc; + + rc = rtas_call(rtas_token("get-sensor-state"), 2, 2, &dr_status, + DR_ENTITY_SENSE, drc_index); + if (rc || dr_status != DR_ENTITY_PRESENT) + return -1; + + rtas_set_indicator(ISOLATION_STATE, drc_index, UNISOLATE); + + return 0; +} + int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_elog) { int rc; diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h index 4fe48c04c6c2..4ea12037c920 100644 --- a/arch/powerpc/platforms/pseries/pseries.h +++ b/arch/powerpc/platforms/pseries/pseries.h @@ -55,6 +55,7 @@ extern int dlpar_attach_node(struct device_node *, struct device_node *); extern int dlpar_detach_node(struct device_node *); extern int dlpar_acquire_drc(u32 drc_index); extern int dlpar_release_drc(u32 drc_index); +extern int dlpar_unisolate_drc(u32 drc_index); void queue_hotplug_event(struct pseries_hp_errorlog *hp_errlog); int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_errlog); -- 2.30.2
[PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error
At this moment, PAPR [1] does not have a way to report errors during a device removal operation. This puts a strain in the hypervisor, which needs extra mechanisms to try to fallback and recover from an error that might have happened during the removal. The QEMU community has dealt with it during these years by either trying to preempt the error before sending the HP event or, in case of a guest side failure, reboot the guest to complete the removal process. This started to change with QEMU commit fe1831eff8a4 ("spapr_drc.c: use DRC reconfiguration to cleanup DIMM unplug state"), where a way to fallback from a memory removal error was introduced. In this case, when QEMU detects that the kernel is reconfiguring LMBs DRCs that were marked as pending removal, the entire process is reverted from the QEMU side as well. Around the same time, other discussions in the QEMU mailing discussed an alternative for other device as well. In [2] the idea of using RTAS set-indicator for this role was first introduced. The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is already UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else for both QEMU and phyp. This gives us an opportunity to use this behavior to signal the hypervisor layer when a device removal happens, allowing it to do a proper error handling knowing for sure that the removal failed in the kernel. Using set-indicator to report HP errors isn't strange to PAPR, as per R1-13.5.3.4-4. of table 13.7 of [1]: "For all DR options: If this is a DR operation that involves the user insert- ing a DR entity, then if the firmware can determine that the inserted entity would cause a system disturbance, then the set-indicator RTAS call must not unisolate the entity and must return an error status which is unique to the particular error." PAPR does not make any restrictions or considerations about setting an already Unisolated/Configured DRC to 'unisolate', meaning we have a chance to use it for this purpose - signal an OS side error when attempting to remove a DR entity. To validate the design, this is being implemented only for CPUs. QEMU will use this mechanism to rollback the device removal (hotunplug) state, allowing for a better error handling mechanism. A implementation of how QEMU can do it is in [3]. When using a kernel with this series applied, together with this QEMU build, this is what happens in a common CPU removal/hotunplug error scenario (trying to remove the last online CPU): ( QEMU command line: qemu-system-ppc64 -machine pseries,accel=kvm,usb=off -smp 1,maxcpus=2,threads=1,cores=2,sockets=1 ... ) [root@localhost ~]# QEMU 5.2.92 monitor - type 'help' for more information (qemu) device_add host-spapr-cpu-core,core-id=1,id=core1 (qemu) [root@localhost ~]# echo 0 > /sys/devices/system/cpu/cpu0/online [ 77.548442][ T13] IRQ 19: no longer affine to CPU0 [ 77.548452][ T13] IRQ 20: no longer affine to CPU0 [ 77.548458][ T13] IRQ 256: no longer affine to CPU0 [ 77.548465][ T13] IRQ 258: no longer affine to CPU0 [ 77.548472][ T13] IRQ 259: no longer affine to CPU0 [ 77.548479][ T13] IRQ 260: no longer affine to CPU0 [ 77.548485][ T13] IRQ 261: no longer affine to CPU0 [ 77.548590][T0] cpu 0 (hwid 0) Ready to die... [root@localhost ~]# (qemu) (qemu) device_del core1 (qemu) [ 83.214073][ T100] pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16 qemu-system-ppc64: Device hotunplug rejected by the guest for device core1 (qemu) As soon as the CPU removal fails in dlpar_cpu(), QEMU becames aware of it and is able to do error recovery. If this solution is well received, I'll push for an architecture change request internally at IBM to make this mechanism PAPR official. [1] https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf [2] https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06395.html [3] https://github.com/danielhb/qemu/tree/unisolate_drc_callback_v1 Daniel Henrique Barboza (2): dlpar.c: introduce dlpar_unisolate_drc() hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure arch/powerpc/platforms/pseries/dlpar.c | 14 ++ arch/powerpc/platforms/pseries/hotplug-cpu.c | 9 - arch/powerpc/platforms/pseries/pseries.h | 1 + 3 files changed, 23 insertions(+), 1 deletion(-) -- 2.30.2
Re: [PATCH 0/3] Reintroduce cpu_core_mask
Hi, Using a QEMU pseries guest with this follwing SMP topology, with a single NUMA node: (...) -smp 32,threads=4,cores=4,sockets=2, (...) This is the output of lscpu with a guest running v5.12-rc5: [root@localhost ~]# lscpu Architecture:ppc64le Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 4 Core(s) per socket: 8 Socket(s): 1 NUMA node(s):1 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-31 [root@localhost ~]# The changes with cpu_core_mask made the topology sockets matching NUMA nodes. In this case, given that we have a single NUMA node, the SMP topology got adjusted to have 8 cores instead of 4 so we can have a single socket as well. Although sockets equal to NUMA nodes is true for Power hardware, QEMU doesn't have this constraint and users expect sockets and NUMA nodes to be kind of independent, regardless of how unpractical that would be with real hardware. The same guest running a kernel with this series applied: [root@localhost ~]# lscpu Architecture:ppc64le Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 4 Core(s) per socket: 4 Socket(s): 2 NUMA node(s):1 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-31 The sockets and NUMA nodes are being represented separately, as intended via the QEMU command line. Thanks for the looking this up, Srikar. For all patches: Tested-by: Daniel Henrique Barboza On 4/15/21 9:09 AM, Srikar Dronamraju wrote: Daniel had reported that QEMU is now unable to see requested topologies in a multi socket single NUMA node configurations. -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 This patchset reintroduces cpu_core_mask so that users can see requested topologies while still maintaining the boot time of very large system configurations. It includes caching the chip_id as suggested by Michael Ellermann 4 Threads/Core; 4 cores/Socket; 4 Sockets/Node, 2 Nodes in System -numa node,nodeid=0,memdev=m0 \ -numa node,nodeid=1,memdev=m1 \ -smp 128,sockets=8,threads=4,maxcpus=128 \ 5.12.0-rc5 (or any kernel with commit 4ca234a9cbd7) --- srikar@cloudy:~$ lscpu Architecture:ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 <<<<<- NUMA node(s):2 Model: 2.3 (pvr 004e 1203) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 1 MiB L1i cache: 1 MiB NUMA node0 CPU(s): 0-15,32-47,64-79,96-111 NUMA node1 CPU(s): 16-31,48-63,80-95,112-127 -- srikar@cloudy:~$ dmesg |grep smp [0.010658] smp: Bringing up secondary CPUs ... [0.424681] smp: Brought up 2 nodes, 128 CPUs -- 5.12.0-rc5 + 3 patches -- srikar@cloudy:~$ lscpu Architecture:ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 4 Socket(s): 8<<<<- NUMA node(s):2 Model: 2.3 (pvr 004e 1203) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 1 MiB L1i cache: 1 MiB NUMA node0 CPU(s): 0-15,32-47,64-79,96-111 NUMA node1 CPU(s): 16-31,48-63,80-95,112-127 -- srikar@cloudy:~$ dmesg |grep smp [0.010372] smp: Bringing up secondary CPUs ... [0.417892] smp: Brought up 2 nodes, 128 CPUs 5.12.0-rc5 -- srikar@cloudy:~$ lscpu Architecture:ppc64le Byte Order: Little Endian CPU(s): 1024 On-line CPU(s) list: 0-1023 Thread(s) per core: 8 Core(s) per socket: 128 Socket(s): 1 NUMA node(s):1 Model: 2.3 (pvr 004e 1203) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization
[PATCH v3 1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_offline()
One of the reasons that dlpar_cpu_offline can fail is when attempting to offline the last online CPU of the kernel. This can be observed in a pseries QEMU guest that has hotplugged CPUs. If the user offlines all other CPUs of the guest, and a hotplugged CPU is now the last online CPU, trying to reclaim it will fail. See [1] for an example. The current error message in this situation returns rc with -EBUSY and a generic explanation, e.g.: pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16 EBUSY can be caused by other conditions, such as cpu_hotplug_disable being true. Throwing a more specific error message for this case, instead of just "Failed to offline CPU", makes it clearer that the error is in fact a known error situation instead of other generic/unknown cause. This patch adds a 'last online' check in dlpar_cpu_offline() to catch the 'last online CPU' offline error, returning a more informative error message: pseries-hotplug-cpu: Unable to remove last online CPU PowerPC,POWER9 [1] https://bugzilla.redhat.com/1911414 Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 12cbffd3c2e3..4b9df4d645b4 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -271,6 +271,19 @@ static int dlpar_offline_cpu(struct device_node *dn) if (!cpu_online(cpu)) break; + /* device_offline() will return -EBUSY (via cpu_down()) +* if there is only one CPU left. Check it here to fail +* earlier and with a more informative error message, +* while also retaining the cpu_add_remove_lock to be sure +* that no CPUs are being online/offlined during this +* check. +*/ + if (num_online_cpus() == 1) { + pr_warn("Unable to remove last online CPU %pOFn\n", dn); + rc = -EBUSY; + goto out_unlock; + } + cpu_maps_update_done(); rc = device_offline(get_cpu_device(cpu)); if (rc) @@ -283,6 +296,7 @@ static int dlpar_offline_cpu(struct device_node *dn) thread); } } +out_unlock: cpu_maps_update_done(); out: -- 2.30.2
[PATCH v3 0/1] show 'last online CPU' error in dlpar_cpu_offline()
changes in v3 after Daniel Axtens review: - fixed typo in commit mgs - fixed block comment format to make checkpatch happy v2 link: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210323205056.52768-2-danielhb...@gmail.com/ changes in v2 after Michael Ellerman review: - moved the verification code from dlpar_cpu_remove() to dlpar_cpu_offline(), while holding cpu_add_remove_lock - reworded the commit message and code comment v1 link: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210305173845.451158-1-danielhb...@gmail.com/ Daniel Henrique Barboza (1): hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_offline() arch/powerpc/platforms/pseries/hotplug-cpu.c | 14 ++ 1 file changed, 14 insertions(+) -- 2.30.2
Re: [PATCH v2 1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_offline()
Hey Daniel, On 3/26/21 2:24 AM, Daniel Axtens wrote: Hi Daniel, Two small nitpicks: This patch adds a 'last online' check in dlpar_cpu_offline() to catch the 'last online CPU' offline error, eturning a more informative error ^--- s/eturning/returning/; + /* device_offline() will return -EBUSY (via cpu_down()) +* if there is only one CPU left. Check it here to fail +* earlier and with a more informative error message, +* while also retaining the cpu_add_remove_lock to be sure +* that no CPUs are being online/offlined during this +* check. */ Checkpatch has a small issue with this comment: WARNING: Block comments use a trailing */ on a separate line #50: FILE: arch/powerpc/platforms/pseries/hotplug-cpu.c:279: +* check. */ Apart from that, this patch seems sane to me, but I haven't been able to test it. Thanks for the review, and for letting me know of the existence of 'scripts/checkpatch.pl' to verify the patches before posting. I'll send a v3. Thanks, DHB Kind regards, Daniel
[PATCH v2 1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_offline()
One of the reasons that dlpar_cpu_offline can fail is when attempting to offline the last online CPU of the kernel. This can be observed in a pseries QEMU guest that has hotplugged CPUs. If the user offlines all other CPUs of the guest, and a hotplugged CPU is now the last online CPU, trying to reclaim it will fail. See [1] for an example. The current error message in this situation returns rc with -EBUSY and a generic explanation, e.g.: pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16 EBUSY can be caused by other conditions, such as cpu_hotplug_disable being true. Throwing a more specific error message for this case, instead of just "Failed to offline CPU", makes it clearer that the error is in fact a known error situation instead of other generic/unknown cause. This patch adds a 'last online' check in dlpar_cpu_offline() to catch the 'last online CPU' offline error, eturning a more informative error message: pseries-hotplug-cpu: Unable to remove last online CPU PowerPC,POWER9 [1] https://bugzilla.redhat.com/1911414 Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 12cbffd3c2e3..3ac7e904385c 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -271,6 +271,18 @@ static int dlpar_offline_cpu(struct device_node *dn) if (!cpu_online(cpu)) break; + /* device_offline() will return -EBUSY (via cpu_down()) +* if there is only one CPU left. Check it here to fail +* earlier and with a more informative error message, +* while also retaining the cpu_add_remove_lock to be sure +* that no CPUs are being online/offlined during this +* check. */ + if (num_online_cpus() == 1) { + pr_warn("Unable to remove last online CPU %pOFn\n", dn); + rc = -EBUSY; + goto out_unlock; + } + cpu_maps_update_done(); rc = device_offline(get_cpu_device(cpu)); if (rc) @@ -283,6 +295,7 @@ static int dlpar_offline_cpu(struct device_node *dn) thread); } } +out_unlock: cpu_maps_update_done(); out: -- 2.30.2
[PATCH v2 0/1] show 'last online CPU' error in dlpar_cpu_offline()
changes in v2 after Michael Ellerman review: - moved the verification code from dlpar_cpu_remove() to dlpar_cpu_offline(), while holding cpu_add_remove_lock - reworded the commit message and code comment v1 link: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210305173845.451158-1-danielhb...@gmail.com/ Daniel Henrique Barboza (1): hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_offline() arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 + 1 file changed, 13 insertions(+) -- 2.30.2
Re: [PATCH 1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_remove()
On 3/19/21 8:26 AM, Michael Ellerman wrote: Daniel Henrique Barboza writes: Ping On 3/5/21 2:38 PM, Daniel Henrique Barboza wrote: Of all the reasons that dlpar_cpu_remove() can fail, the 'last online CPU' is one that can be caused directly by the user offlining CPUs in a partition/virtual machine that has hotplugged CPUs. Trying to reclaim a hotplugged CPU can fail if the CPU is now the last online in the system. This is easily reproduced using QEMU [1]. Sorry, I saw this earlier and never got around to replying. No problem. Thanks for the review! I'm wondering if we neet to catch it earlier, ie. in dlpar_offline_cpu(). By the time we return to dlpar_cpu_remove() we've dropped the cpu_add_remove_lock (cpu_maps_update_done), so num_online_cpus() could change out from under us, meaning the num_online_cpus() == 1 check might trigger incorrectly (or vice versa). Something like the patch below (completely untested :D) Makes sense. I'll try it out to see if it works. And writing that patch makes me wonder, is == 1 the right check? In most cases we'll remove all but one thread of the core, but we'll fail on the last thread. Leaving that core effectively stuck in SMT1. Is that useful behaviour? Should we instead check at the start that we can remove all threads of the core without going to zero online CPUs? I think it's ok to allow SMT1 cores, speaking from QEMU perspective. QEMU does not have a "core hotunplug" operation where the whole core is hotunplugged at once. The CPU hotplug/unplug operations are done as single CPU thread add/removal. If the user wants to run 4 cores, all of them with SMT1, QEMU will allow it. Libvirt does not operate with the core granularity either - you can specify the amount of vcpus the guest should run with, and Libvirt will send hotplug/unplug requests to QEMU to match the desired value. It doesn't bother with how many threads of a core were offlined or not. Thanks, DHB cheers diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 12cbffd3c2e3..498c22331ac8 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -271,6 +271,12 @@ static int dlpar_offline_cpu(struct device_node *dn) if (!cpu_online(cpu)) break; + if (num_online_cpus() == 1) { + pr_warn("Unable to remove last online CPU %pOFn\n", dn); + rc = EBUSY; + goto out_unlock; + } + cpu_maps_update_done(); rc = device_offline(get_cpu_device(cpu)); if (rc) @@ -283,6 +289,7 @@ static int dlpar_offline_cpu(struct device_node *dn) thread); } } +out_unlock: cpu_maps_update_done(); out:
Re: Advice needed on SMP regression after cpu_core_mask change
On 3/18/21 10:42 AM, Srikar Dronamraju wrote: * Daniel Henrique Barboza [2021-03-17 10:00:34]: Hello, Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced a regression in both upstream and RHEL downstream kernels [1]. The assumption made in the commit: "Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be equal on Power" Doesn't seem to be true. After this commit, QEMU is now unable to set single NUMA node SMP topologies such as: -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 What does it mean for a NUMA to have more than one sockets? If they are all part of the same node, there are at local distance to each other. cache is per core. So what resources are shared by the Sockets that are part of the same NUMA. And how does Userspace/ application make use of the same. Honestly, I sympathize with the idea that multiple sockets in the same NUMA node being "weird". QEMU is accepting this kind of topology since forever because we didn't pay attention to these other details. I don't see any problems adding more constraints that makes sense in the virtual layer, as long as the constraints make sense and are documented. Putting multiple sockets in a single NUMA node seems like a fair restriction. Please don't mistake this as attempt to downplay your report but a honest attempt to better understand the situation. It's cool. Ask away. For example, if the socket denotes the hemisphere logic in P10, then can we see if the coregroup feature can be used. "Coregroup" is suppose to mean a set of cores within a NUMA that have some characteristics and there can be multiple coregroups within a NUMA. We add that mostly to mimic hemisphere in P10. However the number of coregroups in a NUMA is not exported to userspace at this time. I see. I thought that the presence of the hemispheres inside the chip would justify more than one NUMA node inside the chip, meaning that a chip/socket would have more than one NUMA nodes inside of it. If that's not the case then I guess socket == NUMA node is still valid in P10 as well. The last 'lscpu' example I gave here, claiming that this would be a Power10 scenario, doesn't represent P10 after all. However if each Socket is associated with a memory and node distance, then should they be NUMA? Can you provide me with the unique ibm,chip-ids in your 2 NUMA, 4 node case? Does this cause an performance issues with the guest/application? I can fetch some values, but we're trying to move out of it since it's not on the pseries spec (PAPR). Perhaps with these restrictions we can live without ibm,chip-id in QEMU. Till your report, I was under the impression that NUMAs == Sockets. After reading and discussing about it, I think the sensible thing to do is to put this same constraint in QEMU. In theory it would be nice to let the virtual machine to have whatever topology it wants, multiple sockets in the same NUMA domain and so on, but in the end we're emulating Power hardware. If Power hardware - and the powerpc kernel - operates under these assumptions, then I don't see much point into allowing users to set unrealistic virtual CPU topologies that will be misrepresented in the kernel. I'll try this restriction in QEMU and see how upstream kernel behaves, with and without ibm,chip-id being advertised in the DT. Thanks, DHB lscpu will give the following output in this case: # lscpu Architecture:ppc64le Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s):1 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-7 This is happening because the macro cpu_cpu_mask(cpu) expands to cpumask_of_node(cpu_to_node(cpu)), which in turn expands to node_to_cpumask_map[node]. node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on CC to correct me if I'm wrong). We're now associating sockets to NUMA nodes directly. If I add a second NUMA node then I can get the intended smp topology: -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 -numa node,memdev=mem0,cpus=0-3,nodeid=0 \ -numa node,memdev=mem1,cpus=4-7,nodeid=1 \ # lscpu Architecture:ppc64le Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 2 NUMA node(s):2 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 Ho
Re: [PATCH 1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_remove()
Ping On 3/5/21 2:38 PM, Daniel Henrique Barboza wrote: Of all the reasons that dlpar_cpu_remove() can fail, the 'last online CPU' is one that can be caused directly by the user offlining CPUs in a partition/virtual machine that has hotplugged CPUs. Trying to reclaim a hotplugged CPU can fail if the CPU is now the last online in the system. This is easily reproduced using QEMU [1]. Throwing a more specific error message for this case, instead of just "Failed to offline CPU", makes it clearer that the error is in fact a known error situation instead of other generic/unknown cause. [1] https://bugzilla.redhat.com/1911414 Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 12cbffd3c2e3..134f393f09e1 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -514,7 +514,17 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, u32 drc_index) rc = dlpar_offline_cpu(dn); if (rc) { - pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc); + /* dlpar_offline_cpu will return -EBUSY from cpu_down() (via +* device_offline()) in 2 cases: cpu_hotplug_disable is true or +* there is only one CPU left. Warn the user about the second +* since this can happen with user offlining CPUs and then +* attempting hotunplugs. +*/ + if (rc == -EBUSY && num_online_cpus() == 1) + pr_warn("Unable to remove last online CPU %pOFn\n", dn); + else + pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc); + return -EINVAL; }
Re: [PATCH] powerpc/numa: Fix topology_physical_package_id() on pSeries
On 3/18/21 4:28 AM, Cédric Le Goater wrote: Also we've been using it for several years and I don't think we should risk breaking anything by changing the value now. I guess we can leave it that way. Please read the commit log of the second patch (not tagged as a v2 ...). But we should remove ibm,chip-id from QEMU since the property does not exist on PAPR and that the calculation is anyhow very broken. I am a strong advocate of getting rid of ibm,chip-id in QEMU. That said, we need to make sure that the current problem with CPU topologies, that I reported in that other thread, can be fixed without it. Thanks, DHB Thanks, C.
Re: Advice needed on SMP regression after cpu_core_mask change
On 3/17/21 12:30 PM, Cédric Le Goater wrote: On 3/17/21 2:00 PM, Daniel Henrique Barboza wrote: Hello, Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced a regression in both upstream and RHEL downstream kernels [1]. The assumption made in the commit: "Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be equal on Power" Doesn't seem to be true. After this commit, QEMU is now unable to set single NUMA node SMP topologies such as: -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 lscpu will give the following output in this case: # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-7 This is happening because the macro cpu_cpu_mask(cpu) expands to cpumask_of_node(cpu_to_node(cpu)), which in turn expands to node_to_cpumask_map[node]. node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on CC to correct me if I'm wrong). We're now associating sockets to NUMA nodes directly. If I add a second NUMA node then I can get the intended smp topology: -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 -numa node,memdev=mem0,cpus=0-3,nodeid=0 \ -numa node,memdev=mem1,cpus=4-7,nodeid=1 \ # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 2 NUMA node(s): 2 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 However, if I try a single socket with multiple NUMA nodes topology, which is the case of Power10, e.g.: -smp 8,maxcpus=8,cores=4,threads=2,sockets=1 -numa node,memdev=mem0,cpus=0-3,nodeid=0 \ -numa node,memdev=mem1,cpus=4-7,nodeid=1 \ This is the result: # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 2 NUMA node(s): 2 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 This confirms my suspicions that, at this moment, we're making sockets == NUMA nodes. Yes. I don't think we can do better on PAPR and the above examples seem to confirm that the "sockets" definition is simply ignored. Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The commit after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop updating cpu_core_mask"), is erasing the code that calculated cpu_core_mask. cpu_core_mask, despite its shortcomings that caused its removal, was giving a precise SMP topology. And it was using physical_package_id/'ibm,chip-id' for that. ibm,chip-id is a no-no on pSeries. I guess this is inherent to PAPR which is hiding a lot of the underlying HW and topology. May be we are trying to reconcile two orthogonal views of machine virtualization ... Checking in QEMU I can say that the ibm,chip-id calculation is the only place in the code that cares about cores per socket information. The kernel is now ignoring that, starting on 4bce545903fa, and now QEMU is unable to provide this info to the guest. If we're not going to use ibm,chip-id any longer, which seems sensible given that PAPR does not declare it, we need another way of letting the guest know how much cores per socket we want. The RTAS call "ibm,get-system-parameter" with token "Processor Module Information" returns that kind of information : 2 byte binary number (N) of module types followed by N module specifiers of the form: 2 byte binary number (M) of sockets of this module type 2 byte binary number (L) of chips per this module type 2 byte binary number (K) of cores per chip in this module type. See the values in these sysfs files : cat /sys/devices/hv_24x7/interface/{sockets,chipspersocket,coresperchip} But I am afraid these are host level information and not guest/LPAR. I believe there might be some sort of reasoning behind not having this on PAPR, but I'll say in advance that the virtual machine should act as the real hardware, as close as possible. This is the kind of hcall that could be used in this situation.
Advice needed on SMP regression after cpu_core_mask change
Hello, Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced a regression in both upstream and RHEL downstream kernels [1]. The assumption made in the commit: "Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be equal on Power" Doesn't seem to be true. After this commit, QEMU is now unable to set single NUMA node SMP topologies such as: -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 lscpu will give the following output in this case: # lscpu Architecture:ppc64le Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s):1 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-7 This is happening because the macro cpu_cpu_mask(cpu) expands to cpumask_of_node(cpu_to_node(cpu)), which in turn expands to node_to_cpumask_map[node]. node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on CC to correct me if I'm wrong). We're now associating sockets to NUMA nodes directly. If I add a second NUMA node then I can get the intended smp topology: -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 -numa node,memdev=mem0,cpus=0-3,nodeid=0 \ -numa node,memdev=mem1,cpus=4-7,nodeid=1 \ # lscpu Architecture:ppc64le Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 2 NUMA node(s):2 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 However, if I try a single socket with multiple NUMA nodes topology, which is the case of Power10, e.g.: -smp 8,maxcpus=8,cores=4,threads=2,sockets=1 -numa node,memdev=mem0,cpus=0-3,nodeid=0 \ -numa node,memdev=mem1,cpus=4-7,nodeid=1 \ This is the result: # lscpu Architecture:ppc64le Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 2 NUMA node(s):2 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 This confirms my suspicions that, at this moment, we're making sockets == NUMA nodes. Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The commit after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop updating cpu_core_mask"), is erasing the code that calculated cpu_core_mask. cpu_core_mask, despite its shortcomings that caused its removal, was giving a precise SMP topology. And it was using physical_package_id/'ibm,chip-id' for that. Checking in QEMU I can say that the ibm,chip-id calculation is the only place in the code that cares about cores per socket information. The kernel is now ignoring that, starting on 4bce545903fa, and now QEMU is unable to provide this info to the guest. If we're not going to use ibm,chip-id any longer, which seems sensible given that PAPR does not declare it, we need another way of letting the guest know how much cores per socket we want. [1] https://bugzilla.redhat.com/1934421 Thanks, DHB
Re: [PATCH] powerpc/numa: Fix topology_physical_package_id() on pSeries
On 3/15/21 1:16 PM, Cédric Le Goater wrote: On 3/15/21 4:12 PM, Daniel Henrique Barboza wrote: On 3/12/21 11:31 AM, Cédric Le Goater wrote: Initial commit 15863ff3b8da ("powerpc: Make chip-id information available to userspace") introduce a cpu_to_chip_id() routine for the PowerNV platform using the "ibm,chip-id" property to query the chip id of a CPU. But PAPR does not specify such a property and the node id query is broken. Use cpu_to_node() instead which guarantees to have a correct value on all platforms, PowerNV an pSeries. It is worth mentioning that that this patch will change how topology_physical_package_id() represents in a QEMU guest. Right now, ibm,chip-id in QEMU is matching the socket-id. After this patch, topology_physical_package_id() will now match the NUMA id of the CPU. yes. I should have added some more background. LPARs are impacted by the use of ibm,chip-id because the property does not exist under PowerVM and the topology-id in sysfs is always -1 even if NUMA nodes are defined. Under QEMU/KVM, ibm,chip-id is badly calculated when using uncommon SMT configuration. This leads to a bogus topology-id value being exported in sysfs. The use of cpu_to_node() guarantees to have a correct NUMA node id under both environments QEMU/KVM and PowerVM. On the PowerNV platform, the numa node id returned by cpu_to_node() is computed from the "ibm,associativity" property of the CPU. Its value is built from the OPAL chip id and is equivalent to ibm,chip-id. May be I should rephrase the commit log in a v2 ? It's a fine idea, given that apparently we don't have documentation explaining these details (well, at least I didn't find any). We can reference the commit message later on as explanation :) Thanks, DHB C. Reviewed-by: Daniel Henrique Barboza Tested-by: Daniel Henrique Barboza Cc: Nathan Lynch Cc: Srikar Dronamraju Cc: Vasant Hegde Signed-off-by: Cédric Le Goater --- arch/powerpc/include/asm/topology.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 3beeb030cd78..887c42a4e43d 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -123,7 +123,7 @@ static inline int cpu_to_coregroup_id(int cpu) #ifdef CONFIG_PPC64 #include -#define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) +#define topology_physical_package_id(cpu) (cpu_to_node(cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) #define topology_core_cpumask(cpu) (cpu_cpu_mask(cpu))
Re: [PATCH] powerpc/numa: Fix topology_physical_package_id() on pSeries
On 3/12/21 11:31 AM, Cédric Le Goater wrote: Initial commit 15863ff3b8da ("powerpc: Make chip-id information available to userspace") introduce a cpu_to_chip_id() routine for the PowerNV platform using the "ibm,chip-id" property to query the chip id of a CPU. But PAPR does not specify such a property and the node id query is broken. Use cpu_to_node() instead which guarantees to have a correct value on all platforms, PowerNV an pSeries. It is worth mentioning that that this patch will change how topology_physical_package_id() represents in a QEMU guest. Right now, ibm,chip-id in QEMU is matching the socket-id. After this patch, topology_physical_package_id() will now match the NUMA id of the CPU. Reviewed-by: Daniel Henrique Barboza Tested-by: Daniel Henrique Barboza Cc: Nathan Lynch Cc: Srikar Dronamraju Cc: Vasant Hegde Signed-off-by: Cédric Le Goater --- arch/powerpc/include/asm/topology.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 3beeb030cd78..887c42a4e43d 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -123,7 +123,7 @@ static inline int cpu_to_coregroup_id(int cpu) #ifdef CONFIG_PPC64 #include -#define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) +#define topology_physical_package_id(cpu) (cpu_to_node(cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) #define topology_core_cpumask(cpu)(cpu_cpu_mask(cpu))
Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
On 3/12/21 6:53 AM, Cédric Le Goater wrote: On 3/12/21 2:55 AM, David Gibson wrote: On Tue, 9 Mar 2021 18:26:35 +0100 Cédric Le Goater wrote: On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: On 3/9/21 12:33 PM, Cédric Le Goater wrote: On 3/8/21 6:13 PM, Greg Kurz wrote: On Wed, 3 Mar 2021 18:48:50 +0100 Cédric Le Goater wrote: The 'chip_id' field of the XIVE CPU structure is used to choose a target for a source located on the same chip when possible. This field is assigned on the PowerNV platform using the "ibm,chip-id" property on pSeries under KVM when NUMA nodes are defined but it is undefined This sentence seems to have a syntax problem... like it is missing an 'and' before 'on pSeries'. ah yes, or simply a comma. under PowerVM. The XIVE source structure has a similar field 'src_chip' which is only assigned on the PowerNV platform. cpu_to_node() returns a compatible value on all platforms, 0 being the default node. It will also give us the opportunity to set the affinity of a source on pSeries when we can localize them. IIUC this relies on the fact that the NUMA node id is == to chip id on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable with this change. Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel in Cc:) [...] On PowerNV, Linux uses "ibm,associativity" property of the CPU to find the node id. This value is built from the chip id in OPAL, so the value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" property are unlikely to be different. cpu_to_node(cpu) is used in many places to allocate the structures locally to the owning node. XIVE is not an exception (see below in the same patch), it is better to be consistent and get the same information (node id) using the same routine. In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot unifies the controllers of the system to only expose one the OS. This is problematic and should be changed but it's another topic. On the other hand, you have the pSeries case under PowerVM that doesn't xc->chip_id, which isn't passed to any hcall AFAICT. yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid chip id. QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not always correct btw) If you have a way to reliably reproduce this, let me know and I'll fix it up in QEMU. with : -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 # dmesg | grep numa [0.013106] numa: Node 0 CPUs: 0-1 [0.013136] numa: Node 1 CPUs: 2-3 # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id ibm,chip-id = <0x01>; ibm,chip-id = <0x02>; ibm,chip-id = <0x00>; ibm,chip-id = <0x03>; with : -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 # dmesg | grep numa [0.013106] numa: Node 0 CPUs: 0-1 [0.013136] numa: Node 1 CPUs: 2-3 # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id ibm,chip-id = <0x00>; ibm,chip-id = <0x00>; ibm,chip-id = <0x00>; ibm,chip-id = <0x00>; I think we should simply remove "ibm,chip-id" since it's not used and not in the PAPR spec. As I mentioned to Daniel on our call this morning, oddly it *does* appear to be used in the RHEL kernel, even though that's 4.18 based. This patch seems to have caused a minor regression; not in the identification of NUMA nodes, but in the number of sockets shown be lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 for more information. Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : [root@localhost ~]# lscpu Architecture:ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 NUMA node(s):2 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache:
Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
On 3/9/21 12:33 PM, Cédric Le Goater wrote: On 3/8/21 6:13 PM, Greg Kurz wrote: On Wed, 3 Mar 2021 18:48:50 +0100 Cédric Le Goater wrote: The 'chip_id' field of the XIVE CPU structure is used to choose a target for a source located on the same chip when possible. This field is assigned on the PowerNV platform using the "ibm,chip-id" property on pSeries under KVM when NUMA nodes are defined but it is undefined This sentence seems to have a syntax problem... like it is missing an 'and' before 'on pSeries'. ah yes, or simply a comma. under PowerVM. The XIVE source structure has a similar field 'src_chip' which is only assigned on the PowerNV platform. cpu_to_node() returns a compatible value on all platforms, 0 being the default node. It will also give us the opportunity to set the affinity of a source on pSeries when we can localize them. IIUC this relies on the fact that the NUMA node id is == to chip id on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable with this change. Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel in Cc:) That's correct. H_HOME_NODE_ASSOCIATIVITY returns not only the node_id, but a list with the ibm,associativity domains of the CPU that "proc-no" (processor identifier) is mapped to inside QEMU. node_id in this case, considering that we're working with a reference-points of size 4, is the 4th element of the returned list. The last element is "procno" itself. On PowerNV, Linux uses "ibm,associativity" property of the CPU to find the node id. This value is built from the chip id in OPAL, so the value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" property are unlikely to be different. cpu_to_node(cpu) is used in many places to allocate the structures locally to the owning node. XIVE is not an exception (see below in the same patch), it is better to be consistent and get the same information (node id) using the same routine. In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot unifies the controllers of the system to only expose one the OS. This is problematic and should be changed but it's another topic. On the other hand, you have the pSeries case under PowerVM that doesn't xc->chip_id, which isn't passed to any hcall AFAICT. yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid chip id. QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not always correct btw) If you have a way to reliably reproduce this, let me know and I'll fix it up in QEMU. Thanks, DHB It looks like the chip id is only used for localization purpose in this case, right ? Yes and PAPR sources are not localized. So it's not used. MSI sources could be if we rewrote the MSI driver. In this case, what about doing this change for pSeries only, somewhere in spapr.c ? The IPI code is common to all platforms and all have the same issue. I rather not. Thanks, C. Signed-off-by: Cédric Le Goater --- arch/powerpc/sysdev/xive/common.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 595310e056f4..b8e456da28aa 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu) xc = per_cpu(xive_cpu, cpu); if (!xc) { - struct device_node *np; - xc = kzalloc_node(sizeof(struct xive_cpu), GFP_KERNEL, cpu_to_node(cpu)); if (!xc) return -ENOMEM; - np = of_get_cpu_node(cpu, NULL); - if (np) - xc->chip_id = of_get_ibm_chip_id(np); - of_node_put(np); + xc->chip_id = cpu_to_node(cpu); xc->hw_ipi = XIVE_BAD_IRQ; per_cpu(xive_cpu, cpu) = xc;
[PATCH 1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_remove()
Of all the reasons that dlpar_cpu_remove() can fail, the 'last online CPU' is one that can be caused directly by the user offlining CPUs in a partition/virtual machine that has hotplugged CPUs. Trying to reclaim a hotplugged CPU can fail if the CPU is now the last online in the system. This is easily reproduced using QEMU [1]. Throwing a more specific error message for this case, instead of just "Failed to offline CPU", makes it clearer that the error is in fact a known error situation instead of other generic/unknown cause. [1] https://bugzilla.redhat.com/1911414 Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 12cbffd3c2e3..134f393f09e1 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -514,7 +514,17 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, u32 drc_index) rc = dlpar_offline_cpu(dn); if (rc) { - pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc); + /* dlpar_offline_cpu will return -EBUSY from cpu_down() (via +* device_offline()) in 2 cases: cpu_hotplug_disable is true or +* there is only one CPU left. Warn the user about the second +* since this can happen with user offlining CPUs and then +* attempting hotunplugs. +*/ + if (rc == -EBUSY && num_online_cpus() == 1) + pr_warn("Unable to remove last online CPU %pOFn\n", dn); + else + pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc); + return -EINVAL; } -- 2.29.2
Re: [PATCH 0/1] powerpc/numa: do not skip node 0 in lookup table
I discussed this a bit with Aneesh Kumar in IBM internal Slack, a few weeks ago, and he informed me that that this patch does not make sense with the design used by the kernel. The kernel will assume that, for node 0, all associativity domains must also be zeroed. This is why node 0 is skipped when creating the distance table. This of course has consequences for QEMU, so based on that, I've adapted the QEMU implementation to not touch node 0. Daniel On 8/14/20 5:34 PM, Daniel Henrique Barboza wrote: Hi, This is a simple fix that I made while testing NUMA changes I'm making in QEMU [1]. Setting any non-zero value to the associativity of NUMA node 0 has no impact in the output of 'numactl' because the distance_lookup_table is never initialized for node 0. Seeing through the LOPAPR spec and git history I found no technical reason to skip node 0, which makes me believe this is a bug that got under the radar up until now because no one attempted to set node 0 associativity like I'm doing now. For anyone wishing to give it a spin, using the QEMU build in [1] and experimenting with NUMA distances, such as: sudo ./qemu-system-ppc64 -machine pseries-5.2,accel=kvm,usb=off,dump-guest-core=off -m 65536 -overcommit mem-lock=off -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc -display none -vga none -nographic -boot menu=on -device spapr-pci-host-bridge,index=1,id=pci.1 -device spapr-pci-host-bridge,index=2,id=pci.2 -device spapr-pci-host-bridge,index=3,id=pci.3 -device spapr-pci-host-bridge,index=4,id=pci.4 -device qemu-xhci,id=usb,bus=pci.0,addr=0x2 -drive file=/home/danielhb/f32.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -device usb-kbd,id=input0,bus=usb.0,port=1 -device usb-mouse,id=input1,bus=usb.0,port=2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on \ -numa node,nodeid=0,cpus=0 -numa node,nodeid=1,cpus=1 \ -numa node,nodeid=2,cpus=2 -numa node,nodeid=3,cpus=3 \ -numa dist,src=0,dst=1,val=80 -numa dist,src=0,dst=2,val=80 \ -numa dist,src=0,dst=3,val=80 -numa dist,src=1,dst=2,val=80 \ -numa dist,src=1,dst=3,val=80 -numa dist,src=2,dst=3,val=80 The current kernel code will ignore the associativity of node 0, and numactl will output this: node distances: node 0 1 2 3 0: 10 160 160 160 1: 160 10 80 80 2: 160 80 10 80 3: 160 80 80 10 With this patch: node distances: node 0 1 2 3 0: 10 160 160 160 1: 160 10 80 40 2: 160 80 10 20 3: 160 40 20 10 If anyone wonders, this patch has no conflict with the proposed NUMA changes in [2] because Aneesh isn't changing this line. [1] https://github.com/danielhb/qemu/tree/spapr_numa_v1 [2] https://patchwork.ozlabs.org/project/linuxppc-dev/patch/2020073916.243569-1-aneesh.ku...@linux.ibm.com/ Daniel Henrique Barboza (1): powerpc/numa: do not skip node 0 when init lookup table arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH 1/1] powerpc/numa: do not skip node 0 when init lookup table
associativity_to_nid() is skipping node 0 when initializing the distance lookup table. This has no practical effect when the associativity of node 0 is always zero, which seems to be case for a long time. As such, this line got introduced in commit 41eab6f88f24 from 2010 and never revisited. However, QEMU is making an effort to allow user input to configure NUMA topologies, and this behavior got exposed when testing that work. With the existing code, this is what happens with a 4 node NUMA guest with distance = 80 to each other: $ numactl -H (...) node distances: node 0 1 2 3 0: 10 160 160 160 1: 160 10 80 80 2: 160 80 10 80 3: 160 80 80 10 With this patch, this is the result: $ numactl -H (...) node distances: node 0 1 2 3 0: 10 80 80 80 1: 80 10 80 80 2: 80 80 10 80 3: 80 80 80 10 Signed-off-by: Daniel Henrique Barboza --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 1f61fa2148b5..c11aabad1090 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -239,7 +239,7 @@ static int associativity_to_nid(const __be32 *associativity) if (nid == 0x || nid >= nr_node_ids) nid = NUMA_NO_NODE; - if (nid > 0 && + if (nid >= 0 && of_read_number(associativity, 1) >= distance_ref_points_depth) { /* * Skip the length field and send start of associativity array -- 2.26.2
[PATCH 0/1] powerpc/numa: do not skip node 0 in lookup table
Hi, This is a simple fix that I made while testing NUMA changes I'm making in QEMU [1]. Setting any non-zero value to the associativity of NUMA node 0 has no impact in the output of 'numactl' because the distance_lookup_table is never initialized for node 0. Seeing through the LOPAPR spec and git history I found no technical reason to skip node 0, which makes me believe this is a bug that got under the radar up until now because no one attempted to set node 0 associativity like I'm doing now. For anyone wishing to give it a spin, using the QEMU build in [1] and experimenting with NUMA distances, such as: sudo ./qemu-system-ppc64 -machine pseries-5.2,accel=kvm,usb=off,dump-guest-core=off -m 65536 -overcommit mem-lock=off -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc -display none -vga none -nographic -boot menu=on -device spapr-pci-host-bridge,index=1,id=pci.1 -device spapr-pci-host-bridge,index=2,id=pci.2 -device spapr-pci-host-bridge,index=3,id=pci.3 -device spapr-pci-host-bridge,index=4,id=pci.4 -device qemu-xhci,id=usb,bus=pci.0,addr=0x2 -drive file=/home/danielhb/f32.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -device usb-kbd,id=input0,bus=usb.0,port=1 -device usb-mouse,id=input1,bus=usb.0,port=2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on \ -numa node,nodeid=0,cpus=0 -numa node,nodeid=1,cpus=1 \ -numa node,nodeid=2,cpus=2 -numa node,nodeid=3,cpus=3 \ -numa dist,src=0,dst=1,val=80 -numa dist,src=0,dst=2,val=80 \ -numa dist,src=0,dst=3,val=80 -numa dist,src=1,dst=2,val=80 \ -numa dist,src=1,dst=3,val=80 -numa dist,src=2,dst=3,val=80 The current kernel code will ignore the associativity of node 0, and numactl will output this: node distances: node 0 1 2 3 0: 10 160 160 160 1: 160 10 80 80 2: 160 80 10 80 3: 160 80 80 10 With this patch: node distances: node 0 1 2 3 0: 10 160 160 160 1: 160 10 80 40 2: 160 80 10 20 3: 160 40 20 10 If anyone wonders, this patch has no conflict with the proposed NUMA changes in [2] because Aneesh isn't changing this line. [1] https://github.com/danielhb/qemu/tree/spapr_numa_v1 [2] https://patchwork.ozlabs.org/project/linuxppc-dev/patch/2020073916.243569-1-aneesh.ku...@linux.ibm.com/ Daniel Henrique Barboza (1): powerpc/numa: do not skip node 0 when init lookup table arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.26.2
Question about NUMA distance calculation in powerpc/mm/numa.c
Hello, I didn't find an explanation about the 'double the distance' logic in 'git log' or anywhere in the kernel docs: (arch/powerpc/mm/numa.c, __node_distance()): for (i = 0; i < distance_ref_points_depth; i++) { if (distance_lookup_table[a][i] == distance_lookup_table[b][i]) break; /* Double the distance for each NUMA level */ distance *= 2; } For reference, the commit that added it: commit 41eab6f88f24124df89e38067b3766b7bef06ddb Author: Anton Blanchard Date: Sun May 16 20:22:31 2010 + powerpc/numa: Use form 1 affinity to setup node distance Is there a technical reason for the distance being calculated as the double for each NUMA level? The reason I'm asking is because of the QEMU/Libvirt capability to define NUMA node distances in the VMs. For x86, an user is capable of setting any distance values to the NUMA topology due to how ACPI SLIT works. The user, of course, wants the pseries guest to behave the same way. The best we can do for now is document why this will not happen. I'll document the limitations imposed by the design itself (how ibm,associativity-reference-points is capped to MAX_DISTANCE_REF_POINTS and so on). I also would like to document that the pseries kernel will double the distance for each NUMA level, and for that it would be nice to provide an actual reason for that to happen, if there is any. Thanks, Daniel
Re: [RFC PATCH kernel] powerpc/xive: Drop deregistered irqs
This patch fixed an issue I was experiencing with virsh start/destroy of guests with mlx5 and GPU passthrough in a Power 9 server. I believe it's a similar situation which Alexey described in the post commit msg. Tested-by: Daniel Henrique Barboza On 7/12/19 5:20 AM, Alexey Kardashevskiy wrote: There is a race between releasing an irq on one cpu and fetching it from XIVE on another cpu as there does not seem to be any locking between these, probably because xive_irq_chip::irq_shutdown() is supposed to remove the irq from all queues in the system which it does not do. As a result, when such released irq appears in a queue, we take it from the queue but we do not change the current priority on that cpu and since there is no handler for the irq, EOI is never called and the cpu current priority remains elevated (7 vs. 0xff==unmasked). If another irq is assigned to the same cpu, then that device stops working until irq is moved to another cpu or the device is reset. This checks if irq is still registered, if not, it assumes no valid irq was fetched from the loop and if there is none left, it continues to the irq==0 case (not visible in this patch) and sets priority to 0xff which is basically unmasking. This calls irq_to_desc() on a hot path now which is a radix tree lookup; hopefully this won't be noticeable as that tree is quite small. Signed-off-by: Alexey Kardashevskiy --- Found it on P9 system with: - a host with 8 cpus online - a boot disk on ahci with its msix on cpu#0 - a guest with 2xGPUs + 6xNVLink + 4 cpus - GPU#0 from the guest is bound to the same cpu#0. Killing a guest killed ahci and therefore the host because of the race. Note that VFIO masks interrupts first and only then resets the device. Alternatives: 1. Fix xive_irq_chip::irq_shutdown() to walk through all cpu queues and drop deregistered irqs. 2. Exploit chip->irq_get_irqchip_state function from 62e0468650c30f0298 "genirq: Add optional hardware synchronization for shutdown". Both require deep XIVE knowledge which I do not have. --- arch/powerpc/sysdev/xive/common.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 082c7e1c20f0..65742e280337 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -148,8 +148,12 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool just_peek) irq = xive_read_eq(&xc->queue[prio], just_peek); /* Found something ? That's it */ - if (irq) - break; + if (irq) { + /* Another CPU may have shut this irq down, check it */ + if (irq_to_desc(irq)) + break; + irq = 0; + } /* Clear pending bits */ xc->pending_prio &= ~(1 << prio);
Re: [PATCH kernel v2] powerpc/mm: Flush radix process translations when setting MMU type
On 02/07/2018 12:33 PM, Laurent Vivier wrote: On 01/02/2018 06:09, Alexey Kardashevskiy wrote: Radix guests do normally invalidate process-scoped translations when a new pid is allocated but migrated guests do not invalidate these so migrated guests crash sometime, especially easy to reproduce with migration happening within first 10 seconds after the guest boot start on the same machine. This adds the "Invalidate process-scoped translations" flush to fix radix guests migration. Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * removed PPC_TLBIE_5() from the !(old&PATH_HR) case as it is pointless on hash --- Not so sure that "process-scoped translations" only require flushing at pid allocation and migration. --- arch/powerpc/mm/pgtable_64.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index c9a623c..d75dd52 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -471,6 +471,8 @@ void mmu_partition_table_set_entry(unsigned int lpid, unsigned long dw0, if (old & PATB_HR) { asm volatile(PPC_TLBIE_5(%0,%1,2,0,1) : : "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid)); + asm volatile(PPC_TLBIE_5(%0,%1,2,1,1) : : +"r" (TLBIEL_INVAL_SET_LPID), "r" (lpid)); trace_tlbie(lpid, 0, TLBIEL_INVAL_SET_LPID, lpid, 2, 0, 1); } else { asm volatile(PPC_TLBIE_5(%0,%1,2,0,0) : : This patch fixes for me a VM migration crash on POWER9. Same here. Tested-by: Daniel Henrique Barboza Tested-by: Laurent Vivier Thanks, Laurent
Re: Possible LMB hot unplug bug in 4.13+ kernels
Unless you (Daniel) think there's some reason lmb_is_removable() is incorrectly returning false. But most likely it's correct and there's just an unmovable allocation in that range. I am not educated enough to say that the current behavior is wrong. What I can say is that in 4.11 and older kernels that supports LMB hot plug/unplug I didn't see this kernel "refusal" to remove a LMB that was just hotplugged. Assuming that the kernel is behaving as intended, a QEMU guest started with 4Gb of RAM that receives an extra 1Gb of RAM will not unplug this same 1Gb. It seems off from the user perspective that a recently added memory is being considered not removable, thus QEMU will need to keep this limitation in mind when dealing with future LMB bugs in 4.13+ kernels. Thanks, Daniel
Possible LMB hot unplug bug in 4.13+ kernels
Hi, I've stumbled in a LMB hot unplug problem when running a guest with 4.13+ kernel using QEMU 2.10. When trying to hot unplug a recently hotplugged LMB this is what I got, using an upstream kernel: --- QEMU cmd line: sudo ./qemu-system-ppc64 -name migrate_qemu -boot strict=on -device nec-usb-xhci,id=usb,bus=pci.0,addr=0xf -device spapr-vscsi,id=scsi0,reg=0x2000 -smp 32,maxcpus=32,sockets=32,cores=1,threads=1 --machine pseries,accel=kvm,kvm-type=HV,usb=off,dump-guest-core=off -m 4G,slots=32,maxmem=32G -drive file=/home/danielhb/vm_imgs/f26.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x2,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -nographic Last login: Wed Oct 4 12:28:25 on hvc0 [danielhb@localhost ~]$ grep Mem /proc/meminfo MemTotal: 4161728 kB MemFree: 3204352 kB MemAvailable: 3558336 kB [danielhb@localhost ~]$ QEMU 2.10.50 monitor - type 'help' for more information (qemu) (qemu) object_add memory-backend-ram,id=ram0,size=1G (qemu) device_add pc-dimm,id=dimm0,memdev=ram0 (qemu) [danielhb@localhost ~]$ grep Mem /proc/meminfo MemTotal: 5210304 kB MemFree: 4254656 kB MemAvailable: 4598144 kB [danielhb@localhost ~]$ (qemu) (qemu) device_del dimm0 (qemu) [ 136.058727] pseries-hotplug-mem: Memory indexed-count-remove failed, adding any removed LMBs (qemu) [danielhb@localhost ~]$ grep Mem /proc/meminfo MemTotal: 5210304 kB MemFree: 4253696 kB MemAvailable: 4597184 kB [danielhb@localhost ~]$ - The LMBs are about to be unplugged, them something happens and they ended up being hotplugged back. This isn't reproducible with 4.11 guests. I can reliably reproduce it in 4.13+. Haven't tried 4.12. Changing QEMU snapshots or even the hypervisor kernel/OS didn't affect the result. In an attempt to better understand the issue I did the following changes in upstream kernel: diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 1d48ab424bd9..37550833cdb0 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -433,8 +433,10 @@ static bool lmb_is_removable(struct of_drconf_cell *lmb) unsigned long pfn, block_sz; u64 phys_addr; - if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) + if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) { + pr_err("lmb is not assigned \n"); return false; + } block_sz = memory_block_size_bytes(); scns_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE; @@ -442,8 +444,10 @@ static bool lmb_is_removable(struct of_drconf_cell *lmb) #ifdef CONFIG_FA_DUMP /* Don't hot-remove memory that falls in fadump boot memory area */ - if (is_fadump_boot_memory_area(phys_addr, block_sz)) + if (is_fadump_boot_memory_area(phys_addr, block_sz)) { + pr_err("lmb belongs to fadump boot memory area\n"); return false; + } #endif for (i = 0; i < scns_per_block; i++) { @@ -454,7 +458,9 @@ static bool lmb_is_removable(struct of_drconf_cell *lmb) rc &= is_mem_section_removable(pfn, PAGES_PER_SECTION); phys_addr += MIN_MEMORY_BLOCK_SIZE; } - + if (!rc) { + pr_err("is_mem_section_removable returned false \n"); + } return rc ? true : false; } @@ -465,12 +471,16 @@ static int dlpar_remove_lmb(struct of_drconf_cell *lmb) unsigned long block_sz; int nid, rc; - if (!lmb_is_removable(lmb)) + if (!lmb_is_removable(lmb)) { + pr_err("dlpar_remove_lmb: lmb is not removable! \n"); return -EINVAL; + } rc = dlpar_offline_lmb(lmb); - if (rc) + if (rc) { + pr_err("dlpar_remove_lmb: offline_lmb returned not zero \n"); return rc; + } block_sz = pseries_memory_block_size(); nid = memory_add_physaddr_to_nid(lmb->base_addr); And this is the output: - [danielhb@localhost ~]$ QEMU 2.10.50 monitor - type 'help' for more information (qemu) (qemu) object_add memory-backend-ram,id=ram0,size=1G (qemu) device_add pc-dimm,id=dimm0,memdev=ram0 (qemu) [danielhb@localhost ~]$ grep Mem /proc/meminfo MemTotal: 5210304 kB MemFree: 4254656 kB MemAvailable: 4598144 kB [danielhb@localhost ~]$ (qemu) (qemu) device_del dimm0 (qemu) [ 136.058473] pseries-hotplug-mem: is_mem_section_removable returned false [ 136.058607] pseries-hotplug-mem: dlpar_remove_lmb: lmb is not removable! [ 136.058727] pseries-hotplug-mem: Memory indexed-count-remove failed, adding any removed LMBs (qemu) [danielhb@localhost ~]$ grep Mem /proc/meminfo MemTotal: 5210304 kB MemFree: 4253696 kB MemAvailable: 4597184 kB [danielhb@localhost ~]$ --- It appears that the hot unp
Re: Question: handling early hotplug interrupts
Hi Ben, On 08/29/2017 06:55 PM, Benjamin Herrenschmidt wrote: On Tue, 2017-08-29 at 17:43 -0300, Daniel Henrique Barboza wrote: Hi, This is a scenario I've been facing when working in early device hotplugs in QEMU. When a device is added, a IRQ pulse is fired to warn the guest of the event, then the kernel fetches it by calling 'check_exception' and handles it. If the hotplug is done too early (before SLOF, for example), the pulse is ignored and the hotplug event is left unchecked in the events queue. One solution would be to pulse the hotplug queue interrupt after CAS, when we are sure that the hotplug queue is negotiated. However, this panics the kernel with sig 11 kernel access of bad area, which suggests that the kernel wasn't quite ready to handle it. That's not right. This is a bug that needs fixing. The interrupt should be masked anyway but still. Tell us more about the crash (backtrace etc...) this definitely needs fixing. This is the backtrace using a 4.13.0-rc3 guest: - [0.008913] Unable to handle kernel paging request for data at address 0x0100 [0.008989] Faulting instruction address: 0xc012c318 [0.009046] Oops: Kernel access of bad area, sig: 11 [#1] [0.009092] SMP NR_CPUS=1024 [0.009092] NUMA [0.009128] pSeries [0.009173] Modules linked in: [0.009210] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc3+ #1 [0.009268] task: c000feb02580 task.stack: c000fe108000 [0.009325] NIP: c012c318 LR: c012c9c4 CTR: [0.009394] REGS: c000fffef910 TRAP: 0380 Not tainted (4.13.0-rc3+) [0.009450] MSR: 82009033 [0.009454] CR: 28000822 XER: 2000 [0.009554] CFAR: c012c9c0 SOFTE: 0 [0.009554] GPR00: c012c9c4 c000fffefb90 c141f100 0400 [0.009554] GPR04: c000fe1851c0 fee6 [0.009554] GPR08: 000fffe1 0001 02001001 [0.009554] GPR12: 0040 cfd8 c000db58 [0.009554] GPR16: [0.009554] GPR20: 0001 [0.009554] GPR24: 0002 0013 c000fe14bc00 0400 [0.009554] GPR28: 0400 c000fe1851c0 0001 [0.010121] NIP [c012c318] __queue_work+0x48/0x640 [0.010168] LR [c012c9c4] queue_work_on+0xb4/0xf0 [0.010213] Call Trace: [0.010239] [c000fffefb90] [c000db58] kernel_init+0x8/0x160 (unreliable) [0.010308] [c000fffefc70] [c012c9c4] queue_work_on+0xb4/0xf0 [0.010368] [c000fffefcb0] [c00c4608] queue_hotplug_event+0xd8/0x150 [0.010435] [c000fffefd00] [c00c30d0] ras_hotplug_interrupt+0x140/0x190 [0.010505] [c000fffefd90] [c018c8b0] __handle_irq_event_percpu+0x90/0x310 [0.010573] [c000fffefe50] [c018cb6c] handle_irq_event_percpu+0x3c/0x90 [0.010642] [c000fffefe90] [c018cc24] handle_irq_event+0x64/0xc0 [0.010710] [c000fffefec0] [c01928b0] handle_fasteoi_irq+0xc0/0x230 [0.010779] [c000fffefef0] [c018ae14] generic_handle_irq+0x54/0x80 [0.010847] [c000fffeff20] [c00189f0] __do_irq+0x90/0x210 [0.010904] [c000fffeff90] [c002e730] call_do_irq+0x14/0x24 [0.010961] [c000fe10b640] [c0018c10] do_IRQ+0xa0/0x130 [0.011021] [c000fe10b6a0] [c0008c58] hardware_interrupt_common+0x158/0x160 [0.011090] --- interrupt: 501 at __replay_interrupt+0x38/0x3c [0.011090] LR = arch_local_irq_restore+0x74/0x90 [0.011179] [c000fe10b990] [c000fe10b9e0] 0xc000fe10b9e0 (unreliable) [0.011249] [c000fe10b9b0] [c0b967fc] _raw_spin_unlock_irqrestore+0x4c/0xb0 [0.011316] [c000fe10b9e0] [c018ff50] __setup_irq+0x630/0x9e0 [0.011374] [c000fe10ba90] [c019054c] request_threaded_irq+0x13c/0x250 [0.011441] [c000fe10baf0] [c00c2cd0] request_event_sources_irqs+0x100/0x180 [0.011511] [c000fe10bc10] [c0eceda8] __machine_initcall_pseries_init_ras_IRQ+0xc4/0x12c [0.011591] [c000fe10bc40] [c000d8c8] do_one_initcall+0x68/0x1e0 [0.011659] [c000fe10bd00] [c0eb4484] kernel_init_freeable+0x284/0x370 [0.011725] [c000fe10bdc0] [c000db7c] kernel_init+0x2c/0x160 [0.011782] [c000fe10be30] [c000bc9c] ret_from_kernel_thread+0x5c/0xc0 [0.011848] Instruction dump: [0.011885] fbc1fff0 f8010010 f821ff21 7c7c1b78 7c9d2378 7cbe2b78 787b0020 6000 [0.011955] 6000 892d028a 2fa9 409e04bc <813d0100> 75290001 408204c0 3d2061c8 [0.012026] ---[ end trace e0b4
Question: handling early hotplug interrupts
Hi, This is a scenario I've been facing when working in early device hotplugs in QEMU. When a device is added, a IRQ pulse is fired to warn the guest of the event, then the kernel fetches it by calling 'check_exception' and handles it. If the hotplug is done too early (before SLOF, for example), the pulse is ignored and the hotplug event is left unchecked in the events queue. One solution would be to pulse the hotplug queue interrupt after CAS, when we are sure that the hotplug queue is negotiated. However, this panics the kernel with sig 11 kernel access of bad area, which suggests that the kernel wasn't quite ready to handle it. In my experiments using upstream 4.13 I saw that there is a 'safe time' to pulse the queue, sometime after CAS and before mounting the root fs, but I wasn't able to pinpoint it. From QEMU perspective, the last hcall done (an h_set_mode) is still too early to pulse it and the kernel panics. Looking at the kernel source I saw that the IRQ handling is initiated quite early in the init process. So my question (ok, actually 2 questions): - Is my analysis correct? Is there an unsafe time to fire a IRQ pulse before CAS that can break the kernel or am I overlooking/doing something wrong? - is there a reliable way to know when can the kernel safely handle the hotplug interrupt? Thanks, Daniel
Re: [RFC PATCH] powerpc: Disabling MEMORY_HOTPLUG_DEFAULT_ONLINE option for PPC64 arch
On 08/01/2017 11:39 AM, Daniel Henrique Barboza wrote: On 08/01/2017 11:05 AM, Nathan Fontenot wrote: At this point I don't think we need this patch to disable auto online for ppc64. I would be curious if this is still broken with the latest mainline code though. If the auto_online feature is already working in upstream 4.13 kernel then I don't see a reason to apply this patch either. We can leave it as a FYI/reminder of a problem that was happening in 4.11 and got solved later on. I've compiled an upstream kernel with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y and the problem with LMB hotplug is still reproducible. This means that it would be good to have any sort of mechanism (such as the one this patch proposes) to remove this option from the pseries kernel until it is fixed. Here is the stack trace: Last login: Tue Aug 1 19:26:21 on hvc0 [danielhb@localhost ~]$ uname -a Linux localhost.localdomain 4.13.0-rc3+ #1 SMP Tue Aug 1 21:29:14 CDT 2017 ppc64le ppc64le ppc64le GNU/Linux [danielhb@localhost ~]$ QEMU 2.9.90 monitor - type 'help' for more information (qemu) (qemu) object_add memory-backend-ram,id=ram1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=ram1 (qemu) [ 617.757086] kernel BUG at mm/memory_hotplug.c:1936! [ 617.757245] Oops: Exception in kernel mode, sig: 5 [#1] [ 617.757332] SMP NR_CPUS=1024 [ 617.757333] NUMA [ 617.757396] pSeries [ 617.757479] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc ghash_generic vmx_crypto xfs libcrc32c bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_blk ibmvscsi virtio_pci ibmveth scsi_transport_srp i2c_core virtio_ring virtio crc32c_vpmsum [ 617.758577] CPU: 6 PID: 5 Comm: kworker/u16:0 Not tainted 4.13.0-rc3+ #1 [ 617.758741] Workqueue: pseries hotplug workque pseries_hp_work_fn [ 617.758852] task: c000feb9e780 task.stack: c000fe118000 [ 617.759047] NIP: c037322c LR: c03731b8 CTR: [ 617.759171] REGS: c000fe11b790 TRAP: 0700 Not tainted (4.13.0-rc3+) [ 617.759275] MSR: 8282b033 [ 617.759284] CR: 22002424 XER: 2000 [ 617.759450] CFAR: c03731bc SOFTE: 1 [ 617.759450] GPR00: c03731b8 c000fe11ba10 c141f100 0001 [ 617.759450] GPR04: 01a0 c15dac30 8b80 [ 617.759450] GPR08: 0007 0002 c000e9873480 303078302d303030 [ 617.759450] GPR12: 2200 cfd83c00 c0137f88 c000fe1f0e40 [ 617.759450] GPR16: 0001 0004 0001 [ 617.759450] GPR20: c000fc28 0010 c000ebb1c184 0010 [ 617.759450] GPR24: c000fea0 c000f0aec4a0 c000fc28 [ 617.759450] GPR28: c1347f08 1000 0001 [ 617.760679] NIP [c037322c] remove_memory+0xfc/0x110 [ 617.760766] LR [c03731b8] remove_memory+0x88/0x110 [ 617.760849] Call Trace: [ 617.760895] [c000fe11ba10] [c03731b8] remove_memory+0x88/0x110 (unreliable) [ 617.761032] [c000fe11ba50] [c00cb0e0] dlpar_add_lmb+0x280/0x480 [ 617.761214] [c000fe11bb30] [c00cc83c] dlpar_memory+0xa5c/0xe50 [ 617.761319] [c000fe11bbf0] [c00c3b88] handle_dlpar_errorlog+0xf8/0x160 [ 617.761443] [c000fe11bc60] [c00c3c84] pseries_hp_work_fn+0x94/0xa0 [ 617.761568] [c000fe11bc90] [c012e3e8] process_one_work+0x248/0x540 [ 617.761697] [c000fe11bd30] [c012e768] worker_thread+0x88/0x5c0 [ 617.761857] [c000fe11bdc0] [c013812c] kthread+0x1ac/0x1c0 [ 617.761962] [c000fe11be30] [c000bc9c] ret_from_kernel_thread+0x5c/0xc0 [ 617.762085] Instruction dump: [ 617.762149] 4be09a9d 6000 4bd95ac5 6000 38210040 e8010010 eb81ffe0 eba1ffe8 [ 617.762274] ebc1fff0 7c0803a6 ebe1fff8 4e800020 <0fe0> [ 617.762403] ---[ end trace feaa62099c5987b9 ]--- Thanks, Daniel Thanks, Daniel -Nathan However, this change alone isn't enough to prevent situations such as [1], where distros can enable the option unaware of the consequences of doing it (e.g. breaking LMB hotplug altogether). Instead of relying on all distros knowing that pseries can't handle CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y after 943db62c316c, this patch changes mm/Kconfig to make the MEMORY_HOTPLUG_DEFAULT_ONLINE config unavailable for the PPC64 arch. [1] https://bugzilla.
Re: [RFC PATCH] powerpc: Disabling MEMORY_HOTPLUG_DEFAULT_ONLINE option for PPC64 arch
On 08/01/2017 11:05 AM, Nathan Fontenot wrote: On 08/01/2017 04:59 AM, Michael Ellerman wrote: Daniel Henrique Barboza writes: Commit 943db62c316c ("powerpc/pseries: Revert 'Auto-online hotplugged memory'") reverted the auto-online feature for pseries due to problems with LMB removals not updating the device struct properly. Among other things, this commit made the following change in arch/powerpc/configs/pseries_defconfig: @@ -58,7 +58,6 @@ CONFIG_KEXEC_FILE=y CONFIG_IRQ_ALL_CPUS=y CONFIG_MEMORY_HOTPLUG=y CONFIG_MEMORY_HOTREMOVE=y -CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y CONFIG_KSM=y The intent was to disable the option in the defconfig of pseries, since after that the code doesn't have this support anymore. It's always polite to Cc the author of a commit you're referring to, so I added Nathan. Noted. Thanks for adding Nathan in the CC. The intention when we merged that fix was that the auto-online code would be "fixed" to mark the device online. I say "fixed" because it wasn't entirely clear if that was the correct behaviour, though it definitely seemed like it should be. I've lost track of where/if the discussion got to on whether the auto-online code should do that or not. Did anything get resolved? I think, though I should go back and test to be sure, that everything works in the latest mainline code. The issue causing this to be a problem was in the original implementation of auto_online support. If you wanted to auto online memory, the code was calling memory_block_change_state(). This worked but did not update the device struct for each of the memory block that was online'ed such that dev->offline == true even after the memory was online. I sent a patch earlier this year (commit dc18d706a436) that corrected this to call device_online() instead of memory_block_change_state(). With this fix (appears to have gone into the 4.11 kernel) it should be possible to use auto_online on power systems. Commit dc18d706a436 was present in the 4.11 kernels that experiences this issue (Fedora 26 and Ubuntu 17.10 in my tests). So I am not entirely sure that we can use auto_online on power systems, at least in those kernels. At this point I don't think we need this patch to disable auto online for ppc64. I would be curious if this is still broken with the latest mainline code though. If the auto_online feature is already working in upstream 4.13 kernel then I don't see a reason to apply this patch either. We can leave it as a FYI/reminder of a problem that was happening in 4.11 and got solved later on. Thanks, Daniel -Nathan However, this change alone isn't enough to prevent situations such as [1], where distros can enable the option unaware of the consequences of doing it (e.g. breaking LMB hotplug altogether). Instead of relying on all distros knowing that pseries can't handle CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y after 943db62c316c, this patch changes mm/Kconfig to make the MEMORY_HOTPLUG_DEFAULT_ONLINE config unavailable for the PPC64 arch. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1476380 Fixes: 943db62c316c ("powerpc/pseries: Revert 'Auto-online hotplugged memory'") Signed-off-by: Daniel Henrique Barboza --- mm/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) I don't own that file, so we at least need an Ack from the mm folks. cheers diff --git a/mm/Kconfig b/mm/Kconfig index 48b1af4..a342c77 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -169,7 +169,7 @@ config MEMORY_HOTPLUG_SPARSE config MEMORY_HOTPLUG_DEFAULT_ONLINE bool "Online the newly added memory blocks by default" default n -depends on MEMORY_HOTPLUG +depends on MEMORY_HOTPLUG && !PPC64 help This option sets the default policy setting for memory hotplug onlining policy (/sys/devices/system/memory/auto_online_blocks) which -- 2.9.4
[RFC PATCH] powerpc: Disabling MEMORY_HOTPLUG_DEFAULT_ONLINE option for PPC64 arch
Commit 943db62c316c ("powerpc/pseries: Revert 'Auto-online hotplugged memory'") reverted the auto-online feature for pseries due to problems with LMB removals not updating the device struct properly. Among other things, this commit made the following change in arch/powerpc/configs/pseries_defconfig: @@ -58,7 +58,6 @@ CONFIG_KEXEC_FILE=y CONFIG_IRQ_ALL_CPUS=y CONFIG_MEMORY_HOTPLUG=y CONFIG_MEMORY_HOTREMOVE=y -CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y CONFIG_KSM=y The intent was to disable the option in the defconfig of pseries, since after that the code doesn't have this support anymore. However, this change alone isn't enough to prevent situations such as [1], where distros can enable the option unaware of the consequences of doing it (e.g. breaking LMB hotplug altogether). Instead of relying on all distros knowing that pseries can't handle CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y after 943db62c316c, this patch changes mm/Kconfig to make the MEMORY_HOTPLUG_DEFAULT_ONLINE config unavailable for the PPC64 arch. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1476380 Fixes: 943db62c316c ("powerpc/pseries: Revert 'Auto-online hotplugged memory'") Signed-off-by: Daniel Henrique Barboza --- mm/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/Kconfig b/mm/Kconfig index 48b1af4..a342c77 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -169,7 +169,7 @@ config MEMORY_HOTPLUG_SPARSE config MEMORY_HOTPLUG_DEFAULT_ONLINE bool "Online the newly added memory blocks by default" default n -depends on MEMORY_HOTPLUG +depends on MEMORY_HOTPLUG && !PPC64 help This option sets the default policy setting for memory hotplug onlining policy (/sys/devices/system/memory/auto_online_blocks) which -- 2.9.4