Re: [kvm-unit-tests PATCH v8 09/35] powerpc: Fix stack backtrace termination
On 05/04/2024 10.35, Nicholas Piggin wrote: The backtrace handler terminates when it sees a NULL caller address, but the powerpc stack setup does not keep such a NULL caller frame at the start of the stack. This happens to work on pseries because the memory at 0 is mapped and it contains 0 at the location of the return address pointer if it were a stack frame. But this is fragile, and does not work with powernv where address 0 contains firmware instructions. Use the existing dummy frame on stack as the NULL caller, and create a new frame on stack for the entry code. Signed-off-by: Nicholas Piggin --- powerpc/cstart64.S | 15 +++ 1 file changed, 15 insertions(+) Reviewed-by: Thomas Huth
Re: [kvm-unit-tests PATCH v8 06/35] gitlab-ci: Run migration selftest on s390x and powerpc
On 11/04/2024 21.22, Thomas Huth wrote: On 08/04/2024 18.06, Nico Boehr wrote: Quoting Nicholas Piggin (2024-04-05 10:35:07) The migration harness is complicated and easy to break so CI will be helpful. Signed-off-by: Nicholas Piggin --- .gitlab-ci.yml | 32 +++- s390x/unittests.cfg | 8 2 files changed, 31 insertions(+), 9 deletions(-) diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml index ff34b1f50..60b3cdfd2 100644 --- a/.gitlab-ci.yml +++ b/.gitlab-ci.yml [...] @@ -135,7 +147,7 @@ build-riscv64: build-s390x: extends: .outoftree_template script: - - dnf install -y qemu-system-s390x gcc-s390x-linux-gnu + - dnf install -y qemu-system-s390x gcc-s390x-linux-gnu nmap-ncat - mkdir build - cd build - ../configure --arch=s390x --cross-prefix=s390x-linux-gnu- @@ -161,6 +173,8 @@ build-s390x: sclp-1g sclp-3g selftest-setup + selftest-migration-kvm We're running under TCG in the Gitlab CI. I'm a little bit confused why we're running a KVM-only test here. The build-s390x job is TCG, indeed, but we have the "s390x-kvm" job that runs on a KVM-capable s390x host, so it could be added there? I now gave it a try and it seems to work, so I updated this patch and pushed it to the repository now. Thomas
Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()
On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote: > On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote: > > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote: > > > +/** > > > + * enum execmem_type - types of executable memory ranges > > > + * > > > + * There are several subsystems that allocate executable memory. > > > + * Architectures define different restrictions on placement, > > > + * permissions, alignment and other parameters for memory that can be > > > used > > > + * by these subsystems. > > > + * Types in this enum identify subsystems that allocate executable memory > > > + * and let architectures define parameters for ranges suitable for > > > + * allocations by each subsystem. > > > + * > > > + * @EXECMEM_DEFAULT: default parameters that would be used for types that > > > + * are not explcitly defined. > > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections > > > + * @EXECMEM_KPROBES: parameters for kprobes > > > + * @EXECMEM_FTRACE: parameters for ftrace > > > + * @EXECMEM_BPF: parameters for BPF > > > + * @EXECMEM_TYPE_MAX: > > > + */ > > > +enum execmem_type { > > > + EXECMEM_DEFAULT, > > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT, > > > + EXECMEM_KPROBES, > > > + EXECMEM_FTRACE, > > > + EXECMEM_BPF, > > > + EXECMEM_TYPE_MAX, > > > +}; > > > > Can we please get a break-down of how all these types are actually > > different from one another? > > > > I'm thinking some platforms have a tiny immediate space (arm64 comes to > > mind) and has less strict placement constraints for some of them? > > Yeah, and really I'd *much* rather deal with that in arch code, as I have said > several times. > > For arm64 we have two bsaic restrictions: > > 1) Direct branches can go +/-128M >We can expand this range by having direct branches go to PLTs, at a >performance cost. > > 2) PREL32 relocations can go +/-2G >We cannot expand this further. > > * We don't need to allocate memory for ftrace. We do not use trampolines. > > * Kprobes XOL areas don't care about either of those; we don't place any > PC-relative instructions in those. Maybe we want to in future. > > * Modules care about both; we'd *prefer* to place them within +/-128M of all > other kernel/module code, but if there's no space we can use PLTs and expand > that to +/-2G. Since modules can refreence other modules, that ends up > actually being halved, and modules have to fit within some 2G window that > also covers the kernel. > > * I'm not sure about BPF's requirements; it seems happy doing the same as > modules. BPF are happy with vmalloc(). > So if we *must* use a common execmem allocator, what we'd reall want is our > own > types, e.g. > > EXECMEM_ANYWHERE > EXECMEM_NOPLT > EXECMEM_PREL32 > > ... and then we use those in arch code to implement module_alloc() and > friends. I'm looking at execmem_types more as definition of the consumers, maybe I should have named the enum execmem_consumer at the first place. And the arch constrains defined in struct execmem_range describe how memory should be allocated for each consumer. These constraints are defined early at boot and remain static, so initializing them once and letting a common allocator use them makes perfect sense to me. I agree that fallback_{start,end} are not ideal, but we have 3 architectures that have preferred and secondary range for modules. And arm and powerpc use the same logic for kprobes as well, and I don't see why this code should be duplicated. And, for instance, if you decide to place PC-relative instructions if kprobes XOL areas, you'd only need to update execmem_range for kprobes to be more like the range for modules. With central allocator it's easier to deal with the things like VM_FLUSH_RESET_PERMS and caching of ROX memory and I think it will be more maintainable that module_alloc(), alloc_insn_page() and bpf_jit_alloc_exec() spread all over the place. > Mark. -- Sincerely yours, Mike.
Re: [kvm-unit-tests PATCH v8 10/35] powerpc: interrupt stack backtracing
On 05/04/2024 10.35, Nicholas Piggin wrote: Add support for backtracing across interrupt stacks, and add interrupt frame backtrace for unhandled interrupts. This requires a back-chain created from initial interrupt stack frame to the r1 value of the interrupted context. A label is added at the return location of the exception handler call, so the unwinder can recognize the initial interrupt frame. The additional cstart entry-frame is no longer required because the unwinder now looks for frame == 0 as well as address == 0. Signed-off-by: Nicholas Piggin --- Acked-by: Thomas Huth
Re: [RFC PATCH 6/7] execmem: add support for cache of large ROX pages
On Mon, Apr 15, 2024 at 08:00:26PM +0300, Mike Rapoport wrote: > On Mon, Apr 15, 2024 at 12:47:50PM +0200, Peter Zijlstra wrote: > > On Thu, Apr 11, 2024 at 07:05:25PM +0300, Mike Rapoport wrote: > > > > > To populate the cache, a writable large page is allocated from vmalloc > > > with > > > VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped as > > > ROX. > > > > > +static void execmem_invalidate(void *ptr, size_t size, bool writable) > > > +{ > > > + if (execmem_info->invalidate) > > > + execmem_info->invalidate(ptr, size, writable); > > > + else > > > + memset(ptr, 0, size); > > > +} > > > > +static void execmem_invalidate(void *ptr, size_t size, bool writeable) > > +{ > > + /* fill memory with INT3 instructions */ > > + if (writeable) > > + memset(ptr, 0xcc, size); > > + else > > + text_poke_set(ptr, 0xcc, size); > > +} > > > > Thing is, 0xcc (aka INT3_INSN_OPCODE) is not an invalid instruction. > > It raises #BP not #UD. > > Do you mean that _invalidate is a poor name choice or that it's necessary > to use an instruction that raises #UD? Poor naming, mostly. #BP handler will still scream bloody murder if the site is otherwise unclaimed. It just isn't an invalid instruction.
Re: [kvm-unit-tests PATCH v8 06/35] gitlab-ci: Run migration selftest on s390x and powerpc
On 16/04/2024 09.18, Thomas Huth wrote: On 11/04/2024 21.22, Thomas Huth wrote: On 08/04/2024 18.06, Nico Boehr wrote: Quoting Nicholas Piggin (2024-04-05 10:35:07) The migration harness is complicated and easy to break so CI will be helpful. Signed-off-by: Nicholas Piggin --- .gitlab-ci.yml | 32 +++- s390x/unittests.cfg | 8 2 files changed, 31 insertions(+), 9 deletions(-) diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml index ff34b1f50..60b3cdfd2 100644 --- a/.gitlab-ci.yml +++ b/.gitlab-ci.yml [...] @@ -135,7 +147,7 @@ build-riscv64: build-s390x: extends: .outoftree_template script: - - dnf install -y qemu-system-s390x gcc-s390x-linux-gnu + - dnf install -y qemu-system-s390x gcc-s390x-linux-gnu nmap-ncat - mkdir build - cd build - ../configure --arch=s390x --cross-prefix=s390x-linux-gnu- @@ -161,6 +173,8 @@ build-s390x: sclp-1g sclp-3g selftest-setup + selftest-migration-kvm We're running under TCG in the Gitlab CI. I'm a little bit confused why we're running a KVM-only test here. The build-s390x job is TCG, indeed, but we have the "s390x-kvm" job that runs on a KVM-capable s390x host, so it could be added there? I now gave it a try and it seems to work, so I updated this patch and pushed it to the repository now. Hmm, "selftest-migration" now was failing once here: https://gitlab.com/kvm-unit-tests/kvm-unit-tests/-/jobs/6633865591 Let's keep an eye on it, and if it is not stable enough, we might need to disable it in the CI again... Thomas
[PATCH v9 2/3] powerpc/fadump: add hotplug_ready sysfs interface
The elfcorehdr describes the CPUs and memory of the crashed kernel to the kernel that captures the dump, known as the second or fadump kernel. The elfcorehdr needs to be updated if the system's memory changes due to memory hotplug or online/offline events. Currently, memory hotplug events are monitored in userspace by udev rules, and fadump is re-registered, which recreates the elfcorehdr with the latest available memory in the system. However, the previous patch ("powerpc: make fadump resilient with memory add/remove events") moved the creation of elfcorehdr to the second or fadump kernel. This eliminates the need to regenerate the elfcorehdr during memory hotplug or online/offline events. Create a sysfs entry at /sys/kernel/fadump/hotplug_ready to let userspace know that fadump re-registration is not required for memory add/remove events. Signed-off-by: Sourabh Jain Cc: Aditya Gupta Cc: "Aneesh Kumar K.V" Cc: Hari Bathini Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Naveen N Rao --- Documentation/ABI/testing/sysfs-kernel-fadump | 11 +++ arch/powerpc/kernel/fadump.c | 14 ++ 2 files changed, 25 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump b/Documentation/ABI/testing/sysfs-kernel-fadump index 8f7a64a81783..c586054657d6 100644 --- a/Documentation/ABI/testing/sysfs-kernel-fadump +++ b/Documentation/ABI/testing/sysfs-kernel-fadump @@ -38,3 +38,14 @@ Contact: linuxppc-dev@lists.ozlabs.org Description: read only Provide information about the amount of memory reserved by FADump to save the crash dump in bytes. + +What: /sys/kernel/fadump/hotplug_ready +Date: Apr 2024 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Kdump udev rule re-registers fadump on memory add/remove events, + primarily to update the elfcorehdr. This sysfs indicates the + kdump udev rule that fadump re-registration is not required on + memory add/remove events because elfcorehdr is now prepared in + the second/fadump kernel. +User: kexec-tools diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index e816725a11c0..131bf0d2d45d 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -1442,6 +1442,18 @@ static ssize_t enabled_show(struct kobject *kobj, return sprintf(buf, "%d\n", fw_dump.fadump_enabled); } +/* + * /sys/kernel/fadump/hotplug_ready sysfs node returns 1, which inidcates + * to usersapce that fadump re-registration is not required on memory + * hotplug events. + */ +static ssize_t hotplug_ready_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", 1); +} + static ssize_t mem_reserved_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) @@ -1514,11 +1526,13 @@ static struct kobj_attribute release_attr = __ATTR_WO(release_mem); static struct kobj_attribute enable_attr = __ATTR_RO(enabled); static struct kobj_attribute register_attr = __ATTR_RW(registered); static struct kobj_attribute mem_reserved_attr = __ATTR_RO(mem_reserved); +static struct kobj_attribute hotplug_ready_attr = __ATTR_RO(hotplug_ready); static struct attribute *fadump_attrs[] = { &enable_attr.attr, ®ister_attr.attr, &mem_reserved_attr.attr, + &hotplug_ready_attr.attr, NULL, }; -- 2.43.0
[PATCH v9 1/3] powerpc: make fadump resilient with memory add/remove events
Due to changes in memory resources caused by either memory hotplug or online/offline events, the elfcorehdr, which describes the CPUs and memory of the crashed kernel to the kernel that collects the dump (known as second/fadump kernel), becomes outdated. Consequently, attempting dump collection with an outdated elfcorehdr can lead to failed or inaccurate dump collection. Memory hotplug or online/offline events is referred as memory add/remove events in reset of the commit message. The current solution to address the aforementioned issue is as follows: Monitor memory add/remove events in userspace using udev rules, and re-register fadump whenever there are changes in memory resources. This leads to the creation of a new elfcorehdr with updated system memory information. There are several notable issues associated with re-registering fadump for every memory add/remove events. 1. Bulk memory add/remove events with udev-based fadump re-registration can lead to race conditions and, more importantly, it creates a wide window during which fadump is inactive until all memory add/remove events are settled. 2. Re-registering fadump for every memory add/remove event is inefficient. 3. The memory for elfcorehdr is allocated based on the memblock regions available during early boot and remains fixed thereafter. However, if elfcorehdr is later recreated with additional memblock regions, its size will increase, potentially leading to memory corruption. Address the aforementioned challenges by shifting the creation of elfcorehdr from the first kernel (also referred as the crashed kernel), where it was created and frequently recreated for every memory add/remove event, to the fadump kernel. As a result, the elfcorehdr only needs to be created once, thus eliminating the necessity to re-register fadump during memory add/remove events. At present, the first kernel prepares fadump header and stores it in the fadump reserved area. The fadump header includes the start address of the elfcorehdr, crashing CPU details, and other relevant information. In the event of a crash in the first kernel, the second/fadump boots and accesses the fadump header prepared by the first kernel. It then performs the following steps in a platform-specific function [rtas|opal]_fadump_process: 1. Sanity check for fadump header 2. Update CPU notes in elfcorehdr Along with the above, update the setup_fadump()/fadump.c to create elfcorehdr and set its address to the global variable elfcorehdr_addr for the vmcore module to process it in the second/fadump kernel. Section below outlines the information required to create the elfcorehdr and the changes made to make it available to the fadump kernel if it's not already. To create elfcorehdr, the following crashed kernel information is required: CPU notes, vmcoreinfo, and memory ranges. At present, the CPU notes are already prepared in the fadump kernel, so no changes are needed in that regard. The fadump kernel has access to all crashed kernel memory regions, including boot memory regions that are relocated by firmware to fadump reserved areas, so no changes for that either. However, it is necessary to add new members to the fadump header, i.e., the 'fadump_crash_info_header' structure, in order to pass the crashed kernel's vmcoreinfo address and its size to fadump kernel. In addition to the vmcoreinfo address and size, there are a few other attributes also added to the fadump_crash_info_header structure. 1. version: It stores the fadump header version, which is currently set to 1. This provides flexibility to update the fadump crash info header in the future without changing the magic number. For each change in the fadump header, the version will be increased. This will help the updated kernel determine how to handle kernel dumps from older kernels. The magic number remains relevant for checking fadump header corruption. 2. pt_regs_sz/cpu_mask_sz: Store size of pt_regs and cpu_mask structure of first kernel. These attributes are used to prevent dump processing if the sizes of pt_regs or cpu_mask structure differ between the first and fadump kernels. Note: if either first/crashed kernel or second/fadump kernel do not have the changes introduced here then kernel fail to collect the dump and prints relevant error message on the console. Signed-off-by: Sourabh Jain Cc: Aditya Gupta Cc: "Aneesh Kumar K.V" Cc: Hari Bathini Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Naveen N Rao --- arch/powerpc/include/asm/fadump-internal.h | 31 +- arch/powerpc/kernel/fadump.c | 361 +++ arch/powerpc/platforms/powernv/opal-fadump.c | 22 +- arch/powerpc/platforms/pseries/rtas-fadump.c | 34 +- 4 files changed, 242 insertions(+), 206 deletions(-) diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h index 27f9e11eda28..5d706a7acc8a 100644 --- a/arch/powerpc/include/a
[PATCH v9 3/3] Documentation/powerpc: update fadump implementation details
The patch titled ("powerpc: make fadump resilient with memory add/remove events") has made significant changes to the implementation of fadump, particularly on elfcorehdr creation and fadump crash info header structure. Therefore, updating the fadump implementation documentation to reflect those changes. Following updates are done to firmware assisted dump documentation: 1. The elfcorehdr is no longer stored after fadump HDR in the reserved dump area. Instead, the second kernel dynamically allocates memory for the elfcorehdr within the address range from 0 to the boot memory size. Therefore, update figures 1 and 2 of Memory Reservation during the first and second kernels to reflect this change. 2. A version field has been added to the fadump header to manage the future changes to fadump crash info header structure without changing the fadump header magic number in the future. Therefore, remove the corresponding TODO from the document. Signed-off-by: Sourabh Jain Cc: Aditya Gupta Cc: Aneesh Kumar K.V Cc: Hari Bathini Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Naveen N Rao --- .../arch/powerpc/firmware-assisted-dump.rst | 91 +-- 1 file changed, 42 insertions(+), 49 deletions(-) diff --git a/Documentation/arch/powerpc/firmware-assisted-dump.rst b/Documentation/arch/powerpc/firmware-assisted-dump.rst index e363fc48529a..7e37aadd1f77 100644 --- a/Documentation/arch/powerpc/firmware-assisted-dump.rst +++ b/Documentation/arch/powerpc/firmware-assisted-dump.rst @@ -134,12 +134,12 @@ that are run. If there is dump data, then the memory is held. If there is no waiting dump data, then only the memory required to -hold CPU state, HPTE region, boot memory dump, FADump header and -elfcore header, is usually reserved at an offset greater than boot -memory size (see Fig. 1). This area is *not* released: this region -will be kept permanently reserved, so that it can act as a receptacle -for a copy of the boot memory content in addition to CPU state and -HPTE region, in the case a crash does occur. +hold CPU state, HPTE region, boot memory dump, and FADump header is +usually reserved at an offset greater than boot memory size (see Fig. 1). +This area is *not* released: this region will be kept permanently +reserved, so that it can act as a receptacle for a copy of the boot +memory content in addition to CPU state and HPTE region, in the case +a crash does occur. Since this reserved memory area is used only after the system crash, there is no point in blocking this significant chunk of memory from @@ -153,22 +153,22 @@ that were present in CMA region:: o Memory Reservation during first kernel - Low memory Top of memory - 0boot memory size |<--- Reserved dump area --->| | - | | |Permanent Reservation | | - V V || V - +---+-/ /---+---++---+-+-++--+ - | | |///|| DUMP | HDR | ELF || | - +---+-/ /---+---++---+-+-++--+ -| ^^ ^ ^ ^ -| || | | | -\ CPU HPTE / | | - -- | | - Boot memory content gets transferred| | - to reserved area by firmware at the | | - time of crash. | | - FADump Header | - (meta area)| + Low memory Top of memory + 0boot memory size |<-- Reserved dump area ->| | + | | | Permanent Reservation | | + V V | | V + +---+-/ /---+---++---+---++-+ + | | |///||DUMP | HDR || | + +---+-/ /---+---++---+---++-+ +| ^^ ^ ^ ^ +| || | | | +\ CPU HPTE / | | + | | + Boot memory content gets transferred | | + to reserved area by firmware at the | | + time of crash. | | + FADump Header | +(meta area) | | | Metadata: This area holds a metadata structure whose @@ -186,13 +186,20 @@ that were present in CMA
[PATCH v9 0/3] powerpc: make fadump resilient with memory add/remove events
Problem: Due to changes in memory resources caused by either memory hotplug or online/offline events, the elfcorehdr, which describes the cpus and memory of the crashed kernel to the kernel that collects the dump (known as second/fadump kernel), becomes outdated. Consequently, attempting dump collection with an outdated elfcorehdr can lead to failed or inaccurate dump collection. Memory hotplug or online/offline events is referred as memory add/remove events in reset of the patch series. Existing solution: == Monitor memory add/remove events in userspace using udev rules, and re-register fadump whenever there are changes in memory resources. This leads to the creation of a new elfcorehdr with updated system memory information. Challenges with existing solution: == 1. Performing bulk memory add/remove with udev-based fadump re-registration can lead to race conditions and, more importantly, it creates a large wide window during which fadump is inactive until all memory add/remove events are settled. 2. Re-registering fadump for every memory add/remove event is inefficient. 3. Memory for elfcorehdr is allocated based on the memblock regions available during first kernel early boot and it remains fixed thereafter. However, if the elfcorehdr is later recreated with additional memblock regions, its size will increase, potentially leading to memory corruption. Proposed solution: == Address the aforementioned challenges by shifting the creation of elfcorehdr from the first kernel (also referred as the crashed kernel), where it was created and frequently recreated for every memory add/remove event, to the fadump kernel. As a result, the elfcorehdr only needs to be created once, thus eliminating the necessity to re-register fadump during memory add/remove events. To know more about elfcorehdr creation in the fadump kernel, refer to the first patch in this series. The second patch includes a new sysfs interface that tells userspace that fadump re-registration isn't needed for memory add/remove events. note that userspace changes do not need to be in sync with kernel changes; they can roll out independently. Since there are significant changes in the fadump implementation, the third patch updates the fadump documentation to reflect the changes made in this patch series. Kernel tree rebased on 6.9.0-rc3 with patch series applied: === https://github.com/sourabhjains/linux/tree/fadump-mem-hotplug-v9 Userspace changes: == To realize this feature, one must update the kdump udev rules to prevent fadump re-registration during memory add/remove events. On rhel apply the following changes to file /usr/lib/udev/rules.d/98-kexec.rules -RUN+="/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; /usr/bin/systemd-run --quiet --no-block /usr/lib/udev/kdump-udev-throttler'" +# don't re-register fadump if the value of the node +# /sys/kernel/fadump/hotplug_ready is 1. + +RUN+="/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; ! test -f /sys/kernel/fadump_enabled || cat /sys/kernel/fadump_enabled | grep 0 || ! test -f /sys/kernel/fadump/hotplug_ready || cat /sys/kernel/fadump/hotplug_ready | grep 0 || exit 0; /usr/bin/systemd-run --quiet --no-block /usr/lib/udev/kdump-udev-throttler'" Changelog: == v9: 16 Apr 2024 - Set the physical address of elfcorehdr to elfcorehdr_addr. 1/3 - Set elfcorehdr_addr to ELFCORE_ADDR_ERR before freeing the elfcorehdr. 1/3 - Mark the populate_elf_pt_load function as __init. 1/3 - Rename a function from process_fadump to fadump_process. 1/3 - Make minor changes to the commit message and a couple of comments in 1/3. - Update date of introduction of /sys/kernel/fadump/hotplug_ready sysfs. 2/3 - Rebase it to 6.9-rc3. v8: 16 Feb 2024 https://lore.kernel.org/all/20240217072004.148293-1-sourabhj...@linux.ibm.com/ - Move `elfcorehdr_addr` and `elfcorehdr_size` struct attributes from `struct fadump_crash_info_header` to `struct fw_dump`. - Make minor changes in commit message 1/3. - Rebase it to 6.8-rc4. v7: 11 Jan 2024 https://lore.kernel.org/all/2024040943.297501-1-sourabhj...@linux.ibm.com/ - Rebase it to 6.7 v6: 8 Dec 2023 https://lore.kernel.org/all/20231208115159.82236-1-sourabhj...@linux.ibm.com/ - Add size fields for `pt_regs` and `cpumask` in the fadump header structure - Don't process the dump if the size of `pt_regs` and `cpu_mask` is not same in the crashed and fadump kernel - Include an additional check for endianness mismatch when the magic number doesn't match, to print the relevant error message - Don't process the dump if the fadump header contains an old magic number - Rebased it to 6.7.0-rc4 v5: 29 Oct 2023 https://lore.kernel.org/all/20231029124548.12198-1-sourabhj...@linux.ibm.com/ - Fix a comment on the fi
powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
The Powerpc clang builds failed due to following warnings / errors on the Linux next-20240416 tag. Powerpc: - tqm8xx_defconfig + clang-17 - Failed - allnoconfig + clang-17 - Failed - tinyconfig + clang-17 - Failed Reported-by: Linux Kernel Functional Testing Build log: In file included from arch/powerpc/kernel/ptrace/ptrace.c:19: In file included from include/linux/syscalls.h:93: In file included from include/trace/syscall.h:7: In file included from include/linux/trace_events.h:9: In file included from include/linux/hardirq.h:11: In file included from arch/powerpc/include/asm/hardirq.h:6: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:14: In file included from arch/powerpc/include/asm/io.h:672: arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic] 43 | DEF_PCI_AC_NORET(insb, (unsigned long p, void *b, unsigned long c), | ^~~ 44 | (p, b, c), pio, p) | ~~ Links: - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20240416/testrun/23495171/suite/build/test/clang-17-tqm8xx_defconfig/details/ - https://storage.tuxsuite.com/public/linaro/lkft/builds/2fAvI3mKJ0dTHcazPrLl2zNF9JO/ -- Linaro LKFT https://lkft.linaro.org
Re: [RFC PATCH 3/7] module: prepare to handle ROX allocations for text
> On 11 Apr 2024, at 19:05, Mike Rapoport wrote: > > @@ -2440,7 +2479,24 @@ static int post_relocation(struct module *mod, const > struct load_info *info) > add_kallsyms(mod, info); > > /* Arch-specific module finalizing. */ > - return module_finalize(info->hdr, info->sechdrs, mod); > + ret = module_finalize(info->hdr, info->sechdrs, mod); > + if (ret) > + return ret; > + > + for_each_mod_mem_type(type) { > + struct module_memory *mem = &mod->mem[type]; > + > + if (mem->is_rox) { > + if (!execmem_update_copy(mem->base, mem->rw_copy, > + mem->size)) > + return -ENOMEM; > + > + vfree(mem->rw_copy); > + mem->rw_copy = NULL; > + } > + } > + > + return 0; > } I might be missing something, but it seems a bit racy. IIUC, module_finalize() calls alternatives_smp_module_add(). At this point, since you don’t hold the text_mutex, some might do text_poke(), e.g., by enabling/disabling static-key, and the update would be overwritten. No?
Re: [kvm-unit-tests PATCH v8 11/35] powerpc/sprs: Specify SPRs with data rather than code
On 05/04/2024 10.35, Nicholas Piggin wrote: A significant rework that builds an array of 'struct spr', where each element describes an SPR. This makes various metadata about the SPR like name and access type easier to carry and use. Hypervisor privileged registers are described despite not being used at the moment for completeness, but also the code might one day be reused for a hypervisor-privileged test. Acked-by: Thomas Huth Signed-off-by: Nicholas Piggin --- lib/powerpc/asm/reg.h | 2 + powerpc/sprs.c| 647 +- 2 files changed, 457 insertions(+), 192 deletions(-) I tried to merge this patch, but it seems to break running the tests on Ubuntu Focal with Clang 11 in the Travis-CI: https://app.travis-ci.com/github/huth/kvm-unit-tests/jobs/620577246 Could you please have a look? Thanks, Thomas
Re: [kvm-unit-tests PATCH v8 06/35] gitlab-ci: Run migration selftest on s390x and powerpc
On 16/04/2024 09.55, Thomas Huth wrote: On 16/04/2024 09.18, Thomas Huth wrote: On 11/04/2024 21.22, Thomas Huth wrote: On 08/04/2024 18.06, Nico Boehr wrote: Quoting Nicholas Piggin (2024-04-05 10:35:07) The migration harness is complicated and easy to break so CI will be helpful. Signed-off-by: Nicholas Piggin --- .gitlab-ci.yml | 32 +++- s390x/unittests.cfg | 8 2 files changed, 31 insertions(+), 9 deletions(-) diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml index ff34b1f50..60b3cdfd2 100644 --- a/.gitlab-ci.yml +++ b/.gitlab-ci.yml [...] @@ -135,7 +147,7 @@ build-riscv64: build-s390x: extends: .outoftree_template script: - - dnf install -y qemu-system-s390x gcc-s390x-linux-gnu + - dnf install -y qemu-system-s390x gcc-s390x-linux-gnu nmap-ncat - mkdir build - cd build - ../configure --arch=s390x --cross-prefix=s390x-linux-gnu- @@ -161,6 +173,8 @@ build-s390x: sclp-1g sclp-3g selftest-setup + selftest-migration-kvm We're running under TCG in the Gitlab CI. I'm a little bit confused why we're running a KVM-only test here. The build-s390x job is TCG, indeed, but we have the "s390x-kvm" job that runs on a KVM-capable s390x host, so it could be added there? I now gave it a try and it seems to work, so I updated this patch and pushed it to the repository now. Hmm, "selftest-migration" now was failing once here: https://gitlab.com/kvm-unit-tests/kvm-unit-tests/-/jobs/6633865591 Let's keep an eye on it, and if it is not stable enough, we might need to disable it in the CI again... And it just failed again: https://gitlab.com/kvm-unit-tests/kvm-unit-tests/-/jobs/6635395811 ... not sure whether this is due to the slow CI machine or whether there is still a bug lurking around somewhere, but anyway, I disabled it now for the CI again to avoid that other MRs get affected. Thomas
Re: [PATCH 0/2] ASoC: fsl-asoc-card: add wm8904 codec support
On Mon, 15 Apr 2024 15:17:34 +0800, Shengjiu Wang wrote: > add wm8904 codec support in fsl-asoc-card. > > Shengjiu Wang (2): > ASoC: fsl-asoc-card: add wm8904 codec support > ASoC: dt-bindings: fsl-asoc-card: Add compatbile string for wm8904 > codec > > [...] Applied to https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next Thanks! [1/2] ASoC: fsl-asoc-card: add wm8904 codec support commit: 51f67862ea6ea83f9fa4cda2e59d7bfd4387f63b [2/2] ASoC: dt-bindings: fsl-asoc-card: Add compatbile string for wm8904 codec commit: 62c48dd33b4f2e037554d1322ae4f9f60e9461ef All being well this means that it will be integrated into the linux-next tree (usually sometime in the next 24 hours) and sent to Linus during the next merge window (or sooner if it is a bug fix), however if problems are discovered then the patch may be dropped or reverted. You may get further e-mails resulting from automated or manual testing and review of the tree, please engage with people reporting problems and send followup patches addressing any issues that are reported if needed. If any updates are required or you are submitting further changes they should be sent as incremental updates against current git, existing patches will not be replaced. Please add any relevant lists and maintainers to the CCs when replying to this mail. Thanks, Mark
Re: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
On Tue, Apr 16, 2024 at 03:02:52PM +0530, Naresh Kamboju wrote: > In file included from arch/powerpc/include/asm/io.h:672: > arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer > arithmetic on a null pointer has undefined behavior > [-Werror,-Wnull-pointer-arithmetic] It is not UB, but just undefined: the program is meaningless. It is not a null pointer but even a null pointer constant here. It matters in places, including here. It would help if the warnings were more correct :-( Segher
Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx
Le 15/04/2024 à 21:12, Christophe Leroy a écrit : > > > Le 12/04/2024 à 16:30, Peter Xu a écrit : >> On Fri, Apr 12, 2024 at 02:08:03PM +, Christophe Leroy wrote: >>> >>> >>> Le 11/04/2024 à 18:15, Peter Xu a écrit : On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote: > On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote: >> This series reimplements hugepages with hugepd on powerpc 8xx. >> >> Unlike most architectures, powerpc 8xx HW requires a two-level >> pagetable topology for all page sizes. So a leaf PMD-contig approach >> is not feasible as such. >> >> Possible sizes are 4k, 16k, 512k and 8M. >> >> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD >> entries >> must point to a single entry level-2 page table. Until now that was >> done using hugepd. This series changes it to use standard page tables >> where the entry is replicated 1024 times on each of the two >> pagetables >> refered by the two associated PMD entries for that 8M page. >> >> At the moment it has to look into each helper to know if the >> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or >> a lower size. I hope this can me handled by core-mm in the future. >> >> There are probably several ways to implement stuff, so feedback is >> very welcome. > > I thought it looks pretty good! I second it. I saw the discussions in patch 1. Christophe, I suppose you're exploring the big hammer over hugepd, and perhaps went already with the 32bit pmd solution for nohash/32bit challenge you mentioned? I'm trying to position my next step; it seems like at least I should not adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks, or you're going to have an RFC soon then I can base on top? >>> >>> Depends on what you expect by "soon". >>> >>> I sure won't be able to send any RFC before end of April. >>> >>> Should be possible to have something during May. >> >> That's good enough, thanks. I'll see what is the best I can do. >> >> Then do you think I can leave p4d/pgd leaves alone? Please check the >> other >> email where I'm not sure whether pgd leaves ever existed for any of >> PowerPC. That's so far what I plan to do, on teaching pgtable walkers >> recognize pud and lower for all leaves. Then if Power can switch from >> hugepd to this it should just work. > > Well, if I understand correctly, something with no PMD will include > and will therefore only have pmd > entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. > pgd_populate(), p4d_populate(), pud_populate() are all "do { } while > (0)" and only pmd_populate exists. So only pmd_leaf() will exist in that > case. > > And therefore including means you > have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be > checked. > > >> >> Even if pgd exists (then something I overlooked..), I'm wondering whether >> we can push that downwards to be either pud/pmd (and looks like we all >> agree p4d is never used on Power). That may involve some pgtable >> operations moving from pgd level to lower, e.g. my pure imagination would >> look like starting with: > > Yes I think there is no doubt that p4d is never used: > > arch/powerpc/include/asm/book3s/32/pgtable.h:#include > > arch/powerpc/include/asm/book3s/64/pgtable.h:#include > > arch/powerpc/include/asm/nohash/32/pgtable.h:#include > > arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include > > > But that means that PPC32 have pmd entries and PPC64 have p4d entries ... > >> >> #define PTE_INDEX_SIZE PTE_SHIFT >> #define PMD_INDEX_SIZE 0 >> #define PUD_INDEX_SIZE 0 >> #define PGD_INDEX_SIZE (32 - PGDIR_SHIFT) >> >> To: >> >> #define PTE_INDEX_SIZE PTE_SHIFT >> #define PMD_INDEX_SIZE (32 - PMD_SHIFT) >> #define PUD_INDEX_SIZE 0 >> #define PGD_INDEX_SIZE 0 > > But then you can't anymore have #define PTRS_PER_PMD 1 from > > >> >> And the rest will need care too. I hope moving downward is easier >> (e.g. the walker should always exist for lower levels but not always for >> higher levels), but I actually have little idea on whether there's any >> other implications, so please bare with me on stupid mistakes. >> >> I just hope pgd leaves don't exist already, then I think it'll be >> simpler. >> >> Thanks, >> Digging into asm-generic/pgtable-nopmd.h, I see a definition of pud_leaf() always returning 0, introduced by commit 2c8a81dc0cc5 ("riscv/mm: fix two page table check related issues") So should asm-generic/pgtable-nopud.h contain the same for p4d_leaf() and asm-generic/pgtable-nop4d.h contain the same for pgd_leaf() ? Christophe
Re: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
On Tue, Apr 16, 2024, at 11:32, Naresh Kamboju wrote: > The Powerpc clang builds failed due to following warnings / errors on the > Linux next-20240416 tag. > > Powerpc: > - tqm8xx_defconfig + clang-17 - Failed > - allnoconfig + clang-17 - Failed > - tinyconfig + clang-17 - Failed > > Reported-by: Linux Kernel Functional Testing I'm not sure why this showed up now, but there is a series from in progress that will avoid this in the future, as the same issue is present on a couple of other architectures. The broken definitions are in the !CONFIG_PCI path of #ifndef CONFIG_PCI #define _IO_BASE0 #define _ISA_MEM_BASE 0 #define PCI_DRAM_OFFSET 0 #elif defined(CONFIG_PPC32) #define _IO_BASEisa_io_base #define _ISA_MEM_BASE isa_mem_base #define PCI_DRAM_OFFSET pci_dram_offset #else #define _IO_BASEpci_io_base #define _ISA_MEM_BASE isa_mem_base #define PCI_DRAM_OFFSET 0 #endif Once the series is merged, the !PCI case can disable CONFIG_HAS_IOPORT and move all references to it into #ifdef sections, something like the (incomplete) patch below. It looks like regardless of this, powerpc can also just set _IO_BASE to ISA_IO_BASE unconditionally, but I could be missing something there. Arnd --- diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index 08c550ed49be..29e002b9316c 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -36,11 +36,8 @@ extern struct pci_dev *isa_bridge_pcidev; * bases. Most of this file only uses _IO_BASE though which we * define properly based on the platform */ -#ifndef CONFIG_PCI -#define _IO_BASE 0 -#define _ISA_MEM_BASE 0 -#define PCI_DRAM_OFFSET 0 -#elif defined(CONFIG_PPC32) +#ifdef CONFIG_HAS_IOPORT +#ifdef CONFIG_PPC32 #define _IO_BASE isa_io_base #define _ISA_MEM_BASE isa_mem_base #define PCI_DRAM_OFFSETpci_dram_offset @@ -486,8 +483,7 @@ static inline u64 __raw_rm_readq(volatile void __iomem *paddr) * to port it over */ -#ifdef CONFIG_PPC32 - +#if defined(CONFIG_PPC32) && defined(CONFIG_HAS_IOPORT) #define __do_in_asm(name, op) \ static inline unsigned int name(unsigned int port) \ { \ @@ -534,7 +530,7 @@ __do_out_asm(_rec_outb, "stbx") __do_out_asm(_rec_outw, "sthbrx") __do_out_asm(_rec_outl, "stwbrx") -#endif /* CONFIG_PPC32 */ +#endif /* CONFIG_PPC32 && CONFIG_HAS_IOPORT */ /* The "__do_*" operations below provide the actual "base" implementation * for each of the defined accessors. Some of them use the out_* functions @@ -577,6 +573,7 @@ __do_out_asm(_rec_outl, "stwbrx") #define __do_readq_be(addr)in_be64(PCI_FIX_ADDR(addr)) #endif /* !defined(CONFIG_EEH) */ +#ifdef CONFIG_HAS_IOPORT #ifdef CONFIG_PPC32 #define __do_outb(val, port) _rec_outb(val, port) #define __do_outw(val, port) _rec_outw(val, port) @@ -592,6 +589,7 @@ __do_out_asm(_rec_outl, "stwbrx") #define __do_inw(port) readw((PCI_IO_ADDR)_IO_BASE + port); #define __do_inl(port) readl((PCI_IO_ADDR)_IO_BASE + port); #endif /* !CONFIG_PPC32 */ +#endif #ifdef CONFIG_EEH #define __do_readsb(a, b, n) eeh_readsb(PCI_FIX_ADDR(a), (b), (n)) @@ -606,12 +604,14 @@ __do_out_asm(_rec_outl, "stwbrx") #define __do_writesw(a, b, n) _outsw(PCI_FIX_ADDR(a),(b),(n)) #define __do_writesl(a, b, n) _outsl(PCI_FIX_ADDR(a),(b),(n)) +#ifdef CONFIG_HAS_IOPORT #define __do_insb(p, b, n) readsb((PCI_IO_ADDR)_IO_BASE+(p), (b), (n)) #define __do_insw(p, b, n) readsw((PCI_IO_ADDR)_IO_BASE+(p), (b), (n)) #define __do_insl(p, b, n) readsl((PCI_IO_ADDR)_IO_BASE+(p), (b), (n)) #define __do_outsb(p, b, n)writesb((PCI_IO_ADDR)_IO_BASE+(p),(b),(n)) #define __do_outsw(p, b, n)writesw((PCI_IO_ADDR)_IO_BASE+(p),(b),(n)) #define __do_outsl(p, b, n)writesl((PCI_IO_ADDR)_IO_BASE+(p),(b),(n)) +#endif #define __do_memset_io(addr, c, n) \ _memset_io(PCI_FIX_ADDR(addr), c, n) @@ -689,6 +689,8 @@ static inline void name at \ #define writesb writesb #define writesw writesw #define writesl writesl + +#ifdef CONFIG_HAS_IOPORT #define inb inb #define inw inw #define inl inl @@ -701,6 +703,8 @@ static inline void name at \ #define outsb outsb #define outsw outsw #define outsl outsl +#endif + #ifdef __powerpc64__ #define readq readq #define writeq writeq
Re: [PATCH 1/3] x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
Sean Christopherson writes: > On Mon, Apr 15, 2024, Geert Uytterhoeven wrote: >> On Sat, Apr 13, 2024 at 11:38 AM Michael Ellerman >> wrote: >> > Michael Ellerman writes: >> > > Stephen Rothwell writes: >> > ... >> > >> On Tue, 9 Apr 2024 10:51:05 -0700 Sean Christopherson >> > >> wrote: >> > ... >> > >>> diff --git a/kernel/cpu.c b/kernel/cpu.c >> > >>> index 8f6affd051f7..07ad53b7f119 100644 >> > >>> --- a/kernel/cpu.c >> > >>> +++ b/kernel/cpu.c >> > >>> @@ -3207,7 +3207,8 @@ enum cpu_mitigations { >> > >>> }; >> > >>> >> > >>> static enum cpu_mitigations cpu_mitigations __ro_after_init = >> > >>> - CPU_MITIGATIONS_AUTO; >> > >>> + IS_ENABLED(CONFIG_SPECULATION_MITIGATIONS) ? CPU_MITIGATIONS_AUTO : >> > >>> +CPU_MITIGATIONS_OFF; >> > >>> >> > >>> static int __init mitigations_parse_cmdline(char *arg) >> > >>> { >> > >> > I think a minimal workaround/fix would be: >> > >> > diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig >> > index 2b8fd6bb7da0..290be2f9e909 100644 >> > --- a/drivers/base/Kconfig >> > +++ b/drivers/base/Kconfig >> > @@ -191,6 +191,10 @@ config GENERIC_CPU_AUTOPROBE >> > config GENERIC_CPU_VULNERABILITIES >> > bool >> > >> > +config SPECULATION_MITIGATIONS >> > + def_bool y >> > + depends on !X86 >> > + >> > config SOC_BUS >> > bool >> > select GLOB >> >> Thanks, that works for me (on arm64), so >> Tested-by: Geert Uytterhoeven > > Oof. I completely missed that "cpu_mitigations" wasn't x86-only. I can't > think > of better solution than an on-by-default generic Kconfig, though can't that it > more simply be: > > diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig > index 2b8fd6bb7da0..5930cb56ee29 100644 > --- a/drivers/base/Kconfig > +++ b/drivers/base/Kconfig > @@ -191,6 +191,9 @@ config GENERIC_CPU_AUTOPROBE > config GENERIC_CPU_VULNERABILITIES > bool > > +config SPECULATION_MITIGATIONS > + def_bool !X86 > + Yeah that works too. cheers
RE: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
From: Segher Boessenkool > Sent: 16 April 2024 11:38 > > On Tue, Apr 16, 2024 at 03:02:52PM +0530, Naresh Kamboju wrote: > > In file included from arch/powerpc/include/asm/io.h:672: > > arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer > > arithmetic on a null pointer has undefined behavior > > [-Werror,-Wnull-pointer-arithmetic] > > It is not UB, but just undefined: the program is meaningless. > > It is not a null pointer but even a null pointer constant here. It > matters in places, including here. > > It would help if the warnings were more correct :-( Isn't it only a problem because the NULL pointer isn't required to be the all-zero bit pattern? So when do we get a warning from using memset() on a structure that contains pointers? Since it is equally buggy. Has anyone ever seen a system where NULL wasn't 'all zeros'? I've used a system where the 'native' invalid pointer was 'all ones', but even there the C code used 'all zeros'. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations
Markus Elfring writes: >> A few update suggestions were taken into account >> from static source code analysis. >> >> Markus Elfring (2): > > I would appreciate a bit more information about the reasons > why this patch series was rejected. > > >> One function call less in vas_window_alloc() after error detection > > https://patchwork.ozlabs.org/project/linuxppc-dev/patch/1f1c21cf-c34c-418c-b00c-8e6474f12...@web.de/ It introduced a new goto and label to avoid a kfree(NULL) call, but kfree() explicitly accepts NULL and handles it. So it complicates the source code for no gain. >> Return directly after a failed kasprintf() in map_paste_region() > > https://patchwork.ozlabs.org/project/linuxppc-dev/patch/f46f04bc-613c-4e98-b602-4c5120556...@web.de/ Basically the same reasoning. And it also changes the function from having two return paths (success and error), to three. cheers
Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations
Le 16/04/2024 à 13:11, Michael Ellerman a écrit : > Markus Elfring writes: >>> A few update suggestions were taken into account >>> from static source code analysis. >>> >>> Markus Elfring (2): >> >> I would appreciate a bit more information about the reasons >> why this patch series was rejected. >> >> >>>One function call less in vas_window_alloc() after error detection >> >> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/1f1c21cf-c34c-418c-b00c-8e6474f12...@web.de/ > > It introduced a new goto and label to avoid a kfree(NULL) call, but > kfree() explicitly accepts NULL and handles it. So it complicates the > source code for no gain. This is explicit in Kernel documentation: /** * kfree - free previously allocated memory * @object: pointer returned by kmalloc() or kmem_cache_alloc() * * If @object is NULL, no operation is performed. */ That's exactly the same behaviour as free() in libc. So Coccinelle should be fixed if it reports an error for that. > >>>Return directly after a failed kasprintf() in map_paste_region() >> >> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/f46f04bc-613c-4e98-b602-4c5120556...@web.de/ > > Basically the same reasoning. And it also changes the function from > having two return paths (success and error), to three. > Looking at that function, I however see a missing region release when ioremap_cache() fails. Christophe
Re: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
On Tue, Apr 16, 2024, at 13:01, Arnd Bergmann wrote: > On Tue, Apr 16, 2024, at 11:32, Naresh Kamboju wrote: >> The Powerpc clang builds failed due to following warnings / errors on the >> Linux next-20240416 tag. >> >> Powerpc: >> - tqm8xx_defconfig + clang-17 - Failed >> - allnoconfig + clang-17 - Failed >> - tinyconfig + clang-17 - Failed >> >> Reported-by: Linux Kernel Functional Testing > > I'm not sure why this showed up now, but there is a series from > in progress that will avoid this in the future, as the same > issue is present on a couple of other architectures. > I see now, it was introduced by my patch to turn on -Wextra by default. I had tested that patch on all architectures with allmodconfig and defconfig, but I did not test any powerpc configs with PCI disabled. Arnd
Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations
>>> One function call less in vas_window_alloc() after error detection >> >> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/1f1c21cf-c34c-418c-b00c-8e6474f12...@web.de/ > > It introduced a new goto and label to avoid a kfree(NULL) call, but > kfree() explicitly accepts NULL and handles it. So it complicates the > source code for no gain. I proposed to avoid such a redundant function call for the affected exception handling. >>> Return directly after a failed kasprintf() in map_paste_region() >> >> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/f46f04bc-613c-4e98-b602-4c5120556...@web.de/ > > Basically the same reasoning. And it also changes the function from > having two return paths (success and error), to three. See also a corresponding advice once more from the section “7) Centralized exiting of functions”: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v6.9-rc4#n532 Regards, Markus
Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations
Le 16/04/2024 à 14:14, Markus Elfring a écrit : >> This is explicit in Kernel documentation: >> >> /** >>* kfree - free previously allocated memory >>* @object: pointer returned by kmalloc() or kmem_cache_alloc() >>* >>* If @object is NULL, no operation is performed. >>*/ >> >> That's exactly the same behaviour as free() in libc. >> >> So Coccinelle should be fixed if it reports an error for that. > > Redundant function calls can occasionally be avoided accordingly, > can't they? Sure they can, but is that worth it here ? Christophe
Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations
> This is explicit in Kernel documentation: > > /** > * kfree - free previously allocated memory > * @object: pointer returned by kmalloc() or kmem_cache_alloc() > * > * If @object is NULL, no operation is performed. > */ > > That's exactly the same behaviour as free() in libc. > > So Coccinelle should be fixed if it reports an error for that. Redundant function calls can occasionally be avoided accordingly, can't they? Regards, Markus
Re: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
On Tue, Apr 16, 2024 at 01:55:57PM +0200, Arnd Bergmann wrote: > On Tue, Apr 16, 2024, at 13:01, Arnd Bergmann wrote: > > On Tue, Apr 16, 2024, at 11:32, Naresh Kamboju wrote: > >> The Powerpc clang builds failed due to following warnings / errors on the > >> Linux next-20240416 tag. > >> > >> Powerpc: > >> - tqm8xx_defconfig + clang-17 - Failed > >> - allnoconfig + clang-17 - Failed > >> - tinyconfig + clang-17 - Failed > >> > >> Reported-by: Linux Kernel Functional Testing > > > > I'm not sure why this showed up now, but there is a series from > > in progress that will avoid this in the future, as the same > > issue is present on a couple of other architectures. > > > > I see now, it was introduced by my patch to turn on -Wextra > by default. I had tested that patch on all architectures > with allmodconfig and defconfig, but I did not test any > powerpc configs with PCI disabled. I think this warning is clang specific as well... (Maybe clang was included in all architectures but I'm not sure). regards, dan carpenter
Re: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
On Tue, Apr 16, 2024, at 14:42, Dan Carpenter wrote: > On Tue, Apr 16, 2024 at 01:55:57PM +0200, Arnd Bergmann wrote: >> On Tue, Apr 16, 2024, at 13:01, Arnd Bergmann wrote: >> > On Tue, Apr 16, 2024, at 11:32, Naresh Kamboju wrote: >> >> The Powerpc clang builds failed due to following warnings / errors on the >> >> Linux next-20240416 tag. >> >> >> >> Powerpc: >> >> - tqm8xx_defconfig + clang-17 - Failed >> >> - allnoconfig + clang-17 - Failed >> >> - tinyconfig + clang-17 - Failed >> >> >> >> Reported-by: Linux Kernel Functional Testing >> > >> > I'm not sure why this showed up now, but there is a series from >> > in progress that will avoid this in the future, as the same >> > issue is present on a couple of other architectures. >> > >> >> I see now, it was introduced by my patch to turn on -Wextra >> by default. I had tested that patch on all architectures >> with allmodconfig and defconfig, but I did not test any >> powerpc configs with PCI disabled. > > I think this warning is clang specific as well... (Maybe clang was > included in all architectures but I'm not sure). Yes, I did test with both gcc and clang where supported. Arnd
Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations
On Tue, 16 Apr 2024, Christophe Leroy wrote: > > > Le 16/04/2024 à 14:14, Markus Elfring a écrit : > >> This is explicit in Kernel documentation: > >> > >> /** > >>* kfree - free previously allocated memory > >>* @object: pointer returned by kmalloc() or kmem_cache_alloc() > >>* > >>* If @object is NULL, no operation is performed. > >>*/ > >> > >> That's exactly the same behaviour as free() in libc. > >> > >> So Coccinelle should be fixed if it reports an error for that. > > > > Redundant function calls can occasionally be avoided accordingly, > > can't they? > > Sure they can, but is that worth it here ? Coccinelle does what the developer of the semantic patch tells it to do. It doesn't spontaneously report errors for anything. julia
Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations
So Coccinelle should be fixed if it reports an error for that. >>> >>> Redundant function calls can occasionally be avoided accordingly, >>> can't they? >> >> Sure they can, but is that worth it here ? > > Coccinelle does what the developer of the semantic patch tells it to do. > It doesn't spontaneously report errors for anything. Some special source code search and transformation patterns are occasionally applied. The corresponding change acceptance can often be challenging. Regards, Markus
Re: [0/2] powerpc/powernv/vas: Adjustments for map_paste_region()?
> Looking at that function, I however see a missing region release when > ioremap_cache() fails. Under which circumstances will the exception handling be completed also for the implementation of the function “map_paste_region”? https://elixir.bootlin.com/linux/v6.9-rc4/source/arch/powerpc/platforms/powernv/vas-window.c#L67 Regards, Markus
Re: [PATCH] i2c: mpc: Removal of of_node_put with __free for auto cleanup
Hi Chris, On Tue, Apr 16, 2024 at 03:59:13AM +, Chris Packham wrote: > On 16/04/24 08:54, Andi Shyti wrote: > >>/* Enable I2C interrupts for mpc5121 */ > >> - node_ctrl = of_find_compatible_node(NULL, NULL, > >> - "fsl,mpc5121-i2c-ctrl"); > >> + struct device_node *node_ctrl __free(device_node) = > > How have you tested this? > > I'm not sure I know anyone that still has a mpc5121. Maybe someone on > linuxppc-dev? that's why I was asking, this would be the first driver in i2c using Rob's __free(device_node). > I did try to take the patch for a spin on my T2081RDB but I'm having > some userland issues on it for some reason (unrelated to this change). > The kernel boot does discover a few peripherals hanging of the I2C > interface but I'm not in a position to offer up a Tested-by and I've run > out of time to debug why my board is unhappy. Thanks for giving it a try. Andi
Re: [PATCH] i2c: mpc: Removal of of_node_put with __free for auto cleanup
On Tue, Apr 16, 2024 at 04:07:48PM +0200, Andi Shyti wrote: > On Tue, Apr 16, 2024 at 03:59:13AM +, Chris Packham wrote: > > On 16/04/24 08:54, Andi Shyti wrote: > > >> /* Enable I2C interrupts for mpc5121 */ > > >> -node_ctrl = of_find_compatible_node(NULL, NULL, > > >> -"fsl,mpc5121-i2c-ctrl"); > > >> +struct device_node *node_ctrl __free(device_node) = > > > How have you tested this? > > > > I'm not sure I know anyone that still has a mpc5121. Maybe someone on > > linuxppc-dev? > > that's why I was asking, this would be the first driver in i2c > using Rob's __free(device_node). Jonathan's, of course :-) Andi
[PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n
From: Arnd Bergmann Calling inb()/outb() on powerpc when CONFIG_PCI is disabled causes a NULL pointer dereference, which is bad for a number of reasons. After my patch to turn on -Werror in linux-next, this caused a compiler-time warning with clang: In file included from arch/powerpc/include/asm/io.h:672: arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic] 43 | DEF_PCI_AC_NORET(insb, (unsigned long p, void *b, unsigned long c), | ^~~ 44 | (p, b, c), pio, p) | ~~ In this configuration, CONFIG_HAS_IOPORT is already disabled, and all drivers that use inb()/outb() should now depend on that (some patches are still in the process of getting marged). Hide all references to inb()/outb() in the powerpc code and the definitions when HAS_IOPORT is disabled to remove the possible NULL pointer access. The same should happin in asm-generic in the near future, but for now the empty inb() macros are still defined to ensure the generic version does not get pulled in. Signed-off-by: Arnd Bergmann Reported-by: Naresh Kamboju -- Cc: linux-ker...@vger.kernel.org> Cc: linuxppc-dev Cc: Aneesh Kumar K.V Cc: Anders Roxell Cc: Kees Cook Cc: Niklas Schnelle Cc: clang-built-linux Cc: Nick Desaulniers Cc: Nathan Chancellor Cc: Jeff Xu Cc: Naveen N. Rao Cc: Dan Carpenter --- arch/powerpc/include/asm/dma.h | 12 arch/powerpc/include/asm/io-defs.h | 4 arch/powerpc/include/asm/io.h | 19 +++ arch/powerpc/kernel/iomap.c| 4 arch/powerpc/kernel/traps.c| 2 +- 5 files changed, 32 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/dma.h b/arch/powerpc/include/asm/dma.h index d97c66d9ae34..004a868f82c9 100644 --- a/arch/powerpc/include/asm/dma.h +++ b/arch/powerpc/include/asm/dma.h @@ -3,6 +3,12 @@ #define _ASM_POWERPC_DMA_H #ifdef __KERNEL__ +/* The maximum address that we can perform a DMA transfer to on this platform */ +/* Doesn't really apply... */ +#define MAX_DMA_ADDRESS(~0UL) + +#ifdef CONFIG_HAS_IOPORT + /* * Defines for using and allocating dma channels. * Written by Hennus Bergman, 1992. @@ -26,10 +32,6 @@ #define MAX_DMA_CHANNELS 8 #endif -/* The maximum address that we can perform a DMA transfer to on this platform */ -/* Doesn't really apply... */ -#define MAX_DMA_ADDRESS(~0UL) - #ifdef HAVE_REALLY_SLOW_DMA_CONTROLLER #define dma_outb outb_p #else @@ -340,5 +342,7 @@ extern int request_dma(unsigned int dmanr, const char *device_id); /* release it again */ extern void free_dma(unsigned int dmanr); +#endif + #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_DMA_H */ diff --git a/arch/powerpc/include/asm/io-defs.h b/arch/powerpc/include/asm/io-defs.h index faf8617cc574..8d2209af7759 100644 --- a/arch/powerpc/include/asm/io-defs.h +++ b/arch/powerpc/include/asm/io-defs.h @@ -20,12 +20,14 @@ DEF_PCI_AC_NORET(writeq, (u64 val, PCI_IO_ADDR addr), (val, addr), mem, addr) DEF_PCI_AC_NORET(writeq_be, (u64 val, PCI_IO_ADDR addr), (val, addr), mem, addr) #endif /* __powerpc64__ */ +#ifdef CONFIG_HAS_IOPORT DEF_PCI_AC_RET(inb, u8, (unsigned long port), (port), pio, port) DEF_PCI_AC_RET(inw, u16, (unsigned long port), (port), pio, port) DEF_PCI_AC_RET(inl, u32, (unsigned long port), (port), pio, port) DEF_PCI_AC_NORET(outb, (u8 val, unsigned long port), (val, port), pio, port) DEF_PCI_AC_NORET(outw, (u16 val, unsigned long port), (val, port), pio, port) DEF_PCI_AC_NORET(outl, (u32 val, unsigned long port), (val, port), pio, port) +#endif DEF_PCI_AC_NORET(readsb, (const PCI_IO_ADDR a, void *b, unsigned long c), (a, b, c), mem, a) @@ -40,6 +42,7 @@ DEF_PCI_AC_NORET(writesw, (PCI_IO_ADDR a, const void *b, unsigned long c), DEF_PCI_AC_NORET(writesl, (PCI_IO_ADDR a, const void *b, unsigned long c), (a, b, c), mem, a) +#ifdef CONFIG_HAS_IOPORT DEF_PCI_AC_NORET(insb, (unsigned long p, void *b, unsigned long c), (p, b, c), pio, p) DEF_PCI_AC_NORET(insw, (unsigned long p, void *b, unsigned long c), @@ -52,6 +55,7 @@ DEF_PCI_AC_NORET(outsw, (unsigned long p, const void *b, unsigned long c), (p, b, c), pio, p) DEF_PCI_AC_NORET(outsl, (unsigned long p, const void *b, unsigned long c), (p, b, c), pio, p) +#endif DEF_PCI_AC_NORET(memset_io, (PCI_IO_ADDR a, int c, unsigned long n), (a, c, n), mem, a) diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index 08c550ed49be..86c212fcbc0c 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -37,7 +37,6 @@ extern struct pci_dev *isa_bridge_pcidev; * define properly based on the platform */ #ifndef CONFIG_PCI -#define _IO_BASE
[PATCH 0/5 RESEND] mm: code and data partitioning improvements
Managing allocations to ensure code and data pages are not interleaved is not possible prior to this patch, as ASLR requires programming a dynamic _text offset while the vmalloc infrastructure maintains static VMALLOC_START and VMALLOC_END constants. In systems where code and data are interleaved at a PTE granularity, kernel improvements targeting the prevention of exploit stages which modify page tables are inefficient and less effective as individual PTE updates occur at high frequency and cannot be coarsely grouped at the PMD level or greater. This patch adds minimal arch-specific callbacks to the initialization of vmalloc and when deciding whether to use a specific virtual memory area to satisfy a vmalloc request to provide the capability to prevent the allocation of specific virtual addresses under specific system states. By default these hooks are unimplemented. To further support the practical use of these callbacks, this patch also adds a virtual address parameter to pmd_populate_kernel, so that this interface matches the equivalent pte-level interface and architectures are not required to perform a reverse page table lookup to determine the vaddr being allocated during pmd creation. To demonstrate the impact and value of these changes, this patch implements support for dynamic PXNTable under aarch64 in 71 lines of code (a single "if" check during memory allocation), by checking the virtual address of a given vmalloc call to determine whether it is code or data. From experience in trying to implement kernel page table immutability and protections in KVM to prevent recent CVEs, e.g. CVE-2024-1086, this is a necessary first step. To better help maintainers and future developers, this patch expands ptdump.c so that non-leaf page table descriptors can be more easily noted in debug output by setting a note_non_leaf bool in the ptdump state. Signed-off-by: Maxwell Bland --- Zero-eth, apologies for the triple mail of these patches. I am in the process of setting up a new SMTP/mail server for Motorola, but until then I've needed to script the raw SMTP in order to send appropriately formatted patch emails. First, thank you to a number of maintainers (Mark Rutland, Greg KH, Christoph Hellwig, Christophe Leroy, David Hildenbrand, Conor Dooley) for their feedback on <20240220203256.31153-1-mbl...@motorola.com> and This patch is a further refinement and overhaul of these prior two attempts. Also, apologies for the roughly two months delay between patch submissions! I had Motorola work to do. In support of testing this patch (but not included in this patch), I set note_non_leaf to true under arch/arm64/mm/ptdump.c and added PMD_TABLE_PXN to pte_bits to print out whether the PXNTable bit was set. The txt files under the following directory can be diff'ed to see the result: github.com/maxwell-bland/linux-patch-data/tree/main/code_data_parting/ptdump I also created a script to fetch and cross-compile the kernel for each of the 21 subarchitectures which required fixes to provide a virtual address to pmd_populate_kernel. I have no idea if it is useful and maybe one already exists, but it worked well for me over some alternatives (xcross, buildroot): github.com/maxwell-bland/x-linux As with the last patchset, I also measured performance using Torvald's test-tlb program on an aarch64 QEMU instance, with results here: github.com/maxwell-bland/linux-patch-data/tree/main/code_data_parting/tlbperf As all changes to other arches are effectively no-ops, performance impacts in those domains are negligible. Maxwell Bland (5): mm: allow arch refinement/skip for vmap alloc arm64: mm: code and data partitioning for aslr mm: add vaddr param to pmd_populate_kernel arm64: dynamic enforcement of PXNTable ptdump: add state parameter for non-leaf callback arch/alpha/include/asm/pgalloc.h | 5 +- arch/arc/include/asm/pgalloc.h | 3 +- arch/arc/mm/highmem.c| 2 +- arch/arm/include/asm/kfence.h| 2 +- arch/arm/include/asm/pgalloc.h | 3 +- arch/arm/mm/kasan_init.c | 2 +- arch/arm/mm/mmu.c| 2 +- arch/arm64/include/asm/module.h | 12 arch/arm64/include/asm/pgalloc.h | 15 - arch/arm64/include/asm/vmalloc.h | 17 - arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/module.c | 7 +- arch/arm64/kernel/probes/kprobes.c | 7 +- arch/arm64/kernel/setup.c| 4 ++ arch/arm64/kernel/vmalloc.c | 71 arch/arm64/mm/ptdump.c | 10 +-- arch/arm64/mm/trans_pgd.c| 2 +- arch/arm64/net/bpf_jit_comp.c| 8 ++- arch/csky/include/asm/pgalloc.h | 2 +- arch/hexagon/include/asm/pgalloc.h | 2 +- arch/loongarch/include/asm/pgalloc.h | 3 +- arch/l
[PATCH 5/5 RESEND] ptdump: add state parameter for non-leaf callback
ptdump can now note non-leaf descriptor entries, a useful addition for debugging table descriptor permissions when working on related code Signed-off-by: Maxwell Bland --- arch/arm64/mm/ptdump.c | 6 -- arch/powerpc/mm/ptdump/ptdump.c | 2 ++ arch/riscv/mm/ptdump.c | 6 -- arch/s390/mm/dump_pagetables.c | 6 -- arch/x86/mm/dump_pagetables.c | 3 ++- include/linux/ptdump.h | 1 + mm/ptdump.c | 13 + 7 files changed, 30 insertions(+), 7 deletions(-) diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c index 796231a4fd63..1a6f4a3513e5 100644 --- a/arch/arm64/mm/ptdump.c +++ b/arch/arm64/mm/ptdump.c @@ -299,7 +299,8 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info) .range = (struct ptdump_range[]){ {info->base_addr, end}, {0, 0} - } + }, + .note_non_leaf = false } }; @@ -335,7 +336,8 @@ bool ptdump_check_wx(void) .range = (struct ptdump_range[]) { {_PAGE_OFFSET(vabits_actual), ~0UL}, {0, 0} - } + }, + .note_non_leaf = false } }; diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c index 9dc239967b77..89e673f5fd3d 100644 --- a/arch/powerpc/mm/ptdump/ptdump.c +++ b/arch/powerpc/mm/ptdump/ptdump.c @@ -307,6 +307,7 @@ static int ptdump_show(struct seq_file *m, void *v) .ptdump = { .note_page = note_page, .range = ptdump_range, + .note_non_leaf = false } }; @@ -340,6 +341,7 @@ bool ptdump_check_wx(void) .ptdump = { .note_page = note_page, .range = ptdump_range, + .note_non_leaf = false } }; diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c index 1289cc6d3700..b355633afcaf 100644 --- a/arch/riscv/mm/ptdump.c +++ b/arch/riscv/mm/ptdump.c @@ -328,7 +328,8 @@ static void ptdump_walk(struct seq_file *s, struct ptd_mm_info *pinfo) .range = (struct ptdump_range[]) { {pinfo->base_addr, pinfo->end}, {0, 0} - } + }, + .note_non_leaf = false } }; @@ -350,7 +351,8 @@ bool ptdump_check_wx(void) .range = (struct ptdump_range[]) { {KERN_VIRT_START, ULONG_MAX}, {0, 0} - } + }, + .note_non_leaf = false } }; diff --git a/arch/s390/mm/dump_pagetables.c b/arch/s390/mm/dump_pagetables.c index ffd07ed7b4af..6468cfd53e2a 100644 --- a/arch/s390/mm/dump_pagetables.c +++ b/arch/s390/mm/dump_pagetables.c @@ -200,7 +200,8 @@ bool ptdump_check_wx(void) .range = (struct ptdump_range[]) { {.start = 0, .end = max_addr}, {.start = 0, .end = 0}, - } + }, + .note_non_leaf = false }, .seq = NULL, .level = -1, @@ -239,7 +240,8 @@ static int ptdump_show(struct seq_file *m, void *v) .range = (struct ptdump_range[]) { {.start = 0, .end = max_addr}, {.start = 0, .end = 0}, - } + }, + .note_non_leaf = false }, .seq = m, .level = -1, diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c index 89079ea73e65..43f00dfb955f 100644 --- a/arch/x86/mm/dump_pagetables.c +++ b/arch/x86/mm/dump_pagetables.c @@ -380,7 +380,8 @@ bool ptdump_walk_pgd_level_core(struct seq_file *m, .ptdump = { .note_page = note_page, .effective_prot = effective_prot, - .range = ptdump_ranges + .range = ptdump_ranges, + .note_non_leaf = false }, .level = -1, .to_dmesg = dmesg, diff --git a/include/linux/ptdump.h b/include/linux/ptdump.h index 8dbd51ea8626..b3e793a5c77f 100644 --- a/include/linux/ptdump.h +++ b/include/linux/ptdump.h @@ -16,6 +16,7 @@ struct ptdump_state { int level, u64 val); void (*effective_prot)(struct ptdump_state *st, int level, u64 val);
Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx
On Tue, Apr 16, 2024 at 10:58:33AM +, Christophe Leroy wrote: > > > Le 15/04/2024 à 21:12, Christophe Leroy a écrit : > > > > > > Le 12/04/2024 à 16:30, Peter Xu a écrit : > >> On Fri, Apr 12, 2024 at 02:08:03PM +, Christophe Leroy wrote: > >>> > >>> > >>> Le 11/04/2024 à 18:15, Peter Xu a écrit : > On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote: > > On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote: > >> This series reimplements hugepages with hugepd on powerpc 8xx. > >> > >> Unlike most architectures, powerpc 8xx HW requires a two-level > >> pagetable topology for all page sizes. So a leaf PMD-contig approach > >> is not feasible as such. > >> > >> Possible sizes are 4k, 16k, 512k and 8M. > >> > >> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD > >> entries > >> must point to a single entry level-2 page table. Until now that was > >> done using hugepd. This series changes it to use standard page tables > >> where the entry is replicated 1024 times on each of the two > >> pagetables > >> refered by the two associated PMD entries for that 8M page. > >> > >> At the moment it has to look into each helper to know if the > >> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or > >> a lower size. I hope this can me handled by core-mm in the future. > >> > >> There are probably several ways to implement stuff, so feedback is > >> very welcome. > > > > I thought it looks pretty good! > > I second it. > > I saw the discussions in patch 1. Christophe, I suppose you're > exploring > the big hammer over hugepd, and perhaps went already with the 32bit pmd > solution for nohash/32bit challenge you mentioned? > > I'm trying to position my next step; it seems like at least I should > not > adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD > checks, > or you're going to have an RFC soon then I can base on top? > >>> > >>> Depends on what you expect by "soon". > >>> > >>> I sure won't be able to send any RFC before end of April. > >>> > >>> Should be possible to have something during May. > >> > >> That's good enough, thanks. I'll see what is the best I can do. > >> > >> Then do you think I can leave p4d/pgd leaves alone? Please check the > >> other > >> email where I'm not sure whether pgd leaves ever existed for any of > >> PowerPC. That's so far what I plan to do, on teaching pgtable walkers > >> recognize pud and lower for all leaves. Then if Power can switch from > >> hugepd to this it should just work. > > > > Well, if I understand correctly, something with no PMD will include > > and will therefore only have pmd > > entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. > > pgd_populate(), p4d_populate(), pud_populate() are all "do { } while > > (0)" and only pmd_populate exists. So only pmd_leaf() will exist in that > > case. > > > > And therefore including means you > > have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be > > checked. > > > > > >> > >> Even if pgd exists (then something I overlooked..), I'm wondering whether > >> we can push that downwards to be either pud/pmd (and looks like we all > >> agree p4d is never used on Power). That may involve some pgtable > >> operations moving from pgd level to lower, e.g. my pure imagination would > >> look like starting with: > > > > Yes I think there is no doubt that p4d is never used: > > > > arch/powerpc/include/asm/book3s/32/pgtable.h:#include > > > > arch/powerpc/include/asm/book3s/64/pgtable.h:#include > > > > arch/powerpc/include/asm/nohash/32/pgtable.h:#include > > > > arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include > > > > > > But that means that PPC32 have pmd entries and PPC64 have p4d entries ... > > > >> > >> #define PTE_INDEX_SIZE PTE_SHIFT > >> #define PMD_INDEX_SIZE 0 > >> #define PUD_INDEX_SIZE 0 > >> #define PGD_INDEX_SIZE (32 - PGDIR_SHIFT) > >> > >> To: > >> > >> #define PTE_INDEX_SIZE PTE_SHIFT > >> #define PMD_INDEX_SIZE (32 - PMD_SHIFT) > >> #define PUD_INDEX_SIZE 0 > >> #define PGD_INDEX_SIZE 0 > > > > But then you can't anymore have #define PTRS_PER_PMD 1 from > > > > > >> > >> And the rest will need care too. I hope moving downward is easier > >> (e.g. the walker should always exist for lower levels but not always for > >> higher levels), but I actually have little idea on whether there's any > >> other implications, so please bare with me on stupid mistakes. > >> > >> I just hope pgd leaves don't exist already, then I think it'll be > >> simpler. > >> > >> Thanks, > >> > > Digging into asm-generic/pgtable-nopmd.h, I see a definition of > pud_leaf() always returning 0, introduced by commit 2c8a81dc0cc5 > ("riscv/mm: fix two page table check related issues"
Re: [PATCH 5/5] ptdump: add state parameter for non-leaf callback
On Mon, 15 Apr 2024 14:51:32 -0500 Maxwell Bland wrote: > ptdump can now note non-leaf descriptor entries, a useful addition for > debugging table descriptor permissions when working on related code > > Signed-off-by: Maxwell Bland > --- > arch/arm64/mm/ptdump.c | 6 -- > arch/powerpc/mm/ptdump/ptdump.c | 2 ++ > arch/riscv/mm/ptdump.c | 6 -- > arch/s390/mm/dump_pagetables.c | 6 -- > arch/x86/mm/dump_pagetables.c | 3 ++- > include/linux/ptdump.h | 1 + > mm/ptdump.c | 13 + > 7 files changed, 30 insertions(+), 7 deletions(-) > > diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c > index 796231a4fd63..1a6f4a3513e5 100644 > --- a/arch/arm64/mm/ptdump.c > +++ b/arch/arm64/mm/ptdump.c > @@ -299,7 +299,8 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info > *info) > .range = (struct ptdump_range[]){ > {info->base_addr, end}, > {0, 0} > - } > + }, > + .note_non_leaf = false > } It would be acceptable to omit all of these and rely upon the runtime zeroing which the compiler will emit. Documentation/arch/arm64/ptdump.rst might need updating. Please include sample output in the changelog so we can better understand the user's view of this change.
[PATCH] powerpc/pseries/iommu: LPAR panics when rebooted with a frozen PE
At the time of LPAR reboot, partition firmware provides Open Firmware property ibm,dma-window for the PE. This property is provided on the PCI bus the PE is attached to. There are execptions where the partition firmware might not provide this property for the PE at the time of LPAR reboot. One of the scenario is where the firmware has frozen the PE due to some error conditions. This PE is frozen for 24 hours or unless the whole system is reinitialized. Within this time frame, if the LPAR is rebooted, the frozen PE will be presented to the LPAR but ibm,dma-window property could be missing. Today, under these circumstances, the LPAR oopses with NULL pointer dereference, when configuring the PCI bus the PE is attached to. BUG: Kernel NULL pointer dereference on read at 0x00c8 Faulting instruction address: 0xc01024c0 Oops: Kernel access of bad area, sig: 7 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: Supported: Yes CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-150600.9-default #1 Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 0xf06 of:IBM,FW1060.00 (NM1060_023) hv:phyp pSeries NIP: c01024c0 LR: c01024b0 CTR: c0102450 REGS: c37db5c0 TRAP: 0300 Not tainted (6.4.0-150600.9-default) MSR: 82009033 CR: 28000822 XER: CFAR: c010254c DAR: 00c8 DSISR: 0008 IRQMASK: 0 ... NIP [c01024c0] pci_dma_bus_setup_pSeriesLP+0x70/0x2a0 LR [c01024b0] pci_dma_bus_setup_pSeriesLP+0x60/0x2a0 Call Trace: pci_dma_bus_setup_pSeriesLP+0x60/0x2a0 (unreliable) pcibios_setup_bus_self+0x1c0/0x370 __of_scan_bus+0x2f8/0x330 pcibios_scan_phb+0x280/0x3d0 pcibios_init+0x88/0x12c do_one_initcall+0x60/0x320 kernel_init_freeable+0x344/0x3e4 kernel_init+0x34/0x1d0 ret_from_kernel_user_thread+0x14/0x1c Fixes: b1fc44eaa9ba ("pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-window") Signed-off-by: Gaurav Batra --- arch/powerpc/platforms/pseries/iommu.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index e8c4129697b1..e808d5b1fa49 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -786,8 +786,16 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus) * parent bus. During reboot, there will be ibm,dma-window property to * define DMA window. For kdump, there will at least be default window or DDW * or both. +* There is an exception to the above. In case the PE goes into frozen +* state, firmware may not provide ibm,dma-window property at the time +* of LPAR reboot. */ + if (!pdn) { + pr_debug(" no ibm,dma-window property !\n"); + return; + } + ppci = PCI_DN(pdn); pr_debug(" parent is %pOF, iommu_table: 0x%p\n", base-commit: 2c71fdf02a95b3dd425b42f28fd47fb2b1d22702 -- 2.39.3 (Apple Git-146)
Re: [PATCH 5/5] ptdump: add state parameter for non-leaf callback
> On Tuesday, April 16, 2024 3:11 PM, Andrew Morton wrote: > On Mon, 15 Apr 2024 14:51:32 -0500 Maxwell Bland > wrote: > > > arch/arm64/mm/ptdump.c | 6 -- > > arch/powerpc/mm/ptdump/ptdump.c | 2 ++ > > arch/riscv/mm/ptdump.c | 6 -- > > arch/s390/mm/dump_pagetables.c | 6 -- > > arch/x86/mm/dump_pagetables.c | 3 ++- > > include/linux/ptdump.h | 1 + > > mm/ptdump.c | 13 + > > 7 files changed, 30 insertions(+), 7 deletions(-) > > > > diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c > > index 796231a4fd63..1a6f4a3513e5 100644 > > --- a/arch/arm64/mm/ptdump.c > > +++ b/arch/arm64/mm/ptdump.c > > @@ -299,7 +299,8 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info > *info) > > .range = (struct ptdump_range[]){ > > {info->base_addr, end}, > > {0, 0} > > - } > > + }, > > + .note_non_leaf = false > > } > > It would be acceptable to omit all of these and rely upon the runtime > zeroing which the compiler will emit. Ah, thank you for the pointer to C99 6.7.8.21. I had always figured since structs are stack allocated they are potentially non-initialized! > Documentation/arch/arm64/ptdump.rst might need updating. > > Please include sample output in the changelog so we can better > understand the user's view of this change. Thanks, I will do both in the next few days everything permitting! Right now this patch results in no change until note_non_leaf = true is adopted for each arch. My plan: I will polish then include output of my personal fixes for arm64. Specifically, printing expanded PMD flags and tab indenting the layout according to each level. Hopefully just adding arm64 support for now is OK, unless maybe we want to default this to true on all arches? IMO default true would be sweet, but I wasn't sure everyone would agree. BRs, Maxwell Bland
[PATCH 3/5] mm: add vaddr param to pmd_populate_kernel
This patch affords each architecture the ability to condition the population of page middle directory entries on the virtual address being allocated, matching existing PTE infrastructure, easing the necessity of performing a reverse page table walk in cases where the population context is not readily accessible, i.e. dynamic vmalloc calls on arm64. To achieve this goal, it modifies every call and implementation of the pmd_populate_kernel function across architectures, ensuring uniform adoption across all kernel deployments. Signed-off-by: Maxwell Bland --- Hi all, Thank you for taking the time to review this change. This effects many subarchitectures so the maintainers list is large. Apologies in advance if there is a specific maintainer I should have spoken with directly for deployment across subprojects. The reason for such a sweeping change is from lore.kernel.org/all/cf5409c3-254a-459b-8969-429db2ec6...@redhat.com It is my understanding as well that some subarchitectures may have separate "next" or development branches ahead of the main upstream linux. Please let me know if a cherry-pick to that branch is desired and I will do my best to check out and deploy it as possible. arch/alpha/include/asm/pgalloc.h | 5 +++-- arch/arc/include/asm/pgalloc.h | 3 ++- arch/arc/mm/highmem.c| 2 +- arch/arm/include/asm/kfence.h| 2 +- arch/arm/include/asm/pgalloc.h | 3 ++- arch/arm/mm/kasan_init.c | 2 +- arch/arm/mm/mmu.c| 2 +- arch/arm64/include/asm/pgalloc.h | 3 ++- arch/arm64/mm/trans_pgd.c| 2 +- arch/csky/include/asm/pgalloc.h | 2 +- arch/hexagon/include/asm/pgalloc.h | 2 +- arch/loongarch/include/asm/pgalloc.h | 3 ++- arch/loongarch/mm/init.c | 2 +- arch/loongarch/mm/kasan_init.c | 2 +- arch/m68k/include/asm/mcf_pgalloc.h | 2 +- arch/m68k/include/asm/motorola_pgalloc.h | 3 ++- arch/m68k/include/asm/sun3_pgalloc.h | 3 ++- arch/microblaze/include/asm/pgalloc.h| 2 +- arch/mips/include/asm/pgalloc.h | 2 +- arch/mips/kvm/mmu.c | 2 +- arch/nios2/include/asm/pgalloc.h | 2 +- arch/openrisc/include/asm/pgalloc.h | 2 +- arch/parisc/include/asm/pgalloc.h| 5 +++-- arch/parisc/mm/init.c| 6 +++--- arch/powerpc/include/asm/book3s/32/pgalloc.h | 2 +- arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 +- arch/powerpc/include/asm/nohash/32/pgalloc.h | 2 +- arch/powerpc/include/asm/nohash/64/pgalloc.h | 2 +- arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +- arch/powerpc/mm/kasan/init_32.c | 4 ++-- arch/powerpc/mm/kasan/init_book3e_64.c | 9 ++--- arch/powerpc/mm/kasan/init_book3s_64.c | 7 +-- arch/powerpc/mm/nohash/book3e_pgtable.c | 2 +- arch/powerpc/mm/pgtable_32.c | 4 ++-- arch/riscv/include/asm/pgalloc.h | 2 +- arch/riscv/kernel/hibernate.c| 2 +- arch/s390/include/asm/pgalloc.h | 2 +- arch/sh/include/asm/pgalloc.h| 2 +- arch/sh/mm/init.c| 2 +- arch/sparc/include/asm/pgalloc_32.h | 3 ++- arch/sparc/include/asm/pgalloc_64.h | 4 ++-- arch/sparc/mm/init_64.c | 8 arch/um/include/asm/pgalloc.h| 4 ++-- arch/x86/include/asm/pgalloc.h | 3 ++- arch/x86/mm/init_64.c| 14 +++--- arch/x86/mm/ioremap.c| 2 +- arch/x86/mm/kasan_init_64.c | 2 +- arch/xtensa/include/asm/pgalloc.h| 2 +- include/linux/mm.h | 4 ++-- mm/hugetlb_vmemmap.c | 4 ++-- mm/kasan/init.c | 14 +- mm/memory.c | 4 ++-- mm/percpu.c | 2 +- mm/pgalloc-track.h | 3 ++- mm/sparse-vmemmap.c | 2 +- 55 files changed, 107 insertions(+), 78 deletions(-) diff --git a/arch/alpha/include/asm/pgalloc.h b/arch/alpha/include/asm/pgalloc.h index 68be7adbfe58..1d3d86cad3cc 100644 --- a/arch/alpha/include/asm/pgalloc.h +++ b/arch/alpha/include/asm/pgalloc.h @@ -7,7 +7,7 @@ #include -/* +/* * Allocate and free page tables. The xxx_kernel() versions are * used to allocate a kernel page table - this turns on ASN bits * if any. @@ -20,7 +20,8 @@ pmd_populate(struct mm_struct *mm, pmd_t *pmd, pgtable_t pte) } static inline void -pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte) +pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte, + unsigned long vaddr) { pmd_set(pmd, p
[no subject]
linux-ker...@vger.kernel.org, linux-arm-ker...@lists.infradead.org, linux-ri...@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, Mark Rutland , Greg Kroah-Hartman , Christoph Hellwig , Christophe Leroy , David Hildenbrand , Conor Dooley From: Maxwell Bland Date: Mon, 15 Apr 2024 15:16:08 -0500 Subject: [PATCH 0/5] mm: code and data partitioning improvements Managing allocations to ensure code and data pages are not interleaved is not possible prior to this patch, as ASLR requires programming a dynamic _text offset while the vmalloc infrastructure maintains static VMALLOC_START and VMALLOC_END constants. In systems where code and data are interleaved at a PTE granularity, kernel improvements targeting the prevention of exploit stages which modify page tables are inefficient and less effective as individual PTE updates occur at high frequency and cannot be coarsely grouped at the PMD level or greater. This patch adds minimal arch-specific callbacks to the initialization of vmalloc and when deciding whether to use a specific virtual memory area to satisfy a vmalloc request to provide the capability to prevent the allocation of specific virtual addresses under specific system states. By default these hooks are unimplemented. To further support the practical use of these callbacks, this patch also adds a virtual address parameter to pmd_populate_kernel, so that this interface matches the equivalent pte-level interface and architectures are not required to perform a reverse page table lookup to determine the vaddr being allocated during pmd creation. To demonstrate the impact and value of these changes, this patch implements support for dynamic PXNTable under aarch64 in 71 lines of code (a single "if" check during memory allocation), by checking the virtual address of a given vmalloc call to determine whether it is code or data. From experience in trying to implement kernel page table immutability and protections in KVM to prevent recent CVEs, e.g. CVE-2024-1086, this is a necessary first step. To better help maintainers and future developers, this patch expands ptdump.c so that non-leaf page table descriptors can be more easily noted in debug output by setting a note_non_leaf bool in the ptdump state. Signed-off-by: Maxwell Bland --- First, thank you to a number of maintainers (Mark Rutland, Greg KH, Christoph Hellwig, Christophe Leroy, David Hildenbrand, Conor Dooley) for their feedback on <20240220203256.31153-1-mbl...@motorola.com> and This patch is a further refinement and overhaul of these prior two attempts. Also, apologies for the roughly two months delay between patch submissions! I had Motorola work to do. In support of testing this patch (but not included in this patch), I set note_non_leaf to true under arch/arm64/mm/ptdump.c and added PMD_TABLE_PXN to pte_bits to print out whether the PXNTable bit was set. The txt files under the following directory can be diff'ed to see the result: github.com/maxwell-bland/linux-patch-data/tree/main/code_data_parting/ptdump I also created a script to fetch and cross-compile the kernel for each of the 21 subarchitectures which required fixes to provide a virtual address to pmd_populate_kernel. I have no idea if it is useful and maybe one already exists, but it worked well for me over some alternatives (xcross, buildroot): github.com/maxwell-bland/x-linux As with the last patchset, I also measured performance using Torvald's test-tlb program on an aarch64 QEMU instance, with results here: github.com/maxwell-bland/linux-patch-data/tree/main/code_data_parting/tlbperf As all changes to other arches are effectively no-ops, performance impacts in those domains are negligible. Maxwell Bland (5): mm: allow arch refinement/skip for vmap alloc arm64: mm: code and data partitioning for aslr mm: add vaddr param to pmd_populate_kernel arm64: dynamic enforcement of PXNTable ptdump: add state parameter for non-leaf callback arch/alpha/include/asm/pgalloc.h | 5 +- arch/arc/include/asm/pgalloc.h | 3 +- arch/arc/mm/highmem.c| 2 +- arch/arm/include/asm/kfence.h| 2 +- arch/arm/include/asm/pgalloc.h | 3 +- arch/arm/mm/kasan_init.c | 2 +- arch/arm/mm/mmu.c| 2 +- arch/arm64/include/asm/module.h | 12 arch/arm64/include/asm/pgalloc.h | 15 - arch/arm64/include/asm/vmalloc.h | 17 - arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/module.c | 7 +- arch/arm64/kernel/probes/kprobes.c | 7 +- arch/arm64/kernel/setup.c| 4 ++ arch/arm64/kernel/vmalloc.c | 71 arch/arm64/mm/ptdump.c | 10 +-- arch/arm64/mm/trans_pgd.c| 2 +- arch/arm64/net/bpf_jit_comp.c| 8 ++- arch/csky/include/asm/pgalloc.h