[PATCH v8 3/3] Documentation/powerpc: update fadump implementation details
The patch titled ("powerpc: make fadump resilient with memory add/remove events") has made significant changes to the implementation of fadump, particularly on elfcorehdr creation and fadump crash info header structure. Therefore, updating the fadump implementation documentation to reflect those changes. Following updates are done to firmware assisted dump documentation: 1. The elfcorehdr is no longer stored after fadump HDR in the reserved dump area. Instead, the second kernel dynamically allocates memory for the elfcorehdr within the address range from 0 to the boot memory size. Therefore, update figures 1 and 2 of Memory Reservation during the first and second kernels to reflect this change. 2. A version field has been added to the fadump header to manage the future changes to fadump crash info header structure without changing the fadump header magic number in the future. Therefore, remove the corresponding TODO from the document. Signed-off-by: Sourabh Jain Cc: Aditya Gupta Cc: Aneesh Kumar K.V Cc: Hari Bathini Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Naveen N Rao --- .../arch/powerpc/firmware-assisted-dump.rst | 91 +-- 1 file changed, 42 insertions(+), 49 deletions(-) diff --git a/Documentation/arch/powerpc/firmware-assisted-dump.rst b/Documentation/arch/powerpc/firmware-assisted-dump.rst index e363fc48529a..7e37aadd1f77 100644 --- a/Documentation/arch/powerpc/firmware-assisted-dump.rst +++ b/Documentation/arch/powerpc/firmware-assisted-dump.rst @@ -134,12 +134,12 @@ that are run. If there is dump data, then the memory is held. If there is no waiting dump data, then only the memory required to -hold CPU state, HPTE region, boot memory dump, FADump header and -elfcore header, is usually reserved at an offset greater than boot -memory size (see Fig. 1). This area is *not* released: this region -will be kept permanently reserved, so that it can act as a receptacle -for a copy of the boot memory content in addition to CPU state and -HPTE region, in the case a crash does occur. +hold CPU state, HPTE region, boot memory dump, and FADump header is +usually reserved at an offset greater than boot memory size (see Fig. 1). +This area is *not* released: this region will be kept permanently +reserved, so that it can act as a receptacle for a copy of the boot +memory content in addition to CPU state and HPTE region, in the case +a crash does occur. Since this reserved memory area is used only after the system crash, there is no point in blocking this significant chunk of memory from @@ -153,22 +153,22 @@ that were present in CMA region:: o Memory Reservation during first kernel - Low memory Top of memory - 0boot memory size |<--- Reserved dump area --->| | - | | |Permanent Reservation | | - V V || V - +---+-/ /---+---++---+-+-++--+ - | | |///|| DUMP | HDR | ELF || | - +---+-/ /---+---++---+-+-++--+ -| ^^ ^ ^ ^ -| || | | | -\ CPU HPTE / | | - -- | | - Boot memory content gets transferred| | - to reserved area by firmware at the | | - time of crash. | | - FADump Header | - (meta area)| + Low memory Top of memory + 0boot memory size |<-- Reserved dump area ->| | + | | | Permanent Reservation | | + V V | | V + +---+-/ /---+---++---+---++-+ + | | |///||DUMP | HDR || | + +---+-/ /---+---++---+---++-+ +| ^^ ^ ^ ^ +| || | | | +\ CPU HPTE / | | + | | + Boot memory content gets transferred | | + to reserved area by firmware at the | | + time of crash. | | + FADump Header | +(meta area) | | | Metadata: This area holds a metadata structure whose @@ -186,13 +186,20 @@ that were present in CMA
[PATCH v8 2/3] powerpc/fadump: add hotplug_ready sysfs interface
The elfcorehdr describes the CPUs and memory of the crashed kernel to the kernel that captures the dump, known as the second or fadump kernel. The elfcorehdr needs to be updated if the system's memory changes due to memory hotplug or online/offline events. Currently, memory hotplug events are monitored in userspace by udev rules, and fadump is re-registered, which recreates the elfcorehdr with the latest available memory in the system. However, the previous patch ("powerpc: make fadump resilient with memory add/remove events") moved the creation of elfcorehdr to the second or fadump kernel. This eliminates the need to regenerate the elfcorehdr during memory hotplug or online/offline events. Create a sysfs entry at /sys/kernel/fadump/hotplug_ready to let userspace know that fadump re-registration is not required for memory add/remove events. Signed-off-by: Sourabh Jain Cc: Aditya Gupta Cc: Aneesh Kumar K.V Cc: Hari Bathini Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Naveen N Rao --- Documentation/ABI/testing/sysfs-kernel-fadump | 11 +++ arch/powerpc/kernel/fadump.c | 14 ++ 2 files changed, 25 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump b/Documentation/ABI/testing/sysfs-kernel-fadump index 8f7a64a81783..e2786a3db8dd 100644 --- a/Documentation/ABI/testing/sysfs-kernel-fadump +++ b/Documentation/ABI/testing/sysfs-kernel-fadump @@ -38,3 +38,14 @@ Contact: linuxppc-dev@lists.ozlabs.org Description: read only Provide information about the amount of memory reserved by FADump to save the crash dump in bytes. + +What: /sys/kernel/fadump/hotplug_ready +Date: Feb 2024 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Kdump udev rule re-registers fadump on memory add/remove events, + primarily to update the elfcorehdr. This sysfs indicates the + kdump udev rule that fadump re-registration is not required on + memory add/remove events because elfcorehdr is now prepared in + the second/fadump kernel. +User: kexec-tools diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index a55ad8514745..6478820a2038 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -1444,6 +1444,18 @@ static ssize_t enabled_show(struct kobject *kobj, return sprintf(buf, "%d\n", fw_dump.fadump_enabled); } +/* + * /sys/kernel/fadump/hotplug_ready sysfs node returns 1, which inidcates + * to usersapce that fadump re-registration is not required on memory + * hotplug events. + */ +static ssize_t hotplug_ready_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", 1); +} + static ssize_t mem_reserved_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) @@ -1516,11 +1528,13 @@ static struct kobj_attribute release_attr = __ATTR_WO(release_mem); static struct kobj_attribute enable_attr = __ATTR_RO(enabled); static struct kobj_attribute register_attr = __ATTR_RW(registered); static struct kobj_attribute mem_reserved_attr = __ATTR_RO(mem_reserved); +static struct kobj_attribute hotplug_ready_attr = __ATTR_RO(hotplug_ready); static struct attribute *fadump_attrs[] = { &enable_attr.attr, ®ister_attr.attr, &mem_reserved_attr.attr, + &hotplug_ready_attr.attr, NULL, }; -- 2.43.0
[PATCH v8 1/3] powerpc: make fadump resilient with memory add/remove events
Due to changes in memory resources caused by either memory hotplug or online/offline events, the elfcorehdr, which describes the CPUs and memory of the crashed kernel to the kernel that collects the dump (known as second/fadump kernel), becomes outdated. Consequently, attempting dump collection with an outdated elfcorehdr can lead to failed or inaccurate dump collection. Memory hotplug or online/offline events is referred as memory add/remove events in reset of the commit message. The current solution to address the aforementioned issue is as follows: Monitor memory add/remove events in userspace using udev rules, and re-register fadump whenever there are changes in memory resources. This leads to the creation of a new elfcorehdr with updated system memory information. There are several notable issues associated with re-registering fadump for every memory add/remove events. 1. Bulk memory add/remove events with udev-based fadump re-registration can lead to race conditions and, more importantly, it creates a wide window during which fadump is inactive until all memory add/remove events are settled. 2. Re-registering fadump for every memory add/remove event is inefficient. 3. The memory for elfcorehdr is allocated based on the memblock regions available during early boot and remains fixed thereafter. However, if elfcorehdr is later recreated with additional memblock regions, its size will increase, potentially leading to memory corruption. Address the aforementioned challenges by shifting the creation of elfcorehdr from the first kernel (also referred as the crashed kernel), where it was created and frequently recreated for every memory add/remove event, to the fadump kernel. As a result, the elfcorehdr only needs to be created once, thus eliminating the necessity to re-register fadump during memory add/remove events. At present, the first kernel is responsible for preparing the fadump header and storing it in the fadump reserved area. The fadump header includes the start address of the elfcorehdr, crashing CPU details, and other relevant information. In the event of a crash in the first kernel, the second/fadump boots and accesses the fadump header prepared by the first kernel. It then performs the following steps in a platform-specific function [rtas|opal]_fadump_process: 1. Sanity check for fadump header 2. Update CPU notes in elfcorehdr Along with the above, update the setup_fadump()/fadump.c to create elfcorehdr and set its address to the global variable elfcorehdr_addr for the vmcore module to process it in the second/fadump kernel. Section below outlines the information required to create the elfcorehdr and the changes made to make it available to the fadump kernel if it's not already. To create elfcorehdr, the following crashed kernel information is required: CPU notes, vmcoreinfo, and memory ranges. At present, the CPU notes are already prepared in the fadump kernel, so no changes are needed in that regard. The fadump kernel has access to all crashed kernel memory regions, including boot memory regions that are relocated by firmware to fadump reserved areas, so no changes for that either. However, it is necessary to add new members to the fadump header, i.e., the 'fadump_crash_info_header' structure, in order to pass the crashed kernel's vmcoreinfo address and its size to fadump kernel. In addition to the vmcoreinfo address and size, there are a few other attributes also added to the fadump_crash_info_header structure. 1. version: It stores the fadump header version, which is currently set to 1. This provides flexibility to update the fadump crash info header in the future without changing the magic number. For each change in the fadump header, the version will be increased. This will help the updated kernel determine how to handle kernel dumps from older kernels. The magic number remains relevant for checking fadump header corruption. 2. pt_regs_sz/cpu_mask_sz: Store size of pt_regs and cpu_mask structure of first kernel. These attributes are used to prevent dump processing if the sizes of pt_regs or cpu_mask structure differ between the first and fadump kernels. Note: if either first/crashed kernel or second/fadump kernel do not have the changes introduced here then kernel fail to collect the dump and prints relevant error message on the console. Signed-off-by: Sourabh Jain Cc: Aditya Gupta Cc: Aneesh Kumar K.V Cc: Hari Bathini Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Naveen N Rao --- arch/powerpc/include/asm/fadump-internal.h | 31 +- arch/powerpc/kernel/fadump.c | 339 +++ arch/powerpc/platforms/powernv/opal-fadump.c | 22 +- arch/powerpc/platforms/pseries/rtas-fadump.c | 30 +- 4 files changed, 232 insertions(+), 190 deletions(-) diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h index 27f9e11eda28..5d706a7acc8a 100644 --- a
[PATCH v8 0/3] powerpc: make fadump resilient with memory add/remove events
Problem: Due to changes in memory resources caused by either memory hotplug or online/offline events, the elfcorehdr, which describes the cpus and memory of the crashed kernel to the kernel that collects the dump (known as second/fadump kernel), becomes outdated. Consequently, attempting dump collection with an outdated elfcorehdr can lead to failed or inaccurate dump collection. Memory hotplug or online/offline events is referred as memory add/remove events in reset of the patch series. Existing solution: == Monitor memory add/remove events in userspace using udev rules, and re-register fadump whenever there are changes in memory resources. This leads to the creation of a new elfcorehdr with updated system memory information. Challenges with existing solution: == 1. Performing bulk memory add/remove with udev-based fadump re-registration can lead to race conditions and, more importantly, it creates a large wide window during which fadump is inactive until all memory add/remove events are settled. 2. Re-registering fadump for every memory add/remove event is inefficient. 3. Memory for elfcorehdr is allocated based on the memblock regions available during first kernel early boot and it remains fixed thereafter. However, if the elfcorehdr is later recreated with additional memblock regions, its size will increase, potentially leading to memory corruption. Proposed solution: == Address the aforementioned challenges by shifting the creation of elfcorehdr from the first kernel (also referred as the crashed kernel), where it was created and frequently recreated for every memory add/remove event, to the fadump kernel. As a result, the elfcorehdr only needs to be created once, thus eliminating the necessity to re-register fadump during memory add/remove events. To know more about elfcorehdr creation in the fadump kernel, refer to the first patch in this series. The second patch includes a new sysfs interface that tells userspace that fadump re-registration isn't needed for memory add/remove events. note that userspace changes do not need to be in sync with kernel changes; they can roll out independently. Since there are significant changes in the fadump implementation, the third patch updates the fadump documentation to reflect the changes made in this patch series. Kernel tree rebased on 6.8.0-rc4 with patch series applied: = https://github.com/sourabhjains/linux/tree/fadump-mem-hotplug-v8 Userspace changes: == To realize this feature, one must update the kdump udev rules to prevent fadump re-registration during memory add/remove events. On rhel apply the following changes to file /usr/lib/udev/rules.d/98-kexec.rules -run+="/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; /usr/bin/systemd-run --quiet --no-block /usr/lib/udev/kdump-udev-throttler'" +# don't re-register fadump if the value of the node +# /sys/kernel/fadump/hotplug_ready is 1. + +run+="/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; ! test -f /sys/kernel/fadump_enabled || cat /sys/kernel/fadump_enabled | grep 0 || ! test -f /sys/kernel/fadump/hotplug_ready || cat /sys/kernel/fadump/hotplug_ready | grep 0 || exit 0; /usr/bin/systemd-run --quiet --no-block /usr/lib/udev/kdump-udev-throttler'" Changelog: == v8: 16 Feb 2023 - Move `elfcorehdr_addr` and `elfcorehdr_size` struct attributes from `struct fadump_crash_info_header` to `struct fw_dump`. - Make minor changes in commit message 1/3. - Rebase it to 6.8-rc4. v7: 11 Jan 2024 https://lore.kernel.org/all/2024040943.297501-1-sourabhj...@linux.ibm.com/ - Rebase it to 6.7 v6: 8 Dec 2023 https://lore.kernel.org/all/20231208115159.82236-1-sourabhj...@linux.ibm.com/ - Add size fields for `pt_regs` and `cpumask` in the fadump header structure - Don't process the dump if the size of `pt_regs` and `cpu_mask` is not same in the crashed and fadump kernel - Include an additional check for endianness mismatch when the magic number doesn't match, to print the relevant error message - Don't process the dump if the fadump header contains an old magic number - Rebased it to 6.7.0-rc4 v5: 29 Oct 2023 https://lore.kernel.org/all/20231029124548.12198-1-sourabhj...@linux.ibm.com/ - Fix a comment on the first patch v4: 21 Oct 2023 https://lore.kernel.org/all/20231021181733.204311-1-sourabhj...@linux.ibm.com/ - Fix a build warning about type casting v3: 9 Oct 2023 https://lore.kernel.org/all/20231009041953.36139-1-sourabhj...@linux.ibm.com/ - Assign physical address of elfcorehdr to fdh->elfcorehdr_addr - Rename a variable, boot_mem_dest_addr -> boot_mem_dest_offset v2: 25 Sep 2023 https://lore.kernel.org/all/20230925051214.678957-1-sourabhj...@linux.ibm.com/ - Fixed a few indentation issues reported by the checkpatch script. - R
Re: [kvm-unit-tests PATCH v4 8/8] migration: add a migration selftest
On Fri Feb 16, 2024 at 9:15 PM AEST, Thomas Huth wrote: > On 09/02/2024 10.11, Nicholas Piggin wrote: > > Add a selftest for migration support in guest library and test harness > > code. It performs migrations in a tight loop to irritate races and bugs > > in the test harness code. > > > > Include the test in arm, s390, powerpc. > > > > Acked-by: Claudio Imbrenda (s390x) > > Reviewed-by: Thomas Huth > > Signed-off-by: Nicholas Piggin > > --- > > arm/Makefile.common | 1 + > > arm/selftest-migration.c | 1 + > > arm/unittests.cfg| 6 ++ > > Hi Nicholas, > > I just gave the patches a try, but the arm test seems to fail for me: Only > the first getchar() seems to wait for a character, all the subsequent ones > don't wait anymore and just continue immediately ... is this working for > you? Or do I need another patch on top? Hey sorry missed this comment It does seem to work for me, I've mostly tested pseries but I did test others too (that's how I saw the arm getchar limit). How are you observing it not waiting for migration? I put some sleeps in the migration script before echo'ing to the console input and it seems to be doing the right thing. Admittedly the test contains no way to programaticaly verify the machine was migrated the expected number of times, it would be nice to try to match that up somehow. Thanks, Nick
Re: [PATCH v2] MAINTAINERS: adjust file entries after crypto vmx file movement
On Thu, Feb 08, 2024 at 10:33:27AM +0100, Lukas Bulwahn wrote: > Commit 109303336a0c ("crypto: vmx - Move to arch/powerpc/crypto") moves the > crypto vmx files to arch/powerpc, but misses to adjust the file entries for > IBM Power VMX Cryptographic instructions and LINUX FOR POWERPC. > > Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about > broken references. > > Adjust these file entries accordingly. To keep the matched files exact > after the movement, spell out each file name in the new directory. > > Signed-off-by: Lukas Bulwahn > --- > v1: > https://lore.kernel.org/lkml/20240129131729.4311-1-lukas.bulw...@gmail.com/ > > v1 -> v2: > - address Herbert Xu's feedback: > keep the matched files exactly those which were in the vmx directory > > Danny, please ack. > Herbert, please pick this minor clean-up patch on your -next tree. > > MAINTAINERS | 18 +++--- > 1 file changed, 11 insertions(+), 7 deletions(-) Patch applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] powerpc: remove unused *_syscall_64.o variables in Makefile
+To: Daniel Axtens Maybe, we should check if the issue fixed by 2f26ed1764b42a8c40d9c48441c73a70d805decf came back. On Fri, Feb 16, 2024 at 10:55 PM Masahiro Yamada wrote: > > Commit ab1a517d55b0 ("powerpc/syscall: Rename syscall_64.c into > interrupt.c") missed to update these three lines: > > GCOV_PROFILE_syscall_64.o := n > KCOV_INSTRUMENT_syscall_64.o := n > UBSAN_SANITIZE_syscall_64.o := n > > To restore the original behavior, we could replace them with: > > GCOV_PROFILE_interrupt.o := n > KCOV_INSTRUMENT_interrupt.o := n > UBSAN_SANITIZE_interrupt.o := n > > However, nobody has noticed the functional change in the past three > years, so they were unneeded. > > Signed-off-by: Masahiro Yamada > --- > > arch/powerpc/kernel/Makefile | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile > index 2919433be355..72d1cd6443bc 100644 > --- a/arch/powerpc/kernel/Makefile > +++ b/arch/powerpc/kernel/Makefile > @@ -191,9 +191,6 @@ GCOV_PROFILE_kprobes-ftrace.o := n > KCOV_INSTRUMENT_kprobes-ftrace.o := n > KCSAN_SANITIZE_kprobes-ftrace.o := n > UBSAN_SANITIZE_kprobes-ftrace.o := n > -GCOV_PROFILE_syscall_64.o := n > -KCOV_INSTRUMENT_syscall_64.o := n > -UBSAN_SANITIZE_syscall_64.o := n > UBSAN_SANITIZE_vdso.o := n > > # Necessary for booting with kcov enabled on book3e machines > -- > 2.40.1 > -- Best Regards Masahiro Yamada
Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings
On 2/16/24 08:56, Catalin Marinas wrote: ... The problem is that the contpte_* symbols are called from the ptep_* inline functions. So where those inlines are called from modules, we need to make sure the contpte_* symbols are available. John Hubbard originally reported this problem against v1 and I enumerated all the drivers that call into the ptep_* inlines here: https://lore.kernel.org/linux-arm-kernel/b994ff89-1a1f-26ca-9479-b08c77f94...@arm.com/#t So they definitely need to be exported. Perhaps we can tighten it to Yes. Let's keep the in-tree modules working. EXPORT_SYMBOL_GPL(), but I was being cautious as I didn't want to break anything out-of-tree. I'm not sure what the normal policy is? arm64 seems to use ~equal amounts of both. EXPORT_SYMBOL_GPL() seems appropriate and low risk. As Catalin says below, these really are deeply core mm routines, and any module operating at this level is not going to be able to survive on EXPORT_SYMBOL alone, IMHO. Now, if only I could find an out of tree module to test that claim on... :) I don't think we are consistent here. For example set_pte_at() can't be called from non-GPL modules because of __sync_icache_dcache. OTOH, such driver is probably doing something dodgy. Same with apply_to_page_range(), it's GPL-only (called from i915). Let's see if others have any view over the next week or so, otherwise I'd go for _GPL and relax it later if someone has a good use-case (can be a patch on top adding _GPL). I think going directly to _GPL for these is fine, actually. thanks, -- John Hubbard NVIDIA
Re: [PATCH v2] powerpc: Add gpr1 and fpu save/restore functions
- Original Message - > From: "christophe leroy" > To: "Timothy Pearson" , "linuxppc-dev" > > Sent: Tuesday, February 13, 2024 9:13:41 AM > Subject: Re: [PATCH v2] powerpc: Add gpr1 and fpu save/restore functions > Le 12/02/2024 à 18:14, Timothy Pearson a écrit : > All this looks very similar to savegpr0_xx and restgpr0_xx with the > difference being that r12 is used instead of r1. > > Could we avoid duplication by using macros ? > Yes. v3 sent.
[PATCH v3] powerpc: Add gpr1 and fpu save/restore functions
When building the kernel in size optimized mode with the amdgpu module enabled, gcc will begin referencing external gpr1 and fpu save/restore functions. This will then cause a linker failure as we do not link against libgcc which normally contains those builtin functions. Implement gpr1 and fpu save/restore functions per the PowerPC 64-bit ELFv2 ABI documentation. Tested on a Talos II with a WX7100 installed and running in DisplayCore mode. Reported-by: kernel test robot Tested-by: Timothy Pearson Signed-off-by: Timothy Pearson --- arch/powerpc/kernel/prom_init_check.sh | 4 +- arch/powerpc/lib/crtsavres.S | 363 + scripts/mod/modpost.c | 4 + 3 files changed, 253 insertions(+), 118 deletions(-) diff --git a/arch/powerpc/kernel/prom_init_check.sh b/arch/powerpc/kernel/prom_init_check.sh index 69623b9045d5..76c5651e29d3 100644 --- a/arch/powerpc/kernel/prom_init_check.sh +++ b/arch/powerpc/kernel/prom_init_check.sh @@ -72,10 +72,10 @@ do # ignore register save/restore funcitons case $UNDEF in - _restgpr_*|_restgpr0_*|_rest32gpr_*) + _restgpr_*|_restgpr0_*|_restgpr1_*|_rest32gpr_*) OK=1 ;; - _savegpr_*|_savegpr0_*|_save32gpr_*) + _savegpr_*|_savegpr0_*|_restgpr0_*|_save32gpr_*) OK=1 ;; esac diff --git a/arch/powerpc/lib/crtsavres.S b/arch/powerpc/lib/crtsavres.S index 7e5e1c28e56a..f97270d36720 100644 --- a/arch/powerpc/lib/crtsavres.S +++ b/arch/powerpc/lib/crtsavres.S @@ -3,6 +3,7 @@ * * Copyright (C) 1995, 1996, 1998, 2000, 2001 Free Software Foundation, Inc. * Copyright 2008 Freescale Semiconductor, Inc. + * Copyright 2024 Raptor Engineering, LLC * Written By Michael Meissner * * Based on gcc/config/rs6000/crtsavres.asm from gcc @@ -314,126 +315,134 @@ _GLOBAL(_restvr_31) #else /* CONFIG_PPC64 */ -.globl _savegpr0_14 -_savegpr0_14: - std r14,-144(r1) -.globl _savegpr0_15 -_savegpr0_15: - std r15,-136(r1) -.globl _savegpr0_16 -_savegpr0_16: - std r16,-128(r1) -.globl _savegpr0_17 -_savegpr0_17: - std r17,-120(r1) -.globl _savegpr0_18 -_savegpr0_18: - std r18,-112(r1) -.globl _savegpr0_19 -_savegpr0_19: - std r19,-104(r1) -.globl _savegpr0_20 -_savegpr0_20: - std r20,-96(r1) -.globl _savegpr0_21 -_savegpr0_21: - std r21,-88(r1) -.globl _savegpr0_22 -_savegpr0_22: - std r22,-80(r1) -.globl _savegpr0_23 -_savegpr0_23: - std r23,-72(r1) -.globl _savegpr0_24 -_savegpr0_24: - std r24,-64(r1) -.globl _savegpr0_25 -_savegpr0_25: - std r25,-56(r1) -.globl _savegpr0_26 -_savegpr0_26: - std r26,-48(r1) -.globl _savegpr0_27 -_savegpr0_27: - std r27,-40(r1) -.globl _savegpr0_28 -_savegpr0_28: - std r28,-32(r1) -.globl _savegpr0_29 -_savegpr0_29: - std r29,-24(r1) -.globl _savegpr0_30 -_savegpr0_30: - std r30,-16(r1) -.globl _savegpr0_31 -_savegpr0_31: - std r31,-8(r1) - std r0,16(r1) +#define __PPC64_SAVEGPR(n,base)\ +.globl _savegpr##n##_14\ +_savegpr##n##_14: \ + std r14,-144(base) \ +.globl _savegpr##n##_15\ +_savegpr##n##_15: \ + std r15,-136(base) \ +.globl _savegpr##n##_16\ +_savegpr##n##_16: \ + std r16,-128(base) \ +.globl _savegpr##n##_17\ +_savegpr##n##_17: \ + std r17,-120(base) \ +.globl _savegpr##n##_18\ +_savegpr##n##_18: \ + std r18,-112(base) \ +.globl _savegpr##n##_19\ +_savegpr##n##_19: \ + std r19,-104(base) \ +.globl _savegpr##n##_20\ +_savegpr##n##_20: \ + std r20,-96(base) \ +.globl _savegpr##n##_21\ +_savegpr##n##_21: \ + std r21,-88(base) \ +.globl _savegpr##n##_22\ +_savegpr##n##_22: \ + std r22,-80(base) \ +.globl _savegpr##n##_23\ +_savegpr##n##_23: \ + std r23,-72(base) \ +.globl _savegpr##n##_24\ +_savegpr##n##_24: \ + std r24,-64(base) \ +.globl _savegpr##n##_25\ +_savegpr##n##_25: \ + std r25,-56(base) \ +.globl _savegpr##n##_26\ +_savegpr##n##_26: \ + std r26,-48(base) \ +.globl _savegpr##n##_27\ +_savegpr##n##_27: \ + std r27,-40(base) \ +.globl _savegpr##n##_28\ +_savegpr##n##_28: \ + std r28,-32(base) \ +.globl _savegpr##n##_29\ +_savegpr##n##_29: \ + std r29,-24(base) \ +.globl _savegpr##n##_30\ +_savegpr##n##_30: \ + std r30,-16(base) \ +.globl _savegpr##n##_31\ +_savegpr##n##_31: \ + std r31,-8(base)\ +
Re: [PATCH] KVM: Get rid of return value from kvm_arch_create_vm_debugfs()
On Fri, 16 Feb 2024 15:59:41 +, Oliver Upton wrote: > > The general expectation with debugfs is that any initialization failure > is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows > implementations to return an error and kvm_create_vm_debugfs() allows > that to fail VM creation. > > Change to a void return to discourage architectures from making debugfs > failures fatal for the VM. Seems like everyone already had the right > idea, as all implementations already return 0 unconditionally. > > Signed-off-by: Oliver Upton Acked-by: Marc Zyngier M. -- Without deviation from the norm, progress is not possible.
Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings
On Fri, Feb 16, 2024 at 12:53:43PM +, Ryan Roberts wrote: > On 16/02/2024 12:25, Catalin Marinas wrote: > > On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote: > >> arch/arm64/mm/contpte.c | 285 +++ > > > > Nitpick: I think most symbols in contpte.c can be EXPORT_SYMBOL_GPL(). > > We don't expect them to be used by random out of tree modules. In fact, > > do we expect them to end up in modules at all? Most seem to be called > > from the core mm code. > > The problem is that the contpte_* symbols are called from the ptep_* inline > functions. So where those inlines are called from modules, we need to make > sure > the contpte_* symbols are available. > > John Hubbard originally reported this problem against v1 and I enumerated all > the drivers that call into the ptep_* inlines here: > https://lore.kernel.org/linux-arm-kernel/b994ff89-1a1f-26ca-9479-b08c77f94...@arm.com/#t > > So they definitely need to be exported. Perhaps we can tighten it to > EXPORT_SYMBOL_GPL(), but I was being cautious as I didn't want to break > anything > out-of-tree. I'm not sure what the normal policy is? arm64 seems to use ~equal > amounts of both. I don't think we are consistent here. For example set_pte_at() can't be called from non-GPL modules because of __sync_icache_dcache. OTOH, such driver is probably doing something dodgy. Same with apply_to_page_range(), it's GPL-only (called from i915). Let's see if others have any view over the next week or so, otherwise I'd go for _GPL and relax it later if someone has a good use-case (can be a patch on top adding _GPL). > > If you can make this easier to parse (in a few years time) with an > > additional patch adding some more comments, that would be great. For > > this patch: > > I already have a big block comment at the top, which was trying to explain it. > Clearly not well enough though. I'll add more comments as a follow up patch > when > I get back from holiday. I read that comment but it wasn't immediately obvious what the atomicity requirements are - basically we require a single PTE to be atomically read (which it is), the rest is the dirty/young state being added on top. I guess a sentence along these lines would do. Enjoy your holiday! -- Catalin
[PATCH] KVM: Get rid of return value from kvm_arch_create_vm_debugfs()
The general expectation with debugfs is that any initialization failure is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows implementations to return an error and kvm_create_vm_debugfs() allows that to fail VM creation. Change to a void return to discourage architectures from making debugfs failures fatal for the VM. Seems like everyone already had the right idea, as all implementations already return 0 unconditionally. Signed-off-by: Oliver Upton --- Compile-tested on arm64, powerpc, and x86 since I don't trust myself to even get this simple patch right. arch/powerpc/kvm/powerpc.c | 3 +-- arch/x86/kvm/debugfs.c | 3 +-- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c| 8 ++-- 4 files changed, 5 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 23407fbd73c9..d32abe7fe6ab 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -2538,9 +2538,8 @@ void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_ vcpu->kvm->arch.kvm_ops->create_vcpu_debugfs(vcpu, debugfs_dentry); } -int kvm_arch_create_vm_debugfs(struct kvm *kvm) +void kvm_arch_create_vm_debugfs(struct kvm *kvm) { if (kvm->arch.kvm_ops->create_vm_debugfs) kvm->arch.kvm_ops->create_vm_debugfs(kvm); - return 0; } diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c index 95ea1a1f7403..999227fc7c66 100644 --- a/arch/x86/kvm/debugfs.c +++ b/arch/x86/kvm/debugfs.c @@ -189,9 +189,8 @@ static const struct file_operations mmu_rmaps_stat_fops = { .release= kvm_mmu_rmaps_stat_release, }; -int kvm_arch_create_vm_debugfs(struct kvm *kvm) +void kvm_arch_create_vm_debugfs(struct kvm *kvm) { debugfs_create_file("mmu_rmaps_stat", 0644, kvm->debugfs_dentry, kvm, &mmu_rmaps_stat_fops); - return 0; } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 7e7fd25b09b3..9a45f673f687 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1507,7 +1507,7 @@ bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu); bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); -int kvm_arch_create_vm_debugfs(struct kvm *kvm); +void kvm_arch_create_vm_debugfs(struct kvm *kvm); #ifndef __KVM_HAVE_ARCH_VM_ALLOC /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 10bfc88a69f7..c681149c382a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1150,10 +1150,7 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname) &stat_fops_per_vm); } - ret = kvm_arch_create_vm_debugfs(kvm); - if (ret) - goto out_err; - + kvm_arch_create_vm_debugfs(kvm); return 0; out_err: kvm_destroy_vm_debugfs(kvm); @@ -1183,9 +1180,8 @@ void __weak kvm_arch_pre_destroy_vm(struct kvm *kvm) * Cleanup should be automatic done in kvm_destroy_vm_debugfs() recursively, so * a per-arch destroy interface is not needed. */ -int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm) +void __weak kvm_arch_create_vm_debugfs(struct kvm *kvm) { - return 0; } static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d -- 2.44.0.rc0.258.g7320e95886-goog
Re: [kvm-unit-tests PATCH v1 00/18] arm/arm64: Rework cache maintenance at boot
Hi, On Thu, Nov 30, 2023 at 04:07:02AM -0500, Shaoqin Huang wrote: > Hi, > > I'm posting Alexandru's patch set[1] rebased on the latest branch with the > conflicts being resolved. No big changes compare to its original code. > > As this version 1 of this series was posted one years ago, I would first let > you > recall it, what's the intention of this series and what this series do. You > can > view it by click the link[2] and view the cover-letter. > > Since when writing the series[1], the efi support for arm64[3] hasn't been > merged into the kvm-unit-tests, but now the efi support for arm64 has been > merged. Directly rebase the series[1] onto the latest branch will break the > efi > tests. This is mainly because the Patch #15 ("arm/arm64: Enable the MMU > early") > moves the mmu_enable() out of the setup_mmu(), which causes the efi test will > not enable the mmu. So I do a small change in the efi_mem_init() which makes > the > efi test also enable the MMU early, and make it works. > > And another change should be noticed is in the Patch #17 ("arm/arm64: Perform > dcache maintenance"). In the efi_mem_init(), it will disable the mmu, and > build > a new pagetable and re-enable the mmu, if the asm_mmu_disable clean and > invalidate the data caches for entire memory, we don't need to care the dcache > and after mmu disabled, we use the mmu_setup_early() to re-enable the mmu, > which > takes care all the cache maintenance. But the situation changes since the > Patch > #18 ("arm/arm64: Rework the cache maintenance in asm_mmu_disable") only clean > and invalidate the data caches for the stack memory area. So we need to clean > and invalidate the data caches manually before disable the mmu, I'm not > confident about current cache maintenance at the efi setup patch, so I ask for > your help to review it if it's right or not. > > And I also drop one patch ("s390: Do not use the physical allocator") from[1] > since this cause s390 test to fail. This is unfortunate. What tests do you see failing? I wrote the s390x patch so I can justify dropping the locking in the physical allocator. And I wanted to drop the locking so I wouldn't have to do maintenance operation on the spinlock. Because of how kvm-unit-tests implements spinlocks for arm/arm64, they don't protect against concurrent accesses when the MMU is turned off. And because arm and arm64 use the physical allocator during the test boot sequence, not during a test, using a spin lock is also useless since there will be no concurrent accesses (the boot phase is done on a single CPU). But since replacing the physical allocator causes test failures for s390x, looks like the physical will still be needed to tests, and that requires having the spinlock. I guess the best approach would be to teach the physical allocator to do cache maintenance on the spinlock. We might as well, since the UART needs it too, and I don't think this series addresses that. Thanks, Alex > > This series may include bug, so I really appreciate your review to improve > this > series together. > > You can get the code from: > > $ git clone https://gitlab.com/shahuang/kvm-unit-tests.git \ > -b arm-arm64-rework-cache-maintenance-at-boot-v1 > > [1] > https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/arm-arm64-rework-cache-maintenance-at-boot-v2-wip2 > [2] > https://lore.kernel.org/all/20220809091558.14379-1-alexandru.eli...@arm.com/ > [3] > https://patchwork.kernel.org/project/kvm/cover/20230530160924.82158-1-nikos.nikole...@arm.com/ > > Changelog: > -- > RFC->v1: > - Gathered Reviewed-by tags. > - Various changes to commit messages and comments to hopefully make the code > easier to understand. > - Patches #8 ("lib/alloc_phys: Expand documentation with usage and > limitations") > are new. > - Folded patch "arm: page.h: Add missing libcflat.h include" into #17 > ("arm/arm64: Perform dcache maintenance at boot"). > - Reordered the series to group patches that touch aproximately the same > code > together - the patches that change the physical allocator are now first, > followed come the patches that change how the secondaries are brought > online. > - Fixed several nasty bugs where the r4 register was being clobbered in the > arm > assembly. > - Unmap the early UART address if the DTB address does not match the early > address. > - Added dcache maintenance when a page table is modified with the MMU > disabled. > - Moved the cache maintenance when disabling the MMU to be executed before > the > MMU is disabled. > - Rebase it on lasted branch which efi support has been merged. > - Make the efi test also enable MMU early. > - Add cache maintenance on efi setup path especially before mmu_disable. > > RFC: > https://lore.kernel.org/all/20220809091558.14379-1-alexandru.eli...@arm.com/ > > Alexandru Elisei (18): > Makefile: Define __ASSEMBLY__ for assembly files > powerpc: Replace the p
Re: [PATCH v1 4/5] powerpc: Remove cpu-as-y completely
Le 20/02/2023 à 07:00, Michael Ellerman a écrit : > Christophe Leroy writes: >> cpu-as-y is there to force assembler building options. >> But there is no need for that. Gcc is passed the necessary >> options and it automatically pass the appropriate option to >> GAS. >> >> GCC is given -maltivec when relevant, so no need >> for -Wa,-maltivec in addition >> >> And -Wa,-many is wrong as it will hide innapropriate >> instructions. Better to detect them and handle them on a >> case by case basis. >> -Wa,-many was added by commit 960e30029863 ("powerpc/Makefile: >> Fix PPC_BOOK3S_64 ASFLAGS") in order to fix an issue with >> clang and the passed -Wa,-mpower4 option. But we have now >> removed it expecting the compiler to automatically pass the >> proper options and instructions based on -mcpu=power4 > > I wanted to apply this one, but it caused a lot of breakage for big > endian Book3S-64 builds - where we build for power4 but have lots of > code that uses >= power5 instructions. > > I'll try and get those all fixed and pick this up for the next merge > window. > ping ?
Re: [PATCH v2 0/2] PCI endpoint BAR hardware description cleanup
On Fri, Feb 16, 2024 at 02:45:13PM +0100, Niklas Cassel wrote: > The series is based on top of: > https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/log/?h=endpoint > > > Hello all, > > This series cleans up the hardware description for PCI endpoint BARs. > > The problems with the existing hardware description: > -The documentation is lackluster. > -Some of the names are confusingly similar, e.g. fixed_64bit and > fixed_size, even though these are for completely unrelated things. > -The way that the BARs are defined in the endpoint controller drivers > is messy, because the left hand side is not a BAR, so you can mark a > BAR as e.g. both fixed size and reserved. > > This series tries to address all the problems above. > > Personally, I think that the code is more readable, both the endpoint > controller drivers, but also pci-epc-core.c. > > (I will be sending out a patch series that adds BAR_RESIZABLE to enum > pci_epc_bar_type in the coming week.) > Applied to pci/endpoint! - Mani > > Kind regards, > Niklas > > > Changes since V1: > -Picked up tags from Kishon and Mani. > -Improved commit message for patch 1/2 as suggested by Mani. > -Improved kdoc in patch 2/2 as suggested by Mani. > > > Niklas Cassel (2): > PCI: endpoint: Clean up hardware description for BARs > PCI: endpoint: Drop only_64bit on reserved BARs > > drivers/pci/controller/dwc/pci-imx6.c | 3 +- > drivers/pci/controller/dwc/pci-keystone.c | 12 +++--- > .../pci/controller/dwc/pci-layerscape-ep.c| 5 ++- > drivers/pci/controller/dwc/pcie-keembay.c | 8 +++- > drivers/pci/controller/dwc/pcie-rcar-gen4.c | 4 +- > drivers/pci/controller/dwc/pcie-tegra194.c| 10 +++-- > drivers/pci/controller/dwc/pcie-uniphier-ep.c | 15 +-- > drivers/pci/controller/pcie-rcar-ep.c | 14 --- > drivers/pci/endpoint/functions/pci-epf-ntb.c | 4 +- > drivers/pci/endpoint/functions/pci-epf-test.c | 8 ++-- > drivers/pci/endpoint/functions/pci-epf-vntb.c | 2 +- > drivers/pci/endpoint/pci-epc-core.c | 25 +--- > drivers/pci/endpoint/pci-epf-core.c | 15 +++ > include/linux/pci-epc.h | 39 --- > 14 files changed, 106 insertions(+), 58 deletions(-) > > -- > 2.43.1 > -- மணிவண்ணன் சதாசிவம்
Re: [PATCH v2 1/2] PCI: endpoint: Clean up hardware description for BARs
On Fri, Feb 16, 2024 at 02:45:14PM +0100, Niklas Cassel wrote: > The hardware description for BARs is scattered in many different variables > in pci_epc_features. Some of these things are mutually exclusive, so it > can create confusion over which variable that has precedence over another. > > Improve the situation by creating a struct pci_epc_bar_desc, and a new > enum pci_epc_bar_type, and convert the endpoint controller drivers to use > this more well defined format. > > Additionally, some endpoint controller drivers mark the BAR succeeding a > "64-bit only BAR" as reserved, while some do not. By definition, a 64-bit > BAR uses the succeeding BAR for the upper 32-bits, so an EPF driver cannot > use a BAR succeeding a 64-bit BAR. Ensure that all endpoint controller > drivers are uniform, and actually describe a reserved BAR as reserved. > > Signed-off-by: Niklas Cassel Reviewed-by: Manivannan Sadhasivam - Mani > Reviewed-by: Kishon Vijay Abraham I > --- > drivers/pci/controller/dwc/pci-imx6.c | 3 +- > drivers/pci/controller/dwc/pci-keystone.c | 12 +++ > .../pci/controller/dwc/pci-layerscape-ep.c| 5 ++- > drivers/pci/controller/dwc/pcie-keembay.c | 8 +++-- > drivers/pci/controller/dwc/pcie-rcar-gen4.c | 4 ++- > drivers/pci/controller/dwc/pcie-tegra194.c| 10 -- > drivers/pci/controller/dwc/pcie-uniphier-ep.c | 15 ++-- > drivers/pci/controller/pcie-rcar-ep.c | 14 +--- > drivers/pci/endpoint/functions/pci-epf-ntb.c | 4 +-- > drivers/pci/endpoint/functions/pci-epf-test.c | 8 ++--- > drivers/pci/endpoint/functions/pci-epf-vntb.c | 2 +- > drivers/pci/endpoint/pci-epc-core.c | 32 + > drivers/pci/endpoint/pci-epf-core.c | 15 > include/linux/pci-epc.h | 34 +++ > 14 files changed, 108 insertions(+), 58 deletions(-) > > diff --git a/drivers/pci/controller/dwc/pci-imx6.c > b/drivers/pci/controller/dwc/pci-imx6.c > index dc2c036ab28c..47a9a96484ed 100644 > --- a/drivers/pci/controller/dwc/pci-imx6.c > +++ b/drivers/pci/controller/dwc/pci-imx6.c > @@ -1081,7 +1081,8 @@ static const struct pci_epc_features > imx8m_pcie_epc_features = { > .linkup_notifier = false, > .msi_capable = true, > .msix_capable = false, > - .reserved_bar = 1 << BAR_1 | 1 << BAR_3, > + .bar[BAR_1] = { .type = BAR_RESERVED, }, > + .bar[BAR_3] = { .type = BAR_RESERVED, }, > .align = SZ_64K, > }; > > diff --git a/drivers/pci/controller/dwc/pci-keystone.c > b/drivers/pci/controller/dwc/pci-keystone.c > index c0c62533a3f1..b2b93b4fa82d 100644 > --- a/drivers/pci/controller/dwc/pci-keystone.c > +++ b/drivers/pci/controller/dwc/pci-keystone.c > @@ -924,12 +924,12 @@ static const struct pci_epc_features > ks_pcie_am654_epc_features = { > .linkup_notifier = false, > .msi_capable = true, > .msix_capable = true, > - .reserved_bar = 1 << BAR_0 | 1 << BAR_1, > - .bar_fixed_64bit = 1 << BAR_0, > - .bar_fixed_size[2] = SZ_1M, > - .bar_fixed_size[3] = SZ_64K, > - .bar_fixed_size[4] = 256, > - .bar_fixed_size[5] = SZ_1M, > + .bar[BAR_0] = { .type = BAR_RESERVED, .only_64bit = true, }, > + .bar[BAR_1] = { .type = BAR_RESERVED, }, > + .bar[BAR_2] = { .type = BAR_FIXED, .fixed_size = SZ_1M, }, > + .bar[BAR_3] = { .type = BAR_FIXED, .fixed_size = SZ_64K, }, > + .bar[BAR_4] = { .type = BAR_FIXED, .fixed_size = 256, }, > + .bar[BAR_5] = { .type = BAR_FIXED, .fixed_size = SZ_1M, }, > .align = SZ_1M, > }; > > diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c > b/drivers/pci/controller/dwc/pci-layerscape-ep.c > index 2e398494e7c0..1f6ee1460ec2 100644 > --- a/drivers/pci/controller/dwc/pci-layerscape-ep.c > +++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c > @@ -250,7 +250,10 @@ static int __init ls_pcie_ep_probe(struct > platform_device *pdev) > pci->dev = dev; > pci->ops = pcie->drvdata->dw_pcie_ops; > > - ls_epc->bar_fixed_64bit = (1 << BAR_2) | (1 << BAR_4); > + ls_epc->bar[BAR_2].only_64bit = true; > + ls_epc->bar[BAR_3].type = BAR_RESERVED; > + ls_epc->bar[BAR_4].only_64bit = true; > + ls_epc->bar[BAR_5].type = BAR_RESERVED; > ls_epc->linkup_notifier = true; > > pcie->pci = pci; > diff --git a/drivers/pci/controller/dwc/pcie-keembay.c > b/drivers/pci/controller/dwc/pcie-keembay.c > index 208d3b0ba196..5e8e54f597dd 100644 > --- a/drivers/pci/controller/dwc/pcie-keembay.c > +++ b/drivers/pci/controller/dwc/pcie-keembay.c > @@ -312,8 +312,12 @@ static const struct pci_epc_features > keembay_pcie_epc_features = { > .linkup_notifier= false, > .msi_capable= true, > .msix_capable = true, > - .reserved_bar = BIT(BAR_1) | BIT(BAR_3) | BIT(BAR_5), > - .bar_fixed_64bit= BIT(BAR_0) | BIT(BAR_2) | BIT(BAR_4), > + .bar[BAR_0] = { .only_64bit = true, }
Re: [kvm-unit-tests PATCH v4 8/8] migration: add a migration selftest
On 09/02/2024 10.11, Nicholas Piggin wrote: Add a selftest for migration support in guest library and test harness code. It performs migrations in a tight loop to irritate races and bugs in the test harness code. Include the test in arm, s390, powerpc. Acked-by: Claudio Imbrenda (s390x) Reviewed-by: Thomas Huth Signed-off-by: Nicholas Piggin --- arm/Makefile.common | 1 + arm/selftest-migration.c | 1 + arm/unittests.cfg| 6 ++ common/selftest-migration.c | 34 ++ powerpc/Makefile.common | 1 + powerpc/selftest-migration.c | 1 + powerpc/unittests.cfg| 4 s390x/Makefile | 1 + s390x/selftest-migration.c | 1 + s390x/unittests.cfg | 4 10 files changed, 54 insertions(+) create mode 12 arm/selftest-migration.c create mode 100644 common/selftest-migration.c create mode 12 powerpc/selftest-migration.c create mode 12 s390x/selftest-migration.c diff --git a/arm/Makefile.common b/arm/Makefile.common index f828dbe0..f107c478 100644 --- a/arm/Makefile.common +++ b/arm/Makefile.common @@ -5,6 +5,7 @@ # tests-common = $(TEST_DIR)/selftest.$(exe) +tests-common += $(TEST_DIR)/selftest-migration.$(exe) tests-common += $(TEST_DIR)/spinlock-test.$(exe) tests-common += $(TEST_DIR)/pci-test.$(exe) tests-common += $(TEST_DIR)/pmu.$(exe) diff --git a/arm/selftest-migration.c b/arm/selftest-migration.c new file mode 12 index ..bd1eb266 --- /dev/null +++ b/arm/selftest-migration.c @@ -0,0 +1 @@ +../common/selftest-migration.c \ No newline at end of file diff --git a/arm/unittests.cfg b/arm/unittests.cfg index fe601cbb..db0e4c9b 100644 --- a/arm/unittests.cfg +++ b/arm/unittests.cfg @@ -55,6 +55,12 @@ smp = $MAX_SMP extra_params = -append 'smp' groups = selftest +# Test migration +[selftest-migration] +file = selftest-migration.flat +groups = selftest migration +arch = arm64 + # Test PCI emulation [pci-test] file = pci-test.flat diff --git a/common/selftest-migration.c b/common/selftest-migration.c new file mode 100644 index ..f70c505f --- /dev/null +++ b/common/selftest-migration.c @@ -0,0 +1,34 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Machine independent migration tests + * + * This is just a very simple test that is intended to stress the migration + * support in the test harness. This could be expanded to test more guest + * library code, but architecture-specific tests should be used to test + * migration of tricky machine state. + */ +#include +#include + +#if defined(__arm__) || defined(__aarch64__) +/* arm can only call getchar 15 times */ +#define NR_MIGRATIONS 15 +#else +#define NR_MIGRATIONS 100 +#endif FYI, I just wrote a patch that will hopefully fix the limitation to 15 times on arm: https://lore.kernel.org/kvm/20240216140210.70280-1-th...@redhat.com/T/#u Thomas
[powerpc:merge] BUILD SUCCESS 35d5936fc3d416d9629c6b53a07bd25dc1c2bc7f
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge branch HEAD: 35d5936fc3d416d9629c6b53a07bd25dc1c2bc7f Automatic merge of 'next' into merge (2024-02-15 23:53) elapsed time: 1445m configs tested: 167 configs skipped: 3 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alpha allnoconfig gcc alphaallyesconfig gcc alpha defconfig gcc arc allmodconfig gcc arc allnoconfig gcc arc allyesconfig gcc arc defconfig gcc arc haps_hs_smp_defconfig gcc arc randconfig-001-20240216 gcc arc randconfig-002-20240216 gcc arm alldefconfig gcc arm allmodconfig gcc arm allnoconfig clang arm allyesconfig gcc arm collie_defconfig gcc arm defconfig clang armhisi_defconfig gcc armkeystone_defconfig gcc arm moxart_defconfig gcc armneponset_defconfig gcc arm randconfig-003-20240216 gcc arm randconfig-004-20240216 gcc arm64allmodconfig clang arm64 allnoconfig gcc arm64 defconfig gcc arm64 randconfig-002-20240216 gcc arm64 randconfig-004-20240216 gcc csky allmodconfig gcc csky allnoconfig gcc csky allyesconfig gcc cskydefconfig gcc csky randconfig-001-20240216 gcc csky randconfig-002-20240216 gcc hexagon allmodconfig clang hexagon allnoconfig clang hexagon allyesconfig clang hexagon defconfig clang i386 allmodconfig gcc i386 allnoconfig gcc i386 allyesconfig gcc i386 buildonly-randconfig-001-20240216 gcc i386 buildonly-randconfig-002-20240216 clang i386 buildonly-randconfig-003-20240216 clang i386 buildonly-randconfig-004-20240216 gcc i386 buildonly-randconfig-005-20240216 gcc i386 buildonly-randconfig-006-20240216 clang i386defconfig clang i386 randconfig-001-20240216 clang i386 randconfig-002-20240216 clang i386 randconfig-003-20240216 gcc i386 randconfig-004-20240216 clang i386 randconfig-005-20240216 gcc i386 randconfig-006-20240216 clang i386 randconfig-011-20240216 gcc i386 randconfig-012-20240216 gcc i386 randconfig-013-20240216 gcc i386 randconfig-014-20240216 clang i386 randconfig-015-20240216 gcc i386 randconfig-016-20240216 gcc loongarchallmodconfig gcc loongarch allnoconfig gcc loongarchallyesconfig gcc loongarch defconfig gcc loongarch randconfig-001-20240216 gcc loongarch randconfig-002-20240216 gcc m68k allmodconfig gcc m68k allnoconfig gcc m68k allyesconfig gcc m68kdefconfig gcc m68kmvme16x_defconfig gcc microblaze allmodconfig gcc microblazeallnoconfig gcc microblaze allyesconfig gcc microblaze defconfig gcc mips allmodconfig gcc mips allnoconfig gcc mips allyesconfig gcc mips ip28_defconfig gcc mips loongson3_defconfig gcc mips pic32mzda_defconfig gcc nios2allmodconfig gcc nios2 allnoconfig gcc nios2allyesconfig gcc nios2 defconfig gcc nios2 randconfig-001-20240216 g
[PATCH] powerpc: remove unused KCSAN_SANITIZE_early_64.o in Makefile
Commit 2fb857bc9f9e ("powerpc/kcsan: Add exclusions from instrumentation") added KCSAN_SANITIZE_early_64.o to arch/powerpc/kernel/Makefile, while it does not compile early_64.o. Signed-off-by: Masahiro Yamada --- arch/powerpc/kernel/Makefile | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 72d1cd6443bc..a6f9b53c7490 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -55,7 +55,6 @@ CFLAGS_btext.o += -DDISABLE_BRANCH_PROFILING endif KCSAN_SANITIZE_early_32.o := n -KCSAN_SANITIZE_early_64.o := n KCSAN_SANITIZE_cputable.o := n KCSAN_SANITIZE_btext.o := n KCSAN_SANITIZE_paca.o := n -- 2.40.1
[PATCH] powerpc: remove unused *_syscall_64.o variables in Makefile
Commit ab1a517d55b0 ("powerpc/syscall: Rename syscall_64.c into interrupt.c") missed to update these three lines: GCOV_PROFILE_syscall_64.o := n KCOV_INSTRUMENT_syscall_64.o := n UBSAN_SANITIZE_syscall_64.o := n To restore the original behavior, we could replace them with: GCOV_PROFILE_interrupt.o := n KCOV_INSTRUMENT_interrupt.o := n UBSAN_SANITIZE_interrupt.o := n However, nobody has noticed the functional change in the past three years, so they were unneeded. Signed-off-by: Masahiro Yamada --- arch/powerpc/kernel/Makefile | 3 --- 1 file changed, 3 deletions(-) diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 2919433be355..72d1cd6443bc 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -191,9 +191,6 @@ GCOV_PROFILE_kprobes-ftrace.o := n KCOV_INSTRUMENT_kprobes-ftrace.o := n KCSAN_SANITIZE_kprobes-ftrace.o := n UBSAN_SANITIZE_kprobes-ftrace.o := n -GCOV_PROFILE_syscall_64.o := n -KCOV_INSTRUMENT_syscall_64.o := n -UBSAN_SANITIZE_syscall_64.o := n UBSAN_SANITIZE_vdso.o := n # Necessary for booting with kcov enabled on book3e machines -- 2.40.1
[PATCH v2 1/2] PCI: endpoint: Clean up hardware description for BARs
The hardware description for BARs is scattered in many different variables in pci_epc_features. Some of these things are mutually exclusive, so it can create confusion over which variable that has precedence over another. Improve the situation by creating a struct pci_epc_bar_desc, and a new enum pci_epc_bar_type, and convert the endpoint controller drivers to use this more well defined format. Additionally, some endpoint controller drivers mark the BAR succeeding a "64-bit only BAR" as reserved, while some do not. By definition, a 64-bit BAR uses the succeeding BAR for the upper 32-bits, so an EPF driver cannot use a BAR succeeding a 64-bit BAR. Ensure that all endpoint controller drivers are uniform, and actually describe a reserved BAR as reserved. Signed-off-by: Niklas Cassel Reviewed-by: Kishon Vijay Abraham I --- drivers/pci/controller/dwc/pci-imx6.c | 3 +- drivers/pci/controller/dwc/pci-keystone.c | 12 +++ .../pci/controller/dwc/pci-layerscape-ep.c| 5 ++- drivers/pci/controller/dwc/pcie-keembay.c | 8 +++-- drivers/pci/controller/dwc/pcie-rcar-gen4.c | 4 ++- drivers/pci/controller/dwc/pcie-tegra194.c| 10 -- drivers/pci/controller/dwc/pcie-uniphier-ep.c | 15 ++-- drivers/pci/controller/pcie-rcar-ep.c | 14 +--- drivers/pci/endpoint/functions/pci-epf-ntb.c | 4 +-- drivers/pci/endpoint/functions/pci-epf-test.c | 8 ++--- drivers/pci/endpoint/functions/pci-epf-vntb.c | 2 +- drivers/pci/endpoint/pci-epc-core.c | 32 + drivers/pci/endpoint/pci-epf-core.c | 15 include/linux/pci-epc.h | 34 +++ 14 files changed, 108 insertions(+), 58 deletions(-) diff --git a/drivers/pci/controller/dwc/pci-imx6.c b/drivers/pci/controller/dwc/pci-imx6.c index dc2c036ab28c..47a9a96484ed 100644 --- a/drivers/pci/controller/dwc/pci-imx6.c +++ b/drivers/pci/controller/dwc/pci-imx6.c @@ -1081,7 +1081,8 @@ static const struct pci_epc_features imx8m_pcie_epc_features = { .linkup_notifier = false, .msi_capable = true, .msix_capable = false, - .reserved_bar = 1 << BAR_1 | 1 << BAR_3, + .bar[BAR_1] = { .type = BAR_RESERVED, }, + .bar[BAR_3] = { .type = BAR_RESERVED, }, .align = SZ_64K, }; diff --git a/drivers/pci/controller/dwc/pci-keystone.c b/drivers/pci/controller/dwc/pci-keystone.c index c0c62533a3f1..b2b93b4fa82d 100644 --- a/drivers/pci/controller/dwc/pci-keystone.c +++ b/drivers/pci/controller/dwc/pci-keystone.c @@ -924,12 +924,12 @@ static const struct pci_epc_features ks_pcie_am654_epc_features = { .linkup_notifier = false, .msi_capable = true, .msix_capable = true, - .reserved_bar = 1 << BAR_0 | 1 << BAR_1, - .bar_fixed_64bit = 1 << BAR_0, - .bar_fixed_size[2] = SZ_1M, - .bar_fixed_size[3] = SZ_64K, - .bar_fixed_size[4] = 256, - .bar_fixed_size[5] = SZ_1M, + .bar[BAR_0] = { .type = BAR_RESERVED, .only_64bit = true, }, + .bar[BAR_1] = { .type = BAR_RESERVED, }, + .bar[BAR_2] = { .type = BAR_FIXED, .fixed_size = SZ_1M, }, + .bar[BAR_3] = { .type = BAR_FIXED, .fixed_size = SZ_64K, }, + .bar[BAR_4] = { .type = BAR_FIXED, .fixed_size = 256, }, + .bar[BAR_5] = { .type = BAR_FIXED, .fixed_size = SZ_1M, }, .align = SZ_1M, }; diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c b/drivers/pci/controller/dwc/pci-layerscape-ep.c index 2e398494e7c0..1f6ee1460ec2 100644 --- a/drivers/pci/controller/dwc/pci-layerscape-ep.c +++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c @@ -250,7 +250,10 @@ static int __init ls_pcie_ep_probe(struct platform_device *pdev) pci->dev = dev; pci->ops = pcie->drvdata->dw_pcie_ops; - ls_epc->bar_fixed_64bit = (1 << BAR_2) | (1 << BAR_4); + ls_epc->bar[BAR_2].only_64bit = true; + ls_epc->bar[BAR_3].type = BAR_RESERVED; + ls_epc->bar[BAR_4].only_64bit = true; + ls_epc->bar[BAR_5].type = BAR_RESERVED; ls_epc->linkup_notifier = true; pcie->pci = pci; diff --git a/drivers/pci/controller/dwc/pcie-keembay.c b/drivers/pci/controller/dwc/pcie-keembay.c index 208d3b0ba196..5e8e54f597dd 100644 --- a/drivers/pci/controller/dwc/pcie-keembay.c +++ b/drivers/pci/controller/dwc/pcie-keembay.c @@ -312,8 +312,12 @@ static const struct pci_epc_features keembay_pcie_epc_features = { .linkup_notifier= false, .msi_capable= true, .msix_capable = true, - .reserved_bar = BIT(BAR_1) | BIT(BAR_3) | BIT(BAR_5), - .bar_fixed_64bit= BIT(BAR_0) | BIT(BAR_2) | BIT(BAR_4), + .bar[BAR_0] = { .only_64bit = true, }, + .bar[BAR_1] = { .type = BAR_RESERVED, }, + .bar[BAR_2] = { .only_64bit = true, }, + .bar[BAR_3] = { .type = BAR_RESERVED, }, + .bar[BAR_4] = { .only_64bit = true, }, +
[PATCH v2 0/2] PCI endpoint BAR hardware description cleanup
The series is based on top of: https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/log/?h=endpoint Hello all, This series cleans up the hardware description for PCI endpoint BARs. The problems with the existing hardware description: -The documentation is lackluster. -Some of the names are confusingly similar, e.g. fixed_64bit and fixed_size, even though these are for completely unrelated things. -The way that the BARs are defined in the endpoint controller drivers is messy, because the left hand side is not a BAR, so you can mark a BAR as e.g. both fixed size and reserved. This series tries to address all the problems above. Personally, I think that the code is more readable, both the endpoint controller drivers, but also pci-epc-core.c. (I will be sending out a patch series that adds BAR_RESIZABLE to enum pci_epc_bar_type in the coming week.) Kind regards, Niklas Changes since V1: -Picked up tags from Kishon and Mani. -Improved commit message for patch 1/2 as suggested by Mani. -Improved kdoc in patch 2/2 as suggested by Mani. Niklas Cassel (2): PCI: endpoint: Clean up hardware description for BARs PCI: endpoint: Drop only_64bit on reserved BARs drivers/pci/controller/dwc/pci-imx6.c | 3 +- drivers/pci/controller/dwc/pci-keystone.c | 12 +++--- .../pci/controller/dwc/pci-layerscape-ep.c| 5 ++- drivers/pci/controller/dwc/pcie-keembay.c | 8 +++- drivers/pci/controller/dwc/pcie-rcar-gen4.c | 4 +- drivers/pci/controller/dwc/pcie-tegra194.c| 10 +++-- drivers/pci/controller/dwc/pcie-uniphier-ep.c | 15 +-- drivers/pci/controller/pcie-rcar-ep.c | 14 --- drivers/pci/endpoint/functions/pci-epf-ntb.c | 4 +- drivers/pci/endpoint/functions/pci-epf-test.c | 8 ++-- drivers/pci/endpoint/functions/pci-epf-vntb.c | 2 +- drivers/pci/endpoint/pci-epc-core.c | 25 +--- drivers/pci/endpoint/pci-epf-core.c | 15 +++ include/linux/pci-epc.h | 39 --- 14 files changed, 106 insertions(+), 58 deletions(-) -- 2.43.1
Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings
Hi Catalin, Thanks for the review! Comments below... On 16/02/2024 12:25, Catalin Marinas wrote: > On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote: >> arch/arm64/mm/contpte.c | 285 +++ > > Nitpick: I think most symbols in contpte.c can be EXPORT_SYMBOL_GPL(). > We don't expect them to be used by random out of tree modules. In fact, > do we expect them to end up in modules at all? Most seem to be called > from the core mm code. The problem is that the contpte_* symbols are called from the ptep_* inline functions. So where those inlines are called from modules, we need to make sure the contpte_* symbols are available. John Hubbard originally reported this problem against v1 and I enumerated all the drivers that call into the ptep_* inlines here: https://lore.kernel.org/linux-arm-kernel/b994ff89-1a1f-26ca-9479-b08c77f94...@arm.com/#t So they definitely need to be exported. Perhaps we can tighten it to EXPORT_SYMBOL_GPL(), but I was being cautious as I didn't want to break anything out-of-tree. I'm not sure what the normal policy is? arm64 seems to use ~equal amounts of both. > >> +#define ptep_get_lockless ptep_get_lockless >> +static inline pte_t ptep_get_lockless(pte_t *ptep) >> +{ >> +pte_t pte = __ptep_get(ptep); >> + >> +if (likely(!pte_valid_cont(pte))) >> +return pte; >> + >> +return contpte_ptep_get_lockless(ptep); >> +} > [...] >> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) >> +{ >> +/* >> + * Gather access/dirty bits, which may be populated in any of the ptes >> + * of the contig range. We may not be holding the PTL, so any contiguous >> + * range may be unfolded/modified/refolded under our feet. Therefore we >> + * ensure we read a _consistent_ contpte range by checking that all ptes >> + * in the range are valid and have CONT_PTE set, that all pfns are >> + * contiguous and that all pgprots are the same (ignoring access/dirty). >> + * If we find a pte that is not consistent, then we must be racing with >> + * an update so start again. If the target pte does not have CONT_PTE >> + * set then that is considered consistent on its own because it is not >> + * part of a contpte range. >> +*/ > > I can't get my head around this lockless API. Maybe it works fine (and > may have been discussed already) but we should document what the races > are, why it works, what the memory ordering requirements are. For > example, the generic (well, x86 PAE) ptep_get_lockless() only needs to > ensure that the low/high 32 bits of a pte are consistent and there are > some ordering rules on how these are updated. > > Does the arm64 implementation only need to be correct w.r.t. the > access/dirty bits? Since we can read orig_ptep atomically, I assume the > only other updates from unfolding would set the dirty/access bits. > >> + >> +pgprot_t orig_prot; >> +unsigned long pfn; >> +pte_t orig_pte; >> +pgprot_t prot; >> +pte_t *ptep; >> +pte_t pte; >> +int i; >> + >> +retry: >> +orig_pte = __ptep_get(orig_ptep); >> + >> +if (!pte_valid_cont(orig_pte)) >> +return orig_pte; >> + >> +orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte))); >> +ptep = contpte_align_down(orig_ptep); >> +pfn = pte_pfn(orig_pte) - (orig_ptep - ptep); >> + >> +for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) { >> +pte = __ptep_get(ptep); >> +prot = pte_pgprot(pte_mkold(pte_mkclean(pte))); > > We don't have any ordering guarantees in how the ptes in this range are > read or written in the contpte_set_ptes() and the fold/unfold functions. > We might not need them given all the other checks below but it's worth > adding a comment. > >> + >> +if (!pte_valid_cont(pte) || >> + pte_pfn(pte) != pfn || >> + pgprot_val(prot) != pgprot_val(orig_prot)) >> +goto retry; > > I think this also needs some comment. I get the !pte_valid_cont() check > to attempt retrying when racing with unfolding. Are the other checks > needed to detect re-folding with different protection or pfn? > >> + >> +if (pte_dirty(pte)) >> +orig_pte = pte_mkdirty(orig_pte); >> + >> +if (pte_young(pte)) >> +orig_pte = pte_mkyoung(orig_pte); >> +} > > After writing the comments above, I think I figured out that the whole > point of this loop is to check that the ptes in the contig range are > still consistent and the only variation allowed is the dirty/young > state to be passed to the orig_pte returned. The original pte may have > been updated by the time this loop finishes but I don't think it > matters, it wouldn't be any different than reading a single pte and > returning it while it is being updated. Correct. The pte can be updated at any time, before after or during the reads. That was always the case. But now we have to cope with a whole
Re: [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings
On Thu, Feb 15, 2024 at 10:32:05AM +, Ryan Roberts wrote: > There are situations where a change to a single PTE could cause the > contpte block in which it resides to become foldable (i.e. could be > repainted with the contiguous bit). Such situations arise, for example, > when user space temporarily changes protections, via mprotect, for > individual pages, such can be the case for certain garbage collectors. > > We would like to detect when such a PTE change occurs. However this can > be expensive due to the amount of checking required. Therefore only > perform the checks when an indiviual PTE is modified via mprotect > (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only > when we are setting the final PTE in a contpte-aligned block. > > Signed-off-by: Ryan Roberts Acked-by: Catalin Marinas
Re: [PATCH v6 17/18] arm64/mm: __always_inline to improve fork() perf
On Thu, Feb 15, 2024 at 10:32:04AM +, Ryan Roberts wrote: > As set_ptes() and wrprotect_ptes() become a bit more complex, the > compiler may choose not to inline them. But this is critical for fork() > performance. So mark the functions, along with contpte_try_unfold() > which is called by them, as __always_inline. This is worth ~1% on the > fork() microbenchmark with order-0 folios (the common case). > > Acked-by: Mark Rutland > Signed-off-by: Ryan Roberts Acked-by: Catalin Marinas
Re: [PATCH v6 16/18] arm64/mm: Implement pte_batch_hint()
On Thu, Feb 15, 2024 at 10:32:03AM +, Ryan Roberts wrote: > When core code iterates over a range of ptes and calls ptep_get() for > each of them, if the range happens to cover contpte mappings, the number > of pte reads becomes amplified by a factor of the number of PTEs in a > contpte block. This is because for each call to ptep_get(), the > implementation must read all of the ptes in the contpte block to which > it belongs to gather the access and dirty bits. > > This causes a hotspot for fork(), as well as operations that unmap > memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we > can fix this by implementing pte_batch_hint() which allows their > iterators to skip getting the contpte tail ptes when gathering the batch > of ptes to operate on. This results in the number of PTE reads returning > to 1 per pte. > > Acked-by: Mark Rutland > Reviewed-by: David Hildenbrand > Tested-by: John Hubbard > Signed-off-by: Ryan Roberts Acked-by: Catalin Marinas
Re: [PATCH v6 14/18] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
On Thu, Feb 15, 2024 at 10:32:01AM +, Ryan Roberts wrote: > Optimize the contpte implementation to fix some of the > exit/munmap/dontneed performance regression introduced by the initial > contpte commit. Subsequent patches will solve it entirely. > > During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be > cleared. Previously this was done 1 PTE at a time. But the core-mm > supports batched clear via the new [get_and_]clear_full_ptes() APIs. So > let's implement those APIs and for fully covered contpte mappings, we no > longer need to unfold the contpte. This significantly reduces unfolding > operations, reducing the number of tlbis that must be issued. > > Tested-by: John Hubbard > Signed-off-by: Ryan Roberts Acked-by: Catalin Marinas
Re: [PATCH v6 13/18] arm64/mm: Implement new wrprotect_ptes() batch API
On Thu, Feb 15, 2024 at 10:32:00AM +, Ryan Roberts wrote: > Optimize the contpte implementation to fix some of the fork performance > regression introduced by the initial contpte commit. Subsequent patches > will solve it entirely. > > During fork(), any private memory in the parent must be write-protected. > Previously this was done 1 PTE at a time. But the core-mm supports > batched wrprotect via the new wrprotect_ptes() API. So let's implement > that API and for fully covered contpte mappings, we no longer need to > unfold the contpte. This has 2 benefits: > > - reduced unfolding, reduces the number of tlbis that must be issued. > - The memory remains contpte-mapped ("folded") in the parent, so it > continues to benefit from the more efficient use of the TLB after > the fork. > > The optimization to wrprotect a whole contpte block without unfolding is > possible thanks to the tightening of the Arm ARM in respect to the > definition and behaviour when 'Misprogramming the Contiguous bit'. See > section D21194 at https://developer.arm.com/documentation/102105/ja-07/ > > Tested-by: John Hubbard > Signed-off-by: Ryan Roberts Acked-by: Catalin Marinas
Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings
On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote: > arch/arm64/mm/contpte.c | 285 +++ Nitpick: I think most symbols in contpte.c can be EXPORT_SYMBOL_GPL(). We don't expect them to be used by random out of tree modules. In fact, do we expect them to end up in modules at all? Most seem to be called from the core mm code. > +#define ptep_get_lockless ptep_get_lockless > +static inline pte_t ptep_get_lockless(pte_t *ptep) > +{ > + pte_t pte = __ptep_get(ptep); > + > + if (likely(!pte_valid_cont(pte))) > + return pte; > + > + return contpte_ptep_get_lockless(ptep); > +} [...] > +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) > +{ > + /* > + * Gather access/dirty bits, which may be populated in any of the ptes > + * of the contig range. We may not be holding the PTL, so any contiguous > + * range may be unfolded/modified/refolded under our feet. Therefore we > + * ensure we read a _consistent_ contpte range by checking that all ptes > + * in the range are valid and have CONT_PTE set, that all pfns are > + * contiguous and that all pgprots are the same (ignoring access/dirty). > + * If we find a pte that is not consistent, then we must be racing with > + * an update so start again. If the target pte does not have CONT_PTE > + * set then that is considered consistent on its own because it is not > + * part of a contpte range. > +*/ I can't get my head around this lockless API. Maybe it works fine (and may have been discussed already) but we should document what the races are, why it works, what the memory ordering requirements are. For example, the generic (well, x86 PAE) ptep_get_lockless() only needs to ensure that the low/high 32 bits of a pte are consistent and there are some ordering rules on how these are updated. Does the arm64 implementation only need to be correct w.r.t. the access/dirty bits? Since we can read orig_ptep atomically, I assume the only other updates from unfolding would set the dirty/access bits. > + > + pgprot_t orig_prot; > + unsigned long pfn; > + pte_t orig_pte; > + pgprot_t prot; > + pte_t *ptep; > + pte_t pte; > + int i; > + > +retry: > + orig_pte = __ptep_get(orig_ptep); > + > + if (!pte_valid_cont(orig_pte)) > + return orig_pte; > + > + orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte))); > + ptep = contpte_align_down(orig_ptep); > + pfn = pte_pfn(orig_pte) - (orig_ptep - ptep); > + > + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) { > + pte = __ptep_get(ptep); > + prot = pte_pgprot(pte_mkold(pte_mkclean(pte))); We don't have any ordering guarantees in how the ptes in this range are read or written in the contpte_set_ptes() and the fold/unfold functions. We might not need them given all the other checks below but it's worth adding a comment. > + > + if (!pte_valid_cont(pte) || > +pte_pfn(pte) != pfn || > +pgprot_val(prot) != pgprot_val(orig_prot)) > + goto retry; I think this also needs some comment. I get the !pte_valid_cont() check to attempt retrying when racing with unfolding. Are the other checks needed to detect re-folding with different protection or pfn? > + > + if (pte_dirty(pte)) > + orig_pte = pte_mkdirty(orig_pte); > + > + if (pte_young(pte)) > + orig_pte = pte_mkyoung(orig_pte); > + } After writing the comments above, I think I figured out that the whole point of this loop is to check that the ptes in the contig range are still consistent and the only variation allowed is the dirty/young state to be passed to the orig_pte returned. The original pte may have been updated by the time this loop finishes but I don't think it matters, it wouldn't be any different than reading a single pte and returning it while it is being updated. If you can make this easier to parse (in a few years time) with an additional patch adding some more comments, that would be great. For this patch: Reviewed-by: Catalin Marinas -- Catalin
Re: [PATCH 1/2] PCI: endpoint: Clean up hardware description for BARs
On Fri, Feb 16, 2024 at 04:49:08PM +0530, Manivannan Sadhasivam wrote: > On Sat, Feb 10, 2024 at 02:26:25AM +0100, Niklas Cassel wrote: > > The hardware description for BARs is scattered in many different variables > > in pci_epc_features. Some of these things are mutually exclusive, so it > > can create confusion over which variable that has precedence over another. > > > > Improve the situation by creating a struct pci_epc_bar_desc, and a new > > enum pci_epc_bar_type, and convert the endpoint controller drivers to use > > this more well defined format. > > > > Signed-off-by: Niklas Cassel > > --- > > drivers/pci/controller/dwc/pci-imx6.c | 3 +- > > drivers/pci/controller/dwc/pci-keystone.c | 12 +++ > > .../pci/controller/dwc/pci-layerscape-ep.c| 5 ++- > > drivers/pci/controller/dwc/pcie-keembay.c | 8 +++-- > > drivers/pci/controller/dwc/pcie-rcar-gen4.c | 4 ++- > > drivers/pci/controller/dwc/pcie-tegra194.c| 10 -- > > drivers/pci/controller/dwc/pcie-uniphier-ep.c | 15 ++-- > > drivers/pci/controller/pcie-rcar-ep.c | 14 +--- > > drivers/pci/endpoint/functions/pci-epf-ntb.c | 4 +-- > > drivers/pci/endpoint/functions/pci-epf-test.c | 8 ++--- > > drivers/pci/endpoint/functions/pci-epf-vntb.c | 2 +- > > drivers/pci/endpoint/pci-epc-core.c | 32 + > > drivers/pci/endpoint/pci-epf-core.c | 15 > > include/linux/pci-epc.h | 34 +++ > > 14 files changed, 108 insertions(+), 58 deletions(-) > > > > diff --git a/drivers/pci/controller/dwc/pci-imx6.c > > b/drivers/pci/controller/dwc/pci-imx6.c > > index dc2c036ab28c..47a9a96484ed 100644 > > --- a/drivers/pci/controller/dwc/pci-imx6.c > > +++ b/drivers/pci/controller/dwc/pci-imx6.c > > @@ -1081,7 +1081,8 @@ static const struct pci_epc_features > > imx8m_pcie_epc_features = { > > .linkup_notifier = false, > > .msi_capable = true, > > .msix_capable = false, > > - .reserved_bar = 1 << BAR_1 | 1 << BAR_3, > > + .bar[BAR_1] = { .type = BAR_RESERVED, }, > > + .bar[BAR_3] = { .type = BAR_RESERVED, }, > > .align = SZ_64K, > > }; > > > > diff --git a/drivers/pci/controller/dwc/pci-keystone.c > > b/drivers/pci/controller/dwc/pci-keystone.c > > index c0c62533a3f1..b2b93b4fa82d 100644 > > --- a/drivers/pci/controller/dwc/pci-keystone.c > > +++ b/drivers/pci/controller/dwc/pci-keystone.c > > @@ -924,12 +924,12 @@ static const struct pci_epc_features > > ks_pcie_am654_epc_features = { > > .linkup_notifier = false, > > .msi_capable = true, > > .msix_capable = true, > > - .reserved_bar = 1 << BAR_0 | 1 << BAR_1, > > - .bar_fixed_64bit = 1 << BAR_0, > > - .bar_fixed_size[2] = SZ_1M, > > - .bar_fixed_size[3] = SZ_64K, > > - .bar_fixed_size[4] = 256, > > - .bar_fixed_size[5] = SZ_1M, > > + .bar[BAR_0] = { .type = BAR_RESERVED, .only_64bit = true, }, > > + .bar[BAR_1] = { .type = BAR_RESERVED, }, > > + .bar[BAR_2] = { .type = BAR_FIXED, .fixed_size = SZ_1M, }, > > + .bar[BAR_3] = { .type = BAR_FIXED, .fixed_size = SZ_64K, }, > > + .bar[BAR_4] = { .type = BAR_FIXED, .fixed_size = 256, }, > > + .bar[BAR_5] = { .type = BAR_FIXED, .fixed_size = SZ_1M, }, > > .align = SZ_1M, > > }; > > > > diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c > > b/drivers/pci/controller/dwc/pci-layerscape-ep.c > > index 2e398494e7c0..1f6ee1460ec2 100644 > > --- a/drivers/pci/controller/dwc/pci-layerscape-ep.c > > +++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c > > @@ -250,7 +250,10 @@ static int __init ls_pcie_ep_probe(struct > > platform_device *pdev) > > pci->dev = dev; > > pci->ops = pcie->drvdata->dw_pcie_ops; > > > > - ls_epc->bar_fixed_64bit = (1 << BAR_2) | (1 << BAR_4); > > + ls_epc->bar[BAR_2].only_64bit = true; > > + ls_epc->bar[BAR_3].type = BAR_RESERVED; > > BAR_3 and BAR_4 were not reserved previously. > Okay, looking at patch 2 makes it clear why you have marked it as such. But it should've been mentioned in the commit message. - Mani -- மணிவண்ணன் சதாசிவம்
Re: [PATCH 1/2] PCI: endpoint: Clean up hardware description for BARs
On Sat, Feb 10, 2024 at 02:26:25AM +0100, Niklas Cassel wrote: > The hardware description for BARs is scattered in many different variables > in pci_epc_features. Some of these things are mutually exclusive, so it > can create confusion over which variable that has precedence over another. > > Improve the situation by creating a struct pci_epc_bar_desc, and a new > enum pci_epc_bar_type, and convert the endpoint controller drivers to use > this more well defined format. > > Signed-off-by: Niklas Cassel > --- > drivers/pci/controller/dwc/pci-imx6.c | 3 +- > drivers/pci/controller/dwc/pci-keystone.c | 12 +++ > .../pci/controller/dwc/pci-layerscape-ep.c| 5 ++- > drivers/pci/controller/dwc/pcie-keembay.c | 8 +++-- > drivers/pci/controller/dwc/pcie-rcar-gen4.c | 4 ++- > drivers/pci/controller/dwc/pcie-tegra194.c| 10 -- > drivers/pci/controller/dwc/pcie-uniphier-ep.c | 15 ++-- > drivers/pci/controller/pcie-rcar-ep.c | 14 +--- > drivers/pci/endpoint/functions/pci-epf-ntb.c | 4 +-- > drivers/pci/endpoint/functions/pci-epf-test.c | 8 ++--- > drivers/pci/endpoint/functions/pci-epf-vntb.c | 2 +- > drivers/pci/endpoint/pci-epc-core.c | 32 + > drivers/pci/endpoint/pci-epf-core.c | 15 > include/linux/pci-epc.h | 34 +++ > 14 files changed, 108 insertions(+), 58 deletions(-) > > diff --git a/drivers/pci/controller/dwc/pci-imx6.c > b/drivers/pci/controller/dwc/pci-imx6.c > index dc2c036ab28c..47a9a96484ed 100644 > --- a/drivers/pci/controller/dwc/pci-imx6.c > +++ b/drivers/pci/controller/dwc/pci-imx6.c > @@ -1081,7 +1081,8 @@ static const struct pci_epc_features > imx8m_pcie_epc_features = { > .linkup_notifier = false, > .msi_capable = true, > .msix_capable = false, > - .reserved_bar = 1 << BAR_1 | 1 << BAR_3, > + .bar[BAR_1] = { .type = BAR_RESERVED, }, > + .bar[BAR_3] = { .type = BAR_RESERVED, }, > .align = SZ_64K, > }; > > diff --git a/drivers/pci/controller/dwc/pci-keystone.c > b/drivers/pci/controller/dwc/pci-keystone.c > index c0c62533a3f1..b2b93b4fa82d 100644 > --- a/drivers/pci/controller/dwc/pci-keystone.c > +++ b/drivers/pci/controller/dwc/pci-keystone.c > @@ -924,12 +924,12 @@ static const struct pci_epc_features > ks_pcie_am654_epc_features = { > .linkup_notifier = false, > .msi_capable = true, > .msix_capable = true, > - .reserved_bar = 1 << BAR_0 | 1 << BAR_1, > - .bar_fixed_64bit = 1 << BAR_0, > - .bar_fixed_size[2] = SZ_1M, > - .bar_fixed_size[3] = SZ_64K, > - .bar_fixed_size[4] = 256, > - .bar_fixed_size[5] = SZ_1M, > + .bar[BAR_0] = { .type = BAR_RESERVED, .only_64bit = true, }, > + .bar[BAR_1] = { .type = BAR_RESERVED, }, > + .bar[BAR_2] = { .type = BAR_FIXED, .fixed_size = SZ_1M, }, > + .bar[BAR_3] = { .type = BAR_FIXED, .fixed_size = SZ_64K, }, > + .bar[BAR_4] = { .type = BAR_FIXED, .fixed_size = 256, }, > + .bar[BAR_5] = { .type = BAR_FIXED, .fixed_size = SZ_1M, }, > .align = SZ_1M, > }; > > diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c > b/drivers/pci/controller/dwc/pci-layerscape-ep.c > index 2e398494e7c0..1f6ee1460ec2 100644 > --- a/drivers/pci/controller/dwc/pci-layerscape-ep.c > +++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c > @@ -250,7 +250,10 @@ static int __init ls_pcie_ep_probe(struct > platform_device *pdev) > pci->dev = dev; > pci->ops = pcie->drvdata->dw_pcie_ops; > > - ls_epc->bar_fixed_64bit = (1 << BAR_2) | (1 << BAR_4); > + ls_epc->bar[BAR_2].only_64bit = true; > + ls_epc->bar[BAR_3].type = BAR_RESERVED; BAR_3 and BAR_4 were not reserved previously. > + ls_epc->bar[BAR_4].only_64bit = true; > + ls_epc->bar[BAR_5].type = BAR_RESERVED; > ls_epc->linkup_notifier = true; > > pcie->pci = pci; > diff --git a/drivers/pci/controller/dwc/pcie-keembay.c > b/drivers/pci/controller/dwc/pcie-keembay.c > index 208d3b0ba196..5e8e54f597dd 100644 > --- a/drivers/pci/controller/dwc/pcie-keembay.c > +++ b/drivers/pci/controller/dwc/pcie-keembay.c > @@ -312,8 +312,12 @@ static const struct pci_epc_features > keembay_pcie_epc_features = { > .linkup_notifier= false, > .msi_capable= true, > .msix_capable = true, > - .reserved_bar = BIT(BAR_1) | BIT(BAR_3) | BIT(BAR_5), > - .bar_fixed_64bit= BIT(BAR_0) | BIT(BAR_2) | BIT(BAR_4), > + .bar[BAR_0] = { .only_64bit = true, }, > + .bar[BAR_1] = { .type = BAR_RESERVED, }, > + .bar[BAR_2] = { .only_64bit = true, }, > + .bar[BAR_3] = { .type = BAR_RESERVED, }, > + .bar[BAR_4] = { .only_64bit = true, }, > + .bar[BAR_5] = { .type = BAR_RESERVED, }, > .align = SZ_16K, > }; > > diff --git a/drivers/pci/controller/dwc/pcie-rcar-gen4.c
Re: [kvm-unit-tests PATCH v4 8/8] migration: add a migration selftest
On 09/02/2024 10.11, Nicholas Piggin wrote: Add a selftest for migration support in guest library and test harness code. It performs migrations in a tight loop to irritate races and bugs in the test harness code. Include the test in arm, s390, powerpc. Acked-by: Claudio Imbrenda (s390x) Reviewed-by: Thomas Huth Signed-off-by: Nicholas Piggin --- arm/Makefile.common | 1 + arm/selftest-migration.c | 1 + arm/unittests.cfg| 6 ++ Hi Nicholas, I just gave the patches a try, but the arm test seems to fail for me: Only the first getchar() seems to wait for a character, all the subsequent ones don't wait anymore and just continue immediately ... is this working for you? Or do I need another patch on top? Thanks, Thomas
[PATCH 2/2] powerpc: Don't ignore errors from set_memory_{n}p() in __kernel_map_pages()
set_memory_p() and set_memory_np() can fail. As mentioned in linux/mm.h: /* * To support DEBUG_PAGEALLOC architecture must ensure that * __kernel_map_pages() never fails */ So panic in case set_memory_p() or set_memory_np() fail in __kernel_map_pages(). Link: https://github.com/KSPP/linux/issues/7 Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/64/hash.h | 2 +- arch/powerpc/mm/book3s64/hash_utils.c | 3 ++- arch/powerpc/mm/pageattr.c| 10 +++--- 3 files changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h index 6e70ae511631..8f47ae79f2a6 100644 --- a/arch/powerpc/include/asm/book3s/64/hash.h +++ b/arch/powerpc/include/asm/book3s/64/hash.h @@ -269,7 +269,7 @@ int hash__create_section_mapping(unsigned long start, unsigned long end, int nid, pgprot_t prot); int hash__remove_section_mapping(unsigned long start, unsigned long end); -void hash__kernel_map_pages(struct page *page, int numpages, int enable); +int hash__kernel_map_pages(struct page *page, int numpages, int enable); #endif /* !__ASSEMBLY__ */ #endif /* __KERNEL__ */ diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index 0626a25b0d72..01c3b4b65241 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -2172,7 +2172,7 @@ static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long lmi) mmu_kernel_ssize, 0); } -void hash__kernel_map_pages(struct page *page, int numpages, int enable) +int hash__kernel_map_pages(struct page *page, int numpages, int enable) { unsigned long flags, vaddr, lmi; int i; @@ -2189,6 +2189,7 @@ void hash__kernel_map_pages(struct page *page, int numpages, int enable) kernel_unmap_linear_page(vaddr, lmi); } local_irq_restore(flags); + return 0; } #endif /* CONFIG_DEBUG_PAGEALLOC || CONFIG_KFENCE */ diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c index 16b8d20d6ca8..62b678585878 100644 --- a/arch/powerpc/mm/pageattr.c +++ b/arch/powerpc/mm/pageattr.c @@ -106,17 +106,21 @@ int change_memory_attr(unsigned long addr, int numpages, long action) #ifdef CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { + int err; unsigned long addr = (unsigned long)page_address(page); if (PageHighMem(page)) return; if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled()) - hash__kernel_map_pages(page, numpages, enable); + err = hash__kernel_map_pages(page, numpages, enable); else if (enable) - set_memory_p(addr, numpages); + err = set_memory_p(addr, numpages); else - set_memory_np(addr, numpages); + err = set_memory_np(addr, numpages); + + if (err) + panic("%s: set_memory_%sp() failed\n", enable ? "" : "n"); } #endif #endif -- 2.43.0
[PATCH 1/2] powerpc: Refactor __kernel_map_pages()
__kernel_map_pages() is almost identical for PPC32 and RADIX. Refactor it. On PPC32 it is not needed for KFENCE, but to keep it simple just make it similar to PPC64. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/64/pgtable.h | 10 -- arch/powerpc/include/asm/book3s/64/radix.h | 2 -- arch/powerpc/mm/book3s64/radix_pgtable.c | 14 -- arch/powerpc/mm/pageattr.c | 19 +++ arch/powerpc/mm/pgtable_32.c | 15 --- 5 files changed, 19 insertions(+), 41 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 927d585652bc..62c43d3d80ec 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1027,16 +1027,6 @@ static inline void vmemmap_remove_mapping(unsigned long start, } #endif -#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE) -static inline void __kernel_map_pages(struct page *page, int numpages, int enable) -{ - if (radix_enabled()) - radix__kernel_map_pages(page, numpages, enable); - else - hash__kernel_map_pages(page, numpages, enable); -} -#endif - static inline pte_t pmd_pte(pmd_t pmd) { return __pte_raw(pmd_raw(pmd)); diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h index 357e23a403d3..8f55ff74bb68 100644 --- a/arch/powerpc/include/asm/book3s/64/radix.h +++ b/arch/powerpc/include/asm/book3s/64/radix.h @@ -362,8 +362,6 @@ int radix__create_section_mapping(unsigned long start, unsigned long end, int radix__remove_section_mapping(unsigned long start, unsigned long end); #endif /* CONFIG_MEMORY_HOTPLUG */ -void radix__kernel_map_pages(struct page *page, int numpages, int enable); - #ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP #define vmemmap_can_optimize vmemmap_can_optimize bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap); diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index c6a4ac766b2b..e16e2fd104c5 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -1339,20 +1339,6 @@ void __ref radix__vmemmap_free(unsigned long start, unsigned long end, #endif #endif -#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE) -void radix__kernel_map_pages(struct page *page, int numpages, int enable) -{ - unsigned long addr; - - addr = (unsigned long)page_address(page); - - if (enable) - set_memory_p(addr, numpages); - else - set_memory_np(addr, numpages); -} -#endif - #ifdef CONFIG_TRANSPARENT_HUGEPAGE unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c index 421db7c4f2a4..16b8d20d6ca8 100644 --- a/arch/powerpc/mm/pageattr.c +++ b/arch/powerpc/mm/pageattr.c @@ -101,3 +101,22 @@ int change_memory_attr(unsigned long addr, int numpages, long action) return apply_to_existing_page_range(&init_mm, start, size, change_page_attr, (void *)action); } + +#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE) +#ifdef CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC +void __kernel_map_pages(struct page *page, int numpages, int enable) +{ + unsigned long addr = (unsigned long)page_address(page); + + if (PageHighMem(page)) + return; + + if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled()) + hash__kernel_map_pages(page, numpages, enable); + else if (enable) + set_memory_p(addr, numpages); + else + set_memory_np(addr, numpages); +} +#endif +#endif diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c index e94919853ca3..4be97b4a44f9 100644 --- a/arch/powerpc/mm/pgtable_32.c +++ b/arch/powerpc/mm/pgtable_32.c @@ -188,18 +188,3 @@ void mark_rodata_ro(void) ptdump_check_wx(); } #endif - -#if defined(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) && defined(CONFIG_DEBUG_PAGEALLOC) -void __kernel_map_pages(struct page *page, int numpages, int enable) -{ - unsigned long addr = (unsigned long)page_address(page); - - if (PageHighMem(page)) - return; - - if (enable) - set_memory_p(addr, numpages); - else - set_memory_np(addr, numpages); -} -#endif /* CONFIG_DEBUG_PAGEALLOC */ -- 2.43.0
[PATCH] powerpc: Handle error in mark_rodata_ro() and mark_initmem_nx()
mark_rodata_ro() and mark_initmem_nx() use functions that can fail like set_memory_nx() and set_memory_ro(), leading to a not protected kernel. In case of failure, panic. Link: https://github.com/KSPP/linux/issues/7 Signed-off-by: Christophe Leroy --- arch/powerpc/mm/book3s32/mmu.c | 7 -- arch/powerpc/mm/mmu_decl.h | 8 +++ arch/powerpc/mm/nohash/8xx.c | 33 +--- arch/powerpc/mm/nohash/e500.c | 10 ++--- arch/powerpc/mm/pgtable_32.c | 39 -- 5 files changed, 65 insertions(+), 32 deletions(-) diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c index 5445587bfe84..100f999871bc 100644 --- a/arch/powerpc/mm/book3s32/mmu.c +++ b/arch/powerpc/mm/book3s32/mmu.c @@ -193,7 +193,7 @@ static bool is_module_segment(unsigned long addr) return true; } -void mmu_mark_initmem_nx(void) +int mmu_mark_initmem_nx(void) { int nb = mmu_has_feature(MMU_FTR_USE_HIGH_BATS) ? 8 : 4; int i; @@ -230,9 +230,10 @@ void mmu_mark_initmem_nx(void) mtsr(mfsr(i << 28) | 0x1000, i << 28); } + return 0; } -void mmu_mark_rodata_ro(void) +int mmu_mark_rodata_ro(void) { int nb = mmu_has_feature(MMU_FTR_USE_HIGH_BATS) ? 8 : 4; int i; @@ -245,6 +246,8 @@ void mmu_mark_rodata_ro(void) } update_bats(); + + return 0; } /* diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h index 72341b9fb552..6107e4944509 100644 --- a/arch/powerpc/mm/mmu_decl.h +++ b/arch/powerpc/mm/mmu_decl.h @@ -160,11 +160,11 @@ static inline unsigned long p_block_mapped(phys_addr_t pa) { return 0; } #endif #if defined(CONFIG_PPC_BOOK3S_32) || defined(CONFIG_PPC_8xx) || defined(CONFIG_PPC_E500) -void mmu_mark_initmem_nx(void); -void mmu_mark_rodata_ro(void); +int mmu_mark_initmem_nx(void); +int mmu_mark_rodata_ro(void); #else -static inline void mmu_mark_initmem_nx(void) { } -static inline void mmu_mark_rodata_ro(void) { } +static inline int mmu_mark_initmem_nx(void) { return 0; } +static inline int mmu_mark_rodata_ro(void) { return 0; } #endif #ifdef CONFIG_PPC_8xx diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c index 6be6421086ed..43d4842bb1c7 100644 --- a/arch/powerpc/mm/nohash/8xx.c +++ b/arch/powerpc/mm/nohash/8xx.c @@ -119,23 +119,26 @@ void __init mmu_mapin_immr(void) PAGE_KERNEL_NCG, MMU_PAGE_512K, true); } -static void mmu_mapin_ram_chunk(unsigned long offset, unsigned long top, - pgprot_t prot, bool new) +static int mmu_mapin_ram_chunk(unsigned long offset, unsigned long top, + pgprot_t prot, bool new) { unsigned long v = PAGE_OFFSET + offset; unsigned long p = offset; + int err = 0; WARN_ON(!IS_ALIGNED(offset, SZ_512K) || !IS_ALIGNED(top, SZ_512K)); - for (; p < ALIGN(p, SZ_8M) && p < top; p += SZ_512K, v += SZ_512K) - __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new); - for (; p < ALIGN_DOWN(top, SZ_8M) && p < top; p += SZ_8M, v += SZ_8M) - __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_8M, new); - for (; p < ALIGN_DOWN(top, SZ_512K) && p < top; p += SZ_512K, v += SZ_512K) - __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new); + for (; p < ALIGN(p, SZ_8M) && p < top && !err; p += SZ_512K, v += SZ_512K) + err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new); + for (; p < ALIGN_DOWN(top, SZ_8M) && p < top && !err; p += SZ_8M, v += SZ_8M) + err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_8M, new); + for (; p < ALIGN_DOWN(top, SZ_512K) && p < top && !err; p += SZ_512K, v += SZ_512K) + err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new); if (!new) flush_tlb_kernel_range(PAGE_OFFSET + v, PAGE_OFFSET + top); + + return err; } unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) @@ -166,27 +169,33 @@ unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) return top; } -void mmu_mark_initmem_nx(void) +int mmu_mark_initmem_nx(void) { unsigned long etext8 = ALIGN(__pa(_etext), SZ_8M); unsigned long sinittext = __pa(_sinittext); unsigned long boundary = strict_kernel_rwx_enabled() ? sinittext : etext8; unsigned long einittext8 = ALIGN(__pa(_einittext), SZ_8M); + int err = 0; if (!debug_pagealloc_enabled_or_kfence()) - mmu_mapin_ram_chunk(boundary, einittext8, PAGE_KERNEL, false); + err = mmu_mapin_ram_chunk(boundary, einittext8, PAGE_KERNEL, false); mmu_pin_tlb(block_mapped_ram, false); + + return err; } #ifdef CONFIG_STRICT_KERNEL_RWX -void mmu_mark_rodata_ro(void) +int mmu_mark_rodata_ro(void) { unsigned long
[PATCH] powerpc/kprobes: Handle error returned by set_memory_rox()
set_memory_rox() can fail. In case it fails, free allocated memory and return NULL. Link: https://github.com/KSPP/linux/issues/7 Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/kprobes.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index b20ee72e873a..bbca90a5e2ec 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -134,10 +134,16 @@ void *alloc_insn_page(void) if (!page) return NULL; - if (strict_module_rwx_enabled()) - set_memory_rox((unsigned long)page, 1); + if (strict_module_rwx_enabled()) { + int err = set_memory_rox((unsigned long)page, 1); + if (err) + goto error; + } return page; +error: + module_memfree(page); + return NULL; } int arch_prepare_kprobe(struct kprobe *p) -- 2.43.0
[PATCH] powerpc: Implement set_memory_rox()
Same as x86 and s390, add set_memory_rox() to avoid doing one pass with set_memory_ro() and a second pass with set_memory_x(). See commit 60463628c9e0 ("x86/mm: Implement native set_memory_rox()") and commit 22e99fa56443 ("s390/mm: implement set_memory_rox()") for more information. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/set_memory.h | 7 +++ arch/powerpc/mm/pageattr.c| 4 2 files changed, 11 insertions(+) diff --git a/arch/powerpc/include/asm/set_memory.h b/arch/powerpc/include/asm/set_memory.h index 7ebc807aa8cc..9a025b776a4b 100644 --- a/arch/powerpc/include/asm/set_memory.h +++ b/arch/powerpc/include/asm/set_memory.h @@ -8,6 +8,7 @@ #define SET_MEMORY_X 3 #define SET_MEMORY_NP 4 /* Set memory non present */ #define SET_MEMORY_P 5 /* Set memory present */ +#define SET_MEMORY_ROX 6 int change_memory_attr(unsigned long addr, int numpages, long action); @@ -41,4 +42,10 @@ static inline int set_memory_p(unsigned long addr, int numpages) return change_memory_attr(addr, numpages, SET_MEMORY_P); } +static inline int set_memory_rox(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_ROX); +} +#define set_memory_rox set_memory_rox + #endif diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c index 6163e484bc6d..421db7c4f2a4 100644 --- a/arch/powerpc/mm/pageattr.c +++ b/arch/powerpc/mm/pageattr.c @@ -38,6 +38,10 @@ static int change_page_attr(pte_t *ptep, unsigned long addr, void *data) /* Don't clear DIRTY bit */ pte_update_delta(ptep, addr, _PAGE_KERNEL_RW & ~_PAGE_DIRTY, _PAGE_KERNEL_RO); break; + case SET_MEMORY_ROX: + /* Don't clear DIRTY bit */ + pte_update_delta(ptep, addr, _PAGE_KERNEL_RW & ~_PAGE_DIRTY, _PAGE_KERNEL_ROX); + break; case SET_MEMORY_RW: pte_update_delta(ptep, addr, _PAGE_KERNEL_RO, _PAGE_KERNEL_RW); break; -- 2.43.0
[PATCH] powerpc: Use user_mode() macro when possible
There is a nice macro to check user mode. Use it instead of open coding anding with MSR_PR to increase readability and avoid having to comment what that anding is for. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/interrupt.h | 2 +- arch/powerpc/kernel/syscall.c| 2 +- arch/powerpc/kernel/traps.c | 4 ++-- arch/powerpc/lib/sstep.c | 23 +++ arch/powerpc/perf/core-book3s.c | 2 +- arch/powerpc/xmon/xmon.c | 4 ++-- 6 files changed, 18 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/include/asm/interrupt.h b/arch/powerpc/include/asm/interrupt.h index a4196ab1d016..7b610864b364 100644 --- a/arch/powerpc/include/asm/interrupt.h +++ b/arch/powerpc/include/asm/interrupt.h @@ -97,7 +97,7 @@ DECLARE_STATIC_KEY_FALSE(interrupt_exit_not_reentrant); static inline bool is_implicit_soft_masked(struct pt_regs *regs) { - if (regs->msr & MSR_PR) + if (user_mode(regs)) return false; if (regs->nip >= (unsigned long)__end_soft_masked) diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c index 77fedb190c93..f6f868e817e6 100644 --- a/arch/powerpc/kernel/syscall.c +++ b/arch/powerpc/kernel/syscall.c @@ -31,7 +31,7 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0) user_exit_irqoff(); BUG_ON(regs_is_unrecoverable(regs)); - BUG_ON(!(regs->msr & MSR_PR)); + BUG_ON(!user_mode(regs)); BUG_ON(arch_irq_disabled_regs(regs)); #ifdef CONFIG_PPC_PKEY diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 11e062b47d3f..f23430adb68a 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -404,7 +404,7 @@ noinstr void hv_nmi_check_nonrecoverable(struct pt_regs *regs) return; if (!(regs->msr & MSR_HV)) return; - if (regs->msr & MSR_PR) + if (user_mode(regs)) return; /* @@ -1510,7 +1510,7 @@ static void do_program_check(struct pt_regs *regs) if (!is_kernel_addr(bugaddr) && !(regs->msr & MSR_IR)) bugaddr += PAGE_OFFSET; - if (!(regs->msr & MSR_PR) && /* not user-mode */ + if (!user_mode(regs) && report_bug(bugaddr, regs) == BUG_TRAP_TYPE_WARN) { regs_add_return_ip(regs, 4); return; diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index 5766180f5380..e65f3fb68d06 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -1429,7 +1429,7 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs, return 1; case 18:/* rfid, scary */ - if (regs->msr & MSR_PR) + if (user_mode(regs)) goto priv; op->type = RFI; return 0; @@ -1742,13 +1742,13 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs, return 1; #endif case 83:/* mfmsr */ - if (regs->msr & MSR_PR) + if (user_mode(regs)) goto priv; op->type = MFMSR; op->reg = rd; return 0; case 146: /* mtmsr */ - if (regs->msr & MSR_PR) + if (user_mode(regs)) goto priv; op->type = MTMSR; op->reg = rd; @@ -1756,7 +1756,7 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs, return 0; #ifdef CONFIG_PPC64 case 178: /* mtmsrd */ - if (regs->msr & MSR_PR) + if (user_mode(regs)) goto priv; op->type = MTMSR; op->reg = rd; @@ -3437,14 +3437,14 @@ int emulate_loadstore(struct pt_regs *regs, struct instruction_op *op) * stored in the thread_struct. If the instruction is in * the kernel, we must not touch the state in the thread_struct. */ - if (!(regs->msr & MSR_PR) && !(regs->msr & MSR_FP)) + if (!user_mode(regs) && !(regs->msr & MSR_FP)) return 0; err = do_fp_load(op, ea, regs, cross_endian); break; #endif #ifdef CONFIG_ALTIVEC case LOAD_VMX: - if (!(regs->msr & MSR_PR) && !(regs->msr & MSR_VEC)) + if (!user_mode(regs) && !(regs->msr & MSR_VEC)) return 0; err = do_vec_load(op->reg, ea, size, regs, cross_endian); break; @@ -3459,7 +3459,7 @
[PATCH] powerpc/trace: Restrict hash_fault trace event to HASH MMU
'perf list' on powerpc 8xx shows an event named "1:hash_fault". This event is pointless because trace_hash_fault() is called only from mm/book3s64/hash_utils.c Only define it when CONFIG_PPC_64S_HASH_MMU is selected. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/trace.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/trace.h b/arch/powerpc/include/asm/trace.h index 82cc2c6704e6..d9ac3a4f46e1 100644 --- a/arch/powerpc/include/asm/trace.h +++ b/arch/powerpc/include/asm/trace.h @@ -267,6 +267,7 @@ TRACE_EVENT_FN(opal_exit, ); #endif +#ifdef CONFIG_PPC_64S_HASH_MMU TRACE_EVENT(hash_fault, TP_PROTO(unsigned long addr, unsigned long access, unsigned long trap), @@ -286,7 +287,7 @@ TRACE_EVENT(hash_fault, TP_printk("hash fault with addr 0x%lx and access = 0x%lx trap = 0x%lx", __entry->addr, __entry->access, __entry->trap) ); - +#endif TRACE_EVENT(tlbie, -- 2.43.0