Re: [PATCH v6 (proposal)] powerpc/cpu: enable nr_cpus for crash kernel
Hi Christophe, The latest series is https://lore.kernel.org/linuxppc-dev/20231017022806.4523-1-pi...@redhat.com/ And Michael has his implement on: https://lore.kernel.org/all/20231229120107.2281153-3-...@ellerman.id.au/T/#m46128446bce1095631162a1927415733a3bf0633 Thanks, Pingfan On Fri, Jan 26, 2024 at 3:40 AM Christophe Leroy wrote: > > Hi, > > Le 22/05/2018 à 10:23, Pingfan Liu a écrit : > > For kexec -p, the boot cpu can be not the cpu0, this causes the problem > > to alloc paca[]. In theory, there is no requirement to assign cpu's logical > > id as its present seq by device tree. But we have something like > > cpu_first_thread_sibling(), which makes assumption on the mapping inside > > a core. Hence partially changing the mapping, i.e. unbind the mapping of > > core while keep the mapping inside a core. After this patch, the core with > > boot-cpu will always be mapped into core 0. > > > > And at present, the code to discovery cpu spreads over two functions: > > early_init_dt_scan_cpus() and smp_setup_cpu_maps(). > > This patch tries to fold smp_setup_cpu_maps() into the "previous" one > > This patch is pretty old and doesn't apply anymore. If still relevant > can you please rebase and resubmit. > > Thanks > Christophe > > > > > Signed-off-by: Pingfan Liu > > --- > > v5 -> v6: > >simplify the loop logic (Hope it can answer Benjamin's concern) > >concentrate the cpu recovery code to early stage (Hope it can answer > > Michael's concern) > > Todo: (if this method is accepted) > >fold the whole smp_setup_cpu_maps() > > > > arch/powerpc/include/asm/smp.h | 1 + > > arch/powerpc/kernel/prom.c | 123 > > - > > arch/powerpc/kernel/setup-common.c | 58 ++--- > > drivers/of/fdt.c | 2 +- > > include/linux/of_fdt.h | 2 + > > 5 files changed, 103 insertions(+), 83 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h > > index fac963e..80c7693 100644 > > --- a/arch/powerpc/include/asm/smp.h > > +++ b/arch/powerpc/include/asm/smp.h > > @@ -30,6 +30,7 @@ > > #include > > > > extern int boot_cpuid; > > +extern int threads_in_core; > > extern int spinning_secondaries; > > > > extern void cpu_die(void); > > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c > > index 4922162..2ae0b4a 100644 > > --- a/arch/powerpc/kernel/prom.c > > +++ b/arch/powerpc/kernel/prom.c > > @@ -77,7 +77,6 @@ unsigned long tce_alloc_start, tce_alloc_end; > > u64 ppc64_rma_size; > > #endif > > static phys_addr_t first_memblock_size; > > -static int __initdata boot_cpu_count; > > > > static int __init early_parse_mem(char *p) > > { > > @@ -305,6 +304,14 @@ static void __init > > check_cpu_feature_properties(unsigned long node) > > } > > } > > > > +struct bootinfo { > > + int boot_thread_id; > > + unsigned int cpu_cnt; > > + int cpu_hwids[NR_CPUS]; > > + bool avail[NR_CPUS]; > > +}; > > +static struct bootinfo *bt_info; > > + > > static int __init early_init_dt_scan_cpus(unsigned long node, > > const char *uname, int depth, > > void *data) > > @@ -312,10 +319,12 @@ static int __init early_init_dt_scan_cpus(unsigned > > long node, > > const char *type = of_get_flat_dt_prop(node, "device_type", NULL); > > const __be32 *prop; > > const __be32 *intserv; > > - int i, nthreads; > > + int i, nthreads, maxidx; > > int len; > > - int found = -1; > > - int found_thread = 0; > > + int found_thread = -1; > > + struct bootinfo *info = data; > > + bool avail; > > + int rotate_cnt, id; > > > > /* We are scanning "cpu" nodes only */ > > if (type == NULL || strcmp(type, "cpu") != 0) > > @@ -325,8 +334,15 @@ static int __init early_init_dt_scan_cpus(unsigned > > long node, > > intserv = of_get_flat_dt_prop(node, "ibm,ppc-interrupt-server#s", > > &len); > > if (!intserv) > > intserv = of_get_flat_dt_prop(node, "reg", &len); > > + avail = of_fdt_device_is_available(initial_boot_params, node); > > +#if 0 > > + //todo > > + if (!avail) > > + avail = !of_fdt_property_match_string(node, > > + "enable-method", "spin-table"); > > +#endif > > > > - nthreads = len / sizeof(int); > > + threads_in_core = nthreads = len / sizeof(int); > > > > /* > >* Now see if any of these threads match our boot cpu. > > @@ -338,9 +354,10 @@ static int __init early_init_dt_scan_cpus(unsigned > > long node, > >* booted proc. > >*/ > > if (fdt_version(initial_boot_params) >= 2) { > > + info->cpu_hwids[info->cpu_cnt] = > > + be32_to_cpu(intserv[i]); > > if (be32_to_cpu(intserv[i]) == > >
RE: [PATCH v2 linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope
From: Baoquan He Sent: Monday, January 29, 2024 7:00 PM > > Michael pointed out that the CONFIG_CRASH_DUMP ifdef is nested inside > CONFIG_KEXEC_CODE ifdef scope in some XEN, Hyper-V codes. > > Although the nesting works well too since CONFIG_CRASH_DUMP has > dependency on CONFIG_KEXEC_CORE, it may cause confusion because there > are places where it's not nested, and people may think it needs to be > nested even though it doesn't have to. > > Fix that by moving CONFIG_CRASH_DUMP ifdeffery of codes out of > CONFIG_KEXEC_CODE ifdeffery scope. > > And also put function machine_crash_shutdown() definition inside > CONFIG_CRASH_DUMP ifdef scope instead of CONFIG_KEXEC_CORE ifdef. > > And also fix a building error Nathan reported as below by replacing > CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef. > > > $ curl -LSso .config > https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.x86_64 > $ make -skj"$(nproc)" ARCH=x86_64 CROSS_COMPILE=x86_64-linux- > olddefconfig all > ... > x86_64-linux-ld: arch/x86/xen/mmu_pv.o: in function > `paddr_vmcoreinfo_note': > mmu_pv.c:(.text+0x3af3): undefined reference to `vmcoreinfo_note' > > > Link: > https://lore.kernel.org/all/sn6pr02mb4157931105fa68d72e3d3db8d4...@sn6pr02mb4157.namprd02.prod.outlook.com/T/#u > Link: > https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u > Signed-off-by: Baoquan He > --- > v1->v2: > - Add missing words and fix typos in patch log pointed out by Michael. > > arch/x86/kernel/cpu/mshyperv.c | 10 ++ > arch/x86/kernel/reboot.c | 2 +- > arch/x86/xen/enlighten_hvm.c | 4 ++-- > arch/x86/xen/mmu_pv.c | 2 +- > 4 files changed, 10 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mshyperv.c > b/arch/x86/kernel/cpu/mshyperv.c > index f8163a59026b..2e8cd5a4ae85 100644 > --- a/arch/x86/kernel/cpu/mshyperv.c > +++ b/arch/x86/kernel/cpu/mshyperv.c > @@ -209,6 +209,7 @@ static void hv_machine_shutdown(void) > if (kexec_in_progress) > hyperv_cleanup(); > } > +#endif /* CONFIG_KEXEC_CORE */ > > #ifdef CONFIG_CRASH_DUMP > static void hv_machine_crash_shutdown(struct pt_regs *regs) > @@ -222,8 +223,7 @@ static void hv_machine_crash_shutdown(struct > pt_regs *regs) > /* Disable the hypercall page when there is only 1 active CPU. */ > hyperv_cleanup(); > } > -#endif > -#endif /* CONFIG_KEXEC_CORE */ > +#endif /* CONFIG_CRASH_DUMP */ > #endif /* CONFIG_HYPERV */ > > static uint32_t __init ms_hyperv_platform(void) > @@ -497,9 +497,11 @@ static void __init ms_hyperv_init_platform(void) > no_timer_check = 1; > #endif > > -#if IS_ENABLED(CONFIG_HYPERV) && defined(CONFIG_KEXEC_CORE) > +#if IS_ENABLED(CONFIG_HYPERV) > +#if defined(CONFIG_KEXEC_CORE) > machine_ops.shutdown = hv_machine_shutdown; > -#ifdef CONFIG_CRASH_DUMP > +#endif > +#if defined(CONFIG_CRASH_DUMP) > machine_ops.crash_shutdown = hv_machine_crash_shutdown; > #endif > #endif > diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c > index 1287b0d5962f..f3130f762784 100644 > --- a/arch/x86/kernel/reboot.c > +++ b/arch/x86/kernel/reboot.c > @@ -826,7 +826,7 @@ void machine_halt(void) > machine_ops.halt(); > } > > -#ifdef CONFIG_KEXEC_CORE > +#ifdef CONFIG_CRASH_DUMP > void machine_crash_shutdown(struct pt_regs *regs) > { > machine_ops.crash_shutdown(regs); > diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c > index 09e3db7ff990..0b367c1e086d 100644 > --- a/arch/x86/xen/enlighten_hvm.c > +++ b/arch/x86/xen/enlighten_hvm.c > @@ -148,6 +148,7 @@ static void xen_hvm_shutdown(void) > if (kexec_in_progress) > xen_reboot(SHUTDOWN_soft_reset); > } > +#endif > > #ifdef CONFIG_CRASH_DUMP > static void xen_hvm_crash_shutdown(struct pt_regs *regs) > @@ -156,7 +157,6 @@ static void xen_hvm_crash_shutdown(struct pt_regs > *regs) > xen_reboot(SHUTDOWN_soft_reset); > } > #endif > -#endif > > static int xen_cpu_up_prepare_hvm(unsigned int cpu) > { > @@ -238,10 +238,10 @@ static void __init xen_hvm_guest_init(void) > > #ifdef CONFIG_KEXEC_CORE > machine_ops.shutdown = xen_hvm_shutdown; > +#endif > #ifdef CONFIG_CRASH_DUMP > machine_ops.crash_shutdown = xen_hvm_crash_shutdown; > #endif > -#endif > } > > static __init int xen_parse_nopv(char *arg) > diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c > index 218773cfb009..e21974f2cf2d 100644 > --- a/arch/x86/xen/mmu_pv.c > +++ b/arch/x86/xen/mmu_pv.c > @@ -2520,7 +2520,7 @@ int xen_remap_pfn(struct vm_area_struct *vma, > unsigned long addr, > } > EXPORT_SYMBOL_GPL(xen_remap_pfn); > > -#ifdef CONFIG_KEXEC_CORE > +#ifdef CONFIG_VMCORE_INFO > phys_addr_t paddr_vmcoreinfo_note(void) > { > if (xen_pv_domain()) > -- > 2.41.0 Reviewed-by: Michael Kelley
Re: [PATCH linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope
On 01/30/24 at 01:39am, Michael Kelley wrote: > From: Baoquan He > > > > On 01/29/24 at 06:27pm, Michael Kelley wrote: > > > From: Baoquan He Sent: Monday, January 29, 2024 > > 5:51 AM > > > > > > > > Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside > > > > arch/x86/xen/enlighten_hvm.c. > > > > > > Did some words get left out in the above sentence? It mentions the Xen > > > case, but not the Hyper-V case. I'm not sure what you intended. > > > > Thanks a lot for your careful reviewing. > > > > Yeah, I tried to list all affected file names, seems my vim editor threw > > away some words. And I forgot mentioning the change in reboot.c. > > > > I adjusted log as below according to your comments, do you think it's OK > > now? > > Yes -- looks like everything is included and clear up my confusion. But > I still have two small nits per below. :-) Right, I will grabbed them into v2. Thanks again. > > > > > === > > Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside > > CONFIG_KEXEC_CODE ifdef scope in some XEN, HyperV codes. > > s/Hyper-V/HyperV/ > > > > > Although the nesting works well too since CONFIG_CRASH_DUMP has > > dependency on CONFIG_KEXEC_CORE, it may cause confusion because there > > are places where it's not nested, and people may think it needs be nested > > s/needs to be/needs be/ > > > even though it doesn't have to. > > > > Fix that by moving CONFIG_CRASH_DUMP ifdeffery of codes out of > > CONFIG_KEXEC_CODE ifdeffery scope. > > > > And also put function machine_crash_shutdown() definition inside > > CONFIG_CRASH_DUMP ifdef scope instead of CONFIG_KEXEC_CORE ifdef. > > > > And also fix a building error Nathan reported as below by replacing > > CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef. > > .. > > === > > > > Thanks > > Baoquan >
Re: [PATCH v2 linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope
Michael pointed out that the CONFIG_CRASH_DUMP ifdef is nested inside CONFIG_KEXEC_CODE ifdef scope in some XEN, Hyper-V codes. Although the nesting works well too since CONFIG_CRASH_DUMP has dependency on CONFIG_KEXEC_CORE, it may cause confusion because there are places where it's not nested, and people may think it needs to be nested even though it doesn't have to. Fix that by moving CONFIG_CRASH_DUMP ifdeffery of codes out of CONFIG_KEXEC_CODE ifdeffery scope. And also put function machine_crash_shutdown() definition inside CONFIG_CRASH_DUMP ifdef scope instead of CONFIG_KEXEC_CORE ifdef. And also fix a building error Nathan reported as below by replacing CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef. $ curl -LSso .config https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.x86_64 $ make -skj"$(nproc)" ARCH=x86_64 CROSS_COMPILE=x86_64-linux- olddefconfig all ... x86_64-linux-ld: arch/x86/xen/mmu_pv.o: in function `paddr_vmcoreinfo_note': mmu_pv.c:(.text+0x3af3): undefined reference to `vmcoreinfo_note' Link: https://lore.kernel.org/all/sn6pr02mb4157931105fa68d72e3d3db8d4...@sn6pr02mb4157.namprd02.prod.outlook.com/T/#u Link: https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u Signed-off-by: Baoquan He --- v1->v2: - Add missing words and fix typos in patch log pointed out by Michael. arch/x86/kernel/cpu/mshyperv.c | 10 ++ arch/x86/kernel/reboot.c | 2 +- arch/x86/xen/enlighten_hvm.c | 4 ++-- arch/x86/xen/mmu_pv.c | 2 +- 4 files changed, 10 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index f8163a59026b..2e8cd5a4ae85 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -209,6 +209,7 @@ static void hv_machine_shutdown(void) if (kexec_in_progress) hyperv_cleanup(); } +#endif /* CONFIG_KEXEC_CORE */ #ifdef CONFIG_CRASH_DUMP static void hv_machine_crash_shutdown(struct pt_regs *regs) @@ -222,8 +223,7 @@ static void hv_machine_crash_shutdown(struct pt_regs *regs) /* Disable the hypercall page when there is only 1 active CPU. */ hyperv_cleanup(); } -#endif -#endif /* CONFIG_KEXEC_CORE */ +#endif /* CONFIG_CRASH_DUMP */ #endif /* CONFIG_HYPERV */ static uint32_t __init ms_hyperv_platform(void) @@ -497,9 +497,11 @@ static void __init ms_hyperv_init_platform(void) no_timer_check = 1; #endif -#if IS_ENABLED(CONFIG_HYPERV) && defined(CONFIG_KEXEC_CORE) +#if IS_ENABLED(CONFIG_HYPERV) +#if defined(CONFIG_KEXEC_CORE) machine_ops.shutdown = hv_machine_shutdown; -#ifdef CONFIG_CRASH_DUMP +#endif +#if defined(CONFIG_CRASH_DUMP) machine_ops.crash_shutdown = hv_machine_crash_shutdown; #endif #endif diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index 1287b0d5962f..f3130f762784 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -826,7 +826,7 @@ void machine_halt(void) machine_ops.halt(); } -#ifdef CONFIG_KEXEC_CORE +#ifdef CONFIG_CRASH_DUMP void machine_crash_shutdown(struct pt_regs *regs) { machine_ops.crash_shutdown(regs); diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c index 09e3db7ff990..0b367c1e086d 100644 --- a/arch/x86/xen/enlighten_hvm.c +++ b/arch/x86/xen/enlighten_hvm.c @@ -148,6 +148,7 @@ static void xen_hvm_shutdown(void) if (kexec_in_progress) xen_reboot(SHUTDOWN_soft_reset); } +#endif #ifdef CONFIG_CRASH_DUMP static void xen_hvm_crash_shutdown(struct pt_regs *regs) @@ -156,7 +157,6 @@ static void xen_hvm_crash_shutdown(struct pt_regs *regs) xen_reboot(SHUTDOWN_soft_reset); } #endif -#endif static int xen_cpu_up_prepare_hvm(unsigned int cpu) { @@ -238,10 +238,10 @@ static void __init xen_hvm_guest_init(void) #ifdef CONFIG_KEXEC_CORE machine_ops.shutdown = xen_hvm_shutdown; +#endif #ifdef CONFIG_CRASH_DUMP machine_ops.crash_shutdown = xen_hvm_crash_shutdown; #endif -#endif } static __init int xen_parse_nopv(char *arg) diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 218773cfb009..e21974f2cf2d 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -2520,7 +2520,7 @@ int xen_remap_pfn(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL_GPL(xen_remap_pfn); -#ifdef CONFIG_KEXEC_CORE +#ifdef CONFIG_VMCORE_INFO phys_addr_t paddr_vmcoreinfo_note(void) { if (xen_pv_domain()) -- 2.41.0
RE: [PATCH linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope
From: Baoquan He > > On 01/29/24 at 06:27pm, Michael Kelley wrote: > > From: Baoquan He Sent: Monday, January 29, 2024 > 5:51 AM > > > > > > Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside > > > arch/x86/xen/enlighten_hvm.c. > > > > Did some words get left out in the above sentence? It mentions the Xen > > case, but not the Hyper-V case. I'm not sure what you intended. > > Thanks a lot for your careful reviewing. > > Yeah, I tried to list all affected file names, seems my vim editor threw > away some words. And I forgot mentioning the change in reboot.c. > > I adjusted log as below according to your comments, do you think it's OK > now? Yes -- looks like everything is included and clear up my confusion. But I still have two small nits per below. :-) Michael > > === > Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside > CONFIG_KEXEC_CODE ifdef scope in some XEN, HyperV codes. s/Hyper-V/HyperV/ > > Although the nesting works well too since CONFIG_CRASH_DUMP has > dependency on CONFIG_KEXEC_CORE, it may cause confusion because there > are places where it's not nested, and people may think it needs be nested s/needs to be/needs be/ > even though it doesn't have to. > > Fix that by moving CONFIG_CRASH_DUMP ifdeffery of codes out of > CONFIG_KEXEC_CODE ifdeffery scope. > > And also put function machine_crash_shutdown() definition inside > CONFIG_CRASH_DUMP ifdef scope instead of CONFIG_KEXEC_CORE ifdef. > > And also fix a building error Nathan reported as below by replacing > CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef. > .. > === > > Thanks > Baoquan
Re: [PATCH linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope
On 01/29/24 at 06:27pm, Michael Kelley wrote: > From: Baoquan He Sent: Monday, January 29, 2024 5:51 AM > > > > Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside > > arch/x86/xen/enlighten_hvm.c. > > Did some words get left out in the above sentence? It mentions the Xen > case, but not the Hyper-V case. I'm not sure what you intended. Thanks a lot for your careful reviewing. Yeah, I tried to list all affected file names, seems my vim editor threw away some words. And I forgot mentioning the change in reboot.c. I adjusted log as below according to your comments, do you think it's OK now? === Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside CONFIG_KEXEC_CODE ifdef scope in some XEN, HyperV codes. Although the nesting works well too since CONFIG_CRASH_DUMP has dependency on CONFIG_KEXEC_CORE, it may cause confusion because there are places where it's not nested, and people may think it needs be nested even though it doesn't have to. Fix that by moving CONFIG_CRASH_DUMP ifdeffery of codes out of CONFIG_KEXEC_CODE ifdeffery scope. And also put function machine_crash_shutdown() definition inside CONFIG_CRASH_DUMP ifdef scope instead of CONFIG_KEXEC_CORE ifdef. And also fix a building error Nathan reported as below by replacing CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef. .. === Thanks Baoquan
Re: [PATCH v10 5/6] arm64: support copy_mc_[user]_highpage()
On Mon, Jan 29, 2024 at 2:47 PM Tong Tiangen wrote: > > Currently, many scenarios that can tolerate memory errors when copying page > have been supported in the kernel[1][2][3], all of which are implemented by > copy_mc_[user]_highpage(). arm64 should also support this mechanism. > > Due to mte, arm64 needs to have its own copy_mc_[user]_highpage() > architecture implementation, macros __HAVE_ARCH_COPY_MC_HIGHPAGE and > __HAVE_ARCH_COPY_MC_USER_HIGHPAGE have been added to control it. > > Add new helper copy_mc_page() which provide a page copy implementation with > machine check safe. The copy_mc_page() in copy_mc_page.S is largely borrows > from copy_page() in copy_page.S and the main difference is copy_mc_page() > add extable entry to every load/store insn to support machine check safe. > > Add new extable type EX_TYPE_COPY_MC_PAGE_ERR_ZERO which used in > copy_mc_page(). > > [1]a873dfe1032a ("mm, hwpoison: try to recover from copy-on write faults") > [2]5f2500b93cc9 ("mm/khugepaged: recover from poisoned anonymous memory") > [3]6b970599e807 ("mm: hwpoison: support recovery from > ksm_might_need_to_copy()") > > Signed-off-by: Tong Tiangen > --- > arch/arm64/include/asm/asm-extable.h | 15 ++ > arch/arm64/include/asm/assembler.h | 4 ++ > arch/arm64/include/asm/mte.h | 5 ++ > arch/arm64/include/asm/page.h| 10 > arch/arm64/lib/Makefile | 2 + > arch/arm64/lib/copy_mc_page.S| 78 > arch/arm64/lib/mte.S | 27 ++ > arch/arm64/mm/copypage.c | 66 --- > arch/arm64/mm/extable.c | 7 +-- > include/linux/highmem.h | 8 +++ > 10 files changed, 213 insertions(+), 9 deletions(-) > create mode 100644 arch/arm64/lib/copy_mc_page.S > > diff --git a/arch/arm64/include/asm/asm-extable.h > b/arch/arm64/include/asm/asm-extable.h > index 980d1dd8e1a3..819044fefbe7 100644 > --- a/arch/arm64/include/asm/asm-extable.h > +++ b/arch/arm64/include/asm/asm-extable.h > @@ -10,6 +10,7 @@ > #define EX_TYPE_UACCESS_ERR_ZERO 2 > #define EX_TYPE_KACCESS_ERR_ZERO 3 > #define EX_TYPE_LOAD_UNALIGNED_ZEROPAD 4 > +#define EX_TYPE_COPY_MC_PAGE_ERR_ZERO 5 > > /* Data fields for EX_TYPE_UACCESS_ERR_ZERO */ > #define EX_DATA_REG_ERR_SHIFT 0 > @@ -51,6 +52,16 @@ > #define _ASM_EXTABLE_UACCESS(insn, fixup) \ > _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, wzr, wzr) > > +#define _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, err, zero) \ > + __ASM_EXTABLE_RAW(insn, fixup, \ > + EX_TYPE_COPY_MC_PAGE_ERR_ZERO,\ > + ( \ > + EX_DATA_REG(ERR, err) | \ > + EX_DATA_REG(ZERO, zero) \ > + )) > + > +#define _ASM_EXTABLE_COPY_MC_PAGE(insn, fixup) \ > + _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, wzr, wzr) > /* > * Create an exception table entry for uaccess `insn`, which will branch to > `fixup` > * when an unhandled fault is taken. > @@ -59,6 +70,10 @@ > _ASM_EXTABLE_UACCESS(\insn, \fixup) > .endm > > + .macro _asm_extable_copy_mc_page, insn, fixup > + _ASM_EXTABLE_COPY_MC_PAGE(\insn, \fixup) > + .endm > + > /* > * Create an exception table entry for `insn` if `fixup` is provided. > Otherwise > * do nothing. > diff --git a/arch/arm64/include/asm/assembler.h > b/arch/arm64/include/asm/assembler.h > index 513787e43329..e1d8ce155878 100644 > --- a/arch/arm64/include/asm/assembler.h > +++ b/arch/arm64/include/asm/assembler.h > @@ -154,6 +154,10 @@ lr .reqx30 // link register > #define CPU_LE(code...) code > #endif > > +#define CPY_MC(l, x...)\ > +: x; \ > + _asm_extable_copy_mc_pageb, l > + > /* > * Define a macro that constructs a 64-bit value by concatenating two > * 32-bit registers. Note that on big endian systems the order of the > diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h > index 91fbd5c8a391..9cdded082dd4 100644 > --- a/arch/arm64/include/asm/mte.h > +++ b/arch/arm64/include/asm/mte.h > @@ -92,6 +92,7 @@ static inline bool try_page_mte_tagging(struct page *page) > void mte_zero_clear_page_tags(void *addr); > void mte_sync_tags(pte_t pte, unsigned int nr_pages); > void mte_copy_page_tags(void *kto, const void *kfrom); > +int mte_copy_mc_page_tags(void *kto, const void *kfrom); > void mte_thread_init_user(void); > void mte_thread_switch(struct task_struct *next); > void mte_cpu_setup(void); > @@ -128,6 +129,10 @@ static inline void mte_sync_tags(pte_t pte, unsigned int > nr_pages) > static inline void mte_copy_page_tags(void *kto, const void *kfrom) > { > } > +static i
Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time
On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote: > Declaring rodata_enabled and mark_rodata_ro() at all time > helps removing related #ifdefery in C files. > > Signed-off-by: Christophe Leroy Very nice cleanup, thanks!, applied and pushed Luis
RE: [PATCH linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope
From: Baoquan He Sent: Monday, January 29, 2024 5:51 AM > > Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside > arch/x86/xen/enlighten_hvm.c. Did some words get left out in the above sentence? It mentions the Xen case, but not the Hyper-V case. I'm not sure what you intended. > > Although the nesting works well too since CONFIG_CRASH_DUMP has > dependency on CONFIG_KEXEC_CORE, it may cause confuse because there s/confusion/confuse/ > are places where it's not nested, and people may think it need be nested s/need be/needs to be/ > even though it doesn't have to. > > Fix that by moving CONFIG_CRASH_DUMP ifdeffery of codes out of > CONFIG_KEXEC_CODE ifdeffery scope. > > And also fix a building error Nathan reported as below by replacing > CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef. > > > $ curl -LSso .config > https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.x86_64 > > $ make -skj"$(nproc)" ARCH=x86_64 CROSS_COMPILE=x86_64-linux- > olddefconfig all > ... > x86_64-linux-ld: arch/x86/xen/mmu_pv.o: in function > `paddr_vmcoreinfo_note': > mmu_pv.c:(.text+0x3af3): undefined reference to `vmcoreinfo_note' > > > Link: > https://lore.kernel.org/all/sn6pr02mb4157931105fa68d72e3d3db8d4...@sn6pr02mb4157.namprd02.prod.outlook.com/T/#u > Link: > https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u > Signed-off-by: Baoquan He Modulo the commit message nits, LGTM. Reviewed-by: Michael Kelley > --- > arch/x86/kernel/cpu/mshyperv.c | 10 ++ > arch/x86/kernel/reboot.c | 2 +- > arch/x86/xen/enlighten_hvm.c | 4 ++-- > arch/x86/xen/mmu_pv.c | 2 +- > 4 files changed, 10 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mshyperv.c > b/arch/x86/kernel/cpu/mshyperv.c > index f8163a59026b..2e8cd5a4ae85 100644 > --- a/arch/x86/kernel/cpu/mshyperv.c > +++ b/arch/x86/kernel/cpu/mshyperv.c > @@ -209,6 +209,7 @@ static void hv_machine_shutdown(void) > if (kexec_in_progress) > hyperv_cleanup(); > } > +#endif /* CONFIG_KEXEC_CORE */ > > #ifdef CONFIG_CRASH_DUMP > static void hv_machine_crash_shutdown(struct pt_regs *regs) > @@ -222,8 +223,7 @@ static void hv_machine_crash_shutdown(struct > pt_regs *regs) > /* Disable the hypercall page when there is only 1 active CPU. */ > hyperv_cleanup(); > } > -#endif > -#endif /* CONFIG_KEXEC_CORE */ > +#endif /* CONFIG_CRASH_DUMP */ > #endif /* CONFIG_HYPERV */ > > static uint32_t __init ms_hyperv_platform(void) > @@ -497,9 +497,11 @@ static void __init ms_hyperv_init_platform(void) > no_timer_check = 1; > #endif > > -#if IS_ENABLED(CONFIG_HYPERV) && defined(CONFIG_KEXEC_CORE) > +#if IS_ENABLED(CONFIG_HYPERV) > +#if defined(CONFIG_KEXEC_CORE) > machine_ops.shutdown = hv_machine_shutdown; > -#ifdef CONFIG_CRASH_DUMP > +#endif > +#if defined(CONFIG_CRASH_DUMP) > machine_ops.crash_shutdown = hv_machine_crash_shutdown; > #endif > #endif > diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c > index 1287b0d5962f..f3130f762784 100644 > --- a/arch/x86/kernel/reboot.c > +++ b/arch/x86/kernel/reboot.c > @@ -826,7 +826,7 @@ void machine_halt(void) > machine_ops.halt(); > } > > -#ifdef CONFIG_KEXEC_CORE > +#ifdef CONFIG_CRASH_DUMP > void machine_crash_shutdown(struct pt_regs *regs) > { > machine_ops.crash_shutdown(regs); > diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c > index 09e3db7ff990..0b367c1e086d 100644 > --- a/arch/x86/xen/enlighten_hvm.c > +++ b/arch/x86/xen/enlighten_hvm.c > @@ -148,6 +148,7 @@ static void xen_hvm_shutdown(void) > if (kexec_in_progress) > xen_reboot(SHUTDOWN_soft_reset); > } > +#endif > > #ifdef CONFIG_CRASH_DUMP > static void xen_hvm_crash_shutdown(struct pt_regs *regs) > @@ -156,7 +157,6 @@ static void xen_hvm_crash_shutdown(struct pt_regs > *regs) > xen_reboot(SHUTDOWN_soft_reset); > } > #endif > -#endif > > static int xen_cpu_up_prepare_hvm(unsigned int cpu) > { > @@ -238,10 +238,10 @@ static void __init xen_hvm_guest_init(void) > > #ifdef CONFIG_KEXEC_CORE > machine_ops.shutdown = xen_hvm_shutdown; > +#endif > #ifdef CONFIG_CRASH_DUMP > machine_ops.crash_shutdown = xen_hvm_crash_shutdown; > #endif > -#endif > } > > static __init int xen_parse_nopv(char *arg) > diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c > index 218773cfb009..e21974f2cf2d 100644 > --- a/arch/x86/xen/mmu_pv.c > +++ b/arch/x86/xen/mmu_pv.c > @@ -2520,7 +2520,7 @@ int xen_remap_pfn(struct vm_area_struct *vma, > unsigned long addr, > } > EXPORT_SYMBOL_GPL(xen_remap_pfn); > > -#ifdef CONFIG_KEXEC_CORE > +#ifdef CONFIG_VMCORE_INFO > phys_addr_t paddr_vmcoreinfo_note(void) > { > if (xen_pv_domain()) > -- > 2.41.0
Re: [PATCH v10 3/6] arm64: add uaccess to machine check safe
On Mon, Jan 29, 2024 at 09:46:49PM +0800, Tong Tiangen wrote: > If user process access memory fails due to hardware memory error, only the > relevant processes are affected, so it is more reasonable to kill the user > process and isolate the corrupt page than to panic the kernel. > > Signed-off-by: Tong Tiangen > --- > arch/arm64/lib/copy_from_user.S | 10 +- > arch/arm64/lib/copy_to_user.S | 10 +- > arch/arm64/mm/extable.c | 8 > 3 files changed, 14 insertions(+), 14 deletions(-) > > diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S > index 34e317907524..1bf676e9201d 100644 > --- a/arch/arm64/lib/copy_from_user.S > +++ b/arch/arm64/lib/copy_from_user.S > @@ -25,7 +25,7 @@ > .endm > > .macro strb1 reg, ptr, val > - strb \reg, [\ptr], \val > + USER(9998f, strb \reg, [\ptr], \val) > .endm This is a store to *kernel* memory, not user memory. It should not be marked with USER(). I understand that you *might* want to handle memory errors on these stores, but the commit message doesn't describe that and the associated trade-off. For example, consider that when a copy_form_user fails we'll try to zero the remaining buffer via memset(); so if a STR* instruction in copy_to_user faulted, upon handling the fault we'll immediately try to fix that up with some more stores which will also fault, but won't get fixed up, leading to a panic() anyway... Further, this change will also silently fixup unexpected kernel faults if we pass bad kernel pointers to copy_{to,from}_user, which will hide real bugs. So NAK to this change as-is; likewise for the addition of USER() to other ldr* macros in copy_from_user.S and the addition of USER() str* macros in copy_to_user.S. If we want to handle memory errors on some kaccesses, we need a new EX_TYPE_* separate from the usual EX_TYPE_KACESS_ERR_ZERO that means "handle memory errors, but treat other faults as fatal". That should come with a rationale and explanation of why it's actually useful. [...] > diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c > index 478e639f8680..28ec35e3d210 100644 > --- a/arch/arm64/mm/extable.c > +++ b/arch/arm64/mm/extable.c > @@ -85,10 +85,10 @@ bool fixup_exception_mc(struct pt_regs *regs) > if (!ex) > return false; > > - /* > - * This is not complete, More Machine check safe extable type can > - * be processed here. > - */ > + switch (ex->type) { > + case EX_TYPE_UACCESS_ERR_ZERO: > + return ex_handler_uaccess_err_zero(ex, regs); > + } Please fold this part into the prior patch, and start ogf with *only* handling errors on accesses already marked with EX_TYPE_UACCESS_ERR_ZERO. I think that change would be relatively uncontroversial, and it would be much easier to build atop that. Thanks, Mark.
Re: [PATCH v10 2/6] arm64: add support for machine check error safe
On Mon, Jan 29, 2024 at 09:46:48PM +0800, Tong Tiangen wrote: > For the arm64 kernel, when it processes hardware memory errors for > synchronize notifications(do_sea()), if the errors is consumed within the > kernel, the current processing is panic. However, it is not optimal. > > Take uaccess for example, if the uaccess operation fails due to memory > error, only the user process will be affected. Killing the user process and > isolating the corrupt page is a better choice. > > This patch only enable machine error check framework and adds an exception > fixup before the kernel panic in do_sea(). > > Signed-off-by: Tong Tiangen > --- > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/extable.h | 1 + > arch/arm64/mm/extable.c | 16 > arch/arm64/mm/fault.c| 29 - > 4 files changed, 46 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index aa7c1d435139..2cc34b5e7abb 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -20,6 +20,7 @@ config ARM64 > select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2 > select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE > select ARCH_HAS_CACHE_LINE_SIZE > + select ARCH_HAS_COPY_MC if ACPI_APEI_GHES > select ARCH_HAS_CURRENT_STACK_POINTER > select ARCH_HAS_DEBUG_VIRTUAL > select ARCH_HAS_DEBUG_VM_PGTABLE > diff --git a/arch/arm64/include/asm/extable.h > b/arch/arm64/include/asm/extable.h > index 72b0e71cc3de..f80ebd0addfd 100644 > --- a/arch/arm64/include/asm/extable.h > +++ b/arch/arm64/include/asm/extable.h > @@ -46,4 +46,5 @@ bool ex_handler_bpf(const struct exception_table_entry *ex, > #endif /* !CONFIG_BPF_JIT */ > > bool fixup_exception(struct pt_regs *regs); > +bool fixup_exception_mc(struct pt_regs *regs); > #endif > diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c > index 228d681a8715..478e639f8680 100644 > --- a/arch/arm64/mm/extable.c > +++ b/arch/arm64/mm/extable.c > @@ -76,3 +76,19 @@ bool fixup_exception(struct pt_regs *regs) > > BUG(); > } > + > +bool fixup_exception_mc(struct pt_regs *regs) Can we please replace 'mc' with something like 'memory_error' ? There's no "machine check" on arm64, and 'mc' is opaque regardless. > +{ > + const struct exception_table_entry *ex; > + > + ex = search_exception_tables(instruction_pointer(regs)); > + if (!ex) > + return false; > + > + /* > + * This is not complete, More Machine check safe extable type can > + * be processed here. > + */ > + > + return false; > +} As with my comment on the subsequenty patch, I'd much prefer that we handle EX_TYPE_UACCESS_ERR_ZERO from the outset. > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c > index 55f6455a8284..312932dc100b 100644 > --- a/arch/arm64/mm/fault.c > +++ b/arch/arm64/mm/fault.c > @@ -730,6 +730,31 @@ static int do_bad(unsigned long far, unsigned long esr, > struct pt_regs *regs) > return 1; /* "fault" */ > } > > +static bool arm64_do_kernel_sea(unsigned long addr, unsigned int esr, > + struct pt_regs *regs, int sig, int code) > +{ > + if (!IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC)) > + return false; > + > + if (user_mode(regs)) > + return false; This function is called "arm64_do_kernel_sea"; surely the caller should *never* call this for a SEA taken from user mode? > + > + if (apei_claim_sea(regs) < 0) > + return false; > + > + if (!fixup_exception_mc(regs)) > + return false; > + > + if (current->flags & PF_KTHREAD) > + return true; I think this needs a comment; why do we allow kthreads to go on, yet kill user threads? What about helper threads (e.g. for io_uring)? > + > + set_thread_esr(0, esr); Why do we set the ESR to 0? Mark. > + arm64_force_sig_fault(sig, code, addr, > + "Uncorrected memory error on access to user memory\n"); > + > + return true; > +} > + > static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs) > { > const struct fault_info *inf; > @@ -755,7 +780,9 @@ static int do_sea(unsigned long far, unsigned long esr, > struct pt_regs *regs) >*/ > siaddr = untagged_addr(far); > } > - arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr); > + > + if (!arm64_do_kernel_sea(siaddr, esr, regs, inf->sig, inf->code)) > + arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, > esr); > > return 0; > } > -- > 2.25.1 >
[PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP
Similar to how we optimized fork(), let's implement PTE batching when consecutive (present) PTEs map consecutive pages of the same large folio. Most infrastructure we need for batching (mmu gather, rmap) is already there. We only have to add get_and_clear_full_ptes() and clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to process a PTE range. We won't bother sanity-checking the mapcount of all subpages, but only check the mapcount of the first subpage we process. To keep small folios as fast as possible force inlining of a specialized variant using __always_inline with nr=1. Signed-off-by: David Hildenbrand --- include/linux/pgtable.h | 66 + mm/memory.c | 92 + 2 files changed, 132 insertions(+), 26 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index aab227e12493..f0feae7f89fb 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -580,6 +580,72 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, } #endif +#ifndef get_and_clear_full_ptes +/** + * get_and_clear_full_ptes - Clear PTEs that map consecutive pages of the same + * folio, collecting dirty/accessed bits. + * @mm: Address space the pages are mapped into. + * @addr: Address the first page is mapped at. + * @ptep: Page table pointer for the first entry. + * @nr: Number of entries to clear. + * @full: Whether we are clearing a full mm. + * + * May be overridden by the architecture; otherwise, implemented as a simple + * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into + * returned PTE. + * + * Note that PTE bits in the PTE range besides the PFN can differ. For example, + * some PTEs might be write-protected. + * + * Context: The caller holds the page table lock. The PTEs map consecutive + * pages that belong to the same folio. The PTEs are all in the same PMD. + */ +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm, + unsigned long addr, pte_t *ptep, unsigned int nr, int full) +{ + pte_t pte, tmp_pte; + + pte = ptep_get_and_clear_full(mm, addr, ptep, full); + while (--nr) { + ptep++; + addr += PAGE_SIZE; + tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full); + if (pte_dirty(tmp_pte)) + pte = pte_mkdirty(pte); + if (pte_young(tmp_pte)) + pte = pte_mkyoung(pte); + } + return pte; +} +#endif + +#ifndef clear_full_ptes +/** + * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio. + * @mm: Address space the pages are mapped into. + * @addr: Address the first page is mapped at. + * @ptep: Page table pointer for the first entry. + * @nr: Number of entries to clear. + * @full: Whether we are clearing a full mm. + * + * Note that PTE bits in the PTE range besides the PFN can differ. For example, + * some PTEs might be write-protected. + * + * Context: The caller holds the page table lock. The PTEs map consecutive + * pages that belong to the same folio. The PTEs are all in the same PMD. + */ +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, unsigned int nr, int full) +{ + for (;;) { + ptep_get_and_clear_full(mm, addr, ptep, full); + if (--nr == 0) + break; + ptep++; + addr += PAGE_SIZE; + } +} +#endif /* * If two threads concurrently fault at the same page, the thread that diff --git a/mm/memory.c b/mm/memory.c index a2190d7cfa74..38a010c4d04d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1515,7 +1515,7 @@ static inline bool zap_drop_file_uffd_wp(struct zap_details *details) */ static inline void zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, - unsigned long addr, pte_t *pte, + unsigned long addr, pte_t *pte, int nr, struct zap_details *details, pte_t pteval) { /* Zap on anonymous always means dropping everything */ @@ -1525,20 +1525,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, if (zap_drop_file_uffd_wp(details)) return; - pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); + for (;;) { + /* the PFN in the PTE is irrelevant. */ + pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); + if (--nr == 0) + break; + pte++; + addr += PAGE_SIZE; + } } -static inline void zap_present_folio_pte(struct mmu_gather *tlb, +static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, struct vm_area_struct *vma, struct folio *folio, - struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
[PATCH v1 8/9] mm/mmu_gather: add tlb_remove_tlb_entries()
Let's add a helper that lets us batch-process multiple consecutive PTEs. Note that the loop will get optimized out on all architectures except on powerpc. We have to add an early define of __tlb_remove_tlb_entry() on ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries() a macro). Signed-off-by: David Hildenbrand --- arch/powerpc/include/asm/tlb.h | 2 ++ include/asm-generic/tlb.h | 20 2 files changed, 22 insertions(+) diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index b3de6102a907..1ca7d4c4b90d 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -19,6 +19,8 @@ #include +static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, + unsigned long address); #define __tlb_remove_tlb_entry __tlb_remove_tlb_entry #define tlb_flush tlb_flush diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 428c3f93addc..bd00dd238b79 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -616,6 +616,26 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb, __tlb_remove_tlb_entry(tlb, ptep, address); \ } while (0) +/** + * tlb_remove_tlb_entries - remember unmapping of multiple consecutive ptes for + * later tlb invalidation. + * + * Similar to tlb_remove_tlb_entry(), but remember unmapping of multiple + * consecutive ptes instead of only a single one. + */ +static inline void tlb_remove_tlb_entries(struct mmu_gather *tlb, + pte_t *ptep, unsigned int nr, unsigned long address) +{ + tlb_flush_pte_range(tlb, address, PAGE_SIZE * nr); + for (;;) { + __tlb_remove_tlb_entry(tlb, ptep, address); + if (--nr == 0) + break; + ptep++; + address += PAGE_SIZE; + } +} + #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \ do {\ unsigned long _sz = huge_page_size(h); \ -- 2.43.0
[PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages()
Add __tlb_remove_folio_pages(), which will remove multiple consecutive pages that belong to the same large folio, instead of only a single page. We'll be using this function when optimizing unmapping/zapping of large folios that are mapped by PTEs. We're using the remaining spare bit in an encoded_page to indicate that the next enoced page in an array contains actually shifted "nr_pages". Teach swap/freeing code about putting multiple folio references, and delayed rmap handling to remove page ranges of a folio. This extension allows for still gathering almost as many small folios as we used to (-1, because we have to prepare for a possibly bigger next entry), but still allows for gathering consecutive pages that belong to the same large folio. Note that we don't pass the folio pointer, because it is not required for now. Further, we don't support page_size != PAGE_SIZE, it won't be required for simple PTE batching. We have to provide a separate s390 implementation, but it's fairly straight forward. Another, more invasive and likely more expensive, approach would be to use folio+range or a PFN range instead of page+nr_pages. But, we should do that consistently for the whole mmu_gather. For now, let's keep it simple and add "nr_pages" only. Signed-off-by: David Hildenbrand --- arch/s390/include/asm/tlb.h | 17 +++ include/asm-generic/tlb.h | 8 + include/linux/mm_types.h| 20 mm/mmu_gather.c | 61 +++-- mm/swap.c | 12 ++-- mm/swap_state.c | 12 ++-- 6 files changed, 116 insertions(+), 14 deletions(-) diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h index 48df896d5b79..abfd2bf29e9e 100644 --- a/arch/s390/include/asm/tlb.h +++ b/arch/s390/include/asm/tlb.h @@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table); static inline void tlb_flush(struct mmu_gather *tlb); static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, bool delay_rmap, int page_size); +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb, + struct page *page, unsigned int nr_pages, bool delay_rmap); #define tlb_flush tlb_flush #define pte_free_tlb pte_free_tlb @@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, return false; } +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb, + struct page *page, unsigned int nr_pages, bool delay_rmap) +{ + struct encoded_page *encoded_pages[] = { + encode_page(page, ENCODED_PAGE_BIT_NR_PAGES), + encode_nr_pages(nr_pages), + }; + + VM_WARN_ON_ONCE(delay_rmap); + VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1)); + + free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages)); + return false; +} + static inline void tlb_flush(struct mmu_gather *tlb) { __tlb_flush_mm_lazy(tlb->mm); diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 2eb7b0d4f5d2..428c3f93addc 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -69,6 +69,7 @@ * * - tlb_remove_page() / __tlb_remove_page() * - tlb_remove_page_size() / __tlb_remove_page_size() + * - __tlb_remove_folio_pages() * *__tlb_remove_page_size() is the basic primitive that queues a page for *freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a @@ -78,6 +79,11 @@ *tlb_remove_page() and tlb_remove_page_size() imply the call to *tlb_flush_mmu() when required and has no return value. * + *__tlb_remove_folio_pages() is similar to __tlb_remove_page(), however, + *instead of removing a single page, remove the given number of consecutive + *pages that are all part of the same (large) folio: just like calling + *__tlb_remove_page() on each page individually. + * * - tlb_change_page_size() * *call before __tlb_remove_page*() to set the current page-size; implies a @@ -262,6 +268,8 @@ struct mmu_gather_batch { extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, bool delay_rmap, int page_size); +bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page, + unsigned int nr_pages, bool delay_rmap); #ifdef CONFIG_SMP /* diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 1b89eec0d6df..198662b7a39a 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -226,6 +226,15 @@ struct encoded_page; /* Perform rmap removal after we have flushed the TLB. */ #define ENCODED_PAGE_BIT_DELAY_RMAP1ul +/* + * The next item in an encoded_page array is the "nr_pages" argument, specifying + * the number of consecutive pages starting from this page, that all belong to + * the same folio. For example, "nr_pages" corresponds to the number of folio + * references that mu
[PATCH v1 6/9] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP
Nowadays, encoded pages are only used in mmu_gather handling. Let's update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP. While at it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS. If encoded page pointers would ever be used in other context again, we'd likely want to change the defines to reflect their context (e.g., ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP). For now, let's keep it simple. This is a preparation for using the remaining spare bit to indicate that the next item in an array of encoded pages is a "nr_pages" argument and not an encoded page. Signed-off-by: David Hildenbrand --- include/linux/mm_types.h | 17 +++-- mm/mmu_gather.c | 5 +++-- 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8b611e13153e..1b89eec0d6df 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -210,8 +210,8 @@ struct page { * * An 'encoded_page' pointer is a pointer to a regular 'struct page', but * with the low bits of the pointer indicating extra context-dependent - * information. Not super-common, but happens in mmu_gather and mlock - * handling, and this acts as a type system check on that use. + * information. Only used in mmu_gather handling, and this acts as a type + * system check on that use. * * We only really have two guaranteed bits in general, although you could * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE) @@ -220,21 +220,26 @@ struct page { * Use the supplied helper functions to endcode/decode the pointer and bits. */ struct encoded_page; -#define ENCODE_PAGE_BITS 3ul + +#define ENCODED_PAGE_BITS 3ul + +/* Perform rmap removal after we have flushed the TLB. */ +#define ENCODED_PAGE_BIT_DELAY_RMAP1ul + static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags) { - BUILD_BUG_ON(flags > ENCODE_PAGE_BITS); + BUILD_BUG_ON(flags > ENCODED_PAGE_BITS); return (struct encoded_page *)(flags | (unsigned long)page); } static inline unsigned long encoded_page_flags(struct encoded_page *page) { - return ENCODE_PAGE_BITS & (unsigned long)page; + return ENCODED_PAGE_BITS & (unsigned long)page; } static inline struct page *encoded_page_ptr(struct encoded_page *page) { - return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page); + return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page); } /* diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index ac733d81b112..6540c99c6758 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -53,7 +53,7 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_ for (int i = 0; i < batch->nr; i++) { struct encoded_page *enc = batch->encoded_pages[i]; - if (encoded_page_flags(enc)) { + if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) { struct page *page = encoded_page_ptr(enc); folio_remove_rmap_pte(page_folio(page), page, vma); } @@ -119,6 +119,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb) bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, bool delay_rmap, int page_size) { + int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0; struct mmu_gather_batch *batch; VM_BUG_ON(!tlb->end); @@ -132,7 +133,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, * Add the page and check if we are full. If so * force a flush. */ - batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap); + batch->encoded_pages[batch->nr++] = encode_page(page, flags); if (batch->nr == batch->max) { if (!tlb_next_batch(tlb)) return true; -- 2.43.0
[PATCH v1 5/9] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size()
We have two bits available in the encoded page pointer to store additional information. Currently, we use one bit to request delay of the rmap removal until after a TLB flush. We want to make use of the remaining bit internally for batching of multiple pages of the same folio, specifying that the next encoded page pointer in an array is actually "nr_pages". So pass page + delay_rmap flag instead of an encoded page, to handle the encoding internally. Signed-off-by: David Hildenbrand --- arch/s390/include/asm/tlb.h | 13 ++--- include/asm-generic/tlb.h | 12 ++-- mm/mmu_gather.c | 7 --- 3 files changed, 16 insertions(+), 16 deletions(-) diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h index d1455a601adc..48df896d5b79 100644 --- a/arch/s390/include/asm/tlb.h +++ b/arch/s390/include/asm/tlb.h @@ -25,8 +25,7 @@ void __tlb_remove_table(void *_table); static inline void tlb_flush(struct mmu_gather *tlb); static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, - struct encoded_page *page, - int page_size); + struct page *page, bool delay_rmap, int page_size); #define tlb_flush tlb_flush #define pte_free_tlb pte_free_tlb @@ -42,14 +41,14 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page * has already been freed, so just do free_page_and_swap_cache. * - * s390 doesn't delay rmap removal, so there is nothing encoded in - * the page pointer. + * s390 doesn't delay rmap removal. */ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, - struct encoded_page *page, - int page_size) + struct page *page, bool delay_rmap, int page_size) { - free_page_and_swap_cache(encoded_page_ptr(page)); + VM_WARN_ON_ONCE(delay_rmap); + + free_page_and_swap_cache(page); return false; } diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 129a3a759976..2eb7b0d4f5d2 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -260,9 +260,8 @@ struct mmu_gather_batch { */ #define MAX_GATHER_BATCH_COUNT (1UL/MAX_GATHER_BATCH) -extern bool __tlb_remove_page_size(struct mmu_gather *tlb, - struct encoded_page *page, - int page_size); +extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, + bool delay_rmap, int page_size); #ifdef CONFIG_SMP /* @@ -462,13 +461,14 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) static inline void tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size) { - if (__tlb_remove_page_size(tlb, encode_page(page, 0), page_size)) + if (__tlb_remove_page_size(tlb, page, false, page_size)) tlb_flush_mmu(tlb); } -static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags) +static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, + struct page *page, bool delay_rmap) { - return __tlb_remove_page_size(tlb, encode_page(page, flags), PAGE_SIZE); + return __tlb_remove_page_size(tlb, page, delay_rmap, PAGE_SIZE); } /* tlb_remove_page diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 604ddf08affe..ac733d81b112 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -116,7 +116,8 @@ static void tlb_batch_list_free(struct mmu_gather *tlb) tlb->local.next = NULL; } -bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, int page_size) +bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, + bool delay_rmap, int page_size) { struct mmu_gather_batch *batch; @@ -131,13 +132,13 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, i * Add the page and check if we are full. If so * force a flush. */ - batch->encoded_pages[batch->nr++] = page; + batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap); if (batch->nr == batch->max) { if (!tlb_next_batch(tlb)) return true; batch = tlb->active; } - VM_BUG_ON_PAGE(batch->nr > batch->max, encoded_page_ptr(page)); + VM_BUG_ON_PAGE(batch->nr > batch->max, page); return false; } -- 2.43.0
[PATCH v1 4/9] mm/memory: factor out zapping folio pte into zap_present_folio_pte()
Let's prepare for further changes by factoring it out into a separate function. Signed-off-by: David Hildenbrand --- mm/memory.c | 53 - 1 file changed, 32 insertions(+), 21 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 20bc13ab8db2..a2190d7cfa74 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1528,30 +1528,14 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); } -static inline void zap_present_pte(struct mmu_gather *tlb, - struct vm_area_struct *vma, pte_t *pte, pte_t ptent, - unsigned long addr, struct zap_details *details, - int *rss, bool *force_flush, bool *force_break) +static inline void zap_present_folio_pte(struct mmu_gather *tlb, + struct vm_area_struct *vma, struct folio *folio, + struct page *page, pte_t *pte, pte_t ptent, unsigned long addr, + struct zap_details *details, int *rss, bool *force_flush, + bool *force_break) { struct mm_struct *mm = tlb->mm; bool delay_rmap = false; - struct folio *folio; - struct page *page; - - page = vm_normal_page(vma, addr, ptent); - if (!page) { - /* We don't need up-to-date accessed/dirty bits. */ - ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); - arch_check_zapped_pte(vma, ptent); - tlb_remove_tlb_entry(tlb, pte, addr); - VM_WARN_ON_ONCE(userfaultfd_wp(vma)); - ksm_might_unmap_zero_page(mm, ptent); - return; - } - - folio = page_folio(page); - if (unlikely(!should_zap_folio(details, folio))) - return; if (!folio_test_anon(folio)) { ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); @@ -1586,6 +1570,33 @@ static inline void zap_present_pte(struct mmu_gather *tlb, } } +static inline void zap_present_pte(struct mmu_gather *tlb, + struct vm_area_struct *vma, pte_t *pte, pte_t ptent, + unsigned long addr, struct zap_details *details, + int *rss, bool *force_flush, bool *force_break) +{ + struct mm_struct *mm = tlb->mm; + struct folio *folio; + struct page *page; + + page = vm_normal_page(vma, addr, ptent); + if (!page) { + /* We don't need up-to-date accessed/dirty bits. */ + ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + arch_check_zapped_pte(vma, ptent); + tlb_remove_tlb_entry(tlb, pte, addr); + VM_WARN_ON_ONCE(userfaultfd_wp(vma)); + ksm_might_unmap_zero_page(mm, ptent); + return; + } + + folio = page_folio(page); + if (unlikely(!should_zap_folio(details, folio))) + return; + zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details, + rss, force_flush, force_break); +} + static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, -- 2.43.0
[PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()
We don't need up-to-date accessed-dirty information for anon folios and can simply work with the ptent we already have. Also, we know the RSS counter we want to update. We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() + zap_install_uffd_wp_if_needed() after updating the folio and RSS. While at it, only call zap_install_uffd_wp_if_needed() if there is even any chance that pte_install_uffd_wp_if_needed() would do *something*. That is, just don't bother if uffd-wp does not apply. Signed-off-by: David Hildenbrand --- mm/memory.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 69502cdc0a7d..20bc13ab8db2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather *tlb, folio = page_folio(page); if (unlikely(!should_zap_folio(details, folio))) return; - ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); - arch_check_zapped_pte(vma, ptent); - tlb_remove_tlb_entry(tlb, pte, addr); - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); if (!folio_test_anon(folio)) { + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); if (pte_dirty(ptent)) { folio_mark_dirty(folio); if (tlb_delay_rmap(tlb)) { @@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather *tlb, } if (pte_young(ptent) && likely(vma_has_recency(vma))) folio_mark_accessed(folio); + rss[mm_counter(folio)]--; + } else { + /* We don't need up-to-date accessed/dirty bits. */ + ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + rss[MM_ANONPAGES]--; } - rss[mm_counter(folio)]--; + arch_check_zapped_pte(vma, ptent); + tlb_remove_tlb_entry(tlb, pte, addr); + if (unlikely(userfaultfd_pte_wp(vma, ptent))) + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); + if (!delay_rmap) { folio_remove_rmap_pte(folio, page, vma); if (unlikely(page_mapcount(page) < 0)) -- 2.43.0
[PATCH v1 2/9] mm/memory: handle !page case in zap_present_pte() separately
We don't need uptodate accessed/dirty bits, so in theory we could replace ptep_get_and_clear_full() by an optimized ptep_clear_full() function. Let's rely on the provided pte. Further, there is no scenario where we would have to insert uffd-wp markers when zapping something that is not a normal page (i.e., zeropage). Add a sanity check to make sure this remains true. should_zap_folio() no longer has to handle NULL pointers. This change replaces 2/3 "!page/!folio" checks by a single "!page" one. Signed-off-by: David Hildenbrand --- mm/memory.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 50a6c79c78fc..69502cdc0a7d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1497,10 +1497,6 @@ static inline bool should_zap_folio(struct zap_details *details, if (should_zap_cows(details)) return true; - /* E.g. the caller passes NULL for the case of a zero folio */ - if (!folio) - return true; - /* Otherwise we should only zap non-anon folios */ return !folio_test_anon(folio); } @@ -1543,19 +1539,23 @@ static inline void zap_present_pte(struct mmu_gather *tlb, struct page *page; page = vm_normal_page(vma, addr, ptent); - if (page) - folio = page_folio(page); + if (!page) { + /* We don't need up-to-date accessed/dirty bits. */ + ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + arch_check_zapped_pte(vma, ptent); + tlb_remove_tlb_entry(tlb, pte, addr); + VM_WARN_ON_ONCE(userfaultfd_wp(vma)); + ksm_might_unmap_zero_page(mm, ptent); + return; + } + folio = page_folio(page); if (unlikely(!should_zap_folio(details, folio))) return; ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); arch_check_zapped_pte(vma, ptent); tlb_remove_tlb_entry(tlb, pte, addr); zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); - if (unlikely(!page)) { - ksm_might_unmap_zero_page(mm, ptent); - return; - } if (!folio_test_anon(folio)) { if (pte_dirty(ptent)) { -- 2.43.0
[PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()
Let's prepare for further changes by factoring out processing of present PTEs. Signed-off-by: David Hildenbrand --- mm/memory.c | 92 ++--- 1 file changed, 52 insertions(+), 40 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index b05fd28dbce1..50a6c79c78fc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); } +static inline void zap_present_pte(struct mmu_gather *tlb, + struct vm_area_struct *vma, pte_t *pte, pte_t ptent, + unsigned long addr, struct zap_details *details, + int *rss, bool *force_flush, bool *force_break) +{ + struct mm_struct *mm = tlb->mm; + bool delay_rmap = false; + struct folio *folio; + struct page *page; + + page = vm_normal_page(vma, addr, ptent); + if (page) + folio = page_folio(page); + + if (unlikely(!should_zap_folio(details, folio))) + return; + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + arch_check_zapped_pte(vma, ptent); + tlb_remove_tlb_entry(tlb, pte, addr); + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); + if (unlikely(!page)) { + ksm_might_unmap_zero_page(mm, ptent); + return; + } + + if (!folio_test_anon(folio)) { + if (pte_dirty(ptent)) { + folio_mark_dirty(folio); + if (tlb_delay_rmap(tlb)) { + delay_rmap = true; + *force_flush = true; + } + } + if (pte_young(ptent) && likely(vma_has_recency(vma))) + folio_mark_accessed(folio); + } + rss[mm_counter(folio)]--; + if (!delay_rmap) { + folio_remove_rmap_pte(folio, page, vma); + if (unlikely(page_mapcount(page) < 0)) + print_bad_pte(vma, addr, ptent, page); + } + if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) { + *force_flush = true; + *force_break = true; + } +} + static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, struct zap_details *details) { + bool force_flush = false, force_break = false; struct mm_struct *mm = tlb->mm; - int force_flush = 0; int rss[NR_MM_COUNTERS]; spinlock_t *ptl; pte_t *start_pte; @@ -1565,45 +1613,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, break; if (pte_present(ptent)) { - unsigned int delay_rmap; - - page = vm_normal_page(vma, addr, ptent); - if (page) - folio = page_folio(page); - - if (unlikely(!should_zap_folio(details, folio))) - continue; - ptent = ptep_get_and_clear_full(mm, addr, pte, - tlb->fullmm); - arch_check_zapped_pte(vma, ptent); - tlb_remove_tlb_entry(tlb, pte, addr); - zap_install_uffd_wp_if_needed(vma, addr, pte, details, - ptent); - if (unlikely(!page)) { - ksm_might_unmap_zero_page(mm, ptent); - continue; - } - - delay_rmap = 0; - if (!folio_test_anon(folio)) { - if (pte_dirty(ptent)) { - folio_mark_dirty(folio); - if (tlb_delay_rmap(tlb)) { - delay_rmap = 1; - force_flush = 1; - } - } - if (pte_young(ptent) && likely(vma_has_recency(vma))) - folio_mark_accessed(folio); - } - rss[mm_counter(folio)]--; - if (!delay_rmap) { - folio_remove_rmap_pte(folio, page, vma); - if (unlikely(page_mapcount(page) < 0)) - print_bad_pte(vma, addr, ptent, page); - } - if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) { - force_flush = 1; +
[PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
This series is based on [1] and must be applied on top of it. Similar to what we did with fork(), let's implement PTE batching during unmap/zap when processing PTE-mapped THPs. We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch, (c) perform batch PTE setting/updates and (d) perform TLB entry removal once per batch. Ryan was previously working on this in the context of cont-pte for arm64, int latest iteration [2] with a focus on arm6 with cont-pte only. This series implements the optimization for all architectures, independent of such PTE bits, teaches MMU gather/TLB code to be fully aware of such large-folio-pages batches as well, and amkes use of our new rmap batching function when removing the rmap. To achieve that, we have to enlighten MMU gather / page freeing code (i.e., everything that consumes encoded_page) to process unmapping of consecutive pages that all belong to the same large folio. I'm being very careful to not degrade order-0 performance, and it looks like I managed to achieve that. While this series should -- similar to [1] -- be beneficial for adding cont-pte support on arm64[2], it's one of the requirements for maintaining a total mapcount[3] for large folios with minimal added overhead and further changes[4] that build up on top of the total mapcount. Independent of all that, this series results in a speedup during munmap() and similar unmapping (process teardown, MADV_DONTNEED on larger ranges) with PTE-mapped THP, which is the default with THPs that are smaller than a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]). On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by PTE-mapped folios of the same size (stddev < 1%) results in the following runtimes for munmap() in seconds (shorter is better): Folio Size | mm-unstable | New | Change - 4KiB |0.058110 | 0.057715 | - 1% 16KiB |0.044198 | 0.035469 | -20% 32KiB |0.034216 | 0.023522 | -31% 64KiB |0.029207 | 0.018434 | -37% 128KiB |0.026579 | 0.014026 | -47% 256KiB |0.025130 | 0.011756 | -53% 512KiB |0.024292 | 0.010703 | -56% 1024KiB |0.023812 | 0.010294 | -57% 2048KiB |0.023785 | 0.009910 | -58% CCing especially s390x folks, because they have a tlb freeing hooks that needs adjustment. Only tested on x86-64 for now, will have to do some more stress testing. Compile-tested on most other architectures. The PPC change is negleglible and makes my cross-compiler happy. [1] https://lkml.kernel.org/r/20240129124649.189745-1-da...@redhat.com [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.robe...@arm.com [3] https://lkml.kernel.org/r/20230809083256.699513-1-da...@redhat.com [4] https://lkml.kernel.org/r/20231124132626.235350-1-da...@redhat.com [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.robe...@arm.com Cc: Andrew Morton Cc: Matthew Wilcox (Oracle) Cc: Ryan Roberts Cc: Catalin Marinas Cc: Will Deacon Cc: "Aneesh Kumar K.V" Cc: Nick Piggin Cc: Peter Zijlstra Cc: Michael Ellerman Cc: Christophe Leroy Cc: "Naveen N. Rao" Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Sven Schnelle Cc: Arnd Bergmann Cc: linux-a...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org David Hildenbrand (9): mm/memory: factor out zapping of present pte into zap_present_pte() mm/memory: handle !page case in zap_present_pte() separately mm/memory: further separate anon and pagecache folio handling in zap_present_pte() mm/memory: factor out zapping folio pte into zap_present_folio_pte() mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size() mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP mm/mmu_gather: add __tlb_remove_folio_pages() mm/mmu_gather: add tlb_remove_tlb_entries() mm/memory: optimize unmap/zap with PTE-mapped THP arch/powerpc/include/asm/tlb.h | 2 + arch/s390/include/asm/tlb.h| 30 -- include/asm-generic/tlb.h | 40 ++-- include/linux/mm_types.h | 37 ++-- include/linux/pgtable.h| 66 + mm/memory.c| 167 +++-- mm/mmu_gather.c| 63 +++-- mm/swap.c | 12 ++- mm/swap_state.c| 12 ++- 9 files changed, 347 insertions(+), 82 deletions(-) -- 2.43.0
[PATCH v10 6/6] arm64: introduce copy_mc_to_kernel() implementation
The copy_mc_to_kernel() helper is memory copy implementation that handles source exceptions. It can be used in memory copy scenarios that tolerate hardware memory errors(e.g: pmem_read/dax_copy_to_iter). Currnently, only x86 and ppc suuport this helper, after arm64 support machine check safe framework, we introduce copy_mc_to_kernel() implementation. Signed-off-by: Tong Tiangen --- arch/arm64/include/asm/string.h | 5 + arch/arm64/include/asm/uaccess.h | 21 +++ arch/arm64/lib/Makefile | 2 +- arch/arm64/lib/memcpy_mc.S | 257 +++ mm/kasan/shadow.c| 12 ++ 5 files changed, 296 insertions(+), 1 deletion(-) create mode 100644 arch/arm64/lib/memcpy_mc.S diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h index 3a3264ff47b9..995b63c26e99 100644 --- a/arch/arm64/include/asm/string.h +++ b/arch/arm64/include/asm/string.h @@ -35,6 +35,10 @@ extern void *memchr(const void *, int, __kernel_size_t); extern void *memcpy(void *, const void *, __kernel_size_t); extern void *__memcpy(void *, const void *, __kernel_size_t); +#define __HAVE_ARCH_MEMCPY_MC +extern int memcpy_mcs(void *, const void *, __kernel_size_t); +extern int __memcpy_mcs(void *, const void *, __kernel_size_t); + #define __HAVE_ARCH_MEMMOVE extern void *memmove(void *, const void *, __kernel_size_t); extern void *__memmove(void *, const void *, __kernel_size_t); @@ -57,6 +61,7 @@ void memcpy_flushcache(void *dst, const void *src, size_t cnt); */ #define memcpy(dst, src, len) __memcpy(dst, src, len) +#define memcpy_mcs(dst, src, len) __memcpy_mcs(dst, src, len) #define memmove(dst, src, len) __memmove(dst, src, len) #define memset(s, c, n) __memset(s, c, n) diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index 14be5000c5a0..61e28ef2112a 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -425,4 +425,25 @@ static inline size_t probe_subpage_writeable(const char __user *uaddr, #endif /* CONFIG_ARCH_HAS_SUBPAGE_FAULTS */ +#ifdef CONFIG_ARCH_HAS_COPY_MC +/** + * copy_mc_to_kernel - memory copy that handles source exceptions + * + * @dst: destination address + * @src: source address + * @len: number of bytes to copy + * + * Return 0 for success, or #size if there was an exception. + */ +static inline unsigned long __must_check +copy_mc_to_kernel(void *to, const void *from, unsigned long size) +{ + int ret; + + ret = memcpy_mcs(to, from, size); + return (ret == -EFAULT) ? size : 0; +} +#define copy_mc_to_kernel copy_mc_to_kernel +#endif + #endif /* __ASM_UACCESS_H */ diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile index a2fd865b816d..899d6ae9698c 100644 --- a/arch/arm64/lib/Makefile +++ b/arch/arm64/lib/Makefile @@ -3,7 +3,7 @@ lib-y := clear_user.o delay.o copy_from_user.o \ copy_to_user.o copy_page.o \ clear_page.o csum.o insn.o memchr.o memcpy.o \ memset.o memcmp.o strcmp.o strncmp.o strlen.o\ - strnlen.o strchr.o strrchr.o tishift.o + strnlen.o strchr.o strrchr.o tishift.o memcpy_mc.o ifeq ($(CONFIG_KERNEL_MODE_NEON), y) obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o diff --git a/arch/arm64/lib/memcpy_mc.S b/arch/arm64/lib/memcpy_mc.S new file mode 100644 index ..7076b500d154 --- /dev/null +++ b/arch/arm64/lib/memcpy_mc.S @@ -0,0 +1,257 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2012-2021, Arm Limited. + * + * Adapted from the original at: + * https://github.com/ARM-software/optimized-routines/blob/afd6244a1f8d9229/string/aarch64/memcpy.S + */ + +#include +#include + +/* Assumptions: + * + * ARMv8-a, AArch64, unaligned accesses. + * + */ + +#define L(label) .L ## label + +#define dstin x0 +#define srcx1 +#define count x2 +#define dstx3 +#define srcend x4 +#define dstend x5 +#define A_lx6 +#define A_lw w6 +#define A_hx7 +#define B_lx8 +#define B_lw w8 +#define B_hx9 +#define C_lx10 +#define C_lw w10 +#define C_hx11 +#define D_lx12 +#define D_hx13 +#define E_lx14 +#define E_hx15 +#define F_lx16 +#define F_hx17 +#define G_lcount +#define G_hdst +#define H_lsrc +#define H_hsrcend +#define tmp1 x14 + +/* This implementation handles overlaps and supports both memcpy and memmove + from a single entry point. It uses unaligned accesses and branchless + sequences to keep the code small, simple and improve performance. + + Copies are split into 3 main cases: small copies of up to 32 bytes, medium + copies of up to 128 bytes, and large copies. The overhead of the overlap + check is negligible since it is only required for large copies. + + Large copies use a software pipelined loop processing 64 bytes per iteration. + The destinatio
[PATCH v10 0/6]arm64: add machine check safe support
With the increase of memory capacity and density, the probability of memory error also increases. The increasing size and density of server RAM in data centers and clouds have shown increased uncorrectable memory errors. Currently, more and more scenarios that can tolerate memory errors???such as CoW[1,2], KSM copy[3], coredump copy[4], khugepaged[5,6], uaccess copy[7], etc. This patchset introduces a new processing framework on ARM64, which enables ARM64 to support error recovery in the above scenarios, and more scenarios can be expanded based on this in the future. In arm64, memory error handling in do_sea(), which is divided into two cases: 1. If the user state consumed the memory errors, the solution is to kill the user process and isolate the error page. 2. If the kernel state consumed the memory errors, the solution is to panic. For case 2, Undifferentiated panic may not be the optimal choice, as it can be handled better. In some scenarios, we can avoid panic, such as uaccess, if the uaccess fails due to memory error, only the user process will be affected, killing the user process and isolating the user page with hardware memory errors is a better choice. [1] commit d302c2398ba2 ("mm, hwpoison: when copy-on-write hits poison, take page offline") [2] commit 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults") [3] commit 6b970599e807 ("mm: hwpoison: support recovery from ksm_might_need_to_copy()") [4] commit 245f09226893 ("mm: hwpoison: coredump: support recovery from dump_user_range()") [5] commit 98c76c9f1ef7 ("mm/khugepaged: recover from poisoned anonymous memory") [6] commit 12904d953364 ("mm/khugepaged: recover from poisoned file-backed memory") [7] commit 278b917f8cb9 ("x86/mce: Add _ASM_EXTABLE_CPY for copy user access") Since V9: 1. Rebase to latest kernel version 6.8-rc2. 2. Add patch 6/6 to support copy_mc_to_kernel(). Since V8: 1. Rebase to latest kernel version and fix topo in some of the patches. 2. According to the suggestion of Catalin, I attempted to modify the return value of function copy_mc_[user]_highpage() to bytes not copied. During the modification process, I found that it would be more reasonable to return -EFAULT when copy error occurs (referring to the newly added patch 4). For ARM64, the implementation of copy_mc_[user]_highpage() needs to consider MTE. Considering the scenario where data copying is successful but the MTE tag copying fails, it is also not reasonable to return bytes not copied. 3. Considering the recent addition of machine check safe support for multiple scenarios, modify commit message for patch 5 (patch 4 for V8). Since V7: Currently, there are patches supporting recover from poison consumption for the cow scenario[1]. Therefore, Supporting cow scenario under the arm64 architecture only needs to modify the relevant code under the arch/. [1]https://lore.kernel.org/lkml/20221031201029.102123-1-tony.l...@intel.com/ Since V6: Resend patches that are not merged into the mainline in V6. Since V5: 1. Add patch2/3 to add uaccess assembly helpers. 2. Optimize the implementation logic of arm64_do_kernel_sea() in patch8. 3. Remove kernel access fixup in patch9. All suggestion are from Mark. Since V4: 1. According Michael's suggestion, add patch5. 2. According Mark's suggestiog, do some restructuring to arm64 extable, then a new adaptation of machine check safe support is made based on this. 3. According Mark's suggestion, support machine check safe in do_mte() in cow scene. 4. In V4, two patches have been merged into -next, so V5 not send these two patches. Since V3: 1. According to Robin's suggestion, direct modify user_ldst and user_ldp in asm-uaccess.h and modify mte.S. 2. Add new macro USER_MC in asm-uaccess.h, used in copy_from_user.S and copy_to_user.S. 3. According to Robin's suggestion, using micro in copy_page_mc.S to simplify code. 4. According to KeFeng's suggestion, modify powerpc code in patch1. 5. According to KeFeng's suggestion, modify mm/extable.c and some code optimization. Since V2: 1. According to Mark's suggestion, all uaccess can be recovered due to memory error. 2. Scenario pagecache reading is also supported as part of uaccess (copy_to_user()) and duplication code problem is also solved. Thanks for Robin's suggestion. 3. According Mark's suggestion, update commit message of patch 2/5. 4. According Borisllav's suggestion, update commit message of patch 1/5. Since V1: 1.Consistent with PPC/x86, Using CONFIG_ARCH_HAS_COPY_MC instead of ARM64_UCE_KERNEL_RECOVERY. 2.Add two new scenes, cow and pagecache reading. 3.Fix two small bug(the first two patch). V1 in here: https://lore.kernel.org/lkml/20220323033705.3966643-1-tongtian...@huawei.com/ Tong Tiangen (6): uaccess: add generic fallback version of copy_mc_to_user() arm64: add support for machine check error safe arm64: add uaccess to mach
[PATCH v10 2/6] arm64: add support for machine check error safe
For the arm64 kernel, when it processes hardware memory errors for synchronize notifications(do_sea()), if the errors is consumed within the kernel, the current processing is panic. However, it is not optimal. Take uaccess for example, if the uaccess operation fails due to memory error, only the user process will be affected. Killing the user process and isolating the corrupt page is a better choice. This patch only enable machine error check framework and adds an exception fixup before the kernel panic in do_sea(). Signed-off-by: Tong Tiangen --- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/extable.h | 1 + arch/arm64/mm/extable.c | 16 arch/arm64/mm/fault.c| 29 - 4 files changed, 46 insertions(+), 1 deletion(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index aa7c1d435139..2cc34b5e7abb 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -20,6 +20,7 @@ config ARM64 select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2 select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE select ARCH_HAS_CACHE_LINE_SIZE + select ARCH_HAS_COPY_MC if ACPI_APEI_GHES select ARCH_HAS_CURRENT_STACK_POINTER select ARCH_HAS_DEBUG_VIRTUAL select ARCH_HAS_DEBUG_VM_PGTABLE diff --git a/arch/arm64/include/asm/extable.h b/arch/arm64/include/asm/extable.h index 72b0e71cc3de..f80ebd0addfd 100644 --- a/arch/arm64/include/asm/extable.h +++ b/arch/arm64/include/asm/extable.h @@ -46,4 +46,5 @@ bool ex_handler_bpf(const struct exception_table_entry *ex, #endif /* !CONFIG_BPF_JIT */ bool fixup_exception(struct pt_regs *regs); +bool fixup_exception_mc(struct pt_regs *regs); #endif diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c index 228d681a8715..478e639f8680 100644 --- a/arch/arm64/mm/extable.c +++ b/arch/arm64/mm/extable.c @@ -76,3 +76,19 @@ bool fixup_exception(struct pt_regs *regs) BUG(); } + +bool fixup_exception_mc(struct pt_regs *regs) +{ + const struct exception_table_entry *ex; + + ex = search_exception_tables(instruction_pointer(regs)); + if (!ex) + return false; + + /* +* This is not complete, More Machine check safe extable type can +* be processed here. +*/ + + return false; +} diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 55f6455a8284..312932dc100b 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -730,6 +730,31 @@ static int do_bad(unsigned long far, unsigned long esr, struct pt_regs *regs) return 1; /* "fault" */ } +static bool arm64_do_kernel_sea(unsigned long addr, unsigned int esr, +struct pt_regs *regs, int sig, int code) +{ + if (!IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC)) + return false; + + if (user_mode(regs)) + return false; + + if (apei_claim_sea(regs) < 0) + return false; + + if (!fixup_exception_mc(regs)) + return false; + + if (current->flags & PF_KTHREAD) + return true; + + set_thread_esr(0, esr); + arm64_force_sig_fault(sig, code, addr, + "Uncorrected memory error on access to user memory\n"); + + return true; +} + static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs) { const struct fault_info *inf; @@ -755,7 +780,9 @@ static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs) */ siaddr = untagged_addr(far); } - arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr); + + if (!arm64_do_kernel_sea(siaddr, esr, regs, inf->sig, inf->code)) + arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr); return 0; } -- 2.25.1
[PATCH v10 4/6] mm/hwpoison: return -EFAULT when copy fail in copy_mc_[user]_highpage()
If hardware errors are encountered during page copying, returning the bytes not copied is not meaningful, and the caller cannot do any processing on the remaining data. Returning -EFAULT is more reasonable, which represents a hardware error encountered during the copying. Signed-off-by: Tong Tiangen --- include/linux/highmem.h | 8 mm/khugepaged.c | 4 ++-- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 451c1dff0e87..c5ca1a1fc4f5 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -335,8 +335,8 @@ static inline void copy_highpage(struct page *to, struct page *from) /* * If architecture supports machine check exception handling, define the * #MC versions of copy_user_highpage and copy_highpage. They copy a memory - * page with #MC in source page (@from) handled, and return the number - * of bytes not copied if there was a #MC, otherwise 0 for success. + * page with #MC in source page (@from) handled, and return -EFAULT if there + * was a #MC, otherwise 0 for success. */ static inline int copy_mc_user_highpage(struct page *to, struct page *from, unsigned long vaddr, struct vm_area_struct *vma) @@ -352,7 +352,7 @@ static inline int copy_mc_user_highpage(struct page *to, struct page *from, kunmap_local(vto); kunmap_local(vfrom); - return ret; + return ret ? -EFAULT : 0; } static inline int copy_mc_highpage(struct page *to, struct page *from) @@ -368,7 +368,7 @@ static inline int copy_mc_highpage(struct page *to, struct page *from) kunmap_local(vto); kunmap_local(vfrom); - return ret; + return ret ? -EFAULT : 0; } #else static inline int copy_mc_user_highpage(struct page *to, struct page *from, diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2b219acb528e..ba6743a54c86 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -797,7 +797,7 @@ static int __collapse_huge_page_copy(pte_t *pte, continue; } src_page = pte_page(pteval); - if (copy_mc_user_highpage(page, src_page, _address, vma) > 0) { + if (copy_mc_user_highpage(page, src_page, _address, vma)) { result = SCAN_COPY_MC; break; } @@ -2053,7 +2053,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, clear_highpage(hpage + (index % HPAGE_PMD_NR)); index++; } - if (copy_mc_highpage(hpage + (page->index % HPAGE_PMD_NR), page) > 0) { + if (copy_mc_highpage(hpage + (page->index % HPAGE_PMD_NR), page)) { result = SCAN_COPY_MC; goto rollback; } -- 2.25.1
[PATCH v10 3/6] arm64: add uaccess to machine check safe
If user process access memory fails due to hardware memory error, only the relevant processes are affected, so it is more reasonable to kill the user process and isolate the corrupt page than to panic the kernel. Signed-off-by: Tong Tiangen --- arch/arm64/lib/copy_from_user.S | 10 +- arch/arm64/lib/copy_to_user.S | 10 +- arch/arm64/mm/extable.c | 8 3 files changed, 14 insertions(+), 14 deletions(-) diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S index 34e317907524..1bf676e9201d 100644 --- a/arch/arm64/lib/copy_from_user.S +++ b/arch/arm64/lib/copy_from_user.S @@ -25,7 +25,7 @@ .endm .macro strb1 reg, ptr, val - strb \reg, [\ptr], \val + USER(9998f, strb \reg, [\ptr], \val) .endm .macro ldrh1 reg, ptr, val @@ -33,7 +33,7 @@ .endm .macro strh1 reg, ptr, val - strh \reg, [\ptr], \val + USER(9998f, strh \reg, [\ptr], \val) .endm .macro ldr1 reg, ptr, val @@ -41,7 +41,7 @@ .endm .macro str1 reg, ptr, val - str \reg, [\ptr], \val + USER(9998f, str \reg, [\ptr], \val) .endm .macro ldp1 reg1, reg2, ptr, val @@ -49,7 +49,7 @@ .endm .macro stp1 reg1, reg2, ptr, val - stp \reg1, \reg2, [\ptr], \val + USER(9998f, stp \reg1, \reg2, [\ptr], \val) .endm end.reqx5 @@ -66,7 +66,7 @@ SYM_FUNC_START(__arch_copy_from_user) b.ne9998f // Before being absolutely sure we couldn't copy anything, try harder USER(9998f, ldtrb tmp1w, [srcin]) - strbtmp1w, [dst], #1 +USER(9998f, strb tmp1w, [dst], #1) 9998: sub x0, end, dst// bytes not copied ret SYM_FUNC_END(__arch_copy_from_user) diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S index 802231772608..cc031bd87455 100644 --- a/arch/arm64/lib/copy_to_user.S +++ b/arch/arm64/lib/copy_to_user.S @@ -20,7 +20,7 @@ * x0 - bytes not copied */ .macro ldrb1 reg, ptr, val - ldrb \reg, [\ptr], \val + USER(9998f, ldrb \reg, [\ptr], \val) .endm .macro strb1 reg, ptr, val @@ -28,7 +28,7 @@ .endm .macro ldrh1 reg, ptr, val - ldrh \reg, [\ptr], \val + USER(9998f, ldrh \reg, [\ptr], \val) .endm .macro strh1 reg, ptr, val @@ -36,7 +36,7 @@ .endm .macro ldr1 reg, ptr, val - ldr \reg, [\ptr], \val + USER(9998f, ldr \reg, [\ptr], \val) .endm .macro str1 reg, ptr, val @@ -44,7 +44,7 @@ .endm .macro ldp1 reg1, reg2, ptr, val - ldp \reg1, \reg2, [\ptr], \val + USER(9998f, ldp \reg1, \reg2, [\ptr], \val) .endm .macro stp1 reg1, reg2, ptr, val @@ -64,7 +64,7 @@ SYM_FUNC_START(__arch_copy_to_user) 9997: cmp dst, dstin b.ne9998f // Before being absolutely sure we couldn't copy anything, try harder - ldrbtmp1w, [srcin] +USER(9998f, ldrb tmp1w, [srcin]) USER(9998f, sttrb tmp1w, [dst]) add dst, dst, #1 9998: sub x0, end, dst// bytes not copied diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c index 478e639f8680..28ec35e3d210 100644 --- a/arch/arm64/mm/extable.c +++ b/arch/arm64/mm/extable.c @@ -85,10 +85,10 @@ bool fixup_exception_mc(struct pt_regs *regs) if (!ex) return false; - /* -* This is not complete, More Machine check safe extable type can -* be processed here. -*/ + switch (ex->type) { + case EX_TYPE_UACCESS_ERR_ZERO: + return ex_handler_uaccess_err_zero(ex, regs); + } return false; } -- 2.25.1
[PATCH v10 5/6] arm64: support copy_mc_[user]_highpage()
Currently, many scenarios that can tolerate memory errors when copying page have been supported in the kernel[1][2][3], all of which are implemented by copy_mc_[user]_highpage(). arm64 should also support this mechanism. Due to mte, arm64 needs to have its own copy_mc_[user]_highpage() architecture implementation, macros __HAVE_ARCH_COPY_MC_HIGHPAGE and __HAVE_ARCH_COPY_MC_USER_HIGHPAGE have been added to control it. Add new helper copy_mc_page() which provide a page copy implementation with machine check safe. The copy_mc_page() in copy_mc_page.S is largely borrows from copy_page() in copy_page.S and the main difference is copy_mc_page() add extable entry to every load/store insn to support machine check safe. Add new extable type EX_TYPE_COPY_MC_PAGE_ERR_ZERO which used in copy_mc_page(). [1]a873dfe1032a ("mm, hwpoison: try to recover from copy-on write faults") [2]5f2500b93cc9 ("mm/khugepaged: recover from poisoned anonymous memory") [3]6b970599e807 ("mm: hwpoison: support recovery from ksm_might_need_to_copy()") Signed-off-by: Tong Tiangen --- arch/arm64/include/asm/asm-extable.h | 15 ++ arch/arm64/include/asm/assembler.h | 4 ++ arch/arm64/include/asm/mte.h | 5 ++ arch/arm64/include/asm/page.h| 10 arch/arm64/lib/Makefile | 2 + arch/arm64/lib/copy_mc_page.S| 78 arch/arm64/lib/mte.S | 27 ++ arch/arm64/mm/copypage.c | 66 --- arch/arm64/mm/extable.c | 7 +-- include/linux/highmem.h | 8 +++ 10 files changed, 213 insertions(+), 9 deletions(-) create mode 100644 arch/arm64/lib/copy_mc_page.S diff --git a/arch/arm64/include/asm/asm-extable.h b/arch/arm64/include/asm/asm-extable.h index 980d1dd8e1a3..819044fefbe7 100644 --- a/arch/arm64/include/asm/asm-extable.h +++ b/arch/arm64/include/asm/asm-extable.h @@ -10,6 +10,7 @@ #define EX_TYPE_UACCESS_ERR_ZERO 2 #define EX_TYPE_KACCESS_ERR_ZERO 3 #define EX_TYPE_LOAD_UNALIGNED_ZEROPAD 4 +#define EX_TYPE_COPY_MC_PAGE_ERR_ZERO 5 /* Data fields for EX_TYPE_UACCESS_ERR_ZERO */ #define EX_DATA_REG_ERR_SHIFT 0 @@ -51,6 +52,16 @@ #define _ASM_EXTABLE_UACCESS(insn, fixup) \ _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, wzr, wzr) +#define _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, err, zero) \ + __ASM_EXTABLE_RAW(insn, fixup, \ + EX_TYPE_COPY_MC_PAGE_ERR_ZERO,\ + ( \ + EX_DATA_REG(ERR, err) | \ + EX_DATA_REG(ZERO, zero) \ + )) + +#define _ASM_EXTABLE_COPY_MC_PAGE(insn, fixup) \ + _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, wzr, wzr) /* * Create an exception table entry for uaccess `insn`, which will branch to `fixup` * when an unhandled fault is taken. @@ -59,6 +70,10 @@ _ASM_EXTABLE_UACCESS(\insn, \fixup) .endm + .macro _asm_extable_copy_mc_page, insn, fixup + _ASM_EXTABLE_COPY_MC_PAGE(\insn, \fixup) + .endm + /* * Create an exception table entry for `insn` if `fixup` is provided. Otherwise * do nothing. diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h index 513787e43329..e1d8ce155878 100644 --- a/arch/arm64/include/asm/assembler.h +++ b/arch/arm64/include/asm/assembler.h @@ -154,6 +154,10 @@ lr .reqx30 // link register #define CPU_LE(code...) code #endif +#define CPY_MC(l, x...)\ +: x; \ + _asm_extable_copy_mc_pageb, l + /* * Define a macro that constructs a 64-bit value by concatenating two * 32-bit registers. Note that on big endian systems the order of the diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h index 91fbd5c8a391..9cdded082dd4 100644 --- a/arch/arm64/include/asm/mte.h +++ b/arch/arm64/include/asm/mte.h @@ -92,6 +92,7 @@ static inline bool try_page_mte_tagging(struct page *page) void mte_zero_clear_page_tags(void *addr); void mte_sync_tags(pte_t pte, unsigned int nr_pages); void mte_copy_page_tags(void *kto, const void *kfrom); +int mte_copy_mc_page_tags(void *kto, const void *kfrom); void mte_thread_init_user(void); void mte_thread_switch(struct task_struct *next); void mte_cpu_setup(void); @@ -128,6 +129,10 @@ static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages) static inline void mte_copy_page_tags(void *kto, const void *kfrom) { } +static inline int mte_copy_mc_page_tags(void *kto, const void *kfrom) +{ + return 0; +} static inline void mte_thread_init_user(void) { } diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h index 2312e6ee595f..304cc86b8a10 100644 --- a/arch/arm6
[PATCH v10 1/6] uaccess: add generic fallback version of copy_mc_to_user()
x86/powerpc has it's implementation of copy_mc_to_user(), we add generic fallback in include/linux/uaccess.h prepare for other architechures to enable CONFIG_ARCH_HAS_COPY_MC. Signed-off-by: Tong Tiangen Acked-by: Michael Ellerman --- arch/powerpc/include/asm/uaccess.h | 1 + arch/x86/include/asm/uaccess.h | 1 + include/linux/uaccess.h| 9 + 3 files changed, 11 insertions(+) diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h index f1f9890f50d3..4bfd1e6f0702 100644 --- a/arch/powerpc/include/asm/uaccess.h +++ b/arch/powerpc/include/asm/uaccess.h @@ -381,6 +381,7 @@ copy_mc_to_user(void __user *to, const void *from, unsigned long n) return n; } +#define copy_mc_to_user copy_mc_to_user #endif extern long __copy_from_user_flushcache(void *dst, const void __user *src, diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 5c367c1290c3..fd56282ee9a8 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -497,6 +497,7 @@ copy_mc_to_kernel(void *to, const void *from, unsigned len); unsigned long __must_check copy_mc_to_user(void __user *to, const void *from, unsigned len); +#define copy_mc_to_user copy_mc_to_user #endif /* diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 3064314f4832..550287c92990 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -205,6 +205,15 @@ copy_mc_to_kernel(void *dst, const void *src, size_t cnt) } #endif +#ifndef copy_mc_to_user +static inline unsigned long __must_check +copy_mc_to_user(void *dst, const void *src, size_t cnt) +{ + check_object_size(src, cnt, true); + return raw_copy_to_user(dst, src, cnt); +} +#endif + static __always_inline void pagefault_disabled_inc(void) { current->pagefault_disabled++; -- 2.25.1
[PATCH linux-next 3/3] arch, crash: move arch_crash_save_vmcoreinfo() out to file vmcore_info.c
Nathan reported below building error: = $ curl -LSso .config https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.armv7 $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- olddefconfig all ... arm-linux-gnueabi-ld: arch/arm/kernel/machine_kexec.o: in function `arch_crash_save_vmcoreinfo': machine_kexec.c:(.text+0x488): undefined reference to `vmcoreinfo_append_str' On architecutres, like arm, s390, ppc, sh, function arch_crash_save_vmcoreinfo() is located in machine_kexec.c and it can only be compiled in when CONFIG_KEXEC_CORE=y. That's not right because arch_crash_save_vmcoreinfo() is used to export arch specific vmcoreinfo. CONFIG_VMCORE_INFO is supposed to control its compiling in. However, CONFIG_VMVCORE_INFO could be independent of CONFIG_KEXEC_CORE, e.g CONFIG_PROC_KCORE=y will select CONFIG_VMVCORE_INFO. Or CONFIG_KEXEC/CONFIG_KEXEC_FILE is set while CONFIG_CRASH_DUMP is not set, it will report linking error. So, on arm, s390, ppc and sh, move arch_crash_save_vmcoreinfo out to a new file vmcore_info.c. Let CONFIG_VMCORE_INFO decide if compiling in arch_crash_save_vmcoreinfo(). Reported-by: Nathan Chancellor Closes: https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u Signed-off-by: Baoquan He --- arch/arm/kernel/Makefile | 1 + arch/arm/kernel/machine_kexec.c | 7 --- arch/arm/kernel/vmcore_info.c| 10 ++ arch/powerpc/kexec/Makefile | 1 + arch/powerpc/kexec/core.c| 28 -- arch/powerpc/kexec/vmcore_info.c | 34 arch/s390/kernel/Makefile| 1 + arch/s390/kernel/machine_kexec.c | 15 -- arch/s390/kernel/vmcore_info.c | 23 + arch/sh/kernel/Makefile | 1 + arch/sh/kernel/machine_kexec.c | 11 --- arch/sh/kernel/vmcore_info.c | 17 12 files changed, 88 insertions(+), 61 deletions(-) create mode 100644 arch/arm/kernel/vmcore_info.c create mode 100644 arch/powerpc/kexec/vmcore_info.c create mode 100644 arch/s390/kernel/vmcore_info.c create mode 100644 arch/sh/kernel/vmcore_info.c diff --git a/arch/arm/kernel/Makefile b/arch/arm/kernel/Makefile index 771264d4726a..6a9de826ffd3 100644 --- a/arch/arm/kernel/Makefile +++ b/arch/arm/kernel/Makefile @@ -60,6 +60,7 @@ obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o insn.o patch.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER)+= ftrace.o insn.o patch.o obj-$(CONFIG_JUMP_LABEL) += jump_label.o insn.o patch.o obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o +obj-$(CONFIG_VMCORE_INFO) += vmcore_info.o # Main staffs in KPROBES are in arch/arm/probes/ . obj-$(CONFIG_KPROBES) += patch.o insn.o obj-$(CONFIG_OABI_COMPAT) += sys_oabi-compat.o diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c index 5d07cf9e0044..80ceb5bd2680 100644 --- a/arch/arm/kernel/machine_kexec.c +++ b/arch/arm/kernel/machine_kexec.c @@ -198,10 +198,3 @@ void machine_kexec(struct kimage *image) soft_restart(reboot_entry_phys); } - -void arch_crash_save_vmcoreinfo(void) -{ -#ifdef CONFIG_ARM_LPAE - VMCOREINFO_CONFIG(ARM_LPAE); -#endif -} diff --git a/arch/arm/kernel/vmcore_info.c b/arch/arm/kernel/vmcore_info.c new file mode 100644 index ..1437aba47787 --- /dev/null +++ b/arch/arm/kernel/vmcore_info.c @@ -0,0 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include + +void arch_crash_save_vmcoreinfo(void) +{ +#ifdef CONFIG_ARM_LPAE + VMCOREINFO_CONFIG(ARM_LPAE); +#endif +} diff --git a/arch/powerpc/kexec/Makefile b/arch/powerpc/kexec/Makefile index 0c2abe7f9908..91e96f5168b7 100644 --- a/arch/powerpc/kexec/Makefile +++ b/arch/powerpc/kexec/Makefile @@ -8,6 +8,7 @@ obj-y += core.o crash.o core_$(BITS).o obj-$(CONFIG_PPC32)+= relocate_32.o obj-$(CONFIG_KEXEC_FILE) += file_load.o ranges.o file_load_$(BITS).o elf_$(BITS).o +obj-$(CONFIG_VMCORE_INFO) += vmcore_info.o # Disable GCOV, KCOV & sanitizers in odd or sensitive code GCOV_PROFILE_core_$(BITS).o := n diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c index 27fa9098a5b7..3ff4411ed496 100644 --- a/arch/powerpc/kexec/core.c +++ b/arch/powerpc/kexec/core.c @@ -53,34 +53,6 @@ void machine_kexec_cleanup(struct kimage *image) { } -void arch_crash_save_vmcoreinfo(void) -{ - -#ifdef CONFIG_NUMA - VMCOREINFO_SYMBOL(node_data); - VMCOREINFO_LENGTH(node_data, MAX_NUMNODES); -#endif -#ifndef CONFIG_NUMA - VMCOREINFO_SYMBOL(contig_page_data); -#endif -#if defined(CONFIG_PPC64) && defined(CONFIG_SPARSEMEM_VMEMMAP) - VMCOREINFO_SYMBOL(vmemmap_list); - VMCOREINFO_SYMBOL(mmu_vmemmap_psize); - VMCOREINFO_SYMBOL(mmu_psize_defs); - VMCOREINFO_STRUCT_SIZE(vmemmap_backing); - VMCOREINFO_OFFSET(vmemmap_backing, list); - VMCOREINFO_OFFSET(vmemmap_backing, phys); - VMCOREIN
[PATCH linux-next 2/3] crash: fix building error in generic codes
Nathan reported some building errors on arm64 as below: == $ curl -LSso .config https://github.com/archlinuxarm/PKGBUILDs/raw/master/core/linux-aarch64/config $ make -skj"$(nproc)" ARCH=arm64 CROSS_COMPILE=aarch64-linux- olddefconfig all ... aarch64-linux-ld: kernel/kexec_file.o: in function `kexec_walk_memblock.constprop.0': kexec_file.c:(.text+0x314): undefined reference to `crashk_res' ... aarch64-linux-ld: drivers/of/kexec.o: in function `of_kexec_alloc_and_setup_fdt': kexec.c:(.text+0x580): undefined reference to `crashk_res' ... aarch64-linux-ld: kexec.c:(.text+0x5c0): undefined reference to `crashk_low_res' == On the provided config, it has: === CONFIG_VMCORE_INFO=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC=y CONFIG_KEXEC_FILE=y === For these crash related code blocks, they need put inside CONFIG_CRASH_DUMP ifdeffery scope to avoid building erorr when CONFIG_CRASH_DUMP is not set. Reported-by: Nathan Chancellor Closes: https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u Signed-off-by: Baoquan He --- drivers/of/kexec.c | 2 ++ kernel/kexec_file.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c index 68278340cecf..9ccde2fd77cb 100644 --- a/drivers/of/kexec.c +++ b/drivers/of/kexec.c @@ -395,6 +395,7 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage *image, if (ret) goto out; +#ifdef CONFIG_CRASH_DUMP /* add linux,usable-memory-range */ ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,usable-memory-range", crashk_res.start, @@ -410,6 +411,7 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage *image, if (ret) goto out; } +#endif } /* add bootargs */ diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index ce7ce2ae27cd..2d1db05fbf04 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -540,8 +540,10 @@ static int kexec_walk_memblock(struct kexec_buf *kbuf, phys_addr_t mstart, mend; struct resource res = { }; +#ifdef CONFIG_CRASH_DUMP if (kbuf->image->type == KEXEC_TYPE_CRASH) return func(&crashk_res, kbuf); +#endif /* * Using MEMBLOCK_NONE will properly skip MEMBLOCK_DRIVER_MANAGED. See -- 2.41.0
[PATCH linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope
Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside arch/x86/xen/enlighten_hvm.c. Although the nesting works well too since CONFIG_CRASH_DUMP has dependency on CONFIG_KEXEC_CORE, it may cause confuse because there are places where it's not nested, and people may think it need be nested even though it doesn't have to. Fix that by moving CONFIG_CRASH_DUMP ifdeffery of codes out of CONFIG_KEXEC_CODE ifdeffery scope. And also fix a building error Nathan reported as below by replacing CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef. $ curl -LSso .config https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.x86_64 $ make -skj"$(nproc)" ARCH=x86_64 CROSS_COMPILE=x86_64-linux- olddefconfig all ... x86_64-linux-ld: arch/x86/xen/mmu_pv.o: in function `paddr_vmcoreinfo_note': mmu_pv.c:(.text+0x3af3): undefined reference to `vmcoreinfo_note' Link: https://lore.kernel.org/all/sn6pr02mb4157931105fa68d72e3d3db8d4...@sn6pr02mb4157.namprd02.prod.outlook.com/T/#u Link: https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u Signed-off-by: Baoquan He --- arch/x86/kernel/cpu/mshyperv.c | 10 ++ arch/x86/kernel/reboot.c | 2 +- arch/x86/xen/enlighten_hvm.c | 4 ++-- arch/x86/xen/mmu_pv.c | 2 +- 4 files changed, 10 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index f8163a59026b..2e8cd5a4ae85 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -209,6 +209,7 @@ static void hv_machine_shutdown(void) if (kexec_in_progress) hyperv_cleanup(); } +#endif /* CONFIG_KEXEC_CORE */ #ifdef CONFIG_CRASH_DUMP static void hv_machine_crash_shutdown(struct pt_regs *regs) @@ -222,8 +223,7 @@ static void hv_machine_crash_shutdown(struct pt_regs *regs) /* Disable the hypercall page when there is only 1 active CPU. */ hyperv_cleanup(); } -#endif -#endif /* CONFIG_KEXEC_CORE */ +#endif /* CONFIG_CRASH_DUMP */ #endif /* CONFIG_HYPERV */ static uint32_t __init ms_hyperv_platform(void) @@ -497,9 +497,11 @@ static void __init ms_hyperv_init_platform(void) no_timer_check = 1; #endif -#if IS_ENABLED(CONFIG_HYPERV) && defined(CONFIG_KEXEC_CORE) +#if IS_ENABLED(CONFIG_HYPERV) +#if defined(CONFIG_KEXEC_CORE) machine_ops.shutdown = hv_machine_shutdown; -#ifdef CONFIG_CRASH_DUMP +#endif +#if defined(CONFIG_CRASH_DUMP) machine_ops.crash_shutdown = hv_machine_crash_shutdown; #endif #endif diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index 1287b0d5962f..f3130f762784 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -826,7 +826,7 @@ void machine_halt(void) machine_ops.halt(); } -#ifdef CONFIG_KEXEC_CORE +#ifdef CONFIG_CRASH_DUMP void machine_crash_shutdown(struct pt_regs *regs) { machine_ops.crash_shutdown(regs); diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c index 09e3db7ff990..0b367c1e086d 100644 --- a/arch/x86/xen/enlighten_hvm.c +++ b/arch/x86/xen/enlighten_hvm.c @@ -148,6 +148,7 @@ static void xen_hvm_shutdown(void) if (kexec_in_progress) xen_reboot(SHUTDOWN_soft_reset); } +#endif #ifdef CONFIG_CRASH_DUMP static void xen_hvm_crash_shutdown(struct pt_regs *regs) @@ -156,7 +157,6 @@ static void xen_hvm_crash_shutdown(struct pt_regs *regs) xen_reboot(SHUTDOWN_soft_reset); } #endif -#endif static int xen_cpu_up_prepare_hvm(unsigned int cpu) { @@ -238,10 +238,10 @@ static void __init xen_hvm_guest_init(void) #ifdef CONFIG_KEXEC_CORE machine_ops.shutdown = xen_hvm_shutdown; +#endif #ifdef CONFIG_CRASH_DUMP machine_ops.crash_shutdown = xen_hvm_crash_shutdown; #endif -#endif } static __init int xen_parse_nopv(char *arg) diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 218773cfb009..e21974f2cf2d 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -2520,7 +2520,7 @@ int xen_remap_pfn(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL_GPL(xen_remap_pfn); -#ifdef CONFIG_KEXEC_CORE +#ifdef CONFIG_VMCORE_INFO phys_addr_t paddr_vmcoreinfo_note(void) { if (xen_pv_domain()) -- 2.41.0
[PATCH] MAINTAINERS: adjust file entries after crypto vmx file movement
Commit 109303336a0c ("crypto: vmx - Move to arch/powerpc/crypto") moves the crypto vmx files to arch/powerpc, but misses to adjust the file entries for IBM Power VMX Cryptographic instructions and LINUX FOR POWERPC. Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about broken references. Adjust these file entries accordingly. Signed-off-by: Lukas Bulwahn --- Danny, please ack. Herbert, please pick this minor clean-up patch on your -next tree. MAINTAINERS | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 2fb944964be5..15bc79e80e28 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10307,12 +10307,12 @@ M:Nayna Jain M: Paulo Flabiano Smorigo L: linux-cry...@vger.kernel.org S: Supported -F: drivers/crypto/vmx/Kconfig -F: drivers/crypto/vmx/Makefile -F: drivers/crypto/vmx/aes* -F: drivers/crypto/vmx/ghash* -F: drivers/crypto/vmx/ppc-xlate.pl -F: drivers/crypto/vmx/vmx.c +F: arch/powerpc/crypto/Kconfig +F: arch/powerpc/crypto/Makefile +F: arch/powerpc/crypto/aes* +F: arch/powerpc/crypto/ghash* +F: arch/powerpc/crypto/ppc-xlate.pl +F: arch/powerpc/crypto/vmx.c IBM ServeRAID RAID DRIVER S: Orphan @@ -12397,7 +12397,6 @@ F: drivers/*/*/*pasemi* F: drivers/*/*pasemi* F: drivers/char/tpm/tpm_ibmvtpm* F: drivers/crypto/nx/ -F: drivers/crypto/vmx/ F: drivers/i2c/busses/i2c-opal.c F: drivers/net/ethernet/ibm/ibmveth.* F: drivers/net/ethernet/ibm/ibmvnic.* -- 2.17.1
[PATCH v3 15/15] mm/memory: ignore writable bit in folio_pte_batch()
... and conditionally return to the caller if any PTE except the first one is writable. fork() has to make sure to properly write-protect in case any PTE is writable. Other users (e.g., page unmaping) are expected to not care. Reviewed-by: Ryan Roberts Signed-off-by: David Hildenbrand --- mm/memory.c | 30 -- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index b2ec2b6b54c7..b05fd28dbce1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -968,7 +968,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) pte = pte_mkclean(pte); if (likely(flags & FPB_IGNORE_SOFT_DIRTY)) pte = pte_clear_soft_dirty(pte); - return pte_mkold(pte); + return pte_wrprotect(pte_mkold(pte)); } /* @@ -976,21 +976,32 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) * pages of the same folio. * * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN, - * the accessed bit, dirty bit (with FPB_IGNORE_DIRTY) and soft-dirty bit - * (with FPB_IGNORE_SOFT_DIRTY). + * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and + * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY). + * + * If "any_writable" is set, it will indicate if any other PTE besides the + * first (given) PTE is writable. */ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, - pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags) + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, + bool *any_writable) { unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); const pte_t *end_ptep = start_ptep + max_nr; pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), flags); pte_t *ptep = start_ptep + 1; + bool writable; + + if (any_writable) + *any_writable = false; VM_WARN_ON_FOLIO(!pte_present(pte), folio); while (ptep != end_ptep) { - pte = __pte_batch_clear_ignored(ptep_get(ptep), flags); + pte = ptep_get(ptep); + if (any_writable) + writable = !!pte_write(pte); + pte = __pte_batch_clear_ignored(pte, flags); if (!pte_same(pte, expected_pte)) break; @@ -1003,6 +1014,9 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, if (pte_pfn(pte) == folio_end_pfn) break; + if (any_writable) + *any_writable |= writable; + expected_pte = pte_next_pfn(expected_pte); ptep++; } @@ -1024,6 +1038,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma { struct page *page; struct folio *folio; + bool any_writable; fpb_t flags = 0; int err, nr; @@ -1044,7 +1059,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma if (!vma_soft_dirty_enabled(src_vma)) flags |= FPB_IGNORE_SOFT_DIRTY; - nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags); + nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags, +&any_writable); folio_ref_add(folio, nr); if (folio_test_anon(folio)) { if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, @@ -1058,6 +1074,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma folio_dup_file_rmap_ptes(folio, page, nr); rss[mm_counter_file(folio)] += nr; } + if (any_writable) + pte = pte_mkwrite(pte, src_vma); __copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte, addr, nr); return nr; -- 2.43.0
[PATCH v3 14/15] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
Let's always ignore the accessed/young bit: we'll always mark the PTE as old in our child process during fork, and upcoming users will similarly not care. Ignore the dirty bit only if we don't want to duplicate the dirty bit into the child process during fork. Maybe, we could just set all PTEs in the child dirty if any PTE is dirty. For now, let's keep the behavior unchanged, this can be optimized later if required. Ignore the soft-dirty bit only if the bit doesn't have any meaning in the src vma, and similarly won't have any in the copied dst vma. For now, we won't bother with the uffd-wp bit. Reviewed-by: Ryan Roberts Signed-off-by: David Hildenbrand --- mm/memory.c | 36 +++- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 86f8a0021c8e..b2ec2b6b54c7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -953,24 +953,44 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma, set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr); } +/* Flags for folio_pte_batch(). */ +typedef int __bitwise fpb_t; + +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */ +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0)) + +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */ +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1)) + +static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) +{ + if (flags & FPB_IGNORE_DIRTY) + pte = pte_mkclean(pte); + if (likely(flags & FPB_IGNORE_SOFT_DIRTY)) + pte = pte_clear_soft_dirty(pte); + return pte_mkold(pte); +} + /* * Detect a PTE batch: consecutive (present) PTEs that map consecutive * pages of the same folio. * - * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN. + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN, + * the accessed bit, dirty bit (with FPB_IGNORE_DIRTY) and soft-dirty bit + * (with FPB_IGNORE_SOFT_DIRTY). */ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, - pte_t *start_ptep, pte_t pte, int max_nr) + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags) { unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); const pte_t *end_ptep = start_ptep + max_nr; - pte_t expected_pte = pte_next_pfn(pte); + pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), flags); pte_t *ptep = start_ptep + 1; VM_WARN_ON_FOLIO(!pte_present(pte), folio); while (ptep != end_ptep) { - pte = ptep_get(ptep); + pte = __pte_batch_clear_ignored(ptep_get(ptep), flags); if (!pte_same(pte, expected_pte)) break; @@ -1004,6 +1024,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma { struct page *page; struct folio *folio; + fpb_t flags = 0; int err, nr; page = vm_normal_page(src_vma, addr, pte); @@ -1018,7 +1039,12 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma * by keeping the batching logic separate. */ if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) { - nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr); + if (src_vma->vm_flags & VM_SHARED) + flags |= FPB_IGNORE_DIRTY; + if (!vma_soft_dirty_enabled(src_vma)) + flags |= FPB_IGNORE_SOFT_DIRTY; + + nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags); folio_ref_add(folio, nr); if (folio_test_anon(folio)) { if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, -- 2.43.0
[PATCH v3 12/15] mm/memory: pass PTE to copy_present_pte()
We already read it, let's just forward it. This patch is based on work by Ryan Roberts. Reviewed-by: Ryan Roberts Signed-off-by: David Hildenbrand --- mm/memory.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index a3bdb25f4c8d..41b24da5be38 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -959,10 +959,9 @@ static inline void __copy_present_pte(struct vm_area_struct *dst_vma, */ static inline int copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, -pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, -struct folio **prealloc) +pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr, +int *rss, struct folio **prealloc) { - pte_t pte = ptep_get(src_pte); struct page *page; struct folio *folio; @@ -1103,7 +1102,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, } /* copy_present_pte() will clear `*prealloc' if consumed */ ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, - addr, rss, &prealloc); + ptent, addr, rss, &prealloc); /* * If we need a pre-allocated page for this pte, drop the * locks, allocate, and try again. -- 2.43.0
[PATCH v3 11/15] mm/memory: factor out copying the actual PTE in copy_present_pte()
Let's prepare for further changes. Reviewed-by: Ryan Roberts Signed-off-by: David Hildenbrand --- mm/memory.c | 63 - 1 file changed, 33 insertions(+), 30 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 8d14ba440929..a3bdb25f4c8d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -930,6 +930,29 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma return 0; } +static inline void __copy_present_pte(struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, + pte_t pte, unsigned long addr) +{ + struct mm_struct *src_mm = src_vma->vm_mm; + + /* If it's a COW mapping, write protect it both processes. */ + if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) { + ptep_set_wrprotect(src_mm, addr, src_pte); + pte = pte_wrprotect(pte); + } + + /* If it's a shared mapping, mark it clean in the child. */ + if (src_vma->vm_flags & VM_SHARED) + pte = pte_mkclean(pte); + pte = pte_mkold(pte); + + if (!userfaultfd_wp(dst_vma)) + pte = pte_clear_uffd_wp(pte); + + set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); +} + /* * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page * is required to copy this pte. @@ -939,23 +962,23 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, struct folio **prealloc) { - struct mm_struct *src_mm = src_vma->vm_mm; - unsigned long vm_flags = src_vma->vm_flags; pte_t pte = ptep_get(src_pte); struct page *page; struct folio *folio; page = vm_normal_page(src_vma, addr, pte); - if (page) - folio = page_folio(page); - if (page && folio_test_anon(folio)) { + if (unlikely(!page)) + goto copy_pte; + + folio = page_folio(page); + folio_get(folio); + if (folio_test_anon(folio)) { /* * If this page may have been pinned by the parent process, * copy the page immediately for the child so that we'll always * guarantee the pinned page won't be randomly replaced in the * future. */ - folio_get(folio); if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, src_vma))) { /* Page may be pinned, we have to copy. */ folio_put(folio); @@ -963,34 +986,14 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, addr, rss, prealloc, page); } rss[MM_ANONPAGES]++; - } else if (page) { - folio_get(folio); + VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio); + } else { folio_dup_file_rmap_pte(folio, page); rss[mm_counter_file(folio)]++; } - /* -* If it's a COW mapping, write protect it both -* in the parent and the child -*/ - if (is_cow_mapping(vm_flags) && pte_write(pte)) { - ptep_set_wrprotect(src_mm, addr, src_pte); - pte = pte_wrprotect(pte); - } - VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page)); - - /* -* If it's a shared mapping, mark it clean in -* the child -*/ - if (vm_flags & VM_SHARED) - pte = pte_mkclean(pte); - pte = pte_mkold(pte); - - if (!userfaultfd_wp(dst_vma)) - pte = pte_clear_uffd_wp(pte); - - set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); +copy_pte: + __copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, pte, addr); return 0; } -- 2.43.0
[PATCH v3 10/15] powerpc/mm: use pte_next_pfn() in set_ptes()
Let's use our handy new helper. Note that the implementation is slightly different, but shouldn't really make a difference in practice. Reviewed-by: Christophe Leroy Signed-off-by: David Hildenbrand --- arch/powerpc/mm/pgtable.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index a04ae4449a02..549a440ed7f6 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -220,10 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, break; ptep++; addr += PAGE_SIZE; - /* -* increment the pfn. -*/ - pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte))); + pte = pte_next_pfn(pte); } } -- 2.43.0
[PATCH v3 09/15] arm/mm: use pte_next_pfn() in set_ptes()
Let's use our handy helper now that it's available on all archs. Signed-off-by: David Hildenbrand --- arch/arm/mm/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c index 674ed71573a8..c24e29c0b9a4 100644 --- a/arch/arm/mm/mmu.c +++ b/arch/arm/mm/mmu.c @@ -1814,6 +1814,6 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, if (--nr == 0) break; ptep++; - pte_val(pteval) += PAGE_SIZE; + pteval = pte_next_pfn(pteval); } } -- 2.43.0
[PATCH v3 08/15] mm/pgtable: make pte_next_pfn() independent of set_ptes()
Let's provide pte_next_pfn(), independently of set_ptes(). This allows for using the generic pte_next_pfn() version in some arch-specific set_ptes() implementations, and prepares for reusing pte_next_pfn() in other context. Reviewed-by: Christophe Leroy Signed-off-by: David Hildenbrand --- include/linux/pgtable.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index f6d0e3513948..351cd9dc7194 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -212,7 +212,6 @@ static inline int pmd_dirty(pmd_t pmd) #define arch_flush_lazy_mmu_mode() do {} while (0) #endif -#ifndef set_ptes #ifndef pte_next_pfn static inline pte_t pte_next_pfn(pte_t pte) @@ -221,6 +220,7 @@ static inline pte_t pte_next_pfn(pte_t pte) } #endif +#ifndef set_ptes /** * set_ptes - Map consecutive pages to a contiguous range of addresses. * @mm: Address space to map the pages into. -- 2.43.0
[PATCH v3 07/15] sparc/pgtable: define PFN_PTE_SHIFT
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn(). Signed-off-by: David Hildenbrand --- arch/sparc/include/asm/pgtable_64.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index a8c871b7d786..652af9d63fa2 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -929,6 +929,8 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT); } +#define PFN_PTE_SHIFT PAGE_SHIFT + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr) { -- 2.43.0
[PATCH v3 06/15] s390/pgtable: define PFN_PTE_SHIFT
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn(). Signed-off-by: David Hildenbrand --- arch/s390/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 1299b56e43f6..4b91e65c85d9 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1316,6 +1316,8 @@ pgprot_t pgprot_writecombine(pgprot_t prot); #define pgprot_writethroughpgprot_writethrough pgprot_t pgprot_writethrough(pgprot_t prot); +#define PFN_PTE_SHIFT PAGE_SHIFT + /* * Set multiple PTEs to consecutive pages with a single call. All PTEs * are within the same folio, PMD and VMA. -- 2.43.0
[PATCH v3 05/15] riscv/pgtable: define PFN_PTE_SHIFT
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn(). Reviewed-by: Alexandre Ghiti Signed-off-by: David Hildenbrand --- arch/riscv/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 0c94260b5d0c..add5cd30ab34 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -523,6 +523,8 @@ static inline void __set_pte_at(pte_t *ptep, pte_t pteval) set_pte(ptep, pteval); } +#define PFN_PTE_SHIFT _PAGE_PFN_SHIFT + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval, unsigned int nr) { -- 2.43.0
[PATCH v3 04/15] powerpc/pgtable: define PFN_PTE_SHIFT
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn(). Reviewed-by: Christophe Leroy Signed-off-by: David Hildenbrand --- arch/powerpc/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 9224f23065ff..7a1ba8889aea 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -41,6 +41,8 @@ struct mm_struct; #ifndef __ASSEMBLY__ +#define PFN_PTE_SHIFT PTE_RPN_SHIFT + void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr); #define set_ptes set_ptes -- 2.43.0
[PATCH v3 03/15] nios2/pgtable: define PFN_PTE_SHIFT
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn(). Signed-off-by: David Hildenbrand --- arch/nios2/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h index 5144506dfa69..d052dfcbe8d3 100644 --- a/arch/nios2/include/asm/pgtable.h +++ b/arch/nios2/include/asm/pgtable.h @@ -178,6 +178,8 @@ static inline void set_pte(pte_t *ptep, pte_t pteval) *ptep = pteval; } +#define PFN_PTE_SHIFT 0 + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr) { -- 2.43.0
[PATCH v3 02/15] arm/pgtable: define PFN_PTE_SHIFT
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn(). Signed-off-by: David Hildenbrand --- arch/arm/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h index d657b84b6bf7..be91e376df79 100644 --- a/arch/arm/include/asm/pgtable.h +++ b/arch/arm/include/asm/pgtable.h @@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval) extern void __sync_icache_dcache(pte_t pteval); #endif +#define PFN_PTE_SHIFT PAGE_SHIFT + void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval, unsigned int nr); #define set_ptes set_ptes -- 2.43.0
[PATCH v3 01/15] arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary
From: Ryan Roberts Since the high bits [51:48] of an OA are not stored contiguously in the PTE, there is a theoretical bug in set_ptes(), which just adds PAGE_SIZE to the pte to get the pte with the next pfn. This works until the pfn crosses the 48-bit boundary, at which point we overflow into the upper attributes. Of course one could argue (and Matthew Wilcox has :) that we will never see a folio cross this boundary because we only allow naturally aligned power-of-2 allocation, so this would require a half-petabyte folio. So its only a theoretical bug. But its better that the code is robust regardless. I've implemented pte_next_pfn() as part of the fix, which is an opt-in core-mm interface. So that is now available to the core-mm, which will be needed shortly to support forthcoming fork()-batching optimizations. Link: https://lkml.kernel.org/r/20240125173534.1659317-1-ryan.robe...@arm.com Fixes: 4a169d61c2ed ("arm64: implement the new page table range API") Closes: https://lore.kernel.org/linux-mm/fdaeb9a5-d890-499a-92c8-d171df43a...@arm.com/ Signed-off-by: Ryan Roberts Reviewed-by: Catalin Marinas Reviewed-by: David Hildenbrand Signed-off-by: David Hildenbrand --- arch/arm64/include/asm/pgtable.h | 28 +--- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index b50270107e2f..9428801c1040 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -341,6 +341,22 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages) mte_sync_tags(pte, nr_pages); } +/* + * Select all bits except the pfn + */ +static inline pgprot_t pte_pgprot(pte_t pte) +{ + unsigned long pfn = pte_pfn(pte); + + return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte)); +} + +#define pte_next_pfn pte_next_pfn +static inline pte_t pte_next_pfn(pte_t pte) +{ + return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte)); +} + static inline void set_ptes(struct mm_struct *mm, unsigned long __always_unused addr, pte_t *ptep, pte_t pte, unsigned int nr) @@ -354,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm, if (--nr == 0) break; ptep++; - pte_val(pte) += PAGE_SIZE; + pte = pte_next_pfn(pte); } } #define set_ptes set_ptes @@ -433,16 +449,6 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte) return clear_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE)); } -/* - * Select all bits except the pfn - */ -static inline pgprot_t pte_pgprot(pte_t pte) -{ - unsigned long pfn = pte_pfn(pte); - - return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte)); -} - #ifdef CONFIG_NUMA_BALANCING /* * See the comment in include/linux/pgtable.h -- 2.43.0
[PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP
Now that the rmap overhaul[1] is upstream that provides a clean interface for rmap batching, let's implement PTE batching during fork when processing PTE-mapped THPs. This series is partially based on Ryan's previous work[2] to implement cont-pte support on arm64, but its a complete rewrite based on [1] to optimize all architectures independent of any such PTE bits, and to use the new rmap batching functions that simplify the code and prepare for further rmap accounting changes. We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch and (c) perform batch PTE setting/updates. While this series should be beneficial for adding cont-pte support on ARM64[2], it's one of the requirements for maintaining a total mapcount[3] for large folios with minimal added overhead and further changes[4] that build up on top of the total mapcount. Independent of all that, this series results in a speedup during fork with PTE-mapped THP, which is the default with THPs that are smaller than a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]). On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios of the same size (stddev < 1%) results in the following runtimes for fork() (shorter is better): Folio Size | v6.8-rc1 | New | Change -- 4KiB | 0.014328 | 0.014035 | - 2% 16KiB | 0.014263 | 0.01196 | -16% 32KiB | 0.014334 | 0.01094 | -24% 64KiB | 0.014046 | 0.010444 | -26% 128KiB | 0.014011 | 0.010063 | -28% 256KiB | 0.013993 | 0.009938 | -29% 512KiB | 0.013983 | 0.00985 | -30% 1024KiB | 0.013986 | 0.00982 | -30% 2048KiB | 0.014305 | 0.010076 | -30% Note that these numbers are even better than the ones from v1 (verified over multiple reboots), even though there were only minimal code changes. Well, I removed a pte_mkclean() call for anon folios, maybe that also plays a role. But my experience is that fork() is extremely sensitive to code size, inlining, ... so I suspect we'll see on other architectures rather a change of -20% instead of -30%, and it will be easy to "lose" some of that speedup in the future by subtle code changes. Next up is PTE batching when unmapping. Only tested on x86-64. Compile-tested on most other architectures. v2 -> v3: * Rebased on mm-unstable * Picked up RB's * Updated documentation of wrprotect_ptes(). v1 -> v2: * "arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary" -> Added patch from Ryan * "arm/pgtable: define PFN_PTE_SHIFT" -> Removed the arm64 bits * "mm/pgtable: make pte_next_pfn() independent of set_ptes()" * "arm/mm: use pte_next_pfn() in set_ptes()" * "powerpc/mm: use pte_next_pfn() in set_ptes()" -> Added to use pte_next_pfn() in some arch set_ptes() implementations I tried to make use of pte_next_pfn() also in the others, but it's not trivial because the other archs implement set_ptes() in their asm/pgtable.h. Future work. * "mm/memory: factor out copying the actual PTE in copy_present_pte()" -> Move common folio_get() out of if/else * "mm/memory: optimize fork() with PTE-mapped THP" -> Add doc for wrprotect_ptes -> Extend description to mention handling of pinned folios -> Move common folio_ref_add() out of if/else * "mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()" -> Be more conservative with dirt/soft-dirty, let the caller specify using flags [1] https://lkml.kernel.org/r/20231220224504.646757-1-da...@redhat.com [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.robe...@arm.com [3] https://lkml.kernel.org/r/20230809083256.699513-1-da...@redhat.com [4] https://lkml.kernel.org/r/20231124132626.235350-1-da...@redhat.com [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.robe...@arm.com Cc: Andrew Morton Cc: Matthew Wilcox (Oracle) Cc: Ryan Roberts Cc: Russell King Cc: Catalin Marinas Cc: Will Deacon Cc: Dinh Nguyen Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: "Aneesh Kumar K.V" Cc: "Naveen N. Rao" Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Albert Ou Cc: Alexander Gordeev Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Christian Borntraeger Cc: Sven Schnelle Cc: "David S. Miller" Cc: linux-arm-ker...@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-ri...@lists.infradead.org Cc: linux-s...@vger.kernel.org Cc: sparcli...@vger.kernel.org --- Andrew asked for a resend based on latest mm-unstable. I am sending this out earlier than I would usually have sent out the next version, so we can pull it into mm-unstable again now that v1 was dropped. David Hildenbrand (14): arm/pgtable: define PFN_PTE_SHIFT nios2/pgtable: define PFN_PTE_SHIFT powerpc/pgtable: define PFN_PTE_SHIFT riscv/pgtable: define PFN_P
[PATCH] perf/pmu-events/powerpc: Update json mapfile with Power11 PVR
Update the Power11 PVR to json mapfile to enable json events. Power11 is PowerISA v3.1 compliant and support Power10 events. Signed-off-by: Madhavan Srinivasan --- tools/perf/pmu-events/arch/powerpc/mapfile.csv | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/perf/pmu-events/arch/powerpc/mapfile.csv b/tools/perf/pmu-events/arch/powerpc/mapfile.csv index 599a588dbeb4..4d5e9138d4cc 100644 --- a/tools/perf/pmu-events/arch/powerpc/mapfile.csv +++ b/tools/perf/pmu-events/arch/powerpc/mapfile.csv @@ -15,3 +15,4 @@ 0x0066[[:xdigit:]]{4},1,power8,core 0x004e[[:xdigit:]]{4},1,power9,core 0x0080[[:xdigit:]]{4},1,power10,core +0x0082[[:xdigit:]]{4},1,power10,core -- 2.43.0
Re: [PATCH v2 14/15] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
On 25/01/2024 19:32, David Hildenbrand wrote: > Let's always ignore the accessed/young bit: we'll always mark the PTE > as old in our child process during fork, and upcoming users will > similarly not care. > > Ignore the dirty bit only if we don't want to duplicate the dirty bit > into the child process during fork. Maybe, we could just set all PTEs > in the child dirty if any PTE is dirty. For now, let's keep the behavior > unchanged, this can be optimized later if required. > > Ignore the soft-dirty bit only if the bit doesn't have any meaning in > the src vma, and similarly won't have any in the copied dst vma. > > For now, we won't bother with the uffd-wp bit. > > Signed-off-by: David Hildenbrand Reviewed-by: Ryan Roberts > --- > mm/memory.c | 36 +++- > 1 file changed, 31 insertions(+), 5 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 4d1be89a01ee0..b3f035fe54c8d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -953,24 +953,44 @@ static __always_inline void __copy_present_ptes(struct > vm_area_struct *dst_vma, > set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr); > } > > +/* Flags for folio_pte_batch(). */ > +typedef int __bitwise fpb_t; > + > +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */ > +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0)) > + > +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */ > +#define FPB_IGNORE_SOFT_DIRTY((__force fpb_t)BIT(1)) > + > +static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) > +{ > + if (flags & FPB_IGNORE_DIRTY) > + pte = pte_mkclean(pte); > + if (likely(flags & FPB_IGNORE_SOFT_DIRTY)) > + pte = pte_clear_soft_dirty(pte); > + return pte_mkold(pte); > +} > + > /* > * Detect a PTE batch: consecutive (present) PTEs that map consecutive > * pages of the same folio. > * > - * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN. > + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN, > + * the accessed bit, dirty bit (with FPB_IGNORE_DIRTY) and soft-dirty bit > + * (with FPB_IGNORE_SOFT_DIRTY). > */ > static inline int folio_pte_batch(struct folio *folio, unsigned long addr, > - pte_t *start_ptep, pte_t pte, int max_nr) > + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags) > { > unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); > const pte_t *end_ptep = start_ptep + max_nr; > - pte_t expected_pte = pte_next_pfn(pte); > + pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), > flags); > pte_t *ptep = start_ptep + 1; > > VM_WARN_ON_FOLIO(!pte_present(pte), folio); > > while (ptep != end_ptep) { > - pte = ptep_get(ptep); > + pte = __pte_batch_clear_ignored(ptep_get(ptep), flags); > > if (!pte_same(pte, expected_pte)) > break; > @@ -1004,6 +1024,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, > struct vm_area_struct *src_vma > { > struct page *page; > struct folio *folio; > + fpb_t flags = 0; > int err, nr; > > page = vm_normal_page(src_vma, addr, pte); > @@ -1018,7 +1039,12 @@ copy_present_ptes(struct vm_area_struct *dst_vma, > struct vm_area_struct *src_vma >* by keeping the batching logic separate. >*/ > if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) { > - nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr); > + if (src_vma->vm_flags & VM_SHARED) > + flags |= FPB_IGNORE_DIRTY; > + if (!vma_soft_dirty_enabled(src_vma)) > + flags |= FPB_IGNORE_SOFT_DIRTY; > + > + nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags); > folio_ref_add(folio, nr); > if (folio_test_anon(folio)) { > if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
Re: Re: [PATCH] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'
Hi Aneesh, Thanks for looking into the patch. My comments are inline below. On 2024/01/24 01:06 PM, Aneesh Kumar K.V wrote: > Amit Machhiwal writes: > > > Currently, rebooting a pseries nested qemu-kvm guest (L2) results in > > below error as L1 qemu sends PVR value 'arch_compat' == 0 via > > ppc_set_compat ioctl. This triggers a condition failure in > > kvmppc_set_arch_compat() resulting in an EINVAL. > > > > qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid > > > > This patch updates kvmppc_set_arch_compat() to use the host PVR value if > > 'compat_pvr' == 0 indicating that qemu doesn't want to enforce any > > specific PVR compat mode. > > > > Signed-off-by: Amit Machhiwal > > --- > > arch/powerpc/kvm/book3s_hv.c | 2 +- > > arch/powerpc/kvm/book3s_hv_nestedv2.c | 12 ++-- > > 2 files changed, 11 insertions(+), 3 deletions(-) > > > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > > index 1ed6ec140701..9573d7f4764a 100644 > > --- a/arch/powerpc/kvm/book3s_hv.c > > +++ b/arch/powerpc/kvm/book3s_hv.c > > @@ -439,7 +439,7 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu > > *vcpu, u32 arch_compat) > > if (guest_pcr_bit > host_pcr_bit) > > return -EINVAL; > > > > - if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) { > > + if (kvmhv_on_pseries() && kvmhv_is_nestedv2() && arch_compat) { > > if (!(cap & nested_capabilities)) > > return -EINVAL; > > } > > > > Instead of that arch_compat check, would it better to do > > if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) { > if (cap && !(cap & nested_capabilities)) > return -EINVAL; > } > > ie, if a capability is requested, then check against nested_capbilites > to see if the capability exist. The above condition check will cause problems when we would try to boot a machine below Power 9. For example, if we passed the arch_compat == PVR_ARCH_207, cap will remain 0 resulting the above check into a false condition. Consequently, we would never return an -EINVAL in that case resulting the arch compatilbility request succeed when it doesn't support nested papr guest. > > > > diff --git a/arch/powerpc/kvm/book3s_hv_nestedv2.c > > b/arch/powerpc/kvm/book3s_hv_nestedv2.c > > index fd3c4f2d9480..069a1fcfd782 100644 > > --- a/arch/powerpc/kvm/book3s_hv_nestedv2.c > > +++ b/arch/powerpc/kvm/book3s_hv_nestedv2.c > > @@ -138,6 +138,7 @@ static int gs_msg_ops_vcpu_fill_info(struct > > kvmppc_gs_buff *gsb, > > vector128 v; > > int rc, i; > > u16 iden; > > + u32 arch_compat = 0; > > > > vcpu = gsm->data; > > > > @@ -347,8 +348,15 @@ static int gs_msg_ops_vcpu_fill_info(struct > > kvmppc_gs_buff *gsb, > > break; > > } > > case KVMPPC_GSID_LOGICAL_PVR: > > - rc = kvmppc_gse_put_u32(gsb, iden, > > - vcpu->arch.vcore->arch_compat); > > + if (!vcpu->arch.vcore->arch_compat) { > > + if (cpu_has_feature(CPU_FTR_ARCH_31)) > > + arch_compat = PVR_ARCH_31; > > + else if (cpu_has_feature(CPU_FTR_ARCH_300)) > > + arch_compat = PVR_ARCH_300; > > + } else { > > + arch_compat = vcpu->arch.vcore->arch_compat; > > + } > > + rc = kvmppc_gse_put_u32(gsb, iden, arch_compat); > > > > Won't a arch_compat = 0 work here?. ie, where you observing the -EINVAL from > the first hunk or does this hunk have an impact? > No, an arch_compat == 0 won't work in nested API v2. That's because the guest wide PVR cannot be 0, and if arch_compat == 0, then suppported host PVR value should be mentioned. If we were to skip this hunk (keeping the arch_compat == 0), a system reboot of L2 guest would fail and result into a kernel trap as below: [ 22.106360] reboot: Restarting system KVM: unknown exit, hardware reason ffea NIP 0100 LR fe44 CTR XER 20040092 CPU#0 MSR 1000 HID0 HF 6c00 iidx 3 didx 3 TB DECR 0 GPR00 c2a8c300 7fe0 GPR04 1002 82803033 GPR08 0a00 0004 2fff GPR12 c2e1 000105639200 0004 GPR16 00010563a090 GPR20 000105639e20 0001056399c8 7fffe54abab0 000105639288 GPR24 0001 0001 GPR28 c2b30840 CR [ - - - - - - - - ] RES 000@ SRR0
Re: [PATCH v2 13/15] mm/memory: optimize fork() with PTE-mapped THP
On 25/01/2024 19:32, David Hildenbrand wrote: > Let's implement PTE batching when consecutive (present) PTEs map > consecutive pages of the same large folio, and all other PTE bits besides > the PFNs are equal. > > We will optimize folio_pte_batch() separately, to ignore selected > PTE bits. This patch is based on work by Ryan Roberts. > > Use __always_inline for __copy_present_ptes() and keep the handling for > single PTEs completely separate from the multi-PTE case: we really want > the compiler to optimize for the single-PTE case with small folios, to > not degrade performance. > > Note that PTE batching will never exceed a single page table and will > always stay within VMA boundaries. > > Further, processing PTE-mapped THP that maybe pinned and have > PageAnonExclusive set on at least one subpage should work as expected, > but there is room for improvement: We will repeatedly (1) detect a PTE > batch (2) detect that we have to copy a page (3) fall back and allocate a > single page to copy a single page. For now we won't care as pinned pages > are a corner case, and we should rather look into maintaining only a > single PageAnonExclusive bit for large folios. > > Signed-off-by: David Hildenbrand Reviewed-by: Ryan Roberts > --- > include/linux/pgtable.h | 31 +++ > mm/memory.c | 112 +--- > 2 files changed, 124 insertions(+), 19 deletions(-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 351cd9dc7194f..891ed246978a4 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -650,6 +650,37 @@ static inline void ptep_set_wrprotect(struct mm_struct > *mm, unsigned long addres > } > #endif > > +#ifndef wrprotect_ptes > +/** > + * wrprotect_ptes - Write-protect consecutive pages that are mapped to a > + * contiguous range of addresses. > + * @mm: Address space to map the pages into. > + * @addr: Address the first page is mapped at. > + * @ptep: Page table pointer for the first entry. > + * @nr: Number of pages to write-protect. > + * > + * May be overridden by the architecture; otherwise, implemented as a simple > + * loop over ptep_set_wrprotect(). > + * > + * Note that PTE bits in the PTE range besides the PFN can differ. For > example, > + * some PTEs might already be write-protected. > + * > + * Context: The caller holds the page table lock. The pages all belong > + * to the same folio. The PTEs are all in the same PMD. > + */ > +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr, > + pte_t *ptep, unsigned int nr) > +{ > + for (;;) { > + ptep_set_wrprotect(mm, addr, ptep); > + if (--nr == 0) > + break; > + ptep++; > + addr += PAGE_SIZE; > + } > +} > +#endif > + > /* > * On some architectures hardware does not set page access bit when accessing > * memory page, it is responsibility of software setting this bit. It brings > diff --git a/mm/memory.c b/mm/memory.c > index 729ca4d6a820c..4d1be89a01ee0 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -930,15 +930,15 @@ copy_present_page(struct vm_area_struct *dst_vma, > struct vm_area_struct *src_vma > return 0; > } > > -static inline void __copy_present_pte(struct vm_area_struct *dst_vma, > +static __always_inline void __copy_present_ptes(struct vm_area_struct > *dst_vma, > struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, > - pte_t pte, unsigned long addr) > + pte_t pte, unsigned long addr, int nr) > { > struct mm_struct *src_mm = src_vma->vm_mm; > > /* If it's a COW mapping, write protect it both processes. */ > if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) { > - ptep_set_wrprotect(src_mm, addr, src_pte); > + wrprotect_ptes(src_mm, addr, src_pte, nr); > pte = pte_wrprotect(pte); > } > > @@ -950,26 +950,93 @@ static inline void __copy_present_pte(struct > vm_area_struct *dst_vma, > if (!userfaultfd_wp(dst_vma)) > pte = pte_clear_uffd_wp(pte); > > - set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); > + set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr); > +} > + > +/* > + * Detect a PTE batch: consecutive (present) PTEs that map consecutive > + * pages of the same folio. > + * > + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN. > + */ > +static inline int folio_pte_batch(struct folio *folio, unsigned long addr, > + pte_t *start_ptep, pte_t pte, int max_nr) > +{ > + unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); > + const pte_t *end_ptep = start_ptep + max_nr; > + pte_t expected_pte = pte_next_pfn(pte); > + pte_t *ptep = start_ptep + 1; > + > + VM_WARN_ON_FOLIO(!pte_present(pte), folio); > + > + while (ptep != end_ptep) { > + pte = ptep_get(ptep
Re: [PATCH 5/5] sched/vtime: do not include header
On Sun, Jan 28, 2024 at 08:58:54PM +0100, Alexander Gordeev wrote: > There is no architecture-specific code or data left > that generic needs to know about. > Thus, avoid the inclusion of header. > > Signed-off-by: Alexander Gordeev > --- > include/asm-generic/vtime.h | 1 - > include/linux/vtime.h | 4 > 2 files changed, 5 deletions(-) > delete mode 100644 include/asm-generic/vtime.h I guess you need to get rid of this as well: arch/powerpc/include/asm/Kbuild:generic-y += vtime.h
Re: [PATCH 4/5] s390/irq,nmi: do not include header
On Sun, Jan 28, 2024 at 08:58:53PM +0100, Alexander Gordeev wrote: > update_timer_sys() and update_timer_mcck() are inlines used for > CPU time accounting from the interrupt and machine-check handlers. > These routines are specific to s390 architecture, but declared > via header, which in turn inludes . > Avoid the extra loop and include header directly. > > Signed-off-by: Alexander Gordeev > --- > arch/s390/kernel/irq.c | 1 + > arch/s390/kernel/nmi.c | 1 + > 2 files changed, 2 insertions(+) ... > +++ b/arch/s390/kernel/irq.c > +#include ... > +++ b/arch/s390/kernel/nmi.c > +#include It is confusing when the patch subject is "do not include.." and all what this patch is doing is to add two includes. I see what this is doing: getting rid of the implicit include of asm/vtime.h most likely via linux/hardirq.h, but that's not very obvious. Anyway: Acked-by: Heiko Carstens
Re: [PATCH 3/5] s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
On Sun, Jan 28, 2024 at 08:58:52PM +0100, Alexander Gordeev wrote: > __ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore. > > Signed-off-by: Alexander Gordeev > --- > arch/s390/include/asm/vtime.h | 2 -- > 1 file changed, 2 deletions(-) Acked-by: Heiko Carstens
Re: [PATCH] mm/debug_vm_pgtable: Fix BUG_ON with pud advanced test
On 1/29/24 12:23 PM, Anshuman Khandual wrote: > > > On 1/29/24 11:56, Aneesh Kumar K.V wrote: >> On 1/29/24 11:52 AM, Anshuman Khandual wrote: >>> >>> >>> On 1/29/24 11:30, Aneesh Kumar K.V (IBM) wrote: Architectures like powerpc add debug checks to ensure we find only devmap PUD pte entries. These debug checks are only done with CONFIG_DEBUG_VM. This patch marks the ptes used for PUD advanced test devmap pte entries so that we don't hit on debug checks on architecture like ppc64 as below. WARNING: CPU: 2 PID: 1 at arch/powerpc/mm/book3s64/radix_pgtable.c:1382 radix__pud_hugepage_update+0x38/0x138 NIP [c00a7004] radix__pud_hugepage_update+0x38/0x138 LR [c00a77a8] radix__pudp_huge_get_and_clear+0x28/0x60 Call Trace: [c4a2f950] [c4a2f9a0] 0xc4a2f9a0 (unreliable) [c4a2f980] [000d34c1] 0xd34c1 [c4a2f9a0] [c206ba98] pud_advanced_tests+0x118/0x334 [c4a2fa40] [c206db34] debug_vm_pgtable+0xcbc/0x1c48 [c4a2fc10] [c000fd28] do_one_initcall+0x60/0x388 Also kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:202! NIP [c0096510] pudp_huge_get_and_clear_full+0x98/0x174 LR [c206bb34] pud_advanced_tests+0x1b4/0x334 Call Trace: [c4a2f950] [000d34c1] 0xd34c1 (unreliable) [c4a2f9a0] [c206bb34] pud_advanced_tests+0x1b4/0x334 [c4a2fa40] [c206db34] debug_vm_pgtable+0xcbc/0x1c48 [c4a2fc10] [c000fd28] do_one_initcall+0x60/0x388 Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage") Signed-off-by: Aneesh Kumar K.V (IBM) --- mm/debug_vm_pgtable.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 5662e29fe253..65c19025da3d 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -362,6 +362,12 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args) vaddr &= HPAGE_PUD_MASK; pud = pfn_pud(args->pud_pfn, args->page_prot); + /* + * Some architectures have debug checks to make sure + * huge pud mapping are only found with devmap entries + * For now test with only devmap entries. + */ >>> Do you see this behaviour to be changed in powerpc anytime soon ? Otherwise >>> these pud_mkdevmap() based work arounds, might be required to stick around >>> for longer just to prevent powerpc specific triggers. Given PUD transparent >>> huge pages i.e HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD are just supported on x86 >>> and powerpc platforms, could not this problem be solved in a more uniform >>> manner. >>> >> >> >> IIUC pud level transparent hugepages are only supported with devmap entries >> even >> on x86. We don't do anonymous pud hugepage. > > There are some 'pud_trans_huge(orig_pud) || pud_devmap(orig_pud)' checks in > core paths i.e in mm/memory.c which might suggest pud_trans_huge() to exist > without also being a devmap. I might be missing something here, but on x86 > platform following helpers suggest pud_trans_huge() to exist without being > a devmap as well. > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > static inline int pud_trans_huge(pud_t pud) > { > return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE; > } > #endif > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > static inline int pud_devmap(pud_t pud) > { > return !!(pud_val(pud) & _PAGE_DEVMAP); > } > #else > static inline int pud_devmap(pud_t pud) > { > return 0; > } > #endif > > We might need some more clarity on this regarding x86 platform's pud huge > page implementation. > static vm_fault_t create_huge_pud(struct vm_fault *vmf) { #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) struct vm_area_struct *vma = vmf->vma; /* No support for anonymous transparent PUD pages yet */ if (vma_is_anonymous(vma)) return VM_FAULT_FALLBACK; if (vma->vm_ops->huge_fault) return vma->vm_ops->huge_fault(vmf, PUD_ORDER); #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ return VM_FAULT_FALLBACK; } -aneesh