Re: [PATCH v6 (proposal)] powerpc/cpu: enable nr_cpus for crash kernel

2024-01-29 Thread Pingfan Liu
Hi Christophe,

The latest series is
https://lore.kernel.org/linuxppc-dev/20231017022806.4523-1-pi...@redhat.com/

And Michael has his implement on:
https://lore.kernel.org/all/20231229120107.2281153-3-...@ellerman.id.au/T/#m46128446bce1095631162a1927415733a3bf0633

Thanks,

Pingfan

On Fri, Jan 26, 2024 at 3:40 AM Christophe Leroy
 wrote:
>
> Hi,
>
> Le 22/05/2018 à 10:23, Pingfan Liu a écrit :
> > For kexec -p, the boot cpu can be not the cpu0, this causes the problem
> > to alloc paca[]. In theory, there is no requirement to assign cpu's logical
> > id as its present seq by device tree. But we have something like
> > cpu_first_thread_sibling(), which makes assumption on the mapping inside
> > a core. Hence partially changing the mapping, i.e. unbind the mapping of
> > core while keep the mapping inside a core. After this patch, the core with
> > boot-cpu will always be mapped into core 0.
> >
> > And at present, the code to discovery cpu spreads over two functions:
> > early_init_dt_scan_cpus() and smp_setup_cpu_maps().
> > This patch tries to fold smp_setup_cpu_maps() into the "previous" one
>
> This patch is pretty old and doesn't apply anymore. If still relevant
> can you please rebase and resubmit.
>
> Thanks
> Christophe
>
> >
> > Signed-off-by: Pingfan Liu 
> > ---
> > v5 -> v6:
> >simplify the loop logic (Hope it can answer Benjamin's concern)
> >concentrate the cpu recovery code to early stage (Hope it can answer 
> > Michael's concern)
> > Todo: (if this method is accepted)
> >fold the whole smp_setup_cpu_maps()
> >
> >   arch/powerpc/include/asm/smp.h |   1 +
> >   arch/powerpc/kernel/prom.c | 123 
> > -
> >   arch/powerpc/kernel/setup-common.c |  58 ++---
> >   drivers/of/fdt.c   |   2 +-
> >   include/linux/of_fdt.h |   2 +
> >   5 files changed, 103 insertions(+), 83 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
> > index fac963e..80c7693 100644
> > --- a/arch/powerpc/include/asm/smp.h
> > +++ b/arch/powerpc/include/asm/smp.h
> > @@ -30,6 +30,7 @@
> >   #include 
> >
> >   extern int boot_cpuid;
> > +extern int threads_in_core;
> >   extern int spinning_secondaries;
> >
> >   extern void cpu_die(void);
> > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> > index 4922162..2ae0b4a 100644
> > --- a/arch/powerpc/kernel/prom.c
> > +++ b/arch/powerpc/kernel/prom.c
> > @@ -77,7 +77,6 @@ unsigned long tce_alloc_start, tce_alloc_end;
> >   u64 ppc64_rma_size;
> >   #endif
> >   static phys_addr_t first_memblock_size;
> > -static int __initdata boot_cpu_count;
> >
> >   static int __init early_parse_mem(char *p)
> >   {
> > @@ -305,6 +304,14 @@ static void __init 
> > check_cpu_feature_properties(unsigned long node)
> >   }
> >   }
> >
> > +struct bootinfo {
> > + int boot_thread_id;
> > + unsigned int cpu_cnt;
> > + int cpu_hwids[NR_CPUS];
> > + bool avail[NR_CPUS];
> > +};
> > +static struct bootinfo *bt_info;
> > +
> >   static int __init early_init_dt_scan_cpus(unsigned long node,
> > const char *uname, int depth,
> > void *data)
> > @@ -312,10 +319,12 @@ static int __init early_init_dt_scan_cpus(unsigned 
> > long node,
> >   const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> >   const __be32 *prop;
> >   const __be32 *intserv;
> > - int i, nthreads;
> > + int i, nthreads, maxidx;
> >   int len;
> > - int found = -1;
> > - int found_thread = 0;
> > + int found_thread = -1;
> > + struct bootinfo *info = data;
> > + bool avail;
> > + int rotate_cnt, id;
> >
> >   /* We are scanning "cpu" nodes only */
> >   if (type == NULL || strcmp(type, "cpu") != 0)
> > @@ -325,8 +334,15 @@ static int __init early_init_dt_scan_cpus(unsigned 
> > long node,
> >   intserv = of_get_flat_dt_prop(node, "ibm,ppc-interrupt-server#s", 
> > &len);
> >   if (!intserv)
> >   intserv = of_get_flat_dt_prop(node, "reg", &len);
> > + avail = of_fdt_device_is_available(initial_boot_params, node);
> > +#if 0
> > + //todo
> > + if (!avail)
> > + avail = !of_fdt_property_match_string(node,
> > + "enable-method", "spin-table");
> > +#endif
> >
> > - nthreads = len / sizeof(int);
> > + threads_in_core = nthreads = len / sizeof(int);
> >
> >   /*
> >* Now see if any of these threads match our boot cpu.
> > @@ -338,9 +354,10 @@ static int __init early_init_dt_scan_cpus(unsigned 
> > long node,
> >* booted proc.
> >*/
> >   if (fdt_version(initial_boot_params) >= 2) {
> > + info->cpu_hwids[info->cpu_cnt] =
> > + be32_to_cpu(intserv[i]);
> >   if (be32_to_cpu(intserv[i]) ==
> > 

RE: [PATCH v2 linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope

2024-01-29 Thread Michael Kelley
From: Baoquan He  Sent: Monday, January 29, 2024 7:00 PM
> 
> Michael pointed out that the CONFIG_CRASH_DUMP ifdef is nested inside
> CONFIG_KEXEC_CODE ifdef scope in some XEN, Hyper-V codes.
> 
> Although the nesting works well too since CONFIG_CRASH_DUMP has
> dependency on CONFIG_KEXEC_CORE, it may cause confusion because there
> are places where it's not nested, and people may think it needs to be
> nested even though it doesn't have to.
> 
> Fix that by moving  CONFIG_CRASH_DUMP ifdeffery of codes out of
> CONFIG_KEXEC_CODE ifdeffery scope.
> 
> And also put function machine_crash_shutdown() definition inside
> CONFIG_CRASH_DUMP ifdef scope instead of CONFIG_KEXEC_CORE ifdef.
> 
> And also fix a building error Nathan reported as below by replacing
> CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef.
> 
> 
> $ curl -LSso .config 
> https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.x86_64
> $ make -skj"$(nproc)" ARCH=x86_64 CROSS_COMPILE=x86_64-linux-
> olddefconfig all
> ...
> x86_64-linux-ld: arch/x86/xen/mmu_pv.o: in function
> `paddr_vmcoreinfo_note':
> mmu_pv.c:(.text+0x3af3): undefined reference to `vmcoreinfo_note'
> 
> 
> Link: 
> https://lore.kernel.org/all/sn6pr02mb4157931105fa68d72e3d3db8d4...@sn6pr02mb4157.namprd02.prod.outlook.com/T/#u
> Link: 
> https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u
> Signed-off-by: Baoquan He 
> ---
> v1->v2:
> - Add missing words and fix typos in patch log pointed out by Michael.
> 
>  arch/x86/kernel/cpu/mshyperv.c | 10 ++
>  arch/x86/kernel/reboot.c   |  2 +-
>  arch/x86/xen/enlighten_hvm.c   |  4 ++--
>  arch/x86/xen/mmu_pv.c  |  2 +-
>  4 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c
> b/arch/x86/kernel/cpu/mshyperv.c
> index f8163a59026b..2e8cd5a4ae85 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -209,6 +209,7 @@ static void hv_machine_shutdown(void)
>   if (kexec_in_progress)
>   hyperv_cleanup();
>  }
> +#endif /* CONFIG_KEXEC_CORE */
> 
>  #ifdef CONFIG_CRASH_DUMP
>  static void hv_machine_crash_shutdown(struct pt_regs *regs)
> @@ -222,8 +223,7 @@ static void hv_machine_crash_shutdown(struct
> pt_regs *regs)
>   /* Disable the hypercall page when there is only 1 active CPU. */
>   hyperv_cleanup();
>  }
> -#endif
> -#endif /* CONFIG_KEXEC_CORE */
> +#endif /* CONFIG_CRASH_DUMP */
>  #endif /* CONFIG_HYPERV */
> 
>  static uint32_t  __init ms_hyperv_platform(void)
> @@ -497,9 +497,11 @@ static void __init ms_hyperv_init_platform(void)
>   no_timer_check = 1;
>  #endif
> 
> -#if IS_ENABLED(CONFIG_HYPERV) && defined(CONFIG_KEXEC_CORE)
> +#if IS_ENABLED(CONFIG_HYPERV)
> +#if defined(CONFIG_KEXEC_CORE)
>   machine_ops.shutdown = hv_machine_shutdown;
> -#ifdef CONFIG_CRASH_DUMP
> +#endif
> +#if defined(CONFIG_CRASH_DUMP)
>   machine_ops.crash_shutdown = hv_machine_crash_shutdown;
>  #endif
>  #endif
> diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> index 1287b0d5962f..f3130f762784 100644
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -826,7 +826,7 @@ void machine_halt(void)
>   machine_ops.halt();
>  }
> 
> -#ifdef CONFIG_KEXEC_CORE
> +#ifdef CONFIG_CRASH_DUMP
>  void machine_crash_shutdown(struct pt_regs *regs)
>  {
>   machine_ops.crash_shutdown(regs);
> diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
> index 09e3db7ff990..0b367c1e086d 100644
> --- a/arch/x86/xen/enlighten_hvm.c
> +++ b/arch/x86/xen/enlighten_hvm.c
> @@ -148,6 +148,7 @@ static void xen_hvm_shutdown(void)
>   if (kexec_in_progress)
>   xen_reboot(SHUTDOWN_soft_reset);
>  }
> +#endif
> 
>  #ifdef CONFIG_CRASH_DUMP
>  static void xen_hvm_crash_shutdown(struct pt_regs *regs)
> @@ -156,7 +157,6 @@ static void xen_hvm_crash_shutdown(struct pt_regs
> *regs)
>   xen_reboot(SHUTDOWN_soft_reset);
>  }
>  #endif
> -#endif
> 
>  static int xen_cpu_up_prepare_hvm(unsigned int cpu)
>  {
> @@ -238,10 +238,10 @@ static void __init xen_hvm_guest_init(void)
> 
>  #ifdef CONFIG_KEXEC_CORE
>   machine_ops.shutdown = xen_hvm_shutdown;
> +#endif
>  #ifdef CONFIG_CRASH_DUMP
>   machine_ops.crash_shutdown = xen_hvm_crash_shutdown;
>  #endif
> -#endif
>  }
> 
>  static __init int xen_parse_nopv(char *arg)
> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 218773cfb009..e21974f2cf2d 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -2520,7 +2520,7 @@ int xen_remap_pfn(struct vm_area_struct *vma,
> unsigned long addr,
>  }
>  EXPORT_SYMBOL_GPL(xen_remap_pfn);
> 
> -#ifdef CONFIG_KEXEC_CORE
> +#ifdef CONFIG_VMCORE_INFO
>  phys_addr_t paddr_vmcoreinfo_note(void)
>  {
>   if (xen_pv_domain())
> --
> 2.41.0

Reviewed-by: Michael Kelley 



Re: [PATCH linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope

2024-01-29 Thread Baoquan He
On 01/30/24 at 01:39am, Michael Kelley wrote:
> From: Baoquan He 
> > 
> > On 01/29/24 at 06:27pm, Michael Kelley wrote:
> > > From: Baoquan He  Sent: Monday, January 29, 2024
> > 5:51 AM
> > > >
> > > > Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside
> > > > arch/x86/xen/enlighten_hvm.c.
> > >
> > > Did some words get left out in the above sentence?  It mentions the Xen
> > > case, but not the Hyper-V case.  I'm not sure what you intended.
> > 
> > Thanks a lot for your careful reviewing.
> > 
> > Yeah, I tried to list all affected file names, seems my vim editor threw
> > away some words. And I forgot mentioning the change in reboot.c.
> > 
> > I adjusted log as below according to your comments, do you think it's OK
> > now?
> 
> Yes -- looks like everything is included and clear up my confusion.  But
> I still have two small nits per below. :-)

Right, I will grabbed them into v2. Thanks again.

> 
> > 
> > ===
> > Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside
> > CONFIG_KEXEC_CODE ifdef scope in some XEN, HyperV codes.
> 
> s/Hyper-V/HyperV/
> 
> > 
> > Although the nesting works well too since CONFIG_CRASH_DUMP has
> > dependency on CONFIG_KEXEC_CORE, it may cause confusion because there
> > are places where it's not nested, and people may think it needs be nested
> 
> s/needs to be/needs be/
> 
> > even though it doesn't have to.
> > 
> > Fix that by moving  CONFIG_CRASH_DUMP ifdeffery of codes out of
> > CONFIG_KEXEC_CODE ifdeffery scope.
> > 
> > And also put function machine_crash_shutdown() definition inside
> > CONFIG_CRASH_DUMP ifdef scope instead of CONFIG_KEXEC_CORE ifdef.
> > 
> > And also fix a building error Nathan reported as below by replacing
> > CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef.
> > ..
> > ===
> > 
> > Thanks
> > Baoquan
> 



Re: [PATCH v2 linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope

2024-01-29 Thread Baoquan He
Michael pointed out that the CONFIG_CRASH_DUMP ifdef is nested inside
CONFIG_KEXEC_CODE ifdef scope in some XEN, Hyper-V codes.

Although the nesting works well too since CONFIG_CRASH_DUMP has
dependency on CONFIG_KEXEC_CORE, it may cause confusion because there
are places where it's not nested, and people may think it needs to be
nested even though it doesn't have to.

Fix that by moving  CONFIG_CRASH_DUMP ifdeffery of codes out of
CONFIG_KEXEC_CODE ifdeffery scope.

And also put function machine_crash_shutdown() definition inside
CONFIG_CRASH_DUMP ifdef scope instead of CONFIG_KEXEC_CORE ifdef.

And also fix a building error Nathan reported as below by replacing
CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef.


$ curl -LSso .config 
https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.x86_64
$ make -skj"$(nproc)" ARCH=x86_64 CROSS_COMPILE=x86_64-linux- olddefconfig all
...
x86_64-linux-ld: arch/x86/xen/mmu_pv.o: in function `paddr_vmcoreinfo_note':
mmu_pv.c:(.text+0x3af3): undefined reference to `vmcoreinfo_note'


Link: 
https://lore.kernel.org/all/sn6pr02mb4157931105fa68d72e3d3db8d4...@sn6pr02mb4157.namprd02.prod.outlook.com/T/#u
Link: 
https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u
Signed-off-by: Baoquan He 
---
v1->v2:
- Add missing words and fix typos in patch log pointed out by Michael.

 arch/x86/kernel/cpu/mshyperv.c | 10 ++
 arch/x86/kernel/reboot.c   |  2 +-
 arch/x86/xen/enlighten_hvm.c   |  4 ++--
 arch/x86/xen/mmu_pv.c  |  2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f8163a59026b..2e8cd5a4ae85 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -209,6 +209,7 @@ static void hv_machine_shutdown(void)
if (kexec_in_progress)
hyperv_cleanup();
 }
+#endif /* CONFIG_KEXEC_CORE */
 
 #ifdef CONFIG_CRASH_DUMP
 static void hv_machine_crash_shutdown(struct pt_regs *regs)
@@ -222,8 +223,7 @@ static void hv_machine_crash_shutdown(struct pt_regs *regs)
/* Disable the hypercall page when there is only 1 active CPU. */
hyperv_cleanup();
 }
-#endif
-#endif /* CONFIG_KEXEC_CORE */
+#endif /* CONFIG_CRASH_DUMP */
 #endif /* CONFIG_HYPERV */
 
 static uint32_t  __init ms_hyperv_platform(void)
@@ -497,9 +497,11 @@ static void __init ms_hyperv_init_platform(void)
no_timer_check = 1;
 #endif
 
-#if IS_ENABLED(CONFIG_HYPERV) && defined(CONFIG_KEXEC_CORE)
+#if IS_ENABLED(CONFIG_HYPERV)
+#if defined(CONFIG_KEXEC_CORE)
machine_ops.shutdown = hv_machine_shutdown;
-#ifdef CONFIG_CRASH_DUMP
+#endif
+#if defined(CONFIG_CRASH_DUMP)
machine_ops.crash_shutdown = hv_machine_crash_shutdown;
 #endif
 #endif
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 1287b0d5962f..f3130f762784 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -826,7 +826,7 @@ void machine_halt(void)
machine_ops.halt();
 }
 
-#ifdef CONFIG_KEXEC_CORE
+#ifdef CONFIG_CRASH_DUMP
 void machine_crash_shutdown(struct pt_regs *regs)
 {
machine_ops.crash_shutdown(regs);
diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 09e3db7ff990..0b367c1e086d 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -148,6 +148,7 @@ static void xen_hvm_shutdown(void)
if (kexec_in_progress)
xen_reboot(SHUTDOWN_soft_reset);
 }
+#endif
 
 #ifdef CONFIG_CRASH_DUMP
 static void xen_hvm_crash_shutdown(struct pt_regs *regs)
@@ -156,7 +157,6 @@ static void xen_hvm_crash_shutdown(struct pt_regs *regs)
xen_reboot(SHUTDOWN_soft_reset);
 }
 #endif
-#endif
 
 static int xen_cpu_up_prepare_hvm(unsigned int cpu)
 {
@@ -238,10 +238,10 @@ static void __init xen_hvm_guest_init(void)
 
 #ifdef CONFIG_KEXEC_CORE
machine_ops.shutdown = xen_hvm_shutdown;
+#endif
 #ifdef CONFIG_CRASH_DUMP
machine_ops.crash_shutdown = xen_hvm_crash_shutdown;
 #endif
-#endif
 }
 
 static __init int xen_parse_nopv(char *arg)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 218773cfb009..e21974f2cf2d 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2520,7 +2520,7 @@ int xen_remap_pfn(struct vm_area_struct *vma, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(xen_remap_pfn);
 
-#ifdef CONFIG_KEXEC_CORE
+#ifdef CONFIG_VMCORE_INFO
 phys_addr_t paddr_vmcoreinfo_note(void)
 {
if (xen_pv_domain())
-- 
2.41.0



RE: [PATCH linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope

2024-01-29 Thread Michael Kelley
From: Baoquan He 
> 
> On 01/29/24 at 06:27pm, Michael Kelley wrote:
> > From: Baoquan He  Sent: Monday, January 29, 2024
> 5:51 AM
> > >
> > > Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside
> > > arch/x86/xen/enlighten_hvm.c.
> >
> > Did some words get left out in the above sentence?  It mentions the Xen
> > case, but not the Hyper-V case.  I'm not sure what you intended.
> 
> Thanks a lot for your careful reviewing.
> 
> Yeah, I tried to list all affected file names, seems my vim editor threw
> away some words. And I forgot mentioning the change in reboot.c.
> 
> I adjusted log as below according to your comments, do you think it's OK
> now?

Yes -- looks like everything is included and clear up my confusion.  But
I still have two small nits per below. :-)

Michael

> 
> ===
> Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside
> CONFIG_KEXEC_CODE ifdef scope in some XEN, HyperV codes.

s/Hyper-V/HyperV/

> 
> Although the nesting works well too since CONFIG_CRASH_DUMP has
> dependency on CONFIG_KEXEC_CORE, it may cause confusion because there
> are places where it's not nested, and people may think it needs be nested

s/needs to be/needs be/

> even though it doesn't have to.
> 
> Fix that by moving  CONFIG_CRASH_DUMP ifdeffery of codes out of
> CONFIG_KEXEC_CODE ifdeffery scope.
> 
> And also put function machine_crash_shutdown() definition inside
> CONFIG_CRASH_DUMP ifdef scope instead of CONFIG_KEXEC_CORE ifdef.
> 
> And also fix a building error Nathan reported as below by replacing
> CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef.
> ..
> ===
> 
> Thanks
> Baoquan



Re: [PATCH linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope

2024-01-29 Thread Baoquan He
On 01/29/24 at 06:27pm, Michael Kelley wrote:
> From: Baoquan He  Sent: Monday, January 29, 2024 5:51 AM
> > 
> > Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside
> > arch/x86/xen/enlighten_hvm.c.
> 
> Did some words get left out in the above sentence?  It mentions the Xen
> case, but not the Hyper-V case.  I'm not sure what you intended.

Thanks a lot for your careful reviewing.

Yeah, I tried to list all affected file names, seems my vim editor threw
away some words. And I forgot mentioning the change in reboot.c.

I adjusted log as below according to your comments, do you think it's OK
now?

===
Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside
CONFIG_KEXEC_CODE ifdef scope in some XEN, HyperV codes. 

Although the nesting works well too since CONFIG_CRASH_DUMP has
dependency on CONFIG_KEXEC_CORE, it may cause confusion because there
are places where it's not nested, and people may think it needs be nested
even though it doesn't have to.

Fix that by moving  CONFIG_CRASH_DUMP ifdeffery of codes out of
CONFIG_KEXEC_CODE ifdeffery scope.

And also put function machine_crash_shutdown() definition inside
CONFIG_CRASH_DUMP ifdef scope instead of CONFIG_KEXEC_CORE ifdef.

And also fix a building error Nathan reported as below by replacing
CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef.
..
===

Thanks
Baoquan



Re: [PATCH v10 5/6] arm64: support copy_mc_[user]_highpage()

2024-01-29 Thread Andrey Konovalov
On Mon, Jan 29, 2024 at 2:47 PM Tong Tiangen  wrote:
>
> Currently, many scenarios that can tolerate memory errors when copying page
> have been supported in the kernel[1][2][3], all of which are implemented by
> copy_mc_[user]_highpage(). arm64 should also support this mechanism.
>
> Due to mte, arm64 needs to have its own copy_mc_[user]_highpage()
> architecture implementation, macros __HAVE_ARCH_COPY_MC_HIGHPAGE and
> __HAVE_ARCH_COPY_MC_USER_HIGHPAGE have been added to control it.
>
> Add new helper copy_mc_page() which provide a page copy implementation with
> machine check safe. The copy_mc_page() in copy_mc_page.S is largely borrows
> from copy_page() in copy_page.S and the main difference is copy_mc_page()
> add extable entry to every load/store insn to support machine check safe.
>
> Add new extable type EX_TYPE_COPY_MC_PAGE_ERR_ZERO which used in
> copy_mc_page().
>
> [1]a873dfe1032a ("mm, hwpoison: try to recover from copy-on write faults")
> [2]5f2500b93cc9 ("mm/khugepaged: recover from poisoned anonymous memory")
> [3]6b970599e807 ("mm: hwpoison: support recovery from 
> ksm_might_need_to_copy()")
>
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/include/asm/asm-extable.h | 15 ++
>  arch/arm64/include/asm/assembler.h   |  4 ++
>  arch/arm64/include/asm/mte.h |  5 ++
>  arch/arm64/include/asm/page.h| 10 
>  arch/arm64/lib/Makefile  |  2 +
>  arch/arm64/lib/copy_mc_page.S| 78 
>  arch/arm64/lib/mte.S | 27 ++
>  arch/arm64/mm/copypage.c | 66 ---
>  arch/arm64/mm/extable.c  |  7 +--
>  include/linux/highmem.h  |  8 +++
>  10 files changed, 213 insertions(+), 9 deletions(-)
>  create mode 100644 arch/arm64/lib/copy_mc_page.S
>
> diff --git a/arch/arm64/include/asm/asm-extable.h 
> b/arch/arm64/include/asm/asm-extable.h
> index 980d1dd8e1a3..819044fefbe7 100644
> --- a/arch/arm64/include/asm/asm-extable.h
> +++ b/arch/arm64/include/asm/asm-extable.h
> @@ -10,6 +10,7 @@
>  #define EX_TYPE_UACCESS_ERR_ZERO   2
>  #define EX_TYPE_KACCESS_ERR_ZERO   3
>  #define EX_TYPE_LOAD_UNALIGNED_ZEROPAD 4
> +#define EX_TYPE_COPY_MC_PAGE_ERR_ZERO  5
>
>  /* Data fields for EX_TYPE_UACCESS_ERR_ZERO */
>  #define EX_DATA_REG_ERR_SHIFT  0
> @@ -51,6 +52,16 @@
>  #define _ASM_EXTABLE_UACCESS(insn, fixup)  \
> _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, wzr, wzr)
>
> +#define _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, err, zero) \
> +   __ASM_EXTABLE_RAW(insn, fixup,  \
> + EX_TYPE_COPY_MC_PAGE_ERR_ZERO,\
> + ( \
> +   EX_DATA_REG(ERR, err) | \
> +   EX_DATA_REG(ZERO, zero) \
> + ))
> +
> +#define _ASM_EXTABLE_COPY_MC_PAGE(insn, fixup) \
> +   _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, wzr, wzr)
>  /*
>   * Create an exception table entry for uaccess `insn`, which will branch to 
> `fixup`
>   * when an unhandled fault is taken.
> @@ -59,6 +70,10 @@
> _ASM_EXTABLE_UACCESS(\insn, \fixup)
> .endm
>
> +   .macro  _asm_extable_copy_mc_page, insn, fixup
> +   _ASM_EXTABLE_COPY_MC_PAGE(\insn, \fixup)
> +   .endm
> +
>  /*
>   * Create an exception table entry for `insn` if `fixup` is provided. 
> Otherwise
>   * do nothing.
> diff --git a/arch/arm64/include/asm/assembler.h 
> b/arch/arm64/include/asm/assembler.h
> index 513787e43329..e1d8ce155878 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -154,6 +154,10 @@ lr .reqx30 // link register
>  #define CPU_LE(code...) code
>  #endif
>
> +#define CPY_MC(l, x...)\
> +:   x; \
> +   _asm_extable_copy_mc_pageb, l
> +
>  /*
>   * Define a macro that constructs a 64-bit value by concatenating two
>   * 32-bit registers. Note that on big endian systems the order of the
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 91fbd5c8a391..9cdded082dd4 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -92,6 +92,7 @@ static inline bool try_page_mte_tagging(struct page *page)
>  void mte_zero_clear_page_tags(void *addr);
>  void mte_sync_tags(pte_t pte, unsigned int nr_pages);
>  void mte_copy_page_tags(void *kto, const void *kfrom);
> +int mte_copy_mc_page_tags(void *kto, const void *kfrom);
>  void mte_thread_init_user(void);
>  void mte_thread_switch(struct task_struct *next);
>  void mte_cpu_setup(void);
> @@ -128,6 +129,10 @@ static inline void mte_sync_tags(pte_t pte, unsigned int 
> nr_pages)
>  static inline void mte_copy_page_tags(void *kto, const void *kfrom)
>  {
>  }
> +static i

Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-29 Thread Luis Chamberlain
On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
> Declaring rodata_enabled and mark_rodata_ro() at all time
> helps removing related #ifdefery in C files.
> 
> Signed-off-by: Christophe Leroy 

Very nice cleanup, thanks!, applied and pushed

  Luis


RE: [PATCH linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope

2024-01-29 Thread Michael Kelley
From: Baoquan He  Sent: Monday, January 29, 2024 5:51 AM
> 
> Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside
> arch/x86/xen/enlighten_hvm.c.

Did some words get left out in the above sentence?  It mentions the Xen
case, but not the Hyper-V case.  I'm not sure what you intended.

> 
> Although the nesting works well too since CONFIG_CRASH_DUMP has
> dependency on CONFIG_KEXEC_CORE, it may cause confuse because there

s/confusion/confuse/

> are places where it's not nested, and people may think it need be nested

s/need be/needs to be/

> even though it doesn't have to.
> 
> Fix that by moving  CONFIG_CRASH_DUMP ifdeffery of codes out of
> CONFIG_KEXEC_CODE ifdeffery scope.
> 
> And also fix a building error Nathan reported as below by replacing
> CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef.
> 
> 
> $ curl -LSso .config 
> https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.x86_64
>  
> $ make -skj"$(nproc)" ARCH=x86_64 CROSS_COMPILE=x86_64-linux-
> olddefconfig all
> ...
> x86_64-linux-ld: arch/x86/xen/mmu_pv.o: in function
> `paddr_vmcoreinfo_note':
> mmu_pv.c:(.text+0x3af3): undefined reference to `vmcoreinfo_note'
> 
> 
> Link: 
> https://lore.kernel.org/all/sn6pr02mb4157931105fa68d72e3d3db8d4...@sn6pr02mb4157.namprd02.prod.outlook.com/T/#u
> Link: 
> https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u
> Signed-off-by: Baoquan He 

Modulo the commit message nits, LGTM.

Reviewed-by: Michael Kelley 

> ---
>  arch/x86/kernel/cpu/mshyperv.c | 10 ++
>  arch/x86/kernel/reboot.c   |  2 +-
>  arch/x86/xen/enlighten_hvm.c   |  4 ++--
>  arch/x86/xen/mmu_pv.c  |  2 +-
>  4 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c
> b/arch/x86/kernel/cpu/mshyperv.c
> index f8163a59026b..2e8cd5a4ae85 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -209,6 +209,7 @@ static void hv_machine_shutdown(void)
>   if (kexec_in_progress)
>   hyperv_cleanup();
>  }
> +#endif /* CONFIG_KEXEC_CORE */
> 
>  #ifdef CONFIG_CRASH_DUMP
>  static void hv_machine_crash_shutdown(struct pt_regs *regs)
> @@ -222,8 +223,7 @@ static void hv_machine_crash_shutdown(struct
> pt_regs *regs)
>   /* Disable the hypercall page when there is only 1 active CPU. */
>   hyperv_cleanup();
>  }
> -#endif
> -#endif /* CONFIG_KEXEC_CORE */
> +#endif /* CONFIG_CRASH_DUMP */
>  #endif /* CONFIG_HYPERV */
> 
>  static uint32_t  __init ms_hyperv_platform(void)
> @@ -497,9 +497,11 @@ static void __init ms_hyperv_init_platform(void)
>   no_timer_check = 1;
>  #endif
> 
> -#if IS_ENABLED(CONFIG_HYPERV) && defined(CONFIG_KEXEC_CORE)
> +#if IS_ENABLED(CONFIG_HYPERV)
> +#if defined(CONFIG_KEXEC_CORE)
>   machine_ops.shutdown = hv_machine_shutdown;
> -#ifdef CONFIG_CRASH_DUMP
> +#endif
> +#if defined(CONFIG_CRASH_DUMP)
>   machine_ops.crash_shutdown = hv_machine_crash_shutdown;
>  #endif
>  #endif
> diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> index 1287b0d5962f..f3130f762784 100644
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -826,7 +826,7 @@ void machine_halt(void)
>   machine_ops.halt();
>  }
> 
> -#ifdef CONFIG_KEXEC_CORE
> +#ifdef CONFIG_CRASH_DUMP
>  void machine_crash_shutdown(struct pt_regs *regs)
>  {
>   machine_ops.crash_shutdown(regs);
> diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
> index 09e3db7ff990..0b367c1e086d 100644
> --- a/arch/x86/xen/enlighten_hvm.c
> +++ b/arch/x86/xen/enlighten_hvm.c
> @@ -148,6 +148,7 @@ static void xen_hvm_shutdown(void)
>   if (kexec_in_progress)
>   xen_reboot(SHUTDOWN_soft_reset);
>  }
> +#endif
> 
>  #ifdef CONFIG_CRASH_DUMP
>  static void xen_hvm_crash_shutdown(struct pt_regs *regs)
> @@ -156,7 +157,6 @@ static void xen_hvm_crash_shutdown(struct pt_regs
> *regs)
>   xen_reboot(SHUTDOWN_soft_reset);
>  }
>  #endif
> -#endif
> 
>  static int xen_cpu_up_prepare_hvm(unsigned int cpu)
>  {
> @@ -238,10 +238,10 @@ static void __init xen_hvm_guest_init(void)
> 
>  #ifdef CONFIG_KEXEC_CORE
>   machine_ops.shutdown = xen_hvm_shutdown;
> +#endif
>  #ifdef CONFIG_CRASH_DUMP
>   machine_ops.crash_shutdown = xen_hvm_crash_shutdown;
>  #endif
> -#endif
>  }
> 
>  static __init int xen_parse_nopv(char *arg)
> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 218773cfb009..e21974f2cf2d 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -2520,7 +2520,7 @@ int xen_remap_pfn(struct vm_area_struct *vma,
> unsigned long addr,
>  }
>  EXPORT_SYMBOL_GPL(xen_remap_pfn);
> 
> -#ifdef CONFIG_KEXEC_CORE
> +#ifdef CONFIG_VMCORE_INFO
>  phys_addr_t paddr_vmcoreinfo_note(void)
>  {
>   if (xen_pv_domain())
> --
> 2.41.0



Re: [PATCH v10 3/6] arm64: add uaccess to machine check safe

2024-01-29 Thread Mark Rutland
On Mon, Jan 29, 2024 at 09:46:49PM +0800, Tong Tiangen wrote:
> If user process access memory fails due to hardware memory error, only the
> relevant processes are affected, so it is more reasonable to kill the user
> process and isolate the corrupt page than to panic the kernel.
> 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/lib/copy_from_user.S | 10 +-
>  arch/arm64/lib/copy_to_user.S   | 10 +-
>  arch/arm64/mm/extable.c |  8 
>  3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S
> index 34e317907524..1bf676e9201d 100644
> --- a/arch/arm64/lib/copy_from_user.S
> +++ b/arch/arm64/lib/copy_from_user.S
> @@ -25,7 +25,7 @@
>   .endm
>  
>   .macro strb1 reg, ptr, val
> - strb \reg, [\ptr], \val
> + USER(9998f, strb \reg, [\ptr], \val)
>   .endm

This is a store to *kernel* memory, not user memory. It should not be marked
with USER().

I understand that you *might* want to handle memory errors on these stores, but
the commit message doesn't describe that and the associated trade-off. For
example, consider that when a copy_form_user fails we'll try to zero the
remaining buffer via memset(); so if a STR* instruction in copy_to_user
faulted, upon handling the fault we'll immediately try to fix that up with some
more stores which will also fault, but won't get fixed up, leading to a panic()
anyway...

Further, this change will also silently fixup unexpected kernel faults if we
pass bad kernel pointers to copy_{to,from}_user, which will hide real bugs.

So NAK to this change as-is; likewise for the addition of USER() to other ldr*
macros in copy_from_user.S and the addition of USER() str* macros in
copy_to_user.S.

If we want to handle memory errors on some kaccesses, we need a new EX_TYPE_*
separate from the usual EX_TYPE_KACESS_ERR_ZERO that means "handle memory
errors, but treat other faults as fatal". That should come with a rationale and
explanation of why it's actually useful.

[...]

> diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
> index 478e639f8680..28ec35e3d210 100644
> --- a/arch/arm64/mm/extable.c
> +++ b/arch/arm64/mm/extable.c
> @@ -85,10 +85,10 @@ bool fixup_exception_mc(struct pt_regs *regs)
>   if (!ex)
>   return false;
>  
> - /*
> -  * This is not complete, More Machine check safe extable type can
> -  * be processed here.
> -  */
> + switch (ex->type) {
> + case EX_TYPE_UACCESS_ERR_ZERO:
> + return ex_handler_uaccess_err_zero(ex, regs);
> + }

Please fold this part into the prior patch, and start ogf with *only* handling
errors on accesses already marked with EX_TYPE_UACCESS_ERR_ZERO. I think that
change would be relatively uncontroversial, and it would be much easier to
build atop that.

Thanks,
Mark.


Re: [PATCH v10 2/6] arm64: add support for machine check error safe

2024-01-29 Thread Mark Rutland
On Mon, Jan 29, 2024 at 09:46:48PM +0800, Tong Tiangen wrote:
> For the arm64 kernel, when it processes hardware memory errors for
> synchronize notifications(do_sea()), if the errors is consumed within the
> kernel, the current processing is panic. However, it is not optimal.
> 
> Take uaccess for example, if the uaccess operation fails due to memory
> error, only the user process will be affected. Killing the user process and
> isolating the corrupt page is a better choice.
> 
> This patch only enable machine error check framework and adds an exception
> fixup before the kernel panic in do_sea().
> 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/Kconfig   |  1 +
>  arch/arm64/include/asm/extable.h |  1 +
>  arch/arm64/mm/extable.c  | 16 
>  arch/arm64/mm/fault.c| 29 -
>  4 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index aa7c1d435139..2cc34b5e7abb 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -20,6 +20,7 @@ config ARM64
>   select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
>   select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
>   select ARCH_HAS_CACHE_LINE_SIZE
> + select ARCH_HAS_COPY_MC if ACPI_APEI_GHES
>   select ARCH_HAS_CURRENT_STACK_POINTER
>   select ARCH_HAS_DEBUG_VIRTUAL
>   select ARCH_HAS_DEBUG_VM_PGTABLE
> diff --git a/arch/arm64/include/asm/extable.h 
> b/arch/arm64/include/asm/extable.h
> index 72b0e71cc3de..f80ebd0addfd 100644
> --- a/arch/arm64/include/asm/extable.h
> +++ b/arch/arm64/include/asm/extable.h
> @@ -46,4 +46,5 @@ bool ex_handler_bpf(const struct exception_table_entry *ex,
>  #endif /* !CONFIG_BPF_JIT */
>  
>  bool fixup_exception(struct pt_regs *regs);
> +bool fixup_exception_mc(struct pt_regs *regs);
>  #endif
> diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
> index 228d681a8715..478e639f8680 100644
> --- a/arch/arm64/mm/extable.c
> +++ b/arch/arm64/mm/extable.c
> @@ -76,3 +76,19 @@ bool fixup_exception(struct pt_regs *regs)
>  
>   BUG();
>  }
> +
> +bool fixup_exception_mc(struct pt_regs *regs)

Can we please replace 'mc' with something like 'memory_error' ?

There's no "machine check" on arm64, and 'mc' is opaque regardless.

> +{
> + const struct exception_table_entry *ex;
> +
> + ex = search_exception_tables(instruction_pointer(regs));
> + if (!ex)
> + return false;
> +
> + /*
> +  * This is not complete, More Machine check safe extable type can
> +  * be processed here.
> +  */
> +
> + return false;
> +}

As with my comment on the subsequenty patch, I'd much prefer that we handle
EX_TYPE_UACCESS_ERR_ZERO from the outset.



> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 55f6455a8284..312932dc100b 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -730,6 +730,31 @@ static int do_bad(unsigned long far, unsigned long esr, 
> struct pt_regs *regs)
>   return 1; /* "fault" */
>  }
>  
> +static bool arm64_do_kernel_sea(unsigned long addr, unsigned int esr,
> +  struct pt_regs *regs, int sig, int code)
> +{
> + if (!IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC))
> + return false;
> +
> + if (user_mode(regs))
> + return false;

This function is called "arm64_do_kernel_sea"; surely the caller should *never*
call this for a SEA taken from user mode?

> +
> + if (apei_claim_sea(regs) < 0)
> + return false;
> +
> + if (!fixup_exception_mc(regs))
> + return false;
> +
> + if (current->flags & PF_KTHREAD)
> + return true;

I think this needs a comment; why do we allow kthreads to go on, yet kill user
threads? What about helper threads (e.g. for io_uring)?

> +
> + set_thread_esr(0, esr);

Why do we set the ESR to 0?

Mark.

> + arm64_force_sig_fault(sig, code, addr,
> + "Uncorrected memory error on access to user memory\n");
> +
> + return true;
> +}
> +
>  static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs)
>  {
>   const struct fault_info *inf;
> @@ -755,7 +780,9 @@ static int do_sea(unsigned long far, unsigned long esr, 
> struct pt_regs *regs)
>*/
>   siaddr  = untagged_addr(far);
>   }
> - arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr);
> +
> + if (!arm64_do_kernel_sea(siaddr, esr, regs, inf->sig, inf->code))
> + arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, 
> esr);
>  
>   return 0;
>  }
> -- 
> 2.25.1
> 


[PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-29 Thread David Hildenbrand
Similar to how we optimized fork(), let's implement PTE batching when
consecutive (present) PTEs map consecutive pages of the same large
folio.

Most infrastructure we need for batching (mmu gather, rmap) is already
there. We only have to add get_and_clear_full_ptes() and
clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to
process a PTE range.

We won't bother sanity-checking the mapcount of all subpages, but only
check the mapcount of the first subpage we process.

To keep small folios as fast as possible force inlining of a specialized
variant using __always_inline with nr=1.

Signed-off-by: David Hildenbrand 
---
 include/linux/pgtable.h | 66 +
 mm/memory.c | 92 +
 2 files changed, 132 insertions(+), 26 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index aab227e12493..f0feae7f89fb 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -580,6 +580,72 @@ static inline pte_t ptep_get_and_clear_full(struct 
mm_struct *mm,
 }
 #endif
 
+#ifndef get_and_clear_full_ptes
+/**
+ * get_and_clear_full_ptes - Clear PTEs that map consecutive pages of the same
+ *  folio, collecting dirty/accessed bits.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into
+ * returned PTE.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+   pte_t pte, tmp_pte;
+
+   pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   while (--nr) {
+   ptep++;
+   addr += PAGE_SIZE;
+   tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   if (pte_dirty(tmp_pte))
+   pte = pte_mkdirty(pte);
+   if (pte_young(tmp_pte))
+   pte = pte_mkyoung(pte);
+   }
+   return pte;
+}
+#endif
+
+#ifndef clear_full_ptes
+/**
+ * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr, int full)
+{
+   for (;;) {
+   ptep_get_and_clear_full(mm, addr, ptep, full);
+   if (--nr == 0)
+   break;
+   ptep++;
+   addr += PAGE_SIZE;
+   }
+}
+#endif
 
 /*
  * If two threads concurrently fault at the same page, the thread that
diff --git a/mm/memory.c b/mm/memory.c
index a2190d7cfa74..38a010c4d04d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1515,7 +1515,7 @@ static inline bool zap_drop_file_uffd_wp(struct 
zap_details *details)
  */
 static inline void
 zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
- unsigned long addr, pte_t *pte,
+ unsigned long addr, pte_t *pte, int nr,
  struct zap_details *details, pte_t pteval)
 {
/* Zap on anonymous always means dropping everything */
@@ -1525,20 +1525,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
*vma,
if (zap_drop_file_uffd_wp(details))
return;
 
-   pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+   for (;;) {
+   /* the PFN in the PTE is irrelevant. */
+   pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+   if (--nr == 0)
+   break;
+   pte++;
+   addr += PAGE_SIZE;
+   }
 }
 
-static inline void zap_present_folio_pte(struct mmu_gather *tlb,
+static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
struct vm_area_struct *vma, struct folio *folio,
-   struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,

[PATCH v1 8/9] mm/mmu_gather: add tlb_remove_tlb_entries()

2024-01-29 Thread David Hildenbrand
Let's add a helper that lets us batch-process multiple consecutive PTEs.

Note that the loop will get optimized out on all architectures except on
powerpc. We have to add an early define of __tlb_remove_tlb_entry() on
ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries() a
macro).

Signed-off-by: David Hildenbrand 
---
 arch/powerpc/include/asm/tlb.h |  2 ++
 include/asm-generic/tlb.h  | 20 
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index b3de6102a907..1ca7d4c4b90d 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -19,6 +19,8 @@
 
 #include 
 
+static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
+ unsigned long address);
 #define __tlb_remove_tlb_entry __tlb_remove_tlb_entry
 
 #define tlb_flush tlb_flush
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 428c3f93addc..bd00dd238b79 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -616,6 +616,26 @@ static inline void tlb_flush_p4d_range(struct mmu_gather 
*tlb,
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
+/**
+ * tlb_remove_tlb_entries - remember unmapping of multiple consecutive ptes for
+ * later tlb invalidation.
+ *
+ * Similar to tlb_remove_tlb_entry(), but remember unmapping of multiple
+ * consecutive ptes instead of only a single one.
+ */
+static inline void tlb_remove_tlb_entries(struct mmu_gather *tlb,
+   pte_t *ptep, unsigned int nr, unsigned long address)
+{
+   tlb_flush_pte_range(tlb, address, PAGE_SIZE * nr);
+   for (;;) {
+   __tlb_remove_tlb_entry(tlb, ptep, address);
+   if (--nr == 0)
+   break;
+   ptep++;
+   address += PAGE_SIZE;
+   }
+}
+
 #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
do {\
unsigned long _sz = huge_page_size(h);  \
-- 
2.43.0



[PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages()

2024-01-29 Thread David Hildenbrand
Add __tlb_remove_folio_pages(), which will remove multiple consecutive
pages that belong to the same large folio, instead of only a single
page. We'll be using this function when optimizing unmapping/zapping of
large folios that are mapped by PTEs.

We're using the remaining spare bit in an encoded_page to indicate that
the next enoced page in an array contains actually shifted "nr_pages".
Teach swap/freeing code about putting multiple folio references, and
delayed rmap handling to remove page ranges of a folio.

This extension allows for still gathering almost as many small folios
as we used to (-1, because we have to prepare for a possibly bigger next
entry), but still allows for gathering consecutive pages that belong to the
same large folio.

Note that we don't pass the folio pointer, because it is not required for
now. Further, we don't support page_size != PAGE_SIZE, it won't be
required for simple PTE batching.

We have to provide a separate s390 implementation, but it's fairly
straight forward.

Another, more invasive and likely more expensive, approach would be to
use folio+range or a PFN range instead of page+nr_pages. But, we should
do that consistently for the whole mmu_gather. For now, let's keep it
simple and add "nr_pages" only.

Signed-off-by: David Hildenbrand 
---
 arch/s390/include/asm/tlb.h | 17 +++
 include/asm-generic/tlb.h   |  8 +
 include/linux/mm_types.h| 20 
 mm/mmu_gather.c | 61 +++--
 mm/swap.c   | 12 ++--
 mm/swap_state.c | 12 ++--
 6 files changed, 116 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 48df896d5b79..abfd2bf29e9e 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
 static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, bool delay_rmap, int page_size);
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+   struct page *page, unsigned int nr_pages, bool delay_rmap);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather 
*tlb,
return false;
 }
 
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+   struct page *page, unsigned int nr_pages, bool delay_rmap)
+{
+   struct encoded_page *encoded_pages[] = {
+   encode_page(page, ENCODED_PAGE_BIT_NR_PAGES),
+   encode_nr_pages(nr_pages),
+   };
+
+   VM_WARN_ON_ONCE(delay_rmap);
+   VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
+
+   free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
+   return false;
+}
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
__tlb_flush_mm_lazy(tlb->mm);
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2eb7b0d4f5d2..428c3f93addc 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -69,6 +69,7 @@
  *
  *  - tlb_remove_page() / __tlb_remove_page()
  *  - tlb_remove_page_size() / __tlb_remove_page_size()
+ *  - __tlb_remove_folio_pages()
  *
  *__tlb_remove_page_size() is the basic primitive that queues a page for
  *freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
@@ -78,6 +79,11 @@
  *tlb_remove_page() and tlb_remove_page_size() imply the call to
  *tlb_flush_mmu() when required and has no return value.
  *
+ *__tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
+ *instead of removing a single page, remove the given number of consecutive
+ *pages that are all part of the same (large) folio: just like calling
+ *__tlb_remove_page() on each page individually.
+ *
  *  - tlb_change_page_size()
  *
  *call before __tlb_remove_page*() to set the current page-size; implies a
@@ -262,6 +268,8 @@ struct mmu_gather_batch {
 
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
bool delay_rmap, int page_size);
+bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
+   unsigned int nr_pages, bool delay_rmap);
 
 #ifdef CONFIG_SMP
 /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1b89eec0d6df..198662b7a39a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -226,6 +226,15 @@ struct encoded_page;
 /* Perform rmap removal after we have flushed the TLB. */
 #define ENCODED_PAGE_BIT_DELAY_RMAP1ul
 
+/*
+ * The next item in an encoded_page array is the "nr_pages" argument, 
specifying
+ * the number of consecutive pages starting from this page, that all belong to
+ * the same folio. For example, "nr_pages" corresponds to the number of folio
+ * references that mu

[PATCH v1 6/9] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP

2024-01-29 Thread David Hildenbrand
Nowadays, encoded pages are only used in mmu_gather handling. Let's
update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP. While at
it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.

If encoded page pointers would ever be used in other context again, we'd
likely want to change the defines to reflect their context (e.g.,
ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP). For now, let's keep it simple.

This is a preparation for using the remaining spare bit to indicate that
the next item in an array of encoded pages is a "nr_pages" argument and
not an encoded page.

Signed-off-by: David Hildenbrand 
---
 include/linux/mm_types.h | 17 +++--
 mm/mmu_gather.c  |  5 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8b611e13153e..1b89eec0d6df 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -210,8 +210,8 @@ struct page {
  *
  * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
  * with the low bits of the pointer indicating extra context-dependent
- * information. Not super-common, but happens in mmu_gather and mlock
- * handling, and this acts as a type system check on that use.
+ * information. Only used in mmu_gather handling, and this acts as a type
+ * system check on that use.
  *
  * We only really have two guaranteed bits in general, although you could
  * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
@@ -220,21 +220,26 @@ struct page {
  * Use the supplied helper functions to endcode/decode the pointer and bits.
  */
 struct encoded_page;
-#define ENCODE_PAGE_BITS 3ul
+
+#define ENCODED_PAGE_BITS  3ul
+
+/* Perform rmap removal after we have flushed the TLB. */
+#define ENCODED_PAGE_BIT_DELAY_RMAP1ul
+
 static __always_inline struct encoded_page *encode_page(struct page *page, 
unsigned long flags)
 {
-   BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);
+   BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
return (struct encoded_page *)(flags | (unsigned long)page);
 }
 
 static inline unsigned long encoded_page_flags(struct encoded_page *page)
 {
-   return ENCODE_PAGE_BITS & (unsigned long)page;
+   return ENCODED_PAGE_BITS & (unsigned long)page;
 }
 
 static inline struct page *encoded_page_ptr(struct encoded_page *page)
 {
-   return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+   return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
 }
 
 /*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index ac733d81b112..6540c99c6758 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -53,7 +53,7 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch 
*batch, struct vm_area_
for (int i = 0; i < batch->nr; i++) {
struct encoded_page *enc = batch->encoded_pages[i];
 
-   if (encoded_page_flags(enc)) {
+   if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
struct page *page = encoded_page_ptr(enc);
folio_remove_rmap_pte(page_folio(page), page, vma);
}
@@ -119,6 +119,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
bool delay_rmap, int page_size)
 {
+   int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
struct mmu_gather_batch *batch;
 
VM_BUG_ON(!tlb->end);
@@ -132,7 +133,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct 
page *page,
 * Add the page and check if we are full. If so
 * force a flush.
 */
-   batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
+   batch->encoded_pages[batch->nr++] = encode_page(page, flags);
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
-- 
2.43.0



[PATCH v1 5/9] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size()

2024-01-29 Thread David Hildenbrand
We have two bits available in the encoded page pointer to store
additional information. Currently, we use one bit to request delay of the
rmap removal until after a TLB flush.

We want to make use of the remaining bit internally for batching of
multiple pages of the same folio, specifying that the next encoded page
pointer in an array is actually "nr_pages". So pass page + delay_rmap flag
instead of an encoded page, to handle the encoding internally.

Signed-off-by: David Hildenbrand 
---
 arch/s390/include/asm/tlb.h | 13 ++---
 include/asm-generic/tlb.h   | 12 ++--
 mm/mmu_gather.c |  7 ---
 3 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index d1455a601adc..48df896d5b79 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -25,8 +25,7 @@
 void __tlb_remove_table(void *_table);
 static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
- int page_size);
+   struct page *page, bool delay_rmap, int page_size);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -42,14 +41,14 @@ static inline bool __tlb_remove_page_size(struct mmu_gather 
*tlb,
  * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
  * has already been freed, so just do free_page_and_swap_cache.
  *
- * s390 doesn't delay rmap removal, so there is nothing encoded in
- * the page pointer.
+ * s390 doesn't delay rmap removal.
  */
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
- int page_size)
+   struct page *page, bool delay_rmap, int page_size)
 {
-   free_page_and_swap_cache(encoded_page_ptr(page));
+   VM_WARN_ON_ONCE(delay_rmap);
+
+   free_page_and_swap_cache(page);
return false;
 }
 
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 129a3a759976..2eb7b0d4f5d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -260,9 +260,8 @@ struct mmu_gather_batch {
  */
 #define MAX_GATHER_BATCH_COUNT (1UL/MAX_GATHER_BATCH)
 
-extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
-  struct encoded_page *page,
-  int page_size);
+extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+   bool delay_rmap, int page_size);
 
 #ifdef CONFIG_SMP
 /*
@@ -462,13 +461,14 @@ static inline void tlb_flush_mmu_tlbonly(struct 
mmu_gather *tlb)
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, int page_size)
 {
-   if (__tlb_remove_page_size(tlb, encode_page(page, 0), page_size))
+   if (__tlb_remove_page_size(tlb, page, false, page_size))
tlb_flush_mmu(tlb);
 }
 
-static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct 
page *page, unsigned int flags)
+static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb,
+   struct page *page, bool delay_rmap)
 {
-   return __tlb_remove_page_size(tlb, encode_page(page, flags), PAGE_SIZE);
+   return __tlb_remove_page_size(tlb, page, delay_rmap, PAGE_SIZE);
 }
 
 /* tlb_remove_page
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 604ddf08affe..ac733d81b112 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -116,7 +116,8 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
tlb->local.next = NULL;
 }
 
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, 
int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+   bool delay_rmap, int page_size)
 {
struct mmu_gather_batch *batch;
 
@@ -131,13 +132,13 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, 
struct encoded_page *page, i
 * Add the page and check if we are full. If so
 * force a flush.
 */
-   batch->encoded_pages[batch->nr++] = page;
+   batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
batch = tlb->active;
}
-   VM_BUG_ON_PAGE(batch->nr > batch->max, encoded_page_ptr(page));
+   VM_BUG_ON_PAGE(batch->nr > batch->max, page);
 
return false;
 }
-- 
2.43.0



[PATCH v1 4/9] mm/memory: factor out zapping folio pte into zap_present_folio_pte()

2024-01-29 Thread David Hildenbrand
Let's prepare for further changes by factoring it out into a separate
function.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 53 -
 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 20bc13ab8db2..a2190d7cfa74 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1528,30 +1528,14 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
*vma,
pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
 }
 
-static inline void zap_present_pte(struct mmu_gather *tlb,
-   struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
-   unsigned long addr, struct zap_details *details,
-   int *rss, bool *force_flush, bool *force_break)
+static inline void zap_present_folio_pte(struct mmu_gather *tlb,
+   struct vm_area_struct *vma, struct folio *folio,
+   struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
+   struct zap_details *details, int *rss, bool *force_flush,
+   bool *force_break)
 {
struct mm_struct *mm = tlb->mm;
bool delay_rmap = false;
-   struct folio *folio;
-   struct page *page;
-
-   page = vm_normal_page(vma, addr, ptent);
-   if (!page) {
-   /* We don't need up-to-date accessed/dirty bits. */
-   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
-   arch_check_zapped_pte(vma, ptent);
-   tlb_remove_tlb_entry(tlb, pte, addr);
-   VM_WARN_ON_ONCE(userfaultfd_wp(vma));
-   ksm_might_unmap_zero_page(mm, ptent);
-   return;
-   }
-
-   folio = page_folio(page);
-   if (unlikely(!should_zap_folio(details, folio)))
-   return;
 
if (!folio_test_anon(folio)) {
ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
@@ -1586,6 +1570,33 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
}
 }
 
+static inline void zap_present_pte(struct mmu_gather *tlb,
+   struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+   unsigned long addr, struct zap_details *details,
+   int *rss, bool *force_flush, bool *force_break)
+{
+   struct mm_struct *mm = tlb->mm;
+   struct folio *folio;
+   struct page *page;
+
+   page = vm_normal_page(vma, addr, ptent);
+   if (!page) {
+   /* We don't need up-to-date accessed/dirty bits. */
+   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   arch_check_zapped_pte(vma, ptent);
+   tlb_remove_tlb_entry(tlb, pte, addr);
+   VM_WARN_ON_ONCE(userfaultfd_wp(vma));
+   ksm_might_unmap_zero_page(mm, ptent);
+   return;
+   }
+
+   folio = page_folio(page);
+   if (unlikely(!should_zap_folio(details, folio)))
+   return;
+   zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
+ rss, force_flush, force_break);
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
-- 
2.43.0



[PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()

2024-01-29 Thread David Hildenbrand
We don't need up-to-date accessed-dirty information for anon folios and can
simply work with the ptent we already have. Also, we know the RSS counter
we want to update.

We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
zap_install_uffd_wp_if_needed() after updating the folio and RSS.

While at it, only call zap_install_uffd_wp_if_needed() if there is even
any chance that pte_install_uffd_wp_if_needed() would do *something*.
That is, just don't bother if uffd-wp does not apply.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 69502cdc0a7d..20bc13ab8db2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
return;
-   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
-   arch_check_zapped_pte(vma, ptent);
-   tlb_remove_tlb_entry(tlb, pte, addr);
-   zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
 
if (!folio_test_anon(folio)) {
+   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
if (pte_dirty(ptent)) {
folio_mark_dirty(folio);
if (tlb_delay_rmap(tlb)) {
@@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
}
if (pte_young(ptent) && likely(vma_has_recency(vma)))
folio_mark_accessed(folio);
+   rss[mm_counter(folio)]--;
+   } else {
+   /* We don't need up-to-date accessed/dirty bits. */
+   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   rss[MM_ANONPAGES]--;
}
-   rss[mm_counter(folio)]--;
+   arch_check_zapped_pte(vma, ptent);
+   tlb_remove_tlb_entry(tlb, pte, addr);
+   if (unlikely(userfaultfd_pte_wp(vma, ptent)))
+   zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+
if (!delay_rmap) {
folio_remove_rmap_pte(folio, page, vma);
if (unlikely(page_mapcount(page) < 0))
-- 
2.43.0



[PATCH v1 2/9] mm/memory: handle !page case in zap_present_pte() separately

2024-01-29 Thread David Hildenbrand
We don't need uptodate accessed/dirty bits, so in theory we could
replace ptep_get_and_clear_full() by an optimized ptep_clear_full()
function. Let's rely on the provided pte.

Further, there is no scenario where we would have to insert uffd-wp
markers when zapping something that is not a normal page (i.e., zeropage).
Add a sanity check to make sure this remains true.

should_zap_folio() no longer has to handle NULL pointers. This change
replaces 2/3 "!page/!folio" checks by a single "!page" one.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 50a6c79c78fc..69502cdc0a7d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1497,10 +1497,6 @@ static inline bool should_zap_folio(struct zap_details 
*details,
if (should_zap_cows(details))
return true;
 
-   /* E.g. the caller passes NULL for the case of a zero folio */
-   if (!folio)
-   return true;
-
/* Otherwise we should only zap non-anon folios */
return !folio_test_anon(folio);
 }
@@ -1543,19 +1539,23 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
struct page *page;
 
page = vm_normal_page(vma, addr, ptent);
-   if (page)
-   folio = page_folio(page);
+   if (!page) {
+   /* We don't need up-to-date accessed/dirty bits. */
+   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   arch_check_zapped_pte(vma, ptent);
+   tlb_remove_tlb_entry(tlb, pte, addr);
+   VM_WARN_ON_ONCE(userfaultfd_wp(vma));
+   ksm_might_unmap_zero_page(mm, ptent);
+   return;
+   }
 
+   folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
return;
ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
arch_check_zapped_pte(vma, ptent);
tlb_remove_tlb_entry(tlb, pte, addr);
zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
-   if (unlikely(!page)) {
-   ksm_might_unmap_zero_page(mm, ptent);
-   return;
-   }
 
if (!folio_test_anon(folio)) {
if (pte_dirty(ptent)) {
-- 
2.43.0



[PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()

2024-01-29 Thread David Hildenbrand
Let's prepare for further changes by factoring out processing of present
PTEs.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 92 ++---
 1 file changed, 52 insertions(+), 40 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b05fd28dbce1..50a6c79c78fc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
*vma,
pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
 }
 
+static inline void zap_present_pte(struct mmu_gather *tlb,
+   struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+   unsigned long addr, struct zap_details *details,
+   int *rss, bool *force_flush, bool *force_break)
+{
+   struct mm_struct *mm = tlb->mm;
+   bool delay_rmap = false;
+   struct folio *folio;
+   struct page *page;
+
+   page = vm_normal_page(vma, addr, ptent);
+   if (page)
+   folio = page_folio(page);
+
+   if (unlikely(!should_zap_folio(details, folio)))
+   return;
+   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   arch_check_zapped_pte(vma, ptent);
+   tlb_remove_tlb_entry(tlb, pte, addr);
+   zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+   if (unlikely(!page)) {
+   ksm_might_unmap_zero_page(mm, ptent);
+   return;
+   }
+
+   if (!folio_test_anon(folio)) {
+   if (pte_dirty(ptent)) {
+   folio_mark_dirty(folio);
+   if (tlb_delay_rmap(tlb)) {
+   delay_rmap = true;
+   *force_flush = true;
+   }
+   }
+   if (pte_young(ptent) && likely(vma_has_recency(vma)))
+   folio_mark_accessed(folio);
+   }
+   rss[mm_counter(folio)]--;
+   if (!delay_rmap) {
+   folio_remove_rmap_pte(folio, page, vma);
+   if (unlikely(page_mapcount(page) < 0))
+   print_bad_pte(vma, addr, ptent, page);
+   }
+   if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+   *force_flush = true;
+   *force_break = true;
+   }
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct zap_details *details)
 {
+   bool force_flush = false, force_break = false;
struct mm_struct *mm = tlb->mm;
-   int force_flush = 0;
int rss[NR_MM_COUNTERS];
spinlock_t *ptl;
pte_t *start_pte;
@@ -1565,45 +1613,9 @@ static unsigned long zap_pte_range(struct mmu_gather 
*tlb,
break;
 
if (pte_present(ptent)) {
-   unsigned int delay_rmap;
-
-   page = vm_normal_page(vma, addr, ptent);
-   if (page)
-   folio = page_folio(page);
-
-   if (unlikely(!should_zap_folio(details, folio)))
-   continue;
-   ptent = ptep_get_and_clear_full(mm, addr, pte,
-   tlb->fullmm);
-   arch_check_zapped_pte(vma, ptent);
-   tlb_remove_tlb_entry(tlb, pte, addr);
-   zap_install_uffd_wp_if_needed(vma, addr, pte, details,
- ptent);
-   if (unlikely(!page)) {
-   ksm_might_unmap_zero_page(mm, ptent);
-   continue;
-   }
-
-   delay_rmap = 0;
-   if (!folio_test_anon(folio)) {
-   if (pte_dirty(ptent)) {
-   folio_mark_dirty(folio);
-   if (tlb_delay_rmap(tlb)) {
-   delay_rmap = 1;
-   force_flush = 1;
-   }
-   }
-   if (pte_young(ptent) && 
likely(vma_has_recency(vma)))
-   folio_mark_accessed(folio);
-   }
-   rss[mm_counter(folio)]--;
-   if (!delay_rmap) {
-   folio_remove_rmap_pte(folio, page, vma);
-   if (unlikely(page_mapcount(page) < 0))
-   print_bad_pte(vma, addr, ptent, page);
-   }
-   if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) 
{
-   force_flush = 1;
+ 

[PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-29 Thread David Hildenbrand
This series is based on [1] and must be applied on top of it.
Similar to what we did with fork(), let's implement PTE batching
during unmap/zap when processing PTE-mapped THPs.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.

Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.

To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.

While this series should -- similar to [1] -- be beneficial for adding
cont-pte support on arm64[2], it's one of the requirements for maintaining
a total mapcount[3] for large folios with minimal added overhead and
further changes[4] that build up on top of the total mapcount.

Independent of all that, this series results in a speedup during munmap()
and similar unmapping (process teardown, MADV_DONTNEED on larger ranges)
with PTE-mapped THP, which is the default with THPs that are smaller than
a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by
PTE-mapped folios of the same size (stddev < 1%) results in the following
runtimes for munmap() in seconds (shorter is better):

Folio Size | mm-unstable |  New | Change
-
  4KiB |0.058110 | 0.057715 |   - 1%
 16KiB |0.044198 | 0.035469 |   -20%
 32KiB |0.034216 | 0.023522 |   -31%
 64KiB |0.029207 | 0.018434 |   -37%
128KiB |0.026579 | 0.014026 |   -47%
256KiB |0.025130 | 0.011756 |   -53%
512KiB |0.024292 | 0.010703 |   -56%
   1024KiB |0.023812 | 0.010294 |   -57%
   2048KiB |0.023785 | 0.009910 |   -58%

CCing especially s390x folks, because they have a tlb freeing hooks that
needs adjustment. Only tested on x86-64 for now, will have to do some more
stress testing. Compile-tested on most other architectures. The PPC
change is negleglible and makes my cross-compiler happy.

[1] https://lkml.kernel.org/r/20240129124649.189745-1-da...@redhat.com
[2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.robe...@arm.com
[3] https://lkml.kernel.org/r/20230809083256.699513-1-da...@redhat.com
[4] https://lkml.kernel.org/r/20231124132626.235350-1-da...@redhat.com
[5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.robe...@arm.com

Cc: Andrew Morton 
Cc: Matthew Wilcox (Oracle) 
Cc: Ryan Roberts 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: "Aneesh Kumar K.V" 
Cc: Nick Piggin 
Cc: Peter Zijlstra 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: "Naveen N. Rao" 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Alexander Gordeev 
Cc: Christian Borntraeger 
Cc: Sven Schnelle 
Cc: Arnd Bergmann 
Cc: linux-a...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org

David Hildenbrand (9):
  mm/memory: factor out zapping of present pte into zap_present_pte()
  mm/memory: handle !page case in zap_present_pte() separately
  mm/memory: further separate anon and pagecache folio handling in
zap_present_pte()
  mm/memory: factor out zapping folio pte into zap_present_folio_pte()
  mm/mmu_gather: pass "delay_rmap" instead of encoded page to
__tlb_remove_page_size()
  mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP
  mm/mmu_gather: add __tlb_remove_folio_pages()
  mm/mmu_gather: add tlb_remove_tlb_entries()
  mm/memory: optimize unmap/zap with PTE-mapped THP

 arch/powerpc/include/asm/tlb.h |   2 +
 arch/s390/include/asm/tlb.h|  30 --
 include/asm-generic/tlb.h  |  40 ++--
 include/linux/mm_types.h   |  37 ++--
 include/linux/pgtable.h|  66 +
 mm/memory.c| 167 +++--
 mm/mmu_gather.c|  63 +++--
 mm/swap.c  |  12 ++-
 mm/swap_state.c|  12 ++-
 9 files changed, 347 insertions(+), 82 deletions(-)

-- 
2.43.0



[PATCH v10 6/6] arm64: introduce copy_mc_to_kernel() implementation

2024-01-29 Thread Tong Tiangen
The copy_mc_to_kernel() helper is memory copy implementation that handles
source exceptions. It can be used in memory copy scenarios that tolerate
hardware memory errors(e.g: pmem_read/dax_copy_to_iter).

Currnently, only x86 and ppc suuport this helper, after arm64 support
machine check safe framework, we introduce copy_mc_to_kernel()
implementation.

Signed-off-by: Tong Tiangen 
---
 arch/arm64/include/asm/string.h  |   5 +
 arch/arm64/include/asm/uaccess.h |  21 +++
 arch/arm64/lib/Makefile  |   2 +-
 arch/arm64/lib/memcpy_mc.S   | 257 +++
 mm/kasan/shadow.c|  12 ++
 5 files changed, 296 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/lib/memcpy_mc.S

diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h
index 3a3264ff47b9..995b63c26e99 100644
--- a/arch/arm64/include/asm/string.h
+++ b/arch/arm64/include/asm/string.h
@@ -35,6 +35,10 @@ extern void *memchr(const void *, int, __kernel_size_t);
 extern void *memcpy(void *, const void *, __kernel_size_t);
 extern void *__memcpy(void *, const void *, __kernel_size_t);
 
+#define __HAVE_ARCH_MEMCPY_MC
+extern int memcpy_mcs(void *, const void *, __kernel_size_t);
+extern int __memcpy_mcs(void *, const void *, __kernel_size_t);
+
 #define __HAVE_ARCH_MEMMOVE
 extern void *memmove(void *, const void *, __kernel_size_t);
 extern void *__memmove(void *, const void *, __kernel_size_t);
@@ -57,6 +61,7 @@ void memcpy_flushcache(void *dst, const void *src, size_t 
cnt);
  */
 
 #define memcpy(dst, src, len) __memcpy(dst, src, len)
+#define memcpy_mcs(dst, src, len) __memcpy_mcs(dst, src, len)
 #define memmove(dst, src, len) __memmove(dst, src, len)
 #define memset(s, c, n) __memset(s, c, n)
 
diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 14be5000c5a0..61e28ef2112a 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -425,4 +425,25 @@ static inline size_t probe_subpage_writeable(const char 
__user *uaddr,
 
 #endif /* CONFIG_ARCH_HAS_SUBPAGE_FAULTS */
 
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+/**
+ * copy_mc_to_kernel - memory copy that handles source exceptions
+ *
+ * @dst:   destination address
+ * @src:   source address
+ * @len:   number of bytes to copy
+ *
+ * Return 0 for success, or #size if there was an exception.
+ */
+static inline unsigned long __must_check
+copy_mc_to_kernel(void *to, const void *from, unsigned long size)
+{
+   int ret;
+
+   ret = memcpy_mcs(to, from, size);
+   return (ret == -EFAULT) ? size : 0;
+}
+#define copy_mc_to_kernel copy_mc_to_kernel
+#endif
+
 #endif /* __ASM_UACCESS_H */
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index a2fd865b816d..899d6ae9698c 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -3,7 +3,7 @@ lib-y   := clear_user.o delay.o copy_from_user.o
\
   copy_to_user.o copy_page.o   \
   clear_page.o csum.o insn.o memchr.o memcpy.o \
   memset.o memcmp.o strcmp.o strncmp.o strlen.o\
-  strnlen.o strchr.o strrchr.o tishift.o
+  strnlen.o strchr.o strrchr.o tishift.o memcpy_mc.o
 
 ifeq ($(CONFIG_KERNEL_MODE_NEON), y)
 obj-$(CONFIG_XOR_BLOCKS)   += xor-neon.o
diff --git a/arch/arm64/lib/memcpy_mc.S b/arch/arm64/lib/memcpy_mc.S
new file mode 100644
index ..7076b500d154
--- /dev/null
+++ b/arch/arm64/lib/memcpy_mc.S
@@ -0,0 +1,257 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2012-2021, Arm Limited.
+ *
+ * Adapted from the original at:
+ * 
https://github.com/ARM-software/optimized-routines/blob/afd6244a1f8d9229/string/aarch64/memcpy.S
+ */
+
+#include 
+#include 
+
+/* Assumptions:
+ *
+ * ARMv8-a, AArch64, unaligned accesses.
+ *
+ */
+
+#define L(label) .L ## label
+
+#define dstin  x0
+#define srcx1
+#define count  x2
+#define dstx3
+#define srcend x4
+#define dstend x5
+#define A_lx6
+#define A_lw   w6
+#define A_hx7
+#define B_lx8
+#define B_lw   w8
+#define B_hx9
+#define C_lx10
+#define C_lw   w10
+#define C_hx11
+#define D_lx12
+#define D_hx13
+#define E_lx14
+#define E_hx15
+#define F_lx16
+#define F_hx17
+#define G_lcount
+#define G_hdst
+#define H_lsrc
+#define H_hsrcend
+#define tmp1   x14
+
+/* This implementation handles overlaps and supports both memcpy and memmove
+   from a single entry point.  It uses unaligned accesses and branchless
+   sequences to keep the code small, simple and improve performance.
+
+   Copies are split into 3 main cases: small copies of up to 32 bytes, medium
+   copies of up to 128 bytes, and large copies.  The overhead of the overlap
+   check is negligible since it is only required for large copies.
+
+   Large copies use a software pipelined loop processing 64 bytes per 
iteration.
+   The destinatio

[PATCH v10 0/6]arm64: add machine check safe support

2024-01-29 Thread Tong Tiangen
With the increase of memory capacity and density, the probability of memory
error also increases. The increasing size and density of server RAM in data
centers and clouds have shown increased uncorrectable memory errors.

Currently, more and more scenarios that can tolerate memory errors???such as
CoW[1,2], KSM copy[3], coredump copy[4], khugepaged[5,6], uaccess copy[7],
etc.

This patchset introduces a new processing framework on ARM64, which enables
ARM64 to support error recovery in the above scenarios, and more scenarios
can be expanded based on this in the future.

In arm64, memory error handling in do_sea(), which is divided into two cases:
 1. If the user state consumed the memory errors, the solution is to kill
the user process and isolate the error page.
 2. If the kernel state consumed the memory errors, the solution is to
panic.

For case 2, Undifferentiated panic may not be the optimal choice, as it can
be handled better. In some scenarios, we can avoid panic, such as uaccess,
if the uaccess fails due to memory error, only the user process will be
affected, killing the user process and isolating the user page with
hardware memory errors is a better choice.

[1] commit d302c2398ba2 ("mm, hwpoison: when copy-on-write hits poison, take 
page offline")
[2] commit 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage 
copy-on-write faults")
[3] commit 6b970599e807 ("mm: hwpoison: support recovery from 
ksm_might_need_to_copy()")
[4] commit 245f09226893 ("mm: hwpoison: coredump: support recovery from 
dump_user_range()")
[5] commit 98c76c9f1ef7 ("mm/khugepaged: recover from poisoned anonymous 
memory")
[6] commit 12904d953364 ("mm/khugepaged: recover from poisoned file-backed 
memory")
[7] commit 278b917f8cb9 ("x86/mce: Add _ASM_EXTABLE_CPY for copy user access")

Since V9:
 1. Rebase to latest kernel version 6.8-rc2.
 2. Add patch 6/6 to support copy_mc_to_kernel().

Since V8:
 1. Rebase to latest kernel version and fix topo in some of the patches.
 2. According to the suggestion of Catalin, I attempted to modify the
return value of function copy_mc_[user]_highpage() to bytes not copied.
During the modification process, I found that it would be more
reasonable to return -EFAULT when copy error occurs (referring to the
newly added patch 4). 

For ARM64, the implementation of copy_mc_[user]_highpage() needs to
consider MTE. Considering the scenario where data copying is successful
but the MTE tag copying fails, it is also not reasonable to return
bytes not copied.
 3. Considering the recent addition of machine check safe support for
multiple scenarios, modify commit message for patch 5 (patch 4 for V8).

Since V7:
 Currently, there are patches supporting recover from poison
 consumption for the cow scenario[1]. Therefore, Supporting cow
 scenario under the arm64 architecture only needs to modify the relevant
 code under the arch/.
 [1]https://lore.kernel.org/lkml/20221031201029.102123-1-tony.l...@intel.com/

Since V6:
 Resend patches that are not merged into the mainline in V6.

Since V5:
 1. Add patch2/3 to add uaccess assembly helpers.
 2. Optimize the implementation logic of arm64_do_kernel_sea() in patch8.
 3. Remove kernel access fixup in patch9.
 All suggestion are from Mark. 

Since V4:
 1. According Michael's suggestion, add patch5.
 2. According Mark's suggestiog, do some restructuring to arm64
 extable, then a new adaptation of machine check safe support is made based
 on this.
 3. According Mark's suggestion, support machine check safe in do_mte() in
 cow scene.
 4. In V4, two patches have been merged into -next, so V5 not send these
 two patches.

Since V3:
 1. According to Robin's suggestion, direct modify user_ldst and
 user_ldp in asm-uaccess.h and modify mte.S.
 2. Add new macro USER_MC in asm-uaccess.h, used in copy_from_user.S
 and copy_to_user.S.
 3. According to Robin's suggestion, using micro in copy_page_mc.S to
 simplify code.
 4. According to KeFeng's suggestion, modify powerpc code in patch1.
 5. According to KeFeng's suggestion, modify mm/extable.c and some code
 optimization.

Since V2:
 1. According to Mark's suggestion, all uaccess can be recovered due to
memory error.
 2. Scenario pagecache reading is also supported as part of uaccess
(copy_to_user()) and duplication code problem is also solved. 
Thanks for Robin's suggestion.
 3. According Mark's suggestion, update commit message of patch 2/5.
 4. According Borisllav's suggestion, update commit message of patch 1/5.

Since V1:
 1.Consistent with PPC/x86, Using CONFIG_ARCH_HAS_COPY_MC instead of
   ARM64_UCE_KERNEL_RECOVERY.
 2.Add two new scenes, cow and pagecache reading.
 3.Fix two small bug(the first two patch).

V1 in here:
https://lore.kernel.org/lkml/20220323033705.3966643-1-tongtian...@huawei.com/

Tong Tiangen (6):
  uaccess: add generic fallback version of copy_mc_to_user()
  arm64: add support for machine check error safe
  arm64: add uaccess to mach

[PATCH v10 2/6] arm64: add support for machine check error safe

2024-01-29 Thread Tong Tiangen
For the arm64 kernel, when it processes hardware memory errors for
synchronize notifications(do_sea()), if the errors is consumed within the
kernel, the current processing is panic. However, it is not optimal.

Take uaccess for example, if the uaccess operation fails due to memory
error, only the user process will be affected. Killing the user process and
isolating the corrupt page is a better choice.

This patch only enable machine error check framework and adds an exception
fixup before the kernel panic in do_sea().

Signed-off-by: Tong Tiangen 
---
 arch/arm64/Kconfig   |  1 +
 arch/arm64/include/asm/extable.h |  1 +
 arch/arm64/mm/extable.c  | 16 
 arch/arm64/mm/fault.c| 29 -
 4 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index aa7c1d435139..2cc34b5e7abb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -20,6 +20,7 @@ config ARM64
select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
select ARCH_HAS_CACHE_LINE_SIZE
+   select ARCH_HAS_COPY_MC if ACPI_APEI_GHES
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEBUG_VM_PGTABLE
diff --git a/arch/arm64/include/asm/extable.h b/arch/arm64/include/asm/extable.h
index 72b0e71cc3de..f80ebd0addfd 100644
--- a/arch/arm64/include/asm/extable.h
+++ b/arch/arm64/include/asm/extable.h
@@ -46,4 +46,5 @@ bool ex_handler_bpf(const struct exception_table_entry *ex,
 #endif /* !CONFIG_BPF_JIT */
 
 bool fixup_exception(struct pt_regs *regs);
+bool fixup_exception_mc(struct pt_regs *regs);
 #endif
diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
index 228d681a8715..478e639f8680 100644
--- a/arch/arm64/mm/extable.c
+++ b/arch/arm64/mm/extable.c
@@ -76,3 +76,19 @@ bool fixup_exception(struct pt_regs *regs)
 
BUG();
 }
+
+bool fixup_exception_mc(struct pt_regs *regs)
+{
+   const struct exception_table_entry *ex;
+
+   ex = search_exception_tables(instruction_pointer(regs));
+   if (!ex)
+   return false;
+
+   /*
+* This is not complete, More Machine check safe extable type can
+* be processed here.
+*/
+
+   return false;
+}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 55f6455a8284..312932dc100b 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -730,6 +730,31 @@ static int do_bad(unsigned long far, unsigned long esr, 
struct pt_regs *regs)
return 1; /* "fault" */
 }
 
+static bool arm64_do_kernel_sea(unsigned long addr, unsigned int esr,
+struct pt_regs *regs, int sig, int code)
+{
+   if (!IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC))
+   return false;
+
+   if (user_mode(regs))
+   return false;
+
+   if (apei_claim_sea(regs) < 0)
+   return false;
+
+   if (!fixup_exception_mc(regs))
+   return false;
+
+   if (current->flags & PF_KTHREAD)
+   return true;
+
+   set_thread_esr(0, esr);
+   arm64_force_sig_fault(sig, code, addr,
+   "Uncorrected memory error on access to user memory\n");
+
+   return true;
+}
+
 static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs)
 {
const struct fault_info *inf;
@@ -755,7 +780,9 @@ static int do_sea(unsigned long far, unsigned long esr, 
struct pt_regs *regs)
 */
siaddr  = untagged_addr(far);
}
-   arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr);
+
+   if (!arm64_do_kernel_sea(siaddr, esr, regs, inf->sig, inf->code))
+   arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, 
esr);
 
return 0;
 }
-- 
2.25.1



[PATCH v10 4/6] mm/hwpoison: return -EFAULT when copy fail in copy_mc_[user]_highpage()

2024-01-29 Thread Tong Tiangen
If hardware errors are encountered during page copying, returning the bytes
not copied is not meaningful, and the caller cannot do any processing on
the remaining data. Returning -EFAULT is more reasonable, which represents
a hardware error encountered during the copying.

Signed-off-by: Tong Tiangen 
---
 include/linux/highmem.h | 8 
 mm/khugepaged.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 451c1dff0e87..c5ca1a1fc4f5 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -335,8 +335,8 @@ static inline void copy_highpage(struct page *to, struct 
page *from)
 /*
  * If architecture supports machine check exception handling, define the
  * #MC versions of copy_user_highpage and copy_highpage. They copy a memory
- * page with #MC in source page (@from) handled, and return the number
- * of bytes not copied if there was a #MC, otherwise 0 for success.
+ * page with #MC in source page (@from) handled, and return -EFAULT if there
+ * was a #MC, otherwise 0 for success.
  */
 static inline int copy_mc_user_highpage(struct page *to, struct page *from,
unsigned long vaddr, struct 
vm_area_struct *vma)
@@ -352,7 +352,7 @@ static inline int copy_mc_user_highpage(struct page *to, 
struct page *from,
kunmap_local(vto);
kunmap_local(vfrom);
 
-   return ret;
+   return ret ? -EFAULT : 0;
 }
 
 static inline int copy_mc_highpage(struct page *to, struct page *from)
@@ -368,7 +368,7 @@ static inline int copy_mc_highpage(struct page *to, struct 
page *from)
kunmap_local(vto);
kunmap_local(vfrom);
 
-   return ret;
+   return ret ? -EFAULT : 0;
 }
 #else
 static inline int copy_mc_user_highpage(struct page *to, struct page *from,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2b219acb528e..ba6743a54c86 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -797,7 +797,7 @@ static int __collapse_huge_page_copy(pte_t *pte,
continue;
}
src_page = pte_page(pteval);
-   if (copy_mc_user_highpage(page, src_page, _address, vma) > 0) {
+   if (copy_mc_user_highpage(page, src_page, _address, vma)) {
result = SCAN_COPY_MC;
break;
}
@@ -2053,7 +2053,7 @@ static int collapse_file(struct mm_struct *mm, unsigned 
long addr,
clear_highpage(hpage + (index % HPAGE_PMD_NR));
index++;
}
-   if (copy_mc_highpage(hpage + (page->index % HPAGE_PMD_NR), 
page) > 0) {
+   if (copy_mc_highpage(hpage + (page->index % HPAGE_PMD_NR), 
page)) {
result = SCAN_COPY_MC;
goto rollback;
}
-- 
2.25.1



[PATCH v10 3/6] arm64: add uaccess to machine check safe

2024-01-29 Thread Tong Tiangen
If user process access memory fails due to hardware memory error, only the
relevant processes are affected, so it is more reasonable to kill the user
process and isolate the corrupt page than to panic the kernel.

Signed-off-by: Tong Tiangen 
---
 arch/arm64/lib/copy_from_user.S | 10 +-
 arch/arm64/lib/copy_to_user.S   | 10 +-
 arch/arm64/mm/extable.c |  8 
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S
index 34e317907524..1bf676e9201d 100644
--- a/arch/arm64/lib/copy_from_user.S
+++ b/arch/arm64/lib/copy_from_user.S
@@ -25,7 +25,7 @@
.endm
 
.macro strb1 reg, ptr, val
-   strb \reg, [\ptr], \val
+   USER(9998f, strb \reg, [\ptr], \val)
.endm
 
.macro ldrh1 reg, ptr, val
@@ -33,7 +33,7 @@
.endm
 
.macro strh1 reg, ptr, val
-   strh \reg, [\ptr], \val
+   USER(9998f, strh \reg, [\ptr], \val)
.endm
 
.macro ldr1 reg, ptr, val
@@ -41,7 +41,7 @@
.endm
 
.macro str1 reg, ptr, val
-   str \reg, [\ptr], \val
+   USER(9998f, str \reg, [\ptr], \val)
.endm
 
.macro ldp1 reg1, reg2, ptr, val
@@ -49,7 +49,7 @@
.endm
 
.macro stp1 reg1, reg2, ptr, val
-   stp \reg1, \reg2, [\ptr], \val
+   USER(9998f, stp \reg1, \reg2, [\ptr], \val)
.endm
 
 end.reqx5
@@ -66,7 +66,7 @@ SYM_FUNC_START(__arch_copy_from_user)
b.ne9998f
// Before being absolutely sure we couldn't copy anything, try harder
 USER(9998f, ldtrb tmp1w, [srcin])
-   strbtmp1w, [dst], #1
+USER(9998f, strb   tmp1w, [dst], #1)
 9998:  sub x0, end, dst// bytes not copied
ret
 SYM_FUNC_END(__arch_copy_from_user)
diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S
index 802231772608..cc031bd87455 100644
--- a/arch/arm64/lib/copy_to_user.S
+++ b/arch/arm64/lib/copy_to_user.S
@@ -20,7 +20,7 @@
  * x0 - bytes not copied
  */
.macro ldrb1 reg, ptr, val
-   ldrb  \reg, [\ptr], \val
+   USER(9998f, ldrb  \reg, [\ptr], \val)
.endm
 
.macro strb1 reg, ptr, val
@@ -28,7 +28,7 @@
.endm
 
.macro ldrh1 reg, ptr, val
-   ldrh  \reg, [\ptr], \val
+   USER(9998f, ldrh  \reg, [\ptr], \val)
.endm
 
.macro strh1 reg, ptr, val
@@ -36,7 +36,7 @@
.endm
 
.macro ldr1 reg, ptr, val
-   ldr \reg, [\ptr], \val
+   USER(9998f, ldr \reg, [\ptr], \val)
.endm
 
.macro str1 reg, ptr, val
@@ -44,7 +44,7 @@
.endm
 
.macro ldp1 reg1, reg2, ptr, val
-   ldp \reg1, \reg2, [\ptr], \val
+   USER(9998f, ldp \reg1, \reg2, [\ptr], \val)
.endm
 
.macro stp1 reg1, reg2, ptr, val
@@ -64,7 +64,7 @@ SYM_FUNC_START(__arch_copy_to_user)
 9997:  cmp dst, dstin
b.ne9998f
// Before being absolutely sure we couldn't copy anything, try harder
-   ldrbtmp1w, [srcin]
+USER(9998f, ldrb   tmp1w, [srcin])
 USER(9998f, sttrb tmp1w, [dst])
add dst, dst, #1
 9998:  sub x0, end, dst// bytes not copied
diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
index 478e639f8680..28ec35e3d210 100644
--- a/arch/arm64/mm/extable.c
+++ b/arch/arm64/mm/extable.c
@@ -85,10 +85,10 @@ bool fixup_exception_mc(struct pt_regs *regs)
if (!ex)
return false;
 
-   /*
-* This is not complete, More Machine check safe extable type can
-* be processed here.
-*/
+   switch (ex->type) {
+   case EX_TYPE_UACCESS_ERR_ZERO:
+   return ex_handler_uaccess_err_zero(ex, regs);
+   }
 
return false;
 }
-- 
2.25.1



[PATCH v10 5/6] arm64: support copy_mc_[user]_highpage()

2024-01-29 Thread Tong Tiangen
Currently, many scenarios that can tolerate memory errors when copying page
have been supported in the kernel[1][2][3], all of which are implemented by
copy_mc_[user]_highpage(). arm64 should also support this mechanism.

Due to mte, arm64 needs to have its own copy_mc_[user]_highpage()
architecture implementation, macros __HAVE_ARCH_COPY_MC_HIGHPAGE and
__HAVE_ARCH_COPY_MC_USER_HIGHPAGE have been added to control it.

Add new helper copy_mc_page() which provide a page copy implementation with
machine check safe. The copy_mc_page() in copy_mc_page.S is largely borrows
from copy_page() in copy_page.S and the main difference is copy_mc_page()
add extable entry to every load/store insn to support machine check safe.

Add new extable type EX_TYPE_COPY_MC_PAGE_ERR_ZERO which used in
copy_mc_page().

[1]a873dfe1032a ("mm, hwpoison: try to recover from copy-on write faults")
[2]5f2500b93cc9 ("mm/khugepaged: recover from poisoned anonymous memory")
[3]6b970599e807 ("mm: hwpoison: support recovery from ksm_might_need_to_copy()")

Signed-off-by: Tong Tiangen 
---
 arch/arm64/include/asm/asm-extable.h | 15 ++
 arch/arm64/include/asm/assembler.h   |  4 ++
 arch/arm64/include/asm/mte.h |  5 ++
 arch/arm64/include/asm/page.h| 10 
 arch/arm64/lib/Makefile  |  2 +
 arch/arm64/lib/copy_mc_page.S| 78 
 arch/arm64/lib/mte.S | 27 ++
 arch/arm64/mm/copypage.c | 66 ---
 arch/arm64/mm/extable.c  |  7 +--
 include/linux/highmem.h  |  8 +++
 10 files changed, 213 insertions(+), 9 deletions(-)
 create mode 100644 arch/arm64/lib/copy_mc_page.S

diff --git a/arch/arm64/include/asm/asm-extable.h 
b/arch/arm64/include/asm/asm-extable.h
index 980d1dd8e1a3..819044fefbe7 100644
--- a/arch/arm64/include/asm/asm-extable.h
+++ b/arch/arm64/include/asm/asm-extable.h
@@ -10,6 +10,7 @@
 #define EX_TYPE_UACCESS_ERR_ZERO   2
 #define EX_TYPE_KACCESS_ERR_ZERO   3
 #define EX_TYPE_LOAD_UNALIGNED_ZEROPAD 4
+#define EX_TYPE_COPY_MC_PAGE_ERR_ZERO  5
 
 /* Data fields for EX_TYPE_UACCESS_ERR_ZERO */
 #define EX_DATA_REG_ERR_SHIFT  0
@@ -51,6 +52,16 @@
 #define _ASM_EXTABLE_UACCESS(insn, fixup)  \
_ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, wzr, wzr)
 
+#define _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, err, zero) \
+   __ASM_EXTABLE_RAW(insn, fixup,  \
+ EX_TYPE_COPY_MC_PAGE_ERR_ZERO,\
+ ( \
+   EX_DATA_REG(ERR, err) | \
+   EX_DATA_REG(ZERO, zero) \
+ ))
+
+#define _ASM_EXTABLE_COPY_MC_PAGE(insn, fixup) \
+   _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, wzr, wzr)
 /*
  * Create an exception table entry for uaccess `insn`, which will branch to 
`fixup`
  * when an unhandled fault is taken.
@@ -59,6 +70,10 @@
_ASM_EXTABLE_UACCESS(\insn, \fixup)
.endm
 
+   .macro  _asm_extable_copy_mc_page, insn, fixup
+   _ASM_EXTABLE_COPY_MC_PAGE(\insn, \fixup)
+   .endm
+
 /*
  * Create an exception table entry for `insn` if `fixup` is provided. Otherwise
  * do nothing.
diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index 513787e43329..e1d8ce155878 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -154,6 +154,10 @@ lr .reqx30 // link register
 #define CPU_LE(code...) code
 #endif
 
+#define CPY_MC(l, x...)\
+:   x; \
+   _asm_extable_copy_mc_pageb, l
+
 /*
  * Define a macro that constructs a 64-bit value by concatenating two
  * 32-bit registers. Note that on big endian systems the order of the
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 91fbd5c8a391..9cdded082dd4 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -92,6 +92,7 @@ static inline bool try_page_mte_tagging(struct page *page)
 void mte_zero_clear_page_tags(void *addr);
 void mte_sync_tags(pte_t pte, unsigned int nr_pages);
 void mte_copy_page_tags(void *kto, const void *kfrom);
+int mte_copy_mc_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
 void mte_thread_switch(struct task_struct *next);
 void mte_cpu_setup(void);
@@ -128,6 +129,10 @@ static inline void mte_sync_tags(pte_t pte, unsigned int 
nr_pages)
 static inline void mte_copy_page_tags(void *kto, const void *kfrom)
 {
 }
+static inline int mte_copy_mc_page_tags(void *kto, const void *kfrom)
+{
+   return 0;
+}
 static inline void mte_thread_init_user(void)
 {
 }
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 2312e6ee595f..304cc86b8a10 100644
--- a/arch/arm6

[PATCH v10 1/6] uaccess: add generic fallback version of copy_mc_to_user()

2024-01-29 Thread Tong Tiangen
x86/powerpc has it's implementation of copy_mc_to_user(), we add generic
fallback in include/linux/uaccess.h prepare for other architechures to
enable CONFIG_ARCH_HAS_COPY_MC.

Signed-off-by: Tong Tiangen 
Acked-by: Michael Ellerman 
---
 arch/powerpc/include/asm/uaccess.h | 1 +
 arch/x86/include/asm/uaccess.h | 1 +
 include/linux/uaccess.h| 9 +
 3 files changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index f1f9890f50d3..4bfd1e6f0702 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -381,6 +381,7 @@ copy_mc_to_user(void __user *to, const void *from, unsigned 
long n)
 
return n;
 }
+#define copy_mc_to_user copy_mc_to_user
 #endif
 
 extern long __copy_from_user_flushcache(void *dst, const void __user *src,
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 5c367c1290c3..fd56282ee9a8 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -497,6 +497,7 @@ copy_mc_to_kernel(void *to, const void *from, unsigned len);
 
 unsigned long __must_check
 copy_mc_to_user(void __user *to, const void *from, unsigned len);
+#define copy_mc_to_user copy_mc_to_user
 #endif
 
 /*
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 3064314f4832..550287c92990 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -205,6 +205,15 @@ copy_mc_to_kernel(void *dst, const void *src, size_t cnt)
 }
 #endif
 
+#ifndef copy_mc_to_user
+static inline unsigned long __must_check
+copy_mc_to_user(void *dst, const void *src, size_t cnt)
+{
+   check_object_size(src, cnt, true);
+   return raw_copy_to_user(dst, src, cnt);
+}
+#endif
+
 static __always_inline void pagefault_disabled_inc(void)
 {
current->pagefault_disabled++;
-- 
2.25.1



[PATCH linux-next 3/3] arch, crash: move arch_crash_save_vmcoreinfo() out to file vmcore_info.c

2024-01-29 Thread Baoquan He
Nathan reported below building error:

=
$ curl -LSso .config 
https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.armv7
$ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- olddefconfig all
...
arm-linux-gnueabi-ld: arch/arm/kernel/machine_kexec.o: in function 
`arch_crash_save_vmcoreinfo':
machine_kexec.c:(.text+0x488): undefined reference to `vmcoreinfo_append_str'


On architecutres, like arm, s390, ppc, sh, function
arch_crash_save_vmcoreinfo() is located in machine_kexec.c and it can
only be compiled in when CONFIG_KEXEC_CORE=y.

That's not right because arch_crash_save_vmcoreinfo() is used to export
arch specific vmcoreinfo. CONFIG_VMCORE_INFO is supposed to control its
compiling in. However, CONFIG_VMVCORE_INFO could be independent of
CONFIG_KEXEC_CORE, e.g CONFIG_PROC_KCORE=y will select CONFIG_VMVCORE_INFO.
Or CONFIG_KEXEC/CONFIG_KEXEC_FILE is set while CONFIG_CRASH_DUMP is
not set, it will report linking error.

So, on arm, s390, ppc and sh, move arch_crash_save_vmcoreinfo out to
a new file vmcore_info.c. Let CONFIG_VMCORE_INFO decide if compiling in
arch_crash_save_vmcoreinfo().

Reported-by: Nathan Chancellor 
Closes: 
https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u
Signed-off-by: Baoquan He 
---
 arch/arm/kernel/Makefile |  1 +
 arch/arm/kernel/machine_kexec.c  |  7 ---
 arch/arm/kernel/vmcore_info.c| 10 ++
 arch/powerpc/kexec/Makefile  |  1 +
 arch/powerpc/kexec/core.c| 28 --
 arch/powerpc/kexec/vmcore_info.c | 34 
 arch/s390/kernel/Makefile|  1 +
 arch/s390/kernel/machine_kexec.c | 15 --
 arch/s390/kernel/vmcore_info.c   | 23 +
 arch/sh/kernel/Makefile  |  1 +
 arch/sh/kernel/machine_kexec.c   | 11 ---
 arch/sh/kernel/vmcore_info.c | 17 
 12 files changed, 88 insertions(+), 61 deletions(-)
 create mode 100644 arch/arm/kernel/vmcore_info.c
 create mode 100644 arch/powerpc/kexec/vmcore_info.c
 create mode 100644 arch/s390/kernel/vmcore_info.c
 create mode 100644 arch/sh/kernel/vmcore_info.c

diff --git a/arch/arm/kernel/Makefile b/arch/arm/kernel/Makefile
index 771264d4726a..6a9de826ffd3 100644
--- a/arch/arm/kernel/Makefile
+++ b/arch/arm/kernel/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_DYNAMIC_FTRACE)  += ftrace.o insn.o patch.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER)+= ftrace.o insn.o patch.o
 obj-$(CONFIG_JUMP_LABEL)   += jump_label.o insn.o patch.o
 obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o relocate_kernel.o
+obj-$(CONFIG_VMCORE_INFO)  += vmcore_info.o
 # Main staffs in KPROBES are in arch/arm/probes/ .
 obj-$(CONFIG_KPROBES)  += patch.o insn.o
 obj-$(CONFIG_OABI_COMPAT)  += sys_oabi-compat.o
diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c
index 5d07cf9e0044..80ceb5bd2680 100644
--- a/arch/arm/kernel/machine_kexec.c
+++ b/arch/arm/kernel/machine_kexec.c
@@ -198,10 +198,3 @@ void machine_kexec(struct kimage *image)
 
soft_restart(reboot_entry_phys);
 }
-
-void arch_crash_save_vmcoreinfo(void)
-{
-#ifdef CONFIG_ARM_LPAE
-   VMCOREINFO_CONFIG(ARM_LPAE);
-#endif
-}
diff --git a/arch/arm/kernel/vmcore_info.c b/arch/arm/kernel/vmcore_info.c
new file mode 100644
index ..1437aba47787
--- /dev/null
+++ b/arch/arm/kernel/vmcore_info.c
@@ -0,0 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include 
+
+void arch_crash_save_vmcoreinfo(void)
+{
+#ifdef CONFIG_ARM_LPAE
+   VMCOREINFO_CONFIG(ARM_LPAE);
+#endif
+}
diff --git a/arch/powerpc/kexec/Makefile b/arch/powerpc/kexec/Makefile
index 0c2abe7f9908..91e96f5168b7 100644
--- a/arch/powerpc/kexec/Makefile
+++ b/arch/powerpc/kexec/Makefile
@@ -8,6 +8,7 @@ obj-y   += core.o crash.o core_$(BITS).o
 obj-$(CONFIG_PPC32)+= relocate_32.o
 
 obj-$(CONFIG_KEXEC_FILE)   += file_load.o ranges.o file_load_$(BITS).o 
elf_$(BITS).o
+obj-$(CONFIG_VMCORE_INFO)  += vmcore_info.o
 
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_core_$(BITS).o := n
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 27fa9098a5b7..3ff4411ed496 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -53,34 +53,6 @@ void machine_kexec_cleanup(struct kimage *image)
 {
 }
 
-void arch_crash_save_vmcoreinfo(void)
-{
-
-#ifdef CONFIG_NUMA
-   VMCOREINFO_SYMBOL(node_data);
-   VMCOREINFO_LENGTH(node_data, MAX_NUMNODES);
-#endif
-#ifndef CONFIG_NUMA
-   VMCOREINFO_SYMBOL(contig_page_data);
-#endif
-#if defined(CONFIG_PPC64) && defined(CONFIG_SPARSEMEM_VMEMMAP)
-   VMCOREINFO_SYMBOL(vmemmap_list);
-   VMCOREINFO_SYMBOL(mmu_vmemmap_psize);
-   VMCOREINFO_SYMBOL(mmu_psize_defs);
-   VMCOREINFO_STRUCT_SIZE(vmemmap_backing);
-   VMCOREINFO_OFFSET(vmemmap_backing, list);
-   VMCOREINFO_OFFSET(vmemmap_backing, phys);
-   VMCOREIN

[PATCH linux-next 2/3] crash: fix building error in generic codes

2024-01-29 Thread Baoquan He
Nathan reported some building errors on arm64 as below:

==
$ curl -LSso .config 
https://github.com/archlinuxarm/PKGBUILDs/raw/master/core/linux-aarch64/config
$ make -skj"$(nproc)" ARCH=arm64 CROSS_COMPILE=aarch64-linux- olddefconfig all
...
aarch64-linux-ld: kernel/kexec_file.o: in function 
`kexec_walk_memblock.constprop.0':
kexec_file.c:(.text+0x314): undefined reference to `crashk_res'
...
aarch64-linux-ld: drivers/of/kexec.o: in function 
`of_kexec_alloc_and_setup_fdt':
kexec.c:(.text+0x580): undefined reference to `crashk_res'
...
aarch64-linux-ld: kexec.c:(.text+0x5c0): undefined reference to `crashk_low_res'
==

On the provided config, it has:
===
CONFIG_VMCORE_INFO=y
CONFIG_KEXEC_CORE=y
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y
===

For these crash related code blocks, they need put inside CONFIG_CRASH_DUMP
ifdeffery scope to avoid building erorr when CONFIG_CRASH_DUMP is not
set.

Reported-by: Nathan Chancellor 
Closes: 
https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u
Signed-off-by: Baoquan He 
---
 drivers/of/kexec.c  | 2 ++
 kernel/kexec_file.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 68278340cecf..9ccde2fd77cb 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -395,6 +395,7 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage 
*image,
if (ret)
goto out;
 
+#ifdef CONFIG_CRASH_DUMP
/* add linux,usable-memory-range */
ret = fdt_appendprop_addrrange(fdt, 0, chosen_node,
"linux,usable-memory-range", crashk_res.start,
@@ -410,6 +411,7 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage 
*image,
if (ret)
goto out;
}
+#endif
}
 
/* add bootargs */
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index ce7ce2ae27cd..2d1db05fbf04 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -540,8 +540,10 @@ static int kexec_walk_memblock(struct kexec_buf *kbuf,
phys_addr_t mstart, mend;
struct resource res = { };
 
+#ifdef CONFIG_CRASH_DUMP
if (kbuf->image->type == KEXEC_TYPE_CRASH)
return func(&crashk_res, kbuf);
+#endif
 
/*
 * Using MEMBLOCK_NONE will properly skip MEMBLOCK_DRIVER_MANAGED. See
-- 
2.41.0



[PATCH linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope

2024-01-29 Thread Baoquan He
Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside
arch/x86/xen/enlighten_hvm.c.

Although the nesting works well too since CONFIG_CRASH_DUMP has
dependency on CONFIG_KEXEC_CORE, it may cause confuse because there
are places where it's not nested, and people may think it need be nested
even though it doesn't have to.

Fix that by moving  CONFIG_CRASH_DUMP ifdeffery of codes out of
CONFIG_KEXEC_CODE ifdeffery scope.

And also fix a building error Nathan reported as below by replacing
CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef.


$ curl -LSso .config 
https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.x86_64
$ make -skj"$(nproc)" ARCH=x86_64 CROSS_COMPILE=x86_64-linux- olddefconfig all
...
x86_64-linux-ld: arch/x86/xen/mmu_pv.o: in function `paddr_vmcoreinfo_note':
mmu_pv.c:(.text+0x3af3): undefined reference to `vmcoreinfo_note'


Link: 
https://lore.kernel.org/all/sn6pr02mb4157931105fa68d72e3d3db8d4...@sn6pr02mb4157.namprd02.prod.outlook.com/T/#u
Link: 
https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u
Signed-off-by: Baoquan He 
---
 arch/x86/kernel/cpu/mshyperv.c | 10 ++
 arch/x86/kernel/reboot.c   |  2 +-
 arch/x86/xen/enlighten_hvm.c   |  4 ++--
 arch/x86/xen/mmu_pv.c  |  2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f8163a59026b..2e8cd5a4ae85 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -209,6 +209,7 @@ static void hv_machine_shutdown(void)
if (kexec_in_progress)
hyperv_cleanup();
 }
+#endif /* CONFIG_KEXEC_CORE */
 
 #ifdef CONFIG_CRASH_DUMP
 static void hv_machine_crash_shutdown(struct pt_regs *regs)
@@ -222,8 +223,7 @@ static void hv_machine_crash_shutdown(struct pt_regs *regs)
/* Disable the hypercall page when there is only 1 active CPU. */
hyperv_cleanup();
 }
-#endif
-#endif /* CONFIG_KEXEC_CORE */
+#endif /* CONFIG_CRASH_DUMP */
 #endif /* CONFIG_HYPERV */
 
 static uint32_t  __init ms_hyperv_platform(void)
@@ -497,9 +497,11 @@ static void __init ms_hyperv_init_platform(void)
no_timer_check = 1;
 #endif
 
-#if IS_ENABLED(CONFIG_HYPERV) && defined(CONFIG_KEXEC_CORE)
+#if IS_ENABLED(CONFIG_HYPERV)
+#if defined(CONFIG_KEXEC_CORE)
machine_ops.shutdown = hv_machine_shutdown;
-#ifdef CONFIG_CRASH_DUMP
+#endif
+#if defined(CONFIG_CRASH_DUMP)
machine_ops.crash_shutdown = hv_machine_crash_shutdown;
 #endif
 #endif
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 1287b0d5962f..f3130f762784 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -826,7 +826,7 @@ void machine_halt(void)
machine_ops.halt();
 }
 
-#ifdef CONFIG_KEXEC_CORE
+#ifdef CONFIG_CRASH_DUMP
 void machine_crash_shutdown(struct pt_regs *regs)
 {
machine_ops.crash_shutdown(regs);
diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 09e3db7ff990..0b367c1e086d 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -148,6 +148,7 @@ static void xen_hvm_shutdown(void)
if (kexec_in_progress)
xen_reboot(SHUTDOWN_soft_reset);
 }
+#endif
 
 #ifdef CONFIG_CRASH_DUMP
 static void xen_hvm_crash_shutdown(struct pt_regs *regs)
@@ -156,7 +157,6 @@ static void xen_hvm_crash_shutdown(struct pt_regs *regs)
xen_reboot(SHUTDOWN_soft_reset);
 }
 #endif
-#endif
 
 static int xen_cpu_up_prepare_hvm(unsigned int cpu)
 {
@@ -238,10 +238,10 @@ static void __init xen_hvm_guest_init(void)
 
 #ifdef CONFIG_KEXEC_CORE
machine_ops.shutdown = xen_hvm_shutdown;
+#endif
 #ifdef CONFIG_CRASH_DUMP
machine_ops.crash_shutdown = xen_hvm_crash_shutdown;
 #endif
-#endif
 }
 
 static __init int xen_parse_nopv(char *arg)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 218773cfb009..e21974f2cf2d 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2520,7 +2520,7 @@ int xen_remap_pfn(struct vm_area_struct *vma, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(xen_remap_pfn);
 
-#ifdef CONFIG_KEXEC_CORE
+#ifdef CONFIG_VMCORE_INFO
 phys_addr_t paddr_vmcoreinfo_note(void)
 {
if (xen_pv_domain())
-- 
2.41.0



[PATCH] MAINTAINERS: adjust file entries after crypto vmx file movement

2024-01-29 Thread Lukas Bulwahn
Commit 109303336a0c ("crypto: vmx - Move to arch/powerpc/crypto") moves the
crypto vmx files to arch/powerpc, but misses to adjust the file entries for
IBM Power VMX Cryptographic instructions and LINUX FOR POWERPC.

Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about
broken references.

Adjust these file entries accordingly.

Signed-off-by: Lukas Bulwahn 
---
Danny, please ack.
Herbert, please pick this minor clean-up patch on your -next tree.

 MAINTAINERS | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2fb944964be5..15bc79e80e28 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10307,12 +10307,12 @@ M:Nayna Jain 
 M: Paulo Flabiano Smorigo 
 L: linux-cry...@vger.kernel.org
 S: Supported
-F: drivers/crypto/vmx/Kconfig
-F: drivers/crypto/vmx/Makefile
-F: drivers/crypto/vmx/aes*
-F: drivers/crypto/vmx/ghash*
-F: drivers/crypto/vmx/ppc-xlate.pl
-F: drivers/crypto/vmx/vmx.c
+F: arch/powerpc/crypto/Kconfig
+F: arch/powerpc/crypto/Makefile
+F: arch/powerpc/crypto/aes*
+F: arch/powerpc/crypto/ghash*
+F: arch/powerpc/crypto/ppc-xlate.pl
+F: arch/powerpc/crypto/vmx.c
 
 IBM ServeRAID RAID DRIVER
 S: Orphan
@@ -12397,7 +12397,6 @@ F:  drivers/*/*/*pasemi*
 F: drivers/*/*pasemi*
 F: drivers/char/tpm/tpm_ibmvtpm*
 F: drivers/crypto/nx/
-F: drivers/crypto/vmx/
 F: drivers/i2c/busses/i2c-opal.c
 F: drivers/net/ethernet/ibm/ibmveth.*
 F: drivers/net/ethernet/ibm/ibmvnic.*
-- 
2.17.1



[PATCH v3 15/15] mm/memory: ignore writable bit in folio_pte_batch()

2024-01-29 Thread David Hildenbrand
... and conditionally return to the caller if any PTE except the first one
is writable. fork() has to make sure to properly write-protect in case any
PTE is writable. Other users (e.g., page unmaping) are expected to not
care.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 30 --
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b2ec2b6b54c7..b05fd28dbce1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -968,7 +968,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, 
fpb_t flags)
pte = pte_mkclean(pte);
if (likely(flags & FPB_IGNORE_SOFT_DIRTY))
pte = pte_clear_soft_dirty(pte);
-   return pte_mkold(pte);
+   return pte_wrprotect(pte_mkold(pte));
 }
 
 /*
@@ -976,21 +976,32 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, 
fpb_t flags)
  * pages of the same folio.
  *
  * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
- * the accessed bit, dirty bit (with FPB_IGNORE_DIRTY) and soft-dirty bit
- * (with FPB_IGNORE_SOFT_DIRTY).
+ * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
+ * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
+ *
+ * If "any_writable" is set, it will indicate if any other PTE besides the
+ * first (given) PTE is writable.
  */
 static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
-   pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags)
+   pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
+   bool *any_writable)
 {
unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
const pte_t *end_ptep = start_ptep + max_nr;
pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), 
flags);
pte_t *ptep = start_ptep + 1;
+   bool writable;
+
+   if (any_writable)
+   *any_writable = false;
 
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 
while (ptep != end_ptep) {
-   pte = __pte_batch_clear_ignored(ptep_get(ptep), flags);
+   pte = ptep_get(ptep);
+   if (any_writable)
+   writable = !!pte_write(pte);
+   pte = __pte_batch_clear_ignored(pte, flags);
 
if (!pte_same(pte, expected_pte))
break;
@@ -1003,6 +1014,9 @@ static inline int folio_pte_batch(struct folio *folio, 
unsigned long addr,
if (pte_pfn(pte) == folio_end_pfn)
break;
 
+   if (any_writable)
+   *any_writable |= writable;
+
expected_pte = pte_next_pfn(expected_pte);
ptep++;
}
@@ -1024,6 +1038,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma
 {
struct page *page;
struct folio *folio;
+   bool any_writable;
fpb_t flags = 0;
int err, nr;
 
@@ -1044,7 +1059,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma
if (!vma_soft_dirty_enabled(src_vma))
flags |= FPB_IGNORE_SOFT_DIRTY;
 
-   nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags);
+   nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
+&any_writable);
folio_ref_add(folio, nr);
if (folio_test_anon(folio)) {
if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
@@ -1058,6 +1074,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma
folio_dup_file_rmap_ptes(folio, page, nr);
rss[mm_counter_file(folio)] += nr;
}
+   if (any_writable)
+   pte = pte_mkwrite(pte, src_vma);
__copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte,
addr, nr);
return nr;
-- 
2.43.0



[PATCH v3 14/15] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()

2024-01-29 Thread David Hildenbrand
Let's always ignore the accessed/young bit: we'll always mark the PTE
as old in our child process during fork, and upcoming users will
similarly not care.

Ignore the dirty bit only if we don't want to duplicate the dirty bit
into the child process during fork. Maybe, we could just set all PTEs
in the child dirty if any PTE is dirty. For now, let's keep the behavior
unchanged, this can be optimized later if required.

Ignore the soft-dirty bit only if the bit doesn't have any meaning in
the src vma, and similarly won't have any in the copied dst vma.

For now, we won't bother with the uffd-wp bit.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 36 +++-
 1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 86f8a0021c8e..b2ec2b6b54c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -953,24 +953,44 @@ static __always_inline void __copy_present_ptes(struct 
vm_area_struct *dst_vma,
set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
 }
 
+/* Flags for folio_pte_batch(). */
+typedef int __bitwise fpb_t;
+
+/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
+#define FPB_IGNORE_DIRTY   ((__force fpb_t)BIT(0))
+
+/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
+#define FPB_IGNORE_SOFT_DIRTY  ((__force fpb_t)BIT(1))
+
+static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
+{
+   if (flags & FPB_IGNORE_DIRTY)
+   pte = pte_mkclean(pte);
+   if (likely(flags & FPB_IGNORE_SOFT_DIRTY))
+   pte = pte_clear_soft_dirty(pte);
+   return pte_mkold(pte);
+}
+
 /*
  * Detect a PTE batch: consecutive (present) PTEs that map consecutive
  * pages of the same folio.
  *
- * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
+ * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
+ * the accessed bit, dirty bit (with FPB_IGNORE_DIRTY) and soft-dirty bit
+ * (with FPB_IGNORE_SOFT_DIRTY).
  */
 static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
-   pte_t *start_ptep, pte_t pte, int max_nr)
+   pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags)
 {
unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
const pte_t *end_ptep = start_ptep + max_nr;
-   pte_t expected_pte = pte_next_pfn(pte);
+   pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), 
flags);
pte_t *ptep = start_ptep + 1;
 
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 
while (ptep != end_ptep) {
-   pte = ptep_get(ptep);
+   pte = __pte_batch_clear_ignored(ptep_get(ptep), flags);
 
if (!pte_same(pte, expected_pte))
break;
@@ -1004,6 +1024,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma
 {
struct page *page;
struct folio *folio;
+   fpb_t flags = 0;
int err, nr;
 
page = vm_normal_page(src_vma, addr, pte);
@@ -1018,7 +1039,12 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma
 * by keeping the batching logic separate.
 */
if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
-   nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr);
+   if (src_vma->vm_flags & VM_SHARED)
+   flags |= FPB_IGNORE_DIRTY;
+   if (!vma_soft_dirty_enabled(src_vma))
+   flags |= FPB_IGNORE_SOFT_DIRTY;
+
+   nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags);
folio_ref_add(folio, nr);
if (folio_test_anon(folio)) {
if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
-- 
2.43.0



[PATCH v3 12/15] mm/memory: pass PTE to copy_present_pte()

2024-01-29 Thread David Hildenbrand
We already read it, let's just forward it.

This patch is based on work by Ryan Roberts.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a3bdb25f4c8d..41b24da5be38 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -959,10 +959,9 @@ static inline void __copy_present_pte(struct 
vm_area_struct *dst_vma,
  */
 static inline int
 copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct 
*src_vma,
-pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
-struct folio **prealloc)
+pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr,
+int *rss, struct folio **prealloc)
 {
-   pte_t pte = ptep_get(src_pte);
struct page *page;
struct folio *folio;
 
@@ -1103,7 +1102,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
}
/* copy_present_pte() will clear `*prealloc' if consumed */
ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
-  addr, rss, &prealloc);
+  ptent, addr, rss, &prealloc);
/*
 * If we need a pre-allocated page for this pte, drop the
 * locks, allocate, and try again.
-- 
2.43.0



[PATCH v3 11/15] mm/memory: factor out copying the actual PTE in copy_present_pte()

2024-01-29 Thread David Hildenbrand
Let's prepare for further changes.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 63 -
 1 file changed, 33 insertions(+), 30 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8d14ba440929..a3bdb25f4c8d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -930,6 +930,29 @@ copy_present_page(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma
return 0;
 }
 
+static inline void __copy_present_pte(struct vm_area_struct *dst_vma,
+   struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte,
+   pte_t pte, unsigned long addr)
+{
+   struct mm_struct *src_mm = src_vma->vm_mm;
+
+   /* If it's a COW mapping, write protect it both processes. */
+   if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) {
+   ptep_set_wrprotect(src_mm, addr, src_pte);
+   pte = pte_wrprotect(pte);
+   }
+
+   /* If it's a shared mapping, mark it clean in the child. */
+   if (src_vma->vm_flags & VM_SHARED)
+   pte = pte_mkclean(pte);
+   pte = pte_mkold(pte);
+
+   if (!userfaultfd_wp(dst_vma))
+   pte = pte_clear_uffd_wp(pte);
+
+   set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+}
+
 /*
  * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
  * is required to copy this pte.
@@ -939,23 +962,23 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
 struct folio **prealloc)
 {
-   struct mm_struct *src_mm = src_vma->vm_mm;
-   unsigned long vm_flags = src_vma->vm_flags;
pte_t pte = ptep_get(src_pte);
struct page *page;
struct folio *folio;
 
page = vm_normal_page(src_vma, addr, pte);
-   if (page)
-   folio = page_folio(page);
-   if (page && folio_test_anon(folio)) {
+   if (unlikely(!page))
+   goto copy_pte;
+
+   folio = page_folio(page);
+   folio_get(folio);
+   if (folio_test_anon(folio)) {
/*
 * If this page may have been pinned by the parent process,
 * copy the page immediately for the child so that we'll always
 * guarantee the pinned page won't be randomly replaced in the
 * future.
 */
-   folio_get(folio);
if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, 
src_vma))) {
/* Page may be pinned, we have to copy. */
folio_put(folio);
@@ -963,34 +986,14 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
 addr, rss, prealloc, page);
}
rss[MM_ANONPAGES]++;
-   } else if (page) {
-   folio_get(folio);
+   VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
+   } else {
folio_dup_file_rmap_pte(folio, page);
rss[mm_counter_file(folio)]++;
}
 
-   /*
-* If it's a COW mapping, write protect it both
-* in the parent and the child
-*/
-   if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-   ptep_set_wrprotect(src_mm, addr, src_pte);
-   pte = pte_wrprotect(pte);
-   }
-   VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
-
-   /*
-* If it's a shared mapping, mark it clean in
-* the child
-*/
-   if (vm_flags & VM_SHARED)
-   pte = pte_mkclean(pte);
-   pte = pte_mkold(pte);
-
-   if (!userfaultfd_wp(dst_vma))
-   pte = pte_clear_uffd_wp(pte);
-
-   set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+copy_pte:
+   __copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, pte, addr);
return 0;
 }
 
-- 
2.43.0



[PATCH v3 10/15] powerpc/mm: use pte_next_pfn() in set_ptes()

2024-01-29 Thread David Hildenbrand
Let's use our handy new helper. Note that the implementation is slightly
different, but shouldn't really make a difference in practice.

Reviewed-by: Christophe Leroy 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/mm/pgtable.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index a04ae4449a02..549a440ed7f6 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -220,10 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep,
break;
ptep++;
addr += PAGE_SIZE;
-   /*
-* increment the pfn.
-*/
-   pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
+   pte = pte_next_pfn(pte);
}
 }
 
-- 
2.43.0



[PATCH v3 09/15] arm/mm: use pte_next_pfn() in set_ptes()

2024-01-29 Thread David Hildenbrand
Let's use our handy helper now that it's available on all archs.

Signed-off-by: David Hildenbrand 
---
 arch/arm/mm/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 674ed71573a8..c24e29c0b9a4 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -1814,6 +1814,6 @@ void set_ptes(struct mm_struct *mm, unsigned long addr,
if (--nr == 0)
break;
ptep++;
-   pte_val(pteval) += PAGE_SIZE;
+   pteval = pte_next_pfn(pteval);
}
 }
-- 
2.43.0



[PATCH v3 08/15] mm/pgtable: make pte_next_pfn() independent of set_ptes()

2024-01-29 Thread David Hildenbrand
Let's provide pte_next_pfn(), independently of set_ptes(). This allows for
using the generic pte_next_pfn() version in some arch-specific set_ptes()
implementations, and prepares for reusing pte_next_pfn() in other context.

Reviewed-by: Christophe Leroy 
Signed-off-by: David Hildenbrand 
---
 include/linux/pgtable.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f6d0e3513948..351cd9dc7194 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,7 +212,6 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode() do {} while (0)
 #endif
 
-#ifndef set_ptes
 
 #ifndef pte_next_pfn
 static inline pte_t pte_next_pfn(pte_t pte)
@@ -221,6 +220,7 @@ static inline pte_t pte_next_pfn(pte_t pte)
 }
 #endif
 
+#ifndef set_ptes
 /**
  * set_ptes - Map consecutive pages to a contiguous range of addresses.
  * @mm: Address space to map the pages into.
-- 
2.43.0



[PATCH v3 07/15] sparc/pgtable: define PFN_PTE_SHIFT

2024-01-29 Thread David Hildenbrand
We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simply define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand 
---
 arch/sparc/include/asm/pgtable_64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index a8c871b7d786..652af9d63fa2 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -929,6 +929,8 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT);
 }
 
+#define PFN_PTE_SHIFT  PAGE_SHIFT
+
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr)
 {
-- 
2.43.0



[PATCH v3 06/15] s390/pgtable: define PFN_PTE_SHIFT

2024-01-29 Thread David Hildenbrand
We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simply define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand 
---
 arch/s390/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1299b56e43f6..4b91e65c85d9 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1316,6 +1316,8 @@ pgprot_t pgprot_writecombine(pgprot_t prot);
 #define pgprot_writethroughpgprot_writethrough
 pgprot_t pgprot_writethrough(pgprot_t prot);
 
+#define PFN_PTE_SHIFT  PAGE_SHIFT
+
 /*
  * Set multiple PTEs to consecutive pages with a single call.  All PTEs
  * are within the same folio, PMD and VMA.
-- 
2.43.0



[PATCH v3 05/15] riscv/pgtable: define PFN_PTE_SHIFT

2024-01-29 Thread David Hildenbrand
We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simply define PFN_PTE_SHIFT, required by pte_next_pfn().

Reviewed-by: Alexandre Ghiti 
Signed-off-by: David Hildenbrand 
---
 arch/riscv/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 0c94260b5d0c..add5cd30ab34 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -523,6 +523,8 @@ static inline void __set_pte_at(pte_t *ptep, pte_t pteval)
set_pte(ptep, pteval);
 }
 
+#define PFN_PTE_SHIFT  _PAGE_PFN_SHIFT
+
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pteval, unsigned int nr)
 {
-- 
2.43.0



[PATCH v3 04/15] powerpc/pgtable: define PFN_PTE_SHIFT

2024-01-29 Thread David Hildenbrand
We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simply define PFN_PTE_SHIFT, required by pte_next_pfn().

Reviewed-by: Christophe Leroy 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 9224f23065ff..7a1ba8889aea 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -41,6 +41,8 @@ struct mm_struct;
 
 #ifndef __ASSEMBLY__
 
+#define PFN_PTE_SHIFT  PTE_RPN_SHIFT
+
 void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
pte_t pte, unsigned int nr);
 #define set_ptes set_ptes
-- 
2.43.0



[PATCH v3 03/15] nios2/pgtable: define PFN_PTE_SHIFT

2024-01-29 Thread David Hildenbrand
We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simply define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand 
---
 arch/nios2/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index 5144506dfa69..d052dfcbe8d3 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -178,6 +178,8 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
*ptep = pteval;
 }
 
+#define PFN_PTE_SHIFT  0
+
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr)
 {
-- 
2.43.0



[PATCH v3 02/15] arm/pgtable: define PFN_PTE_SHIFT

2024-01-29 Thread David Hildenbrand
We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simply define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand 
---
 arch/arm/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index d657b84b6bf7..be91e376df79 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval)
 extern void __sync_icache_dcache(pte_t pteval);
 #endif
 
+#define PFN_PTE_SHIFT  PAGE_SHIFT
+
 void set_ptes(struct mm_struct *mm, unsigned long addr,
  pte_t *ptep, pte_t pteval, unsigned int nr);
 #define set_ptes set_ptes
-- 
2.43.0



[PATCH v3 01/15] arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary

2024-01-29 Thread David Hildenbrand
From: Ryan Roberts 

Since the high bits [51:48] of an OA are not stored contiguously in the
PTE, there is a theoretical bug in set_ptes(), which just adds PAGE_SIZE
to the pte to get the pte with the next pfn. This works until the pfn
crosses the 48-bit boundary, at which point we overflow into the upper
attributes.

Of course one could argue (and Matthew Wilcox has :) that we will never
see a folio cross this boundary because we only allow naturally aligned
power-of-2 allocation, so this would require a half-petabyte folio. So
its only a theoretical bug. But its better that the code is robust
regardless.

I've implemented pte_next_pfn() as part of the fix, which is an opt-in
core-mm interface. So that is now available to the core-mm, which will
be needed shortly to support forthcoming fork()-batching optimizations.

Link: https://lkml.kernel.org/r/20240125173534.1659317-1-ryan.robe...@arm.com
Fixes: 4a169d61c2ed ("arm64: implement the new page table range API")
Closes: 
https://lore.kernel.org/linux-mm/fdaeb9a5-d890-499a-92c8-d171df43a...@arm.com/
Signed-off-by: Ryan Roberts 
Reviewed-by: Catalin Marinas 
Reviewed-by: David Hildenbrand 
Signed-off-by: David Hildenbrand 
---
 arch/arm64/include/asm/pgtable.h | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b50270107e2f..9428801c1040 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -341,6 +341,22 @@ static inline void __sync_cache_and_tags(pte_t pte, 
unsigned int nr_pages)
mte_sync_tags(pte, nr_pages);
 }
 
+/*
+ * Select all bits except the pfn
+ */
+static inline pgprot_t pte_pgprot(pte_t pte)
+{
+   unsigned long pfn = pte_pfn(pte);
+
+   return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
+}
+
+#define pte_next_pfn pte_next_pfn
+static inline pte_t pte_next_pfn(pte_t pte)
+{
+   return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
+}
+
 static inline void set_ptes(struct mm_struct *mm,
unsigned long __always_unused addr,
pte_t *ptep, pte_t pte, unsigned int nr)
@@ -354,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm,
if (--nr == 0)
break;
ptep++;
-   pte_val(pte) += PAGE_SIZE;
+   pte = pte_next_pfn(pte);
}
 }
 #define set_ptes set_ptes
@@ -433,16 +449,6 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
return clear_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));
 }
 
-/*
- * Select all bits except the pfn
- */
-static inline pgprot_t pte_pgprot(pte_t pte)
-{
-   unsigned long pfn = pte_pfn(pte);
-
-   return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
-}
-
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * See the comment in include/linux/pgtable.h
-- 
2.43.0



[PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-29 Thread David Hildenbrand
Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when processing
PTE-mapped THPs.

This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to
use the new rmap batching functions that simplify the code and prepare
for further rmap accounting changes.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.

While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.

Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes
for fork() (shorter is better):

Folio Size | v6.8-rc1 |  New | Change
--
  4KiB | 0.014328 | 0.014035 |   - 2%
 16KiB | 0.014263 | 0.01196  |   -16%
 32KiB | 0.014334 | 0.01094  |   -24%
 64KiB | 0.014046 | 0.010444 |   -26%
128KiB | 0.014011 | 0.010063 |   -28%
256KiB | 0.013993 | 0.009938 |   -29%
512KiB | 0.013983 | 0.00985  |   -30%
   1024KiB | 0.013986 | 0.00982  |   -30%
   2048KiB | 0.014305 | 0.010076 |   -30%

Note that these numbers are even better than the ones from v1 (verified
over multiple reboots), even though there were only minimal code changes.
Well, I removed a pte_mkclean() call for anon folios, maybe that also
plays a role.

But my experience is that fork() is extremely sensitive to code size,
inlining, ... so I suspect we'll see on other architectures rather a change
of -20% instead of -30%, and it will be easy to "lose" some of that speedup
in the future by subtle code changes.

Next up is PTE batching when unmapping. Only tested on x86-64.
Compile-tested on most other architectures.

v2 -> v3:
 * Rebased on mm-unstable
 * Picked up RB's
 * Updated documentation of wrprotect_ptes().

v1 -> v2:
 * "arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary"
  -> Added patch from Ryan
 * "arm/pgtable: define PFN_PTE_SHIFT"
  -> Removed the arm64 bits
 * "mm/pgtable: make pte_next_pfn() independent of set_ptes()"
 * "arm/mm: use pte_next_pfn() in set_ptes()"
 * "powerpc/mm: use pte_next_pfn() in set_ptes()"
  -> Added to use pte_next_pfn() in some arch set_ptes() implementations
 I tried to make use of pte_next_pfn() also in the others, but it's
 not trivial because the other archs implement set_ptes() in their
 asm/pgtable.h. Future work.
 * "mm/memory: factor out copying the actual PTE in copy_present_pte()"
  -> Move common folio_get() out of if/else
 * "mm/memory: optimize fork() with PTE-mapped THP"
  -> Add doc for wrprotect_ptes
  -> Extend description to mention handling of pinned folios
  -> Move common folio_ref_add() out of if/else
 * "mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()"
  -> Be more conservative with dirt/soft-dirty, let the caller specify
 using flags

[1] https://lkml.kernel.org/r/20231220224504.646757-1-da...@redhat.com
[2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.robe...@arm.com
[3] https://lkml.kernel.org/r/20230809083256.699513-1-da...@redhat.com
[4] https://lkml.kernel.org/r/20231124132626.235350-1-da...@redhat.com
[5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.robe...@arm.com

Cc: Andrew Morton 
Cc: Matthew Wilcox (Oracle) 
Cc: Ryan Roberts 
Cc: Russell King 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Dinh Nguyen 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: "Naveen N. Rao" 
Cc: Paul Walmsley 
Cc: Palmer Dabbelt 
Cc: Albert Ou 
Cc: Alexander Gordeev 
Cc: Gerald Schaefer 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Sven Schnelle 
Cc: "David S. Miller" 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux-s...@vger.kernel.org
Cc: sparcli...@vger.kernel.org

---

Andrew asked for a resend based on latest mm-unstable. I am sending this
out earlier than I would usually have sent out the next version, so we can
pull it into mm-unstable again now that v1 was dropped.

David Hildenbrand (14):
  arm/pgtable: define PFN_PTE_SHIFT
  nios2/pgtable: define PFN_PTE_SHIFT
  powerpc/pgtable: define PFN_PTE_SHIFT
  riscv/pgtable: define PFN_P

[PATCH] perf/pmu-events/powerpc: Update json mapfile with Power11 PVR

2024-01-29 Thread Madhavan Srinivasan
Update the Power11 PVR to json mapfile to enable
json events. Power11 is PowerISA v3.1 compliant
and support Power10 events.

Signed-off-by: Madhavan Srinivasan 
---
 tools/perf/pmu-events/arch/powerpc/mapfile.csv | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/pmu-events/arch/powerpc/mapfile.csv 
b/tools/perf/pmu-events/arch/powerpc/mapfile.csv
index 599a588dbeb4..4d5e9138d4cc 100644
--- a/tools/perf/pmu-events/arch/powerpc/mapfile.csv
+++ b/tools/perf/pmu-events/arch/powerpc/mapfile.csv
@@ -15,3 +15,4 @@
 0x0066[[:xdigit:]]{4},1,power8,core
 0x004e[[:xdigit:]]{4},1,power9,core
 0x0080[[:xdigit:]]{4},1,power10,core
+0x0082[[:xdigit:]]{4},1,power10,core
-- 
2.43.0



Re: [PATCH v2 14/15] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()

2024-01-29 Thread Ryan Roberts
On 25/01/2024 19:32, David Hildenbrand wrote:
> Let's always ignore the accessed/young bit: we'll always mark the PTE
> as old in our child process during fork, and upcoming users will
> similarly not care.
> 
> Ignore the dirty bit only if we don't want to duplicate the dirty bit
> into the child process during fork. Maybe, we could just set all PTEs
> in the child dirty if any PTE is dirty. For now, let's keep the behavior
> unchanged, this can be optimized later if required.
> 
> Ignore the soft-dirty bit only if the bit doesn't have any meaning in
> the src vma, and similarly won't have any in the copied dst vma.
> 
> For now, we won't bother with the uffd-wp bit.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Ryan Roberts 

> ---
>  mm/memory.c | 36 +++-
>  1 file changed, 31 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 4d1be89a01ee0..b3f035fe54c8d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -953,24 +953,44 @@ static __always_inline void __copy_present_ptes(struct 
> vm_area_struct *dst_vma,
>   set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>  }
>  
> +/* Flags for folio_pte_batch(). */
> +typedef int __bitwise fpb_t;
> +
> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0))
> +
> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> +#define FPB_IGNORE_SOFT_DIRTY((__force fpb_t)BIT(1))
> +
> +static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> +{
> + if (flags & FPB_IGNORE_DIRTY)
> + pte = pte_mkclean(pte);
> + if (likely(flags & FPB_IGNORE_SOFT_DIRTY))
> + pte = pte_clear_soft_dirty(pte);
> + return pte_mkold(pte);
> +}
> +
>  /*
>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>   * pages of the same folio.
>   *
> - * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
> + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> + * the accessed bit, dirty bit (with FPB_IGNORE_DIRTY) and soft-dirty bit
> + * (with FPB_IGNORE_SOFT_DIRTY).
>   */
>  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> - pte_t *start_ptep, pte_t pte, int max_nr)
> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags)
>  {
>   unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>   const pte_t *end_ptep = start_ptep + max_nr;
> - pte_t expected_pte = pte_next_pfn(pte);
> + pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), 
> flags);
>   pte_t *ptep = start_ptep + 1;
>  
>   VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>  
>   while (ptep != end_ptep) {
> - pte = ptep_get(ptep);
> + pte = __pte_batch_clear_ignored(ptep_get(ptep), flags);
>  
>   if (!pte_same(pte, expected_pte))
>   break;
> @@ -1004,6 +1024,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, 
> struct vm_area_struct *src_vma
>  {
>   struct page *page;
>   struct folio *folio;
> + fpb_t flags = 0;
>   int err, nr;
>  
>   page = vm_normal_page(src_vma, addr, pte);
> @@ -1018,7 +1039,12 @@ copy_present_ptes(struct vm_area_struct *dst_vma, 
> struct vm_area_struct *src_vma
>* by keeping the batching logic separate.
>*/
>   if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
> - nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr);
> + if (src_vma->vm_flags & VM_SHARED)
> + flags |= FPB_IGNORE_DIRTY;
> + if (!vma_soft_dirty_enabled(src_vma))
> + flags |= FPB_IGNORE_SOFT_DIRTY;
> +
> + nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags);
>   folio_ref_add(folio, nr);
>   if (folio_test_anon(folio)) {
>   if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,



Re: Re: [PATCH] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-01-29 Thread Amit Machhiwal
Hi Aneesh,

Thanks for looking into the patch. My comments are inline below.

On 2024/01/24 01:06 PM, Aneesh Kumar K.V wrote:
> Amit Machhiwal  writes:
> 
> > Currently, rebooting a pseries nested qemu-kvm guest (L2) results in
> > below error as L1 qemu sends PVR value 'arch_compat' == 0 via
> > ppc_set_compat ioctl. This triggers a condition failure in
> > kvmppc_set_arch_compat() resulting in an EINVAL.
> >
> > qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid
> >
> > This patch updates kvmppc_set_arch_compat() to use the host PVR value if
> > 'compat_pvr' == 0 indicating that qemu doesn't want to enforce any
> > specific PVR compat mode.
> >
> > Signed-off-by: Amit Machhiwal 
> > ---
> >  arch/powerpc/kvm/book3s_hv.c  |  2 +-
> >  arch/powerpc/kvm/book3s_hv_nestedv2.c | 12 ++--
> >  2 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 1ed6ec140701..9573d7f4764a 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -439,7 +439,7 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu 
> > *vcpu, u32 arch_compat)
> > if (guest_pcr_bit > host_pcr_bit)
> > return -EINVAL;
> >  
> > -   if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) {
> > +   if (kvmhv_on_pseries() && kvmhv_is_nestedv2() && arch_compat) {
> > if (!(cap & nested_capabilities))
> > return -EINVAL;
> > }
> >
> 
> Instead of that arch_compat check, would it better to do
> 
>   if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) {
>   if (cap && !(cap & nested_capabilities))
>   return -EINVAL;
>   }
> 
> ie, if a capability is requested, then check against nested_capbilites
> to see if the capability exist.

The above condition check will cause problems when we would try to boot
a machine below Power 9.

For example, if we passed the arch_compat == PVR_ARCH_207, cap will
remain 0 resulting the above check into a false condition. Consequently,
we would never return an -EINVAL in that case resulting the arch
compatilbility request succeed when it doesn't support nested papr
guest.

> 
> 
> > diff --git a/arch/powerpc/kvm/book3s_hv_nestedv2.c 
> > b/arch/powerpc/kvm/book3s_hv_nestedv2.c
> > index fd3c4f2d9480..069a1fcfd782 100644
> > --- a/arch/powerpc/kvm/book3s_hv_nestedv2.c
> > +++ b/arch/powerpc/kvm/book3s_hv_nestedv2.c
> > @@ -138,6 +138,7 @@ static int gs_msg_ops_vcpu_fill_info(struct 
> > kvmppc_gs_buff *gsb,
> > vector128 v;
> > int rc, i;
> > u16 iden;
> > +   u32 arch_compat = 0;
> >  
> > vcpu = gsm->data;
> >  
> > @@ -347,8 +348,15 @@ static int gs_msg_ops_vcpu_fill_info(struct 
> > kvmppc_gs_buff *gsb,
> > break;
> > }
> > case KVMPPC_GSID_LOGICAL_PVR:
> > -   rc = kvmppc_gse_put_u32(gsb, iden,
> > -   vcpu->arch.vcore->arch_compat);
> > +   if (!vcpu->arch.vcore->arch_compat) {
> > +   if (cpu_has_feature(CPU_FTR_ARCH_31))
> > +   arch_compat = PVR_ARCH_31;
> > +   else if (cpu_has_feature(CPU_FTR_ARCH_300))
> > +   arch_compat = PVR_ARCH_300;
> > +   } else {
> > +   arch_compat = vcpu->arch.vcore->arch_compat;
> > +   }
> > +   rc = kvmppc_gse_put_u32(gsb, iden, arch_compat);
> >
> 
> Won't a arch_compat = 0 work here?. ie, where you observing the -EINVAL from
> the first hunk or does this hunk have an impact? 
>

No, an arch_compat == 0 won't work in nested API v2. That's because the
guest wide PVR cannot be 0, and if arch_compat == 0, then suppported
host PVR value should be mentioned.

If we were to skip this hunk (keeping the arch_compat == 0), a system
reboot of L2 guest would fail and result into a kernel trap as below:

[   22.106360] reboot: Restarting system
KVM: unknown exit, hardware reason ffea
NIP 0100   LR fe44 CTR  XER 
20040092 CPU#0
MSR 1000 HID0   HF 6c00 iidx 3 didx 3
TB   DECR 0
GPR00   c2a8c300 7fe0
GPR04   1002 82803033
GPR08 0a00  0004 2fff
GPR12  c2e1 000105639200 0004
GPR16  00010563a090  
GPR20 000105639e20 0001056399c8 7fffe54abab0 000105639288
GPR24  0001 0001 
GPR28   c2b30840 
CR   [ -  -  -  -  -  -  -  -  ] RES 000@
 SRR0   

Re: [PATCH v2 13/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-29 Thread Ryan Roberts
On 25/01/2024 19:32, David Hildenbrand wrote:
> Let's implement PTE batching when consecutive (present) PTEs map
> consecutive pages of the same large folio, and all other PTE bits besides
> the PFNs are equal.
> 
> We will optimize folio_pte_batch() separately, to ignore selected
> PTE bits. This patch is based on work by Ryan Roberts.
> 
> Use __always_inline for __copy_present_ptes() and keep the handling for
> single PTEs completely separate from the multi-PTE case: we really want
> the compiler to optimize for the single-PTE case with small folios, to
> not degrade performance.
> 
> Note that PTE batching will never exceed a single page table and will
> always stay within VMA boundaries.
> 
> Further, processing PTE-mapped THP that maybe pinned and have
> PageAnonExclusive set on at least one subpage should work as expected,
> but there is room for improvement: We will repeatedly (1) detect a PTE
> batch (2) detect that we have to copy a page (3) fall back and allocate a
> single page to copy a single page. For now we won't care as pinned pages
> are a corner case, and we should rather look into maintaining only a
> single PageAnonExclusive bit for large folios.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Ryan Roberts 

> ---
>  include/linux/pgtable.h |  31 +++
>  mm/memory.c | 112 +---
>  2 files changed, 124 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 351cd9dc7194f..891ed246978a4 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -650,6 +650,37 @@ static inline void ptep_set_wrprotect(struct mm_struct 
> *mm, unsigned long addres
>  }
>  #endif
>  
> +#ifndef wrprotect_ptes
> +/**
> + * wrprotect_ptes - Write-protect consecutive pages that are mapped to a
> + *   contiguous range of addresses.
> + * @mm: Address space to map the pages into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of pages to write-protect.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_set_wrprotect().
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For 
> example,
> + * some PTEs might already be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The pages all belong
> + * to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr)
> +{
> + for (;;) {
> + ptep_set_wrprotect(mm, addr, ptep);
> + if (--nr == 0)
> + break;
> + ptep++;
> + addr += PAGE_SIZE;
> + }
> +}
> +#endif
> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/memory.c b/mm/memory.c
> index 729ca4d6a820c..4d1be89a01ee0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -930,15 +930,15 @@ copy_present_page(struct vm_area_struct *dst_vma, 
> struct vm_area_struct *src_vma
>   return 0;
>  }
>  
> -static inline void __copy_present_pte(struct vm_area_struct *dst_vma,
> +static __always_inline void __copy_present_ptes(struct vm_area_struct 
> *dst_vma,
>   struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte,
> - pte_t pte, unsigned long addr)
> + pte_t pte, unsigned long addr, int nr)
>  {
>   struct mm_struct *src_mm = src_vma->vm_mm;
>  
>   /* If it's a COW mapping, write protect it both processes. */
>   if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) {
> - ptep_set_wrprotect(src_mm, addr, src_pte);
> + wrprotect_ptes(src_mm, addr, src_pte, nr);
>   pte = pte_wrprotect(pte);
>   }
>  
> @@ -950,26 +950,93 @@ static inline void __copy_present_pte(struct 
> vm_area_struct *dst_vma,
>   if (!userfaultfd_wp(dst_vma))
>   pte = pte_clear_uffd_wp(pte);
>  
> - set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
> + set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
> +}
> +
> +/*
> + * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> + * pages of the same folio.
> + *
> + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
> + */
> +static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> + pte_t *start_ptep, pte_t pte, int max_nr)
> +{
> + unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
> + const pte_t *end_ptep = start_ptep + max_nr;
> + pte_t expected_pte = pte_next_pfn(pte);
> + pte_t *ptep = start_ptep + 1;
> +
> + VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> +
> + while (ptep != end_ptep) {
> + pte = ptep_get(ptep

Re: [PATCH 5/5] sched/vtime: do not include header

2024-01-29 Thread Heiko Carstens
On Sun, Jan 28, 2024 at 08:58:54PM +0100, Alexander Gordeev wrote:
> There is no architecture-specific code or data left
> that generic  needs to know about.
> Thus, avoid the inclusion of  header.
> 
> Signed-off-by: Alexander Gordeev 
> ---
>  include/asm-generic/vtime.h | 1 -
>  include/linux/vtime.h   | 4 
>  2 files changed, 5 deletions(-)
>  delete mode 100644 include/asm-generic/vtime.h

I guess you need to get rid of this as well:

arch/powerpc/include/asm/Kbuild:generic-y += vtime.h


Re: [PATCH 4/5] s390/irq,nmi: do not include header

2024-01-29 Thread Heiko Carstens
On Sun, Jan 28, 2024 at 08:58:53PM +0100, Alexander Gordeev wrote:
> update_timer_sys() and update_timer_mcck() are inlines used for
> CPU time accounting from the interrupt and machine-check handlers.
> These routines are specific to s390 architecture, but declared
> via  header, which in turn inludes .
> Avoid the extra loop and include  header directly.
> 
> Signed-off-by: Alexander Gordeev 
> ---
>  arch/s390/kernel/irq.c | 1 +
>  arch/s390/kernel/nmi.c | 1 +
>  2 files changed, 2 insertions(+)
...
> +++ b/arch/s390/kernel/irq.c
> +#include 
...
> +++ b/arch/s390/kernel/nmi.c
> +#include 

It is confusing when the patch subject is "do not include.." and all
what this patch is doing is to add two includes. I see what this is
doing: getting rid of the implicit include of asm/vtime.h most likely
via linux/hardirq.h, but that's not very obvious.

Anyway:
Acked-by: Heiko Carstens 


Re: [PATCH 3/5] s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover

2024-01-29 Thread Heiko Carstens
On Sun, Jan 28, 2024 at 08:58:52PM +0100, Alexander Gordeev wrote:
> __ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore.
> 
> Signed-off-by: Alexander Gordeev 
> ---
>  arch/s390/include/asm/vtime.h | 2 --
>  1 file changed, 2 deletions(-)

Acked-by: Heiko Carstens 


Re: [PATCH] mm/debug_vm_pgtable: Fix BUG_ON with pud advanced test

2024-01-29 Thread Aneesh Kumar K.V
On 1/29/24 12:23 PM, Anshuman Khandual wrote:
> 
> 
> On 1/29/24 11:56, Aneesh Kumar K.V wrote:
>> On 1/29/24 11:52 AM, Anshuman Khandual wrote:
>>>
>>>
>>> On 1/29/24 11:30, Aneesh Kumar K.V (IBM) wrote:
 Architectures like powerpc add debug checks to ensure we find only devmap
 PUD pte entries. These debug checks are only done with CONFIG_DEBUG_VM.
 This patch marks the ptes used for PUD advanced test devmap pte entries
 so that we don't hit on debug checks on architecture like ppc64 as
 below.

 WARNING: CPU: 2 PID: 1 at arch/powerpc/mm/book3s64/radix_pgtable.c:1382 
 radix__pud_hugepage_update+0x38/0x138
 
 NIP [c00a7004] radix__pud_hugepage_update+0x38/0x138
 LR [c00a77a8] radix__pudp_huge_get_and_clear+0x28/0x60
 Call Trace:
 [c4a2f950] [c4a2f9a0] 0xc4a2f9a0 (unreliable)
 [c4a2f980] [000d34c1] 0xd34c1
 [c4a2f9a0] [c206ba98] pud_advanced_tests+0x118/0x334
 [c4a2fa40] [c206db34] debug_vm_pgtable+0xcbc/0x1c48
 [c4a2fc10] [c000fd28] do_one_initcall+0x60/0x388

 Also

  kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:202!
  

  NIP [c0096510] pudp_huge_get_and_clear_full+0x98/0x174
  LR [c206bb34] pud_advanced_tests+0x1b4/0x334
  Call Trace:
  [c4a2f950] [000d34c1] 0xd34c1 (unreliable)
  [c4a2f9a0] [c206bb34] pud_advanced_tests+0x1b4/0x334
  [c4a2fa40] [c206db34] debug_vm_pgtable+0xcbc/0x1c48
  [c4a2fc10] [c000fd28] do_one_initcall+0x60/0x388

 Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud 
 hugepage")
 Signed-off-by: Aneesh Kumar K.V (IBM) 
 ---
  mm/debug_vm_pgtable.c | 8 
  1 file changed, 8 insertions(+)

 diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
 index 5662e29fe253..65c19025da3d 100644
 --- a/mm/debug_vm_pgtable.c
 +++ b/mm/debug_vm_pgtable.c
 @@ -362,6 +362,12 @@ static void __init pud_advanced_tests(struct 
 pgtable_debug_args *args)
vaddr &= HPAGE_PUD_MASK;
  
pud = pfn_pud(args->pud_pfn, args->page_prot);
 +  /*
 +   * Some architectures have debug checks to make sure
 +   * huge pud mapping are only found with devmap entries
 +   * For now test with only devmap entries.
 +   */
>>> Do you see this behaviour to be changed in powerpc anytime soon ? Otherwise
>>> these pud_mkdevmap() based work arounds, might be required to stick around
>>> for longer just to prevent powerpc specific triggers. Given PUD transparent
>>> huge pages i.e HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD are just supported on x86
>>> and powerpc platforms, could not this problem be solved in a more uniform
>>> manner.
>>>
>>
>>
>> IIUC pud level transparent hugepages are only supported with devmap entries 
>> even
>> on x86. We don't do anonymous pud hugepage.
> 
> There are some 'pud_trans_huge(orig_pud) || pud_devmap(orig_pud)' checks in
> core paths i.e in mm/memory.c which might suggest pud_trans_huge() to exist
> without also being a devmap. I might be missing something here, but on x86
> platform following helpers suggest pud_trans_huge() to exist without being
> a devmap as well.
> 
> #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> static inline int pud_trans_huge(pud_t pud)
> {
> return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
> }
> #endif
> 
> #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> static inline int pud_devmap(pud_t pud)
> {
> return !!(pud_val(pud) & _PAGE_DEVMAP);
> }
> #else
> static inline int pud_devmap(pud_t pud)
> {
> return 0;
> }
> #endif
> 
> We might need some more clarity on this regarding x86 platform's pud huge
> page implementation.
> 

static vm_fault_t create_huge_pud(struct vm_fault *vmf)
{
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
struct vm_area_struct *vma = vmf->vma;
/* No support for anonymous transparent PUD pages yet */
if (vma_is_anonymous(vma))
return VM_FAULT_FALLBACK;
if (vma->vm_ops->huge_fault)
return vma->vm_ops->huge_fault(vmf, PUD_ORDER);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
return VM_FAULT_FALLBACK;
}



-aneesh