Re: [PATCH v4 RESEND] powerpc: Replace kretprobe code with rethook on powerpc

2024-09-08 Thread Google
On Fri, 06 Sep 2024 21:52:52 +1000
Michael Ellerman  wrote:

> On Fri, 30 Aug 2024 07:31:31 -0400, Abhishek Dubey wrote:
> > This is an adaptation of commit f3a112c0c40d ("x86,rethook,kprobes:
> > Replace kretprobe with rethook on x86") to powerpc.
> > 
> > Rethook follows the existing kretprobe implementation, but separates
> > it from kprobes so that it can be used by fprobe (ftrace-based
> > function entry/exit probes). As such, this patch also enables fprobe
> > to work on powerpc. The only other change compared to the existing
> > kretprobe implementation is doing the return address fixup in
> > arch_rethook_fixup_return().
> > 
> > [...]
> 
> Applied to powerpc/next.
> 
> [1/1] powerpc: Replace kretprobe code with rethook on powerpc
>   
> https://git.kernel.org/powerpc/c/19f1bc3fb55452739dd3d56cfd06c29ecdbe3e9f

Thanks, and sorry for late reply, but I don't have any objection.

> 
> cheers


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 1/2] MAINTAINERS: Update email address of Naveen

2024-07-17 Thread Google
On Wed, 17 Jul 2024 13:58:35 +1000
Michael Ellerman  wrote:

> Masami Hiramatsu (Google)  writes:
> > Hi Naveen,
> >
> > On Sun, 14 Jul 2024 14:04:23 +0530
> > Naveen N Rao  wrote:
> >
> >> I have switched to using my @kernel.org id for my contributions. Update
> >> MAINTAINERS and mailmap to reflect the same.
> >> 
> >
> > Looks good to me. 
> >
> > Reviewed-by: Masami Hiramatsu (Google) 
> >
> > Would powerpc maintainer pick this?
> 
> Yeah I can take both.

Thank you for pick them up!

-- 
Masami Hiramatsu (Google) 


Re: [PATCH 1/2] MAINTAINERS: Update email address of Naveen

2024-07-16 Thread Google
Hi Naveen,

On Sun, 14 Jul 2024 14:04:23 +0530
Naveen N Rao  wrote:

> I have switched to using my @kernel.org id for my contributions. Update
> MAINTAINERS and mailmap to reflect the same.
> 

Looks good to me. 

Reviewed-by: Masami Hiramatsu (Google) 

Would powerpc maintainer pick this?

Thanks,


> Cc: Naveen N. Rao 
> Signed-off-by: Naveen N Rao 
> ---
>  .mailmap| 2 ++
>  MAINTAINERS | 6 +++---
>  2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/.mailmap b/.mailmap
> index 81ac1e17ac3c..289011ebca00 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -473,6 +473,8 @@ Nadia Yvette Chambers  William Lee 
> Irwin III   Naoya Horiguchi  
>  Naoya Horiguchi  
>  Nathan Chancellor  
> +Naveen N Rao  
> +Naveen N Rao  
>  Neeraj Upadhyay  
>  Neeraj Upadhyay  
>  Neil Armstrong  
> diff --git a/MAINTAINERS b/MAINTAINERS
> index fa32e3c035c2..05f14b67cd74 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3878,7 +3878,7 @@ S:  Odd Fixes
>  F:   drivers/net/ethernet/netronome/nfp/bpf/
>  
>  BPF JIT for POWERPC (32-BIT AND 64-BIT)
> -M:   Naveen N. Rao 
> +M:   Naveen N Rao 
>  M:   Michael Ellerman 
>  L:   b...@vger.kernel.org
>  S:   Supported
> @@ -12332,7 +12332,7 @@ F:mm/kmsan/
>  F:   scripts/Makefile.kmsan
>  
>  KPROBES
> -M:   Naveen N. Rao 
> +M:   Naveen N Rao 
>  M:   Anil S Keshavamurthy 
>  M:   "David S. Miller" 
>  M:   Masami Hiramatsu 
> @@ -12708,7 +12708,7 @@ LINUX FOR POWERPC (32-BIT AND 64-BIT)
>  M:   Michael Ellerman 
>  R:   Nicholas Piggin 
>  R:   Christophe Leroy 
> -R:   Naveen N. Rao 
> +R:   Naveen N Rao 
>  L:   linuxppc-dev@lists.ozlabs.org
>  S:   Supported
>  W:   https://github.com/linuxppc/wiki/wiki
> 
> base-commit: 582b0e554593e530b1386eacafee6c412c5673cc
> -- 
> 2.45.2
> 


-- 
Masami Hiramatsu (Google) 


Re: [PATCH 2/2] MAINTAINERS: Update powerpc BPF JIT maintainers

2024-07-16 Thread Google
Hi,

On Tue, 16 Jul 2024 12:36:11 +0530
Hari Bathini  wrote:

> 
> 
> On 14/07/24 2:04 pm, Naveen N Rao wrote:
> > Hari Bathini has been updating and maintaining the powerpc BPF JIT since
> > a while now. Christophe Leroy has been doing the same for 32-bit
> > powerpc. Add them as maintainers for the powerpc BPF JIT.
> > 
> > I am no longer actively looking into the powerpc BPF JIT. Change my role
> > to that of a reviewer so that I can help with the odd query.
> > 
> > Signed-off-by: Naveen N Rao 
> 
> Acked-by: Hari Bathini 

Reviewed-by: Masami Hiramatsu (Google) 

But this should go through powerpc tree or bpf tree.

Thank you,

> 
> > ---
> >   MAINTAINERS | 4 +++-
> >   1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 05f14b67cd74..c7a931ee7a2e 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -3878,8 +3878,10 @@ S:   Odd Fixes
> >   F:drivers/net/ethernet/netronome/nfp/bpf/
> >   
> >   BPF JIT for POWERPC (32-BIT AND 64-BIT)
> > -M: Naveen N Rao 
> >   M:Michael Ellerman 
> > +M: Hari Bathini 
> > +M: Christophe Leroy 
> > +R: Naveen N Rao 
> >   L:b...@vger.kernel.org
> >   S:Supported
> >   F:arch/powerpc/net/


-- 
Masami Hiramatsu (Google) 


Re: [PATCH v3] PowerPC: Replace kretprobe with rethook

2024-07-09 Thread Google
On Tue, 09 Jul 2024 12:28:29 +0530
Naveen N Rao  wrote:

> Masami Hiramatsu wrote:
> > On Thu, 27 Jun 2024 09:21:01 -0400
> > Abhishek Dubey  wrote:
> > 
> >> +/* rethook initializer */
> >> +int __init arch_init_kprobes(void)
> >> +{
> >> +  return register_kprobe(&trampoline_p);
> >> +}
> > 
> > No, please don't use arch_init_kprobes() for initializing rethook, since
> > rethook is used from fprobe too (at this moment).
> > 
> > If you want to make it relays on kprobes, you have to make a dependency
> > in powerpc's kconfig, e.g.
> > 
> > select HAVE_RETHOOK if KPROBES
> > 
> > But I don't recommend it.
> 
> Given that kretprobe has always worked this way on powerpc, I think this
> is a fair tradeoff. We get to enable fprobes on powerpc only if kprobes
> is also enabled.
> 
> Longer term, it would certainly be nice to get rid of that probe, and to
> expand the trampoline to directly invoke the rethook callback.

OK. In longer term, rethook will be only for kretprobe, and kretprobe
will be replaced by fprobe[1]. So please comment it and add that 

[1] 
https://lore.kernel.org/all/172000134410.63468.1374887213469474.stgit@devnote2/

Thank you,

> 
> 
> Thanks,
> Naveen


-- 
Masami Hiramatsu (Google) 


Re: [PATCH v3] PowerPC: Replace kretprobe with rethook

2024-07-01 Thread Google
On Thu, 27 Jun 2024 09:21:01 -0400
Abhishek Dubey  wrote:

> +/* rethook initializer */
> +int __init arch_init_kprobes(void)
> +{
> + return register_kprobe(&trampoline_p);
> +}

No, please don't use arch_init_kprobes() for initializing rethook, since
rethook is used from fprobe too (at this moment).

If you want to make it relays on kprobes, you have to make a dependency
in powerpc's kconfig, e.g.

select HAVE_RETHOOK if KPROBES

But I don't recommend it.

Thank you,

-- 
Masami Hiramatsu (Google) 


Re: [PATCH v2] PowerPC: Replace kretprobe with rethook

2024-06-17 Thread Google
s, orig_ret_address - 4);
> > +   regs->link = orig_ret_address;
> 
> I think we can move these into arch_rethook_fixup_return(), so 
> trampoline_rethook_handler() can simply invoke 
> rethook_trampoline_handler().
> 
> > +
> > +   return 0;
> > +}
> > +NOKPROBE_SYMBOL(trampoline_rethook_handler);
> > +
> > +void arch_rethook_prepare(struct rethook_node *rh, struct pt_regs *regs, 
> > bool mcount)
> > +{
> > +   rh->ret_addr = regs->link;
> > +   rh->frame = 0;
> 
> There is additional code to validate our assumption with a frame pointer 
> set, so I think we should set this to regs->gpr[1].

Additonal note: If this sets regs->gpr[1], pass it to 
rethook_trampoline_handler()
too, so that it can find correct frame.

BTW, it seems powerpc does not use kretprobe/rethook shadow stack for
stack unwinding yet, is that right?

Thanks!

> 
> > +
> > +   /* Replace the return addr with trampoline addr */
> > +   regs->link = (unsigned long)arch_rethook_trampoline;
> > +}
> > +NOKPROBE_SYMBOL(arch_rethook_prepare);
> > +
> > +static struct kprobe trampoline_p = {
> > +   .addr = (kprobe_opcode_t *) &arch_rethook_trampoline,
> > +   .pre_handler = trampoline_rethook_handler
> > +};
> > +
> > +/* rethook initializer */
> > +int arch_init_kprobes(void)
> 
> Needs __init annotation.
> 
> > +{
> > +   return register_kprobe(&trampoline_p);
> > +}
> > diff --git a/arch/powerpc/kernel/stacktrace.c 
> > b/arch/powerpc/kernel/stacktrace.c
> > index e6a958a5da27..a31648b45956 100644
> > --- a/arch/powerpc/kernel/stacktrace.c
> > +++ b/arch/powerpc/kernel/stacktrace.c
> > @@ -21,6 +21,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  
> > @@ -133,14 +134,15 @@ int __no_sanitize_address 
> > arch_stack_walk_reliable(stack_trace_consume_fn consum
> >  * arch-dependent code, they are generic.
> >  */
> > ip = ftrace_graph_ret_addr(task, &graph_idx, ip, stack);
> > -#ifdef CONFIG_KPROBES
> > +
> > /*
> >  * Mark stacktraces with kretprobed functions on them
> >  * as unreliable.
> >  */
> > -   if (ip == (unsigned long)__kretprobe_trampoline)
> > -   return -EINVAL;
> > -#endif
> > +   #ifdef CONFIG_RETHOOK
> 
> The #ifdef usually starts at column 0, and there is no need to indent 
> the below code.
> 
> - Naveen
> 
> > +   if (ip == (unsigned long)arch_rethook_trampoline)
> > +   return -EINVAL;
> > +   #endif
> >  
> > if (!consume_entry(cookie, ip))
> > return -EINVAL;
> > -- 
> > 2.44.0
> > 


-- 
Masami Hiramatsu (Google) 


Re: [PATCH] PowerPC: Replace kretprobe with rethook

2024-05-23 Thread Google
On Thu, 16 May 2024 09:46:46 -0400
Abhishek Dubey  wrote:

> This is an adaptation of commit f3a112c0c40d ("x86,rethook,kprobes:
> Replace kretprobe with rethook on x86") to Power.
> 
> Replaces the kretprobe code with rethook on Power. With this patch,
> kretprobe on Power uses the rethook instead of kretprobe specific
> trampoline code.
> 
> Reference to other archs:
> commit b57c2f124098 ("riscv: add riscv rethook implementation")
> commit 7b0a096436c2 ("LoongArch: Replace kretprobe with rethook")
> 

Hi Abhishek,

Thanks for applying rethook, it looks good. I have comments below.

> diff --git a/arch/powerpc/kernel/stacktrace.c 
> b/arch/powerpc/kernel/stacktrace.c
> index e6a958a5da27..6de912cf198c 100644
> --- a/arch/powerpc/kernel/stacktrace.c
> +++ b/arch/powerpc/kernel/stacktrace.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -133,14 +134,13 @@ int __no_sanitize_address 
> arch_stack_walk_reliable(stack_trace_consume_fn consum
>* arch-dependent code, they are generic.
>*/
>   ip = ftrace_graph_ret_addr(task, &graph_idx, ip, stack);
> -#ifdef CONFIG_KPROBES

This still needs to check CONFIG_RETHOOK.

> +
>   /*
>* Mark stacktraces with kretprobed functions on them
>* as unreliable.
>*/
> - if (ip == (unsigned long)__kretprobe_trampoline)
> + if (ip == (unsigned long)arch_rethook_trampoline)
>   return -EINVAL;
> -#endif
>  
>   if (!consume_entry(cookie, ip))
>   return -EINVAL;
> diff --git a/include/linux/rethook.h b/include/linux/rethook.h
> index ba60962805f6..9f2fb6abdc60 100644
> --- a/include/linux/rethook.h
> +++ b/include/linux/rethook.h
> @@ -65,7 +65,6 @@ void rethook_recycle(struct rethook_node *node);
>  void rethook_hook(struct rethook_node *node, struct pt_regs *regs, bool 
> mcount);
>  unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long 
> frame,
>   struct llist_node **cur);
> -

nit: removed unrelated line.

>  /* Arch dependent code must implement arch_* and trampoline code */
>  void arch_rethook_prepare(struct rethook_node *node, struct pt_regs *regs, 
> bool mcount);
>  void arch_rethook_trampoline(void);
> -- 
> 2.44.0
> 

Thank you,

-- 
Masami Hiramatsu (Google) 


Re: [PATCH v3] kprobe/ftrace: bail out if ftrace was killed

2024-05-15 Thread Google
On Wed, 15 May 2024 15:18:08 -0700
Stephen Brennan  wrote:

> Masami Hiramatsu (Google)  writes:
> > On Thu, 2 May 2024 01:35:16 +0800
> > Guo Ren  wrote:
> >
> >> On Thu, May 2, 2024 at 12:30 AM Stephen Brennan
> >>  wrote:
> >> >
> >> > If an error happens in ftrace, ftrace_kill() will prevent disarming
> >> > kprobes. Eventually, the ftrace_ops associated with the kprobes will be
> >> > freed, yet the kprobes will still be active, and when triggered, they
> >> > will use the freed memory, likely resulting in a page fault and panic.
> >> >
> >> > This behavior can be reproduced quite easily, by creating a kprobe and
> >> > then triggering a ftrace_kill(). For simplicity, we can simulate an
> >> > ftrace error with a kernel module like [1]:
> >> >
> >> > [1]: https://github.com/brenns10/kernel_stuff/tree/master/ftrace_killer
> >> >
> >> >   sudo perf probe --add commit_creds
> >> >   sudo perf trace -e probe:commit_creds
> >> >   # In another terminal
> >> >   make
> >> >   sudo insmod ftrace_killer.ko  # calls ftrace_kill(), simulating bug
> >> >   # Back to perf terminal
> >> >   # ctrl-c
> >> >   sudo perf probe --del commit_creds
> >> >
> >> > After a short period, a page fault and panic would occur as the kprobe
> >> > continues to execute and uses the freed ftrace_ops. While ftrace_kill()
> >> > is supposed to be used only in extreme circumstances, it is invoked in
> >> > FTRACE_WARN_ON() and so there are many places where an unexpected bug
> >> > could be triggered, yet the system may continue operating, possibly
> >> > without the administrator noticing. If ftrace_kill() does not panic the
> >> > system, then we should do everything we can to continue operating,
> >> > rather than leave a ticking time bomb.
> >> >
> >> > Signed-off-by: Stephen Brennan 
> >> > ---
> >> > Changes in v3:
> >> >   Don't expose ftrace_is_dead(). Create a "kprobe_ftrace_disabled"
> >> >   variable and check it directly in the kprobe handlers.
> >> > Link to v1/v2 discussion:
> >> >   
> >> > https://lore.kernel.org/all/20240426225834.993353-1-stephen.s.bren...@oracle.com/
> >> >
> >> >  arch/csky/kernel/probes/ftrace.c | 3 +++
> >> >  arch/loongarch/kernel/ftrace_dyn.c   | 3 +++
> >> >  arch/parisc/kernel/ftrace.c  | 3 +++
> >> >  arch/powerpc/kernel/kprobes-ftrace.c | 3 +++
> >> >  arch/riscv/kernel/probes/ftrace.c| 3 +++
> >> >  arch/s390/kernel/ftrace.c| 3 +++
> >> >  arch/x86/kernel/kprobes/ftrace.c | 3 +++
> >> >  include/linux/kprobes.h  | 7 +++
> >> >  kernel/kprobes.c | 6 ++
> >> >  kernel/trace/ftrace.c| 1 +
> >> >  10 files changed, 35 insertions(+)
> >> >
> >> > diff --git a/arch/csky/kernel/probes/ftrace.c 
> >> > b/arch/csky/kernel/probes/ftrace.c
> >> > index 834cffcfbce3..7ba4b98076de 100644
> >> > --- a/arch/csky/kernel/probes/ftrace.c
> >> > +++ b/arch/csky/kernel/probes/ftrace.c
> >> > @@ -12,6 +12,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned 
> >> > long parent_ip,
> >> > struct kprobe_ctlblk *kcb;
> >> > struct pt_regs *regs;
> >> >
> >> > +   if (unlikely(kprobe_ftrace_disabled))
> >> > +   return;
> >> > +
> >> For csky part.
> >> Acked-by: Guo Ren 
> >
> > Thanks Stephen, Guo and Steve!
> >
> > Let me pick this to probes/for-next!
> 
> Thank you Masami!
> 
> I did want to check, is this the correct git tree to be watching?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/log/?h=probes/for-next
> 
> ( I'm not trying to pressure on timing, as I know the merge window is
>   hectic. Just making sure I'm watching the correct place! )

Sorry, I forgot to push it from my local tree. Now it should be there.

Thanks,

-- 
Masami Hiramatsu (Google) 


Re: [PATCH RESEND v8 07/16] mm/execmem, arch: convert simple overrides of module_alloc to execmem

2024-05-07 Thread Google
On Sun,  5 May 2024 19:06:19 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (IBM)" 
> 
> Several architectures override module_alloc() only to define address
> range for code allocations different than VMALLOC address space.
> 
> Provide a generic implementation in execmem that uses the parameters for
> address space ranges, required alignment and page protections provided
> by architectures.
> 
> The architectures must fill execmem_info structure and implement
> execmem_arch_setup() that returns a pointer to that structure. This way the
> execmem initialization won't be called from every architecture, but rather
> from a central place, namely a core_initcall() in execmem.
> 
> The execmem provides execmem_alloc() API that wraps __vmalloc_node_range()
> with the parameters defined by the architectures.  If an architecture does
> not implement execmem_arch_setup(), execmem_alloc() will fall back to
> module_alloc().
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thanks,

> Signed-off-by: Mike Rapoport (IBM) 
> Acked-by: Song Liu 
> ---
>  arch/loongarch/kernel/module.c | 19 --
>  arch/mips/kernel/module.c  | 20 --
>  arch/nios2/kernel/module.c | 21 ---
>  arch/parisc/kernel/module.c| 24 
>  arch/riscv/kernel/module.c | 24 
>  arch/sparc/kernel/module.c | 20 --
>  include/linux/execmem.h| 47 
>  mm/execmem.c   | 67 --
>  mm/mm_init.c   |  2 +
>  9 files changed, 210 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c
> index c7d0338d12c1..ca6dd7ea1610 100644
> --- a/arch/loongarch/kernel/module.c
> +++ b/arch/loongarch/kernel/module.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -490,10 +491,22 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char 
> *strtab,
>   return 0;
>  }
>  
> -void *module_alloc(unsigned long size)
> +static struct execmem_info execmem_info __ro_after_init;
> +
> +struct execmem_info __init *execmem_arch_setup(void)
>  {
> - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, 
> __builtin_return_address(0));
> + execmem_info = (struct execmem_info){
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
> + .start  = MODULES_VADDR,
> + .end= MODULES_END,
> + .pgprot = PAGE_KERNEL,
> + .alignment = 1,
> + },
> + },
> + };
> +
> + return &execmem_info;
>  }
>  
>  static void module_init_ftrace_plt(const Elf_Ehdr *hdr,
> diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
> index 9a6c96014904..59225a3cf918 100644
> --- a/arch/mips/kernel/module.c
> +++ b/arch/mips/kernel/module.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  struct mips_hi16 {
> @@ -32,11 +33,22 @@ static LIST_HEAD(dbe_list);
>  static DEFINE_SPINLOCK(dbe_lock);
>  
>  #ifdef MODULES_VADDR
> -void *module_alloc(unsigned long size)
> +static struct execmem_info execmem_info __ro_after_init;
> +
> +struct execmem_info __init *execmem_arch_setup(void)
>  {
> - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> + execmem_info = (struct execmem_info){
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
> + .start  = MODULES_VADDR,
> + .end= MODULES_END,
> + .pgprot = PAGE_KERNEL,
> + .alignment = 1,
> + },
> + },
> + };
> +
> + return &execmem_info;
>  }
>  #endif
>  
> diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
> index 9c97b7513853..0d1ee86631fc 100644
> --- a/arch/nios2/kernel/module.c
> +++ b/arch/nios2/kernel/module.c
> @@ -18,15 +18,26 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> -void *module_alloc(unsigned long size)
> +static struct execmem_info execmem_info __ro_after_init;
> +
> +struct execmem_info __init *execmem_arch_setup(void)
>  {
> - return __vmalloc_node_range(siz

Re: [PATCH RESEND v8 05/16] module: make module_memory_{alloc,free} more self-contained

2024-05-07 Thread Google
On Sun,  5 May 2024 19:06:17 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (IBM)" 
> 
> Move the logic related to the memory allocation and freeing into
> module_memory_alloc() and module_memory_free().
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thanks,

> Signed-off-by: Mike Rapoport (IBM) 
> Reviewed-by: Philippe Mathieu-Daudé 
> ---
>  kernel/module/main.c | 64 +++-
>  1 file changed, 39 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index e1e8a7a9d6c1..5b82b069e0d3 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1203,15 +1203,44 @@ static bool mod_mem_use_vmalloc(enum mod_mem_type 
> type)
>   mod_mem_type_is_core_data(type);
>  }
>  
> -static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
> +static int module_memory_alloc(struct module *mod, enum mod_mem_type type)
>  {
> + unsigned int size = PAGE_ALIGN(mod->mem[type].size);
> + void *ptr;
> +
> + mod->mem[type].size = size;
> +
>   if (mod_mem_use_vmalloc(type))
> - return vzalloc(size);
> - return module_alloc(size);
> + ptr = vmalloc(size);
> + else
> + ptr = module_alloc(size);
> +
> + if (!ptr)
> + return -ENOMEM;
> +
> + /*
> +  * The pointer to these blocks of memory are stored on the module
> +  * structure and we keep that around so long as the module is
> +  * around. We only free that memory when we unload the module.
> +  * Just mark them as not being a leak then. The .init* ELF
> +  * sections *do* get freed after boot so we *could* treat them
> +  * slightly differently with kmemleak_ignore() and only grey
> +  * them out as they work as typical memory allocations which
> +  * *do* eventually get freed, but let's just keep things simple
> +  * and avoid *any* false positives.
> +  */
> + kmemleak_not_leak(ptr);
> +
> + memset(ptr, 0, size);
> + mod->mem[type].base = ptr;
> +
> + return 0;
>  }
>  
> -static void module_memory_free(void *ptr, enum mod_mem_type type)
> +static void module_memory_free(struct module *mod, enum mod_mem_type type)
>  {
> + void *ptr = mod->mem[type].base;
> +
>   if (mod_mem_use_vmalloc(type))
>   vfree(ptr);
>   else
> @@ -1229,12 +1258,12 @@ static void free_mod_mem(struct module *mod)
>   /* Free lock-classes; relies on the preceding sync_rcu(). */
>   lockdep_free_key_range(mod_mem->base, mod_mem->size);
>   if (mod_mem->size)
> - module_memory_free(mod_mem->base, type);
> + module_memory_free(mod, type);
>   }
>  
>   /* MOD_DATA hosts mod, so free it at last */
>   lockdep_free_key_range(mod->mem[MOD_DATA].base, 
> mod->mem[MOD_DATA].size);
> - module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA);
> + module_memory_free(mod, MOD_DATA);
>  }
>  
>  /* Free a module, remove from lists, etc. */
> @@ -2225,7 +2254,6 @@ static int find_module_sections(struct module *mod, 
> struct load_info *info)
>  static int move_module(struct module *mod, struct load_info *info)
>  {
>   int i;
> - void *ptr;
>   enum mod_mem_type t = 0;
>   int ret = -ENOMEM;
>  
> @@ -2234,26 +2262,12 @@ static int move_module(struct module *mod, struct 
> load_info *info)
>   mod->mem[type].base = NULL;
>   continue;
>   }
> - mod->mem[type].size = PAGE_ALIGN(mod->mem[type].size);
> - ptr = module_memory_alloc(mod->mem[type].size, type);
> - /*
> - * The pointer to these blocks of memory are stored on the 
> module
> - * structure and we keep that around so long as the module is
> - * around. We only free that memory when we unload the 
> module.
> - * Just mark them as not being a leak then. The .init* ELF
> - * sections *do* get freed after boot so we *could* treat 
> them
> - * slightly differently with kmemleak_ignore() and only grey
> - * them out as they work as typical memory allocations which
> - * *do* eventually get freed, but let's just keep things 
> simple
> - * and avoid *any* false positives.
> -  */
> - kmemleak_not_leak(ptr);
> - if (!ptr) {
> +
> + ret = module_memory_alloc(mod, type);
>

Re: [PATCH v3] kprobe/ftrace: bail out if ftrace was killed

2024-05-01 Thread Google
b/arch/powerpc/kernel/kprobes-ftrace.c
> > index 072ebe7f290b..f8208c027148 100644
> > --- a/arch/powerpc/kernel/kprobes-ftrace.c
> > +++ b/arch/powerpc/kernel/kprobes-ftrace.c
> > @@ -21,6 +21,9 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned 
> > long parent_nip,
> > struct pt_regs *regs;
> > int bit;
> >
> > +   if (unlikely(kprobe_ftrace_disabled))
> > +   return;
> > +
> > bit = ftrace_test_recursion_trylock(nip, parent_nip);
> > if (bit < 0)
> > return;
> > diff --git a/arch/riscv/kernel/probes/ftrace.c 
> > b/arch/riscv/kernel/probes/ftrace.c
> > index 7142ec42e889..a69dfa610aa8 100644
> > --- a/arch/riscv/kernel/probes/ftrace.c
> > +++ b/arch/riscv/kernel/probes/ftrace.c
> > @@ -11,6 +11,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned 
> > long parent_ip,
> > struct kprobe_ctlblk *kcb;
> > int bit;
> >
> > +   if (unlikely(kprobe_ftrace_disabled))
> > +   return;
> > +
> > bit = ftrace_test_recursion_trylock(ip, parent_ip);
> > if (bit < 0)
> > return;
> > diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
> > index c46381ea04ec..7f6f8c438c26 100644
> > --- a/arch/s390/kernel/ftrace.c
> > +++ b/arch/s390/kernel/ftrace.c
> > @@ -296,6 +296,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned 
> > long parent_ip,
> > struct kprobe *p;
> > int bit;
> >
> > +   if (unlikely(kprobe_ftrace_disabled))
> > +   return;
> > +
> > bit = ftrace_test_recursion_trylock(ip, parent_ip);
> > if (bit < 0)
> > return;
> > diff --git a/arch/x86/kernel/kprobes/ftrace.c 
> > b/arch/x86/kernel/kprobes/ftrace.c
> > index dd2ec14adb77..15af7e98e161 100644
> > --- a/arch/x86/kernel/kprobes/ftrace.c
> > +++ b/arch/x86/kernel/kprobes/ftrace.c
> > @@ -21,6 +21,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned 
> > long parent_ip,
> > struct kprobe_ctlblk *kcb;
> > int bit;
> >
> > +   if (unlikely(kprobe_ftrace_disabled))
> > +   return;
> > +
> > bit = ftrace_test_recursion_trylock(ip, parent_ip);
> > if (bit < 0)
> > return;
> > diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> > index 0ff44d6633e3..5fcbc254d186 100644
> > --- a/include/linux/kprobes.h
> > +++ b/include/linux/kprobes.h
> > @@ -378,11 +378,15 @@ static inline void wait_for_kprobe_optimizer(void) { }
> >  extern void kprobe_ftrace_handler(unsigned long ip, unsigned long 
> > parent_ip,
> >   struct ftrace_ops *ops, struct 
> > ftrace_regs *fregs);
> >  extern int arch_prepare_kprobe_ftrace(struct kprobe *p);
> > +/* Set when ftrace has been killed: kprobes on ftrace must be disabled for 
> > safety */
> > +extern bool kprobe_ftrace_disabled __read_mostly;
> > +extern void kprobe_ftrace_kill(void);
> >  #else
> >  static inline int arch_prepare_kprobe_ftrace(struct kprobe *p)
> >  {
> > return -EINVAL;
> >  }
> > +static inline void kprobe_ftrace_kill(void) {}
> >  #endif /* CONFIG_KPROBES_ON_FTRACE */
> >
> >  /* Get the kprobe at this addr (if any) - called with preemption disabled 
> > */
> > @@ -495,6 +499,9 @@ static inline void kprobe_flush_task(struct task_struct 
> > *tk)
> >  static inline void kprobe_free_init_mem(void)
> >  {
> >  }
> > +static inline void kprobe_ftrace_kill(void)
> > +{
> > +}
> >  static inline int disable_kprobe(struct kprobe *kp)
> >  {
> > return -EOPNOTSUPP;
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index 65adc815fc6e..166ebf81dc45 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -1068,6 +1068,7 @@ static struct ftrace_ops kprobe_ipmodify_ops 
> > __read_mostly = {
> >
> >  static int kprobe_ipmodify_enabled;
> >  static int kprobe_ftrace_enabled;
> > +bool kprobe_ftrace_disabled;
> >
> >  static int __arm_kprobe_ftrace(struct kprobe *p, struct ftrace_ops *ops,
> >int *cnt)
> > @@ -1136,6 +1137,11 @@ static int disarm_kprobe_ftrace(struct kprobe *p)
> > ipmodify ? &kprobe_ipmodify_ops : &kprobe_ftrace_ops,
> > ipmodify ? &kprobe_ipmodify_enabled : 
> > &kprobe_ftrace_enabled);
> >  }
> > +
> > +void kprobe_ftrace_kill()
> > +{
> > +   kprobe_ftrace_disabled = true;
> > +}
> >  #else  /* !CONFIG_KPROBES_ON_FTRACE */
> >  static inline int arm_kprobe_ftrace(struct kprobe *p)
> >  {
> > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > index da1710499698..96db99c347b3 100644
> > --- a/kernel/trace/ftrace.c
> > +++ b/kernel/trace/ftrace.c
> > @@ -7895,6 +7895,7 @@ void ftrace_kill(void)
> > ftrace_disabled = 1;
> > ftrace_enabled = 0;
> > ftrace_trace_function = ftrace_stub;
> > +   kprobe_ftrace_kill();
> >  }
> >
> >  /**
> > --
> > 2.39.3
> >
> 
> 
> -- 
> Best Regards
>  Guo Ren


-- 
Masami Hiramatsu (Google) 


Re: [PATCH] kprobe/ftrace: bail out if ftrace was killed

2024-04-29 Thread Google
Hi Stephen,

On Fri, 26 Apr 2024 15:58:34 -0700
Stephen Brennan  wrote:

> If an error happens in ftrace, ftrace_kill() will prevent disarming
> kprobes. Eventually, the ftrace_ops associated with the kprobes will be
> freed, yet the kprobes will still be active, and when triggered, they
> will use the freed memory, likely resulting in a page fault and panic.

Hmm, indeed.

> 
> This behavior can be reproduced quite easily, by creating a kprobe and
> then triggering a ftrace_kill(). For simplicity, we can simulate an
> ftrace error with a kernel module like [1]:
> 
> [1]: https://github.com/brenns10/kernel_stuff/tree/master/ftrace_killer
> 
>   sudo perf probe --add commit_creds
>   sudo perf trace -e probe:commit_creds
>   # In another terminal
>   make
>   sudo insmod ftrace_killer.ko  # calls ftrace_kill(), simulating bug
>   # Back to perf terminal
>   # ctrl-c
>   sudo perf probe --del commit_creds
> 
> After a short period, a page fault and panic would occur as the kprobe
> continues to execute and uses the freed ftrace_ops. While ftrace_kill()
> is supposed to be used only in extreme circumstances, it is invoked in
> FTRACE_WARN_ON() and so there are many places where an unexpected bug
> could be triggered, yet the system may continue operating, possibly
> without the administrator noticing. If ftrace_kill() does not panic the
> system, then we should do everything we can to continue operating,
> rather than leave a ticking time bomb.

OK, the patch looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thanks!

> 
> Signed-off-by: Stephen Brennan 
> ---
> 
> Apologies for the wide net cast here. I recognize that a change like this
> may need to be split up and go through arch-specific trees. I hoped to get
> feedback on the patch itself. If it's satisfactory and the architecture
> maintainers prefer it split out, I'm glad to do it. Thanks!
> 
>  arch/csky/kernel/probes/ftrace.c | 3 +++
>  arch/loongarch/kernel/ftrace_dyn.c   | 3 +++
>  arch/parisc/kernel/ftrace.c  | 3 +++
>  arch/powerpc/kernel/kprobes-ftrace.c | 3 +++
>  arch/riscv/kernel/probes/ftrace.c| 3 +++
>  arch/s390/kernel/ftrace.c| 3 +++
>  arch/x86/kernel/kprobes/ftrace.c | 3 +++
>  include/linux/ftrace.h   | 2 ++
>  8 files changed, 23 insertions(+)
> 
> diff --git a/arch/csky/kernel/probes/ftrace.c 
> b/arch/csky/kernel/probes/ftrace.c
> index 834cffcfbce3..3931bf9f707b 100644
> --- a/arch/csky/kernel/probes/ftrace.c
> +++ b/arch/csky/kernel/probes/ftrace.c
> @@ -12,6 +12,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long 
> parent_ip,
>   struct kprobe_ctlblk *kcb;
>   struct pt_regs *regs;
>  
> + if (unlikely(ftrace_is_dead()))
> + return;
> +
>   bit = ftrace_test_recursion_trylock(ip, parent_ip);
>   if (bit < 0)
>   return;
> diff --git a/arch/loongarch/kernel/ftrace_dyn.c 
> b/arch/loongarch/kernel/ftrace_dyn.c
> index 73858c9029cc..82c952cb5be0 100644
> --- a/arch/loongarch/kernel/ftrace_dyn.c
> +++ b/arch/loongarch/kernel/ftrace_dyn.c
> @@ -287,6 +287,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned 
> long parent_ip,
>   struct kprobe *p;
>   struct kprobe_ctlblk *kcb;
>  
> + if (unlikely(ftrace_is_dead()))
> + return;
> +
>   bit = ftrace_test_recursion_trylock(ip, parent_ip);
>   if (bit < 0)
>   return;
> diff --git a/arch/parisc/kernel/ftrace.c b/arch/parisc/kernel/ftrace.c
> index 621a4b386ae4..3660834f54c3 100644
> --- a/arch/parisc/kernel/ftrace.c
> +++ b/arch/parisc/kernel/ftrace.c
> @@ -206,6 +206,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned 
> long parent_ip,
>   struct kprobe *p;
>   int bit;
>  
> + if (unlikely(ftrace_is_dead()))
> + return;
> +
>   bit = ftrace_test_recursion_trylock(ip, parent_ip);
>   if (bit < 0)
>   return;
> diff --git a/arch/powerpc/kernel/kprobes-ftrace.c 
> b/arch/powerpc/kernel/kprobes-ftrace.c
> index 072ebe7f290b..85eb55aa1457 100644
> --- a/arch/powerpc/kernel/kprobes-ftrace.c
> +++ b/arch/powerpc/kernel/kprobes-ftrace.c
> @@ -21,6 +21,9 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long 
> parent_nip,
>   struct pt_regs *regs;
>   int bit;
>  
> + if (unlikely(ftrace_is_dead()))
> + return;
> +
>   bit = ftrace_test_recursion_trylock(nip, parent_nip);
>   if (bit < 0)
>   return;
> diff --git a/arch/riscv/kernel/probes/ftrace.c 
> b/arch/riscv/kernel/probes/ftrace.c
> index 7142ec42e889..8814fbe4c888 100644
> --- a/arch/riscv/kernel/probes/ftrace.c
> +

Re: [PATCH v5 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-22 Thread Google
On Mon, 22 Apr 2024 12:44:35 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (IBM)" 
> 
> kprobes depended on CONFIG_MODULES because it has to allocate memory for
> code.
> 
> Since code allocations are now implemented with execmem, kprobes can be
> enabled in non-modular kernels.
> 
> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> dependency of CONFIG_KPROBES on CONFIG_MODULES.

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thank you!

> 
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/Kconfig|  2 +-
>  include/linux/module.h  |  9 ++
>  kernel/kprobes.c| 55 +++--
>  kernel/trace/trace_kprobe.c | 20 +-
>  4 files changed, 63 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 7006f71f0110..a48ce6a488b3 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -52,9 +52,9 @@ config GENERIC_ENTRY
>  
>  config KPROBES
>   bool "Kprobes"
> - depends on MODULES
>   depends on HAVE_KPROBES
>   select KALLSYMS
> + select EXECMEM
>   select TASKS_RCU if PREEMPTION
>   help
> Kprobes allows you to trap at almost any kernel address and
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 1153b0d99a80..ffa1c603163c 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -605,6 +605,11 @@ static inline bool module_is_live(struct module *mod)
>   return mod->state != MODULE_STATE_GOING;
>  }
>  
> +static inline bool module_is_coming(struct module *mod)
> +{
> +return mod->state == MODULE_STATE_COMING;
> +}
> +
>  struct module *__module_text_address(unsigned long addr);
>  struct module *__module_address(unsigned long addr);
>  bool is_module_address(unsigned long addr);
> @@ -857,6 +862,10 @@ void *dereference_module_function_descriptor(struct 
> module *mod, void *ptr)
>   return ptr;
>  }
>  
> +static inline bool module_is_coming(struct module *mod)
> +{
> + return false;
> +}
>  #endif /* CONFIG_MODULES */
>  
>  #ifdef CONFIG_SYSFS
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index ddd7cdc16edf..ca2c6cbd42d2 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -1588,7 +1588,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
>   }
>  
>   /* Get module refcount and reject __init functions for loaded modules. 
> */
> - if (*probed_mod) {
> + if (IS_ENABLED(CONFIG_MODULES) && *probed_mod) {
>   /*
>* We must hold a refcount of the probed module while updating
>* its code to prohibit unexpected unloading.
> @@ -1603,12 +1603,13 @@ static int check_kprobe_address_safe(struct kprobe *p,
>* kprobes in there.
>*/
>   if (within_module_init((unsigned long)p->addr, *probed_mod) &&
> - (*probed_mod)->state != MODULE_STATE_COMING) {
> + !module_is_coming(*probed_mod)) {
>   module_put(*probed_mod);
>   *probed_mod = NULL;
>   ret = -ENOENT;
>   }
>   }
> +
>  out:
>   preempt_enable();
>   jump_label_unlock();
> @@ -2488,24 +2489,6 @@ int kprobe_add_area_blacklist(unsigned long start, 
> unsigned long end)
>   return 0;
>  }
>  
> -/* Remove all symbols in given area from kprobe blacklist */
> -static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
> end)
> -{
> - struct kprobe_blacklist_entry *ent, *n;
> -
> - list_for_each_entry_safe(ent, n, &kprobe_blacklist, list) {
> - if (ent->start_addr < start || ent->start_addr >= end)
> - continue;
> - list_del(&ent->list);
> - kfree(ent);
> - }
> -}
> -
> -static void kprobe_remove_ksym_blacklist(unsigned long entry)
> -{
> - kprobe_remove_area_blacklist(entry, entry + 1);
> -}
> -
>  int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long 
> *value,
>  char *type, char *sym)
>  {
> @@ -2570,6 +2553,25 @@ static int __init populate_kprobe_blacklist(unsigned 
> long *start,
>   return ret ? : arch_populate_kprobe_blacklist();
>  }
>  
> +#ifdef CONFIG_MODULES
> +/* Remove all symbols in given area from kprobe blacklist */
> +static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
> end)
> +

Re: [PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-20 Thread Google
On Sat, 20 Apr 2024 10:33:38 +0300
Mike Rapoport  wrote:

> On Fri, Apr 19, 2024 at 03:59:40PM +, Christophe Leroy wrote:
> > 
> > 
> > Le 19/04/2024 à 17:49, Mike Rapoport a écrit :
> > > Hi Masami,
> > > 
> > > On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
> > >> Hi Mike,
> > >>
> > >> On Thu, 11 Apr 2024 19:00:50 +0300
> > >> Mike Rapoport  wrote:
> > >>
> > >>> From: "Mike Rapoport (IBM)" 
> > >>>
> > >>> kprobes depended on CONFIG_MODULES because it has to allocate memory for
> > >>> code.
> > >>>
> > >>> Since code allocations are now implemented with execmem, kprobes can be
> > >>> enabled in non-modular kernels.
> > >>>
> > >>> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> > >>> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> > >>> dependency of CONFIG_KPROBES on CONFIG_MODULES.
> > >>
> > >> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
> > >> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
> > >> function body? We have enough dummy functions for that, so it should
> > >> not make a problem.
> > > 
> > > The code in check_kprobe_address_safe() that gets the module and checks 
> > > for
> > > __init functions does not compile with IS_ENABLED(CONFIG_MODULES).
> > > I can pull it out to a helper or leave #ifdef in the function body,
> > > whichever you prefer.
> > 
> > As far as I can see, the only problem is MODULE_STATE_COMING.
> > Can we move 'enum module_state' out of #ifdef CONFIG_MODULES in module.h  ?
> 
> There's dereference of 'struct module' there:
>  
>   (*probed_mod)->state != MODULE_STATE_COMING) {
>   ...
>   }
> 
> so moving out 'enum module_state' won't be enough.

Hmm, this part should be inline functions like;

#ifdef CONFIG_MODULES
static inline bool module_is_coming(struct module *mod)
{
return mod->state == MODULE_STATE_COMING;
}
#else
#define module_is_coming(mod) (false)
#endif

Then we don't need the enum.
Thank you,

>  
> > >   
> > >> -- 
> > >> Masami Hiramatsu
> > > 
> 
> -- 
> Sincerely yours,
> Mike.
> 


-- 
Masami Hiramatsu (Google) 


Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-20 Thread Google
On Sat, 20 Apr 2024 07:22:50 +0300
Mike Rapoport  wrote:

> On Fri, Apr 19, 2024 at 02:42:16PM -0700, Song Liu wrote:
> > On Fri, Apr 19, 2024 at 1:00 PM Mike Rapoport  wrote:
> > >
> > > On Fri, Apr 19, 2024 at 10:32:39AM -0700, Song Liu wrote:
> > > > On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport  wrote:
> > > > [...]
> > > > > > >
> > > > > > > [1] 
> > > > > > > https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org
> > > > > >
> > > > > > For the ROX to work, we need different users (module text, kprobe, 
> > > > > > etc.) to have
> > > > > > the same execmem_range. From [1]:
> > > > > >
> > > > > > static void *execmem_cache_alloc(struct execmem_range *range, 
> > > > > > size_t size)
> > > > > > {
> > > > > > ...
> > > > > >p = __execmem_cache_alloc(size);
> > > > > >if (p)
> > > > > >return p;
> > > > > >   err = execmem_cache_populate(range, size);
> > > > > > ...
> > > > > > }
> > > > > >
> > > > > > We are calling __execmem_cache_alloc() without range. For this to 
> > > > > > work,
> > > > > > we can only call execmem_cache_alloc() with one execmem_range.
> > > > >
> > > > > Actually, on x86 this will "just work" because everything shares the 
> > > > > same
> > > > > address space :)
> > > > >
> > > > > The 2M pages in the cache will be in the modules space, so
> > > > > __execmem_cache_alloc() will always return memory from that address 
> > > > > space.
> > > > >
> > > > > For other architectures this indeed needs to be fixed with passing the
> > > > > range to __execmem_cache_alloc() and limiting search in the cache for 
> > > > > that
> > > > > range.
> > > >
> > > > I think we at least need the "map to" concept (initially proposed by 
> > > > Thomas)
> > > > to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
> > > > maps to EXECMEM_MODULE_TEXT, so that all these actually share
> > > > the same range.
> > >
> > > Why?
> > 
> > IIUC, we need to update __execmem_cache_alloc() to take a range pointer as
> > input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe
> > will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing
> > the "range" object, we will have to compare different range parameters to 
> > check
> > we can share cached pages between module text and kprobe, which is not
> > efficient. Did I miss something?

Song, thanks for trying to eplain. I think I need to explain why I used
module_alloc() originally.

This depends on how kprobe features are implemented on the architecture, and
how much features are supported on kprobes.

Because kprobe jump optimization and kprobe jump-back optimization need to
use a jump instruction to jump into the trampoline and jump back from the
trampoline directly, if the architecuture jmp instruction supports +-2GB range
like x86, it needs to allocate the trampoline buffer inside such address space.
This requirement is similar to the modules (because module function needs to
call other functions in the kernel etc.), at least kprobes on x86 used
module_alloc().

However, if an architecture only supports breakpoint/trap based kprobe,
it does not need to consider whether the execmem is allocated.

> 
> We can always share large ROX pages as long as they are within the correct
> address space. The permissions for them are ROX and the alignment
> differences are due to KASAN and this is handled during allocation of the
> large page to refill the cache. __execmem_cache_alloc() only needs to limit
> the search for the address space of the range.

So I don't think EXECMEM_KPROBE always same as EXECMEM_MODULE_TEXT, it
should be configured for each arch. Especially, if it is only used for
searching parameter, it looks OK to me.

Thank you,

> 
> And regardless, they way we deal with sharing of the cache can be sorted
> out later.
> 
> > Thanks,
> > Song
> 
> -- 
> Sincerely yours,
> Mike.
> 


-- 
Masami Hiramatsu (Google) 


Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-17 Thread Google
On Thu, 11 Apr 2024 19:00:41 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (IBM)" 
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules and
> puts the burden of code allocation to the modules code.
> 
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> Start splitting code allocation from modules by introducing execmem_alloc()
> and execmem_free() APIs.
> 
> Initially, execmem_alloc() is a wrapper for module_alloc() and
> execmem_free() is a replacement of module_memfree() to allow updating all
> call sites to use the new APIs.
> 
> Since architectures define different restrictions on placement,
> permissions, alignment and other parameters for memory that can be used by
> different subsystems that allocate executable memory, execmem_alloc() takes
> a type argument, that will be used to identify the calling subsystem and to
> allow architectures define parameters for ranges suitable for that
> subsystem.
> 

This looks good to me for the kprobe part.

Acked-by: Masami Hiramatsu (Google) 

Thank you,

> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/powerpc/kernel/kprobes.c|  6 ++--
>  arch/s390/kernel/ftrace.c|  4 +--
>  arch/s390/kernel/kprobes.c   |  4 +--
>  arch/s390/kernel/module.c|  5 +--
>  arch/sparc/net/bpf_jit_comp_32.c |  8 ++---
>  arch/x86/kernel/ftrace.c |  6 ++--
>  arch/x86/kernel/kprobes/core.c   |  4 +--
>  include/linux/execmem.h  | 57 
>  include/linux/moduleloader.h |  3 --
>  kernel/bpf/core.c|  6 ++--
>  kernel/kprobes.c |  8 ++---
>  kernel/module/Kconfig|  1 +
>  kernel/module/main.c | 25 +-
>  mm/Kconfig   |  3 ++
>  mm/Makefile  |  1 +
>  mm/execmem.c | 26 +++
>  16 files changed, 122 insertions(+), 45 deletions(-)
>  create mode 100644 include/linux/execmem.h
>  create mode 100644 mm/execmem.c
> 
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index bbca90a5e2ec..9fcd01bb2ce6 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -19,8 +19,8 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -130,7 +130,7 @@ void *alloc_insn_page(void)
>  {
>   void *page;
>  
> - page = module_alloc(PAGE_SIZE);
> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>   if (!page)
>   return NULL;
>  
> @@ -142,7 +142,7 @@ void *alloc_insn_page(void)
>   }
>   return page;
>  error:
> - module_memfree(page);
> + execmem_free(page);
>   return NULL;
>  }
>  
> diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
> index c46381ea04ec..798249ef5646 100644
> --- a/arch/s390/kernel/ftrace.c
> +++ b/arch/s390/kernel/ftrace.c
> @@ -7,13 +7,13 @@
>   *   Author(s): Martin Schwidefsky 
>   */
>  
> -#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void)
>  {
>   const char *start, *end;
>  
> - ftrace_plt = module_alloc(PAGE_SIZE);
> + ftrace_plt = execmem_alloc(EXECMEM_FTRACE, PAGE_SIZE);
>   if (!ftrace_plt)
>   panic("cannot allocate ftrace plt\n");
>  
> diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
> index f0cf20d4b3c5..3c1b1be744de 100644
> --- a/arch/s390/kernel/kprobes.c
> +++ b/arch/s390/kernel/kprobes.c
> @@ -9,7 +9,6 @@
>  
>  #define pr_fmt(fmt) "kprobes: " fmt
>  
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -21,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -38,7 +38,7 @@ void *alloc_insn_page(void)
>  {
>   void *page;
>  
> - page = module_alloc(PAGE_SIZE);
> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>   if (!page)
>   return NULL;
>   set_memory_rox((unsigned long)page, 1);
> diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
> index 42215f9404af..ac97a905e8cd 100644
> --- a/arch/s3

Re: [RFC PATCH 5/9] powerpc/kprobes: Use ftrace to determine if a probe is at function entry

2023-12-10 Thread Google
On Fri,  8 Dec 2023 22:00:44 +0530
Naveen N Rao  wrote:

> Rather than hard-coding the offset into a function to be used to
> determine if a kprobe is at function entry, use ftrace_location() to
> determine the ftrace location within the function and categorize all
> instructions till that offset to be function entry.
> 
> For functions that cannot be traced, we fall back to using a fixed
> offset of 8 (two instructions) to categorize a probe as being at
> function entry for 64-bit elfv2.
> 

OK, so this can cover both KPROBES_ON_FTRACE=y/n cases and the function
is traced by ftrace or not.

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thanks,

> Signed-off-by: Naveen N Rao 
> ---
>  arch/powerpc/kernel/kprobes.c | 18 --
>  1 file changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index b20ee72e873a..42665dfab59e 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -105,24 +105,22 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
> unsigned int offset)
>   return addr;
>  }
>  
> -static bool arch_kprobe_on_func_entry(unsigned long offset)
> +static bool arch_kprobe_on_func_entry(unsigned long addr, unsigned long 
> offset)
>  {
> -#ifdef CONFIG_PPC64_ELF_ABI_V2
> -#ifdef CONFIG_KPROBES_ON_FTRACE
> - return offset <= 16;
> -#else
> - return offset <= 8;
> -#endif
> -#else
> + unsigned long ip = ftrace_location(addr);
> +
> + if (ip)
> + return offset <= (ip - addr);
> + if (IS_ENABLED(CONFIG_PPC64_ELF_ABI_V2))
> + return offset <= 8;
>   return !offset;
> -#endif
>  }
>  
>  /* XXX try and fold the magic of kprobe_lookup_name() in this */
>  kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long 
> offset,
>bool *on_func_entry)
>  {
> - *on_func_entry = arch_kprobe_on_func_entry(offset);
> + *on_func_entry = arch_kprobe_on_func_entry(addr, offset);
>   return (kprobe_opcode_t *)(addr + offset);
>  }
>  
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 


Re: [PATCH 03/22] [RESEND] kprobes: unify kprobes_exceptions_nofify() prototypes

2023-11-08 Thread Google
On Wed,  8 Nov 2023 13:58:24 +0100
Arnd Bergmann  wrote:

> From: Arnd Bergmann 
> 
> Most architectures that support kprobes declare this function in their
> own asm/kprobes.h header and provide an override, but some are missing
> the prototype, which causes a warning for the __weak stub implementation:
> 
> kernel/kprobes.c:1865:12: error: no previous prototype for 
> 'kprobe_exceptions_notify' [-Werror=missing-prototypes]
>  1865 | int __weak kprobe_exceptions_notify(struct notifier_block *self,
> 
> Move the prototype into linux/kprobes.h so it is visible to all
> the definitions.

Thanks, let me pick this to linux-trace tree.

> 
> Acked-by: Masami Hiramatsu (Google) 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/arc/include/asm/kprobes.h | 3 ---
>  arch/arm/include/asm/kprobes.h | 2 --
>  arch/arm64/include/asm/kprobes.h   | 2 --
>  arch/mips/include/asm/kprobes.h| 2 --
>  arch/powerpc/include/asm/kprobes.h | 2 --
>  arch/s390/include/asm/kprobes.h| 2 --
>  arch/sh/include/asm/kprobes.h  | 2 --
>  arch/sparc/include/asm/kprobes.h   | 2 --
>  arch/x86/include/asm/kprobes.h | 2 --
>  include/linux/kprobes.h| 4 
>  10 files changed, 4 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/arc/include/asm/kprobes.h b/arch/arc/include/asm/kprobes.h
> index de1566e32cb8..68e8301c0df2 100644
> --- a/arch/arc/include/asm/kprobes.h
> +++ b/arch/arc/include/asm/kprobes.h
> @@ -32,9 +32,6 @@ struct kprobe;
>  
>  void arch_remove_kprobe(struct kprobe *p);
>  
> -int kprobe_exceptions_notify(struct notifier_block *self,
> -  unsigned long val, void *data);
> -
>  struct prev_kprobe {
>   struct kprobe *kp;
>   unsigned long status;
> diff --git a/arch/arm/include/asm/kprobes.h b/arch/arm/include/asm/kprobes.h
> index e26a278d301a..5b8dbf1b0be4 100644
> --- a/arch/arm/include/asm/kprobes.h
> +++ b/arch/arm/include/asm/kprobes.h
> @@ -40,8 +40,6 @@ struct kprobe_ctlblk {
>  
>  void arch_remove_kprobe(struct kprobe *);
>  int kprobe_fault_handler(struct pt_regs *regs, unsigned int fsr);
> -int kprobe_exceptions_notify(struct notifier_block *self,
> -  unsigned long val, void *data);
>  
>  /* optinsn template addresses */
>  extern __visible kprobe_opcode_t optprobe_template_entry[];
> diff --git a/arch/arm64/include/asm/kprobes.h 
> b/arch/arm64/include/asm/kprobes.h
> index 05cd82eeca13..be7a3680dadf 100644
> --- a/arch/arm64/include/asm/kprobes.h
> +++ b/arch/arm64/include/asm/kprobes.h
> @@ -37,8 +37,6 @@ struct kprobe_ctlblk {
>  
>  void arch_remove_kprobe(struct kprobe *);
>  int kprobe_fault_handler(struct pt_regs *regs, unsigned int fsr);
> -int kprobe_exceptions_notify(struct notifier_block *self,
> -  unsigned long val, void *data);
>  void __kretprobe_trampoline(void);
>  void __kprobes *trampoline_probe_handler(struct pt_regs *regs);
>  
> diff --git a/arch/mips/include/asm/kprobes.h b/arch/mips/include/asm/kprobes.h
> index 68b1e5d458cf..bc27d99c9436 100644
> --- a/arch/mips/include/asm/kprobes.h
> +++ b/arch/mips/include/asm/kprobes.h
> @@ -71,8 +71,6 @@ struct kprobe_ctlblk {
>   struct prev_kprobe prev_kprobe;
>  };
>  
> -extern int kprobe_exceptions_notify(struct notifier_block *self,
> - unsigned long val, void *data);
>  
>  #endif /* CONFIG_KPROBES */
>  #endif /* _ASM_KPROBES_H */
> diff --git a/arch/powerpc/include/asm/kprobes.h 
> b/arch/powerpc/include/asm/kprobes.h
> index c8e4b4fd4e33..4525a9c68260 100644
> --- a/arch/powerpc/include/asm/kprobes.h
> +++ b/arch/powerpc/include/asm/kprobes.h
> @@ -84,8 +84,6 @@ struct arch_optimized_insn {
>   kprobe_opcode_t *insn;
>  };
>  
> -extern int kprobe_exceptions_notify(struct notifier_block *self,
> - unsigned long val, void *data);
>  extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
>  extern int kprobe_handler(struct pt_regs *regs);
>  extern int kprobe_post_handler(struct pt_regs *regs);
> diff --git a/arch/s390/include/asm/kprobes.h b/arch/s390/include/asm/kprobes.h
> index 21b9e5290c04..01f1682a73b7 100644
> --- a/arch/s390/include/asm/kprobes.h
> +++ b/arch/s390/include/asm/kprobes.h
> @@ -73,8 +73,6 @@ struct kprobe_ctlblk {
>  void arch_remove_kprobe(struct kprobe *p);
>  
>  int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
> -int kprobe_exceptions_notify(struct notifier_block *self,
> - unsigned long val, void *data);
>  
>  #define flush_insn_slot(p)   do { } while (0)
>  
> diff --git a/arch/sh/include/asm/kprobes.h b/arch/sh/include/asm/kprobes.h
> index eeba83e0a7d

Re: [PATCH 14/17] kprobes: unify kprobes_exceptions_nofify() prototypes

2023-08-11 Thread Google
On Thu, 10 Aug 2023 16:19:32 +0200
Arnd Bergmann  wrote:

> From: Arnd Bergmann 
> 
> Most architectures that support kprobes declare this function in their
> own asm/kprobes.h header and provide an override, but some are missing
> the prototype, which causes a warning for the __weak stub implementation:
> 
> kernel/kprobes.c:1865:12: error: no previous prototype for 
> 'kprobe_exceptions_notify' [-Werror=missing-prototypes]
>  1865 | int __weak kprobe_exceptions_notify(struct notifier_block *self,
> 
> Move the prototype into linux/kprobes.h so it is visible to all
> the definitions.

Good catch! and it seems x86 has no implementation, so this is more resonable.

Acked-by: Masami Hiramatsu (Google) 

Thank you,

> 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/arc/include/asm/kprobes.h | 3 ---
>  arch/arm/include/asm/kprobes.h | 2 --
>  arch/arm64/include/asm/kprobes.h   | 2 --
>  arch/ia64/include/asm/kprobes.h| 2 --
>  arch/mips/include/asm/kprobes.h| 2 --
>  arch/powerpc/include/asm/kprobes.h | 2 --
>  arch/s390/include/asm/kprobes.h| 2 --
>  arch/sh/include/asm/kprobes.h  | 2 --
>  arch/sparc/include/asm/kprobes.h   | 2 --
>  arch/x86/include/asm/kprobes.h | 2 --
>  include/linux/kprobes.h| 4 
>  11 files changed, 4 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/arc/include/asm/kprobes.h b/arch/arc/include/asm/kprobes.h
> index de1566e32cb89..68e8301c0df2c 100644
> --- a/arch/arc/include/asm/kprobes.h
> +++ b/arch/arc/include/asm/kprobes.h
> @@ -32,9 +32,6 @@ struct kprobe;
>  
>  void arch_remove_kprobe(struct kprobe *p);
>  
> -int kprobe_exceptions_notify(struct notifier_block *self,
> -  unsigned long val, void *data);
> -
>  struct prev_kprobe {
>   struct kprobe *kp;
>   unsigned long status;
> diff --git a/arch/arm/include/asm/kprobes.h b/arch/arm/include/asm/kprobes.h
> index e26a278d301ab..5b8dbf1b0be49 100644
> --- a/arch/arm/include/asm/kprobes.h
> +++ b/arch/arm/include/asm/kprobes.h
> @@ -40,8 +40,6 @@ struct kprobe_ctlblk {
>  
>  void arch_remove_kprobe(struct kprobe *);
>  int kprobe_fault_handler(struct pt_regs *regs, unsigned int fsr);
> -int kprobe_exceptions_notify(struct notifier_block *self,
> -  unsigned long val, void *data);
>  
>  /* optinsn template addresses */
>  extern __visible kprobe_opcode_t optprobe_template_entry[];
> diff --git a/arch/arm64/include/asm/kprobes.h 
> b/arch/arm64/include/asm/kprobes.h
> index 05cd82eeca136..be7a3680dadff 100644
> --- a/arch/arm64/include/asm/kprobes.h
> +++ b/arch/arm64/include/asm/kprobes.h
> @@ -37,8 +37,6 @@ struct kprobe_ctlblk {
>  
>  void arch_remove_kprobe(struct kprobe *);
>  int kprobe_fault_handler(struct pt_regs *regs, unsigned int fsr);
> -int kprobe_exceptions_notify(struct notifier_block *self,
> -  unsigned long val, void *data);
>  void __kretprobe_trampoline(void);
>  void __kprobes *trampoline_probe_handler(struct pt_regs *regs);
>  
> diff --git a/arch/ia64/include/asm/kprobes.h b/arch/ia64/include/asm/kprobes.h
> index 9e956768946cc..56004f97df6d2 100644
> --- a/arch/ia64/include/asm/kprobes.h
> +++ b/arch/ia64/include/asm/kprobes.h
> @@ -107,8 +107,6 @@ struct arch_specific_insn {
>  };
>  
>  extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
> -extern int kprobe_exceptions_notify(struct notifier_block *self,
> - unsigned long val, void *data);
>  
>  extern void arch_remove_kprobe(struct kprobe *p);
>  
> diff --git a/arch/mips/include/asm/kprobes.h b/arch/mips/include/asm/kprobes.h
> index 68b1e5d458cfb..bc27d99c94363 100644
> --- a/arch/mips/include/asm/kprobes.h
> +++ b/arch/mips/include/asm/kprobes.h
> @@ -71,8 +71,6 @@ struct kprobe_ctlblk {
>   struct prev_kprobe prev_kprobe;
>  };
>  
> -extern int kprobe_exceptions_notify(struct notifier_block *self,
> - unsigned long val, void *data);
>  
>  #endif /* CONFIG_KPROBES */
>  #endif /* _ASM_KPROBES_H */
> diff --git a/arch/powerpc/include/asm/kprobes.h 
> b/arch/powerpc/include/asm/kprobes.h
> index c8e4b4fd4e330..4525a9c68260d 100644
> --- a/arch/powerpc/include/asm/kprobes.h
> +++ b/arch/powerpc/include/asm/kprobes.h
> @@ -84,8 +84,6 @@ struct arch_optimized_insn {
>   kprobe_opcode_t *insn;
>  };
>  
> -extern int kprobe_exceptions_notify(struct notifier_block *self,
> - unsigned long val, void *data);
>  extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
>  extern int kprobe_handler(struct pt_regs *regs);
>  extern int kprobe_post_handler(struct pt_r

Re: [PATCH v3 35/51] trace,hardirq: No moar _rcuidle() tracing

2023-01-16 Thread Google
Hi Peter,

On Thu, 12 Jan 2023 20:43:49 +0100
Peter Zijlstra  wrote:

> Robot reported that trace_hardirqs_{on,off}() tickle the forbidden
> _rcuidle() tracepoint through local_irq_{en,dis}able().
> 
> For 'sane' configs, these calls will only happen with RCU enabled and
> as such can use the regular tracepoint. This also means it's possible
> to trace them from NMI context again.
> 
> Signed-off-by: Peter Zijlstra (Intel) 

The code looks good to me. I just have a question about comment.

> ---
>  kernel/trace/trace_preemptirq.c |   21 +
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> --- a/kernel/trace/trace_preemptirq.c
> +++ b/kernel/trace/trace_preemptirq.c
> @@ -20,6 +20,15 @@
>  static DEFINE_PER_CPU(int, tracing_irq_cpu);
>  
>  /*
> + * ...

Is this intended? Wouldn't you leave any comment here?

Thank you,

> + */
> +#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> +#define trace(point) trace_##point
> +#else
> +#define trace(point) if (!in_nmi()) trace_##point##_rcuidle
> +#endif
> +
> +/*
>   * Like trace_hardirqs_on() but without the lockdep invocation. This is
>   * used in the low level entry code where the ordering vs. RCU is important
>   * and lockdep uses a staged approach which splits the lockdep hardirq
> @@ -28,8 +37,7 @@ static DEFINE_PER_CPU(int, tracing_irq_c
>  void trace_hardirqs_on_prepare(void)
>  {
>   if (this_cpu_read(tracing_irq_cpu)) {
> - if (!in_nmi())
> - trace_irq_enable(CALLER_ADDR0, CALLER_ADDR1);
> + trace(irq_enable)(CALLER_ADDR0, CALLER_ADDR1);
>   tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
>   this_cpu_write(tracing_irq_cpu, 0);
>   }
> @@ -40,8 +48,7 @@ NOKPROBE_SYMBOL(trace_hardirqs_on_prepar
>  void trace_hardirqs_on(void)
>  {
>   if (this_cpu_read(tracing_irq_cpu)) {
> - if (!in_nmi())
> - trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> + trace(irq_enable)(CALLER_ADDR0, CALLER_ADDR1);
>   tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
>   this_cpu_write(tracing_irq_cpu, 0);
>   }
> @@ -63,8 +70,7 @@ void trace_hardirqs_off_finish(void)
>   if (!this_cpu_read(tracing_irq_cpu)) {
>   this_cpu_write(tracing_irq_cpu, 1);
>   tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
> - if (!in_nmi())
> - trace_irq_disable(CALLER_ADDR0, CALLER_ADDR1);
> + trace(irq_disable)(CALLER_ADDR0, CALLER_ADDR1);
>   }
>  
>  }
> @@ -78,8 +84,7 @@ void trace_hardirqs_off(void)
>   if (!this_cpu_read(tracing_irq_cpu)) {
>   this_cpu_write(tracing_irq_cpu, 1);
>   tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
> - if (!in_nmi())
> -     trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> + trace(irq_disable)(CALLER_ADDR0, CALLER_ADDR1);
>   }
>  }
>  EXPORT_SYMBOL(trace_hardirqs_off);
> 
> 


-- 
Masami Hiramatsu (Google) 


Re: Questions about kprobe handler

2022-11-24 Thread Google
On Thu, 24 Nov 2022 17:14:27 +0530
"Naveen N. Rao"  wrote:

> Jinyang He wrote:
> > 在 2022/11/17 21:09, Masami Hiramatsu (Google) 写道:
> > 
> >> On Thu, 17 Nov 2022 09:07:37 +0800
> >> Tiezhu Yang  wrote:
> >>
> >>> Hi KPROBES maintainers,
> >>>
> >>> There are some differences of kprobe handler implementations on various
> >>> archs, the implementations are almost same on arm64, riscv, csky, the
> >>> code logic is clear and understandable. But on mips and loongarch (not
> >>> upstreamed yet), if get_kprobe() returns NULL, what is the purpose of
> >>> the check "if (addr->word != breakpoint_insn.word)", is it necessary?
> >>> Can we just return directly? Please take a look, thank you.
> 
> Are you seeing any problem with that?
> 
> >> Good question!
> >>
> >> This means that when the software breakpoint was hit on that CPU, but
> >> before calling kprobe handler function, the other CPU can remove that
> >> kprobe from hash table, becahse the hash table is not locked.
> >> In that case, the get_kprobe(addr) will return NULL, and the software
> >> breakpoint instruction is already removed (replaced with the original
> >> instruction). Thus it is safe to go back. But this is originally
> >> implemented for x86, which doesn't need stop_machine() to modify the
> >> code. On the other hand, if an architecture which needs stop_machine()
> >> to modify code, the above scenario never happen. In that case, you
> >> don't need this "if" case.
> 
> This has been a problematic area on powerpc. See, for instance, commits:
> - 9ed5df69b79a22 ("powerpc/kprobes: Use probe_address() to read instructions")
> - 21f8b2fa3ca5b0 ("powerpc/kprobes: Ignore traps that happened in real mode")
> 
> I think we should close this race, perhaps by instroducing another 
> synchronize_rcu() in unregister_kprobe(), allowing architectures using 
> stop_machine() to override that. That would be cleaner.
> 
> 
> >>
> >> Thank you,
> >>
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/mips/kernel/kprobes.c#n323
> >>>   p = get_kprobe(addr);
> >>>   if (p) {
> >>>   ...
> >>>   } else if (addr->word != breakpoint_insn.word) {
> > 
> > Hi,
> > 
> > 
> > Sorry for the late reply, but I think there should be some public
> > comments so that I can get the authoritative response, as offline
> > communication with Tiezhu is always one-sided.
> > 
> > I think the branch you answered here is " if (p)... " rather than
> > "else if (addr->word != breakpoint_insn.word)". It is right if we
> > not use stop_machine here we need this branch. In fact, Tiezhu
> > and Huacai, the maintainer of LoongArch are more concerned
> > about the latter why we need compare with the breakpoint_insn.
> 
> Masami answered why we need the else part comparing the instruction at 
> addr with the breakpoint instruction. It is to check if the breakpoint 
> instruction has been removed since we hit it, but before the processor 
> reached the kprobe handler.
> 
> > 
> > The reason I gave as follows, and I show mips code here,
> > 
> >      p = get_kprobe(addr);
> >      if (!p) {
> >      if (addr->word != breakpoint_insn.word) {
> >          /*
> >           * The breakpoint instruction was removed right
> >           * after we hit it.  Another cpu has removed
> >           * either a probepoint or a debugger breakpoint
> >           * at this address.  In either case, no further
> >           * handling of this interrupt is appropriate.
> >           */
> >          ret = 1;
> >      }
> >      /* Not one of ours: let kernel handle it */
> >      goto no_kprobe;
> >      }
> > 
> > ...
> > int kprobe_exceptions_notify(struct notifier_block *self,
> >                 unsigned long val, void *data)
> > {
> >      struct die_args *args = (struct die_args *)data;
> >      int ret = NOTIFY_DONE;
> > 
> >      switch (val) {
> >      case DIE_BREAK:
> >      if (kprobe_handler(args->regs))
> >          ret = NOTIFY_STOP;
> >      break;
> > ...
> > 
> > The !p means this insn has been moved, and then in most cases the COND
> 
> !p means the kprobe has been removed - there may or ma

Re: [PATCH] kprobes: Enable tracing for mololithic kernel images

2022-06-13 Thread Google
el.org>, Masahiro Yamada , Jarkko Sakkinen 
, Sami Tolvanen , "Naveen N. Rao" 
, Marco Elver , Kees Cook 
, Steven Rostedt , Nathan 
Chancellor , "Russell King \(Oracle\)" 
, Mark Brown , Borislav Petkov 
, Alexander Egorenkov , Thomas 
Bogendoerfer , Parisc List 
, Nathaniel McCallum , 
Dmitry Torokhov , "David S. Miller" 
, "Kirill A. Shutemov" , 
Tobias Huschle , "Peter Zijlstra \(Intel\)" 
, "H. Peter Anvin" , sparclinux 
, Tiezhu Yang , Miroslav 
Benes , Chen Zhongjin , Ard Biesheuvel , 
 >the arch/x86 maintainers , Russell King 
 >, linux-riscv , Ingo 
 >Molnar , Aaron Tomlin , Albert Ou 
 >, Heiko Carstens , Liao Chang 
 >, Paul Walmsley , Josh 
 >Poimboeuf , Thomas Richter , 
 >"open list:BROADCOM NVRAM DRIVER" , Changbin Du 
 >, Palmer Dabbelt , linuxppc-dev 
 >, "linux-modu...@vger.kernel.org" 
 >
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Sun, 12 Jun 2022 15:59:29 +
Christophe Leroy  wrote:

> 
> 
> Le 12/06/2022 à 14:18, Masami Hiramatsu (Google) a écrit :
> > [You don't often get email from mhira...@kernel.org. Learn why this is 
> > important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > On Thu, 9 Jun 2022 15:23:16 +0200
> > Ard Biesheuvel  wrote:
> > 
> >> On Thu, 9 Jun 2022 at 15:14, Jarkko Sakkinen  wrote:
> >>>
> >>> On Wed, Jun 08, 2022 at 09:12:34AM -0700, Song Liu wrote:
> >>>> On Wed, Jun 8, 2022 at 7:21 AM Masami Hiramatsu  
> >>>> wrote:
> >>>>>
> >>>>> Hi Jarkko,
> >>>>>
> >>>>> On Wed, 8 Jun 2022 08:25:38 +0300
> >>>>> Jarkko Sakkinen  wrote:
> >>>>>
> >>>>>> On Wed, Jun 08, 2022 at 10:35:42AM +0800, Guo Ren wrote:
> >>>>>>> .
> >>>>>>>
> >>>>>>> On Wed, Jun 8, 2022 at 8:02 AM Jarkko Sakkinen  
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Tracing with kprobes while running a monolithic kernel is currently
> >>>>>>>> impossible because CONFIG_KPROBES is dependent of CONFIG_MODULES.  
> >>>>>>>> This
> >>>>>>>> dependency is a result of kprobes code using the module allocator 
> >>>>>>>> for the
> >>>>>>>> trampoline code.
> >>>>>>>>
> >>>>>>>> Detaching kprobes from modules helps to squeeze down the user space,
> >>>>>>>> e.g. when developing new core kernel features, while still having all
> >>>>>>>> the nice tracing capabilities.
> >>>>>>>>
> >>>>>>>> For kernel/ and arch/*, move module_alloc() and module_memfree() to
> >>>>>>>> module_alloc.c, and compile as part of vmlinux when either 
> >>>>>>>> CONFIG_MODULES
> >>>>>>>> or CONFIG_KPROBES is enabled.  In addition, flag kernel module 
> >>>>>>>> specific
> >>>>>>>> code with CONFIG_MODULES.
> >>>>>>>>
> >>>>>>>> As the result, kprobes can be used with a monolithic kernel.
> >>>>>>> It's strange when MODULES is n, but vmlinux still obtains 
> >>>>>>> module_alloc.
> >>>>>>>
> >>>>>>> Maybe we need a kprobe_alloc, right?
> >>>>>>
> >>>>>> Perhaps not the best name but at least it documents the fact that
> >>>>>> they use the same allocator.
> >>>>>>
> >>>>>> Few years ago I carved up something "half-way there" for kprobes,
> >>>>>> and I used the name text_alloc() [*].
> >>>>>>
> >>>>>> [*] 
> >>>>>> https://lore.kernel.org/all/20200724050553.1724168-1-jarkko.sakki...@linux.intel.com/
> >>>>>
> >>>>> Yeah, I remember that. Thank you for updating your patch!
> >>>>> I think the idea (split module_alloc() from CONFIG_MODULE) is good to 
> >>>>> me.
> >>>>> If module support maintainers think this name is not good, you may be
> >>>>> able to rename it as text_alloc() and make the module_alloc() as a
> >>>>> wrapper of it.
> >>>>
> >

Re: [PATCH] kprobes: Enable tracing for mololithic kernel images

2022-06-13 Thread Google
hiro Yamada , Jarkko Sakkinen , Sami 
Tolvanen , "Naveen N. Rao" 
, Marco Elver , Kees Cook 
, Steven Rostedt , Nathan 
Chancellor , "Russell King \(Oracle\)" 
, Mark Brown , Borislav Petkov 
, Alexander Egorenkov , Thomas 
Bogendoerfer , linux-par...@vger.kernel.org, 
Nathaniel McCallum , Dmitry Torokhov 
, "David S. Miller" , "Kirill 
A. Shutemov" , Tobias Huschle 
, "Peter Zijlstra \(Intel\)" , "H. 
Peter Anvin" , sparcli...@vger.kernel.org, Tiezhu Yang 
, Miroslav Benes , Chen Zhongjin 
, Ard Biesheuvel , X86 ML , Russell King 
, linux-ri...@lists.infradead.org, Ingo Molnar 
, Aaron Tomlin , Albert Ou 
, Heiko Carstens , Liao Chang 
, Paul Walmsley , Josh 
Poimboeuf , Thomas Richter , 
linux-m...@vger.kernel.org, Changbin Du , Palmer Dabbelt 
, linuxppc-dev@lists.ozlabs.org, 
linux-modu...@vger.kernel.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Wed, 8 Jun 2022 11:19:19 -0700
Song Liu  wrote:

> On Wed, Jun 8, 2022 at 9:28 AM Ard Biesheuvel  wrote:
> >
> > Hello Jarkko,
> >
> > On Wed, 8 Jun 2022 at 02:02, Jarkko Sakkinen  wrote:
> > >
> > > Tracing with kprobes while running a monolithic kernel is currently
> > > impossible because CONFIG_KPROBES is dependent of CONFIG_MODULES.  This
> > > dependency is a result of kprobes code using the module allocator for the
> > > trampoline code.
> > >
> > > Detaching kprobes from modules helps to squeeze down the user space,
> > > e.g. when developing new core kernel features, while still having all
> > > the nice tracing capabilities.
> > >
> > > For kernel/ and arch/*, move module_alloc() and module_memfree() to
> > > module_alloc.c, and compile as part of vmlinux when either CONFIG_MODULES
> > > or CONFIG_KPROBES is enabled.  In addition, flag kernel module specific
> > > code with CONFIG_MODULES.
> > >
> > > As the result, kprobes can be used with a monolithic kernel.
> >
> > I think I may have mentioned this the previous time as well, but I
> > don't think this is the right approach.
> >
> > Kprobes uses alloc_insn_page() to allocate executable memory, but the
> > requirements for this memory are radically different compared to
> > loadable modules, which need to be within an arch-specific distance of
> > the core kernel, need KASAN backing etc etc.
> 
> I think the distance of core kernel requirement is the same for kprobe
> alloc_insn_page and modules, no?

This strongly depends on how kprobes (software breakpoint and
single-step) is implemented on the arch. For example, x86 implements
the so-called "kprobe-booster" which jumps back from the single
stepping trampoline buffer. Then the buffer address must be within
the range where it can jump to the original address.
However, if the arch implements single-step as an instruction
emulation, it has no such limitation. As far as I know, arm64
will do emulation for the instructions which change PC register
and will do direct execution with another software breakpoint
for other instructions.

Why I'm using module_alloc() for a generic function, is that
can cover the limitation most widely.
Thus, if we have CONFIG_ARCH_HAVE_ALLOC_INSN_PAGE flag and
kprobes can check it instead of using __weak function, the
kprobes may not need to depend on module_alloc() in general.

Thank you,

> 
> Thanks,
> Song
> 
> >
> > This is why arm64, for instance, does not implement alloc_insn_page()
> > in terms of module_alloc() [and likely does not belong in this patch
> > for that reason]
> 
> 
> 
> >
> > Is there any reason kprobes cannot simply use vmalloc()?
> >


-- 
Masami Hiramatsu (Google) 


Re: [PATCH] kprobes: Enable tracing for mololithic kernel images

2022-06-13 Thread Google
, Will Deacon , Masahiro Yamada 
, Jarkko Sakkinen , Sami Tolvanen 
, "Naveen N. Rao" , Marco 
Elver , Kees Cook , Steven Rostedt 
, Nathan Chancellor , "Russell King 
\(Oracle\)" , Mark Brown , 
Borislav Petkov , Alexander Egorenkov , 
Thomas Bogendoerfer , Parisc List 
, Nathaniel McCallum , 
Dmitry Torokhov , "David S. Miller" 
, "Kirill A. Shutemov" , 
Tobias Huschle , "Peter Zijlstra \(Intel\)" 
, "H. Peter Anvin" , sparclinux 
, Tiezhu Yang , Miroslav Benes , Chen Zhongjin 
, linux-riscv , the 
arch/x86 maintainers , Russell King , 
Ingo Molnar , Aaron Tomlin , Albert Ou 
, Heiko Carstens , Liao Chang 
, Paul Walmsley , Josh 
Poimboeuf , Thomas Richter , "open 
list:BROADCOM NVRAM DRIVER" , Changbin Du 
, Palmer Dabbelt , linuxppc-dev 
, linux-modu...@vger.kernel.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Thu, 9 Jun 2022 15:23:16 +0200
Ard Biesheuvel  wrote:

> On Thu, 9 Jun 2022 at 15:14, Jarkko Sakkinen  wrote:
> >
> > On Wed, Jun 08, 2022 at 09:12:34AM -0700, Song Liu wrote:
> > > On Wed, Jun 8, 2022 at 7:21 AM Masami Hiramatsu  
> > > wrote:
> > > >
> > > > Hi Jarkko,
> > > >
> > > > On Wed, 8 Jun 2022 08:25:38 +0300
> > > > Jarkko Sakkinen  wrote:
> > > >
> > > > > On Wed, Jun 08, 2022 at 10:35:42AM +0800, Guo Ren wrote:
> > > > > > .
> > > > > >
> > > > > > On Wed, Jun 8, 2022 at 8:02 AM Jarkko Sakkinen  
> > > > > > wrote:
> > > > > > >
> > > > > > > Tracing with kprobes while running a monolithic kernel is 
> > > > > > > currently
> > > > > > > impossible because CONFIG_KPROBES is dependent of CONFIG_MODULES. 
> > > > > > >  This
> > > > > > > dependency is a result of kprobes code using the module allocator 
> > > > > > > for the
> > > > > > > trampoline code.
> > > > > > >
> > > > > > > Detaching kprobes from modules helps to squeeze down the user 
> > > > > > > space,
> > > > > > > e.g. when developing new core kernel features, while still having 
> > > > > > > all
> > > > > > > the nice tracing capabilities.
> > > > > > >
> > > > > > > For kernel/ and arch/*, move module_alloc() and module_memfree() 
> > > > > > > to
> > > > > > > module_alloc.c, and compile as part of vmlinux when either 
> > > > > > > CONFIG_MODULES
> > > > > > > or CONFIG_KPROBES is enabled.  In addition, flag kernel module 
> > > > > > > specific
> > > > > > > code with CONFIG_MODULES.
> > > > > > >
> > > > > > > As the result, kprobes can be used with a monolithic kernel.
> > > > > > It's strange when MODULES is n, but vmlinux still obtains 
> > > > > > module_alloc.
> > > > > >
> > > > > > Maybe we need a kprobe_alloc, right?
> > > > >
> > > > > Perhaps not the best name but at least it documents the fact that
> > > > > they use the same allocator.
> > > > >
> > > > > Few years ago I carved up something "half-way there" for kprobes,
> > > > > and I used the name text_alloc() [*].
> > > > >
> > > > > [*] 
> > > > > https://lore.kernel.org/all/20200724050553.1724168-1-jarkko.sakki...@linux.intel.com/
> > > >
> > > > Yeah, I remember that. Thank you for updating your patch!
> > > > I think the idea (split module_alloc() from CONFIG_MODULE) is good to 
> > > > me.
> > > > If module support maintainers think this name is not good, you may be
> > > > able to rename it as text_alloc() and make the module_alloc() as a
> > > > wrapper of it.
> > >
> > > IIUC, most users of module_alloc() use it to allocate memory for text, 
> > > except
> > > that module code uses it for both text and data. Therefore, I guess 
> > > calling it
> > > text_alloc() is not 100% accurate until we change the module code (to use
> > > a different API to allocate memory for data).
> >
> > After reading the feedback, I'd stay on using module_alloc() because
> > it has arch-specific quirks baked in. Easier to deal with them in one
> > place.
> >
> 
> In that case, please ensure that you enable this only on architectures
> where it is needed. arm64 implements alloc_insn_page() without relying
> on module_alloc() so I would not expect to see any changes there.

Hmm, what about adding CONFIG_ARCH_HAVE_ALLOC_INSN_PAGE and check it?
If it is defined, kprobes will not define the __weak function, but
if not, it will use module_alloc()?

Thank you,

-- 
Masami Hiramatsu (Google) 


Re: [PATCH] kprobes: Enable tracing for mololithic kernel images

2022-06-08 Thread Google
ernel.org>, Masahiro Yamada , Jarkko Sakkinen 
, Sami Tolvanen , "Naveen N. Rao" 
, Marco Elver , Kees Cook 
, Steven Rostedt , Nathan 
Chancellor , "Russell King \(Oracle\)" 
, Mark Brown , Borislav Petkov 
, Alexander Egorenkov , Thomas 
Bogendoerfer , Parisc List 
, Nathaniel McCallum , 
Dmitry Torokhov , "David S. Miller" 
, "Kirill A. Shutemov" , 
Tobias Huschle , "Peter Zijlstra \(Intel\)" 
, "H. Peter Anvin" , sparclinux 
, Tiezhu Yang , Miroslav 
Benes , Chen Zhongjin , Ard Biesheuvel 
, the arch/x86 maintainers , Russell King 
, linux-riscv , Ingo 
Molnar , Aaron Tomlin , Albert Ou 
, Heiko Carstens , Liao Chang 
, Paul Walmsley , Josh 
Poimboeuf , Thomas Richter , "open 
list:BROADCOM NVRAM DRIVER" , Changbin Du 
, Palmer Dabbelt , linuxppc-dev 
, linux-modu...@vger.kernel.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


Hi Jarkko,

On Wed, 8 Jun 2022 08:25:38 +0300
Jarkko Sakkinen  wrote:

> On Wed, Jun 08, 2022 at 10:35:42AM +0800, Guo Ren wrote:
> > .
> > 
> > On Wed, Jun 8, 2022 at 8:02 AM Jarkko Sakkinen  wrote:
> > >
> > > Tracing with kprobes while running a monolithic kernel is currently
> > > impossible because CONFIG_KPROBES is dependent of CONFIG_MODULES.  This
> > > dependency is a result of kprobes code using the module allocator for the
> > > trampoline code.
> > >
> > > Detaching kprobes from modules helps to squeeze down the user space,
> > > e.g. when developing new core kernel features, while still having all
> > > the nice tracing capabilities.
> > >
> > > For kernel/ and arch/*, move module_alloc() and module_memfree() to
> > > module_alloc.c, and compile as part of vmlinux when either CONFIG_MODULES
> > > or CONFIG_KPROBES is enabled.  In addition, flag kernel module specific
> > > code with CONFIG_MODULES.
> > >
> > > As the result, kprobes can be used with a monolithic kernel.
> > It's strange when MODULES is n, but vmlinux still obtains module_alloc.
> > 
> > Maybe we need a kprobe_alloc, right?
> 
> Perhaps not the best name but at least it documents the fact that
> they use the same allocator.
> 
> Few years ago I carved up something "half-way there" for kprobes,
> and I used the name text_alloc() [*].
> 
> [*] 
> https://lore.kernel.org/all/20200724050553.1724168-1-jarkko.sakki...@linux.intel.com/
>  

Yeah, I remember that. Thank you for updating your patch!
I think the idea (split module_alloc() from CONFIG_MODULE) is good to me.
If module support maintainers think this name is not good, you may be
able to rename it as text_alloc() and make the module_alloc() as a
wrapper of it.

Acked-by: Masami Hiramatsu (Google) 
for kprobe side.

Thank you,

-- 
Masami Hiramatsu (Google) 


[PATCH] Documentation: Clarify better about the rwsem non-owner release issue

2020-03-21 Thread Joel Fernandes (Google)
Reword and clarify better about the rwsem non-owner release issue.

Link: https://lore.kernel.org/linux-pci/20200321212144.ga6...@google.com/

Signed-off-by: Joel Fernandes (Google) 
---
 Documentation/locking/locktypes.rst | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/locking/locktypes.rst 
b/Documentation/locking/locktypes.rst
index 6f2c0f5b041e..656dce72f11f 100644
--- a/Documentation/locking/locktypes.rst
+++ b/Documentation/locking/locktypes.rst
@@ -292,7 +292,7 @@ implementations to provide priority inheritance for all 
lock types except
 the truly spinning ones. Priority inheritance on ownerless locks is
 obviously impossible.
 
-For now the rwsem non-owner release excludes code which utilizes it from
-being used on PREEMPT_RT enabled kernels. In same cases this can be
-mitigated by disabling portions of the code, in other cases the complete
-functionality has to be disabled until a workable solution has been found.
+For now, a PREEMPT_RT kernel just disables code sections that perform a
+non-owner release of an rwsem. In some cases, parts of the code are disabled.
+In other cases, the complete functionality has to be disabled until a workable
+solution has been found.
-- 
2.25.1.696.g5e7596f4ac-goog



[PATCH] treewide: Rename rcu_dereference_raw_notrace to _check

2019-07-11 Thread Joel Fernandes (Google)
The rcu_dereference_raw_notrace() API name is confusing.
It is equivalent to rcu_dereference_raw() except that it also does
sparse pointer checking.

There are only a few users of rcu_dereference_raw_notrace(). This
patches renames all of them to be rcu_dereference_raw_check with the
"check" indicating sparse checking.

Signed-off-by: Joel Fernandes (Google) 

---
Previous discussion is here:
https://lore.kernel.org/linuxppc-dev/20190528200014.gv28...@linux.ibm.com/T/

 Documentation/RCU/Design/Requirements/Requirements.html | 2 +-
 arch/powerpc/include/asm/kvm_book3s_64.h| 2 +-
 include/linux/rculist.h | 4 ++--
 include/linux/rcupdate.h| 2 +-
 kernel/trace/ftrace_internal.h  | 8 
 kernel/trace/trace.c| 4 ++--
 6 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/Documentation/RCU/Design/Requirements/Requirements.html 
b/Documentation/RCU/Design/Requirements/Requirements.html
index f04c467e55c5..467251f7fef6 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2514,7 +2514,7 @@ disabled across the entire RCU read-side critical section.
 
 It is possible to use tracing on RCU code, but tracing itself
 uses RCU.
-For this reason, rcu_dereference_raw_notrace()
+For this reason, rcu_dereference_raw_check()
 is provided for use by tracing, which avoids the destructive
 recursion that could otherwise ensue.
 This API is also used by virtualization in some architectures,
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 21b1ed5df888..53388a311967 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -546,7 +546,7 @@ static inline void note_hpte_modification(struct kvm *kvm,
  */
 static inline struct kvm_memslots *kvm_memslots_raw(struct kvm *kvm)
 {
-   return rcu_dereference_raw_notrace(kvm->memslots[0]);
+   return rcu_dereference_raw_check(kvm->memslots[0]);
 }
 
 extern void kvmppc_mmu_debugfs_init(struct kvm *kvm);
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index e91ec9ddcd30..10aab1d2d471 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -642,10 +642,10 @@ static inline void hlist_add_behind_rcu(struct hlist_node 
*n,
  * not do any RCU debugging or tracing.
  */
 #define hlist_for_each_entry_rcu_notrace(pos, head, member)
\
-   for (pos = hlist_entry_safe 
(rcu_dereference_raw_notrace(hlist_first_rcu(head)),\
+   for (pos = hlist_entry_safe 
(rcu_dereference_raw_check(hlist_first_rcu(head)),\
typeof(*(pos)), member);\
pos;\
-   pos = 
hlist_entry_safe(rcu_dereference_raw_notrace(hlist_next_rcu(\
+   pos = 
hlist_entry_safe(rcu_dereference_raw_check(hlist_next_rcu(\
&(pos)->member)), typeof(*(pos)), member))
 
 /**
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 0c9b92799abc..e5161e377ad4 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -478,7 +478,7 @@ do {
  \
  * The no-tracing version of rcu_dereference_raw() must not call
  * rcu_read_lock_held().
  */
-#define rcu_dereference_raw_notrace(p) __rcu_dereference_check((p), 1, __rcu)
+#define rcu_dereference_raw_check(p) __rcu_dereference_check((p), 1, __rcu)
 
 /**
  * rcu_dereference_protected() - fetch RCU pointer when updates prevented
diff --git a/kernel/trace/ftrace_internal.h b/kernel/trace/ftrace_internal.h
index 0515a2096f90..0456e0a3dab1 100644
--- a/kernel/trace/ftrace_internal.h
+++ b/kernel/trace/ftrace_internal.h
@@ -6,22 +6,22 @@
 
 /*
  * Traverse the ftrace_global_list, invoking all entries.  The reason that we
- * can use rcu_dereference_raw_notrace() is that elements removed from this 
list
+ * can use rcu_dereference_raw_check() is that elements removed from this list
  * are simply leaked, so there is no need to interact with a grace-period
- * mechanism.  The rcu_dereference_raw_notrace() calls are needed to handle
+ * mechanism.  The rcu_dereference_raw_check() calls are needed to handle
  * concurrent insertions into the ftrace_global_list.
  *
  * Silly Alpha and silly pointer-speculation compiler optimizations!
  */
 #define do_for_each_ftrace_op(op, list)\
-   op = rcu_dereference_raw_notrace(list); \
+   op = rcu_dereference_raw_check(list);   \
do
 
 /*
  * Optimized for just a single item in the list (as that is the normal case).
  */
 #define while_for_each_ftrace_op(op)   \
-   while (likely(op = rcu_dere

[PATCH RFC 5/5] rcu: Remove rcu_dereference_raw_notrace since no users

2019-05-24 Thread Joel Fernandes (Google)
The series removes all users of the API and with this patch, the API
itself. Also fix documentation.

Signed-off-by: Joel Fernandes (Google) 
---
 Documentation/RCU/Design/Requirements/Requirements.html | 6 +++---
 include/linux/rcupdate.h| 9 -
 2 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/Documentation/RCU/Design/Requirements/Requirements.html 
b/Documentation/RCU/Design/Requirements/Requirements.html
index 5a9238a2883c..9727278893e6 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2512,9 +2512,9 @@ disabled across the entire RCU read-side critical section.
 
 It is possible to use tracing on RCU code, but tracing itself
 uses RCU.
-For this reason, rcu_dereference_raw_notrace()
-is provided for use by tracing, which avoids the destructive
-recursion that could otherwise ensue.
+This is the other reason for using, rcu_dereference_raw(),
+for use by tracing, which avoids the destructive recursion that could
+otherwise ensue.
 This API is also used by virtualization in some architectures,
 where RCU readers execute in environments in which tracing
 cannot be used.
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 922bb6848813..f917a27fc115 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -472,15 +472,6 @@ static inline void rcu_preempt_sleep_check(void) { }
__rcu_dereference_check((p), (c) || rcu_read_lock_sched_held(), \
__rcu)
 
-/*
- * The tracing infrastructure traces RCU (we want that), but unfortunately
- * some of the RCU checks causes tracing to lock up the system.
- *
- * The no-tracing version of rcu_dereference_raw() must not call
- * rcu_read_lock_held().
- */
-#define rcu_dereference_raw_notrace(p) __rcu_dereference_check((p), 1, __rcu)
-
 /**
  * rcu_dereference_protected() - fetch RCU pointer when updates prevented
  * @p: The pointer to read, prior to dereferencing
-- 
2.22.0.rc1.257.g3120a18244-goog



[PATCH RFC 4/5] rculist: Remove hlist_for_each_entry_rcu_notrace since no users

2019-05-24 Thread Joel Fernandes (Google)
The series removes all users of the API and with this patch, the API
itself.

Signed-off-by: Joel Fernandes (Google) 
---
 .clang-format   |  1 -
 include/linux/rculist.h | 20 
 2 files changed, 21 deletions(-)

diff --git a/.clang-format b/.clang-format
index 2ffd69afc1a8..aa935923f5cb 100644
--- a/.clang-format
+++ b/.clang-format
@@ -287,7 +287,6 @@ ForEachMacros:
   - 'hlist_for_each_entry_from_rcu'
   - 'hlist_for_each_entry_rcu'
   - 'hlist_for_each_entry_rcu_bh'
-  - 'hlist_for_each_entry_rcu_notrace'
   - 'hlist_for_each_entry_safe'
   - '__hlist_for_each_rcu'
   - 'hlist_for_each_safe'
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index e91ec9ddcd30..0d3d77cf4f07 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -628,26 +628,6 @@ static inline void hlist_add_behind_rcu(struct hlist_node 
*n,
pos = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(\
&(pos)->member)), typeof(*(pos)), member))
 
-/**
- * hlist_for_each_entry_rcu_notrace - iterate over rcu list of given type (for 
tracing)
- * @pos:   the type * to use as a loop cursor.
- * @head:  the head for your list.
- * @member:the name of the hlist_node within the struct.
- *
- * This list-traversal primitive may safely run concurrently with
- * the _rcu list-mutation primitives such as hlist_add_head_rcu()
- * as long as the traversal is guarded by rcu_read_lock().
- *
- * This is the same as hlist_for_each_entry_rcu() except that it does
- * not do any RCU debugging or tracing.
- */
-#define hlist_for_each_entry_rcu_notrace(pos, head, member)
\
-   for (pos = hlist_entry_safe 
(rcu_dereference_raw_notrace(hlist_first_rcu(head)),\
-   typeof(*(pos)), member);\
-   pos;\
-   pos = 
hlist_entry_safe(rcu_dereference_raw_notrace(hlist_next_rcu(\
-   &(pos)->member)), typeof(*(pos)), member))
-
 /**
  * hlist_for_each_entry_rcu_bh - iterate over rcu list of given type
  * @pos:   the type * to use as a loop cursor.
-- 
2.22.0.rc1.257.g3120a18244-goog



[PATCH RFC 3/5] hashtable: Use the regular hlist_for_each_entry_rcu API

2019-05-24 Thread Joel Fernandes (Google)
hlist_for_each_entry_rcu already does not do any tracing. This series
removes the notrace variant of it, so let us just use the regular API.

In a future patch, we can also remove the
hash_for_each_possible_rcu_notrace API that this patch touches.

Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/hashtable.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
index 417d2c4bc60d..47fa7b673c1b 100644
--- a/include/linux/hashtable.h
+++ b/include/linux/hashtable.h
@@ -189,7 +189,7 @@ static inline void hash_del_rcu(struct hlist_node *node)
  * not do any RCU debugging or tracing.
  */
 #define hash_for_each_possible_rcu_notrace(name, obj, member, key) \
-   hlist_for_each_entry_rcu_notrace(obj, \
+   hlist_for_each_entry_rcu(obj, \
&name[hash_min(key, HASH_BITS(name))], member)
 
 /**
-- 
2.22.0.rc1.257.g3120a18244-goog



[PATCH RFC 2/5] trace: Use regular rcu_dereference_raw API

2019-05-24 Thread Joel Fernandes (Google)
rcu_dereference_raw already does not do any tracing. There is no need to
use the _notrace variant of it and this series removes that API, so let us
use the regular variant here.

While at it, also replace the only user of
hlist_for_each_entry_rcu_notrace (which indirectly uses the
rcu_dereference_raw_notrace API) with hlist_for_each_entry_rcu which
also does not do any tracing.

Signed-off-by: Joel Fernandes (Google) 
---
 kernel/trace/ftrace.c  | 4 ++--
 kernel/trace/ftrace_internal.h | 8 
 kernel/trace/trace.c   | 4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b920358dd8f7..f7d5f0ee69de 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -706,7 +706,7 @@ ftrace_find_profiled_func(struct ftrace_profile_stat *stat, 
unsigned long ip)
if (hlist_empty(hhd))
return NULL;
 
-   hlist_for_each_entry_rcu_notrace(rec, hhd, node) {
+   hlist_for_each_entry_rcu(rec, hhd, node) {
if (rec->ip == ip)
return rec;
}
@@ -1135,7 +1135,7 @@ __ftrace_lookup_ip(struct ftrace_hash *hash, unsigned 
long ip)
key = ftrace_hash_key(hash, ip);
hhd = &hash->buckets[key];
 
-   hlist_for_each_entry_rcu_notrace(entry, hhd, hlist) {
+   hlist_for_each_entry_rcu(entry, hhd, hlist) {
if (entry->ip == ip)
return entry;
}
diff --git a/kernel/trace/ftrace_internal.h b/kernel/trace/ftrace_internal.h
index 0515a2096f90..e3530a284f46 100644
--- a/kernel/trace/ftrace_internal.h
+++ b/kernel/trace/ftrace_internal.h
@@ -6,22 +6,22 @@
 
 /*
  * Traverse the ftrace_global_list, invoking all entries.  The reason that we
- * can use rcu_dereference_raw_notrace() is that elements removed from this 
list
+ * can use rcu_dereference_raw() is that elements removed from this list
  * are simply leaked, so there is no need to interact with a grace-period
- * mechanism.  The rcu_dereference_raw_notrace() calls are needed to handle
+ * mechanism.  The rcu_dereference_raw() calls are needed to handle
  * concurrent insertions into the ftrace_global_list.
  *
  * Silly Alpha and silly pointer-speculation compiler optimizations!
  */
 #define do_for_each_ftrace_op(op, list)\
-   op = rcu_dereference_raw_notrace(list); \
+   op = rcu_dereference_raw(list); \
do
 
 /*
  * Optimized for just a single item in the list (as that is the normal case).
  */
 #define while_for_each_ftrace_op(op)   \
-   while (likely(op = rcu_dereference_raw_notrace((op)->next)) &&  \
+   while (likely(op = rcu_dereference_raw((op)->next)) &&  \
   unlikely((op) != &ftrace_list_end))
 
 extern struct ftrace_ops __rcu *ftrace_ops_list;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ec43f387..cb8d696d9cde 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2638,10 +2638,10 @@ static void ftrace_exports(struct ring_buffer_event 
*event)
 
preempt_disable_notrace();
 
-   export = rcu_dereference_raw_notrace(ftrace_exports_list);
+   export = rcu_dereference_raw(ftrace_exports_list);
while (export) {
trace_process_export(export, event);
-   export = rcu_dereference_raw_notrace(export->next);
+   export = rcu_dereference_raw(export->next);
}
 
preempt_enable_notrace();
-- 
2.22.0.rc1.257.g3120a18244-goog



[PATCH RFC 0/5] Remove some notrace RCU APIs

2019-05-24 Thread Joel Fernandes (Google)
The series removes users of the following APIs, and the APIs themselves, since
the regular non - _notrace variants don't do any tracing anyway.
 * hlist_for_each_entry_rcu_notrace
 * rcu_dereference_raw_notrace

Joel Fernandes (Google) (5):
powerpc: Use regular rcu_dereference_raw API
trace: Use regular rcu_dereference_raw API
hashtable: Use the regular hlist_for_each_entry_rcu API
rculist: Remove hlist_for_each_entry_rcu_notrace since no users
rcu: Remove rcu_dereference_raw_notrace since no users

.clang-format |  1 -
.../RCU/Design/Requirements/Requirements.html |  6 +++---
arch/powerpc/include/asm/kvm_book3s_64.h  |  2 +-
include/linux/hashtable.h |  2 +-
include/linux/rculist.h   | 20 ---
include/linux/rcupdate.h  |  9 -
kernel/trace/ftrace.c |  4 ++--
kernel/trace/ftrace_internal.h|  8 
kernel/trace/trace.c  |  4 ++--
9 files changed, 13 insertions(+), 43 deletions(-)

--
2.22.0.rc1.257.g3120a18244-goog



[PATCH RFC 1/5] powerpc: Use regular rcu_dereference_raw API

2019-05-24 Thread Joel Fernandes (Google)
rcu_dereference_raw already does not do any tracing. There is no need to
use the _notrace variant of it and this series removes that API, so let us
use the regular variant here.

Signed-off-by: Joel Fernandes (Google) 
---
 arch/powerpc/include/asm/kvm_book3s_64.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 21b1ed5df888..c15c9bbf0206 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -546,7 +546,7 @@ static inline void note_hpte_modification(struct kvm *kvm,
  */
 static inline struct kvm_memslots *kvm_memslots_raw(struct kvm *kvm)
 {
-   return rcu_dereference_raw_notrace(kvm->memslots[0]);
+   return rcu_dereference_raw(kvm->memslots[0]);
 }
 
 extern void kvmppc_mmu_debugfs_init(struct kvm *kvm);
-- 
2.22.0.rc1.257.g3120a18244-goog



[PATCH 4/4] x86: select HAVE_MOVE_PMD for faster mremap (v1)

2018-10-12 Thread Joel Fernandes (Google)
Moving page-tables at the PMD-level on x86 is known to be safe. Enable
this option so that we can do fast mremap when possible.

Signed-off-by: Joel Fernandes (Google) 
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1a0be022f91d..01c02a9d7825 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -171,6 +171,7 @@ config X86
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_MIXED_BREAKPOINTS_REGS
select HAVE_MOD_ARCH_SPECIFIC
+   select HAVE_MOVE_PMD
select HAVE_NMI
select HAVE_OPROFILE
select HAVE_OPTPROBES
-- 
2.19.0.605.g01d371f741-goog



[PATCH 3/4] arm64: select HAVE_MOVE_PMD for faster mremap (v1)

2018-10-12 Thread Joel Fernandes (Google)
Moving page-tables at the PMD-level on arm64 is known to be safe. Enable
this option so that we can do fast mremap when possible.

Signed-off-by: Joel Fernandes (Google) 
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1b1a0e95c751..5d7c35c6f90c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -135,6 +135,7 @@ config ARM64
select HAVE_IRQ_TIME_ACCOUNTING
select HAVE_MEMBLOCK
select HAVE_MEMBLOCK_NODE_MAP if NUMA
+   select HAVE_MOVE_PMD
select HAVE_NMI
select HAVE_PATA_PLATFORM
select HAVE_PERF_EVENTS
-- 
2.19.0.605.g01d371f741-goog



[PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-12 Thread Joel Fernandes (Google)
Android needs to mremap large regions of memory during memory management
related operations. The mremap system call can be really slow if THP is
not enabled. The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map. Turning on THP
may not be a viable option, and is not for us. This patch speeds up the
performance for non-THP system by copying at the PMD level when possible.

The speed up is three orders of magnitude. On a 1GB mremap, the mremap
completion times drops from 160-250 millesconds to 380-400 microseconds.

Before:
Total mremap time for 1GB data: 242321014 nanoseconds.
Total mremap time for 1GB data: 196842467 nanoseconds.
Total mremap time for 1GB data: 167051162 nanoseconds.

After:
Total mremap time for 1GB data: 385781 nanoseconds.
Total mremap time for 1GB data: 388959 nanoseconds.
Total mremap time for 1GB data: 402813 nanoseconds.

Incase THP is enabled, the optimization is skipped. I also flush the
tlb every time we do this optimization since I couldn't find a way to
determine if the low-level PTEs are dirty. It is seen that the cost of
doing so is not much compared the improvement, on both x86-64 and arm64.

Cc: minc...@kernel.org
Cc: pan...@google.com
Cc: hu...@google.com
Cc: lokeshgi...@google.com
Cc: dan...@google.com
Cc: mho...@kernel.org
Cc: kir...@shutemov.name
Cc: a...@linux-foundation.org
Cc: kernel-t...@android.com
Signed-off-by: Joel Fernandes (Google) 
---
 arch/Kconfig |  5 
 mm/mremap.c  | 65 
 2 files changed, 70 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 6801123932a5..9724fe39884f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -518,6 +518,11 @@ config HAVE_IRQ_TIME_ACCOUNTING
  Archs need to ensure they use a high enough resolution clock to
  support irq time accounting and then call 
enable_sched_clock_irqtime().
 
+config HAVE_MOVE_PMD
+   bool
+   help
+ Archs that select this are able to move page tables at the PMD level.
+
 config HAVE_ARCH_TRANSPARENT_HUGEPAGE
bool
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 9e68a02a52b1..2fd163cff406 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t 
*old_pmd,
drop_rmap_locks(vma);
 }
 
+static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, unsigned long old_end,
+ pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
+{
+   spinlock_t *old_ptl, *new_ptl;
+   struct mm_struct *mm = vma->vm_mm;
+
+   if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
+   || old_end - old_addr < PMD_SIZE)
+   return false;
+
+   /*
+* The destination pmd shouldn't be established, free_pgtables()
+* should have release it.
+*/
+   if (WARN_ON(!pmd_none(*new_pmd)))
+   return false;
+
+   /*
+* We don't have to worry about the ordering of src and dst
+* ptlocks because exclusive mmap_sem prevents deadlock.
+*/
+   old_ptl = pmd_lock(vma->vm_mm, old_pmd);
+   if (old_ptl) {
+   pmd_t pmd;
+
+   new_ptl = pmd_lockptr(mm, new_pmd);
+   if (new_ptl != old_ptl)
+   spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+   /* Clear the pmd */
+   pmd = *old_pmd;
+   pmd_clear(old_pmd);
+
+   VM_BUG_ON(!pmd_none(*new_pmd));
+
+   /* Set the new pmd */
+   set_pmd_at(mm, new_addr, new_pmd, pmd);
+   if (new_ptl != old_ptl)
+   spin_unlock(new_ptl);
+   spin_unlock(old_ptl);
+
+   *need_flush = true;
+   return true;
+   }
+   return false;
+}
+
 unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
@@ -239,7 +287,24 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
split_huge_pmd(vma, old_pmd, old_addr);
if (pmd_trans_unstable(old_pmd))
continue;
+   } else if (extent == PMD_SIZE && 
IS_ENABLED(CONFIG_HAVE_MOVE_PMD)) {
+   /*
+* If the extent is PMD-sized, try to speed the move by
+* moving at the PMD level if possible.
+*/
+   bool moved;
+
+   if (need_rmap_locks)
+   take_rmap_locks(vma);
+   moved = move_normal_pmd(vma, old_addr, new_addr,
+   old_end, old_pmd, new_pmd,
+   

[PATCH 1/4] treewide: remove unused address argument from pte_alloc functions (v2)

2018-10-12 Thread Joel Fernandes (Google)
This series speeds up mremap(2) syscall by copying page tables at the
PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work.  Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. So this patch therefore removes this
argument tree-wide resulting in a nice negative diff as well. Also
ensuring along the way that the enabled architectures do not do anything
funky with 'address' argument that goes unnoticed by the optimization.

Build and boot tested on x86-64. Build tested on arm64.

The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from  pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.

// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.

virtual patch

@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@

 fn(...
- , T2 E2
 )
 { ... }

@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ 
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)

@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)

@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

 fn(...
-,  E2
 )

@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@

(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)

Suggested-by: Kirill A. Shutemov 
Cc: Kirill A. Shutemov 
Cc: Michal Hocko 
Cc: Julia Lawall 
Signed-off-by: Joel Fernandes (Google) 
---
 arch/alpha/include/asm/pgalloc.h |  6 +++---
 arch/arc/include/asm/pgalloc.h   |  5 ++---
 arch/arm/include/asm/pgalloc.h   |  4 ++--
 arch/arm64/include/asm/pgalloc.h |  4 ++--
 arch/hexagon/include/asm/pgalloc.h   |  6 ++
 arch/ia64/include/asm/pgalloc.h  |  5 ++---
 arch/m68k/include/asm/mcf_pgalloc.h  |  8 ++--
 arch/m68k/include/asm/motorola_pgalloc.h |  4 ++--
 arch/m68k/include/asm/sun3_pgalloc.h |  6 ++
 arch/microblaze/include/asm/pgalloc.h| 19 ++-
 arch/microblaze/mm/pgtable.c |  3 +--
 arch/mips/include/asm/pgalloc.h  |  6 ++
 arch/nds32/include/asm/pgalloc.h |  5 ++---
 arch/nios2/include/asm/pgalloc.h |  6 ++
 arch/openrisc/include/asm/pgalloc.h  |  5 ++---
 arch/openrisc/mm/ioremap.c   |  3 +--
 arch/parisc/include/asm/pgalloc.h|  4 ++--
 arch/powerpc/include/asm/book3s/32/pgalloc.h |  4 ++--
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 12 +---
 arch/powerpc/include/asm/nohash/32/pgalloc.h |  4 ++--
 arch/powerpc/include/asm/nohash/64/pgalloc.h |  6 ++
 arch/powerpc/mm/pgtable-book3s64.c   |  2 +-
 arch/powerpc/mm/pgtable_32.c |  4 ++--
 arch/riscv/include/asm/pgalloc.h |  6 ++
 arch/s390/include/asm/pgalloc.h  |  4 ++--
 arch/sh/include/asm/pgalloc.h|  6 ++
 arch/sparc/include/asm/pgalloc_32.h  |  5 ++---
 arch/sparc/include/asm/pgalloc_64.h  |  6 ++
 arch/sparc/mm/init_64.c  |  6 ++
 arch/sparc/mm/srmmu.c|  4 ++--
 arch/um/include/asm/pgalloc.h|  4 ++--
 arch/um/kernel/mem.c |  4 ++--
 arch/unicore32/include/asm/pgalloc.h |  4 ++--
 arch/x86/include/asm/pgalloc.h   |  4 ++--
 arch/x86/mm/pgtable.c|  4 ++--
 arch/xtensa/include/asm/pgalloc.h|  8 +++-
 include/linux/mm.h   | 13 ++---
 mm/huge_memory.c |  8 
 mm/kasan/kasan_init.c|  2 +-
 mm/memory.c  | 17 -
 mm/migrate.c |  2 +-
 mm/mremap.c   

[PATCH 0/4] Add support for fast mremap

2018-10-12 Thread Joel Fernandes (Google)
Hi,
Here is the latest "fast mremap" series. The main change in this submission is
to enable the fast mremap optimization on a per-architecture basis to prevent
possible issues with architectures that may not behave well with such change.

x86: select HAVE_MOVE_PMD for faster mremap (v1)

arm64: select HAVE_MOVE_PMD for faster mremap (v1)

mm: speed up mremap by 500x on large regions (v2)
v1->v2: Added support for per-arch enablement (Kirill Shutemov)

treewide: remove unused address argument from pte_alloc functions (v2)
v1->v2: fix arch/um/ prototype which was missed in v1 (Anton Ivanov)
update changelog with manual fixups for m68k and microblaze.

Joel Fernandes (Google) (4):
  treewide: remove unused address argument from pte_alloc functions (v2)
  mm: speed up mremap by 500x on large regions (v2)
  arm64: select HAVE_MOVE_PMD for faster mremap (v1)
  x86: select HAVE_MOVE_PMD for faster mremap (v1)

 arch/Kconfig |  5 ++
 arch/alpha/include/asm/pgalloc.h |  6 +-
 arch/arc/include/asm/pgalloc.h   |  5 +-
 arch/arm/include/asm/pgalloc.h   |  4 +-
 arch/arm64/Kconfig   |  1 +
 arch/arm64/include/asm/pgalloc.h |  4 +-
 arch/hexagon/include/asm/pgalloc.h   |  6 +-
 arch/ia64/include/asm/pgalloc.h  |  5 +-
 arch/m68k/include/asm/mcf_pgalloc.h  |  8 +--
 arch/m68k/include/asm/motorola_pgalloc.h |  4 +-
 arch/m68k/include/asm/sun3_pgalloc.h |  6 +-
 arch/microblaze/include/asm/pgalloc.h| 19 +-
 arch/microblaze/mm/pgtable.c |  3 +-
 arch/mips/include/asm/pgalloc.h  |  6 +-
 arch/nds32/include/asm/pgalloc.h |  5 +-
 arch/nios2/include/asm/pgalloc.h |  6 +-
 arch/openrisc/include/asm/pgalloc.h  |  5 +-
 arch/openrisc/mm/ioremap.c   |  3 +-
 arch/parisc/include/asm/pgalloc.h|  4 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h |  4 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 12 ++--
 arch/powerpc/include/asm/nohash/32/pgalloc.h |  4 +-
 arch/powerpc/include/asm/nohash/64/pgalloc.h |  6 +-
 arch/powerpc/mm/pgtable-book3s64.c   |  2 +-
 arch/powerpc/mm/pgtable_32.c |  4 +-
 arch/riscv/include/asm/pgalloc.h |  6 +-
 arch/s390/include/asm/pgalloc.h  |  4 +-
 arch/sh/include/asm/pgalloc.h|  6 +-
 arch/sparc/include/asm/pgalloc_32.h  |  5 +-
 arch/sparc/include/asm/pgalloc_64.h  |  6 +-
 arch/sparc/mm/init_64.c  |  6 +-
 arch/sparc/mm/srmmu.c|  4 +-
 arch/um/include/asm/pgalloc.h|  4 +-
 arch/um/kernel/mem.c |  4 +-
 arch/unicore32/include/asm/pgalloc.h |  4 +-
 arch/x86/Kconfig |  1 +
 arch/x86/include/asm/pgalloc.h   |  4 +-
 arch/x86/mm/pgtable.c|  4 +-
 arch/xtensa/include/asm/pgalloc.h|  8 +--
 include/linux/mm.h   | 13 ++--
 mm/huge_memory.c |  8 +--
 mm/kasan/kasan_init.c|  2 +-
 mm/memory.c  | 17 +++--
 mm/migrate.c |  2 +-
 mm/mremap.c  | 67 +++-
 mm/userfaultfd.c |  2 +-
 virt/kvm/arm/mmu.c   |  2 +-
 47 files changed, 169 insertions(+), 147 deletions(-)

-- 
2.19.0.605.g01d371f741-goog


[PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-11 Thread Joel Fernandes (Google)
Android needs to mremap large regions of memory during memory management
related operations. The mremap system call can be really slow if THP is
not enabled. The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map. Turning on THP
may not be a viable option, and is not for us. This patch speeds up the
performance for non-THP system by copying at the PMD level when possible.

The speed up is three orders of magnitude. On a 1GB mremap, the mremap
completion times drops from 160-250 millesconds to 380-400 microseconds.

Before:
Total mremap time for 1GB data: 242321014 nanoseconds.
Total mremap time for 1GB data: 196842467 nanoseconds.
Total mremap time for 1GB data: 167051162 nanoseconds.

After:
Total mremap time for 1GB data: 385781 nanoseconds.
Total mremap time for 1GB data: 388959 nanoseconds.
Total mremap time for 1GB data: 402813 nanoseconds.

Incase THP is enabled, the optimization is skipped. I also flush the
tlb every time we do this optimization since I couldn't find a way to
determine if the low-level PTEs are dirty. It is seen that the cost of
doing so is not much compared the improvement, on both x86-64 and arm64.

Cc: minc...@kernel.org
Cc: pan...@google.com
Cc: hu...@google.com
Cc: lokeshgi...@google.com
Cc: dan...@google.com
Cc: mho...@kernel.org
Cc: kir...@shutemov.name
Cc: a...@linux-foundation.org
Signed-off-by: Joel Fernandes (Google) 
---
 mm/mremap.c | 62 +
 1 file changed, 62 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index 9e68a02a52b1..d82c485822ef 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t 
*old_pmd,
drop_rmap_locks(vma);
 }
 
+static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, unsigned long old_end,
+ pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
+{
+   spinlock_t *old_ptl, *new_ptl;
+   struct mm_struct *mm = vma->vm_mm;
+
+   if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
+   || old_end - old_addr < PMD_SIZE)
+   return false;
+
+   /*
+* The destination pmd shouldn't be established, free_pgtables()
+* should have release it.
+*/
+   if (WARN_ON(!pmd_none(*new_pmd)))
+   return false;
+
+   /*
+* We don't have to worry about the ordering of src and dst
+* ptlocks because exclusive mmap_sem prevents deadlock.
+*/
+   old_ptl = pmd_lock(vma->vm_mm, old_pmd);
+   if (old_ptl) {
+   pmd_t pmd;
+
+   new_ptl = pmd_lockptr(mm, new_pmd);
+   if (new_ptl != old_ptl)
+   spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+   /* Clear the pmd */
+   pmd = *old_pmd;
+   pmd_clear(old_pmd);
+
+   VM_BUG_ON(!pmd_none(*new_pmd));
+
+   /* Set the new pmd */
+   set_pmd_at(mm, new_addr, new_pmd, pmd);
+   if (new_ptl != old_ptl)
+   spin_unlock(new_ptl);
+   spin_unlock(old_ptl);
+
+   *need_flush = true;
+   return true;
+   }
+   return false;
+}
+
 unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
@@ -239,7 +287,21 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
split_huge_pmd(vma, old_pmd, old_addr);
if (pmd_trans_unstable(old_pmd))
continue;
+   } else if (extent == PMD_SIZE) {
+   bool moved;
+
+   /* See comment in move_ptes() */
+   if (need_rmap_locks)
+   take_rmap_locks(vma);
+   moved = move_normal_pmd(vma, old_addr, new_addr,
+   old_end, old_pmd, new_pmd,
+   &need_flush);
+   if (need_rmap_locks)
+   drop_rmap_locks(vma);
+   if (moved)
+   continue;
}
+
if (pte_alloc(new_vma->vm_mm, new_pmd))
break;
next = (new_addr + PMD_SIZE) & PMD_MASK;
-- 
2.19.0.605.g01d371f741-goog



[PATCH v2 1/2] treewide: remove unused address argument from pte_alloc functions

2018-10-11 Thread Joel Fernandes (Google)
This series speeds up mremap(2) syscall by copying page tables at the
PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future, that makes the scheme not
work.  Also we find that there is no point in passing the 'address' to
pte_alloc since its unused.

This patch therefore removes this argument tree-wide resulting in a nice
negative diff as well. Also ensuring along the way that the architecture
does not do anything funky with 'address' argument that goes unnoticed.

Build and boot tested on x86-64. Build tested on arm64.

The changes were obtained by applying the following Coccinelle script.
The pte_fragment_alloc was manually fixed up since it was only 2
occurences and could not be easily generalized (and thanks Julia for
answering all my silly and not-silly Coccinelle questions!).

// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.

virtual patch

@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@

 fn(...
- , T2 E2
 )
 { ... }

@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)

@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

 fn(...
-,  E2
 )

@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@

(
- #define fn(a, b, c)@p e
+ #define fn(a, b) e
|
- #define fn(a, b)@p e
+ #define fn(a) e
)

Suggested-by: Kirill A. Shutemov 
Cc: Michal Hocko 
Cc: Julia Lawall 
Cc: elfr...@users.sourceforge.net
Signed-off-by: Joel Fernandes (Google) 
---
 arch/alpha/include/asm/pgalloc.h |  6 +++---
 arch/arc/include/asm/pgalloc.h   |  5 ++---
 arch/arm/include/asm/pgalloc.h   |  4 ++--
 arch/arm64/include/asm/pgalloc.h |  4 ++--
 arch/hexagon/include/asm/pgalloc.h   |  6 ++
 arch/ia64/include/asm/pgalloc.h  |  5 ++---
 arch/m68k/include/asm/mcf_pgalloc.h  |  8 ++--
 arch/m68k/include/asm/motorola_pgalloc.h |  4 ++--
 arch/m68k/include/asm/sun3_pgalloc.h |  6 ++
 arch/microblaze/include/asm/pgalloc.h| 19 ++-
 arch/microblaze/mm/pgtable.c |  3 +--
 arch/mips/include/asm/pgalloc.h  |  6 ++
 arch/nds32/include/asm/pgalloc.h |  5 ++---
 arch/nios2/include/asm/pgalloc.h |  6 ++
 arch/openrisc/include/asm/pgalloc.h  |  5 ++---
 arch/openrisc/mm/ioremap.c   |  3 +--
 arch/parisc/include/asm/pgalloc.h|  4 ++--
 arch/powerpc/include/asm/book3s/32/pgalloc.h |  4 ++--
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 12 +---
 arch/powerpc/include/asm/nohash/32/pgalloc.h |  4 ++--
 arch/powerpc/include/asm/nohash/64/pgalloc.h |  6 ++
 arch/powerpc/mm/pgtable-book3s64.c   |  2 +-
 arch/powerpc/mm/pgtable_32.c |  4 ++--
 arch/riscv/include/asm/pgalloc.h |  6 ++
 arch/s390/include/asm/pgalloc.h  |  4 ++--
 arch/sh/include/asm/pgalloc.h|  6 ++
 arch/sparc/include/asm/pgalloc_32.h  |  5 ++---
 arch/sparc/include/asm/pgalloc_64.h  |  6 ++
 arch/sparc/mm/init_64.c  |  6 ++
 arch/sparc/mm/srmmu.c|  4 ++--
 arch/um/kernel/mem.c |  4 ++--
 arch/unicore32/include/asm/pgalloc.h |  4 ++--
 arch/x86/include/asm/pgalloc.h   |  4 ++--
 arch/x86/mm/pgtable.c|  4 ++--
 arch/xtensa/include/asm/pgalloc.h|  8 +++-
 include/linux/mm.h   | 13 ++---
 mm/huge_memory.c |  8 
 mm/kasan/kasan_init.c|  2 +-
 mm/memory.c  | 17 -
 mm/migrate.c |  2 +-
 mm/mremap.c  |  2 +-
 mm/userfaultfd.c |  2 +-
 virt/kvm/arm/mmu.c   |  2 +-
 43 files changed, 95 insertions(+), 145 deletions(-)

diff --git a/arch/alpha/include/asm/pgalloc.h b/arch/alpha/include/asm/pgalloc.h
index ab3e3a8638fb..02f9f91bb4f0 100644
--- a/arch/alpha/include/asm/pgalloc.h
+++ b/arch/al