Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-17 Thread Andy Lutomirski
On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
>
> module_alloc() is used everywhere as a mean to allocate memory for code.
>
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.
>
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
>
> Start splitting code allocation from modules by introducing
> execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>
> Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> module_alloc() and execmem_free() and jit_free() are replacements of
> module_memfree() to allow updating all call sites to use the new APIs.
>
> The intention semantics for new allocation APIs:
>
> * execmem_text_alloc() should be used to allocate memory that must reside
>   close to the kernel image, like loadable kernel modules and generated
>   code that is restricted by relative addressing.
>
> * jit_text_alloc() should be used to allocate memory for generated code
>   when there are no restrictions for the code placement. For
>   architectures that require that any code is within certain distance
>   from the kernel image, jit_text_alloc() will be essentially aliased to
>   execmem_text_alloc().
>

Is there anything in this series to help users do the appropriate 
synchronization when the actually populate the allocated memory with code?  See 
here, for example:

https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab...@app.fastmail.com/T/#u


Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions

2023-06-17 Thread Kent Overstreet
On Sat, Jun 17, 2023 at 09:38:17AM -0700, Song Liu wrote:
> On Sat, Jun 17, 2023 at 8:37 AM Kent Overstreet
>  wrote:
> >
> > On Sat, Jun 17, 2023 at 09:57:59AM +0300, Mike Rapoport wrote:
> > > > This is growing fast. :) We have 3 now: text, data, jit. And it will be
> > > > 5 when we split data into rw data, ro data, ro after init data. I wonder
> > > > whether we should still do some type enum here. But we can revisit
> > > > this topic later.
> > >
> > > I don't think we'd need 5. Four at most :)
> > >
> > > I don't know yet what would be the best way to differentiate RW and RO
> > > data, but ro_after_init surely won't need a new type. It either will be
> > > allocated as RW and then the caller will have to set it RO after
> > > initialization is done, or it will be allocated as RO and the caller will
> > > have to do something like text_poke to update it.
> >
> > Perhaps ro_after_init could use the same allocation interface and share
> > pages with ro pages - if we just added a refcount for "this page
> > currently needs to be rw, module is still loading?"
> 
> If we don't relax rules with read only, we will have to separate rw, ro,
> and ro_after_init. But we can still have page sharing:
> 
> Two modules can put rw data on the same page.
> With text poke (ro data poke to be accurate), two modules can put
> ro data on the same page.
> 
> > text_poke() approach wouldn't be workable, you'd have to audit and fix
> > all module init code in the entire kernel.
> 
> Agreed. For this reason, each module has to have its own page(s) for
> ro_after_init data.

Relaxing page permissions to allow for page sharing could also be a
config option. For archs with 64k pages it seems worthwhile.


Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions

2023-06-17 Thread Song Liu
On Sat, Jun 17, 2023 at 8:37 AM Kent Overstreet
 wrote:
>
> On Sat, Jun 17, 2023 at 09:57:59AM +0300, Mike Rapoport wrote:
> > > This is growing fast. :) We have 3 now: text, data, jit. And it will be
> > > 5 when we split data into rw data, ro data, ro after init data. I wonder
> > > whether we should still do some type enum here. But we can revisit
> > > this topic later.
> >
> > I don't think we'd need 5. Four at most :)
> >
> > I don't know yet what would be the best way to differentiate RW and RO
> > data, but ro_after_init surely won't need a new type. It either will be
> > allocated as RW and then the caller will have to set it RO after
> > initialization is done, or it will be allocated as RO and the caller will
> > have to do something like text_poke to update it.
>
> Perhaps ro_after_init could use the same allocation interface and share
> pages with ro pages - if we just added a refcount for "this page
> currently needs to be rw, module is still loading?"

If we don't relax rules with read only, we will have to separate rw, ro,
and ro_after_init. But we can still have page sharing:

Two modules can put rw data on the same page.
With text poke (ro data poke to be accurate), two modules can put
ro data on the same page.

> text_poke() approach wouldn't be workable, you'd have to audit and fix
> all module init code in the entire kernel.

Agreed. For this reason, each module has to have its own page(s) for
ro_after_init data.

To eventually remove VM_FLUSH_RESET_PERMS, we want
ro_after_init data to share the same allocation interface.

Thanks,
Song


Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions

2023-06-17 Thread Kent Overstreet
On Sat, Jun 17, 2023 at 09:57:59AM +0300, Mike Rapoport wrote:
> > This is growing fast. :) We have 3 now: text, data, jit. And it will be
> > 5 when we split data into rw data, ro data, ro after init data. I wonder
> > whether we should still do some type enum here. But we can revisit
> > this topic later.
> 
> I don't think we'd need 5. Four at most :)
> 
> I don't know yet what would be the best way to differentiate RW and RO
> data, but ro_after_init surely won't need a new type. It either will be
> allocated as RW and then the caller will have to set it RO after
> initialization is done, or it will be allocated as RO and the caller will
> have to do something like text_poke to update it.

Perhaps ro_after_init could use the same allocation interface and share
pages with ro pages - if we just added a refcount for "this page
currently needs to be rw, module is still loading?"

text_poke() approach wouldn't be workable, you'd have to audit and fix
all module init code in the entire kernel.


Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions

2023-06-17 Thread Mike Rapoport
On Fri, Jun 16, 2023 at 01:05:29PM -0700, Song Liu wrote:
> On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport  wrote:
> >
> > From: "Mike Rapoport (IBM)" 
> >
> > The memory allocations for kprobes on arm64 can be placed anywhere in
> > vmalloc address space and currently this is implemented with an override
> > of alloc_insn_page() in arm64.
> >
> > Extend execmem_params with a range for generated code allocations and
> > make kprobes on arm64 use this extension rather than override
> > alloc_insn_page().
> >
> > Signed-off-by: Mike Rapoport (IBM) 
> > ---
> >  arch/arm64/kernel/module.c |  9 +
> >  arch/arm64/kernel/probes/kprobes.c |  7 ---
> >  include/linux/execmem.h| 11 +++
> >  mm/execmem.c   | 14 +-
> >  4 files changed, 33 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> > index c3d999f3a3dd..52b09626bc0f 100644
> > --- a/arch/arm64/kernel/module.c
> > +++ b/arch/arm64/kernel/module.c
> > @@ -30,6 +30,13 @@ static struct execmem_params execmem_params = {
> > .alignment = MODULE_ALIGN,
> > },
> > },
> > +   .jit = {
> > +   .text = {
> > +   .start = VMALLOC_START,
> > +   .end = VMALLOC_END,
> > +   .alignment = 1,
> > +   },
> > +   },
> >  };
> 
> This is growing fast. :) We have 3 now: text, data, jit. And it will be
> 5 when we split data into rw data, ro data, ro after init data. I wonder
> whether we should still do some type enum here. But we can revisit
> this topic later.

I don't think we'd need 5. Four at most :)

I don't know yet what would be the best way to differentiate RW and RO
data, but ro_after_init surely won't need a new type. It either will be
allocated as RW and then the caller will have to set it RO after
initialization is done, or it will be allocated as RO and the caller will
have to do something like text_poke to update it.
 
> Other than that
> 
> Acked-by: Song Liu 

-- 
Sincerely yours,
Mike.


Re: [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES

2023-06-17 Thread Mike Rapoport
On Fri, Jun 16, 2023 at 01:44:55PM +0200, Björn Töpel wrote:
> Mike Rapoport  writes:
> 
> > From: "Mike Rapoport (IBM)" 
> >
> > kprobes depended on CONFIG_MODULES because it has to allocate memory for
> > code.
> 
> I think you can remove the MODULES dependency from BPF_JIT as well:

Yeah, I think so. Thanks!
 
> --8<--
> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> index 2dfe1079f772..fa4587027f8b 100644
> --- a/kernel/bpf/Kconfig
> +++ b/kernel/bpf/Kconfig
> @@ -41,7 +41,6 @@ config BPF_JIT
> bool "Enable BPF Just In Time compiler"
> depends on BPF
> depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
> -   depends on MODULES
> help
>   BPF programs are normally handled by a BPF interpreter. This option
>   allows the kernel to generate native code when a program is loaded
> --8<--
> 
> 
> Björn

-- 
Sincerely yours,
Mike.


Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()

2023-06-17 Thread Mike Rapoport
On Fri, Jun 16, 2023 at 01:01:08PM -0700, Song Liu wrote:
> On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport  wrote:
> >
> > From: "Mike Rapoport (IBM)" 
> >
> > Data related to code allocations, such as module data section, need to
> > comply with architecture constraints for its placement and its
> > allocation right now was done using execmem_text_alloc().
> >
> > Create a dedicated API for allocating data related to code allocations
> > and allow architectures to define address ranges for data allocations.
> >
> > Since currently this is only relevant for powerpc variants that use the
> > VMALLOC address space for module data allocations, automatically reuse
> > address ranges defined for text unless address range for data is
> > explicitly defined by an architecture.
> >
> > With separation of code and data allocations, data sections of the
> > modules are now mapped as PAGE_KERNEL rather than PAGE_KERNEL_EXEC which
> > was a default on many architectures.
> >
> > Signed-off-by: Mike Rapoport (IBM) 
> [...]
> >  static void free_mod_mem(struct module *mod)
> > diff --git a/mm/execmem.c b/mm/execmem.c
> > index a67acd75ffef..f7bf496ad4c3 100644
> > --- a/mm/execmem.c
> > +++ b/mm/execmem.c
> > @@ -63,6 +63,20 @@ void *execmem_text_alloc(size_t size)
> >  fallback_start, fallback_end, kasan);
> >  }
> >
> > +void *execmem_data_alloc(size_t size)
> > +{
> > +   unsigned long start = execmem_params.modules.data.start;
> > +   unsigned long end = execmem_params.modules.data.end;
> > +   pgprot_t pgprot = execmem_params.modules.data.pgprot;
> > +   unsigned int align = execmem_params.modules.data.alignment;
> > +   unsigned long fallback_start = 
> > execmem_params.modules.data.fallback_start;
> > +   unsigned long fallback_end = 
> > execmem_params.modules.data.fallback_end;
> > +   bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;
> > +
> > +   return execmem_alloc(size, start, end, align, pgprot,
> > +fallback_start, fallback_end, kasan);
> > +}
> > +
> >  void execmem_free(void *ptr)
> >  {
> > /*
> > @@ -101,6 +115,28 @@ static bool execmem_validate_params(struct 
> > execmem_params *p)
> > return true;
> >  }
> >
> > +static void execmem_init_missing(struct execmem_params *p)
> 
> Shall we call this execmem_default_init_data?

This also fills in jit.text (next patch), so _data doesn't work here :)
And it's not really a default, the defaults are set explicitly for arches
that don't have execmem_arch_params.
 
> > +{
> > +   struct execmem_modules_range *m = >modules;
> > +
> > +   if (!pgprot_val(execmem_params.modules.data.pgprot))
> > +   execmem_params.modules.data.pgprot = PAGE_KERNEL;
> 
> Do we really need to check each of these? IOW, can we do:
> 
> if (!pgprot_val(execmem_params.modules.data.pgprot)) {
>execmem_params.modules.data.pgprot = PAGE_KERNEL;
>execmem_params.modules.data.alignment = m->text.alignment;
>execmem_params.modules.data.start = m->text.start;
>execmem_params.modules.data.end = m->text.end;
>execmem_params.modules.data.fallback_start = m->text.fallback_start;
>   execmem_params.modules.data.fallback_end = m->text.fallback_end;
> }

Yes, we can have a single check here.
 
> Thanks,
> Song
> 
> [...]

-- 
Sincerely yours,
Mike.


Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()

2023-06-17 Thread Mike Rapoport
On Fri, Jun 16, 2023 at 04:55:53PM +, Edgecombe, Rick P wrote:
> On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" 
> > 
> > Data related to code allocations, such as module data section, need
> > to
> > comply with architecture constraints for its placement and its
> > allocation right now was done using execmem_text_alloc().
> > 
> > Create a dedicated API for allocating data related to code
> > allocations
> > and allow architectures to define address ranges for data
> > allocations.
> 
> Right now the cross-arch way to specify kernel memory permissions is
> encoded in the function names of all the set_memory_foo()'s. You can't
> just have unified prot names because some arch's have NX and some have
> X bits, etc. CPA wouldn't know if it needs to set or unset a bit if you
> pass in a PROT.

The idea is that allocator never uses CPA. It allocates with the
permissions defined by architecture at the first place and then the callers
might use CPA to change them. Ideally, that shouldn't be needed for
anything but ro_after_init, but we are far from there.

> But then you end up with a new function for *each* combination (i.e.
> set_memory_rox()). I wish CPA has flags like mmap() does, and I wonder
> if it makes sense here instead of execmem_data_alloc().

I don't think execmem should handle all the combinations. The code is
always allocated as ROX for architectures that support text poking and as
RWX for those that don't.

Maybe execmem_data_alloc() will need to able to handle RW and RO data
differently at some point, but that's the only variant I can think of, but
even then I don't expect CPA will be used by execmem. 

We also might move resetting the permissions to default from vmalloc to
execmem, but again, we are far from there.
 
> Maybe that is an overhaul for another day though...

CPA surely needs some love :)

-- 
Sincerely yours,
Mike.


Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2023-06-17 Thread Mike Rapoport
On Fri, Jun 16, 2023 at 11:53:54AM -0700, Song Liu wrote:
> On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport  wrote:
> [...]
> > diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> > index 5af4975caeb5..c3d999f3a3dd 100644
> > --- a/arch/arm64/kernel/module.c
> > +++ b/arch/arm64/kernel/module.c
> > @@ -17,56 +17,50 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> >
> > -void *module_alloc(unsigned long size)
> > +static struct execmem_params execmem_params = {
> > +   .modules = {
> > +   .flags = EXECMEM_KASAN_SHADOW,
> > +   .text = {
> > +   .alignment = MODULE_ALIGN,
> > +   },
> > +   },
> > +};
> > +
> > +struct execmem_params __init *execmem_arch_params(void)
> >  {
> > u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
> > -   gfp_t gfp_mask = GFP_KERNEL;
> > -   void *p;
> > -
> > -   /* Silence the initial allocation */
> > -   if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> > -   gfp_mask |= __GFP_NOWARN;
> >
> > -   if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
> > -   IS_ENABLED(CONFIG_KASAN_SW_TAGS))
> > -   /* don't exceed the static module region - see below */
> > -   module_alloc_end = MODULES_END;
> > +   execmem_params.modules.text.pgprot = PAGE_KERNEL;
> > +   execmem_params.modules.text.start = module_alloc_base;
> 
> I think I mentioned this earlier. For arm64 with CONFIG_RANDOMIZE_BASE,
> module_alloc_base is not yet set when execmem_arch_params() is
> called. So we will need some extra logic for this.

Right, this is taken care of in a later patch, but it actually belongs here.
Good catch!
 
> Thanks,
> Song
> 
> 
> > +   execmem_params.modules.text.end = module_alloc_end;
> >
> > -   p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> > -   module_alloc_end, gfp_mask, PAGE_KERNEL, 
> > VM_DEFER_KMEMLEAK,
> > -   NUMA_NO_NODE, __builtin_return_address(0));
> > -
> > -   if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> > +   /*
> > +* KASAN without KASAN_VMALLOC can only deal with module
> > +* allocations being served from the reserved module region,
> > +* since the remainder of the vmalloc region is already
> > +* backed by zero shadow pages, and punching holes into it
> > +* is non-trivial. Since the module region is not randomized
> > +* when KASAN is enabled without KASAN_VMALLOC, it is even
> > +* less likely that the module region gets exhausted, so we
> > +* can simply omit this fallback in that case.
> > +*/
> > +   if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> > (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
> >  (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
> > - !IS_ENABLED(CONFIG_KASAN_SW_TAGS
> > -   /*
> > -* KASAN without KASAN_VMALLOC can only deal with module
> > -* allocations being served from the reserved module region,
> > -* since the remainder of the vmalloc region is already
> > -* backed by zero shadow pages, and punching holes into it
> > -* is non-trivial. Since the module region is not randomized
> > -* when KASAN is enabled without KASAN_VMALLOC, it is even
> > -* less likely that the module region gets exhausted, so we
> > -* can simply omit this fallback in that case.
> > -*/
> > -   p = __vmalloc_node_range(size, MODULE_ALIGN, 
> > module_alloc_base,
> > -   module_alloc_base + SZ_2G, GFP_KERNEL,
> > -   PAGE_KERNEL, 0, NUMA_NO_NODE,
> > -   __builtin_return_address(0));
> > -
> > -   if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
> > -   vfree(p);
> > -   return NULL;
> > + !IS_ENABLED(CONFIG_KASAN_SW_TAGS {
> > +   unsigned long end = module_alloc_base + SZ_2G;
> > +
> > +   execmem_params.modules.text.fallback_start = 
> > module_alloc_base;
> > +   execmem_params.modules.text.fallback_end = end;
> > }
> >
> > -   /* Memory is intended to be executable, reset the pointer tag. */
> > -   return kasan_reset_tag(p);
> > +   return _params;
> >  }
> >
> >  enum aarch64_reloc_op {

-- 
Sincerely yours,
Mike.


Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2023-06-17 Thread Mike Rapoport
On Fri, Jun 16, 2023 at 04:16:28PM +, Edgecombe, Rick P wrote:
> On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
> > -void *module_alloc(unsigned long size)
> > -{
> > -   gfp_t gfp_mask = GFP_KERNEL;
> > -   void *p;
> > -
> > -   if (PAGE_ALIGN(size) > MODULES_LEN)
> > -   return NULL;
> > +static struct execmem_params execmem_params = {
> > +   .modules = {
> > +   .flags = EXECMEM_KASAN_SHADOW,
> > +   .text = {
> > +   .alignment = MODULE_ALIGN,
> > +   },
> > +   },
> > +};
> 
> Did you consider making these execmem_params's ro_after_init? Not that
> it is security sensitive, but it's a nice hint to the reader that it is
> only modified at init. And I guess basically free sanitizing of buggy
> writes to it.

Makes sense.
 
> >  
> > -   p = __vmalloc_node_range(size, MODULE_ALIGN,
> > -    MODULES_VADDR +
> > get_module_load_offset(),
> > -    MODULES_END, gfp_mask, PAGE_KERNEL,
> > -    VM_FLUSH_RESET_PERMS |
> > VM_DEFER_KMEMLEAK,
> > -    NUMA_NO_NODE,
> > __builtin_return_address(0));
> > +struct execmem_params __init *execmem_arch_params(void)
> > +{
> > +   unsigned long start = MODULES_VADDR +
> > get_module_load_offset();
> 
> I think we can drop the mutex's in get_module_load_offset() now, since
> execmem_arch_params() should only be called once at init.

Right. Even more, the entire get_module_load_offset() can be folded into
execmem_arch_params() as 

if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && kaslr_enabled())
module_load_offset =
get_random_u32_inclusive(1, 1024) * PAGE_SIZE;

> >  
> > -   if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0))
> > {
> > -   vfree(p);
> > -   return NULL;
> > -   }
> > +   execmem_params.modules.text.start = start;
> > +   execmem_params.modules.text.end = MODULES_END;
> > +   execmem_params.modules.text.pgprot = PAGE_KERNEL;
> >  
> > -   return p;
> > +   return _params;
> >  }
> >  
> 

-- 
Sincerely yours,
Mike.