Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-07-17 Thread Andy Lutomirski



On Mon, Jun 26, 2023, at 10:48 AM, Song Liu wrote:
> On Mon, Jun 26, 2023 at 5:31 AM Mark Rutland  wrote:
>>
> [...]
>> >
>> > So the idea was that jit_text_alloc() will have a cache of large pages
>> > mapped ROX, will allocate memory from those caches and there will be
>> > jit_update() that uses text poking for writing to that memory.
>> >
>> > Upon allocation of a large page to increase the cache, that large page will
>> > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
>> > x86)
>>
>> Does that work on x86?
>>
>> That is in no way gauranteed for other architectures; on arm64 you need
>> explicit cache maintenance (with I-cache maintenance at the VA to be executed
>> from) followed by context-synchronization-events (e.g. via ISB instructions, 
>> or
>> IPIs).
>
> I guess we need:
> 1) Invalidate unused part of the huge ROX pages;
> 2) Do not put two jit users (including module text, bpf, etc.) in the
> same cache line;
> 3) Explicit cache maintenance;
> 4) context-synchronization-events.
>
> Would these (or a subset of them) be sufficient to protect us from torn read?

Maybe?  #4 is sufficiently vague that I can't really interpret it.

I have a half-drafted email asking for official clarification on the rules that 
might help shed light on this.  I find that this type of request works best 
when it's really well written :)

>
> Thanks,
> Song


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Song Liu
On Mon, Jun 26, 2023 at 5:31 AM Mark Rutland  wrote:
>
[...]
> >
> > So the idea was that jit_text_alloc() will have a cache of large pages
> > mapped ROX, will allocate memory from those caches and there will be
> > jit_update() that uses text poking for writing to that memory.
> >
> > Upon allocation of a large page to increase the cache, that large page will
> > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > x86)
>
> Does that work on x86?
>
> That is in no way gauranteed for other architectures; on arm64 you need
> explicit cache maintenance (with I-cache maintenance at the VA to be executed
> from) followed by context-synchronization-events (e.g. via ISB instructions, 
> or
> IPIs).

I guess we need:
1) Invalidate unused part of the huge ROX pages;
2) Do not put two jit users (including module text, bpf, etc.) in the
same cache line;
3) Explicit cache maintenance;
4) context-synchronization-events.

Would these (or a subset of them) be sufficient to protect us from torn read?

Thanks,
Song


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Mark Rutland
On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> >> > From: "Mike Rapoport (IBM)" 
> >> >
> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >> >
> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> >> > and puts the burden of code allocation to the modules code.
> >> >
> >> > Several architectures override module_alloc() because of various
> >> > constraints where the executable memory can be located and this causes
> >> > additional obstacles for improvements of code allocation.
> >> >
> >> > Start splitting code allocation from modules by introducing
> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >> >
> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> >> > module_memfree() to allow updating all call sites to use the new APIs.
> >> >
> >> > The intention semantics for new allocation APIs:
> >> >
> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> >> >   close to the kernel image, like loadable kernel modules and generated
> >> >   code that is restricted by relative addressing.
> >> >
> >> > * jit_text_alloc() should be used to allocate memory for generated code
> >> >   when there are no restrictions for the code placement. For
> >> >   architectures that require that any code is within certain distance
> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >> >   execmem_text_alloc().
> >> >
> >> 
> >> Is there anything in this series to help users do the appropriate
> >> synchronization when the actually populate the allocated memory with
> >> code?  See here, for example:
> >
> > This series only factors out the executable allocations from modules and
> > puts them in a central place.
> > Anything else would go on top after this lands.
> 
> Hmm.
> 
> On the one hand, there's nothing wrong with factoring out common code. On the
> other hand, this is probably the right time to at least start thinking about
> synchronization, at least to the extent that it might make us want to change
> this API.  (I'm not at all saying that this series should require changes --
> I'm just saying that this is a good time to think about how this should
> work.)
> 
> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> look like the one think in the Linux ecosystem that actually intelligently
> and efficiently maps new text into an address space: mmap().
> 
> On x86, you can mmap() an existing file full of executable code PROT_EXEC and
> jump to it with minimal synchronization (just the standard implicit ordering
> in the kernel that populates the pages before setting up the PTEs and
> whatever user synchronization is needed to avoid jumping into the mapping
> before mmap() finishes).  It works across CPUs, and the only possible way
> userspace can screw it up (for a read-only mapping of read-only text, anyway)
> is to jump to the mapping too early, in which case userspace gets a page
> fault.  Incoherence is impossible, and no one needs to "serialize" (in the
> SDM sense).
> 
> I think the same sequence (from userspace's perspective) works on other
> architectures, too, although I think more cache management is needed on the
> kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to
> map executable text into usermode, but I could easily be wrong.  (IIRC RISC-V
> has very developer-unfriendly icache management, but I don't remember the
> details.)

That's my understanding too, with a couple of details:

1) After the copy we perform and complete all the data + instruction cache
   maintenance *before* marking the mapping as executable.

2) Even *after* the mapping is marked executable, a thread could take a
   spurious fault on an instruction fetch for the new instructions. One way to
   think about this is that the CPU attempted to speculate the instructions
   earlier, saw that the mapping was faulting, and placed a "generate a fault
   here" operation into its pipeline to generate that later.

   The CPU pipeline/OoO-engine/whatever is effectively a transient cache for
   operations in-flight which is only ever "invalidated" by a
   context-synchronization-event (akin to an x86 serializing effect).

   We're only guarnateed to have a new instruction fetch (from the I-cache into
   the CPU pipeline) after the next context synchronization event (akin to an 
x86
   serializing effect), and luckily out exception entry/exit is architecturally
   guarnateed to provide that (unless we explicitly opt out via a control bit).

I know we're a bit lax with that 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Mark Rutland
On Sun, Jun 25, 2023 at 07:14:17PM +0300, Mike Rapoport wrote:
> On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > 
> > On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > >> > From: "Mike Rapoport (IBM)" 
> > >> >
> > >> > module_alloc() is used everywhere as a mean to allocate memory for 
> > >> > code.
> > >> >
> > >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > >> > and puts the burden of code allocation to the modules code.
> > >> >
> > >> > Several architectures override module_alloc() because of various
> > >> > constraints where the executable memory can be located and this causes
> > >> > additional obstacles for improvements of code allocation.
> > >> >
> > >> > Start splitting code allocation from modules by introducing
> > >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() 
> > >> > APIs.
> > >> >
> > >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > >> > module_memfree() to allow updating all call sites to use the new APIs.
> > >> >
> > >> > The intention semantics for new allocation APIs:
> > >> >
> > >> > * execmem_text_alloc() should be used to allocate memory that must 
> > >> > reside
> > >> >   close to the kernel image, like loadable kernel modules and generated
> > >> >   code that is restricted by relative addressing.
> > >> >
> > >> > * jit_text_alloc() should be used to allocate memory for generated code
> > >> >   when there are no restrictions for the code placement. For
> > >> >   architectures that require that any code is within certain distance
> > >> >   from the kernel image, jit_text_alloc() will be essentially aliased 
> > >> > to
> > >> >   execmem_text_alloc().
> > >> >
> > >> 
> > >> Is there anything in this series to help users do the appropriate
> > >> synchronization when the actually populate the allocated memory with
> > >> code?  See here, for example:
> > >
> > > This series only factors out the executable allocations from modules and
> > > puts them in a central place.
> > > Anything else would go on top after this lands.
> > 
> > Hmm.
> > 
> > On the one hand, there's nothing wrong with factoring out common code. On
> > the other hand, this is probably the right time to at least start
> > thinking about synchronization, at least to the extent that it might make
> > us want to change this API.  (I'm not at all saying that this series
> > should require changes -- I'm just saying that this is a good time to
> > think about how this should work.)
> > 
> > The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > look like the one think in the Linux ecosystem that actually
> > intelligently and efficiently maps new text into an address space:
> > mmap().
> > 
> > On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > and jump to it with minimal synchronization (just the standard implicit
> > ordering in the kernel that populates the pages before setting up the
> > PTEs and whatever user synchronization is needed to avoid jumping into
> > the mapping before mmap() finishes).  It works across CPUs, and the only
> > possible way userspace can screw it up (for a read-only mapping of
> > read-only text, anyway) is to jump to the mapping too early, in which
> > case userspace gets a page fault.  Incoherence is impossible, and no one
> > needs to "serialize" (in the SDM sense).
> > 
> > I think the same sequence (from userspace's perspective) works on other
> > architectures, too, although I think more cache management is needed on
> > the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > IPI to map executable text into usermode, but I could easily be wrong.
> > (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > remember the details.)
> > 
> > Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > rather fraught, and I bet many things do it wrong when userspace is
> > multithreaded.  But not in production because it's mostly not used in
> > production.)
> > 
> > But jit_text_alloc() can't do this, because the order of operations
> > doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > before the text is populated, so there is no atomic change from not-there
> > to populated-and-executable.  Which means that there is an opportunity
> > for CPUs, speculatively or otherwise, to start filling various caches
> > with intermediate states of the text, which means that various
> > architectures (even x86!) may need serialization.
> > 
> > For eBPF- and module- like use cases, where JITting/code gen is quite
> > coarse-grained, perhaps something vaguely 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Mark Rutland
On Mon, Jun 26, 2023 at 11:54:02AM +0200, Puranjay Mohan wrote:
> On Mon, Jun 26, 2023 at 8:13 AM Song Liu  wrote:
> >
> > On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
> >  wrote:
> > >
> > > On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > > > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > > > >
> > > > >
> > > > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > > > >>
> > > > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > > > >> >> > From: "Mike Rapoport (IBM)" 
> > > > > >> >> >
> > > > > >> >> > module_alloc() is used everywhere as a mean to allocate 
> > > > > >> >> > memory for code.
> > > > > >> >> >
> > > > > >> >> > Beside being semantically wrong, this unnecessarily ties all 
> > > > > >> >> > subsystems
> > > > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF 
> > > > > >> >> > to modules
> > > > > >> >> > and puts the burden of code allocation to the modules code.
> > > > > >> >> >
> > > > > >> >> > Several architectures override module_alloc() because of 
> > > > > >> >> > various
> > > > > >> >> > constraints where the executable memory can be located and 
> > > > > >> >> > this causes
> > > > > >> >> > additional obstacles for improvements of code allocation.
> > > > > >> >> >
> > > > > >> >> > Start splitting code allocation from modules by introducing
> > > > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), 
> > > > > >> >> > jit_free() APIs.
> > > > > >> >> >
> > > > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are 
> > > > > >> >> > wrappers for
> > > > > >> >> > module_alloc() and execmem_free() and jit_free() are 
> > > > > >> >> > replacements of
> > > > > >> >> > module_memfree() to allow updating all call sites to use the 
> > > > > >> >> > new APIs.
> > > > > >> >> >
> > > > > >> >> > The intention semantics for new allocation APIs:
> > > > > >> >> >
> > > > > >> >> > * execmem_text_alloc() should be used to allocate memory that 
> > > > > >> >> > must reside
> > > > > >> >> >   close to the kernel image, like loadable kernel modules and 
> > > > > >> >> > generated
> > > > > >> >> >   code that is restricted by relative addressing.
> > > > > >> >> >
> > > > > >> >> > * jit_text_alloc() should be used to allocate memory for 
> > > > > >> >> > generated code
> > > > > >> >> >   when there are no restrictions for the code placement. For
> > > > > >> >> >   architectures that require that any code is within certain 
> > > > > >> >> > distance
> > > > > >> >> >   from the kernel image, jit_text_alloc() will be essentially 
> > > > > >> >> > aliased to
> > > > > >> >> >   execmem_text_alloc().
> > > > > >> >> >
> > > > > >> >>
> > > > > >> >> Is there anything in this series to help users do the 
> > > > > >> >> appropriate
> > > > > >> >> synchronization when the actually populate the allocated memory 
> > > > > >> >> with
> > > > > >> >> code?  See here, for example:
> > > > > >> >
> > > > > >> > This series only factors out the executable allocations from 
> > > > > >> > modules and
> > > > > >> > puts them in a central place.
> > > > > >> > Anything else would go on top after this lands.
> > > > > >>
> > > > > >> Hmm.
> > > > > >>
> > > > > >> On the one hand, there's nothing wrong with factoring out common 
> > > > > >> code. On
> > > > > >> the other hand, this is probably the right time to at least start
> > > > > >> thinking about synchronization, at least to the extent that it 
> > > > > >> might make
> > > > > >> us want to change this API.  (I'm not at all saying that this 
> > > > > >> series
> > > > > >> should require changes -- I'm just saying that this is a good time 
> > > > > >> to
> > > > > >> think about how this should work.)
> > > > > >>
> > > > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't 
> > > > > >> actually
> > > > > >> look like the one think in the Linux ecosystem that actually
> > > > > >> intelligently and efficiently maps new text into an address space:
> > > > > >> mmap().
> > > > > >>
> > > > > >> On x86, you can mmap() an existing file full of executable code 
> > > > > >> PROT_EXEC
> > > > > >> and jump to it with minimal synchronization (just the standard 
> > > > > >> implicit
> > > > > >> ordering in the kernel that populates the pages before setting up 
> > > > > >> the
> > > > > >> PTEs and whatever user synchronization is needed to avoid jumping 
> > > > > >> into
> > > > > >> the mapping before mmap() finishes).  It works across CPUs, and 
> > > > > >> the only
> > > > > >> possible way userspace can screw it up (for a read-only mapping of
> > > > > >> read-only text, anyway) is to jump to the mapping too early, in 
> > > > > >> which
> > > > > >> case userspace 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Puranjay Mohan
On Mon, Jun 26, 2023 at 8:13 AM Song Liu  wrote:
>
> On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
>  wrote:
> >
> > On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > > >
> > > >
> > > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > > >>
> > > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > > >> >> > From: "Mike Rapoport (IBM)" 
> > > > >> >> >
> > > > >> >> > module_alloc() is used everywhere as a mean to allocate memory 
> > > > >> >> > for code.
> > > > >> >> >
> > > > >> >> > Beside being semantically wrong, this unnecessarily ties all 
> > > > >> >> > subsystems
> > > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to 
> > > > >> >> > modules
> > > > >> >> > and puts the burden of code allocation to the modules code.
> > > > >> >> >
> > > > >> >> > Several architectures override module_alloc() because of various
> > > > >> >> > constraints where the executable memory can be located and this 
> > > > >> >> > causes
> > > > >> >> > additional obstacles for improvements of code allocation.
> > > > >> >> >
> > > > >> >> > Start splitting code allocation from modules by introducing
> > > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), 
> > > > >> >> > jit_free() APIs.
> > > > >> >> >
> > > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are 
> > > > >> >> > wrappers for
> > > > >> >> > module_alloc() and execmem_free() and jit_free() are 
> > > > >> >> > replacements of
> > > > >> >> > module_memfree() to allow updating all call sites to use the 
> > > > >> >> > new APIs.
> > > > >> >> >
> > > > >> >> > The intention semantics for new allocation APIs:
> > > > >> >> >
> > > > >> >> > * execmem_text_alloc() should be used to allocate memory that 
> > > > >> >> > must reside
> > > > >> >> >   close to the kernel image, like loadable kernel modules and 
> > > > >> >> > generated
> > > > >> >> >   code that is restricted by relative addressing.
> > > > >> >> >
> > > > >> >> > * jit_text_alloc() should be used to allocate memory for 
> > > > >> >> > generated code
> > > > >> >> >   when there are no restrictions for the code placement. For
> > > > >> >> >   architectures that require that any code is within certain 
> > > > >> >> > distance
> > > > >> >> >   from the kernel image, jit_text_alloc() will be essentially 
> > > > >> >> > aliased to
> > > > >> >> >   execmem_text_alloc().
> > > > >> >> >
> > > > >> >>
> > > > >> >> Is there anything in this series to help users do the appropriate
> > > > >> >> synchronization when the actually populate the allocated memory 
> > > > >> >> with
> > > > >> >> code?  See here, for example:
> > > > >> >
> > > > >> > This series only factors out the executable allocations from 
> > > > >> > modules and
> > > > >> > puts them in a central place.
> > > > >> > Anything else would go on top after this lands.
> > > > >>
> > > > >> Hmm.
> > > > >>
> > > > >> On the one hand, there's nothing wrong with factoring out common 
> > > > >> code. On
> > > > >> the other hand, this is probably the right time to at least start
> > > > >> thinking about synchronization, at least to the extent that it might 
> > > > >> make
> > > > >> us want to change this API.  (I'm not at all saying that this series
> > > > >> should require changes -- I'm just saying that this is a good time to
> > > > >> think about how this should work.)
> > > > >>
> > > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't 
> > > > >> actually
> > > > >> look like the one think in the Linux ecosystem that actually
> > > > >> intelligently and efficiently maps new text into an address space:
> > > > >> mmap().
> > > > >>
> > > > >> On x86, you can mmap() an existing file full of executable code 
> > > > >> PROT_EXEC
> > > > >> and jump to it with minimal synchronization (just the standard 
> > > > >> implicit
> > > > >> ordering in the kernel that populates the pages before setting up the
> > > > >> PTEs and whatever user synchronization is needed to avoid jumping 
> > > > >> into
> > > > >> the mapping before mmap() finishes).  It works across CPUs, and the 
> > > > >> only
> > > > >> possible way userspace can screw it up (for a read-only mapping of
> > > > >> read-only text, anyway) is to jump to the mapping too early, in which
> > > > >> case userspace gets a page fault.  Incoherence is impossible, and no 
> > > > >> one
> > > > >> needs to "serialize" (in the SDM sense).
> > > > >>
> > > > >> I think the same sequence (from userspace's perspective) works on 
> > > > >> other
> > > > >> architectures, too, although I think more cache management is needed 
> > > > >> on
> > > > >> the kernel's end.  As 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Song Liu
On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
 wrote:
>
> On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > >
> > >
> > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > >>
> > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > >> >> > From: "Mike Rapoport (IBM)" 
> > > >> >> >
> > > >> >> > module_alloc() is used everywhere as a mean to allocate memory 
> > > >> >> > for code.
> > > >> >> >
> > > >> >> > Beside being semantically wrong, this unnecessarily ties all 
> > > >> >> > subsystems
> > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to 
> > > >> >> > modules
> > > >> >> > and puts the burden of code allocation to the modules code.
> > > >> >> >
> > > >> >> > Several architectures override module_alloc() because of various
> > > >> >> > constraints where the executable memory can be located and this 
> > > >> >> > causes
> > > >> >> > additional obstacles for improvements of code allocation.
> > > >> >> >
> > > >> >> > Start splitting code allocation from modules by introducing
> > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), 
> > > >> >> > jit_free() APIs.
> > > >> >> >
> > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers 
> > > >> >> > for
> > > >> >> > module_alloc() and execmem_free() and jit_free() are replacements 
> > > >> >> > of
> > > >> >> > module_memfree() to allow updating all call sites to use the new 
> > > >> >> > APIs.
> > > >> >> >
> > > >> >> > The intention semantics for new allocation APIs:
> > > >> >> >
> > > >> >> > * execmem_text_alloc() should be used to allocate memory that 
> > > >> >> > must reside
> > > >> >> >   close to the kernel image, like loadable kernel modules and 
> > > >> >> > generated
> > > >> >> >   code that is restricted by relative addressing.
> > > >> >> >
> > > >> >> > * jit_text_alloc() should be used to allocate memory for 
> > > >> >> > generated code
> > > >> >> >   when there are no restrictions for the code placement. For
> > > >> >> >   architectures that require that any code is within certain 
> > > >> >> > distance
> > > >> >> >   from the kernel image, jit_text_alloc() will be essentially 
> > > >> >> > aliased to
> > > >> >> >   execmem_text_alloc().
> > > >> >> >
> > > >> >>
> > > >> >> Is there anything in this series to help users do the appropriate
> > > >> >> synchronization when the actually populate the allocated memory with
> > > >> >> code?  See here, for example:
> > > >> >
> > > >> > This series only factors out the executable allocations from modules 
> > > >> > and
> > > >> > puts them in a central place.
> > > >> > Anything else would go on top after this lands.
> > > >>
> > > >> Hmm.
> > > >>
> > > >> On the one hand, there's nothing wrong with factoring out common code. 
> > > >> On
> > > >> the other hand, this is probably the right time to at least start
> > > >> thinking about synchronization, at least to the extent that it might 
> > > >> make
> > > >> us want to change this API.  (I'm not at all saying that this series
> > > >> should require changes -- I'm just saying that this is a good time to
> > > >> think about how this should work.)
> > > >>
> > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't 
> > > >> actually
> > > >> look like the one think in the Linux ecosystem that actually
> > > >> intelligently and efficiently maps new text into an address space:
> > > >> mmap().
> > > >>
> > > >> On x86, you can mmap() an existing file full of executable code 
> > > >> PROT_EXEC
> > > >> and jump to it with minimal synchronization (just the standard implicit
> > > >> ordering in the kernel that populates the pages before setting up the
> > > >> PTEs and whatever user synchronization is needed to avoid jumping into
> > > >> the mapping before mmap() finishes).  It works across CPUs, and the 
> > > >> only
> > > >> possible way userspace can screw it up (for a read-only mapping of
> > > >> read-only text, anyway) is to jump to the mapping too early, in which
> > > >> case userspace gets a page fault.  Incoherence is impossible, and no 
> > > >> one
> > > >> needs to "serialize" (in the SDM sense).
> > > >>
> > > >> I think the same sequence (from userspace's perspective) works on other
> > > >> architectures, too, although I think more cache management is needed on
> > > >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > > >> IPI to map executable text into usermode, but I could easily be wrong.
> > > >> (IIRC RISC-V has very developer-unfriendly icache management, but I 
> > > >> don't
> > > >> remember the details.)
> > > >>
> > > >> Of course, using ptrace or any other 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-25 Thread Kent Overstreet
On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > 
> > 
> > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > >> 
> > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > >> >> > From: "Mike Rapoport (IBM)" 
> > >> >> >
> > >> >> > module_alloc() is used everywhere as a mean to allocate memory for 
> > >> >> > code.
> > >> >> >
> > >> >> > Beside being semantically wrong, this unnecessarily ties all 
> > >> >> > subsystems
> > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to 
> > >> >> > modules
> > >> >> > and puts the burden of code allocation to the modules code.
> > >> >> >
> > >> >> > Several architectures override module_alloc() because of various
> > >> >> > constraints where the executable memory can be located and this 
> > >> >> > causes
> > >> >> > additional obstacles for improvements of code allocation.
> > >> >> >
> > >> >> > Start splitting code allocation from modules by introducing
> > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() 
> > >> >> > APIs.
> > >> >> >
> > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers 
> > >> >> > for
> > >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > >> >> > module_memfree() to allow updating all call sites to use the new 
> > >> >> > APIs.
> > >> >> >
> > >> >> > The intention semantics for new allocation APIs:
> > >> >> >
> > >> >> > * execmem_text_alloc() should be used to allocate memory that must 
> > >> >> > reside
> > >> >> >   close to the kernel image, like loadable kernel modules and 
> > >> >> > generated
> > >> >> >   code that is restricted by relative addressing.
> > >> >> >
> > >> >> > * jit_text_alloc() should be used to allocate memory for generated 
> > >> >> > code
> > >> >> >   when there are no restrictions for the code placement. For
> > >> >> >   architectures that require that any code is within certain 
> > >> >> > distance
> > >> >> >   from the kernel image, jit_text_alloc() will be essentially 
> > >> >> > aliased to
> > >> >> >   execmem_text_alloc().
> > >> >> >
> > >> >> 
> > >> >> Is there anything in this series to help users do the appropriate
> > >> >> synchronization when the actually populate the allocated memory with
> > >> >> code?  See here, for example:
> > >> >
> > >> > This series only factors out the executable allocations from modules 
> > >> > and
> > >> > puts them in a central place.
> > >> > Anything else would go on top after this lands.
> > >> 
> > >> Hmm.
> > >> 
> > >> On the one hand, there's nothing wrong with factoring out common code. On
> > >> the other hand, this is probably the right time to at least start
> > >> thinking about synchronization, at least to the extent that it might make
> > >> us want to change this API.  (I'm not at all saying that this series
> > >> should require changes -- I'm just saying that this is a good time to
> > >> think about how this should work.)
> > >> 
> > >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > >> look like the one think in the Linux ecosystem that actually
> > >> intelligently and efficiently maps new text into an address space:
> > >> mmap().
> > >> 
> > >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > >> and jump to it with minimal synchronization (just the standard implicit
> > >> ordering in the kernel that populates the pages before setting up the
> > >> PTEs and whatever user synchronization is needed to avoid jumping into
> > >> the mapping before mmap() finishes).  It works across CPUs, and the only
> > >> possible way userspace can screw it up (for a read-only mapping of
> > >> read-only text, anyway) is to jump to the mapping too early, in which
> > >> case userspace gets a page fault.  Incoherence is impossible, and no one
> > >> needs to "serialize" (in the SDM sense).
> > >> 
> > >> I think the same sequence (from userspace's perspective) works on other
> > >> architectures, too, although I think more cache management is needed on
> > >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > >> IPI to map executable text into usermode, but I could easily be wrong.
> > >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > >> remember the details.)
> > >> 
> > >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > >> rather fraught, and I bet many things do it wrong when userspace is
> > >> multithreaded.  But not in production because it's mostly not used in
> > >> production.)
> > >> 
> > >> But jit_text_alloc() can't do this, because the order of operations
> > >> doesn't match.  With 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-25 Thread Mike Rapoport
On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> 
> 
> On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> >> 
> >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> >> >> > From: "Mike Rapoport (IBM)" 
> >> >> >
> >> >> > module_alloc() is used everywhere as a mean to allocate memory for 
> >> >> > code.
> >> >> >
> >> >> > Beside being semantically wrong, this unnecessarily ties all 
> >> >> > subsystems
> >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> >> >> > and puts the burden of code allocation to the modules code.
> >> >> >
> >> >> > Several architectures override module_alloc() because of various
> >> >> > constraints where the executable memory can be located and this causes
> >> >> > additional obstacles for improvements of code allocation.
> >> >> >
> >> >> > Start splitting code allocation from modules by introducing
> >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() 
> >> >> > APIs.
> >> >> >
> >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> >> >> > module_memfree() to allow updating all call sites to use the new APIs.
> >> >> >
> >> >> > The intention semantics for new allocation APIs:
> >> >> >
> >> >> > * execmem_text_alloc() should be used to allocate memory that must 
> >> >> > reside
> >> >> >   close to the kernel image, like loadable kernel modules and 
> >> >> > generated
> >> >> >   code that is restricted by relative addressing.
> >> >> >
> >> >> > * jit_text_alloc() should be used to allocate memory for generated 
> >> >> > code
> >> >> >   when there are no restrictions for the code placement. For
> >> >> >   architectures that require that any code is within certain distance
> >> >> >   from the kernel image, jit_text_alloc() will be essentially aliased 
> >> >> > to
> >> >> >   execmem_text_alloc().
> >> >> >
> >> >> 
> >> >> Is there anything in this series to help users do the appropriate
> >> >> synchronization when the actually populate the allocated memory with
> >> >> code?  See here, for example:
> >> >
> >> > This series only factors out the executable allocations from modules and
> >> > puts them in a central place.
> >> > Anything else would go on top after this lands.
> >> 
> >> Hmm.
> >> 
> >> On the one hand, there's nothing wrong with factoring out common code. On
> >> the other hand, this is probably the right time to at least start
> >> thinking about synchronization, at least to the extent that it might make
> >> us want to change this API.  (I'm not at all saying that this series
> >> should require changes -- I'm just saying that this is a good time to
> >> think about how this should work.)
> >> 
> >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> >> look like the one think in the Linux ecosystem that actually
> >> intelligently and efficiently maps new text into an address space:
> >> mmap().
> >> 
> >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> >> and jump to it with minimal synchronization (just the standard implicit
> >> ordering in the kernel that populates the pages before setting up the
> >> PTEs and whatever user synchronization is needed to avoid jumping into
> >> the mapping before mmap() finishes).  It works across CPUs, and the only
> >> possible way userspace can screw it up (for a read-only mapping of
> >> read-only text, anyway) is to jump to the mapping too early, in which
> >> case userspace gets a page fault.  Incoherence is impossible, and no one
> >> needs to "serialize" (in the SDM sense).
> >> 
> >> I think the same sequence (from userspace's perspective) works on other
> >> architectures, too, although I think more cache management is needed on
> >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> >> IPI to map executable text into usermode, but I could easily be wrong.
> >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> >> remember the details.)
> >> 
> >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> >> rather fraught, and I bet many things do it wrong when userspace is
> >> multithreaded.  But not in production because it's mostly not used in
> >> production.)
> >> 
> >> But jit_text_alloc() can't do this, because the order of operations
> >> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> >> before the text is populated, so there is no atomic change from not-there
> >> to populated-and-executable.  Which means that there is an opportunity
> >> for CPUs, speculatively or otherwise, to start filling various caches
> >> with intermediate states of the text, which means 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-25 Thread Andy Lutomirski



On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
>> 
>> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
>> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> >> > From: "Mike Rapoport (IBM)" 
>> >> >
>> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >> >
>> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> >> > and puts the burden of code allocation to the modules code.
>> >> >
>> >> > Several architectures override module_alloc() because of various
>> >> > constraints where the executable memory can be located and this causes
>> >> > additional obstacles for improvements of code allocation.
>> >> >
>> >> > Start splitting code allocation from modules by introducing
>> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >> >
>> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> >> > module_alloc() and execmem_free() and jit_free() are replacements of
>> >> > module_memfree() to allow updating all call sites to use the new APIs.
>> >> >
>> >> > The intention semantics for new allocation APIs:
>> >> >
>> >> > * execmem_text_alloc() should be used to allocate memory that must 
>> >> > reside
>> >> >   close to the kernel image, like loadable kernel modules and generated
>> >> >   code that is restricted by relative addressing.
>> >> >
>> >> > * jit_text_alloc() should be used to allocate memory for generated code
>> >> >   when there are no restrictions for the code placement. For
>> >> >   architectures that require that any code is within certain distance
>> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >> >   execmem_text_alloc().
>> >> >
>> >> 
>> >> Is there anything in this series to help users do the appropriate
>> >> synchronization when the actually populate the allocated memory with
>> >> code?  See here, for example:
>> >
>> > This series only factors out the executable allocations from modules and
>> > puts them in a central place.
>> > Anything else would go on top after this lands.
>> 
>> Hmm.
>> 
>> On the one hand, there's nothing wrong with factoring out common code. On
>> the other hand, this is probably the right time to at least start
>> thinking about synchronization, at least to the extent that it might make
>> us want to change this API.  (I'm not at all saying that this series
>> should require changes -- I'm just saying that this is a good time to
>> think about how this should work.)
>> 
>> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
>> look like the one think in the Linux ecosystem that actually
>> intelligently and efficiently maps new text into an address space:
>> mmap().
>> 
>> On x86, you can mmap() an existing file full of executable code PROT_EXEC
>> and jump to it with minimal synchronization (just the standard implicit
>> ordering in the kernel that populates the pages before setting up the
>> PTEs and whatever user synchronization is needed to avoid jumping into
>> the mapping before mmap() finishes).  It works across CPUs, and the only
>> possible way userspace can screw it up (for a read-only mapping of
>> read-only text, anyway) is to jump to the mapping too early, in which
>> case userspace gets a page fault.  Incoherence is impossible, and no one
>> needs to "serialize" (in the SDM sense).
>> 
>> I think the same sequence (from userspace's perspective) works on other
>> architectures, too, although I think more cache management is needed on
>> the kernel's end.  As far as I know, no Linux SMP architecture needs an
>> IPI to map executable text into usermode, but I could easily be wrong.
>> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
>> remember the details.)
>> 
>> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
>> rather fraught, and I bet many things do it wrong when userspace is
>> multithreaded.  But not in production because it's mostly not used in
>> production.)
>> 
>> But jit_text_alloc() can't do this, because the order of operations
>> doesn't match.  With jit_text_alloc(), the executable mapping shows up
>> before the text is populated, so there is no atomic change from not-there
>> to populated-and-executable.  Which means that there is an opportunity
>> for CPUs, speculatively or otherwise, to start filling various caches
>> with intermediate states of the text, which means that various
>> architectures (even x86!) may need serialization.
>> 
>> For eBPF- and module- like use cases, where JITting/code gen is quite
>> coarse-grained, perhaps something vaguely like:
>> 
>> jit_text_alloc() -> returns a handle and an executable virtual address,
>> but does *not* map it there
>> 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-25 Thread Mike Rapoport
On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> 
> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> >> > From: "Mike Rapoport (IBM)" 
> >> >
> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >> >
> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> >> > and puts the burden of code allocation to the modules code.
> >> >
> >> > Several architectures override module_alloc() because of various
> >> > constraints where the executable memory can be located and this causes
> >> > additional obstacles for improvements of code allocation.
> >> >
> >> > Start splitting code allocation from modules by introducing
> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >> >
> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> >> > module_memfree() to allow updating all call sites to use the new APIs.
> >> >
> >> > The intention semantics for new allocation APIs:
> >> >
> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> >> >   close to the kernel image, like loadable kernel modules and generated
> >> >   code that is restricted by relative addressing.
> >> >
> >> > * jit_text_alloc() should be used to allocate memory for generated code
> >> >   when there are no restrictions for the code placement. For
> >> >   architectures that require that any code is within certain distance
> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >> >   execmem_text_alloc().
> >> >
> >> 
> >> Is there anything in this series to help users do the appropriate
> >> synchronization when the actually populate the allocated memory with
> >> code?  See here, for example:
> >
> > This series only factors out the executable allocations from modules and
> > puts them in a central place.
> > Anything else would go on top after this lands.
> 
> Hmm.
> 
> On the one hand, there's nothing wrong with factoring out common code. On
> the other hand, this is probably the right time to at least start
> thinking about synchronization, at least to the extent that it might make
> us want to change this API.  (I'm not at all saying that this series
> should require changes -- I'm just saying that this is a good time to
> think about how this should work.)
> 
> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> look like the one think in the Linux ecosystem that actually
> intelligently and efficiently maps new text into an address space:
> mmap().
> 
> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> and jump to it with minimal synchronization (just the standard implicit
> ordering in the kernel that populates the pages before setting up the
> PTEs and whatever user synchronization is needed to avoid jumping into
> the mapping before mmap() finishes).  It works across CPUs, and the only
> possible way userspace can screw it up (for a read-only mapping of
> read-only text, anyway) is to jump to the mapping too early, in which
> case userspace gets a page fault.  Incoherence is impossible, and no one
> needs to "serialize" (in the SDM sense).
> 
> I think the same sequence (from userspace's perspective) works on other
> architectures, too, although I think more cache management is needed on
> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> IPI to map executable text into usermode, but I could easily be wrong.
> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> remember the details.)
> 
> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> rather fraught, and I bet many things do it wrong when userspace is
> multithreaded.  But not in production because it's mostly not used in
> production.)
> 
> But jit_text_alloc() can't do this, because the order of operations
> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> before the text is populated, so there is no atomic change from not-there
> to populated-and-executable.  Which means that there is an opportunity
> for CPUs, speculatively or otherwise, to start filling various caches
> with intermediate states of the text, which means that various
> architectures (even x86!) may need serialization.
> 
> For eBPF- and module- like use cases, where JITting/code gen is quite
> coarse-grained, perhaps something vaguely like:
> 
> jit_text_alloc() -> returns a handle and an executable virtual address,
> but does *not* map it there
> jit_text_write() -> write to that handle
> jit_text_map() -> map it and synchronize if needed (no sync needed on
> x86, I think)
> 
> could be more efficient and/or 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-20 Thread Andy Lutomirski



On Mon, Jun 19, 2023, at 1:18 PM, Nadav Amit wrote:
>> On Jun 19, 2023, at 10:09 AM, Andy Lutomirski  wrote:
>> 
>> But jit_text_alloc() can't do this, because the order of operations doesn't 
>> match.  With jit_text_alloc(), the executable mapping shows up before the 
>> text is populated, so there is no atomic change from not-there to 
>> populated-and-executable.  Which means that there is an opportunity for 
>> CPUs, speculatively or otherwise, to start filling various caches with 
>> intermediate states of the text, which means that various architectures 
>> (even x86!) may need serialization.
>> 
>> For eBPF- and module- like use cases, where JITting/code gen is quite 
>> coarse-grained, perhaps something vaguely like:
>> 
>> jit_text_alloc() -> returns a handle and an executable virtual address, but 
>> does *not* map it there
>> jit_text_write() -> write to that handle
>> jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I 
>> think)
>
> Andy, would you mind explaining why you think a sync is not needed? I 
> mean I have a “feeling” that perhaps TSO can guarantee something based 
> on the order of write and page-table update. Is that the argument?

Sorry, when I say "no sync" I mean no cross-CPU synchronization.  I'm assuming 
the underlying sequence of events is:

allocate physical pages (jit_text_alloc)

write to them (with MOV, memcpy, whatever), via the direct map or via a 
temporary mm

do an appropriate *local* barrier (which, on x86, is probably implied by TSO, 
as the subsequent pagetable change is at least a release; also, any any 
previous temporary mm stuff would have done MOV CR3 afterwards, which is a full 
"serializing" barrier)

optionally zap the direct map via IPI, assuming the pages are direct mapped 
(but this could be avoided with a smart enough allocator and temporary_mm above)

install the final RX PTE (jit_text_map), which does a MOV or maybe a LOCK 
CMPXCHG16B.  Note that the virtual address in question was not readable or 
executable before this, and all CPUs have serialized since the last time it was 
executable.

either jump to the new text locally, or:

1. Do a store-release to tell other CPUs that the text is mapped
2. Other CPU does a load-acquire to detect that the text is mapped and jumps to 
the text

This is all approximately the same thing that plain old mmap(..., PROT_EXEC, 
...) does.

>
> On this regard, one thing that I clearly do not understand is why 
> *today* it is ok for users of bpf_arch_text_copy() not to call 
> text_poke_sync(). Am I missing something?

I cannot explain this, because I suspect the current code is wrong.  But it's 
only wrong across CPUs, because bpf_arch_text_copy goes through text_poke_copy, 
which calls unuse_temporary_mm(), which is serializing.  And it's plausible 
that most eBPF use cases don't actually cause the loaded program to get used on 
a different CPU without first serializing on the CPU that ends up using it.  
(Context switches and interrupts are serializing.)

FRED could make interrupts non-serializing. I sincerely hope that FRED doesn't 
cause this all to fall apart.

--Andy


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-19 Thread Nadav Amit



> On Jun 19, 2023, at 10:09 AM, Andy Lutomirski  wrote:
> 
> But jit_text_alloc() can't do this, because the order of operations doesn't 
> match.  With jit_text_alloc(), the executable mapping shows up before the 
> text is populated, so there is no atomic change from not-there to 
> populated-and-executable.  Which means that there is an opportunity for CPUs, 
> speculatively or otherwise, to start filling various caches with intermediate 
> states of the text, which means that various architectures (even x86!) may 
> need serialization.
> 
> For eBPF- and module- like use cases, where JITting/code gen is quite 
> coarse-grained, perhaps something vaguely like:
> 
> jit_text_alloc() -> returns a handle and an executable virtual address, but 
> does *not* map it there
> jit_text_write() -> write to that handle
> jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I 
> think)

Andy, would you mind explaining why you think a sync is not needed? I mean I 
have a “feeling” that perhaps TSO can guarantee something based on the order of 
write and page-table update. Is that the argument?

On this regard, one thing that I clearly do not understand is why *today* it is 
ok for users of bpf_arch_text_copy() not to call text_poke_sync(). Am I missing 
something?



Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-19 Thread Andy Lutomirski



On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> > From: "Mike Rapoport (IBM)" 
>> >
>> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >
>> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> > and puts the burden of code allocation to the modules code.
>> >
>> > Several architectures override module_alloc() because of various
>> > constraints where the executable memory can be located and this causes
>> > additional obstacles for improvements of code allocation.
>> >
>> > Start splitting code allocation from modules by introducing
>> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >
>> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> > module_alloc() and execmem_free() and jit_free() are replacements of
>> > module_memfree() to allow updating all call sites to use the new APIs.
>> >
>> > The intention semantics for new allocation APIs:
>> >
>> > * execmem_text_alloc() should be used to allocate memory that must reside
>> >   close to the kernel image, like loadable kernel modules and generated
>> >   code that is restricted by relative addressing.
>> >
>> > * jit_text_alloc() should be used to allocate memory for generated code
>> >   when there are no restrictions for the code placement. For
>> >   architectures that require that any code is within certain distance
>> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >   execmem_text_alloc().
>> >
>> 
>> Is there anything in this series to help users do the appropriate
>> synchronization when the actually populate the allocated memory with
>> code?  See here, for example:
>
> This series only factors out the executable allocations from modules and
> puts them in a central place.
> Anything else would go on top after this lands.

Hmm.

On the one hand, there's nothing wrong with factoring out common code. On the 
other hand, this is probably the right time to at least start thinking about 
synchronization, at least to the extent that it might make us want to change 
this API.  (I'm not at all saying that this series should require changes -- 
I'm just saying that this is a good time to think about how this should work.)

The current APIs, *and* the proposed jit_text_alloc() API, don't actually look 
like the one think in the Linux ecosystem that actually intelligently and 
efficiently maps new text into an address space: mmap().

On x86, you can mmap() an existing file full of executable code PROT_EXEC and 
jump to it with minimal synchronization (just the standard implicit ordering in 
the kernel that populates the pages before setting up the PTEs and whatever 
user synchronization is needed to avoid jumping into the mapping before mmap() 
finishes).  It works across CPUs, and the only possible way userspace can screw 
it up (for a read-only mapping of read-only text, anyway) is to jump to the 
mapping too early, in which case userspace gets a page fault.  Incoherence is 
impossible, and no one needs to "serialize" (in the SDM sense).

I think the same sequence (from userspace's perspective) works on other 
architectures, too, although I think more cache management is needed on the 
kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to map 
executable text into usermode, but I could easily be wrong.  (IIRC RISC-V has 
very developer-unfriendly icache management, but I don't remember the details.)

Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is rather 
fraught, and I bet many things do it wrong when userspace is multithreaded.  
But not in production because it's mostly not used in production.)

But jit_text_alloc() can't do this, because the order of operations doesn't 
match.  With jit_text_alloc(), the executable mapping shows up before the text 
is populated, so there is no atomic change from not-there to 
populated-and-executable.  Which means that there is an opportunity for CPUs, 
speculatively or otherwise, to start filling various caches with intermediate 
states of the text, which means that various architectures (even x86!) may need 
serialization.

For eBPF- and module- like use cases, where JITting/code gen is quite 
coarse-grained, perhaps something vaguely like:

jit_text_alloc() -> returns a handle and an executable virtual address, but 
does *not* map it there
jit_text_write() -> write to that handle
jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I 
think)

could be more efficient and/or safer.

(Modules could use this too.  Getting alternatives right might take some 
fiddling, because off the top of my head, this doesn't match how it works now.)

To make alternatives easier, this could work, maybe (haven't fully 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-19 Thread Kent Overstreet
On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" 
> >
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> >
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> >
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> >
> > The intention semantics for new allocation APIs:
> >
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> >
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> >
> 
> Is there anything in this series to help users do the appropriate 
> synchronization when the actually populate the allocated memory with code?  
> See here, for example:
> 
> https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab...@app.fastmail.com/T/#u

We're still in need of an arch independent text_poke() api.


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-18 Thread Mike Rapoport
On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" 
> >
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> >
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> >
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> >
> > The intention semantics for new allocation APIs:
> >
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> >
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> >
> 
> Is there anything in this series to help users do the appropriate
> synchronization when the actually populate the allocated memory with
> code?  See here, for example:

This series only factors out the executable allocations from modules and
puts them in a central place.
Anything else would go on top after this lands.
 
> https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab...@app.fastmail.com/T/#u

-- 
Sincerely yours,
Mike.


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-17 Thread Andy Lutomirski
On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
>
> module_alloc() is used everywhere as a mean to allocate memory for code.
>
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.
>
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
>
> Start splitting code allocation from modules by introducing
> execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>
> Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> module_alloc() and execmem_free() and jit_free() are replacements of
> module_memfree() to allow updating all call sites to use the new APIs.
>
> The intention semantics for new allocation APIs:
>
> * execmem_text_alloc() should be used to allocate memory that must reside
>   close to the kernel image, like loadable kernel modules and generated
>   code that is restricted by relative addressing.
>
> * jit_text_alloc() should be used to allocate memory for generated code
>   when there are no restrictions for the code placement. For
>   architectures that require that any code is within certain distance
>   from the kernel image, jit_text_alloc() will be essentially aliased to
>   execmem_text_alloc().
>

Is there anything in this series to help users do the appropriate 
synchronization when the actually populate the allocated memory with code?  See 
here, for example:

https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab...@app.fastmail.com/T/#u


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-16 Thread Mike Rapoport
On Fri, Jun 16, 2023 at 12:48:02PM -0400, Kent Overstreet wrote:
> On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" 
> > 
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > 
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> > 
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> > 
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > 
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> > 
> > The intention semantics for new allocation APIs:
> > 
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> > 
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> > 
> > The names execmem_text_alloc() and jit_text_alloc() emphasize that the
> > allocated memory is for executable code, the allocations of the
> > associated data, like data sections of a module will use
> > execmem_data_alloc() interface that will be added later.
> 
> I like the API split - at the risk of further bikeshedding, perhaps
> near_text_alloc() and far_text_alloc()? Would be more explicit.

With near and far it should mention from where and that's getting too long.
I don't mind changing the names, but I couldn't think about something
better than Song's execmem and your jit.
 
> Reviewed-by: Kent Overstreet 

Thanks!

-- 
Sincerely yours,
Mike.


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-16 Thread Song Liu
On Fri, Jun 16, 2023 at 9:48 AM Kent Overstreet
 wrote:
>
> On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" 
> >
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> >
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> >
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> >
> > The intention semantics for new allocation APIs:
> >
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> >
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> >
> > The names execmem_text_alloc() and jit_text_alloc() emphasize that the
> > allocated memory is for executable code, the allocations of the
> > associated data, like data sections of a module will use
> > execmem_data_alloc() interface that will be added later.
>
> I like the API split - at the risk of further bikeshedding, perhaps
> near_text_alloc() and far_text_alloc()? Would be more explicit.
>
> Reviewed-by: Kent Overstreet 

Acked-by: Song Liu 


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-16 Thread Kent Overstreet
On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.
> 
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> Start splitting code allocation from modules by introducing
> execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> 
> Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> module_alloc() and execmem_free() and jit_free() are replacements of
> module_memfree() to allow updating all call sites to use the new APIs.
> 
> The intention semantics for new allocation APIs:
> 
> * execmem_text_alloc() should be used to allocate memory that must reside
>   close to the kernel image, like loadable kernel modules and generated
>   code that is restricted by relative addressing.
> 
> * jit_text_alloc() should be used to allocate memory for generated code
>   when there are no restrictions for the code placement. For
>   architectures that require that any code is within certain distance
>   from the kernel image, jit_text_alloc() will be essentially aliased to
>   execmem_text_alloc().
> 
> The names execmem_text_alloc() and jit_text_alloc() emphasize that the
> allocated memory is for executable code, the allocations of the
> associated data, like data sections of a module will use
> execmem_data_alloc() interface that will be added later.

I like the API split - at the risk of further bikeshedding, perhaps
near_text_alloc() and far_text_alloc()? Would be more explicit.

Reviewed-by: Kent Overstreet 


[PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-16 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules
and puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing
execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.

Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
module_alloc() and execmem_free() and jit_free() are replacements of
module_memfree() to allow updating all call sites to use the new APIs.

The intention semantics for new allocation APIs:

* execmem_text_alloc() should be used to allocate memory that must reside
  close to the kernel image, like loadable kernel modules and generated
  code that is restricted by relative addressing.

* jit_text_alloc() should be used to allocate memory for generated code
  when there are no restrictions for the code placement. For
  architectures that require that any code is within certain distance
  from the kernel image, jit_text_alloc() will be essentially aliased to
  execmem_text_alloc().

The names execmem_text_alloc() and jit_text_alloc() emphasize that the
allocated memory is for executable code, the allocations of the
associated data, like data sections of a module will use
execmem_data_alloc() interface that will be added later.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/powerpc/kernel/kprobes.c|  4 +--
 arch/s390/kernel/ftrace.c|  4 +--
 arch/s390/kernel/kprobes.c   |  4 +--
 arch/s390/kernel/module.c|  5 +--
 arch/sparc/net/bpf_jit_comp_32.c |  8 ++---
 arch/x86/kernel/ftrace.c |  6 ++--
 arch/x86/kernel/kprobes/core.c   |  4 +--
 include/linux/execmem.h  | 52 
 include/linux/moduleloader.h |  3 --
 kernel/bpf/core.c| 14 -
 kernel/kprobes.c |  8 ++---
 kernel/module/Kconfig|  1 +
 kernel/module/main.c | 25 +--
 mm/Kconfig   |  3 ++
 mm/Makefile  |  1 +
 mm/execmem.c | 36 ++
 16 files changed, 130 insertions(+), 48 deletions(-)
 create mode 100644 include/linux/execmem.h
 create mode 100644 mm/execmem.c

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index b20ee72e873a..5db8df5e3657 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -19,8 +19,8 @@
 #include 
 #include 
 #include 
-#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -130,7 +130,7 @@ void *alloc_insn_page(void)
 {
void *page;
 
-   page = module_alloc(PAGE_SIZE);
+   page = jit_text_alloc(PAGE_SIZE);
if (!page)
return NULL;
 
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index c46381ea04ec..65343f944101 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -7,13 +7,13 @@
  *   Author(s): Martin Schwidefsky 
  */
 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void)
 {
const char *start, *end;
 
-   ftrace_plt = module_alloc(PAGE_SIZE);
+   ftrace_plt = execmem_text_alloc(PAGE_SIZE);
if (!ftrace_plt)
panic("cannot allocate ftrace plt\n");
 
diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
index d4b863ed0aa7..459cd5141346 100644
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -9,7 +9,6 @@
 
 #define pr_fmt(fmt) "kprobes: " fmt
 
-#include 
 #include 
 #include 
 #include 
@@ -21,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -38,7 +38,7 @@ void *alloc_insn_page(void)
 {
void *page;
 
-   page = module_alloc(PAGE_SIZE);
+   page = execmem_text_alloc(PAGE_SIZE);
if (!page)
return NULL;
set_memory_rox((unsigned long)page, 1);
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index f1b35dcdf3eb..4a844683dc76 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
 #ifdef CONFIG_FUNCTION_TRACER
 void module_arch_cleanup(struct module *mod)
 {
-   module_memfree(mod->arch.trampolines_start);
+   execmem_free(mod->arch.trampolines_start);
 }
 #endif
 
@@ -509,7 +510,7 @@ static int module_alloc_ftrace_hotpatch_trampolines(struct 
module *me,
 
size =