subject:"\[PATCH, RFC\]"

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-10-24 Thread Steven Price

On 23/10/2024 19:10, Liam R. Howlett wrote:
> * Steven Price  [241023 05:31]:
   * Box64 seems to have a custom allocator based on reading 
 /proc/self/maps to allocate a block of VA space with a low enough 
 address [1]

   * PHP has code reading /proc/self/maps - I think this is to find a 
 segment which is close enough to the text segment [2]

   * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
 addresses [3][4]
>>>
>>> Can't the limited number of applications that need to restrict the upper
>>> bound use an LD_PRELOAD compatible library to do this?
>>
>> I'm not entirely sure what point you are making here. Yes an LD_PRELOAD
>> approach could be used instead of a personality type as a 'hack' to
>> preallocate the upper address space. The obvious disadvantage is that
>> you can't (easily) layer LD_PRELOAD so it won't work in the general case.
> 
> My point is that riscv could work around the limited number of
> applications that requires this.  It's not really viable for you.

Ah ok - thanks for the clarification.

>>

   * pmdk has some funky code to find the lowest address that meets 
 certain requirements - this does look like an ALSR alternative and 
 probably couldn't directly use MAP_BELOW_HINT, although maybe this 
 suggests we need a mechanism to map without a VA-range? [5]

   * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
 a range [6]

   * LuaJIT uses an approach to 'probe' to find a suitable low address 
 for allocation [7]

>>>
>>> Although I did not take a deep dive into each example above, there are
>>> some very odd things being done, we will never cover all the use cases
>>> with an exact API match.  What we have today can be made to work for
>>> these users as they have figured ways to do it.
>>>
>>> Are they pretty? no.  Are they common? no.  I'm not sure it's worth
>>> plumbing in new MM code in for these users.
>>
>> My issue with the existing 'solutions' is that they all seem to be fragile:
>>
>>  * Using /proc/self/maps is inherently racy if there could be any other
>> code running in the process at the same time.
> 
> Yes, it is not thread safe.  Parsing text is also undesirable.
> 
>>
>>  * Attempting to map the upper part of the address space only works if
>> done early enough - once an allocation arrives there, there's very
>> little you can robustly do (because the stray allocation might be freed).
>>
>>  * LuaJIT's probing mechanism is probably robust, but it's inefficient -
>> LuaJIT has a fallback of linear probing, following by no hint (ASLR),
>> followed by pseudo-random probing. I don't know the history of the code
>> but it looks like it's probably been tweaked to try to avoid performance
>> issues.
>>
 The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
 library to get low addresses without causing any problems for the rest
 of the application. The use case I'm looking at is in a library and 
 therefore a personality mode wouldn't be appropriate (because I don't 
 want to affect the rest of the application). Reading /proc/self/maps
 is also problematic because other threads could be allocating/freeing
 at the same time.
>>>
>>> As long as you don't exhaust the lower limit you are trying to allocate
>>> within - which is exactly the issue riscv is hitting.
>>
>> Obviously if you actually exhaust the lower limit then any
>> MAP_BELOW_HINT API would also fail - there's really not much that can be
>> done in that case.
> 
> Today we reverse the search, so you end up in the higher address
> (bottom-up vs top-down) - although the direction is arch dependent.
> 
> If the allocation is too high/low then you could detect, free, and
> handle the failure.

Agreed, that's fine.

>>
>>> I understand that you are providing examples to prove that this is
>>> needed, but I feel like you are better demonstrating the flexibility
>>> exists to implement solutions in different ways using todays API.
>>
>> My intention is to show that today's API doesn't provide a robust way of
>> doing this. Although I'm quite happy if you can point me at a robust way
>> with the current API. As I mentioned my goal is to be able to map memory
>> in a (multithreaded) library with a (ideally configurable) number of VA
>> bits. I don't particularly want to restrict the whole process, just
>> specific allocations.
> 
> If you don't need to restrict everything, won't the hint work for your
> usecase?  I must be missing something from your requirements.

The hint only works if the hint address is actually free. Otherwise
mmap() falls back to as if the hint address wasn't specified.

E.g.

>   for(int i = 0; i < 2; i++) {
>   void *addr = mmap((void*)(1UL << 32), PAGE_SIZE, PROT_NONE, 
> MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>   printf("%p\n", addr);
>   }

Prints something lik

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-10-23 Thread Liam R. Howlett

* Steven Price  [241023 05:31]:
> >>   * Box64 seems to have a custom allocator based on reading 
> >> /proc/self/maps to allocate a block of VA space with a low enough 
> >> address [1]
> >>
> >>   * PHP has code reading /proc/self/maps - I think this is to find a 
> >> segment which is close enough to the text segment [2]
> >>
> >>   * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
> >> addresses [3][4]
> > 
> > Can't the limited number of applications that need to restrict the upper
> > bound use an LD_PRELOAD compatible library to do this?
> 
> I'm not entirely sure what point you are making here. Yes an LD_PRELOAD
> approach could be used instead of a personality type as a 'hack' to
> preallocate the upper address space. The obvious disadvantage is that
> you can't (easily) layer LD_PRELOAD so it won't work in the general case.

My point is that riscv could work around the limited number of
applications that requires this.  It's not really viable for you.

> 
> >>
> >>   * pmdk has some funky code to find the lowest address that meets 
> >> certain requirements - this does look like an ALSR alternative and 
> >> probably couldn't directly use MAP_BELOW_HINT, although maybe this 
> >> suggests we need a mechanism to map without a VA-range? [5]
> >>
> >>   * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
> >> a range [6]
> >>
> >>   * LuaJIT uses an approach to 'probe' to find a suitable low address 
> >> for allocation [7]
> >>
> > 
> > Although I did not take a deep dive into each example above, there are
> > some very odd things being done, we will never cover all the use cases
> > with an exact API match.  What we have today can be made to work for
> > these users as they have figured ways to do it.
> > 
> > Are they pretty? no.  Are they common? no.  I'm not sure it's worth
> > plumbing in new MM code in for these users.
> 
> My issue with the existing 'solutions' is that they all seem to be fragile:
> 
>  * Using /proc/self/maps is inherently racy if there could be any other
> code running in the process at the same time.

Yes, it is not thread safe.  Parsing text is also undesirable.

> 
>  * Attempting to map the upper part of the address space only works if
> done early enough - once an allocation arrives there, there's very
> little you can robustly do (because the stray allocation might be freed).
> 
>  * LuaJIT's probing mechanism is probably robust, but it's inefficient -
> LuaJIT has a fallback of linear probing, following by no hint (ASLR),
> followed by pseudo-random probing. I don't know the history of the code
> but it looks like it's probably been tweaked to try to avoid performance
> issues.
> 
> >> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
> >> library to get low addresses without causing any problems for the rest
> >> of the application. The use case I'm looking at is in a library and 
> >> therefore a personality mode wouldn't be appropriate (because I don't 
> >> want to affect the rest of the application). Reading /proc/self/maps
> >> is also problematic because other threads could be allocating/freeing
> >> at the same time.
> > 
> > As long as you don't exhaust the lower limit you are trying to allocate
> > within - which is exactly the issue riscv is hitting.
> 
> Obviously if you actually exhaust the lower limit then any
> MAP_BELOW_HINT API would also fail - there's really not much that can be
> done in that case.

Today we reverse the search, so you end up in the higher address
(bottom-up vs top-down) - although the direction is arch dependent.

If the allocation is too high/low then you could detect, free, and
handle the failure.

> 
> > I understand that you are providing examples to prove that this is
> > needed, but I feel like you are better demonstrating the flexibility
> > exists to implement solutions in different ways using todays API.
> 
> My intention is to show that today's API doesn't provide a robust way of
> doing this. Although I'm quite happy if you can point me at a robust way
> with the current API. As I mentioned my goal is to be able to map memory
> in a (multithreaded) library with a (ideally configurable) number of VA
> bits. I don't particularly want to restrict the whole process, just
> specific allocations.

If you don't need to restrict everything, won't the hint work for your
usecase?  I must be missing something from your requirements.

> 
> I had typed up a series similar to this one as a MAP_BELOW flag would
> fit my use-case well.
> 
> > I think it would be best to use the existing methods and work around the
> > issue that was created in riscv while future changes could mirror amd64
> > and arm64.
> 
> The riscv issue is a different issue to the one I'm trying to solve. I
> agree MAP_BELOW_HINT isn't a great fix for that because we already have
> differences between amd64 and arm64 and obviously no software currently
> out there uses this

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-10-23 Thread Steven Price

Hi Liam,

On 21/10/2024 20:48, Liam R. Howlett wrote:
> * Steven Price  [241021 09:23]:
>> On 09/09/2024 10:46, Kirill A. Shutemov wrote:
>>> On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
 On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
>> Some applications rely on placing data in free bits addresses allocated
>> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
>> address returned by mmap to be less than the 48-bit address space,
>> unless the hint address uses more than 47 bits (the 48th bit is reserved
>> for the kernel address space).
>>
>> The riscv architecture needs a way to similarly restrict the virtual
>> address space. On the riscv port of OpenJDK an error is thrown if
>> attempted to run on the 57-bit address space, called sv57 [1].  golang
>> has a comment that sv57 support is not complete, but there are some
>> workarounds to get it to mostly work [2].
>>>
>>> I also saw libmozjs crashing with 57-bit address space on x86.
>>>
>> These applications work on x86 because x86 does an implicit 47-bit
>> restriction of mmap() address that contain a hint address that is less
>> than 48 bits.
>>
>> Instead of implicitly restricting the address space on riscv (or any
>> current/future architecture), a flag would allow users to opt-in to this
>> behavior rather than opt-out as is done on other architectures. This is
>> desirable because it is a small class of applications that do pointer
>> masking.
>>>
>>> You reiterate the argument about "small class of applications". But it
>>> makes no sense to me.
>>
>> Sorry to chime in late on this - I had been considering implementing
>> something like MAP_BELOW_HINT and found this thread.
>>
>> While the examples of applications that want to use high VA bits and get
>> bitten by future upgrades is not very persuasive. It's worth pointing
>> out that there are a variety of somewhat horrid hacks out there to work
>> around this feature not existing.
>>
>> E.g. from my brief research into other code:
>>
>>   * Box64 seems to have a custom allocator based on reading 
>> /proc/self/maps to allocate a block of VA space with a low enough 
>> address [1]
>>
>>   * PHP has code reading /proc/self/maps - I think this is to find a 
>> segment which is close enough to the text segment [2]
>>
>>   * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
>> addresses [3][4]
> 
> Can't the limited number of applications that need to restrict the upper
> bound use an LD_PRELOAD compatible library to do this?

I'm not entirely sure what point you are making here. Yes an LD_PRELOAD
approach could be used instead of a personality type as a 'hack' to
preallocate the upper address space. The obvious disadvantage is that
you can't (easily) layer LD_PRELOAD so it won't work in the general case.

>>
>>   * pmdk has some funky code to find the lowest address that meets 
>> certain requirements - this does look like an ALSR alternative and 
>> probably couldn't directly use MAP_BELOW_HINT, although maybe this 
>> suggests we need a mechanism to map without a VA-range? [5]
>>
>>   * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
>> a range [6]
>>
>>   * LuaJIT uses an approach to 'probe' to find a suitable low address 
>> for allocation [7]
>>
> 
> Although I did not take a deep dive into each example above, there are
> some very odd things being done, we will never cover all the use cases
> with an exact API match.  What we have today can be made to work for
> these users as they have figured ways to do it.
> 
> Are they pretty? no.  Are they common? no.  I'm not sure it's worth
> plumbing in new MM code in for these users.

My issue with the existing 'solutions' is that they all seem to be fragile:

 * Using /proc/self/maps is inherently racy if there could be any other
code running in the process at the same time.

 * Attempting to map the upper part of the address space only works if
done early enough - once an allocation arrives there, there's very
little you can robustly do (because the stray allocation might be freed).

 * LuaJIT's probing mechanism is probably robust, but it's inefficient -
LuaJIT has a fallback of linear probing, following by no hint (ASLR),
followed by pseudo-random probing. I don't know the history of the code
but it looks like it's probably been tweaked to try to avoid performance
issues.

>> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
>> library to get low addresses without causing any problems for the rest
>> of the application. The use case I'm looking at is in a library and 
>> therefore a personality mode wouldn't be appropriate (because I don't 
>> want to affect the rest of the application). Reading /proc/self/maps
>> is also problematic because othe

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-10-21 Thread Steven Price

On 09/09/2024 10:46, Kirill A. Shutemov wrote:
> On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
>> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
>>> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
 Some applications rely on placing data in free bits addresses allocated
 by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
 address returned by mmap to be less than the 48-bit address space,
 unless the hint address uses more than 47 bits (the 48th bit is reserved
 for the kernel address space).

 The riscv architecture needs a way to similarly restrict the virtual
 address space. On the riscv port of OpenJDK an error is thrown if
 attempted to run on the 57-bit address space, called sv57 [1].  golang
 has a comment that sv57 support is not complete, but there are some
 workarounds to get it to mostly work [2].
> 
> I also saw libmozjs crashing with 57-bit address space on x86.
> 
 These applications work on x86 because x86 does an implicit 47-bit
 restriction of mmap() address that contain a hint address that is less
 than 48 bits.

 Instead of implicitly restricting the address space on riscv (or any
 current/future architecture), a flag would allow users to opt-in to this
 behavior rather than opt-out as is done on other architectures. This is
 desirable because it is a small class of applications that do pointer
 masking.
> 
> You reiterate the argument about "small class of applications". But it
> makes no sense to me.

Sorry to chime in late on this - I had been considering implementing
something like MAP_BELOW_HINT and found this thread.

While the examples of applications that want to use high VA bits and get
bitten by future upgrades is not very persuasive. It's worth pointing
out that there are a variety of somewhat horrid hacks out there to work
around this feature not existing.

E.g. from my brief research into other code:

  * Box64 seems to have a custom allocator based on reading 
/proc/self/maps to allocate a block of VA space with a low enough 
address [1]

  * PHP has code reading /proc/self/maps - I think this is to find a 
segment which is close enough to the text segment [2]

  * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
addresses [3][4]

  * pmdk has some funky code to find the lowest address that meets 
certain requirements - this does look like an ALSR alternative and 
probably couldn't directly use MAP_BELOW_HINT, although maybe this 
suggests we need a mechanism to map without a VA-range? [5]

  * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
a range [6]

  * LuaJIT uses an approach to 'probe' to find a suitable low address 
for allocation [7]

The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
library to get low addresses without causing any problems for the rest
of the application. The use case I'm looking at is in a library and 
therefore a personality mode wouldn't be appropriate (because I don't 
want to affect the rest of the application). Reading /proc/self/maps
is also problematic because other threads could be allocating/freeing
at the same time.

Thanks,
Steve


[1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/
[2] 
https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap.c/#L62
[3] https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp
[4] 
https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb05c
[5] 
https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=29#L29
[6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826
[7] 
https://sources.debian.org/src/luajit/2.1.0+openresty20240815-1/src/lj_alloc.c/

> With full address space by default, this small class of applications is
> going to *broken* unless they would handle RISC-V case specifically.
> 
> On other hand, if you limit VA to 128TiB by default (like many
> architectures do[1]) everything would work without intervention.
> And if an app needs wider address space it would get it with hint opt-in,
> because it is required on x86-64 anyway. Again, no RISC-V-specific code.
> 
> I see no upside with your approach. Just worse user experience.
> 
> [1] See va_high_addr_switch test case in 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/mm/Makefile#n115
>

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-10-21 Thread Liam R. Howlett

* Steven Price  [241021 09:23]:
> On 09/09/2024 10:46, Kirill A. Shutemov wrote:
> > On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
> >> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
> >>> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
>  Some applications rely on placing data in free bits addresses allocated
>  by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
>  address returned by mmap to be less than the 48-bit address space,
>  unless the hint address uses more than 47 bits (the 48th bit is reserved
>  for the kernel address space).
> 
>  The riscv architecture needs a way to similarly restrict the virtual
>  address space. On the riscv port of OpenJDK an error is thrown if
>  attempted to run on the 57-bit address space, called sv57 [1].  golang
>  has a comment that sv57 support is not complete, but there are some
>  workarounds to get it to mostly work [2].
> > 
> > I also saw libmozjs crashing with 57-bit address space on x86.
> > 
>  These applications work on x86 because x86 does an implicit 47-bit
>  restriction of mmap() address that contain a hint address that is less
>  than 48 bits.
> 
>  Instead of implicitly restricting the address space on riscv (or any
>  current/future architecture), a flag would allow users to opt-in to this
>  behavior rather than opt-out as is done on other architectures. This is
>  desirable because it is a small class of applications that do pointer
>  masking.
> > 
> > You reiterate the argument about "small class of applications". But it
> > makes no sense to me.
> 
> Sorry to chime in late on this - I had been considering implementing
> something like MAP_BELOW_HINT and found this thread.
> 
> While the examples of applications that want to use high VA bits and get
> bitten by future upgrades is not very persuasive. It's worth pointing
> out that there are a variety of somewhat horrid hacks out there to work
> around this feature not existing.
> 
> E.g. from my brief research into other code:
> 
>   * Box64 seems to have a custom allocator based on reading 
> /proc/self/maps to allocate a block of VA space with a low enough 
> address [1]
> 
>   * PHP has code reading /proc/self/maps - I think this is to find a 
> segment which is close enough to the text segment [2]
> 
>   * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
> addresses [3][4]

Can't the limited number of applications that need to restrict the upper
bound use an LD_PRELOAD compatible library to do this?

> 
>   * pmdk has some funky code to find the lowest address that meets 
> certain requirements - this does look like an ALSR alternative and 
> probably couldn't directly use MAP_BELOW_HINT, although maybe this 
> suggests we need a mechanism to map without a VA-range? [5]
> 
>   * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
> a range [6]
> 
>   * LuaJIT uses an approach to 'probe' to find a suitable low address 
> for allocation [7]
> 

Although I did not take a deep dive into each example above, there are
some very odd things being done, we will never cover all the use cases
with an exact API match.  What we have today can be made to work for
these users as they have figured ways to do it.

Are they pretty? no.  Are they common? no.  I'm not sure it's worth
plumbing in new MM code in for these users.

> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
> library to get low addresses without causing any problems for the rest
> of the application. The use case I'm looking at is in a library and 
> therefore a personality mode wouldn't be appropriate (because I don't 
> want to affect the rest of the application). Reading /proc/self/maps
> is also problematic because other threads could be allocating/freeing
> at the same time.

As long as you don't exhaust the lower limit you are trying to allocate
within - which is exactly the issue riscv is hitting.

I understand that you are providing examples to prove that this is
needed, but I feel like you are better demonstrating the flexibility
exists to implement solutions in different ways using todays API.

I think it would be best to use the existing methods and work around the
issue that was created in riscv while future changes could mirror amd64
and arm64.

...
> 
> 
> [1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/
> [2] 
> https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap.c/#L62
> [3] 
> https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp
> [4] 
> https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb05c
> [5] 
> https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=29#L29
> [6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826
> [7] 
> https://sources.debi

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-10-02 Thread Palmer Dabbelt

On Fri, 13 Sep 2024 14:04:06 PDT (-0700), Charlie Jenkins wrote:
> On Fri, Sep 13, 2024 at 08:41:34AM +0100, Lorenzo Stoakes wrote:
>> On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote:
>> > On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote:
>> > > On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
>> > > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
>> > > > > * Catalin Marinas  [240906 07:44]:
>> > > > > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote:
>> > > > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
>> > > > > > > > On Fri, Sep 6, 2024 at 3:18â¯PM Arnd Bergmann  
>> > > > > > > > wrote:
>> > > > > > > >> It's also unclear to me how we want this flag to interact with
>> > > > > > > >> the existing logic in arch_get_mmap_end(), which attempts to
>> > > > > > > >> limit the default mapping to a 47-bit address space already.
>> > > > > > > >
>> > > > > > > > To optimize RISC-V progress, I recommend:
>> > > > > > > >
>> > > > > > > > Step 1: Approve the patch.
>> > > > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
>> > > > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK
>> > > > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
>> > >
>> > > Point 4 is an ABI change. What guarantees that there isn't still
>> > > software out there that relies on the old behaviour?
>> >
>> > Yeah I don't think it would be desirable to remove the 47 bit
>> > constraint in architectures that already have it.
>> >
>> > >
>> > > > > > > I really want to first see a plausible explanation about why
>> > > > > > > RISC-V can't just implement this using a 47-bit 
>> > > > > > > DEFAULT_MAP_WINDOW
>> > > > > > > like all the other major architectures (x86, arm64, powerpc64),
>> > > > > >
>> > > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the 
>> > > > > > default
>> > > > > > configuration. We end up with a 47-bit with 16K pages but for a
>> > > > > > different reason that has to do with LPA2 support (I doubt we need 
>> > > > > > this
>> > > > > > for the user mapping but we need to untangle some of the macros 
>> > > > > > there;
>> > > > > > that's for a separate discussion).
>> > > > > >
>> > > > > > That said, we haven't encountered any user space problems with a 
>> > > > > > 48-bit
>> > > > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
>> > > > > > approach (47 or 48 bit default limit). Better to have some ABI
>> > > > > > consistency between architectures. One can still ask for addresses 
>> > > > > > above
>> > > > > > this default limit via mmap().
>> > > > >
>> > > > > I think that is best as well.
>> > > > >
>> > > > > Can we please just do what x86 and arm64 does?
>> > > >
>> > > > I responded to Arnd in the other thread, but I am still not convinced
>> > > > that the solution that x86 and arm64 have selected is the best 
>> > > > solution.
>> > > > The solution of defaulting to 47 bits does allow applications the
>> > > > ability to get addresses that are below 47 bits. However, due to
>> > > > differences across architectures it doesn't seem possible to have all
>> > > > architectures default to the same value. Additionally, this flag will 
>> > > > be
>> > > > able to help users avoid potential bugs where a hint address is passed
>> > > > that causes upper bits of a VA to be used.
>> > >
>> > > The reason we added this limit on arm64 is that we noticed programs
>> > > using the top 8 bits of a 64-bit pointer for additional information.
>> > > IIRC, it wasn't even openJDK but some JavaScript JIT. We could have
>> > > taught those programs of a new flag but since we couldn't tell how many
>> > > are out there, it was the safest to default to a smaller limit and opt
>> > > in to the higher one. Such opt-in is via mmap() but if you prefer a
>> > > prctl() flag, that's fine by me as well (though I think this should be
>> > > opt-in to higher addresses rather than opt-out of the higher addresses).
>> >
>> > The mmap() flag was used in previous versions but was decided against
>> > because this feature is more useful if it is process-wide. A
>> > personality() flag was chosen instead of a prctl() flag because there
>> > existed other flags in personality() that were similar. I am tempted to
>> > use prctl() however because then we could have an additional arg to
>> > select the exact number of bits that should be reserved (rather than
>> > being fixed at 47 bits).
>>
>> I am very much not in favour of a prctl(), it would require us to add state
>> limiting the address space and the timing of it becomes critical. Then we
>> have the same issue we do with the other proposals as to - what happens if
>> this is too low?
>>
>> What is 'too low' varies by architecture, and for 32-bit architectures
>> could get quite... problematic.
>>
>> And again, wha is the RoI here - we introducing maintenance burden and edge
>> cases vs. the x86 s

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-19 Thread Michael Ellerman

Charlie Jenkins  writes:
> On Wed, Sep 11, 2024 at 11:38:55PM +1000, Michael Ellerman wrote:
>> Geert Uytterhoeven  writes:
>> > Hi Christophe,
>> >
>> > On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy
>> >  wrote:
>> >> >>> diff --git a/include/uapi/linux/personality.h 
>> >> >>> b/include/uapi/linux/personality.h
>> >> >>> index 49796b7756af..cd3b8c154d9b 100644
>> >> >>> --- a/include/uapi/linux/personality.h
>> >> >>> +++ b/include/uapi/linux/personality.h
>> >> >>> @@ -22,6 +22,7 @@ enum {
>> >> >>> WHOLE_SECONDS = 0x200,
>> >> >>> STICKY_TIMEOUTS =   0x400,
>> >> >>> ADDR_LIMIT_3GB =0x800,
>> >> >>> +   ADDR_LIMIT_47BIT =  0x1000,
>> >> >>>   };
>> >> >>
>> >> >> I wonder if ADDR_LIMIT_128T would be clearer?
>> >> >>
>> >> >
>> >> > I don't follow, what does 128T represent?
>> >>
>> >> 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT
>> >> address, that naming would be more consistant with the ADDR_LIMIT_3GB
>> >> just above that means a 3 Gigabytes limit.
>> >
>> > Hence ADDR_LIMIT_128TB?
>> 
>> Yes it should be 128TB. Typo by me.
>
> 47BIT was selected because the usecase for this flag is for applications
> that want to store data in the upper bits of a virtual address space. In
> this case, how large the virtual address space is irrelevant, and only
> the number of bits that are being used, and hence the number of bits
> that are free.

Yeah I understand that's how you came to the problem.

But for the user API I think using the size of the address space is
clearer, easier to explain, and matches the existing ADDR_LIMIT_3GB.

cheers

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-13 Thread Charlie Jenkins

On Fri, Sep 13, 2024 at 08:41:34AM +0100, Lorenzo Stoakes wrote:
> On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote:
> > On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote:
> > > On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
> > > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> > > > > * Catalin Marinas  [240906 07:44]:
> > > > > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote:
> > > > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann  
> > > > > > > > wrote:
> > > > > > > >> It's also unclear to me how we want this flag to interact with
> > > > > > > >> the existing logic in arch_get_mmap_end(), which attempts to
> > > > > > > >> limit the default mapping to a 47-bit address space already.
> > > > > > > >
> > > > > > > > To optimize RISC-V progress, I recommend:
> > > > > > > >
> > > > > > > > Step 1: Approve the patch.
> > > > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> > >
> > > Point 4 is an ABI change. What guarantees that there isn't still
> > > software out there that relies on the old behaviour?
> >
> > Yeah I don't think it would be desirable to remove the 47 bit
> > constraint in architectures that already have it.
> >
> > >
> > > > > > > I really want to first see a plausible explanation about why
> > > > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > > > > > > like all the other major architectures (x86, arm64, powerpc64),
> > > > > >
> > > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the 
> > > > > > default
> > > > > > configuration. We end up with a 47-bit with 16K pages but for a
> > > > > > different reason that has to do with LPA2 support (I doubt we need 
> > > > > > this
> > > > > > for the user mapping but we need to untangle some of the macros 
> > > > > > there;
> > > > > > that's for a separate discussion).
> > > > > >
> > > > > > That said, we haven't encountered any user space problems with a 
> > > > > > 48-bit
> > > > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> > > > > > approach (47 or 48 bit default limit). Better to have some ABI
> > > > > > consistency between architectures. One can still ask for addresses 
> > > > > > above
> > > > > > this default limit via mmap().
> > > > >
> > > > > I think that is best as well.
> > > > >
> > > > > Can we please just do what x86 and arm64 does?
> > > >
> > > > I responded to Arnd in the other thread, but I am still not convinced
> > > > that the solution that x86 and arm64 have selected is the best solution.
> > > > The solution of defaulting to 47 bits does allow applications the
> > > > ability to get addresses that are below 47 bits. However, due to
> > > > differences across architectures it doesn't seem possible to have all
> > > > architectures default to the same value. Additionally, this flag will be
> > > > able to help users avoid potential bugs where a hint address is passed
> > > > that causes upper bits of a VA to be used.
> > >
> > > The reason we added this limit on arm64 is that we noticed programs
> > > using the top 8 bits of a 64-bit pointer for additional information.
> > > IIRC, it wasn't even openJDK but some JavaScript JIT. We could have
> > > taught those programs of a new flag but since we couldn't tell how many
> > > are out there, it was the safest to default to a smaller limit and opt
> > > in to the higher one. Such opt-in is via mmap() but if you prefer a
> > > prctl() flag, that's fine by me as well (though I think this should be
> > > opt-in to higher addresses rather than opt-out of the higher addresses).
> >
> > The mmap() flag was used in previous versions but was decided against
> > because this feature is more useful if it is process-wide. A
> > personality() flag was chosen instead of a prctl() flag because there
> > existed other flags in personality() that were similar. I am tempted to
> > use prctl() however because then we could have an additional arg to
> > select the exact number of bits that should be reserved (rather than
> > being fixed at 47 bits).
> 
> I am very much not in favour of a prctl(), it would require us to add state
> limiting the address space and the timing of it becomes critical. Then we
> have the same issue we do with the other proposals as to - what happens if
> this is too low?
> 
> What is 'too low' varies by architecture, and for 32-bit architectures
> could get quite... problematic.
> 
> And again, wha is the RoI here - we introducing maintenance burden and edge
> cases vs. the x86 solution in order to... accommodate things that need more
> than 128 TiB of address space? A problem that does not appear to exist in
> reality?
> 
> I suggested the personality approach as t

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-13 Thread Charlie Jenkins

On Fri, Sep 13, 2024 at 11:08:23AM +0100, Catalin Marinas wrote:
> On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote:
> > On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
> > > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> > > > Opting-in to the higher address space is reasonable. However, it is not
> > > > my preference, because the purpose of this flag is to ensure that
> > > > allocations do not exceed 47-bits, so it is a clearer ABI to have the
> > > > applications that want this guarantee to be the ones setting the flag,
> > > > rather than the applications that want the higher bits setting the flag.
> > > 
> > > Yes, this would be ideal. Unfortunately those applications don't know
> > > they need to set a flag in order to work.
> > 
> > It's not a regression, the applications never worked (on platforms that
> > do not have this default). The 47-bit default would allow applications
> > that didn't work to start working at the cost of a non-ideal ABI. That
> > doesn't seem like a reasonable tradeoff to me.  If applications want to
> > run on new hardware that has different requirements, shouldn't they be
> > required to update rather than expect the kernel will solve their
> > problems for them?
> 
> That's a valid point but it depends on the application and how much you
> want to spend updating user-space. OpenJDK is fine, if you need a JIT
> you'll have to add support for that architecture anyway. But others are
> arch-agnostic, you just recompile to your target. It's not an ABI
> problem, more of an API one.

The arch-agnosticism is my hope with this personality flag, it can be
added arch-agnostic userspace code and allow the application to work
everywhere, but it does have the downside of requiring that change to
user-space code.

> 
> The x86 case (and powerpc/arm64) was different, the 47-bit worked for a
> long time before expanding it. So it made a lot of sense to keep the
> same default.

Yes it is very reasonable that this solution was selected for those
architectures since the support for higher address spaces evolved in the
manner that it did!

- Charlie

> 
> Anyway, the prctl() can go both ways, either expanding or limiting the
> default address space. So I'd be fine with such interface.
> 
> -- 
> Catalin

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-13 Thread Catalin Marinas

On Fri, Sep 13, 2024 at 11:08:23AM +0100, Catalin Marinas wrote:
> On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote:
> > On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
> > > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> > > > Opting-in to the higher address space is reasonable. However, it is not
> > > > my preference, because the purpose of this flag is to ensure that
> > > > allocations do not exceed 47-bits, so it is a clearer ABI to have the
> > > > applications that want this guarantee to be the ones setting the flag,
> > > > rather than the applications that want the higher bits setting the flag.
[...]
> Anyway, the prctl() can go both ways, either expanding or limiting the
> default address space. So I'd be fine with such interface.

Ah, I just realised (while reading Lorenzo's reply) that we can't really
restrict the space via a prctl() as we have the main thread stack
already allocated by the kernel before the user code starts. You may
need to limit this stack as well, not just the later heap allocations
(anonymous mmap()).

-- 
Catalin

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-13 Thread Catalin Marinas

On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote:
> On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
> > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> > > Opting-in to the higher address space is reasonable. However, it is not
> > > my preference, because the purpose of this flag is to ensure that
> > > allocations do not exceed 47-bits, so it is a clearer ABI to have the
> > > applications that want this guarantee to be the ones setting the flag,
> > > rather than the applications that want the higher bits setting the flag.
> > 
> > Yes, this would be ideal. Unfortunately those applications don't know
> > they need to set a flag in order to work.
> 
> It's not a regression, the applications never worked (on platforms that
> do not have this default). The 47-bit default would allow applications
> that didn't work to start working at the cost of a non-ideal ABI. That
> doesn't seem like a reasonable tradeoff to me.  If applications want to
> run on new hardware that has different requirements, shouldn't they be
> required to update rather than expect the kernel will solve their
> problems for them?

That's a valid point but it depends on the application and how much you
want to spend updating user-space. OpenJDK is fine, if you need a JIT
you'll have to add support for that architecture anyway. But others are
arch-agnostic, you just recompile to your target. It's not an ABI
problem, more of an API one.

The x86 case (and powerpc/arm64) was different, the 47-bit worked for a
long time before expanding it. So it made a lot of sense to keep the
same default.

Anyway, the prctl() can go both ways, either expanding or limiting the
default address space. So I'd be fine with such interface.

-- 
Catalin

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-13 Thread Lorenzo Stoakes

On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote:
> On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote:
> > On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
> > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> > > > * Catalin Marinas  [240906 07:44]:
> > > > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote:
> > > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann  
> > > > > > > wrote:
> > > > > > >> It's also unclear to me how we want this flag to interact with
> > > > > > >> the existing logic in arch_get_mmap_end(), which attempts to
> > > > > > >> limit the default mapping to a 47-bit address space already.
> > > > > > >
> > > > > > > To optimize RISC-V progress, I recommend:
> > > > > > >
> > > > > > > Step 1: Approve the patch.
> > > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> >
> > Point 4 is an ABI change. What guarantees that there isn't still
> > software out there that relies on the old behaviour?
>
> Yeah I don't think it would be desirable to remove the 47 bit
> constraint in architectures that already have it.
>
> >
> > > > > > I really want to first see a plausible explanation about why
> > > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > > > > > like all the other major architectures (x86, arm64, powerpc64),
> > > > >
> > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
> > > > > configuration. We end up with a 47-bit with 16K pages but for a
> > > > > different reason that has to do with LPA2 support (I doubt we need 
> > > > > this
> > > > > for the user mapping but we need to untangle some of the macros there;
> > > > > that's for a separate discussion).
> > > > >
> > > > > That said, we haven't encountered any user space problems with a 
> > > > > 48-bit
> > > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> > > > > approach (47 or 48 bit default limit). Better to have some ABI
> > > > > consistency between architectures. One can still ask for addresses 
> > > > > above
> > > > > this default limit via mmap().
> > > >
> > > > I think that is best as well.
> > > >
> > > > Can we please just do what x86 and arm64 does?
> > >
> > > I responded to Arnd in the other thread, but I am still not convinced
> > > that the solution that x86 and arm64 have selected is the best solution.
> > > The solution of defaulting to 47 bits does allow applications the
> > > ability to get addresses that are below 47 bits. However, due to
> > > differences across architectures it doesn't seem possible to have all
> > > architectures default to the same value. Additionally, this flag will be
> > > able to help users avoid potential bugs where a hint address is passed
> > > that causes upper bits of a VA to be used.
> >
> > The reason we added this limit on arm64 is that we noticed programs
> > using the top 8 bits of a 64-bit pointer for additional information.
> > IIRC, it wasn't even openJDK but some JavaScript JIT. We could have
> > taught those programs of a new flag but since we couldn't tell how many
> > are out there, it was the safest to default to a smaller limit and opt
> > in to the higher one. Such opt-in is via mmap() but if you prefer a
> > prctl() flag, that's fine by me as well (though I think this should be
> > opt-in to higher addresses rather than opt-out of the higher addresses).
>
> The mmap() flag was used in previous versions but was decided against
> because this feature is more useful if it is process-wide. A
> personality() flag was chosen instead of a prctl() flag because there
> existed other flags in personality() that were similar. I am tempted to
> use prctl() however because then we could have an additional arg to
> select the exact number of bits that should be reserved (rather than
> being fixed at 47 bits).

I am very much not in favour of a prctl(), it would require us to add state
limiting the address space and the timing of it becomes critical. Then we
have the same issue we do with the other proposals as to - what happens if
this is too low?

What is 'too low' varies by architecture, and for 32-bit architectures
could get quite... problematic.

And again, wha is the RoI here - we introducing maintenance burden and edge
cases vs. the x86 solution in order to... accommodate things that need more
than 128 TiB of address space? A problem that does not appear to exist in
reality?

I suggested the personality approach as the least impactful compromise way
of this series working, but I think after what Arnd has said (and please
forgive me if I've missed further discussion have been dipping in and out
of this!) - adapting risc v to the approach we take elsewhere seems the
most sensible solu

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-12 Thread Charlie Jenkins

On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
> On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> > Opting-in to the higher address space is reasonable. However, it is not
> > my preference, because the purpose of this flag is to ensure that
> > allocations do not exceed 47-bits, so it is a clearer ABI to have the
> > applications that want this guarantee to be the ones setting the flag,
> > rather than the applications that want the higher bits setting the flag.
> 
> Yes, this would be ideal. Unfortunately those applications don't know
> they need to set a flag in order to work.

It's not a regression, the applications never worked (on platforms that
do not have this default). The 47-bit default would allow applications
that didn't work to start working at the cost of a non-ideal ABI. That
doesn't seem like a reasonable tradeoff to me.  If applications want to
run on new hardware that has different requirements, shouldn't they be
required to update rather than expect the kernel will solve their
problems for them?

> 
> A slightly better option is to leave the default 47-bit at the kernel
> ABI level and have the libc/dynamic loader issue the prctl(). You can
> control the default with environment variables if needed.

Having glibc set the 47-bit requirement could make it slightly easier
for applications since they would only have to set the environment
variable. After the kernel interface is approved I can look into
supporting that.

- Charlie

> 
> We do something similar in glibc for arm64 MTE. When MTE is enabled, the
> top byte of an allocated pointer contains the tag that must not be
> corrupted. We left the decision to the C library via the
> glibc.mem.tagging tunable (Android has something similar via the app
> manifest). An app can change the default if it wants but if you run with
> old glibc or no environment variable to say otherwise, the default would
> be safe. Distros can set the environment to be the maximum range by
> default if they know the apps included have been upgraded and tested.
> 
> -- 
> Catalin

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-12 Thread Catalin Marinas

On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> Opting-in to the higher address space is reasonable. However, it is not
> my preference, because the purpose of this flag is to ensure that
> allocations do not exceed 47-bits, so it is a clearer ABI to have the
> applications that want this guarantee to be the ones setting the flag,
> rather than the applications that want the higher bits setting the flag.

Yes, this would be ideal. Unfortunately those applications don't know
they need to set a flag in order to work.

A slightly better option is to leave the default 47-bit at the kernel
ABI level and have the libc/dynamic loader issue the prctl(). You can
control the default with environment variables if needed.

We do something similar in glibc for arm64 MTE. When MTE is enabled, the
top byte of an allocated pointer contains the tag that must not be
corrupted. We left the decision to the C library via the
glibc.mem.tagging tunable (Android has something similar via the app
manifest). An app can change the default if it wants but if you run with
old glibc or no environment variable to say otherwise, the default would
be safe. Distros can set the environment to be the maximum range by
default if they know the apps included have been upgraded and tested.

-- 
Catalin

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-11 Thread Charlie Jenkins

On Wed, Sep 11, 2024 at 11:38:55PM +1000, Michael Ellerman wrote:
> Geert Uytterhoeven  writes:
> > Hi Christophe,
> >
> > On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy
> >  wrote:
> >> >>> diff --git a/include/uapi/linux/personality.h 
> >> >>> b/include/uapi/linux/personality.h
> >> >>> index 49796b7756af..cd3b8c154d9b 100644
> >> >>> --- a/include/uapi/linux/personality.h
> >> >>> +++ b/include/uapi/linux/personality.h
> >> >>> @@ -22,6 +22,7 @@ enum {
> >> >>> WHOLE_SECONDS = 0x200,
> >> >>> STICKY_TIMEOUTS =   0x400,
> >> >>> ADDR_LIMIT_3GB =0x800,
> >> >>> +   ADDR_LIMIT_47BIT =  0x1000,
> >> >>>   };
> >> >>
> >> >> I wonder if ADDR_LIMIT_128T would be clearer?
> >> >>
> >> >
> >> > I don't follow, what does 128T represent?
> >>
> >> 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT
> >> address, that naming would be more consistant with the ADDR_LIMIT_3GB
> >> just above that means a 3 Gigabytes limit.
> >
> > Hence ADDR_LIMIT_128TB?
> 
> Yes it should be 128TB. Typo by me.
> 
> cheers

47BIT was selected because the usecase for this flag is for applications
that want to store data in the upper bits of a virtual address space. In
this case, how large the virtual address space is irrelevant, and only
the number of bits that are being used, and hence the number of bits
that are free.

- Charlie

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-11 Thread Charlie Jenkins

On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote:
> On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
> > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> > > * Catalin Marinas  [240906 07:44]:
> > > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote:
> > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann  wrote:
> > > > > >> It's also unclear to me how we want this flag to interact with
> > > > > >> the existing logic in arch_get_mmap_end(), which attempts to
> > > > > >> limit the default mapping to a 47-bit address space already.
> > > > > >
> > > > > > To optimize RISC-V progress, I recommend:
> > > > > >
> > > > > > Step 1: Approve the patch.
> > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> 
> Point 4 is an ABI change. What guarantees that there isn't still
> software out there that relies on the old behaviour?

Yeah I don't think it would be desirable to remove the 47 bit
constraint in architectures that already have it.

> 
> > > > > I really want to first see a plausible explanation about why
> > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > > > > like all the other major architectures (x86, arm64, powerpc64),
> > > > 
> > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
> > > > configuration. We end up with a 47-bit with 16K pages but for a
> > > > different reason that has to do with LPA2 support (I doubt we need this
> > > > for the user mapping but we need to untangle some of the macros there;
> > > > that's for a separate discussion).
> > > > 
> > > > That said, we haven't encountered any user space problems with a 48-bit
> > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> > > > approach (47 or 48 bit default limit). Better to have some ABI
> > > > consistency between architectures. One can still ask for addresses above
> > > > this default limit via mmap().
> > > 
> > > I think that is best as well.
> > > 
> > > Can we please just do what x86 and arm64 does?
> > 
> > I responded to Arnd in the other thread, but I am still not convinced
> > that the solution that x86 and arm64 have selected is the best solution.
> > The solution of defaulting to 47 bits does allow applications the
> > ability to get addresses that are below 47 bits. However, due to
> > differences across architectures it doesn't seem possible to have all
> > architectures default to the same value. Additionally, this flag will be
> > able to help users avoid potential bugs where a hint address is passed
> > that causes upper bits of a VA to be used.
> 
> The reason we added this limit on arm64 is that we noticed programs
> using the top 8 bits of a 64-bit pointer for additional information.
> IIRC, it wasn't even openJDK but some JavaScript JIT. We could have
> taught those programs of a new flag but since we couldn't tell how many
> are out there, it was the safest to default to a smaller limit and opt
> in to the higher one. Such opt-in is via mmap() but if you prefer a
> prctl() flag, that's fine by me as well (though I think this should be
> opt-in to higher addresses rather than opt-out of the higher addresses).

The mmap() flag was used in previous versions but was decided against
because this feature is more useful if it is process-wide. A
personality() flag was chosen instead of a prctl() flag because there
existed other flags in personality() that were similar. I am tempted to
use prctl() however because then we could have an additional arg to
select the exact number of bits that should be reserved (rather than
being fixed at 47 bits).

Opting-in to the higher address space is reasonable. However, it is not
my preference, because the purpose of this flag is to ensure that
allocations do not exceed 47-bits, so it is a clearer ABI to have the
applications that want this guarantee to be the ones setting the flag,
rather than the applications that want the higher bits setting the flag.

- Charlie

> 
> -- 
> Catalin

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-11 Thread Charlie Jenkins

On Wed, Sep 11, 2024 at 07:25:08AM +, Arnd Bergmann wrote:
> On Wed, Sep 11, 2024, at 00:45, Charlie Jenkins wrote:
> > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> >
> > I responded to Arnd in the other thread, but I am still not convinced
> > that the solution that x86 and arm64 have selected is the best solution.
> > The solution of defaulting to 47 bits does allow applications the
> > ability to get addresses that are below 47 bits. However, due to
> > differences across architectures it doesn't seem possible to have all
> > architectures default to the same value. Additionally, this flag will be
> > able to help users avoid potential bugs where a hint address is passed
> > that causes upper bits of a VA to be used.
> >
> > The other issue I have with this is that if there is not a hint address
> > specified to be greater than 47 bits on x86, then mmap() may return an
> > address that is greater than 47-bits. The documentation in
> > Documentation/arch/x86/x86_64/5level-paging.rst says:
> >
> > "If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> > to look for unmapped area by specified address. If it's already
> > occupied, we look for unmapped area in *full* address space, rather than
> > from 47-bit window."
> 
> This is also in the commit message of b569bab78d8d ("x86/mm: Prepare
> to expose larger address space to userspace"), which introduced it.
> However, I don't actually see the fallback to the full address space,
> instead the actual behavior seems to be the same as arm64.
> 
> Am I missing something in the x86 implementation, or do we just
> need to update the documentation?
> 
>   Arnd

Yeah I guess it is incorrect documentation then? It seems more
reasonable to me to have a hint address fall back onto the larger
address space because otherwise the "hint" address can cause allocations
to fail even if there is space above the 47-bit limit. This is another
reason I wanted to avoid having this default behavior on riscv, to not
have this abuse of the hint address.

- Charlie

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-11 Thread Catalin Marinas

On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
> On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> > * Catalin Marinas  [240906 07:44]:
> > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote:
> > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann  wrote:
> > > > >> It's also unclear to me how we want this flag to interact with
> > > > >> the existing logic in arch_get_mmap_end(), which attempts to
> > > > >> limit the default mapping to a 47-bit address space already.
> > > > >
> > > > > To optimize RISC-V progress, I recommend:
> > > > >
> > > > > Step 1: Approve the patch.
> > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()

Point 4 is an ABI change. What guarantees that there isn't still
software out there that relies on the old behaviour?

> > > > I really want to first see a plausible explanation about why
> > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > > > like all the other major architectures (x86, arm64, powerpc64),
> > > 
> > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
> > > configuration. We end up with a 47-bit with 16K pages but for a
> > > different reason that has to do with LPA2 support (I doubt we need this
> > > for the user mapping but we need to untangle some of the macros there;
> > > that's for a separate discussion).
> > > 
> > > That said, we haven't encountered any user space problems with a 48-bit
> > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> > > approach (47 or 48 bit default limit). Better to have some ABI
> > > consistency between architectures. One can still ask for addresses above
> > > this default limit via mmap().
> > 
> > I think that is best as well.
> > 
> > Can we please just do what x86 and arm64 does?
> 
> I responded to Arnd in the other thread, but I am still not convinced
> that the solution that x86 and arm64 have selected is the best solution.
> The solution of defaulting to 47 bits does allow applications the
> ability to get addresses that are below 47 bits. However, due to
> differences across architectures it doesn't seem possible to have all
> architectures default to the same value. Additionally, this flag will be
> able to help users avoid potential bugs where a hint address is passed
> that causes upper bits of a VA to be used.

The reason we added this limit on arm64 is that we noticed programs
using the top 8 bits of a 64-bit pointer for additional information.
IIRC, it wasn't even openJDK but some JavaScript JIT. We could have
taught those programs of a new flag but since we couldn't tell how many
are out there, it was the safest to default to a smaller limit and opt
in to the higher one. Such opt-in is via mmap() but if you prefer a
prctl() flag, that's fine by me as well (though I think this should be
opt-in to higher addresses rather than opt-out of the higher addresses).

-- 
Catalin

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-11 Thread Michael Ellerman

"Arnd Bergmann"  writes:
> On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote:
>> On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
>>> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote:
>>> The intent is to optionally be able to run a process that keeps higher bits
>>> free for tagging and to be sure no memory mapping in the process will
>>> clobber these (correct me if I'm wrong Charlie! :)
...
> Let's see what the other architectures do and then come up with
> a way that fixes the pointer tagging case first on those that are
> broken. We can see if there needs to be an extra flag after that.
> Here is what I found:
>
> - x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit
>   address space when an addr hint is passed.
> - arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns
>   higher 52-bit addresses when either a hint is passed or
>   CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this
>   is a debugging option)
> - ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48),
>   returns 52 bit address when an addr hint is passed
   
It's 46 or 47 depending on PAGE_SIZE (4K or 64K):

  $ git grep "define DEFAULT_MAP_WINDOW_USER64" 
arch/powerpc/include/asm/task_size_64.h
  arch/powerpc/include/asm/task_size_64.h:#define DEFAULT_MAP_WINDOW_USER64 
   TASK_SIZE_128TB
  arch/powerpc/include/asm/task_size_64.h:#define DEFAULT_MAP_WINDOW_USER64 
   TASK_SIZE_64TB

cheers

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-11 Thread Michael Ellerman

Geert Uytterhoeven  writes:
> Hi Christophe,
>
> On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy
>  wrote:
>> >>> diff --git a/include/uapi/linux/personality.h 
>> >>> b/include/uapi/linux/personality.h
>> >>> index 49796b7756af..cd3b8c154d9b 100644
>> >>> --- a/include/uapi/linux/personality.h
>> >>> +++ b/include/uapi/linux/personality.h
>> >>> @@ -22,6 +22,7 @@ enum {
>> >>> WHOLE_SECONDS = 0x200,
>> >>> STICKY_TIMEOUTS =   0x400,
>> >>> ADDR_LIMIT_3GB =0x800,
>> >>> +   ADDR_LIMIT_47BIT =  0x1000,
>> >>>   };
>> >>
>> >> I wonder if ADDR_LIMIT_128T would be clearer?
>> >>
>> >
>> > I don't follow, what does 128T represent?
>>
>> 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT
>> address, that naming would be more consistant with the ADDR_LIMIT_3GB
>> just above that means a 3 Gigabytes limit.
>
> Hence ADDR_LIMIT_128TB?

Yes it should be 128TB. Typo by me.

cheers

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-11 Thread Michael Ellerman

Charlie Jenkins  writes:
> On Fri, Sep 06, 2024 at 04:59:40PM +1000, Michael Ellerman wrote:
>> Charlie Jenkins  writes:
>> > Create a personality flag ADDR_LIMIT_47BIT to support applications
>> > that wish to transition from running in environments that support at
>> > most 47-bit VAs to environments that support larger VAs. This
>> > personality can be set to cause all allocations to be below the 47-bit
>> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
>> >
>> > Signed-off-by: Charlie Jenkins 
>> > ---
>> >  include/uapi/linux/personality.h | 1 +
>> >  mm/mmap.c| 3 +++
>> >  2 files changed, 4 insertions(+)
>> >
>> > diff --git a/include/uapi/linux/personality.h 
>> > b/include/uapi/linux/personality.h
>> > index 49796b7756af..cd3b8c154d9b 100644
>> > --- a/include/uapi/linux/personality.h
>> > +++ b/include/uapi/linux/personality.h
>> > @@ -22,6 +22,7 @@ enum {
>> >WHOLE_SECONDS = 0x200,
>> >STICKY_TIMEOUTS =   0x400,
>> >ADDR_LIMIT_3GB =0x800,
>> > +  ADDR_LIMIT_47BIT =  0x1000,
>> >  };
>> 
>> I wonder if ADDR_LIMIT_128T would be clearer?
>> 
>
> I don't follow, what does 128T represent?

Sorry, as Christophe explained it's 128 Terabytes, which is the actual
value of the address limit.

I think expressing it as the address value is probably more widely
understood, and would also match ADDR_LIMIT_3GB.

>> Have you looked at writing an update for the personality(2) man page? :)
>
> I will write an update to the man page if this patch is approved!

Yeah fair enough.

My (poorly expressed) point was that trying to describe the flag for the
man page might highlight that using the 47BIT name requires more
explanation.

cheers

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-11 Thread Arnd Bergmann

On Wed, Sep 11, 2024, at 00:45, Charlie Jenkins wrote:
> On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
>
> I responded to Arnd in the other thread, but I am still not convinced
> that the solution that x86 and arm64 have selected is the best solution.
> The solution of defaulting to 47 bits does allow applications the
> ability to get addresses that are below 47 bits. However, due to
> differences across architectures it doesn't seem possible to have all
> architectures default to the same value. Additionally, this flag will be
> able to help users avoid potential bugs where a hint address is passed
> that causes upper bits of a VA to be used.
>
> The other issue I have with this is that if there is not a hint address
> specified to be greater than 47 bits on x86, then mmap() may return an
> address that is greater than 47-bits. The documentation in
> Documentation/arch/x86/x86_64/5level-paging.rst says:
>
> "If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window."

This is also in the commit message of b569bab78d8d ("x86/mm: Prepare
to expose larger address space to userspace"), which introduced it.
However, I don't actually see the fallback to the full address space,
instead the actual behavior seems to be the same as arm64.

Am I missing something in the x86 implementation, or do we just
need to update the documentation?

  Arnd

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-10 Thread Charlie Jenkins

On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> * Catalin Marinas  [240906 07:44]:
> > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote:
> > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann  wrote:
> > > >> It's also unclear to me how we want this flag to interact with
> > > >> the existing logic in arch_get_mmap_end(), which attempts to
> > > >> limit the default mapping to a 47-bit address space already.
> > > >
> > > > To optimize RISC-V progress, I recommend:
> > > >
> > > > Step 1: Approve the patch.
> > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> > > 
> > > I really want to first see a plausible explanation about why
> > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > > like all the other major architectures (x86, arm64, powerpc64),
> > 
> > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
> > configuration. We end up with a 47-bit with 16K pages but for a
> > different reason that has to do with LPA2 support (I doubt we need this
> > for the user mapping but we need to untangle some of the macros there;
> > that's for a separate discussion).
> > 
> > That said, we haven't encountered any user space problems with a 48-bit
> > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> > approach (47 or 48 bit default limit). Better to have some ABI
> > consistency between architectures. One can still ask for addresses above
> > this default limit via mmap().
> 
> I think that is best as well.
> 
> Can we please just do what x86 and arm64 does?
> 
> Thanks,
> Liam

I responded to Arnd in the other thread, but I am still not convinced
that the solution that x86 and arm64 have selected is the best solution.
The solution of defaulting to 47 bits does allow applications the
ability to get addresses that are below 47 bits. However, due to
differences across architectures it doesn't seem possible to have all
architectures default to the same value. Additionally, this flag will be
able to help users avoid potential bugs where a hint address is passed
that causes upper bits of a VA to be used.

The other issue I have with this is that if there is not a hint address
specified to be greater than 47 bits on x86, then mmap() may return an
address that is greater than 47-bits. The documentation in
Documentation/arch/x86/x86_64/5level-paging.rst says:

"If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window."

arm64 on the other hand defines this as only being able to opt-into the
52-bit VA space with the hint address, and my understanding is that
mmap() will not fall back to the 52-bit address space. Please correct me
if I am wrong. From Documentation/arch/arm64/memory.rst:

"To maintain compatibility with software that relies on the ARMv8.0
VA space maximum size of 48-bits, the kernel will, by default,
return virtual addresses to userspace from a 48-bit range.

"Software can "opt-in" to receiving VAs from a 52-bit space by
specifying an mmap hint parameter that is larger than 48-bit."

This is an inconsistency I am trying to solve with this personality
flag.

- Charlie

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-10 Thread Charlie Jenkins

On Tue, Sep 10, 2024 at 09:13:33AM +, Arnd Bergmann wrote:
> On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote:
> > On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
> >> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote:
> >> The intent is to optionally be able to run a process that keeps higher bits
> >> free for tagging and to be sure no memory mapping in the process will
> >> clobber these (correct me if I'm wrong Charlie! :)
> >> 
> >> So you really wouldn't want this if you are using tagged pointers, you'd
> >> want to be sure literally nothing touches the higher bits.
> 
> My understanding was that the purpose of the existing design
> is to allow applications to ask for a high address without having
> to resort to the complexity of MAP_FIXED.
> 
> In particular, I'm sure there is precedent for applications that
> want both tagged pointers (for most mappings) and untagged pointers
> (for large mappings). With a per-mm_struct or per-task_struct
> setting you can't do that.
> 
> > Various architectures handle the hint address differently, but it
> > appears that the only case across any architecture where an address
> > above 47 bits will be returned is if the application had a hint address
> > with a value greater than 47 bits and was using the MAP_FIXED flag.
> > MAP_FIXED bypasses all other checks so I was assuming that it would be
> > logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set,
> > then the intent is for no hint address to cause a value greater than 47
> > bits to be returned.
> 
> I don't think the MAP_FIXED case is that interesting here because
> it has to work in both fixed and non-fixed mappings.
> 
> >> This would be more consistent vs. other arches.
> >
> > Yes riscv is an outlier here. The reason I am pushing for something like
> > a flag to restrict the address space rather than setting it to be the
> > default is it seems like if applications are relying on upper bits to be
> > free, then they should be explicitly asking the kernel to keep them free
> > rather than assuming them to be free.
> 
> Let's see what the other architectures do and then come up with
> a way that fixes the pointer tagging case first on those that are
> broken. We can see if there needs to be an extra flag after that.
> Here is what I found:
> 
> - x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit
>   address space when an addr hint is passed.
> - arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns
>   higher 52-bit addresses when either a hint is passed or
>   CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this
>   is a debugging option)
> - ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48),
>   returns 52 bit address when an addr hint is passed
> - riscv uses a DEFAULT_MAP_WINDOW of BIT(47) but only uses
>   it for allocating the stack below, ignoring it for normal
>   mappings
> - s390 has no DEFAULT_MAP_WINDOW but tried to allocate in
>   the current number of pgtable levels and only upgrades to
>   the next level (31, 42, 53, 64 bits) if a hint is passed or
>   the current level is exhausted.
> - loongarch64 has no DEFAULT_MAP_WINDOW, and a default VA
>   space of 47 bits (16K pages, 3 levels), but can support
>   a 55 bit space (64K pages, 3 levels).
> - sparc has no DEFAULT_MAP_WINDOW and up to 52 bit VA space.
>   It may allocate both positive and negative addresses in
>   there. (?)
> - mips64, parisc64 and alpha have no DEFAULT_MAP_WINDOW and
>   at most 48, 41 or 39 address bits, respectively.
> 
> I would suggest these changes:
> 
> - make riscv enforce DEFAULT_MAP_WINDOW like x86_64, arm64
>and ppc64, leave it at 47
> 
> - add DEFAULT_MAP_WINDOW on loongarch64 (47/48 bits
>   based on page size), sparc (48 bits) and s390 (unsure if
>   42, 53, 47 or 48 bits)
> 
> - leave the rest unchanged.
> 
>Arnd

Changing all architectures to have a standardized DEFAULT_MAP_WINDOW
mostly solves the problem. However, I am concerned that it is fragile
for applications to rely on a default like this. Having the personality
bit flag is supposed to provide an intuitive ABI for users that
guarantees that they will not accidentally request for memory outside of
the boundary that they specified.

Also you bring up that the DEFAULT_MAP_WINDOW would not be able to be
standardized across architectures, so we still have the problem that
this default behavior will be different across architectures which I am
trying to solve.

- Charlie

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-10 Thread Liam R. Howlett

* Catalin Marinas  [240906 07:44]:
> On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote:
> > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann  wrote:
> > >> It's also unclear to me how we want this flag to interact with
> > >> the existing logic in arch_get_mmap_end(), which attempts to
> > >> limit the default mapping to a 47-bit address space already.
> > >
> > > To optimize RISC-V progress, I recommend:
> > >
> > > Step 1: Approve the patch.
> > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> > 
> > I really want to first see a plausible explanation about why
> > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > like all the other major architectures (x86, arm64, powerpc64),
> 
> FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
> configuration. We end up with a 47-bit with 16K pages but for a
> different reason that has to do with LPA2 support (I doubt we need this
> for the user mapping but we need to untangle some of the macros there;
> that's for a separate discussion).
> 
> That said, we haven't encountered any user space problems with a 48-bit
> DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> approach (47 or 48 bit default limit). Better to have some ABI
> consistency between architectures. One can still ask for addresses above
> this default limit via mmap().

I think that is best as well.

Can we please just do what x86 and arm64 does?

Thanks,
Liam

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-10 Thread Geert Uytterhoeven

Hi Christophe,

On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy
 wrote:
> >>> diff --git a/include/uapi/linux/personality.h 
> >>> b/include/uapi/linux/personality.h
> >>> index 49796b7756af..cd3b8c154d9b 100644
> >>> --- a/include/uapi/linux/personality.h
> >>> +++ b/include/uapi/linux/personality.h
> >>> @@ -22,6 +22,7 @@ enum {
> >>> WHOLE_SECONDS = 0x200,
> >>> STICKY_TIMEOUTS =   0x400,
> >>> ADDR_LIMIT_3GB =0x800,
> >>> +   ADDR_LIMIT_47BIT =  0x1000,
> >>>   };
> >>
> >> I wonder if ADDR_LIMIT_128T would be clearer?
> >>
> >
> > I don't follow, what does 128T represent?
>
> 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT
> address, that naming would be more consistant with the ADDR_LIMIT_3GB
> just above that means a 3 Gigabytes limit.

Hence ADDR_LIMIT_128TB?

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-10 Thread Christophe Leroy


diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
index 49796b7756af..cd3b8c154d9b 100644
--- a/include/uapi/linux/personality.h
+++ b/include/uapi/linux/personality.h
@@ -22,6 +22,7 @@ enum {
WHOLE_SECONDS = 0x200,
STICKY_TIMEOUTS =   0x400,
ADDR_LIMIT_3GB =0x800,
+   ADDR_LIMIT_47BIT =  0x1000,
  };


I wonder if ADDR_LIMIT_128T would be clearer?



I don't follow, what does 128T represent?



128T is 128 Terabytes, that's the maximum size achievable with a 47BIT 
address, that naming would be more consistant with the ADDR_LIMIT_3GB 
just above that means a 3 Gigabytes limit.


Christophe

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-10 Thread Arnd Bergmann

On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote:
> On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
>> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote:
>> The intent is to optionally be able to run a process that keeps higher bits
>> free for tagging and to be sure no memory mapping in the process will
>> clobber these (correct me if I'm wrong Charlie! :)
>> 
>> So you really wouldn't want this if you are using tagged pointers, you'd
>> want to be sure literally nothing touches the higher bits.

My understanding was that the purpose of the existing design
is to allow applications to ask for a high address without having
to resort to the complexity of MAP_FIXED.

In particular, I'm sure there is precedent for applications that
want both tagged pointers (for most mappings) and untagged pointers
(for large mappings). With a per-mm_struct or per-task_struct
setting you can't do that.

> Various architectures handle the hint address differently, but it
> appears that the only case across any architecture where an address
> above 47 bits will be returned is if the application had a hint address
> with a value greater than 47 bits and was using the MAP_FIXED flag.
> MAP_FIXED bypasses all other checks so I was assuming that it would be
> logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set,
> then the intent is for no hint address to cause a value greater than 47
> bits to be returned.

I don't think the MAP_FIXED case is that interesting here because
it has to work in both fixed and non-fixed mappings.

>> This would be more consistent vs. other arches.
>
> Yes riscv is an outlier here. The reason I am pushing for something like
> a flag to restrict the address space rather than setting it to be the
> default is it seems like if applications are relying on upper bits to be
> free, then they should be explicitly asking the kernel to keep them free
> rather than assuming them to be free.

Let's see what the other architectures do and then come up with
a way that fixes the pointer tagging case first on those that are
broken. We can see if there needs to be an extra flag after that.
Here is what I found:

- x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit
  address space when an addr hint is passed.
- arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns
  higher 52-bit addresses when either a hint is passed or
  CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this
  is a debugging option)
- ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48),
  returns 52 bit address when an addr hint is passed
- riscv uses a DEFAULT_MAP_WINDOW of BIT(47) but only uses
  it for allocating the stack below, ignoring it for normal
  mappings
- s390 has no DEFAULT_MAP_WINDOW but tried to allocate in
  the current number of pgtable levels and only upgrades to
  the next level (31, 42, 53, 64 bits) if a hint is passed or
  the current level is exhausted.
- loongarch64 has no DEFAULT_MAP_WINDOW, and a default VA
  space of 47 bits (16K pages, 3 levels), but can support
  a 55 bit space (64K pages, 3 levels).
- sparc has no DEFAULT_MAP_WINDOW and up to 52 bit VA space.
  It may allocate both positive and negative addresses in
  there. (?)
- mips64, parisc64 and alpha have no DEFAULT_MAP_WINDOW and
  at most 48, 41 or 39 address bits, respectively.

I would suggest these changes:

- make riscv enforce DEFAULT_MAP_WINDOW like x86_64, arm64
   and ppc64, leave it at 47

- add DEFAULT_MAP_WINDOW on loongarch64 (47/48 bits
  based on page size), sparc (48 bits) and s390 (unsure if
  42, 53, 47 or 48 bits)

- leave the rest unchanged.

   Arnd

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-09 Thread Charlie Jenkins

On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
> (Sorry having issues with my IPv6 setup that duplicated the original email...
> 
> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote:
> > On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote:
> > > On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
> > >> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
> > >> > Create a personality flag ADDR_LIMIT_47BIT to support applications
> > >> > that wish to transition from running in environments that support at
> > >> > most 47-bit VAs to environments that support larger VAs. This
> > >> > personality can be set to cause all allocations to be below the 47-bit
> > >> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
> > >> >
> > >> > Signed-off-by: Charlie Jenkins 
> > >>
> > >> I think having an architecture-independent mechanism to limit the size
> > >> of the 64-bit address space is useful in general, and we've discussed
> > >> the same thing for arm64 in the past, though we have not actually
> > >> reached an agreement on the ABI previously.
> > >
> > > The thread on the original proposals attests to this being rather a 
> > > fraught
> > > topic, and I think the weight of opinion was more so in favour of opt-in
> > > rather than opt-out.
> >
> > You mean opt-in to using the larger addresses like we do on arm64 and
> > powerpc, while "opt-out" means a limit as Charlie suggested?
> 
> I guess I'm not using brilliant terminology here haha!
> 
> To clarify - the weight of opinion was for a situation where the address
> space is limited, except if you set a hint above that (you could call that
> opt-out or opt-in depending which way you look at it, so yeah ok very
> unclear sorry!).
> 
> It was against the MAP_ flag and also I think a _flexible_ per-process
> limit is also questionable as you might end up setting a limit which breaks
> something else, and this starts getting messy quick.
> 
> To be clear, the ADDR_LIMIT_47BIT suggestion is absolutely a compromise and
> practical suggestion.
> 
> >
> > >> > @@ -22,6 +22,7 @@ enum {
> > >> >WHOLE_SECONDS = 0x200,
> > >> >STICKY_TIMEOUTS =   0x400,
> > >> >ADDR_LIMIT_3GB =0x800,
> > >> > +  ADDR_LIMIT_47BIT =  0x1000,
> > >> > };
> > >>
> > >> I'm a bit worried about having this done specifically in the
> > >> personality flag bits, as they are rather limited. We obviously
> > >> don't want to add many more such flags when there could be
> > >> a way to just set the default limit.
> > >
> > > Since I'm the one who suggested it, I feel I should offer some kind of
> > > vague defence here :)
> > >
> > > We shouldn't let perfect be the enemy of the good. This is a relatively
> > > straightforward means of achieving the aim (assuming your concern about
> > > arch_get_mmap_end() below isn't a blocker) which has the least impact on
> > > existing code.
> > >
> > > Of course we can end up in absurdities where we start doing
> > > ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an
> > > egregious maintenance burden and is entirely opt-in so has things going 
> > > for
> > > it.
> >
> > I'm more confused now, I think most importantly we should try to
> > handle this consistently across all architectures. The proposed
> > implementation seems to completely block addresses above BIT(47)
> > even for applications that opt in by calling mmap(BIT(47), ...),
> > which seems to break the existing applications.
> 
> Hm, I thought the commit message suggested the hint overrides it still?
> 
> The intent is to optionally be able to run a process that keeps higher bits
> free for tagging and to be sure no memory mapping in the process will
> clobber these (correct me if I'm wrong Charlie! :)
> 
> So you really wouldn't want this if you are using tagged pointers, you'd
> want to be sure literally nothing touches the higher bits.

Various architectures handle the hint address differently, but it
appears that the only case across any architecture where an address
above 47 bits will be returned is if the application had a hint address
with a value greater than 47 bits and was using the MAP_FIXED flag.
MAP_FIXED bypasses all other checks so I was assuming that it would be
logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set,
then the intent is for no hint address to cause a value greater than 47
bits to be returned.

This does have the issue that if MAP_FIXED is used then an address can
be returned above 47-bits, but if an application does not want addresses
above 47-bits then they shouldn't ask for a fixed address above that
range.

> 
> >
> > If we want this flag for RISC-V and also keep the behavior of
> > defaulting to >BIT(47) addresses for mmap(0, ...) how about
> > changing arch_get_mmap_end() to return the limit based on
> > ADDR_LIMIT_47BIT and then make this default to enabled on
> > arm64 and powerpc but disabled on risc

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-09 Thread Charlie Jenkins

On Fri, Sep 06, 2024 at 04:59:40PM +1000, Michael Ellerman wrote:
> Charlie Jenkins  writes:
> > Create a personality flag ADDR_LIMIT_47BIT to support applications
> > that wish to transition from running in environments that support at
> > most 47-bit VAs to environments that support larger VAs. This
> > personality can be set to cause all allocations to be below the 47-bit
> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
> >
> > Signed-off-by: Charlie Jenkins 
> > ---
> >  include/uapi/linux/personality.h | 1 +
> >  mm/mmap.c| 3 +++
> >  2 files changed, 4 insertions(+)
> >
> > diff --git a/include/uapi/linux/personality.h 
> > b/include/uapi/linux/personality.h
> > index 49796b7756af..cd3b8c154d9b 100644
> > --- a/include/uapi/linux/personality.h
> > +++ b/include/uapi/linux/personality.h
> > @@ -22,6 +22,7 @@ enum {
> > WHOLE_SECONDS = 0x200,
> > STICKY_TIMEOUTS =   0x400,
> > ADDR_LIMIT_3GB =0x800,
> > +   ADDR_LIMIT_47BIT =  0x1000,
> >  };
> 
> I wonder if ADDR_LIMIT_128T would be clearer?
> 

I don't follow, what does 128T represent?

> Have you looked at writing an update for the personality(2) man page? :)

I will write an update to the man page if this patch is approved!

> 
> cheers

- Charlie

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-09-09 Thread Kirill A. Shutemov

On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
> > On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
> > > Some applications rely on placing data in free bits addresses allocated
> > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > address returned by mmap to be less than the 48-bit address space,
> > > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > > for the kernel address space).
> > > 
> > > The riscv architecture needs a way to similarly restrict the virtual
> > > address space. On the riscv port of OpenJDK an error is thrown if
> > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > has a comment that sv57 support is not complete, but there are some
> > > workarounds to get it to mostly work [2].

I also saw libmozjs crashing with 57-bit address space on x86.

> > > These applications work on x86 because x86 does an implicit 47-bit
> > > restriction of mmap() address that contain a hint address that is less
> > > than 48 bits.
> > > 
> > > Instead of implicitly restricting the address space on riscv (or any
> > > current/future architecture), a flag would allow users to opt-in to this
> > > behavior rather than opt-out as is done on other architectures. This is
> > > desirable because it is a small class of applications that do pointer
> > > masking.

You reiterate the argument about "small class of applications". But it
makes no sense to me.

With full address space by default, this small class of applications is
going to *broken* unless they would handle RISC-V case specifically.

On other hand, if you limit VA to 128TiB by default (like many
architectures do[1]) everything would work without intervention.
And if an app needs wider address space it would get it with hint opt-in,
because it is required on x86-64 anyway. Again, no RISC-V-specific code.

I see no upside with your approach. Just worse user experience.

[1] See va_high_addr_switch test case in 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/mm/Makefile#n115

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH RFC v3 0/2] mm: Introduce ADDR_LIMIT_47BIT personality flag

2024-09-08 Thread Jiaxun Yang




在2024年9月5日九月 下午10:15，Charlie Jenkins写道：
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).
>
> The riscv architecture needs a way to similarly restrict the virtual
> address space. On the riscv port of OpenJDK an error is thrown if
> attempted to run on the 57-bit address space, called sv57 [1].  golang
> has a comment that sv57 support is not complete, but there are some
> workarounds to get it to mostly work [2].
>
> These applications work on x86 because x86 does an implicit 47-bit
> restriction of mmap() address that contain a hint address that is less
> than 48 bits.
>
> Instead of implicitly restricting the address space on riscv (or any
> current/future architecture), provide a flag to the personality syscall
> that can be used to ensure an application works in any arbitrary VA
> space. A similar feature has already been implemented by the personality
> syscall in ADDR_LIMIT_32BIT.
>
> This flag will also allow seemless compatibility between all
> architectures, so applications like Go and OpenJDK that use bits in a
> virtual address can request the exact number of bits they need in a
> generic way. The flag can be checked inside of vm_unmapped_area() so
> that this flag does not have to be handled individually by each
> architecture. 

Tested-by: Jiaxun Yang 

Tested on MIPS VA 48 system, fixed pointer tagging on mozjs!

Thanks!

[...]
-- 
- Jiaxun

Re: [PATCH] [RFC] crash: Lock-free crash hotplug support reporting

2024-09-08 Thread Baoquan He

On 09/07/24 at 10:30am, Sourabh Jain wrote:
> Hello Baoquan,
> 
> Do you think this patch would help reduce lock contention when
> CPU/Memory resources are removed in bulk from a system?
.snip...
--
> >   include/linux/kexec.h | 11 ---
> >   kernel/crash_core.c   | 27 +--
> >   kernel/kexec.c|  5 -
> >   kernel/kexec_file.c   |  7 ++-
> >   4 files changed, 23 insertions(+), 27 deletions(-)
> > 
> > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> > index f0e9f8eda7a3..bd755ba6bac4 100644
> > --- a/include/linux/kexec.h
> > +++ b/include/linux/kexec.h
> > @@ -318,13 +318,6 @@ struct kimage {
> > unsigned int preserve_context : 1;
> > /* If set, we are using file mode kexec syscall */
> > unsigned int file_mode:1;
> > -#ifdef CONFIG_CRASH_HOTPLUG
> > -   /* If set, it is safe to update kexec segments that are
> > -* excluded from SHA calculation.
> > -*/
> > -   unsigned int hotplug_support:1;
> > -#endif
> > -
> >   #ifdef ARCH_HAS_KIMAGE_ARCH
> > struct kimage_arch arch;
> >   #endif
> > @@ -370,6 +363,10 @@ struct kimage {
> > unsigned long elf_load_addr;
> >   };
> > +#ifdef CONFIG_CRASH_HOTPLUG
> > +extern unsigned int crash_hotplug_support;
> > +#endif
> > +
> >   /* kexec interface functions */
> >   extern void machine_kexec(struct kimage *image);
> >   extern int machine_kexec_prepare(struct kimage *image);
> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > index 63cf89393c6e..3428deba0070 100644
> > --- a/kernel/crash_core.c
> > +++ b/kernel/crash_core.c
> > @@ -30,6 +30,13 @@
> >   #include "kallsyms_internal.h"
> >   #include "kexec_internal.h"
> > +#ifdef CONFIG_CRASH_HOTPLUG
> > +/* if set, it is safe to update kexec segments that are
> > + * excluded from sha calculation.
> > + */
> > +unsigned int crash_hotplug_support;
> > +#endif
> > +
> >   /* Per cpu memory for storing cpu states in case of system crash. */
> >   note_buf_t __percpu *crash_notes;
> > @@ -500,23 +507,7 @@ static DEFINE_MUTEX(__crash_hotplug_lock);
> >*/
> >   int crash_check_hotplug_support(void)
> >   {
> > -   int rc = 0;
> > -
> > -   crash_hotplug_lock();
> > -   /* Obtain lock while reading crash information */
> > -   if (!kexec_trylock()) {
> > -   pr_info("kexec_trylock() failed, elfcorehdr may be 
> > inaccurate\n");
> > -   crash_hotplug_unlock();
> > -   return 0;
> > -   }
> > -   if (kexec_crash_image) {
> > -   rc = kexec_crash_image->hotplug_support;
> > -   }
> > -   /* Release lock now that update complete */
> > -   kexec_unlock();
> > -   crash_hotplug_unlock();
> > -
> > -   return rc;
> > +   return crash_hotplug_support;


I may not understand this well. Both kexec_load and kexec_file_load set
hotplug_support, crash_check_hotplug_support and
crash_handle_hotplug_event are to check the flag. How do you guarantee
the cpu/memory sysfs checking won't have race with kexec_load and
kexec_file_load?

And here I see taking crash_hotplug_lock() is unnecessary in
crash_check_hotplug_support() because it does't have race with
crash_handle_hotplug_event().

> >   }
> >   /*
> > @@ -552,7 +543,7 @@ static void crash_handle_hotplug_event(unsigned int 
> > hp_action, unsigned int cpu,
> > image = kexec_crash_image;
> > /* Check that kexec segments update is permitted */
> > -   if (!image->hotplug_support)
> > +   if (!crash_hotplug_support)
> > goto out;
> > if (hp_action == KEXEC_CRASH_HP_ADD_CPU ||
> > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > index a6b3f96bb50c..d5c6b51eaa8b 100644
> > --- a/kernel/kexec.c
> > +++ b/kernel/kexec.c
> > @@ -116,6 +116,9 @@ static int do_kexec_load(unsigned long entry, unsigned 
> > long nr_segments,
> > /* Uninstall image */
> > kimage_free(xchg(dest_image, NULL));
> > ret = 0;
> > +#ifdef CONFIG_CRASH_HOTPLUG
> > +   crash_hotplug_support = 0;
> > +#endif
> > goto out_unlock;
> > }
> > if (flags & KEXEC_ON_CRASH) {
> > @@ -136,7 +139,7 @@ static int do_kexec_load(unsigned long entry, unsigned 
> > long nr_segments,
> >   #ifdef CONFIG_CRASH_HOTPLUG
> > if ((flags & KEXEC_ON_CRASH) && arch_crash_hotplug_support(image, 
> > flags))
> > -   image->hotplug_support = 1;
> > +   crash_hotplug_support = 1;
> >   #endif
> > ret = machine_kexec_prepare(image);
> > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> > index 3d64290d24c9..b326edb90fd7 100644
> > --- a/kernel/kexec_file.c
> > +++ b/kernel/kexec_file.c
> > @@ -378,7 +378,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, 
> > initrd_fd,
> >   #ifdef CONFIG_CRASH_HOTPLUG
> > if ((flags & KEXEC_FILE_ON_CRASH) && arch_crash_hotplug_support(image, 
> > flags))
> > -   image->hotplug_support = 1;
> > +   crash_hotplug_support = 1;
> >   #endif
> > ret = machine_kexec_prepare(image);
> > @@ -432,6 +432,11 @@ SYSCALL_DEFINE5(kexec_file_load,

Re: [PATCH] [RFC] crash: Lock-free crash hotplug support reporting

2024-09-06 Thread Sourabh Jain


Hello Baoquan,

Do you think this patch would help reduce lock contention when
CPU/Memory resources are removed in bulk from a system?


Thanks,
Sourabh Jain


On 23/08/24 17:22, Sourabh Jain wrote:

On a CPU/Memory hotplug event, the kexec lock is taken to update the
kdump image. At the same time, this lock is also required to report
the support for crash hotplug to user-space via the
/sys/devices/system/[cpu|memory]/crash_hotplug sysfs interface, to
avoid kdump reload.

The kexec lock is needed to report crash hotplug support because the
crash_hotplug variable, which tracks crash hotplug support, is part of
the kdump image, and the kdump image needs to be updated during a
hotplug event.

Given that only one kdump image can be loaded at any given time, the
crash_hotplug variable can be placed outside the kdump image and set or
reset during kdump image load and unload. This allows crash hotplug
support to be reported without taking the kexec lock.

This would help in situation where CPU/Memory resource are hotplug from
system in bulk.

Commit e2a8f20dd8e9 ("Crash: add lock to serialize crash hotplug
handling") introduced to serialize the kexec lock during bulk CPU/Memory
hotplug events. However, with these changes, the kexec lock for crash
hotplug support reporting can be avoided altogether.

Cc: Hari Bathini 
Cc: Mahesh Salgaonkar 
Cc: ke...@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ker...@vger.kernel.org
Cc: x...@kernel.org
Signed-off-by: Sourabh Jain 
---
  include/linux/kexec.h | 11 ---
  kernel/crash_core.c   | 27 +--
  kernel/kexec.c|  5 -
  kernel/kexec_file.c   |  7 ++-
  4 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index f0e9f8eda7a3..bd755ba6bac4 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -318,13 +318,6 @@ struct kimage {
unsigned int preserve_context : 1;
/* If set, we are using file mode kexec syscall */
unsigned int file_mode:1;
-#ifdef CONFIG_CRASH_HOTPLUG
-   /* If set, it is safe to update kexec segments that are
-* excluded from SHA calculation.
-*/
-   unsigned int hotplug_support:1;
-#endif
-
  #ifdef ARCH_HAS_KIMAGE_ARCH
struct kimage_arch arch;
  #endif
@@ -370,6 +363,10 @@ struct kimage {
unsigned long elf_load_addr;
  };
  
+#ifdef CONFIG_CRASH_HOTPLUG

+extern unsigned int crash_hotplug_support;
+#endif
+
  /* kexec interface functions */
  extern void machine_kexec(struct kimage *image);
  extern int machine_kexec_prepare(struct kimage *image);
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 63cf89393c6e..3428deba0070 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -30,6 +30,13 @@
  #include "kallsyms_internal.h"
  #include "kexec_internal.h"
  
+#ifdef CONFIG_CRASH_HOTPLUG

+/* if set, it is safe to update kexec segments that are
+ * excluded from sha calculation.
+ */
+unsigned int crash_hotplug_support;
+#endif
+
  /* Per cpu memory for storing cpu states in case of system crash. */
  note_buf_t __percpu *crash_notes;
  
@@ -500,23 +507,7 @@ static DEFINE_MUTEX(__crash_hotplug_lock);

   */
  int crash_check_hotplug_support(void)
  {
-   int rc = 0;
-
-   crash_hotplug_lock();
-   /* Obtain lock while reading crash information */
-   if (!kexec_trylock()) {
-   pr_info("kexec_trylock() failed, elfcorehdr may be 
inaccurate\n");
-   crash_hotplug_unlock();
-   return 0;
-   }
-   if (kexec_crash_image) {
-   rc = kexec_crash_image->hotplug_support;
-   }
-   /* Release lock now that update complete */
-   kexec_unlock();
-   crash_hotplug_unlock();
-
-   return rc;
+   return crash_hotplug_support;
  }
  
  /*

@@ -552,7 +543,7 @@ static void crash_handle_hotplug_event(unsigned int 
hp_action, unsigned int cpu,
image = kexec_crash_image;
  
  	/* Check that kexec segments update is permitted */

-   if (!image->hotplug_support)
+   if (!crash_hotplug_support)
goto out;
  
  	if (hp_action == KEXEC_CRASH_HP_ADD_CPU ||

diff --git a/kernel/kexec.c b/kernel/kexec.c
index a6b3f96bb50c..d5c6b51eaa8b 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -116,6 +116,9 @@ static int do_kexec_load(unsigned long entry, unsigned long 
nr_segments,
/* Uninstall image */
kimage_free(xchg(dest_image, NULL));
ret = 0;
+#ifdef CONFIG_CRASH_HOTPLUG
+   crash_hotplug_support = 0;
+#endif
goto out_unlock;
}
if (flags & KEXEC_ON_CRASH) {
@@ -136,7 +139,7 @@ static int do_kexec_load(unsigned long entry, unsigned long 
nr_segments,
  
  #ifdef CONFIG_CRASH_HOTPLUG

if ((flags & KEXEC_ON_CRASH) && arch_crash_hotplug_support(image, 
flags))
-   image->hotplug_support = 1;
+   crash_hotplug_support = 1;
  #endif

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-06 Thread Catalin Marinas

On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote:
> On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann  wrote:
> >> It's also unclear to me how we want this flag to interact with
> >> the existing logic in arch_get_mmap_end(), which attempts to
> >> limit the default mapping to a 47-bit address space already.
> >
> > To optimize RISC-V progress, I recommend:
> >
> > Step 1: Approve the patch.
> > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > Step 3: Wait approximately several iterations for Go & OpenJDK
> > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> 
> I really want to first see a plausible explanation about why
> RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> like all the other major architectures (x86, arm64, powerpc64),

FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
configuration. We end up with a 47-bit with 16K pages but for a
different reason that has to do with LPA2 support (I doubt we need this
for the user mapping but we need to untangle some of the macros there;
that's for a separate discussion).

That said, we haven't encountered any user space problems with a 48-bit
DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
approach (47 or 48 bit default limit). Better to have some ABI
consistency between architectures. One can still ask for addresses above
this default limit via mmap().

-- 
Catalin

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-06 Thread Arnd Bergmann

On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann  wrote:
>>
>> It's also unclear to me how we want this flag to interact with
>> the existing logic in arch_get_mmap_end(), which attempts to
>> limit the default mapping to a 47-bit address space already.
>
> To optimize RISC-V progress, I recommend:
>
> Step 1: Approve the patch.
> Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> Step 3: Wait approximately several iterations for Go & OpenJDK
> Step 4: Remove the 47-bit constraint in arch_get_mmap_end()

I really want to first see a plausible explanation about why
RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
like all the other major architectures (x86, arm64, powerpc64),
e.g. something like the patch below (untested, probably slightly
wrong but show illustrate my point).

 Arnd

diff --git a/arch/riscv/include/asm/processor.h 
b/arch/riscv/include/asm/processor.h
index 8702b8721a27..de9863be1efd 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -20,17 +20,8 @@
  * mmap_end < addr, being mmap_end the top of that address space.
  * See Documentation/arch/riscv/vm-layout.rst for more details.
  */
-#define arch_get_mmap_end(addr, len, flags)\
-({ \
-   unsigned long mmap_end; \
-   typeof(addr) _addr = (addr);\
-   if ((_addr) == 0 || is_compat_task() || \
-   ((_addr + len) > BIT(VA_BITS - 1))) \
-   mmap_end = STACK_TOP_MAX;   \
-   else\
-   mmap_end = (_addr + len);   \
-   mmap_end;   \
-})
+#define arch_get_mmap_end(addr, len, flags) \
+   (((addr) > DEFAULT_MAP_WINDOW) ? TASK_SIZE : DEFAULT_MAP_WINDOW)
 
 #define arch_get_mmap_base(addr, base) \
 ({ \
@@ -47,7 +38,7 @@
 })
 
 #ifdef CONFIG_64BIT
-#define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1))
+#define DEFAULT_MAP_WINDOW (is_compat_task() ? (UL(1) << (MMAP_VA_BITS - 
1)) : TASK_SIZE_32)
 #define STACK_TOP_MAX  TASK_SIZE_64
 #else
 #define DEFAULT_MAP_WINDOW TASK_SIZE

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-06 Thread Lorenzo Stoakes

(Sorry having issues with my IPv6 setup that duplicated the original email...

On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote:
> On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote:
> > On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
> >> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
> >> > Create a personality flag ADDR_LIMIT_47BIT to support applications
> >> > that wish to transition from running in environments that support at
> >> > most 47-bit VAs to environments that support larger VAs. This
> >> > personality can be set to cause all allocations to be below the 47-bit
> >> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
> >> >
> >> > Signed-off-by: Charlie Jenkins 
> >>
> >> I think having an architecture-independent mechanism to limit the size
> >> of the 64-bit address space is useful in general, and we've discussed
> >> the same thing for arm64 in the past, though we have not actually
> >> reached an agreement on the ABI previously.
> >
> > The thread on the original proposals attests to this being rather a fraught
> > topic, and I think the weight of opinion was more so in favour of opt-in
> > rather than opt-out.
>
> You mean opt-in to using the larger addresses like we do on arm64 and
> powerpc, while "opt-out" means a limit as Charlie suggested?

I guess I'm not using brilliant terminology here haha!

To clarify - the weight of opinion was for a situation where the address
space is limited, except if you set a hint above that (you could call that
opt-out or opt-in depending which way you look at it, so yeah ok very
unclear sorry!).

It was against the MAP_ flag and also I think a _flexible_ per-process
limit is also questionable as you might end up setting a limit which breaks
something else, and this starts getting messy quick.

To be clear, the ADDR_LIMIT_47BIT suggestion is absolutely a compromise and
practical suggestion.

>
> >> > @@ -22,6 +22,7 @@ enum {
> >> >  WHOLE_SECONDS = 0x200,
> >> >  STICKY_TIMEOUTS =   0x400,
> >> >  ADDR_LIMIT_3GB =0x800,
> >> > +ADDR_LIMIT_47BIT =  0x1000,
> >> > };
> >>
> >> I'm a bit worried about having this done specifically in the
> >> personality flag bits, as they are rather limited. We obviously
> >> don't want to add many more such flags when there could be
> >> a way to just set the default limit.
> >
> > Since I'm the one who suggested it, I feel I should offer some kind of
> > vague defence here :)
> >
> > We shouldn't let perfect be the enemy of the good. This is a relatively
> > straightforward means of achieving the aim (assuming your concern about
> > arch_get_mmap_end() below isn't a blocker) which has the least impact on
> > existing code.
> >
> > Of course we can end up in absurdities where we start doing
> > ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an
> > egregious maintenance burden and is entirely opt-in so has things going for
> > it.
>
> I'm more confused now, I think most importantly we should try to
> handle this consistently across all architectures. The proposed
> implementation seems to completely block addresses above BIT(47)
> even for applications that opt in by calling mmap(BIT(47), ...),
> which seems to break the existing applications.

Hm, I thought the commit message suggested the hint overrides it still?

The intent is to optionally be able to run a process that keeps higher bits
free for tagging and to be sure no memory mapping in the process will
clobber these (correct me if I'm wrong Charlie! :)

So you really wouldn't want this if you are using tagged pointers, you'd
want to be sure literally nothing touches the higher bits.

>
> If we want this flag for RISC-V and also keep the behavior of
> defaulting to >BIT(47) addresses for mmap(0, ...) how about
> changing arch_get_mmap_end() to return the limit based on
> ADDR_LIMIT_47BIT and then make this default to enabled on
> arm64 and powerpc but disabled on riscv?

But you wouldn't necessarily want all processes to be so restricted, I
think this is what Charlie's trying to avoid :)

On the ohter hand - I'm not sure there are many processes on any arch
that'd want the higher mappings.

So that'd push us again towards risc v just limiting to 48-bits and only
mapping above this if a hint is provided like x86-64 does (and as you
mentioned via irc - it seems risc v is an outlier in that
DEFAULT_MAP_WINDOW == TASK_SIZE).

This would be more consistent vs. other arches.

>
> >> It's also unclear to me how we want this flag to interact with
> >> the existing logic in arch_get_mmap_end(), which attempts to
> >> limit the default mapping to a 47-bit address space already.
> >
> > How does ADDR_LIMIT_3GB presently interact with that?
>
> That is x86 specific and only relevant to compat tasks, limiting
> them to 3 instead of 4 GB. There is also ADDR_LIMIT_32BIT, which
> on arm32 is always set in practice to allow 32-bit addressing

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-06 Thread Guo Ren

On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann  wrote:
>
> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
> > Create a personality flag ADDR_LIMIT_47BIT to support applications
> > that wish to transition from running in environments that support at
> > most 47-bit VAs to environments that support larger VAs. This
> > personality can be set to cause all allocations to be below the 47-bit
> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
> >
> > Signed-off-by: Charlie Jenkins 
>
> I think having an architecture-independent mechanism to limit the size
> of the 64-bit address space is useful in general, and we've discussed
> the same thing for arm64 in the past, though we have not actually
> reached an agreement on the ABI previously.
>
> > @@ -22,6 +22,7 @@ enum {
> >   WHOLE_SECONDS = 0x200,
> >   STICKY_TIMEOUTS =   0x400,
> >   ADDR_LIMIT_3GB =0x800,
> > + ADDR_LIMIT_47BIT =  0x1000,
> > };
>
> I'm a bit worried about having this done specifically in the
> personality flag bits, as they are rather limited. We obviously
> don't want to add many more such flags when there could be
> a way to just set the default limit.
>
> It's also unclear to me how we want this flag to interact with
> the existing logic in arch_get_mmap_end(), which attempts to
> limit the default mapping to a 47-bit address space already.

To optimize RISC-V progress, I recommend:

Step 1: Approve the patch.
Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
Step 3: Wait approximately several iterations for Go & OpenJDK
Step 4: Remove the 47-bit constraint in arch_get_mmap_end()


>
> For some reason, it appears that the arch_get_mmap_end()
> logic on RISC-V defaults to the maximum address
> space for the 'addr==0' case which is inconsistentn with
> the other architectures, so we should probably fix that
> part first, possibly moving more of that logic into a
> shared implementation.
>
>   Arnd
>


-- 
Best Regards
 Guo Ren

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-06 Thread Arnd Bergmann

On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote:
> On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
>> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
>> > Create a personality flag ADDR_LIMIT_47BIT to support applications
>> > that wish to transition from running in environments that support at
>> > most 47-bit VAs to environments that support larger VAs. This
>> > personality can be set to cause all allocations to be below the 47-bit
>> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
>> >
>> > Signed-off-by: Charlie Jenkins 
>>
>> I think having an architecture-independent mechanism to limit the size
>> of the 64-bit address space is useful in general, and we've discussed
>> the same thing for arm64 in the past, though we have not actually
>> reached an agreement on the ABI previously.
>
> The thread on the original proposals attests to this being rather a fraught
> topic, and I think the weight of opinion was more so in favour of opt-in
> rather than opt-out.

You mean opt-in to using the larger addresses like we do on arm64 and
powerpc, while "opt-out" means a limit as Charlie suggested?

>> > @@ -22,6 +22,7 @@ enum {
>> >WHOLE_SECONDS = 0x200,
>> >STICKY_TIMEOUTS =   0x400,
>> >ADDR_LIMIT_3GB =0x800,
>> > +  ADDR_LIMIT_47BIT =  0x1000,
>> > };
>>
>> I'm a bit worried about having this done specifically in the
>> personality flag bits, as they are rather limited. We obviously
>> don't want to add many more such flags when there could be
>> a way to just set the default limit.
>
> Since I'm the one who suggested it, I feel I should offer some kind of
> vague defence here :)
>
> We shouldn't let perfect be the enemy of the good. This is a relatively
> straightforward means of achieving the aim (assuming your concern about
> arch_get_mmap_end() below isn't a blocker) which has the least impact on
> existing code.
>
> Of course we can end up in absurdities where we start doing
> ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an
> egregious maintenance burden and is entirely opt-in so has things going for
> it.

I'm more confused now, I think most importantly we should try to
handle this consistently across all architectures. The proposed
implementation seems to completely block addresses above BIT(47)
even for applications that opt in by calling mmap(BIT(47), ...),
which seems to break the existing applications.

If we want this flag for RISC-V and also keep the behavior of
defaulting to >BIT(47) addresses for mmap(0, ...) how about
changing arch_get_mmap_end() to return the limit based on
ADDR_LIMIT_47BIT and then make this default to enabled on
arm64 and powerpc but disabled on riscv?

>> It's also unclear to me how we want this flag to interact with
>> the existing logic in arch_get_mmap_end(), which attempts to
>> limit the default mapping to a 47-bit address space already.
>
> How does ADDR_LIMIT_3GB presently interact with that?

That is x86 specific and only relevant to compat tasks, limiting
them to 3 instead of 4 GB. There is also ADDR_LIMIT_32BIT, which
on arm32 is always set in practice to allow 32-bit addressing 
as opposed to ARMv2 style 26-bit addressing (IIRC ARMv3 supported
both 26-bit and 32-bit addressing, while ARMv4 through ARMv7 are
32-bit only.

  Arnd

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-06 Thread Lorenzo Stoakes

On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
> > Create a personality flag ADDR_LIMIT_47BIT to support applications
> > that wish to transition from running in environments that support at
> > most 47-bit VAs to environments that support larger VAs. This
> > personality can be set to cause all allocations to be below the 47-bit
> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
> >
> > Signed-off-by: Charlie Jenkins 
>
> I think having an architecture-independent mechanism to limit the size
> of the 64-bit address space is useful in general, and we've discussed
> the same thing for arm64 in the past, though we have not actually
> reached an agreement on the ABI previously.

The thread on the original proposals attests to this being rather a fraught
topic, and I think the weight of opinion was more so in favour of opt-in
rather than opt-out.

>
> > @@ -22,6 +22,7 @@ enum {
> > WHOLE_SECONDS = 0x200,
> > STICKY_TIMEOUTS =   0x400,
> > ADDR_LIMIT_3GB =0x800,
> > +   ADDR_LIMIT_47BIT =  0x1000,
> > };
>
> I'm a bit worried about having this done specifically in the
> personality flag bits, as they are rather limited. We obviously
> don't want to add many more such flags when there could be
> a way to just set the default limit.

Since I'm the one who suggested it, I feel I should offer some kind of
vague defence here :)

We shouldn't let perfect be the enemy of the good. This is a relatively
straightforward means of achieving the aim (assuming your concern about
arch_get_mmap_end() below isn't a blocker) which has the least impact on
existing code.

Of course we can end up in absurdities where we start doing
ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an
egregious maintenance burden and is entirely opt-in so has things going for
it.

>
> It's also unclear to me how we want this flag to interact with
> the existing logic in arch_get_mmap_end(), which attempts to
> limit the default mapping to a 47-bit address space already.

How does ADDR_LIMIT_3GB presently interact with that?

>
> For some reason, it appears that the arch_get_mmap_end()
> logic on RISC-V defaults to the maximum address
> space for the 'addr==0' case which is inconsistentn with
> the other architectures, so we should probably fix that
> part first, possibly moving more of that logic into a
> shared implementation.
>
>   Arnd

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-06 Thread Lorenzo Stoakes

On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
> > Create a personality flag ADDR_LIMIT_47BIT to support applications
> > that wish to transition from running in environments that support at
> > most 47-bit VAs to environments that support larger VAs. This
> > personality can be set to cause all allocations to be below the 47-bit
> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
> >
> > Signed-off-by: Charlie Jenkins 
>
> I think having an architecture-independent mechanism to limit the size
> of the 64-bit address space is useful in general, and we've discussed
> the same thing for arm64 in the past, though we have not actually
> reached an agreement on the ABI previously.

The thread on the original proposals attests to this being rather a fraught
topic, and I think the weight of opinion was more so in favour of opt-in
rather than opt-out.

>
> > @@ -22,6 +22,7 @@ enum {
> > WHOLE_SECONDS = 0x200,
> > STICKY_TIMEOUTS =   0x400,
> > ADDR_LIMIT_3GB =0x800,
> > +   ADDR_LIMIT_47BIT =  0x1000,
> > };
>
> I'm a bit worried about having this done specifically in the
> personality flag bits, as they are rather limited. We obviously
> don't want to add many more such flags when there could be
> a way to just set the default limit.

Since I'm the one who suggested it, I feel I should offer some kind of
vague defence here :)

We shouldn't let perfect be the enemy of the good. This is a relatively
straightforward means of achieving the aim (assuming your concern about
arch_get_mmap_end() below isn't a blocker) which has the least impact on
existing code.

Of course we can end up in absurdities where we start doing
ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an
egregious maintenance burden and is entirely opt-in so has things going for
it.

>
> It's also unclear to me how we want this flag to interact with
> the existing logic in arch_get_mmap_end(), which attempts to
> limit the default mapping to a 47-bit address space already.

How does ADDR_LIMIT_3GB presently interact with that?

>
> For some reason, it appears that the arch_get_mmap_end()
> logic on RISC-V defaults to the maximum address
> space for the 'addr==0' case which is inconsistentn with
> the other architectures, so we should probably fix that
> part first, possibly moving more of that logic into a
> shared implementation.
>
>   Arnd

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-06 Thread Arnd Bergmann

On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
> Create a personality flag ADDR_LIMIT_47BIT to support applications
> that wish to transition from running in environments that support at
> most 47-bit VAs to environments that support larger VAs. This
> personality can be set to cause all allocations to be below the 47-bit
> boundary. Using MAP_FIXED with mmap() will bypass this restriction.
>
> Signed-off-by: Charlie Jenkins 

I think having an architecture-independent mechanism to limit the size
of the 64-bit address space is useful in general, and we've discussed
the same thing for arm64 in the past, though we have not actually
reached an agreement on the ABI previously.

> @@ -22,6 +22,7 @@ enum {
>   WHOLE_SECONDS = 0x200,
>   STICKY_TIMEOUTS =   0x400,
>   ADDR_LIMIT_3GB =0x800,
> + ADDR_LIMIT_47BIT =  0x1000,
> };

I'm a bit worried about having this done specifically in the
personality flag bits, as they are rather limited. We obviously
don't want to add many more such flags when there could be
a way to just set the default limit.

It's also unclear to me how we want this flag to interact with
the existing logic in arch_get_mmap_end(), which attempts to
limit the default mapping to a 47-bit address space already.

For some reason, it appears that the arch_get_mmap_end()
logic on RISC-V defaults to the maximum address
space for the 'addr==0' case which is inconsistentn with
the other architectures, so we should probably fix that
part first, possibly moving more of that logic into a
shared implementation.

  Arnd

Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-05 Thread Michael Ellerman

Charlie Jenkins  writes:
> Create a personality flag ADDR_LIMIT_47BIT to support applications
> that wish to transition from running in environments that support at
> most 47-bit VAs to environments that support larger VAs. This
> personality can be set to cause all allocations to be below the 47-bit
> boundary. Using MAP_FIXED with mmap() will bypass this restriction.
>
> Signed-off-by: Charlie Jenkins 
> ---
>  include/uapi/linux/personality.h | 1 +
>  mm/mmap.c| 3 +++
>  2 files changed, 4 insertions(+)
>
> diff --git a/include/uapi/linux/personality.h 
> b/include/uapi/linux/personality.h
> index 49796b7756af..cd3b8c154d9b 100644
> --- a/include/uapi/linux/personality.h
> +++ b/include/uapi/linux/personality.h
> @@ -22,6 +22,7 @@ enum {
>   WHOLE_SECONDS = 0x200,
>   STICKY_TIMEOUTS =   0x400,
>   ADDR_LIMIT_3GB =0x800,
> + ADDR_LIMIT_47BIT =  0x1000,
>  };

I wonder if ADDR_LIMIT_128T would be clearer?

Have you looked at writing an update for the personality(2) man page? :)

cheers

Re: [PATCH RFC v3 0/2] mm: Introduce ADDR_LIMIT_47BIT personality flag

2024-09-05 Thread John Paul Adrian Glaubitz

Hi Charlie,

On Thu, 2024-09-05 at 14:15 -0700, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).
> 
> The riscv architecture needs a way to similarly restrict the virtual
> address space. On the riscv port of OpenJDK an error is thrown if
> attempted to run on the 57-bit address space, called sv57 [1].  golang
> has a comment that sv57 support is not complete, but there are some
> workarounds to get it to mostly work [2].
> 
> These applications work on x86 because x86 does an implicit 47-bit
> restriction of mmap() address that contain a hint address that is less
> than 48 bits.
> 
> Instead of implicitly restricting the address space on riscv (or any
> current/future architecture), provide a flag to the personality syscall
> that can be used to ensure an application works in any arbitrary VA
> space. A similar feature has already been implemented by the personality
> syscall in ADDR_LIMIT_32BIT.
> 
> This flag will also allow seemless compatibility between all
> architectures, so applications like Go and OpenJDK that use bits in a
> virtual address can request the exact number of bits they need in a
> generic way. The flag can be checked inside of vm_unmapped_area() so
> that this flag does not have to be handled individually by each
> architecture. 
> 
> Link:
> https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79
> [1]
> Link:
> https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47
> [2]
> 
> To: Arnd Bergmann 
> To: Richard Henderson 
> To: Ivan Kokshaysky 
> To: Matt Turner 
> To: Vineet Gupta 
> To: Russell King 
> To: Guo Ren 
> To: Huacai Chen 
> To: WANG Xuerui 
> To: Thomas Bogendoerfer 
> To: James E.J. Bottomley 
> To: Helge Deller 
> To: Michael Ellerman 
> To: Nicholas Piggin 
> To: Christophe Leroy 
> To: Naveen N Rao 
> To: Alexander Gordeev 
> To: Gerald Schaefer 
> To: Heiko Carstens 
> To: Vasily Gorbik 
> To: Christian Borntraeger 
> To: Sven Schnelle 
> To: Yoshinori Sato 
> To: Rich Felker 
> To: John Paul Adrian Glaubitz 
> To: David S. Miller 
> To: Andreas Larsson 
> To: Thomas Gleixner 
> To: Ingo Molnar 
> To: Borislav Petkov 
> To: Dave Hansen 
> To: x...@kernel.org
> To: H. Peter Anvin 
> To: Andy Lutomirski 
> To: Peter Zijlstra 
> To: Muchun Song 
> To: Andrew Morton 
> To: Liam R. Howlett 
> To: Vlastimil Babka 
> To: Lorenzo Stoakes 
> To: Shuah Khan 
> To: Christoph Hellwig 
> To: Michal Hocko 
> To: "Kirill A. Shutemov" 
> To: Chris Torek 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-al...@vger.kernel.org
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c...@vger.kernel.org
> Cc: loonga...@lists.linux.dev
> Cc: linux-m...@vger.kernel.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-kselft...@vger.kernel.org
> Cc: linux-abi-de...@lists.sourceforge.net
> Signed-off-by: Charlie Jenkins 
> 
> Changes in v2:
> - Added much greater detail to cover letter
> - Removed all code that touched architecture specific code and was able
>   to factor this out into all generic functions, except for flags that
>   needed to be added to vm_unmapped_area_info
> - Made this an RFC since I have only tested it on riscv and x86
> - Link to v1: 
> https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb90...@rivosinc.com
> 
> Changes in v3:
> - Use a personality flag instead of an mmap flag
> - Link to v2: 
> https://lore.kernel.org/r/20240829-patches-below_hint_mmap-v2-0-638a28d9e...@rivosinc.com
> 
> ---
> Charlie Jenkins (2):
>   mm: Add personality flag to limit address to 47 bits
>   selftests/mm: Create ADDR_LIMIT_47BIT test
> 
>  include/uapi/linux/personality.h   |  1 +
>  mm/mmap.c  |  3 ++
>  tools/testing/selftests/mm/.gitignore  |  1 +
>  tools/testing/selftests/mm/Makefile|  1 +
>  tools/testing/selftests/mm/map_47bit_personality.c | 34 
> ++
>  5 files changed, 40 insertions(+)
> ---
> base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
> change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55

Wow, this issue has been plaguing SPARC users for years already as the 
architecture
uses a 52-bit virtual address space and Javascript engines such as the one in 
Firefox
or Webkit have been crashing ever since.

I should definitely give this series a try and see if that fixes Javascript 
crashes
on SPARC.

Thank

Re: [PATCH RFC v3 0/2] mm: Introduce ADDR_LIMIT_47BIT personality flag

2024-09-05 Thread Guo Ren

On Fri, Sep 6, 2024 at 5:16 AM Charlie Jenkins  wrote:
>
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).
>
> The riscv architecture needs a way to similarly restrict the virtual
> address space. On the riscv port of OpenJDK an error is thrown if
> attempted to run on the 57-bit address space, called sv57 [1].  golang
> has a comment that sv57 support is not complete, but there are some
> workarounds to get it to mostly work [2].
>
> These applications work on x86 because x86 does an implicit 47-bit
> restriction of mmap() address that contain a hint address that is less
> than 48 bits.
>
> Instead of implicitly restricting the address space on riscv (or any
> current/future architecture), provide a flag to the personality syscall
> that can be used to ensure an application works in any arbitrary VA
> space. A similar feature has already been implemented by the personality
> syscall in ADDR_LIMIT_32BIT.
>
> This flag will also allow seemless compatibility between all
> architectures, so applications like Go and OpenJDK that use bits in a
> virtual address can request the exact number of bits they need in a
> generic way. The flag can be checked inside of vm_unmapped_area() so
> that this flag does not have to be handled individually by each
> architecture.
Acked-by: Guo Ren 

Sv57's pain finds its cure in this antidote.

>
> Link:
> https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79
> [1]
> Link:
> https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47
> [2]
>
> To: Arnd Bergmann 
> To: Richard Henderson 
> To: Ivan Kokshaysky 
> To: Matt Turner 
> To: Vineet Gupta 
> To: Russell King 
> To: Guo Ren 
> To: Huacai Chen 
> To: WANG Xuerui 
> To: Thomas Bogendoerfer 
> To: James E.J. Bottomley 
> To: Helge Deller 
> To: Michael Ellerman 
> To: Nicholas Piggin 
> To: Christophe Leroy 
> To: Naveen N Rao 
> To: Alexander Gordeev 
> To: Gerald Schaefer 
> To: Heiko Carstens 
> To: Vasily Gorbik 
> To: Christian Borntraeger 
> To: Sven Schnelle 
> To: Yoshinori Sato 
> To: Rich Felker 
> To: John Paul Adrian Glaubitz 
> To: David S. Miller 
> To: Andreas Larsson 
> To: Thomas Gleixner 
> To: Ingo Molnar 
> To: Borislav Petkov 
> To: Dave Hansen 
> To: x...@kernel.org
> To: H. Peter Anvin 
> To: Andy Lutomirski 
> To: Peter Zijlstra 
> To: Muchun Song 
> To: Andrew Morton 
> To: Liam R. Howlett 
> To: Vlastimil Babka 
> To: Lorenzo Stoakes 
> To: Shuah Khan 
> To: Christoph Hellwig 
> To: Michal Hocko 
> To: "Kirill A. Shutemov" 
> To: Chris Torek 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-al...@vger.kernel.org
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c...@vger.kernel.org
> Cc: loonga...@lists.linux.dev
> Cc: linux-m...@vger.kernel.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-kselft...@vger.kernel.org
> Cc: linux-abi-de...@lists.sourceforge.net
> Signed-off-by: Charlie Jenkins 
>
> Changes in v2:
> - Added much greater detail to cover letter
> - Removed all code that touched architecture specific code and was able
>   to factor this out into all generic functions, except for flags that
>   needed to be added to vm_unmapped_area_info
> - Made this an RFC since I have only tested it on riscv and x86
> - Link to v1: 
> https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb90...@rivosinc.com
>
> Changes in v3:
> - Use a personality flag instead of an mmap flag
> - Link to v2: 
> https://lore.kernel.org/r/20240829-patches-below_hint_mmap-v2-0-638a28d9e...@rivosinc.com
>
> ---
> Charlie Jenkins (2):
>   mm: Add personality flag to limit address to 47 bits
>   selftests/mm: Create ADDR_LIMIT_47BIT test
>
>  include/uapi/linux/personality.h   |  1 +
>  mm/mmap.c  |  3 ++
>  tools/testing/selftests/mm/.gitignore  |  1 +
>  tools/testing/selftests/mm/Makefile|  1 +
>  tools/testing/selftests/mm/map_47bit_personality.c | 34 
> ++
>  5 files changed, 40 insertions(+)
> ---
> base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
> change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
> --
> - Charlie
>


-- 
Best Regards
 Guo Ren

[PATCH RFC v3 2/2] selftests/mm: Create ADDR_LIMIT_47BIT test

2024-09-05 Thread Charlie Jenkins

Add a selftest for the ADDR_LIMIT_47BIT personality flag that mmaps
until it runs out of space and ensures no addresses are allocated above
47 bits.

Signed-off-by: Charlie Jenkins 
---
 tools/testing/selftests/mm/.gitignore  |  1 +
 tools/testing/selftests/mm/Makefile|  1 +
 tools/testing/selftests/mm/map_47bit_personality.c | 34 ++
 3 files changed, 36 insertions(+)

diff --git a/tools/testing/selftests/mm/.gitignore 
b/tools/testing/selftests/mm/.gitignore
index da030b43e43b..918ef05e180d 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -32,6 +32,7 @@ mlock-random-test
 virtual_address_range
 gup_test
 va_128TBswitch
+map_47bit_personality
 map_fixed_noreplace
 write_to_hugetlbfs
 hmm-tests
diff --git a/tools/testing/selftests/mm/Makefile 
b/tools/testing/selftests/mm/Makefile
index cfad627e8d94..2e95fd545409 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm
 TEST_GEN_FILES += hugepage-vmemmap
 TEST_GEN_FILES += khugepaged
 TEST_GEN_FILES += madv_populate
+TEST_GEN_FILES += map_47bit_personality
 TEST_GEN_FILES += map_fixed_noreplace
 TEST_GEN_FILES += map_hugetlb
 TEST_GEN_FILES += map_populate
diff --git a/tools/testing/selftests/mm/map_47bit_personality.c 
b/tools/testing/selftests/mm/map_47bit_personality.c
new file mode 100644
index ..453412990c21
--- /dev/null
+++ b/tools/testing/selftests/mm/map_47bit_personality.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test the ADDR_LIMIT_47BIT personality flag.
+ */
+#include 
+#include 
+#include 
+#include "../kselftest.h"
+
+#define LENGTH (1)
+
+#define ADDR_LIMIT_47BIT   0x1000
+#define BIT47  1UL << 47
+
+/*
+ * Map memory with ADDR_LIMIT_47BIT until no memory left. Ensure that all 
returned
+ * addresses are below 47 bits.
+ */
+int main(int argc, char **argv)
+{
+   void *addr;
+
+   syscall(__NR_personality, ADDR_LIMIT_47BIT);
+
+   do {
+   addr = mmap(0, LENGTH, PROT_READ | PROT_WRITE, MAP_PRIVATE | 
MAP_ANONYMOUS, -1, 0);
+   } while (addr != MAP_FAILED && (unsigned long)addr < BIT47);
+
+   if (errno == ENOMEM)
+   ksft_test_result_pass("ADDR_LIMIT_47BIT works\n");
+   else
+   ksft_test_result_fail("mmap returned address above 47 bits with 
ADDR_LIMIT_47BIT with addr: %p and err: %s\n",
+ addr, strerror(errno));
+}

-- 
2.45.0

[PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-05 Thread Charlie Jenkins

Create a personality flag ADDR_LIMIT_47BIT to support applications
that wish to transition from running in environments that support at
most 47-bit VAs to environments that support larger VAs. This
personality can be set to cause all allocations to be below the 47-bit
boundary. Using MAP_FIXED with mmap() will bypass this restriction.

Signed-off-by: Charlie Jenkins 
---
 include/uapi/linux/personality.h | 1 +
 mm/mmap.c| 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
index 49796b7756af..cd3b8c154d9b 100644
--- a/include/uapi/linux/personality.h
+++ b/include/uapi/linux/personality.h
@@ -22,6 +22,7 @@ enum {
WHOLE_SECONDS = 0x200,
STICKY_TIMEOUTS =   0x400,
ADDR_LIMIT_3GB =0x800,
+   ADDR_LIMIT_47BIT =  0x1000,
 };
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
index d0dfc85b209b..a5c7544853e5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1766,6 +1766,9 @@ unsigned long vm_unmapped_area(struct 
vm_unmapped_area_info *info)
 {
unsigned long addr;
 
+   if (current->personality & ADDR_LIMIT_47BIT)
+   info->high_limit = MIN(info->high_limit, BIT(47) - 1);
+
if (info->flags & VM_UNMAPPED_AREA_TOPDOWN)
addr = unmapped_area_topdown(info);
else

-- 
2.45.0

[PATCH RFC v3 0/2] mm: Introduce ADDR_LIMIT_47BIT personality flag

2024-09-05 Thread Charlie Jenkins

Some applications rely on placing data in free bits addresses allocated
by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
address returned by mmap to be less than the 48-bit address space,
unless the hint address uses more than 47 bits (the 48th bit is reserved
for the kernel address space).

The riscv architecture needs a way to similarly restrict the virtual
address space. On the riscv port of OpenJDK an error is thrown if
attempted to run on the 57-bit address space, called sv57 [1].  golang
has a comment that sv57 support is not complete, but there are some
workarounds to get it to mostly work [2].

These applications work on x86 because x86 does an implicit 47-bit
restriction of mmap() address that contain a hint address that is less
than 48 bits.

Instead of implicitly restricting the address space on riscv (or any
current/future architecture), provide a flag to the personality syscall
that can be used to ensure an application works in any arbitrary VA
space. A similar feature has already been implemented by the personality
syscall in ADDR_LIMIT_32BIT.

This flag will also allow seemless compatibility between all
architectures, so applications like Go and OpenJDK that use bits in a
virtual address can request the exact number of bits they need in a
generic way. The flag can be checked inside of vm_unmapped_area() so
that this flag does not have to be handled individually by each
architecture. 

Link:
https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79
[1]
Link:
https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47
[2]

To: Arnd Bergmann 
To: Richard Henderson 
To: Ivan Kokshaysky 
To: Matt Turner 
To: Vineet Gupta 
To: Russell King 
To: Guo Ren 
To: Huacai Chen 
To: WANG Xuerui 
To: Thomas Bogendoerfer 
To: James E.J. Bottomley 
To: Helge Deller 
To: Michael Ellerman 
To: Nicholas Piggin 
To: Christophe Leroy 
To: Naveen N Rao 
To: Alexander Gordeev 
To: Gerald Schaefer 
To: Heiko Carstens 
To: Vasily Gorbik 
To: Christian Borntraeger 
To: Sven Schnelle 
To: Yoshinori Sato 
To: Rich Felker 
To: John Paul Adrian Glaubitz 
To: David S. Miller 
To: Andreas Larsson 
To: Thomas Gleixner 
To: Ingo Molnar 
To: Borislav Petkov 
To: Dave Hansen 
To: x...@kernel.org
To: H. Peter Anvin 
To: Andy Lutomirski 
To: Peter Zijlstra 
To: Muchun Song 
To: Andrew Morton 
To: Liam R. Howlett 
To: Vlastimil Babka 
To: Lorenzo Stoakes 
To: Shuah Khan 
To: Christoph Hellwig 
To: Michal Hocko 
To: "Kirill A. Shutemov" 
To: Chris Torek 
Cc: linux-a...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-al...@vger.kernel.org
Cc: linux-snps-...@lists.infradead.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c...@vger.kernel.org
Cc: loonga...@lists.linux.dev
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-kselft...@vger.kernel.org
Cc: linux-abi-de...@lists.sourceforge.net
Signed-off-by: Charlie Jenkins 

Changes in v2:
- Added much greater detail to cover letter
- Removed all code that touched architecture specific code and was able
  to factor this out into all generic functions, except for flags that
  needed to be added to vm_unmapped_area_info
- Made this an RFC since I have only tested it on riscv and x86
- Link to v1: 
https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb90...@rivosinc.com

Changes in v3:
- Use a personality flag instead of an mmap flag
- Link to v2: 
https://lore.kernel.org/r/20240829-patches-below_hint_mmap-v2-0-638a28d9e...@rivosinc.com

---
Charlie Jenkins (2):
  mm: Add personality flag to limit address to 47 bits
  selftests/mm: Create ADDR_LIMIT_47BIT test

 include/uapi/linux/personality.h   |  1 +
 mm/mmap.c  |  3 ++
 tools/testing/selftests/mm/.gitignore  |  1 +
 tools/testing/selftests/mm/Makefile|  1 +
 tools/testing/selftests/mm/map_47bit_personality.c | 34 ++
 5 files changed, 40 insertions(+)
---
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
-- 
- Charlie

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-09-05 Thread Charlie Jenkins

On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
> > Some applications rely on placing data in free bits addresses allocated
> > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > address returned by mmap to be less than the 48-bit address space,
> > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > for the kernel address space).
> > 
> > The riscv architecture needs a way to similarly restrict the virtual
> > address space. On the riscv port of OpenJDK an error is thrown if
> > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > has a comment that sv57 support is not complete, but there are some
> > workarounds to get it to mostly work [2].
> > 
> > These applications work on x86 because x86 does an implicit 47-bit
> > restriction of mmap() address that contain a hint address that is less
> > than 48 bits.
> > 
> > Instead of implicitly restricting the address space on riscv (or any
> > current/future architecture), a flag would allow users to opt-in to this
> > behavior rather than opt-out as is done on other architectures. This is
> > desirable because it is a small class of applications that do pointer
> > masking.
> 
> This argument looks broken to me.
> 
> The "small class of applications" is going to be broken unless they got
> patched to use your new mmap() flag. You are asking for bugs.
> 
> Consider the case when you write, compile and validate a piece of software
> on machine that has <=47bit VA. The binary got shipped to customers.
> Later, customer gets a new shiny machine that supports larger address
> space and your previously working software is broken. Such binaries might
> exist today.
> 
> It is bad idea to use >47bit VA by default. Most of software got tested on
> x86 with 47bit VA.
> 
> We can consider more options to opt-in into wider address space like
> personality or prctl() handle. But opt-out is no-go from what I see.
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov

riscv is in an interesting state in regards to this because the software
ecosystem is much less mature than other architectures. The existing
riscv hardware supports either 38 or 47 bit userspace VAs, but a lot of
people test on QEMU which defaults to 56 bit. As a result, a lot of
code is tested with the larger address space. Applications that don't
work on the larger address space, like OpenJDK, currently throw an error
and exit.

Since riscv does not currently have the address space default to 47
bits, some applications just don't work on 56 bits. We could change the
kernel so that these applications start working without the need for
them to change their code, but that seems like the kernel is
overstepping and fixing binaries rather than providing users tools to
fix the binaries themselves.

This mmap flag was an attempt to provide a tool for these applications
that work on the existing 47 bit VA hardware to also work on different
hardware that supports a 56 bit VA space. After feedback, it looks like
a better solution than the mmap flag is to use the personality syscall
to set a process wide restriction to 47 bits instead, which matches the
32 bit flag that already exists.

- Charlie

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-09-05 Thread Kirill A. Shutemov

On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).
> 
> The riscv architecture needs a way to similarly restrict the virtual
> address space. On the riscv port of OpenJDK an error is thrown if
> attempted to run on the 57-bit address space, called sv57 [1].  golang
> has a comment that sv57 support is not complete, but there are some
> workarounds to get it to mostly work [2].
> 
> These applications work on x86 because x86 does an implicit 47-bit
> restriction of mmap() address that contain a hint address that is less
> than 48 bits.
> 
> Instead of implicitly restricting the address space on riscv (or any
> current/future architecture), a flag would allow users to opt-in to this
> behavior rather than opt-out as is done on other architectures. This is
> desirable because it is a small class of applications that do pointer
> masking.

This argument looks broken to me.

The "small class of applications" is going to be broken unless they got
patched to use your new mmap() flag. You are asking for bugs.

Consider the case when you write, compile and validate a piece of software
on machine that has <=47bit VA. The binary got shipped to customers.
Later, customer gets a new shiny machine that supports larger address
space and your previously working software is broken. Such binaries might
exist today.

It is bad idea to use >47bit VA by default. Most of software got tested on
x86 with 47bit VA.

We can consider more options to opt-in into wider address space like
personality or prctl() handle. But opt-out is no-go from what I see.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH RFC v2 2/4] mm: Add hint and mmap_flags to struct vm_unmapped_area_info

2024-09-03 Thread Christophe Leroy


Hi Charlie,

Le 29/08/2024 à 09:15, Charlie Jenkins a écrit :

The hint address and mmap_flags are necessary to determine if
MAP_BELOW_HINT requirements are satisfied.

Signed-off-by: Charlie Jenkins 
---
  arch/alpha/kernel/osf_sys.c  | 2 ++
  arch/arc/mm/mmap.c   | 3 +++
  arch/arm/mm/mmap.c   | 7 +++
  arch/csky/abiv1/mmap.c   | 3 +++
  arch/loongarch/mm/mmap.c | 3 +++
  arch/mips/mm/mmap.c  | 3 +++
  arch/parisc/kernel/sys_parisc.c  | 3 +++
  arch/powerpc/mm/book3s64/slice.c | 7 +++
  arch/s390/mm/hugetlbpage.c   | 4 
  arch/s390/mm/mmap.c  | 6 ++
  arch/sh/mm/mmap.c| 6 ++
  arch/sparc/kernel/sys_sparc_32.c | 3 +++
  arch/sparc/kernel/sys_sparc_64.c | 6 ++
  arch/sparc/mm/hugetlbpage.c  | 4 
  arch/x86/kernel/sys_x86_64.c | 6 ++
  arch/x86/mm/hugetlbpage.c| 4 
  fs/hugetlbfs/inode.c | 4 
  include/linux/mm.h   | 2 ++
  mm/mmap.c| 6 ++
  19 files changed, 82 insertions(+)



  
diff --git a/arch/powerpc/mm/book3s64/slice.c b/arch/powerpc/mm/book3s64/slice.c

index ef3ce37f1bb3..f0e2550af6d0 100644
--- a/arch/powerpc/mm/book3s64/slice.c
+++ b/arch/powerpc/mm/book3s64/slice.c
@@ -286,6 +286,10 @@ static unsigned long slice_find_area_bottomup(struct 
mm_struct *mm,
.length = len,
.align_mask = PAGE_MASK & ((1ul << pshift) - 1),
};
+
+   info.hint = addr;
+   info.mmap_flags = flags;
+
/*
 * Check till the allow max value for this mmap request
 */
@@ -331,6 +335,9 @@ static unsigned long slice_find_area_topdown(struct 
mm_struct *mm,
};
unsigned long min_addr = max(PAGE_SIZE, mmap_min_addr);
  
+	info.hint = addr;

+   info.mmap_flags = flags;
+
/*
 * If we are trying to allocate above DEFAULT_MAP_WINDOW
 * Add the different to the mmap_base.


ppc64_defconfig:

  CC  arch/powerpc/mm/book3s64/slice.o
arch/powerpc/mm/book3s64/slice.c: In function 'slice_find_area_bottomup':
arch/powerpc/mm/book3s64/slice.c:291:27: error: 'flags' undeclared 
(first use in this function)

  291 | info.mmap_flags = flags;
  |   ^
arch/powerpc/mm/book3s64/slice.c:291:27: note: each undeclared 
identifier is reported only once for each function it appears in

arch/powerpc/mm/book3s64/slice.c: In function 'slice_find_area_topdown':
arch/powerpc/mm/book3s64/slice.c:339:27: error: 'flags' undeclared 
(first use in this function)

  339 | info.mmap_flags = flags;
  |   ^
make[5]: *** [scripts/Makefile.build:244: 
arch/powerpc/mm/book3s64/slice.o] Error 1

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-09-03 Thread Michal Hocko

On Thu 29-08-24 10:33:22, Charlie Jenkins wrote:
> On Thu, Aug 29, 2024 at 10:30:56AM +0200, Michal Hocko wrote:
> > On Thu 29-08-24 00:15:57, Charlie Jenkins wrote:
> > > Some applications rely on placing data in free bits addresses allocated
> > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > address returned by mmap to be less than the 48-bit address space,
> > > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > > for the kernel address space).
> > > 
> > > The riscv architecture needs a way to similarly restrict the virtual
> > > address space. On the riscv port of OpenJDK an error is thrown if
> > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > has a comment that sv57 support is not complete, but there are some
> > > workarounds to get it to mostly work [2].
> > > 
> > > These applications work on x86 because x86 does an implicit 47-bit
> > > restriction of mmap() address that contain a hint address that is less
> > > than 48 bits.
> > > 
> > > Instead of implicitly restricting the address space on riscv (or any
> > > current/future architecture), a flag would allow users to opt-in to this
> > > behavior rather than opt-out as is done on other architectures. This is
> > > desirable because it is a small class of applications that do pointer
> > > masking.
> > 
> > IIRC this has been discussed at length when 5-level page tables support
> > has been proposed for x86. Sorry I do not have a link handy but lore
> > should help you. Linus was not really convinced and in the end vetoed it
> > and prefer that those few applications that benefit from greater address
> > space would do that explicitly than other way around.
> 
> I believe I found the conversation you were referring to. Ingo Molnar
> recommended a flag similar to what I have proposed [1]. Catalin
> recommended to make 52-bit opt-in on arm64 [2]. Dave Hansen brought up
> MPX [3].
> 
> However these conversations are tangential to what I am proposing. arm64
> and x86 decided to have the default address space be 48 bits. However
> this was done on a per-architecture basis with no way for applications
> to have guarantees between architectures. Even this behavior to restrict
> to 48 bits does not even appear in the man pages, so would require
> reading the kernel source code to understand that this feature is
> available. Then to opt-in to larger address spaces, applications have to
> know to provide a hint address that is greater than 47 bits, mmap() will
> then return an address that contains up to 56 bits on x86 and 52 bits on
> arm64. This difference of 4 bits causes inconsistency and is part of the
> problem I am trying to solve with this flag.

Yes, I guess I do understand where you are heading. Our existing model
assumes that anybody requiring more address space know what they are
doing and deal with the reality. This is the way Linus has pushed this
and I am not really convinced it is the right way TBH. On the other hand
it is true that this allows a safe(r) transition to larger address
spaces.

> I am not proposing to change x86 and arm64 away from using their opt-out
> feature, I am instead proposing a standard ABI for applications that
> need some guarantees of the bits used in pointers.

Right, but this is not really different from earlier attempts to achieve
this IIRC. Extentind mmap for that purpose seems quite tricky as already
pointed out in other sub-threads. Quite honestly I am not really sure
what is the right and backwards compatible way. I just wanted to make
you aware this has been discussed at lenght in the past.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-30 Thread Charlie Jenkins

On Fri, Aug 30, 2024 at 10:52:01AM +0100, Lorenzo Stoakes wrote:
> On Thu, Aug 29, 2024 at 03:16:53PM GMT, Charlie Jenkins wrote:
> > On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote:
> > > On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
> > > > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
> > > > > Some applications rely on placing data in free bits addresses 
> > > > > allocated
> > > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > > > address returned by mmap to be less than the 48-bit address space,
> > > > > unless the hint address uses more than 47 bits (the 48th bit is 
> > > > > reserved
> > > > > for the kernel address space).
> > > >
> > > > I'm still confused as to why, if an mmap flag is desired, and thus 
> > > > programs
> > > > are having to be heavily modified and controlled to be able to do this, 
> > > > why
> > > > you can't just do an mmap() with PROT_NONE early, around a hinted 
> > > > address
> > > > that, sits below the required limit, and then mprotect() or mmap() over 
> > > > it?
> > > >
> > > > Your feature is a major adjustment to mmap(), it needs to be pretty
> > > > significantly justified, especially if taking up a new flag.
> > > >
> > > > >
> > > > > The riscv architecture needs a way to similarly restrict the virtual
> > > > > address space. On the riscv port of OpenJDK an error is thrown if
> > > > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > > > has a comment that sv57 support is not complete, but there are some
> > > > > workarounds to get it to mostly work [2].
> > > > >
> > > > > These applications work on x86 because x86 does an implicit 47-bit
> > > > > restriction of mmap() address that contain a hint address that is less
> > > > > than 48 bits.
> > > >
> > > > You mean x86 _has_ to limit to physically available bits in a canonical
> > > > format :) this will not be the case for 5-page table levels though...
> >
> > I might be misunderstanding but I am not talking about pointer masking
> > or canonical addresses here. I am referring to the pattern of:
> >
> > 1. Getting an address from mmap()
> > 2. Writing data into bits assumed to be unused in the address
> > 3. Using the data stored in the address
> > 4. Clearing the data from the address and sign extending
> > 5. Dereferencing the now sign-extended address to conform to canonical
> >addresses
> >
> > I am just talking about step 1 and 2 here -- getting an address from
> > mmap() that only uses bits that will allow your application to not
> > break. How canonicalization happens is a a separate conversation, that
> > can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv.
> > While LAM for x86 is only capable of masking addresses to 48 or 57 bits,
> > Ssnpm for riscv allow an arbitrary number of bits to be masked out.
> > A design goal here is to be able to support all of the pointer masking
> > flavors, and not just x86.
> 
> Right I get that, I was just saying that the implicit limitation in x86 is
> due to virtual addresses _having_ to be less than 48 bits. So that's why
> that is right? I mean perhaps I'm mistaken?
> 
> Or is it such that x86 can provide a space for tagging for CPU technology
> that supports it (UAI perhaps?).
> 
> I agree with what Michal and others said about the decision to default to
> the reduced address space size and opt-in for higher bits. Your series
> doesn't do this...
> 
> >
> > > >
> > > > >
> > > > > Instead of implicitly restricting the address space on riscv (or any
> > > > > current/future architecture), a flag would allow users to opt-in to 
> > > > > this
> > > > > behavior rather than opt-out as is done on other architectures. This 
> > > > > is
> > > > > desirable because it is a small class of applications that do pointer
> > > > > masking.
> > > >
> > > > I raised this last time and you didn't seem to address it so to be more
> > > > blunt:
> > > >
> > > > I don't understand why this needs to be an mmap() flag. From this it 
> > > > seems
> > > > the whole process needs allocations to be below a certain limit.
> >
> > Yeah making it per-process does seem logical, as it would help with
> > pointer masking.
> 
> To me it's the only feasible way forward, you can't control all libraries,
> a map flag continues to seem a strange way to implement this, and I
> understand that your justification is that it is the _least invasive_ way
> of doing this, but as I've said below, it's actually pretty invasive if you
> think about it, the current implementation seems to me to be insufficient
> without having VMA flags etc.
> 
> >
> > > >
> > > > That _could_ be achieved through a 'personality' or similar (though a
> > > > personality is on/off, rather than allowing configuration so maybe
> > > > something else would be needed).
> > > >
> > > > From what you're saying 57-bit is all you really need right? So maybe
> > > > ADDR_LIMIT_57BIT?
> >
> > Addresses w

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-30 Thread Lorenzo Stoakes

On Fri, Aug 30, 2024 at 08:03:25AM GMT, Dave Hansen wrote:
> On 8/29/24 01:42, Lorenzo Stoakes wrote:
> >> These applications work on x86 because x86 does an implicit 47-bit
> >> restriction of mmap() address that contain a hint address that is less
> >> than 48 bits.
> > You mean x86 _has_ to limit to physically available bits in a canonical
> > format 🙂 this will not be the case for 5-page table levels though...
>
> By "physically available bits" are you referring to the bits that can be
> used as a part of the virtual address?  "Physically" may not have been
> the best choice of words. ;)
>
> There's a canonical hole in 4-level paging and 5-level paging on x86.
> The 5-level canonical hole is just smaller.

Yeah sorry this is what I meant!

>
> Also, I should probably say that the >47-bit mmap() access hint was more
> of a crutch than something that we wanted to make ABI forever.  We knew
> that high addresses might break some apps and we hoped that the list of
> things it would break would go down over time so that we could
> eventually just let mmap() access the whole address space by default.
>
> That optimism may have been misplaced.

Interesting, thanks. This speaks again I think to it being unwise to rely
on these things.

I do think the only workable form of this series is a fixed
personality-based mapping limit.

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-30 Thread Dave Hansen

On 8/29/24 01:42, Lorenzo Stoakes wrote:
>> These applications work on x86 because x86 does an implicit 47-bit
>> restriction of mmap() address that contain a hint address that is less
>> than 48 bits.
> You mean x86 _has_ to limit to physically available bits in a canonical
> format 🙂 this will not be the case for 5-page table levels though...

By "physically available bits" are you referring to the bits that can be
used as a part of the virtual address?  "Physically" may not have been
the best choice of words. ;)

There's a canonical hole in 4-level paging and 5-level paging on x86.
The 5-level canonical hole is just smaller.

Also, I should probably say that the >47-bit mmap() access hint was more
of a crutch than something that we wanted to make ABI forever.  We knew
that high addresses might break some apps and we hoped that the list of
things it would break would go down over time so that we could
eventually just let mmap() access the whole address space by default.

That optimism may have been misplaced.

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-30 Thread Lorenzo Stoakes

On Thu, Aug 29, 2024 at 03:16:53PM GMT, Charlie Jenkins wrote:
> On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote:
> > On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
> > > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
> > > > Some applications rely on placing data in free bits addresses allocated
> > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > > address returned by mmap to be less than the 48-bit address space,
> > > > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > > > for the kernel address space).
> > >
> > > I'm still confused as to why, if an mmap flag is desired, and thus 
> > > programs
> > > are having to be heavily modified and controlled to be able to do this, 
> > > why
> > > you can't just do an mmap() with PROT_NONE early, around a hinted address
> > > that, sits below the required limit, and then mprotect() or mmap() over 
> > > it?
> > >
> > > Your feature is a major adjustment to mmap(), it needs to be pretty
> > > significantly justified, especially if taking up a new flag.
> > >
> > > >
> > > > The riscv architecture needs a way to similarly restrict the virtual
> > > > address space. On the riscv port of OpenJDK an error is thrown if
> > > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > > has a comment that sv57 support is not complete, but there are some
> > > > workarounds to get it to mostly work [2].
> > > >
> > > > These applications work on x86 because x86 does an implicit 47-bit
> > > > restriction of mmap() address that contain a hint address that is less
> > > > than 48 bits.
> > >
> > > You mean x86 _has_ to limit to physically available bits in a canonical
> > > format :) this will not be the case for 5-page table levels though...
>
> I might be misunderstanding but I am not talking about pointer masking
> or canonical addresses here. I am referring to the pattern of:
>
> 1. Getting an address from mmap()
> 2. Writing data into bits assumed to be unused in the address
> 3. Using the data stored in the address
> 4. Clearing the data from the address and sign extending
> 5. Dereferencing the now sign-extended address to conform to canonical
>addresses
>
> I am just talking about step 1 and 2 here -- getting an address from
> mmap() that only uses bits that will allow your application to not
> break. How canonicalization happens is a a separate conversation, that
> can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv.
> While LAM for x86 is only capable of masking addresses to 48 or 57 bits,
> Ssnpm for riscv allow an arbitrary number of bits to be masked out.
> A design goal here is to be able to support all of the pointer masking
> flavors, and not just x86.

Right I get that, I was just saying that the implicit limitation in x86 is
due to virtual addresses _having_ to be less than 48 bits. So that's why
that is right? I mean perhaps I'm mistaken?

Or is it such that x86 can provide a space for tagging for CPU technology
that supports it (UAI perhaps?).

I agree with what Michal and others said about the decision to default to
the reduced address space size and opt-in for higher bits. Your series
doesn't do this...

>
> > >
> > > >
> > > > Instead of implicitly restricting the address space on riscv (or any
> > > > current/future architecture), a flag would allow users to opt-in to this
> > > > behavior rather than opt-out as is done on other architectures. This is
> > > > desirable because it is a small class of applications that do pointer
> > > > masking.
> > >
> > > I raised this last time and you didn't seem to address it so to be more
> > > blunt:
> > >
> > > I don't understand why this needs to be an mmap() flag. From this it seems
> > > the whole process needs allocations to be below a certain limit.
>
> Yeah making it per-process does seem logical, as it would help with
> pointer masking.

To me it's the only feasible way forward, you can't control all libraries,
a map flag continues to seem a strange way to implement this, and I
understand that your justification is that it is the _least invasive_ way
of doing this, but as I've said below, it's actually pretty invasive if you
think about it, the current implementation seems to me to be insufficient
without having VMA flags etc.

>
> > >
> > > That _could_ be achieved through a 'personality' or similar (though a
> > > personality is on/off, rather than allowing configuration so maybe
> > > something else would be needed).
> > >
> > > From what you're saying 57-bit is all you really need right? So maybe
> > > ADDR_LIMIT_57BIT?
>
> Addresses will always be limited to 57 bits on riscv and x86 (but not
> necessarily on other architectures). A flag like that would have no
> impact, I do not understand what you are suggesting. This patch is to
> have a configurable number of bits be restricted.

I get that, but as I say below, I don't think a customisable limit is
wor

Re: [PATCH RFC v2 1/4] mm: Add MAP_BELOW_HINT

2024-08-29 Thread Charlie Jenkins

On Thu, Aug 29, 2024 at 06:26:41PM +1000, Michael Ellerman wrote:
> Charlie Jenkins  writes:
> > Some applications rely on placing data in free bits addresses allocated
> > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > address returned by mmap to be less than the 48-bit address space,
> > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > for the kernel address space).
> >
> > To make this behavior explicit and more versatile across all
> > architectures, define a mmap flag that allows users to define an
> > arbitrary upper limit on addresses returned by mmap.
> >
> > Signed-off-by: Charlie Jenkins 
> > ---
> >  include/uapi/asm-generic/mman-common.h   | 1 +
> >  tools/include/uapi/asm-generic/mman-common.h | 1 +
>   
> You're not meant to update the headers in tools/ directly. There's a
> mail somewhere from acme somewhere describing the proper process, but
> the tldr is leave it up to him.

Oh okay, thank you.

> 
> > diff --git a/include/uapi/asm-generic/mman-common.h 
> > b/include/uapi/asm-generic/mman-common.h
> > index 6ce1f1ceb432..03ac13d9aa37 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -32,6 +32,7 @@
> >  
> >  #define MAP_UNINITIALIZED 0x400/* For anonymous mmap, memory 
> > could be
> >  * uninitialized */
> > +#define MAP_BELOW_HINT   0x800 /* give out address that is 
> > below (inclusive) hint address */
> 
> IMHO the API would be clearer if this actually forced the address to be
> below the hint. That's what the flag name implies after all.
> 
> It would also mean the application doesn't need to take into account the
> length of the mapping when passing the hint.
> 
> cheers

That's a good point. The reason I did it this way was to allow mmap the
possibility of returning the same address as the hint. If it must be
strictly less than the hint then the hint address can never be returned.
Maybe that doesn't matter though.

- Charlie

Re: [PATCH RFC v2 2/4] mm: Add hint and mmap_flags to struct vm_unmapped_area_info

2024-08-29 Thread Charlie Jenkins

On Thu, Aug 29, 2024 at 09:48:44AM +0100, Lorenzo Stoakes wrote:
> On Thu, Aug 29, 2024 at 12:15:59AM GMT, Charlie Jenkins wrote:
> 
> [snip]
> 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index d0dfc85b209b..34ba0db23678 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1796,6 +1796,9 @@ generic_get_unmapped_area(struct file *filp, unsigned 
> > long addr,
> > struct vm_unmapped_area_info info = {};
> > const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
> >
> > +   info.hint = addr;
> > +   info.mmap_flags = flags;
> > +
> > if (len > mmap_end - mmap_min_addr)
> > return -ENOMEM;
> >
> > @@ -1841,6 +1844,9 @@ generic_get_unmapped_area_topdown(struct file *filp, 
> > unsigned long addr,
> > struct vm_unmapped_area_info info = {};
> > const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
> >
> > +   info.hint = addr;
> > +   info.mmap_flags = flags;
> > +
> > /* requested length too big for entire address space */
> > if (len > mmap_end - mmap_min_addr)
> > return -ENOMEM;
> >
> 
> These line numbers suggest you're working against Linus's tree, mm/mmap.c
> has changed a lot recently, so to avoid conflicts please base your changes
> on mm-unstable in Andrew's tree (if looking to go through that) or at least
> -next.

I will make sure that I base off of mm-unstable for future revisions.

- Charlie

> 
> > --
> > 2.45.0
> >

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-29 Thread Palmer Dabbelt

On Thu, 29 Aug 2024 02:02:34 PDT (-0700), vba...@suse.cz wrote:
> Such a large recipient list and no linux-api. CC'd, please include it on
> future postings.
>
> On 8/29/24 09:15, Charlie Jenkins wrote:
>> Some applications rely on placing data in free bits addresses allocated
>> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
>> address returned by mmap to be less than the 48-bit address space,
>> unless the hint address uses more than 47 bits (the 48th bit is reserved
>> for the kernel address space).
>>
>> The riscv architecture needs a way to similarly restrict the virtual
>> address space. On the riscv port of OpenJDK an error is thrown if
>> attempted to run on the 57-bit address space, called sv57 [1].  golang
>> has a comment that sv57 support is not complete, but there are some
>> workarounds to get it to mostly work [2].
>>
>> These applications work on x86 because x86 does an implicit 47-bit
>> restriction of mmap() address that contain a hint address that is less
>> than 48 bits.
>>
>> Instead of implicitly restricting the address space on riscv (or any
>> current/future architecture), a flag would allow users to opt-in to this
>> behavior rather than opt-out as is done on other architectures. This is
>> desirable because it is a small class of applications that do pointer
>> masking.
>
> I doubt it's desirable to have different behavior depending on architecture.
> Also you could say it's a small class of applications that need more than 47
> bits.

We're sort of stuck with the architeture-depending behavior here: for 
the first few years RISC-V only had 39-bit VAs, so the defato uABI ended 
up being that userspace can ignore way more bits.  While 48 bits might 
be enough for everyone, 39 doesn't seem to be -- or at least IIRC when 
we tried restricting the default to that, we broke stuff.  There's also 
some other wrinkles like arbitrary bit boundaries in pointer masking and 
vendor-specific paging formats, but at some point we just end up down a 
rabbit hole of insanity there...

FWIW, I think that userspace depending on just tossing some VA bits 
because some kernels happened to never allocate from them is just 
broken, but it seems like other ports worked around the 48->57 bit 
transition and we're trying to do something similar for 39->48 (and that 
works with 49->57, as we'll have to deal with that eventually).

So that's basically how we ended up with this sort of thing: trying to 
do something similar without a flag broke userspace because we were 
trying to jam too much into the hints.  I couldn't really figure out a 
way to satisfy all the userspace constraints by just implicitly 
retrofitting behavior based on the hints, so we figured having an 
explicit flag to control the behavior would be the sanest way to go.

That said: I'm not opposed to just saying "depending on 39-bit VAs is 
broken" and just forcing people to fix it.

>> This flag will also allow seemless compatibility between all
>> architectures, so applications like Go and OpenJDK that use bits in a
>> virtual address can request the exact number of bits they need in a
>> generic way. The flag can be checked inside of vm_unmapped_area() so
>> that this flag does not have to be handled individually by each
>> architecture.
>>
>> Link:
>> https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79
>> [1]
>> Link:
>> https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47
>> [2]
>>
>> To: Arnd Bergmann 
>> To: Richard Henderson 
>> To: Ivan Kokshaysky 
>> To: Matt Turner 
>> To: Vineet Gupta 
>> To: Russell King 
>> To: Guo Ren 
>> To: Huacai Chen 
>> To: WANG Xuerui 
>> To: Thomas Bogendoerfer 
>> To: James E.J. Bottomley 
>> To: Helge Deller 
>> To: Michael Ellerman 
>> To: Nicholas Piggin 
>> To: Christophe Leroy 
>> To: Naveen N Rao 
>> To: Alexander Gordeev 
>> To: Gerald Schaefer 
>> To: Heiko Carstens 
>> To: Vasily Gorbik 
>> To: Christian Borntraeger 
>> To: Sven Schnelle 
>> To: Yoshinori Sato 
>> To: Rich Felker 
>> To: John Paul Adrian Glaubitz 
>> To: David S. Miller 
>> To: Andreas Larsson 
>> To: Thomas Gleixner 
>> To: Ingo Molnar 
>> To: Borislav Petkov 
>> To: Dave Hansen 
>> To: x...@kernel.org
>> To: H. Peter Anvin 
>> To: Andy Lutomirski 
>> To: Peter Zijlstra 
>> To: Muchun Song 
>> To: Andrew Morton 
>> To: Liam R. Howlett 
>> To: Vlastimil Babka 
>> To: Lorenzo Stoakes 
>> To: Shuah Khan 
>> Cc: linux-a...@vger.kernel.org
>> Cc: linux-ker...@vger.kernel.org
>> Cc: linux-al...@vger.kernel.org
>> Cc: linux-snps-...@lists.infradead.org
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-c...@vger.kernel.org
>> Cc: loonga...@lists.linux.dev
>> Cc: linux-m...@vger.kernel.org
>> Cc: linux-par...@vger.kernel.org
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: linux-s...@vger.kernel.org
>> Cc: linux...@vger.kernel.org
>> Cc: sparcli...@vger.kernel.org
>> Cc: linux...@kvack

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-29 Thread Charlie Jenkins

On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote:
> On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
> > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
> > > Some applications rely on placing data in free bits addresses allocated
> > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > address returned by mmap to be less than the 48-bit address space,
> > > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > > for the kernel address space).
> >
> > I'm still confused as to why, if an mmap flag is desired, and thus programs
> > are having to be heavily modified and controlled to be able to do this, why
> > you can't just do an mmap() with PROT_NONE early, around a hinted address
> > that, sits below the required limit, and then mprotect() or mmap() over it?
> >
> > Your feature is a major adjustment to mmap(), it needs to be pretty
> > significantly justified, especially if taking up a new flag.
> >
> > >
> > > The riscv architecture needs a way to similarly restrict the virtual
> > > address space. On the riscv port of OpenJDK an error is thrown if
> > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > has a comment that sv57 support is not complete, but there are some
> > > workarounds to get it to mostly work [2].
> > >
> > > These applications work on x86 because x86 does an implicit 47-bit
> > > restriction of mmap() address that contain a hint address that is less
> > > than 48 bits.
> >
> > You mean x86 _has_ to limit to physically available bits in a canonical
> > format :) this will not be the case for 5-page table levels though...

I might be misunderstanding but I am not talking about pointer masking
or canonical addresses here. I am referring to the pattern of:

1. Getting an address from mmap()
2. Writing data into bits assumed to be unused in the address
3. Using the data stored in the address
4. Clearing the data from the address and sign extending
5. Dereferencing the now sign-extended address to conform to canonical
   addresses

I am just talking about step 1 and 2 here -- getting an address from
mmap() that only uses bits that will allow your application to not
break. How canonicalization happens is a a separate conversation, that
can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv.
While LAM for x86 is only capable of masking addresses to 48 or 57 bits,
Ssnpm for riscv allow an arbitrary number of bits to be masked out.
A design goal here is to be able to support all of the pointer masking
flavors, and not just x86.

> >
> > >
> > > Instead of implicitly restricting the address space on riscv (or any
> > > current/future architecture), a flag would allow users to opt-in to this
> > > behavior rather than opt-out as is done on other architectures. This is
> > > desirable because it is a small class of applications that do pointer
> > > masking.
> >
> > I raised this last time and you didn't seem to address it so to be more
> > blunt:
> >
> > I don't understand why this needs to be an mmap() flag. From this it seems
> > the whole process needs allocations to be below a certain limit.

Yeah making it per-process does seem logical, as it would help with
pointer masking.

> >
> > That _could_ be achieved through a 'personality' or similar (though a
> > personality is on/off, rather than allowing configuration so maybe
> > something else would be needed).
> >
> > From what you're saying 57-bit is all you really need right? So maybe
> > ADDR_LIMIT_57BIT?

Addresses will always be limited to 57 bits on riscv and x86 (but not
necessarily on other architectures). A flag like that would have no
impact, I do not understand what you are suggesting. This patch is to
have a configurable number of bits be restricted.

If anything, a personality that was ADDR_LIMIT_48BIT would be the
closest to what I am trying to achieve. Since the issue is that
applications fail to work when the address space is greater than 48
bits.

> >
> > I don't see how you're going to actually enforce this in a process either
> > via an mmap flag, as a library might decide not to use it, so you'd need to
> > control the allocator, the thread library implementation, and everything
> > that might allocate.

It is reasonable to change the implementation to be per-process but that
is not the current proposal.

This flag was designed for applications which already directly manage
all of their addresses like OpenJDK and Go.

This flag implementation was an attempt to make this feature as least
invasive as possible to reduce maintainence burden and implementation
complexity.

> >
> > Liam also raised various points about VMA particulars that I'm not sure are
> > addressed either.
> >
> > I just find it hard to believe that everything will fit together.
> >
> > I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m
> > just not.
> >
> > >
> > > This flag will also allow seemless comp

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-29 Thread Charlie Jenkins

On Thu, Aug 29, 2024 at 10:30:56AM +0200, Michal Hocko wrote:
> On Thu 29-08-24 00:15:57, Charlie Jenkins wrote:
> > Some applications rely on placing data in free bits addresses allocated
> > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > address returned by mmap to be less than the 48-bit address space,
> > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > for the kernel address space).
> > 
> > The riscv architecture needs a way to similarly restrict the virtual
> > address space. On the riscv port of OpenJDK an error is thrown if
> > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > has a comment that sv57 support is not complete, but there are some
> > workarounds to get it to mostly work [2].
> > 
> > These applications work on x86 because x86 does an implicit 47-bit
> > restriction of mmap() address that contain a hint address that is less
> > than 48 bits.
> > 
> > Instead of implicitly restricting the address space on riscv (or any
> > current/future architecture), a flag would allow users to opt-in to this
> > behavior rather than opt-out as is done on other architectures. This is
> > desirable because it is a small class of applications that do pointer
> > masking.
> 
> IIRC this has been discussed at length when 5-level page tables support
> has been proposed for x86. Sorry I do not have a link handy but lore
> should help you. Linus was not really convinced and in the end vetoed it
> and prefer that those few applications that benefit from greater address
> space would do that explicitly than other way around.

I believe I found the conversation you were referring to. Ingo Molnar
recommended a flag similar to what I have proposed [1]. Catalin
recommended to make 52-bit opt-in on arm64 [2]. Dave Hansen brought up
MPX [3].

However these conversations are tangential to what I am proposing. arm64
and x86 decided to have the default address space be 48 bits. However
this was done on a per-architecture basis with no way for applications
to have guarantees between architectures. Even this behavior to restrict
to 48 bits does not even appear in the man pages, so would require
reading the kernel source code to understand that this feature is
available. Then to opt-in to larger address spaces, applications have to
know to provide a hint address that is greater than 47 bits, mmap() will
then return an address that contains up to 56 bits on x86 and 52 bits on
arm64. This difference of 4 bits causes inconsistency and is part of the
problem I am trying to solve with this flag.

I am not proposing to change x86 and arm64 away from using their opt-out
feature, I am instead proposing a standard ABI for applications that
need some guarantees of the bits used in pointers.

- Charlie

Link: https://lore.kernel.org/lkml/20161209050130.gc2...@gmail.com/ [1]
Link: 
https://lore.kernel.org/lkml/20161209105120.ga3...@e104818-lin.cambridge.arm.com/
[2]
Link:
https://lore.kernel.org/lkml/a2f86495-b55f-fda0-40d2-242c45d3c...@intel.com/
[3]

> 
> -- 
> Michal Hocko
> SUSE Labs

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-29 Thread Lorenzo Stoakes

On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
> On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
> > Some applications rely on placing data in free bits addresses allocated
> > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > address returned by mmap to be less than the 48-bit address space,
> > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > for the kernel address space).
>
> I'm still confused as to why, if an mmap flag is desired, and thus programs
> are having to be heavily modified and controlled to be able to do this, why
> you can't just do an mmap() with PROT_NONE early, around a hinted address
> that, sits below the required limit, and then mprotect() or mmap() over it?
>
> Your feature is a major adjustment to mmap(), it needs to be pretty
> significantly justified, especially if taking up a new flag.
>
> >
> > The riscv architecture needs a way to similarly restrict the virtual
> > address space. On the riscv port of OpenJDK an error is thrown if
> > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > has a comment that sv57 support is not complete, but there are some
> > workarounds to get it to mostly work [2].
> >
> > These applications work on x86 because x86 does an implicit 47-bit
> > restriction of mmap() address that contain a hint address that is less
> > than 48 bits.
>
> You mean x86 _has_ to limit to physically available bits in a canonical
> format :) this will not be the case for 5-page table levels though...
>
> >
> > Instead of implicitly restricting the address space on riscv (or any
> > current/future architecture), a flag would allow users to opt-in to this
> > behavior rather than opt-out as is done on other architectures. This is
> > desirable because it is a small class of applications that do pointer
> > masking.
>
> I raised this last time and you didn't seem to address it so to be more
> blunt:
>
> I don't understand why this needs to be an mmap() flag. From this it seems
> the whole process needs allocations to be below a certain limit.
>
> That _could_ be achieved through a 'personality' or similar (though a
> personality is on/off, rather than allowing configuration so maybe
> something else would be needed).
>
> From what you're saying 57-bit is all you really need right? So maybe
> ADDR_LIMIT_57BIT?
>
> I don't see how you're going to actually enforce this in a process either
> via an mmap flag, as a library might decide not to use it, so you'd need to
> control the allocator, the thread library implementation, and everything
> that might allocate.
>
> Liam also raised various points about VMA particulars that I'm not sure are
> addressed either.
>
> I just find it hard to believe that everything will fit together.
>
> I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m
> just not.
>
> >
> > This flag will also allow seemless compatibility between all
> > architectures, so applications like Go and OpenJDK that use bits in a
> > virtual address can request the exact number of bits they need in a
> > generic way. The flag can be checked inside of vm_unmapped_area() so
> > that this flag does not have to be handled individually by each
> > architecture.
>
> I'm still very unconvinced and feel the bar needs to be high for making
> changes like this that carry maintainership burden.
>
> So for me, it's a no really as an overall concept.
>
> Happy to be convinced otherwise, however... (I may be missing details or
> context that provide more justification).
>

Some more thoughts:

* If you absolutely must keep allocations below a certain limit, you'd
  probably need to actually associate this information with the VMA so the
  memory can't be mremap()'d somewhere invalid (you might not control all
  code so you can't guarantee this won't happen).
* Keeping a map limit associated with a VMA would be horrid and keeping
  VMAs as small as possible is a key aim, so that'd be a no go. VMA flags
  are in limited supply also.
* If we did implement a per-process thing, but it were arbitrary, we'd then
  have to handle all kinds of corner cases forever (this is UAPI, can't
  break it etc.) with crazy-low values, or determine a minimum that might
  vary by arch...
* If we did this we'd absolutely have to implement a check in the brk()
  implementation, which is a very very sensitive bit of code. And of
  course, in mmap() and mremap()... and any arch-specific code that might
  interface with this stuff (these functions are hooked).
* A fixed address limit would make more sense, but it seems difficult to
  know what would work for everybody, and again we'd have to deal with edge
  cases and having a permanent maintenance burden.
* If you did have a map flag what about merging between VMAs above the
  limit and below it? To avoid that you'd need to implement some kind of a
  'VMA flag that has an arbitrary characteristic' or a 'limit' field,
  adjust all t

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-29 Thread Vlastimil Babka

Such a large recipient list and no linux-api. CC'd, please include it on
future postings.

On 8/29/24 09:15, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).
> 
> The riscv architecture needs a way to similarly restrict the virtual
> address space. On the riscv port of OpenJDK an error is thrown if
> attempted to run on the 57-bit address space, called sv57 [1].  golang
> has a comment that sv57 support is not complete, but there are some
> workarounds to get it to mostly work [2].
> 
> These applications work on x86 because x86 does an implicit 47-bit
> restriction of mmap() address that contain a hint address that is less
> than 48 bits.
> 
> Instead of implicitly restricting the address space on riscv (or any
> current/future architecture), a flag would allow users to opt-in to this
> behavior rather than opt-out as is done on other architectures. This is
> desirable because it is a small class of applications that do pointer
> masking.

I doubt it's desirable to have different behavior depending on architecture.
Also you could say it's a small class of applications that need more than 47
bits.

> This flag will also allow seemless compatibility between all
> architectures, so applications like Go and OpenJDK that use bits in a
> virtual address can request the exact number of bits they need in a
> generic way. The flag can be checked inside of vm_unmapped_area() so
> that this flag does not have to be handled individually by each
> architecture. 
> 
> Link:
> https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79
> [1]
> Link:
> https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47
> [2]
> 
> To: Arnd Bergmann 
> To: Richard Henderson 
> To: Ivan Kokshaysky 
> To: Matt Turner 
> To: Vineet Gupta 
> To: Russell King 
> To: Guo Ren 
> To: Huacai Chen 
> To: WANG Xuerui 
> To: Thomas Bogendoerfer 
> To: James E.J. Bottomley 
> To: Helge Deller 
> To: Michael Ellerman 
> To: Nicholas Piggin 
> To: Christophe Leroy 
> To: Naveen N Rao 
> To: Alexander Gordeev 
> To: Gerald Schaefer 
> To: Heiko Carstens 
> To: Vasily Gorbik 
> To: Christian Borntraeger 
> To: Sven Schnelle 
> To: Yoshinori Sato 
> To: Rich Felker 
> To: John Paul Adrian Glaubitz 
> To: David S. Miller 
> To: Andreas Larsson 
> To: Thomas Gleixner 
> To: Ingo Molnar 
> To: Borislav Petkov 
> To: Dave Hansen 
> To: x...@kernel.org
> To: H. Peter Anvin 
> To: Andy Lutomirski 
> To: Peter Zijlstra 
> To: Muchun Song 
> To: Andrew Morton 
> To: Liam R. Howlett 
> To: Vlastimil Babka 
> To: Lorenzo Stoakes 
> To: Shuah Khan 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-al...@vger.kernel.org
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c...@vger.kernel.org
> Cc: loonga...@lists.linux.dev
> Cc: linux-m...@vger.kernel.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-kselft...@vger.kernel.org
> Signed-off-by: Charlie Jenkins 
> 
> Changes in v2:
> - Added much greater detail to cover letter
> - Removed all code that touched architecture specific code and was able
>   to factor this out into all generic functions, except for flags that
>   needed to be added to vm_unmapped_area_info
> - Made this an RFC since I have only tested it on riscv and x86
> - Link to v1: 
> https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb90...@rivosinc.com
> 
> ---
> Charlie Jenkins (4):
>   mm: Add MAP_BELOW_HINT
>   mm: Add hint and mmap_flags to struct vm_unmapped_area_info
>   mm: Support MAP_BELOW_HINT in vm_unmapped_area()
>   selftests/mm: Create MAP_BELOW_HINT test
> 
>  arch/alpha/kernel/osf_sys.c  |  2 ++
>  arch/arc/mm/mmap.c   |  3 +++
>  arch/arm/mm/mmap.c   |  7 ++
>  arch/csky/abiv1/mmap.c   |  3 +++
>  arch/loongarch/mm/mmap.c |  3 +++
>  arch/mips/mm/mmap.c  |  3 +++
>  arch/parisc/kernel/sys_parisc.c  |  3 +++
>  arch/powerpc/mm/book3s64/slice.c |  7 ++
>  arch/s390/mm/hugetlbpage.c   |  4 
>  arch/s390/mm/mmap.c  |  6 ++
>  arch/sh/mm/mmap.c|  6 ++
>  arch/sparc/kernel/sys_sparc_32.c |  3 +++
>  arch/sparc/kernel/sys_sparc_64.c |  6 ++
>  arch/sparc/mm/hugetlbpage.c  |  4 
>  arch/x86

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-29 Thread Michal Hocko

On Thu 29-08-24 00:15:57, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).
> 
> The riscv architecture needs a way to similarly restrict the virtual
> address space. On the riscv port of OpenJDK an error is thrown if
> attempted to run on the 57-bit address space, called sv57 [1].  golang
> has a comment that sv57 support is not complete, but there are some
> workarounds to get it to mostly work [2].
> 
> These applications work on x86 because x86 does an implicit 47-bit
> restriction of mmap() address that contain a hint address that is less
> than 48 bits.
> 
> Instead of implicitly restricting the address space on riscv (or any
> current/future architecture), a flag would allow users to opt-in to this
> behavior rather than opt-out as is done on other architectures. This is
> desirable because it is a small class of applications that do pointer
> masking.

IIRC this has been discussed at length when 5-level page tables support
has been proposed for x86. Sorry I do not have a link handy but lore
should help you. Linus was not really convinced and in the end vetoed it
and prefer that those few applications that benefit from greater address
space would do that explicitly than other way around.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-29 Thread Lorenzo Stoakes

On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).

I'm still confused as to why, if an mmap flag is desired, and thus programs
are having to be heavily modified and controlled to be able to do this, why
you can't just do an mmap() with PROT_NONE early, around a hinted address
that, sits below the required limit, and then mprotect() or mmap() over it?

Your feature is a major adjustment to mmap(), it needs to be pretty
significantly justified, especially if taking up a new flag.

>
> The riscv architecture needs a way to similarly restrict the virtual
> address space. On the riscv port of OpenJDK an error is thrown if
> attempted to run on the 57-bit address space, called sv57 [1].  golang
> has a comment that sv57 support is not complete, but there are some
> workarounds to get it to mostly work [2].
>
> These applications work on x86 because x86 does an implicit 47-bit
> restriction of mmap() address that contain a hint address that is less
> than 48 bits.

You mean x86 _has_ to limit to physically available bits in a canonical
format :) this will not be the case for 5-page table levels though...

>
> Instead of implicitly restricting the address space on riscv (or any
> current/future architecture), a flag would allow users to opt-in to this
> behavior rather than opt-out as is done on other architectures. This is
> desirable because it is a small class of applications that do pointer
> masking.

I raised this last time and you didn't seem to address it so to be more
blunt:

I don't understand why this needs to be an mmap() flag. From this it seems
the whole process needs allocations to be below a certain limit.

That _could_ be achieved through a 'personality' or similar (though a
personality is on/off, rather than allowing configuration so maybe
something else would be needed).

>From what you're saying 57-bit is all you really need right? So maybe
ADDR_LIMIT_57BIT?

I don't see how you're going to actually enforce this in a process either
via an mmap flag, as a library might decide not to use it, so you'd need to
control the allocator, the thread library implementation, and everything
that might allocate.

Liam also raised various points about VMA particulars that I'm not sure are
addressed either.

I just find it hard to believe that everything will fit together.

I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m
just not.

>
> This flag will also allow seemless compatibility between all
> architectures, so applications like Go and OpenJDK that use bits in a
> virtual address can request the exact number of bits they need in a
> generic way. The flag can be checked inside of vm_unmapped_area() so
> that this flag does not have to be handled individually by each
> architecture.

I'm still very unconvinced and feel the bar needs to be high for making
changes like this that carry maintainership burden.

So for me, it's a no really as an overall concept.

Happy to be convinced otherwise, however... (I may be missing details or
context that provide more justification).

>
> Link:
> https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79
> [1]
> Link:
> https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47
> [2]
>
> To: Arnd Bergmann 
> To: Richard Henderson 
> To: Ivan Kokshaysky 
> To: Matt Turner 
> To: Vineet Gupta 
> To: Russell King 
> To: Guo Ren 
> To: Huacai Chen 
> To: WANG Xuerui 
> To: Thomas Bogendoerfer 
> To: James E.J. Bottomley 
> To: Helge Deller 
> To: Michael Ellerman 
> To: Nicholas Piggin 
> To: Christophe Leroy 
> To: Naveen N Rao 
> To: Alexander Gordeev 
> To: Gerald Schaefer 
> To: Heiko Carstens 
> To: Vasily Gorbik 
> To: Christian Borntraeger 
> To: Sven Schnelle 
> To: Yoshinori Sato 
> To: Rich Felker 
> To: John Paul Adrian Glaubitz 
> To: David S. Miller 
> To: Andreas Larsson 
> To: Thomas Gleixner 
> To: Ingo Molnar 
> To: Borislav Petkov 
> To: Dave Hansen 
> To: x...@kernel.org
> To: H. Peter Anvin 
> To: Andy Lutomirski 
> To: Peter Zijlstra 
> To: Muchun Song 
> To: Andrew Morton 
> To: Liam R. Howlett 
> To: Vlastimil Babka 
> To: Lorenzo Stoakes 
> To: Shuah Khan 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-al...@vger.kernel.org
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c...@vger.kernel.org
> Cc: loonga...@lists.linux.dev
> Cc: linux-m...@vger.kernel.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@

Re: [PATCH RFC v2 2/4] mm: Add hint and mmap_flags to struct vm_unmapped_area_info

2024-08-29 Thread Lorenzo Stoakes

On Thu, Aug 29, 2024 at 12:15:59AM GMT, Charlie Jenkins wrote:

[snip]

> diff --git a/mm/mmap.c b/mm/mmap.c
> index d0dfc85b209b..34ba0db23678 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1796,6 +1796,9 @@ generic_get_unmapped_area(struct file *filp, unsigned 
> long addr,
>   struct vm_unmapped_area_info info = {};
>   const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
>
> + info.hint = addr;
> + info.mmap_flags = flags;
> +
>   if (len > mmap_end - mmap_min_addr)
>   return -ENOMEM;
>
> @@ -1841,6 +1844,9 @@ generic_get_unmapped_area_topdown(struct file *filp, 
> unsigned long addr,
>   struct vm_unmapped_area_info info = {};
>   const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
>
> + info.hint = addr;
> + info.mmap_flags = flags;
> +
>   /* requested length too big for entire address space */
>   if (len > mmap_end - mmap_min_addr)
>   return -ENOMEM;
>

These line numbers suggest you're working against Linus's tree, mm/mmap.c
has changed a lot recently, so to avoid conflicts please base your changes
on mm-unstable in Andrew's tree (if looking to go through that) or at least
-next.

> --
> 2.45.0
>

Re: [PATCH RFC v2 1/4] mm: Add MAP_BELOW_HINT

2024-08-29 Thread Michael Ellerman

Charlie Jenkins  writes:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).
>
> To make this behavior explicit and more versatile across all
> architectures, define a mmap flag that allows users to define an
> arbitrary upper limit on addresses returned by mmap.
>
> Signed-off-by: Charlie Jenkins 
> ---
>  include/uapi/asm-generic/mman-common.h   | 1 +
>  tools/include/uapi/asm-generic/mman-common.h | 1 +
  
You're not meant to update the headers in tools/ directly. There's a
mail somewhere from acme somewhere describing the proper process, but
the tldr is leave it up to him.

> diff --git a/include/uapi/asm-generic/mman-common.h 
> b/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..03ac13d9aa37 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -32,6 +32,7 @@
>  
>  #define MAP_UNINITIALIZED 0x400  /* For anonymous mmap, memory could be
>* uninitialized */
> +#define MAP_BELOW_HINT 0x800 /* give out address that is 
> below (inclusive) hint address */

IMHO the API would be clearer if this actually forced the address to be
below the hint. That's what the flag name implies after all.

It would also mean the application doesn't need to take into account the
length of the mapping when passing the hint.

cheers

[PATCH RFC v2 4/4] selftests/mm: Create MAP_BELOW_HINT test

2024-08-29 Thread Charlie Jenkins

Add a selftest for MAP_BELOW_HINT that maps until it runs out of space
below the hint address.

Signed-off-by: Charlie Jenkins 
---
 tools/testing/selftests/mm/Makefile |  1 +
 tools/testing/selftests/mm/map_below_hint.c | 32 +
 2 files changed, 33 insertions(+)

diff --git a/tools/testing/selftests/mm/Makefile 
b/tools/testing/selftests/mm/Makefile
index cfad627e8d94..4e2de85267b5 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm
 TEST_GEN_FILES += hugepage-vmemmap
 TEST_GEN_FILES += khugepaged
 TEST_GEN_FILES += madv_populate
+TEST_GEN_FILES += map_below_hint
 TEST_GEN_FILES += map_fixed_noreplace
 TEST_GEN_FILES += map_hugetlb
 TEST_GEN_FILES += map_populate
diff --git a/tools/testing/selftests/mm/map_below_hint.c 
b/tools/testing/selftests/mm/map_below_hint.c
new file mode 100644
index ..55d6cbf90645
--- /dev/null
+++ b/tools/testing/selftests/mm/map_below_hint.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test the MAP_BELOW_HINT mmap flag.
+ */
+#include 
+#include 
+#include "../kselftest.h"
+
+#define ADDR (1 << 20)
+#define LENGTH (ADDR / 1)
+
+#define MAP_BELOW_HINT   0x800 /* Not defined in all libc */
+
+/*
+ * Map memory with MAP_BELOW_HINT until no memory left. Ensure that all 
returned
+ * addresses are below the hint.
+ */
+int main(int argc, char **argv)
+{
+   void *addr;
+
+   do {
+   addr = mmap((void *)ADDR, LENGTH, PROT_READ | PROT_WRITE,
+   MAP_PRIVATE | MAP_ANONYMOUS | MAP_BELOW_HINT, -1, 
0);
+   } while (addr != MAP_FAILED && (unsigned long)addr <= ADDR);
+
+   if (errno == ENOMEM)
+   ksft_test_result_pass("MAP_BELOW_HINT works\n");
+   else
+   ksft_test_result_fail("mmap returned address above hint with 
MAP_BELOW_HINT with error: %s\n",
+ strerror(errno));
+}

-- 
2.45.0

[PATCH RFC v2 2/4] mm: Add hint and mmap_flags to struct vm_unmapped_area_info

2024-08-29 Thread Charlie Jenkins

The hint address and mmap_flags are necessary to determine if
MAP_BELOW_HINT requirements are satisfied.

Signed-off-by: Charlie Jenkins 
---
 arch/alpha/kernel/osf_sys.c  | 2 ++
 arch/arc/mm/mmap.c   | 3 +++
 arch/arm/mm/mmap.c   | 7 +++
 arch/csky/abiv1/mmap.c   | 3 +++
 arch/loongarch/mm/mmap.c | 3 +++
 arch/mips/mm/mmap.c  | 3 +++
 arch/parisc/kernel/sys_parisc.c  | 3 +++
 arch/powerpc/mm/book3s64/slice.c | 7 +++
 arch/s390/mm/hugetlbpage.c   | 4 
 arch/s390/mm/mmap.c  | 6 ++
 arch/sh/mm/mmap.c| 6 ++
 arch/sparc/kernel/sys_sparc_32.c | 3 +++
 arch/sparc/kernel/sys_sparc_64.c | 6 ++
 arch/sparc/mm/hugetlbpage.c  | 4 
 arch/x86/kernel/sys_x86_64.c | 6 ++
 arch/x86/mm/hugetlbpage.c| 4 
 fs/hugetlbfs/inode.c | 4 
 include/linux/mm.h   | 2 ++
 mm/mmap.c| 6 ++
 19 files changed, 82 insertions(+)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index e5f881bc8288..6903700afd12 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -1223,6 +1223,8 @@ arch_get_unmapped_area_1(unsigned long addr, unsigned 
long len,
info.length = len;
info.low_limit = addr;
info.high_limit = limit;
+   info.hint = addr;
+   info.mmap_flags = flags;
return vm_unmapped_area(&info);
 }
 
diff --git a/arch/arc/mm/mmap.c b/arch/arc/mm/mmap.c
index 69a915297155..5922cb51e029 100644
--- a/arch/arc/mm/mmap.c
+++ b/arch/arc/mm/mmap.c
@@ -29,6 +29,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
struct vm_area_struct *vma;
struct vm_unmapped_area_info info = {};
 
+   info.hint = addr;
+   info.mmap_flags = flags;
+
/*
 * We enforce the MAP_FIXED case.
 */
diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c
index d65d0e6ed10a..04d9234f049a 100644
--- a/arch/arm/mm/mmap.c
+++ b/arch/arm/mm/mmap.c
@@ -36,6 +36,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
int aliasing = cache_is_vipt_aliasing();
struct vm_unmapped_area_info info = {};
 
+   info.hint = addr;
+   info.mmap_flags = flags;
+
/*
 * We only need to do colour alignment if either the I or D
 * caches alias.
@@ -56,6 +59,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
if (len > TASK_SIZE)
return -ENOMEM;
 
+
if (addr) {
if (do_align)
addr = COLOUR_ALIGN(addr, pgoff);
@@ -88,6 +92,9 @@ arch_get_unmapped_area_topdown(struct file *filp, const 
unsigned long addr0,
int aliasing = cache_is_vipt_aliasing();
struct vm_unmapped_area_info info = {};
 
+   info.hint = addr;
+   info.mmap_flags = flags;
+
/*
 * We only need to do colour alignment if either the I or D
 * caches alias.
diff --git a/arch/csky/abiv1/mmap.c b/arch/csky/abiv1/mmap.c
index 7f826331d409..0be4913c6cf3 100644
--- a/arch/csky/abiv1/mmap.c
+++ b/arch/csky/abiv1/mmap.c
@@ -35,6 +35,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
.align_offset = pgoff << PAGE_SHIFT
};
 
+   info.hint = addr;
+   info.mmap_flags = flags;
+
/*
 * We only need to do colour alignment if either the I or D
 * caches alias.
diff --git a/arch/loongarch/mm/mmap.c b/arch/loongarch/mm/mmap.c
index 889030985135..7d1e8be20519 100644
--- a/arch/loongarch/mm/mmap.c
+++ b/arch/loongarch/mm/mmap.c
@@ -27,6 +27,9 @@ static unsigned long arch_get_unmapped_area_common(struct 
file *filp,
int do_color_align;
struct vm_unmapped_area_info info = {};
 
+   info.hint = addr;
+   info.mmap_flags = flags;
+
if (unlikely(len > TASK_SIZE))
return -ENOMEM;
 
diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c
index 7e11d7b58761..22e8f9c8eaa0 100644
--- a/arch/mips/mm/mmap.c
+++ b/arch/mips/mm/mmap.c
@@ -36,6 +36,9 @@ static unsigned long arch_get_unmapped_area_common(struct 
file *filp,
int do_color_align;
struct vm_unmapped_area_info info = {};
 
+   info.hint = addr;
+   info.mmap_flags = flags;
+
if (unlikely(len > TASK_SIZE))
return -ENOMEM;
 
diff --git a/arch/parisc/kernel/sys_parisc.c b/arch/parisc/kernel/sys_parisc.c
index f7722451276e..2ac53f148624 100644
--- a/arch/parisc/kernel/sys_parisc.c
+++ b/arch/parisc/kernel/sys_parisc.c
@@ -108,6 +108,9 @@ static unsigned long arch_get_unmapped_area_common(struct 
file *filp,
.length = len
};
 
+   info.hint = addr;
+   info.mmap_flags = flags;
+
if (unlikely(len > TASK_SIZE))
return -ENOMEM;
 
diff --git a/arch/powerpc/mm/book3s64/slice.c b/arch/powerpc/mm/book3s64/slice.c
index ef3ce37f1bb3..f0e2550af6d0 100644
--- a/arch/powerpc/mm/book3s64/s

[PATCH RFC v2 3/4] mm: Support MAP_BELOW_HINT in vm_unmapped_area()

2024-08-29 Thread Charlie Jenkins

To ensure that all memory allocations comply with the new MAP_BELOW_HINT
flag, set the high_limit in vm_unmapped_area() to the hint address +
length at most. All callers to this function set the high_limit to
something reasonable, usually with space for a random offset and a gap
for the stack. To respect the provided high_limit, take the minimum of
hint+length and the given high_limit.

Signed-off-by: Charlie Jenkins 
---
 mm/mmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 34ba0db23678..459ad380c673 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1766,6 +1766,9 @@ unsigned long vm_unmapped_area(struct 
vm_unmapped_area_info *info)
 {
unsigned long addr;
 
+   if (info->hint != 0 && info->mmap_flags & MAP_BELOW_HINT)
+   info->high_limit = MIN(info->high_limit, info->hint + 
info->length);
+
if (info->flags & VM_UNMAPPED_AREA_TOPDOWN)
addr = unmapped_area_topdown(info);
else

-- 
2.45.0

[PATCH RFC v2 1/4] mm: Add MAP_BELOW_HINT

2024-08-29 Thread Charlie Jenkins

Some applications rely on placing data in free bits addresses allocated
by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
address returned by mmap to be less than the 48-bit address space,
unless the hint address uses more than 47 bits (the 48th bit is reserved
for the kernel address space).

To make this behavior explicit and more versatile across all
architectures, define a mmap flag that allows users to define an
arbitrary upper limit on addresses returned by mmap.

Signed-off-by: Charlie Jenkins 
---
 include/uapi/asm-generic/mman-common.h   | 1 +
 tools/include/uapi/asm-generic/mman-common.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..03ac13d9aa37 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -32,6 +32,7 @@
 
 #define MAP_UNINITIALIZED 0x400/* For anonymous mmap, memory could be
 * uninitialized */
+#define MAP_BELOW_HINT   0x800 /* give out address that is below 
(inclusive) hint address */
 
 /*
  * Flags for mlock
diff --git a/tools/include/uapi/asm-generic/mman-common.h 
b/tools/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..03ac13d9aa37 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -32,6 +32,7 @@
 
 #define MAP_UNINITIALIZED 0x400/* For anonymous mmap, memory could be
 * uninitialized */
+#define MAP_BELOW_HINT   0x800 /* give out address that is below 
(inclusive) hint address */
 
 /*
  * Flags for mlock

-- 
2.45.0

[PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-29 Thread Charlie Jenkins

Some applications rely on placing data in free bits addresses allocated
by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
address returned by mmap to be less than the 48-bit address space,
unless the hint address uses more than 47 bits (the 48th bit is reserved
for the kernel address space).

The riscv architecture needs a way to similarly restrict the virtual
address space. On the riscv port of OpenJDK an error is thrown if
attempted to run on the 57-bit address space, called sv57 [1].  golang
has a comment that sv57 support is not complete, but there are some
workarounds to get it to mostly work [2].

These applications work on x86 because x86 does an implicit 47-bit
restriction of mmap() address that contain a hint address that is less
than 48 bits.

Instead of implicitly restricting the address space on riscv (or any
current/future architecture), a flag would allow users to opt-in to this
behavior rather than opt-out as is done on other architectures. This is
desirable because it is a small class of applications that do pointer
masking.

This flag will also allow seemless compatibility between all
architectures, so applications like Go and OpenJDK that use bits in a
virtual address can request the exact number of bits they need in a
generic way. The flag can be checked inside of vm_unmapped_area() so
that this flag does not have to be handled individually by each
architecture. 

Link:
https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79
[1]
Link:
https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47
[2]

To: Arnd Bergmann 
To: Richard Henderson 
To: Ivan Kokshaysky 
To: Matt Turner 
To: Vineet Gupta 
To: Russell King 
To: Guo Ren 
To: Huacai Chen 
To: WANG Xuerui 
To: Thomas Bogendoerfer 
To: James E.J. Bottomley 
To: Helge Deller 
To: Michael Ellerman 
To: Nicholas Piggin 
To: Christophe Leroy 
To: Naveen N Rao 
To: Alexander Gordeev 
To: Gerald Schaefer 
To: Heiko Carstens 
To: Vasily Gorbik 
To: Christian Borntraeger 
To: Sven Schnelle 
To: Yoshinori Sato 
To: Rich Felker 
To: John Paul Adrian Glaubitz 
To: David S. Miller 
To: Andreas Larsson 
To: Thomas Gleixner 
To: Ingo Molnar 
To: Borislav Petkov 
To: Dave Hansen 
To: x...@kernel.org
To: H. Peter Anvin 
To: Andy Lutomirski 
To: Peter Zijlstra 
To: Muchun Song 
To: Andrew Morton 
To: Liam R. Howlett 
To: Vlastimil Babka 
To: Lorenzo Stoakes 
To: Shuah Khan 
Cc: linux-a...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-al...@vger.kernel.org
Cc: linux-snps-...@lists.infradead.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c...@vger.kernel.org
Cc: loonga...@lists.linux.dev
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-kselft...@vger.kernel.org
Signed-off-by: Charlie Jenkins 

Changes in v2:
- Added much greater detail to cover letter
- Removed all code that touched architecture specific code and was able
  to factor this out into all generic functions, except for flags that
  needed to be added to vm_unmapped_area_info
- Made this an RFC since I have only tested it on riscv and x86
- Link to v1: 
https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb90...@rivosinc.com

---
Charlie Jenkins (4):
  mm: Add MAP_BELOW_HINT
  mm: Add hint and mmap_flags to struct vm_unmapped_area_info
  mm: Support MAP_BELOW_HINT in vm_unmapped_area()
  selftests/mm: Create MAP_BELOW_HINT test

 arch/alpha/kernel/osf_sys.c  |  2 ++
 arch/arc/mm/mmap.c   |  3 +++
 arch/arm/mm/mmap.c   |  7 ++
 arch/csky/abiv1/mmap.c   |  3 +++
 arch/loongarch/mm/mmap.c |  3 +++
 arch/mips/mm/mmap.c  |  3 +++
 arch/parisc/kernel/sys_parisc.c  |  3 +++
 arch/powerpc/mm/book3s64/slice.c |  7 ++
 arch/s390/mm/hugetlbpage.c   |  4 
 arch/s390/mm/mmap.c  |  6 ++
 arch/sh/mm/mmap.c|  6 ++
 arch/sparc/kernel/sys_sparc_32.c |  3 +++
 arch/sparc/kernel/sys_sparc_64.c |  6 ++
 arch/sparc/mm/hugetlbpage.c  |  4 
 arch/x86/kernel/sys_x86_64.c |  6 ++
 arch/x86/mm/hugetlbpage.c|  4 
 fs/hugetlbfs/inode.c |  4 
 include/linux/mm.h   |  2 ++
 include/uapi/asm-generic/mman-common.h   |  1 +
 mm/mmap.c|  9 
 tools/include/uapi/asm-generic/mman-common.h |  1 +
 tools/testing/selftests/mm/Makefile  |  1 +
 tools/testing/selftests/mm/map_below_hint.c  | 32 
 23 files changed, 120 inse

Re: [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options

2024-08-26 Thread Peter Xu

On Fri, Aug 23, 2024 at 06:19:52AM +, LEROY Christophe wrote:
> Why is an option needed for that ? If pmd_leaf() returns always false, 
> it means the arch doesn't support pmd mappings and if properly used all 
> related code should fold away without a config option, shouldn't it ?

It's not always easy to leverage an "if" clause there, IIUC.  Take the case
of when a driver wants to inject a pmd pfnmap, we may want something like:

  if (pmd_leaf_supported())
  inject_pmd_leaf(&pmd);

We don't have a pmd entry to reference at the point of pmd_leaf_supported()
when making the decision.

Thanks,

-- 
Peter Xu

[PATCH] [RFC] crash: Lock-free crash hotplug support reporting

2024-08-23 Thread Sourabh Jain

On a CPU/Memory hotplug event, the kexec lock is taken to update the
kdump image. At the same time, this lock is also required to report
the support for crash hotplug to user-space via the
/sys/devices/system/[cpu|memory]/crash_hotplug sysfs interface, to
avoid kdump reload.

The kexec lock is needed to report crash hotplug support because the
crash_hotplug variable, which tracks crash hotplug support, is part of
the kdump image, and the kdump image needs to be updated during a
hotplug event.

Given that only one kdump image can be loaded at any given time, the
crash_hotplug variable can be placed outside the kdump image and set or
reset during kdump image load and unload. This allows crash hotplug
support to be reported without taking the kexec lock.

This would help in situation where CPU/Memory resource are hotplug from
system in bulk.

Commit e2a8f20dd8e9 ("Crash: add lock to serialize crash hotplug
handling") introduced to serialize the kexec lock during bulk CPU/Memory
hotplug events. However, with these changes, the kexec lock for crash
hotplug support reporting can be avoided altogether.

Cc: Hari Bathini 
Cc: Mahesh Salgaonkar 
Cc: ke...@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ker...@vger.kernel.org
Cc: x...@kernel.org
Signed-off-by: Sourabh Jain 
---
 include/linux/kexec.h | 11 ---
 kernel/crash_core.c   | 27 +--
 kernel/kexec.c|  5 -
 kernel/kexec_file.c   |  7 ++-
 4 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index f0e9f8eda7a3..bd755ba6bac4 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -318,13 +318,6 @@ struct kimage {
unsigned int preserve_context : 1;
/* If set, we are using file mode kexec syscall */
unsigned int file_mode:1;
-#ifdef CONFIG_CRASH_HOTPLUG
-   /* If set, it is safe to update kexec segments that are
-* excluded from SHA calculation.
-*/
-   unsigned int hotplug_support:1;
-#endif
-
 #ifdef ARCH_HAS_KIMAGE_ARCH
struct kimage_arch arch;
 #endif
@@ -370,6 +363,10 @@ struct kimage {
unsigned long elf_load_addr;
 };
 
+#ifdef CONFIG_CRASH_HOTPLUG
+extern unsigned int crash_hotplug_support;
+#endif
+
 /* kexec interface functions */
 extern void machine_kexec(struct kimage *image);
 extern int machine_kexec_prepare(struct kimage *image);
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 63cf89393c6e..3428deba0070 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -30,6 +30,13 @@
 #include "kallsyms_internal.h"
 #include "kexec_internal.h"
 
+#ifdef CONFIG_CRASH_HOTPLUG
+/* if set, it is safe to update kexec segments that are
+ * excluded from sha calculation.
+ */
+unsigned int crash_hotplug_support;
+#endif
+
 /* Per cpu memory for storing cpu states in case of system crash. */
 note_buf_t __percpu *crash_notes;
 
@@ -500,23 +507,7 @@ static DEFINE_MUTEX(__crash_hotplug_lock);
  */
 int crash_check_hotplug_support(void)
 {
-   int rc = 0;
-
-   crash_hotplug_lock();
-   /* Obtain lock while reading crash information */
-   if (!kexec_trylock()) {
-   pr_info("kexec_trylock() failed, elfcorehdr may be 
inaccurate\n");
-   crash_hotplug_unlock();
-   return 0;
-   }
-   if (kexec_crash_image) {
-   rc = kexec_crash_image->hotplug_support;
-   }
-   /* Release lock now that update complete */
-   kexec_unlock();
-   crash_hotplug_unlock();
-
-   return rc;
+   return crash_hotplug_support;
 }
 
 /*
@@ -552,7 +543,7 @@ static void crash_handle_hotplug_event(unsigned int 
hp_action, unsigned int cpu,
image = kexec_crash_image;
 
/* Check that kexec segments update is permitted */
-   if (!image->hotplug_support)
+   if (!crash_hotplug_support)
goto out;
 
if (hp_action == KEXEC_CRASH_HP_ADD_CPU ||
diff --git a/kernel/kexec.c b/kernel/kexec.c
index a6b3f96bb50c..d5c6b51eaa8b 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -116,6 +116,9 @@ static int do_kexec_load(unsigned long entry, unsigned long 
nr_segments,
/* Uninstall image */
kimage_free(xchg(dest_image, NULL));
ret = 0;
+#ifdef CONFIG_CRASH_HOTPLUG
+   crash_hotplug_support = 0;
+#endif
goto out_unlock;
}
if (flags & KEXEC_ON_CRASH) {
@@ -136,7 +139,7 @@ static int do_kexec_load(unsigned long entry, unsigned long 
nr_segments,
 
 #ifdef CONFIG_CRASH_HOTPLUG
if ((flags & KEXEC_ON_CRASH) && arch_crash_hotplug_support(image, 
flags))
-   image->hotplug_support = 1;
+   crash_hotplug_support = 1;
 #endif
 
ret = machine_kexec_prepare(image);
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 3d64290d24c9..b326edb90fd7 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -378,7 +378,7 @@ SYSCALL_DEFINE

Re: [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options

2024-08-22 Thread LEROY Christophe



Le 22/08/2024 à 21:16, Peter Xu a écrit :
> On Thu, Aug 22, 2024 at 05:22:03PM +, LEROY Christophe wrote:
>>
>>
>> Le 18/07/2024 à 00:02, Peter Xu a écrit :
>>> Introduce two more sub-options for PGTABLE_HAS_HUGE_LEAVES:
>>>
>>> - PGTABLE_HAS_PMD_LEAVES: set when there can be PMD mappings
>>> - PGTABLE_HAS_PUD_LEAVES: set when there can be PUD mappings
>>>
>>> It will help to identify whether the current build may only want PMD
>>> helpers but not PUD ones, as these sub-options will also check against the
>>> arch support over HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD].
>>>
>>> Note that having them depend on HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD] is
>>> still some intermediate step.  The best way is to have an option say
>>> "whether arch XXX supports PMD/PUD mappings" and so on.  However let's
>>> leave that for later as that's the easy part.  So far, we use these options
>>> to stably detect per-arch huge mapping support.
>>>
>>> Signed-off-by: Peter Xu 
>>> ---
>>>include/linux/huge_mm.h | 10 +++---
>>>mm/Kconfig  |  6 ++
>>>2 files changed, 13 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 711632df7edf..37482c8445d1 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -96,14 +96,18 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>>>#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
>>> (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>>>
>>> -#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
>>> -#define HPAGE_PMD_SHIFT PMD_SHIFT
>>> +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
>>>#define HPAGE_PUD_SHIFT PUD_SHIFT
>>>#else
>>> -#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>>>#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
>>>#endif
>>>
>>> +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
>>> +#define HPAGE_PMD_SHIFT PMD_SHIFT
>>> +#else
>>> +#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>>> +#endif
>>> +
>>>#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
>>>#define HPAGE_PMD_NR (1<>>#define HPAGE_PMD_MASK   (~(HPAGE_PMD_SIZE - 1))
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 60796402850e..2dbdc088dee8 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -860,6 +860,12 @@ endif # TRANSPARENT_HUGEPAGE
>>>config PGTABLE_HAS_HUGE_LEAVES
>>> def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
>>>
>>> +config PGTABLE_HAS_PMD_LEAVES
>>> +   def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE && PGTABLE_HAS_HUGE_LEAVES
>>> +
>>> +config PGTABLE_HAS_PUD_LEAVES
>>> +   def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD && PGTABLE_HAS_HUGE_LEAVES
>>> +
>>
>> What if an architecture has hugepages at PMD and/or PUD level and
>> doesn't support THP ?
> 
> What's the arch to be discussed here?

It is LOONGARCH and MIPS, they provide pud_leaf() that can return true 
even when they have no PUD.

> 
> The whole purpose of this series so far is trying to make some pmd/pud
> helpers that only defined with CONFIG_THP=on to be available even if not.
> It means this series alone (or any future plan) shouldn't affect any arch
> that has CONFIG_THP=off always.
> 
> But logically I think we should need some config option just to say "this
> arch supports pmd mappings" indeed, even if CONFIG_THP=off.  When that's
> there, we should perhaps add that option into this equation so
> PGTABLE_HAS_*_LEAVES will also be selected in that case.
> 

Why is an option needed for that ? If pmd_leaf() returns always false, 
it means the arch doesn't support pmd mappings and if properly used all 
related code should fold away without a config option, shouldn't it ?

Re: [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options

2024-08-22 Thread Peter Xu

On Thu, Aug 22, 2024 at 05:22:03PM +, LEROY Christophe wrote:
> 
> 
> Le 18/07/2024 à 00:02, Peter Xu a écrit :
> > Introduce two more sub-options for PGTABLE_HAS_HUGE_LEAVES:
> > 
> >- PGTABLE_HAS_PMD_LEAVES: set when there can be PMD mappings
> >- PGTABLE_HAS_PUD_LEAVES: set when there can be PUD mappings
> > 
> > It will help to identify whether the current build may only want PMD
> > helpers but not PUD ones, as these sub-options will also check against the
> > arch support over HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD].
> > 
> > Note that having them depend on HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD] is
> > still some intermediate step.  The best way is to have an option say
> > "whether arch XXX supports PMD/PUD mappings" and so on.  However let's
> > leave that for later as that's the easy part.  So far, we use these options
> > to stably detect per-arch huge mapping support.
> > 
> > Signed-off-by: Peter Xu 
> > ---
> >   include/linux/huge_mm.h | 10 +++---
> >   mm/Kconfig  |  6 ++
> >   2 files changed, 13 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 711632df7edf..37482c8445d1 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -96,14 +96,18 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
> >   #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
> > (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
> >   
> > -#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
> > -#define HPAGE_PMD_SHIFT PMD_SHIFT
> > +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
> >   #define HPAGE_PUD_SHIFT PUD_SHIFT
> >   #else
> > -#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> >   #define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
> >   #endif
> >   
> > +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
> > +#define HPAGE_PMD_SHIFT PMD_SHIFT
> > +#else
> > +#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> > +#endif
> > +
> >   #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> >   #define HPAGE_PMD_NR (1< >   #define HPAGE_PMD_MASK(~(HPAGE_PMD_SIZE - 1))
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 60796402850e..2dbdc088dee8 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -860,6 +860,12 @@ endif # TRANSPARENT_HUGEPAGE
> >   config PGTABLE_HAS_HUGE_LEAVES
> > def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
> >   
> > +config PGTABLE_HAS_PMD_LEAVES
> > +   def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE && PGTABLE_HAS_HUGE_LEAVES
> > +
> > +config PGTABLE_HAS_PUD_LEAVES
> > +   def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD && PGTABLE_HAS_HUGE_LEAVES
> > +
> 
> What if an architecture has hugepages at PMD and/or PUD level and 
> doesn't support THP ?

What's the arch to be discussed here?

The whole purpose of this series so far is trying to make some pmd/pud
helpers that only defined with CONFIG_THP=on to be available even if not.
It means this series alone (or any future plan) shouldn't affect any arch
that has CONFIG_THP=off always.

But logically I think we should need some config option just to say "this
arch supports pmd mappings" indeed, even if CONFIG_THP=off.  When that's
there, we should perhaps add that option into this equation so
PGTABLE_HAS_*_LEAVES will also be selected in that case.

-- 
Peter Xu

Re: [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options

2024-08-22 Thread LEROY Christophe



Le 18/07/2024 à 00:02, Peter Xu a écrit :
> Introduce two more sub-options for PGTABLE_HAS_HUGE_LEAVES:
> 
>- PGTABLE_HAS_PMD_LEAVES: set when there can be PMD mappings
>- PGTABLE_HAS_PUD_LEAVES: set when there can be PUD mappings
> 
> It will help to identify whether the current build may only want PMD
> helpers but not PUD ones, as these sub-options will also check against the
> arch support over HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD].
> 
> Note that having them depend on HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD] is
> still some intermediate step.  The best way is to have an option say
> "whether arch XXX supports PMD/PUD mappings" and so on.  However let's
> leave that for later as that's the easy part.  So far, we use these options
> to stably detect per-arch huge mapping support.
> 
> Signed-off-by: Peter Xu 
> ---
>   include/linux/huge_mm.h | 10 +++---
>   mm/Kconfig  |  6 ++
>   2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 711632df7edf..37482c8445d1 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -96,14 +96,18 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>   #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
>   (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>   
> -#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
> -#define HPAGE_PMD_SHIFT PMD_SHIFT
> +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
>   #define HPAGE_PUD_SHIFT PUD_SHIFT
>   #else
> -#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>   #define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
>   #endif
>   
> +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
> +#define HPAGE_PMD_SHIFT PMD_SHIFT
> +#else
> +#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> +#endif
> +
>   #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
>   #define HPAGE_PMD_NR (1<   #define HPAGE_PMD_MASK  (~(HPAGE_PMD_SIZE - 1))
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 60796402850e..2dbdc088dee8 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -860,6 +860,12 @@ endif # TRANSPARENT_HUGEPAGE
>   config PGTABLE_HAS_HUGE_LEAVES
>   def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
>   
> +config PGTABLE_HAS_PMD_LEAVES
> + def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE && PGTABLE_HAS_HUGE_LEAVES
> +
> +config PGTABLE_HAS_PUD_LEAVES
> + def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD && PGTABLE_HAS_HUGE_LEAVES
> +

What if an architecture has hugepages at PMD and/or PUD level and 
doesn't support THP ?

Christophe

>   #
>   # UP and nommu archs use km based percpu allocator
>   #

Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings

2024-08-22 Thread LEROY Christophe



Le 23/07/2024 à 23:04, Peter Xu a écrit :
>>
>>>
>>> Nornally I don't see this as much of a "code churn" category, because it
>>> doesn't changes the code itself but only move things.  I personally also
>>> prefer without code churns, but only in the case where there'll be tiny
>>> little functional changes here and there without real benefit.
>>>
>>> It's pretty unavoidable to me when one file grows too large and we'll need
>>> to split, and in this case git doesn't have a good way to track such
>>> movement..
>>
>> Yes, that's what I mean.
>>
>> I've been recently thinking if we should pursue a different direction:
>>
>> Just as we recently relocated most follow_huge_* stuff into gup.c, likely we
>> should rather look into moving copy_huge_pmd, change_huge_pmd, copy_huge_pmd
>> ... into the files where they logically belong to.
>>
>> In madvise.c, we've been doing that in some places already: For
>> madvise_cold_or_pageout_pte_range() we inline the code, but not for
>> madvise_free_huge_pmd().
>>
>> pmd_trans_huge() would already compile to a NOP without
>> CONFIG_TRANSPARENT_HUGEPAGE, but to make that code avoid most
>> CONFIG_TRANSPARENT_HUGEPAGE, we'd need a couple more function stubs to make
>> the compiler happy while still being able to compile that code out when not
>> required.
> 
> Right, I had a patch does exactly that, where it's called pmd_is_leaf(),
> for example, but taking CONFIG_* into account.
> 
> I remember I had some issue with that, e.g. I used to see pmd_trans_huge()
> (when !THP) can optimize some path but pmd_is_leaf() didn't do the same job
> even if all configs were off.  But that's another story and I didn't yet
> dig deeper.  Could be something small but overlooked.

When I prepared series 
https://patchwork.kernel.org/project/linux-mm/list/?series=871008 , I 
detected that some architectures define some pXd_leaf() for levels they 
don't support, that's the reason why Andrew had to drop v2 at the last 
minute.

And that's maybe the reason why some of the code you expect to get 
folded-off remains.

Since then I sent v3 that fixes that. Don't know if Andrew is planning 
to take it.

Christophe

Re: [PATCH RFC v2 2/5] of: get dma area lower limit

2024-07-30 Thread Catalin Marinas

On Thu, Jul 25, 2024 at 02:49:01PM +0300, Baruch Siach wrote:
> Hi Catalin,
> 
> On Tue, Jun 18 2024, Catalin Marinas wrote:
> > On Tue, Apr 09, 2024 at 09:17:55AM +0300, Baruch Siach wrote:
> >> of_dma_get_max_cpu_address() returns the highest CPU address that
> >> devices can use for DMA. The implicit assumption is that all CPU
> >> addresses below that limit are suitable for DMA. However the
> >> 'dma-ranges' property this code uses also encodes a lower limit for DMA
> >> that is potentially non zero.
> >> 
> >> Rename to of_dma_get_cpu_limits(), and extend to retrieve also the lower
> >> limit for the same 'dma-ranges' property describing the high limit.
> >
> > I don't understand the reason for the lower limit. The way the Linux
> > zones work is that ZONE_DMA always starts from the start of the RAM. It
> > doesn't matter whether it's 0 or not, you'd not allocate below the start
> > of RAM anyway. If you have a device that cannot use the bottom of the
> > RAM, it is pretty broken and not supported by Linux.
> 
> I won't argue with that assertion. My target system RAM happens to start
> at that the lower end of devices DMA zone, so I'm fine with skipping
> this patch.
> 
> Just curious. What is the inherent limitation that prevents Linux from
> supporting DMA zone with lower limit above RAM start?

It's the way the zone allocation fallback mechanism works. Let's say a
ZONE_DMA32 allocation fails, it falls back to ZONE_DMA and it's supposed
to be compatible with the GFP_DMA32 request. If you have some other zone
below ZONE_DMA, it should also be compatible with GFP_DMA allocations.

-- 
Catalin

Re: [PATCH RFC v2 2/5] of: get dma area lower limit

2024-07-25 Thread Baruch Siach

Hi Catalin,

On Tue, Jun 18 2024, Catalin Marinas wrote:
> On Tue, Apr 09, 2024 at 09:17:55AM +0300, Baruch Siach wrote:
>> of_dma_get_max_cpu_address() returns the highest CPU address that
>> devices can use for DMA. The implicit assumption is that all CPU
>> addresses below that limit are suitable for DMA. However the
>> 'dma-ranges' property this code uses also encodes a lower limit for DMA
>> that is potentially non zero.
>> 
>> Rename to of_dma_get_cpu_limits(), and extend to retrieve also the lower
>> limit for the same 'dma-ranges' property describing the high limit.
>
> I don't understand the reason for the lower limit. The way the Linux
> zones work is that ZONE_DMA always starts from the start of the RAM. It
> doesn't matter whether it's 0 or not, you'd not allocate below the start
> of RAM anyway. If you have a device that cannot use the bottom of the
> RAM, it is pretty broken and not supported by Linux.

I won't argue with that assertion. My target system RAM happens to start
at that the lower end of devices DMA zone, so I'm fine with skipping
this patch.

Just curious. What is the inherent limitation that prevents Linux from
supporting DMA zone with lower limit above RAM start?

Thanks,
baruch

-- 
 ~. .~   Tk Open Systems
=}ooO--U--Ooo{=
   - bar...@tkos.co.il - tel: +972.52.368.4656, http://www.tkos.co.il -

Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings

2024-07-23 Thread David Hildenbrand


On 23.07.24 23:04, Peter Xu wrote:

On Tue, Jul 23, 2024 at 10:18:37AM +0200, David Hildenbrand wrote:

On 22.07.24 17:31, Peter Xu wrote:

On Mon, Jul 22, 2024 at 03:29:43PM +0200, David Hildenbrand wrote:

On 18.07.24 00:02, Peter Xu wrote:

This is an RFC series, so not yet for merging.  Please don't be scared by
the code changes: most of them are code movements only.

This series is based on the dax mprotect fix series here (while that one is
based on mm-unstable):

 [PATCH v3 0/8] mm/mprotect: Fix dax puds
 https://lore.kernel.org/r/20240715192142.3241557-1-pet...@redhat.com

Overview


This series doesn't provide any feature change.  The only goal of this
series is to start decoupling two ideas: "THP" and "huge mapping".  We
already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this
one extends that idea into the code.

The issue is that we have so many functions that only compile with
CONFIG_THP=on, even though they're about huge mappings, and huge mapping is
a pretty common concept, which can apply to many things besides THPs
nowadays.  The major THP file is mm/huge_memory.c as of now.

The first example of such huge mapping users will be hugetlb.  We lived
until now with no problem simply because Linux almost duplicated all the
logics there in the "THP" files into hugetlb APIs.  If we want to get rid
of hugetlb specific APIs and paths, this _might_ be the first thing we want
to do, because we want to be able to e.g., zapping a hugetlb pmd entry even
if !CONFIG_THP.

Then consider other things like dax / pfnmaps.  Dax can depend on THP, then
it'll naturally be able to use pmd/pud helpers, that's okay.  However is it
a must?  Do we also want to have every new pmd/pud mappings in the future
to depend on THP (like PFNMAP)?  My answer is no, but I'm open to opinions.

If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that
are larger than PAGE_SIZE) is a more generic concept than THP, then I think
at some point we need to move the generic code out of THP code into a
common code base.

This is what this series does as a start.


Hi Peter!

  From a quick glimpse, patch #1-#4 do make sense independent of patch #5.

I am not so sure about all of the code movement in patch #5. If large folios
are the future, then likely huge_memory.c should simply be the home for all
that logic.

Maybe the goal should better be to compile huge_memory.c not only for THP,
but also for other use cases that require that logic, and fence off all THP
specific stuff using #ifdef?

Not sure, though. But a lot of this code movements/churn might be avoidable.


I'm fine using ifdefs in the current fine, but IMHO it's a matter of
whether we want to keep huge_memory.c growing into even larger file, and
keep all large folio logics only in that file.  Currently it's ~4000 LOCs.


Depends on "how much" for sure. huge_memory.c is currently on place 12 of
the biggest files in mm/. So there might not be immediate cause for action
... just yet :) [guess which file is on #2 :) ]


7821, hugetlb.c
7602, vmscan.c
7275, slub.c
7072, page_alloc.c
6673, memory.c
5402, memcontrol.c
5239, shmem.c
5155, vmalloc.c
4419, filemap.c
4060, mmap.c
3882, huge_memory.c

IMHO a split is normally better than keeping everything in one file, but
yeah I'd confess THP file isn't that bad comparing to others..  And I'm
definitely surprised it's even out of top ten.


It's always interesting looking at the numbers here. For v6.10 we had:

8521 mm/memcontrol.c
7813 mm/hugetlb.c
7550 mm/vmscan.c
7266 mm/slub.c
7018 mm/page_alloc.c
6468 mm/memory.c
5154 mm/vmalloc.c
5002 mm/shmem.c
4419 mm/filemap.c
4019 mm/mmap.c
3954 mm/ksm.c
3740 mm/swapfile.c
3730 mm/huge_memory.c
3689 mm/gup.c
3542 mm/mempolicy.c

I suspect memcontrol.c shrunk because of the v1 split-off, leaving 
hugetlb.c now at #1 :)


--
Cheers,

David / dhildenb

Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings

2024-07-23 Thread Peter Xu

On Tue, Jul 23, 2024 at 10:18:37AM +0200, David Hildenbrand wrote:
> On 22.07.24 17:31, Peter Xu wrote:
> > On Mon, Jul 22, 2024 at 03:29:43PM +0200, David Hildenbrand wrote:
> > > On 18.07.24 00:02, Peter Xu wrote:
> > > > This is an RFC series, so not yet for merging.  Please don't be scared 
> > > > by
> > > > the code changes: most of them are code movements only.
> > > > 
> > > > This series is based on the dax mprotect fix series here (while that 
> > > > one is
> > > > based on mm-unstable):
> > > > 
> > > > [PATCH v3 0/8] mm/mprotect: Fix dax puds
> > > > https://lore.kernel.org/r/20240715192142.3241557-1-pet...@redhat.com
> > > > 
> > > > Overview
> > > > 
> > > > 
> > > > This series doesn't provide any feature change.  The only goal of this
> > > > series is to start decoupling two ideas: "THP" and "huge mapping".  We
> > > > already started with having PGTABLE_HAS_HUGE_LEAVES config option, and 
> > > > this
> > > > one extends that idea into the code.
> > > > 
> > > > The issue is that we have so many functions that only compile with
> > > > CONFIG_THP=on, even though they're about huge mappings, and huge 
> > > > mapping is
> > > > a pretty common concept, which can apply to many things besides THPs
> > > > nowadays.  The major THP file is mm/huge_memory.c as of now.
> > > > 
> > > > The first example of such huge mapping users will be hugetlb.  We lived
> > > > until now with no problem simply because Linux almost duplicated all the
> > > > logics there in the "THP" files into hugetlb APIs.  If we want to get 
> > > > rid
> > > > of hugetlb specific APIs and paths, this _might_ be the first thing we 
> > > > want
> > > > to do, because we want to be able to e.g., zapping a hugetlb pmd entry 
> > > > even
> > > > if !CONFIG_THP.
> > > > 
> > > > Then consider other things like dax / pfnmaps.  Dax can depend on THP, 
> > > > then
> > > > it'll naturally be able to use pmd/pud helpers, that's okay.  However 
> > > > is it
> > > > a must?  Do we also want to have every new pmd/pud mappings in the 
> > > > future
> > > > to depend on THP (like PFNMAP)?  My answer is no, but I'm open to 
> > > > opinions.
> > > > 
> > > > If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that
> > > > are larger than PAGE_SIZE) is a more generic concept than THP, then I 
> > > > think
> > > > at some point we need to move the generic code out of THP code into a
> > > > common code base.
> > > > 
> > > > This is what this series does as a start.
> > > 
> > > Hi Peter!
> > > 
> > >  From a quick glimpse, patch #1-#4 do make sense independent of patch #5.
> > > 
> > > I am not so sure about all of the code movement in patch #5. If large 
> > > folios
> > > are the future, then likely huge_memory.c should simply be the home for 
> > > all
> > > that logic.
> > > 
> > > Maybe the goal should better be to compile huge_memory.c not only for THP,
> > > but also for other use cases that require that logic, and fence off all 
> > > THP
> > > specific stuff using #ifdef?
> > > 
> > > Not sure, though. But a lot of this code movements/churn might be 
> > > avoidable.
> > 
> > I'm fine using ifdefs in the current fine, but IMHO it's a matter of
> > whether we want to keep huge_memory.c growing into even larger file, and
> > keep all large folio logics only in that file.  Currently it's ~4000 LOCs.
> 
> Depends on "how much" for sure. huge_memory.c is currently on place 12 of
> the biggest files in mm/. So there might not be immediate cause for action
> ... just yet :) [guess which file is on #2 :) ]

7821, hugetlb.c  
7602, vmscan.c  
7275, slub.c   
7072, page_alloc.c
6673, memory.c 
5402, memcontrol.c 
5239, shmem.c   
5155, vmalloc.c  
4419, filemap.c   
4060, mmap.c   
3882, huge_memory.c

IMHO a split is normally better than keeping everything in one file, but
yeah I'd confess THP file isn't that bad comparing to others..  And I'm
definitely surprised it's even out of top ten.

> 
> > 
> > Nornally I don't see this as much of a "code churn" category, because it
> > doesn't changes the code itself but only move things.  I personally also
> > prefer without code churns, but only in the case where there'll be tiny
> > little functional changes here and there without real benefit.
> > 
> > It's pretty unavoidable to me when one file grows too large and we'll need
> > to split, and in this case git doesn't have a good way to track such
> > movement..
> 
> Yes, that's what I mean.
> 
> I've been recently thinking if we should pursue a different direction:
> 
> Just as we recently relocated most follow_huge_* stuff into gup.c, likely we
> should rather look into moving copy_huge_pmd, change_huge_pmd, copy_huge_pmd
> ... into the files where they logically belong to.
> 
> In madvise.c, we've been doing that in some places already: For
> madvise_cold_or_pageout_pte_range() we inline the code, but not for
> madvise_free_huge_pmd().
> 
> pmd_trans_huge() would already comp

Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings

2024-07-23 Thread David Hildenbrand


On 22.07.24 17:31, Peter Xu wrote:

On Mon, Jul 22, 2024 at 03:29:43PM +0200, David Hildenbrand wrote:

On 18.07.24 00:02, Peter Xu wrote:

This is an RFC series, so not yet for merging.  Please don't be scared by
the code changes: most of them are code movements only.

This series is based on the dax mprotect fix series here (while that one is
based on mm-unstable):

[PATCH v3 0/8] mm/mprotect: Fix dax puds
https://lore.kernel.org/r/20240715192142.3241557-1-pet...@redhat.com

Overview


This series doesn't provide any feature change.  The only goal of this
series is to start decoupling two ideas: "THP" and "huge mapping".  We
already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this
one extends that idea into the code.

The issue is that we have so many functions that only compile with
CONFIG_THP=on, even though they're about huge mappings, and huge mapping is
a pretty common concept, which can apply to many things besides THPs
nowadays.  The major THP file is mm/huge_memory.c as of now.

The first example of such huge mapping users will be hugetlb.  We lived
until now with no problem simply because Linux almost duplicated all the
logics there in the "THP" files into hugetlb APIs.  If we want to get rid
of hugetlb specific APIs and paths, this _might_ be the first thing we want
to do, because we want to be able to e.g., zapping a hugetlb pmd entry even
if !CONFIG_THP.

Then consider other things like dax / pfnmaps.  Dax can depend on THP, then
it'll naturally be able to use pmd/pud helpers, that's okay.  However is it
a must?  Do we also want to have every new pmd/pud mappings in the future
to depend on THP (like PFNMAP)?  My answer is no, but I'm open to opinions.

If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that
are larger than PAGE_SIZE) is a more generic concept than THP, then I think
at some point we need to move the generic code out of THP code into a
common code base.

This is what this series does as a start.


Hi Peter!

 From a quick glimpse, patch #1-#4 do make sense independent of patch #5.

I am not so sure about all of the code movement in patch #5. If large folios
are the future, then likely huge_memory.c should simply be the home for all
that logic.

Maybe the goal should better be to compile huge_memory.c not only for THP,
but also for other use cases that require that logic, and fence off all THP
specific stuff using #ifdef?

Not sure, though. But a lot of this code movements/churn might be avoidable.


I'm fine using ifdefs in the current fine, but IMHO it's a matter of
whether we want to keep huge_memory.c growing into even larger file, and
keep all large folio logics only in that file.  Currently it's ~4000 LOCs.


Depends on "how much" for sure. huge_memory.c is currently on place 12 
of the biggest files in mm/. So there might not be immediate cause for 
action ... just yet :) [guess which file is on #2 :) ]




Nornally I don't see this as much of a "code churn" category, because it
doesn't changes the code itself but only move things.  I personally also
prefer without code churns, but only in the case where there'll be tiny
little functional changes here and there without real benefit.

It's pretty unavoidable to me when one file grows too large and we'll need
to split, and in this case git doesn't have a good way to track such
movement..


Yes, that's what I mean.

I've been recently thinking if we should pursue a different direction:

Just as we recently relocated most follow_huge_* stuff into gup.c, 
likely we should rather look into moving copy_huge_pmd, change_huge_pmd, 
copy_huge_pmd ... into the files where they logically belong to.


In madvise.c, we've been doing that in some places already: For 
madvise_cold_or_pageout_pte_range() we inline the code, but not for 
madvise_free_huge_pmd().


pmd_trans_huge() would already compile to a NOP without 
CONFIG_TRANSPARENT_HUGEPAGE, but to make that code avoid most 
CONFIG_TRANSPARENT_HUGEPAGE, we'd need a couple more function stubs to 
make the compiler happy while still being able to compile that code out 
when not required.


The idea would be that e.g., pmd_leaf() would return "false" at compile 
time if no active configuration (THP, HUGETLB, ...) would be active. So 
we could just use pmd_leaf() similar to pmd_trans_huge() in relevant 
code and have the compiler optimize it all out without putting it into 
separate files.


That means, large folios and PMD/PUD mappings will become "more common" 
and better integrated, without the need to jump between files.


Just some thought about an alternative that would make sense to me.



Irrelevant of this, just to mention I think there's still one option that I
at least can make the huge pfnmap depends on THP again which shouldn't be a
huge deal (I don't have any use case that needs huge pfnmap but disable
THP, anyway..), so this series isn't an immediate concern to me for that
route.  But for a hugetlb rework this might be

Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings

2024-07-22 Thread Peter Xu

On Mon, Jul 22, 2024 at 03:29:43PM +0200, David Hildenbrand wrote:
> On 18.07.24 00:02, Peter Xu wrote:
> > This is an RFC series, so not yet for merging.  Please don't be scared by
> > the code changes: most of them are code movements only.
> > 
> > This series is based on the dax mprotect fix series here (while that one is
> > based on mm-unstable):
> > 
> >[PATCH v3 0/8] mm/mprotect: Fix dax puds
> >https://lore.kernel.org/r/20240715192142.3241557-1-pet...@redhat.com
> > 
> > Overview
> > 
> > 
> > This series doesn't provide any feature change.  The only goal of this
> > series is to start decoupling two ideas: "THP" and "huge mapping".  We
> > already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this
> > one extends that idea into the code.
> > 
> > The issue is that we have so many functions that only compile with
> > CONFIG_THP=on, even though they're about huge mappings, and huge mapping is
> > a pretty common concept, which can apply to many things besides THPs
> > nowadays.  The major THP file is mm/huge_memory.c as of now.
> > 
> > The first example of such huge mapping users will be hugetlb.  We lived
> > until now with no problem simply because Linux almost duplicated all the
> > logics there in the "THP" files into hugetlb APIs.  If we want to get rid
> > of hugetlb specific APIs and paths, this _might_ be the first thing we want
> > to do, because we want to be able to e.g., zapping a hugetlb pmd entry even
> > if !CONFIG_THP.
> > 
> > Then consider other things like dax / pfnmaps.  Dax can depend on THP, then
> > it'll naturally be able to use pmd/pud helpers, that's okay.  However is it
> > a must?  Do we also want to have every new pmd/pud mappings in the future
> > to depend on THP (like PFNMAP)?  My answer is no, but I'm open to opinions.
> > 
> > If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that
> > are larger than PAGE_SIZE) is a more generic concept than THP, then I think
> > at some point we need to move the generic code out of THP code into a
> > common code base.
> > 
> > This is what this series does as a start.
> 
> Hi Peter!
> 
> From a quick glimpse, patch #1-#4 do make sense independent of patch #5.
> 
> I am not so sure about all of the code movement in patch #5. If large folios
> are the future, then likely huge_memory.c should simply be the home for all
> that logic.
> 
> Maybe the goal should better be to compile huge_memory.c not only for THP,
> but also for other use cases that require that logic, and fence off all THP
> specific stuff using #ifdef?
> 
> Not sure, though. But a lot of this code movements/churn might be avoidable.

I'm fine using ifdefs in the current fine, but IMHO it's a matter of
whether we want to keep huge_memory.c growing into even larger file, and
keep all large folio logics only in that file.  Currently it's ~4000 LOCs.

Nornally I don't see this as much of a "code churn" category, because it
doesn't changes the code itself but only move things.  I personally also
prefer without code churns, but only in the case where there'll be tiny
little functional changes here and there without real benefit.

It's pretty unavoidable to me when one file grows too large and we'll need
to split, and in this case git doesn't have a good way to track such
movement..

Irrelevant of this, just to mention I think there's still one option that I
at least can make the huge pfnmap depends on THP again which shouldn't be a
huge deal (I don't have any use case that needs huge pfnmap but disable
THP, anyway..), so this series isn't an immediate concern to me for that
route.  But for a hugetlb rework this might be something we need to do,
because we simplly can't make CONFIG_HUGETLB rely on CONFIG_THP..

Thanks,

-- 
Peter Xu

Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings

2024-07-22 Thread David Hildenbrand


On 18.07.24 00:02, Peter Xu wrote:

This is an RFC series, so not yet for merging.  Please don't be scared by
the code changes: most of them are code movements only.

This series is based on the dax mprotect fix series here (while that one is
based on mm-unstable):

   [PATCH v3 0/8] mm/mprotect: Fix dax puds
   https://lore.kernel.org/r/20240715192142.3241557-1-pet...@redhat.com

Overview


This series doesn't provide any feature change.  The only goal of this
series is to start decoupling two ideas: "THP" and "huge mapping".  We
already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this
one extends that idea into the code.

The issue is that we have so many functions that only compile with
CONFIG_THP=on, even though they're about huge mappings, and huge mapping is
a pretty common concept, which can apply to many things besides THPs
nowadays.  The major THP file is mm/huge_memory.c as of now.

The first example of such huge mapping users will be hugetlb.  We lived
until now with no problem simply because Linux almost duplicated all the
logics there in the "THP" files into hugetlb APIs.  If we want to get rid
of hugetlb specific APIs and paths, this _might_ be the first thing we want
to do, because we want to be able to e.g., zapping a hugetlb pmd entry even
if !CONFIG_THP.

Then consider other things like dax / pfnmaps.  Dax can depend on THP, then
it'll naturally be able to use pmd/pud helpers, that's okay.  However is it
a must?  Do we also want to have every new pmd/pud mappings in the future
to depend on THP (like PFNMAP)?  My answer is no, but I'm open to opinions.

If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that
are larger than PAGE_SIZE) is a more generic concept than THP, then I think
at some point we need to move the generic code out of THP code into a
common code base.

This is what this series does as a start.


Hi Peter!

From a quick glimpse, patch #1-#4 do make sense independent of patch #5.

I am not so sure about all of the code movement in patch #5. If large 
folios are the future, then likely huge_memory.c should simply be the 
home for all that logic.


Maybe the goal should better be to compile huge_memory.c not only for 
THP, but also for other use cases that require that logic, and fence off 
all THP specific stuff using #ifdef?


Not sure, though. But a lot of this code movements/churn might be avoidable.

--
Cheers,

David / dhildenb

[PATCH RFC 5/6] mm/huge_mapping: Create huge_mapping_pxx.c

2024-07-17 Thread Peter Xu

At some point, we need to decouple "huge mapping" with THP, for any non-THP
huge mappings in the future (hugetlb, pfnmap, etc..).  This is the first
step towards it.

Or say, we already started to do this when PGTABLE_HAS_HUGE_LEAVES option
was introduced: that is the first thing Linux start to describe LEAVEs
rather than THPs when it is about huge mappings.  Before that, mostly any
huge mapping will have THP involved, like devmap.  Hugetlb is special only
because we duplicated the whole world there, but we also have a demand to
decouple that now.

Linux used to have huge_memory.c which only compiles with THP enabled, I
wished it was called thp.c from the start.  In reality, it contains more
than processing THP: any huge mapping (even if not falling into THP
category) will be able to leverage many of these helpers, but unfortunately
this file is not compiled if !THP.  These helpers are normally only about
the pgtable operations, which may not be directly relevant to what type of
huge folio (e.g. THP) underneath, or perhaps even if there's no vmemmap to
back it.

It's better we move them out of THP world.

Create a new set of files huge_mapping_p[mu]d.c.  This patch starts to move
quite a few essential helpers from huge_memory.c into these new files, so
that they'll start to work and compile rely on PGTABLE_HAS_PXX_LEAVES
rather than THP.  Split them into two files by nature so that e.g. archs
that only supports PMD huge mapping can avoid compiling the whole -pud
file, with the hope to reduce the size of object compiled and linked.

No functional change intended, but only code movement.  Said that, there
will be some "ifdef" machinery changes to pass all kinds of compilations.

Cc: Jason Gunthorpe 
Cc: Matthew Wilcox 
Cc: Oscar Salvador 
Signed-off-by: Peter Xu 
---
 include/linux/huge_mm.h |  318 +---
 include/linux/pgtable.h |   23 +-
 include/trace/events/huge_mapping.h |   41 +
 include/trace/events/thp.h  |   28 -
 mm/Makefile |2 +
 mm/huge_mapping_pmd.c   |  979 +++
 mm/huge_mapping_pud.c   |  235 ++
 mm/huge_memory.c| 1125 +--
 8 files changed, 1472 insertions(+), 1279 deletions(-)
 create mode 100644 include/trace/events/huge_mapping.h
 create mode 100644 mm/huge_mapping_pmd.c
 create mode 100644 mm/huge_mapping_pud.c

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d8b642ad512d..aea2784df8ef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -8,43 +8,214 @@
 #include  /* only for vma_is_dax() */
 #include 
 
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
+void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
 void touch_pud(struct vm_area_struct *vma, unsigned long addr,
   pud_t *pud, bool write);
-void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-  pmd_t *pmd, bool write);
-pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
-vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
-int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
- pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
- struct vm_area_struct *dst_vma, struct vm_area_struct 
*src_vma);
-void huge_pmd_set_accessed(struct vm_fault *vmf);
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
  struct vm_area_struct *vma);
+int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t 
*pud,
+unsigned long addr);
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+   pud_t *pudp, unsigned long addr, pgprot_t newprot,
+   unsigned long cp_flags);
+void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
+   unsigned long address);
+spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma);
 
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
-#else
-static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
+static inline spinlock_t *
+pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 {
+   if (pud_trans_huge(*pud) || pud_devmap(*pud))
+   return __pud_trans_huge_lock(pud, vma);
+   else
+   return NULL;
 }
-#endif
 
+#define split_huge_pud(__vma, __pud, __address)
\
+   do {\
+   pud_t *pud = (__pud);   \
+   if (pud_trans_huge(*pud) || pud_devmap(*pud))   \
+   __split_huge_pud(__vma, __pud, __address);  \
+   }  while (0)
+#else  /* CONFIG_PGTABLE_HAS_PUD_LEAVES */
+static inline void
+huge_pud_set_accessed(

[PATCH RFC 6/6] mm: Convert "_trans_huge() || _devmap()" to use *_leaf()

2024-07-17 Thread Peter Xu

This patch converted all such checks into one *_leaf() check under common
mm/, as "thp+devmap" should compose everything for a *_leaf() for now.  I
didn't yet touch arch code in other directories, as some arch may need some
special attention, so I left those separately.

It should start to save some cycles on such check and pave way for the new
leaf types.  E.g., when a new type of leaf is introduced, it'll naturally
go the same route to what we have now for thp+devmap.

Here one issue with pxx_leaf() API is that such API will be defined by arch
but it doesn't consider kernel config.  For example, below "if" branch
cannot be automatically optimized:

  if (pmd_leaf()) { ... }

Even if both THP && HUGETLB are not enabled (which means pmd_leaf() can
never return true).

To provide a chance for compilers to optimize and omit code when possible,
introduce a light wrapper for them and call them pxx_is_leaf().  That will
take kernel config into account and properly allow omitting branches when
the compiler knows it'll constantly returns false.  This tries to mimic
what we used to have with pxx_trans_huge() when !THP, so it now also
applies to pxx_leaf() API.

Cc: Alistair Popple 
Cc: Dan Williams 
Cc: Jason Gunthorpe 
Signed-off-by: Peter Xu 
---
 include/linux/huge_mm.h|  6 +++---
 include/linux/pgtable.h| 30 +-
 mm/hmm.c   |  4 ++--
 mm/huge_mapping_pmd.c  |  9 +++--
 mm/huge_mapping_pud.c  |  6 +++---
 mm/mapping_dirty_helpers.c |  4 ++--
 mm/memory.c| 14 ++
 mm/migrate_device.c|  2 +-
 mm/mprotect.c  |  4 ++--
 mm/mremap.c|  5 ++---
 mm/page_vma_mapped.c   |  5 ++---
 mm/pgtable-generic.c   |  7 +++
 12 files changed, 58 insertions(+), 38 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index aea2784df8ef..a5b026d0731e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -27,7 +27,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct 
vm_area_struct *vma);
 static inline spinlock_t *
 pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 {
-   if (pud_trans_huge(*pud) || pud_devmap(*pud))
+   if (pud_is_leaf(*pud))
return __pud_trans_huge_lock(pud, vma);
else
return NULL;
@@ -36,7 +36,7 @@ pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 #define split_huge_pud(__vma, __pud, __address)
\
do {\
pud_t *pud = (__pud);   \
-   if (pud_trans_huge(*pud) || pud_devmap(*pud))   \
+   if (pud_is_leaf(*pud))  \
__split_huge_pud(__vma, __pud, __address);  \
}  while (0)
 #else  /* CONFIG_PGTABLE_HAS_PUD_LEAVES */
@@ -125,7 +125,7 @@ static inline int is_swap_pmd(pmd_t pmd)
 static inline spinlock_t *
 pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
-   if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+   if (is_swap_pmd(*pmd) || pmd_is_leaf(*pmd))
return __pmd_trans_huge_lock(pmd, vma);
else
return NULL;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5e505373b113..af7709a132aa 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1641,7 +1641,7 @@ static inline int pud_trans_unstable(pud_t *pud)
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
pud_t pudval = READ_ONCE(*pud);
 
-   if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval))
+   if (pud_none(pudval) || pud_leaf(pudval))
return 1;
if (unlikely(pud_bad(pudval))) {
pud_clear_bad(pud);
@@ -1901,6 +1901,34 @@ typedef unsigned int pgtbl_mod_mask;
 #define pmd_leaf(x)false
 #endif
 
+/*
+ * Wrapper of pxx_leaf() helpers.
+ *
+ * Comparing to pxx_leaf() API, the only difference is: using these macros
+ * can help code generation, so unnecessary code can be omitted when the
+ * specific level of leaf is not possible due to kernel config.  It is
+ * needed because normally pxx_leaf() can be defined in arch code without
+ * knowing the kernel config.
+ *
+ * Currently we only need pmd/pud versions, because the largest leaf Linux
+ * supports so far is pud.
+ *
+ * Defining here also means that in arch's pgtable headers these macros
+ * cannot be used, pxx_leaf()s need to be used instead, because this file
+ * will not be included in arch's pgtable headers.
+ */
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
+#define pmd_is_leaf(x)  pmd_leaf(x)
+#else
+#define pmd_is_leaf(x)  false
+#endif
+
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
+#define pud_is_leaf(x)  pud_leaf(x)
+#else
+#define pud_is_leaf(x)  false
+#endif
+
 #ifndef pgd_leaf_size
 #define pgd_leaf_size(x) (1ULL << PGDIR_SHIFT)
 #e

[PATCH RFC 4/6] mm: Move huge mapping declarations from internal.h to huge_mm.h

2024-07-17 Thread Peter Xu

Most of the huge mapping relevant helpers are declared in huge_mm.h, not
internal.h.  Move the only few from internal.h into huge_mm.h.

Here to move pmd_needs_soft_dirty_wp() over, we'll also need to move
vma_soft_dirty_enabled() into mm.h as it'll be needed in two headers
later (internal.h, huge_mm.h).

Signed-off-by: Peter Xu 
---
 include/linux/huge_mm.h | 10 ++
 include/linux/mm.h  | 18 ++
 mm/internal.h   | 33 -
 3 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 37482c8445d1..d8b642ad512d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -8,6 +8,11 @@
 #include  /* only for vma_is_dax() */
 #include 
 
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+  pud_t *pud, bool write);
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+  pmd_t *pmd, bool write);
+pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
@@ -629,4 +634,9 @@ static inline int split_folio_to_order(struct folio *folio, 
int new_order)
 #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
 #define split_folio(f) split_folio_to_order(f, 0)
 
+static inline bool pmd_needs_soft_dirty_wp(struct vm_area_struct *vma, pmd_t 
pmd)
+{
+   return vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd);
+}
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5f1075d19600..fa10802d8faa 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1117,6 +1117,24 @@ static inline unsigned int folio_order(struct folio 
*folio)
return folio->_flags_1 & 0xff;
 }
 
+static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
+{
+   /*
+* NOTE: we must check this before VM_SOFTDIRTY on soft-dirty
+* enablements, because when without soft-dirty being compiled in,
+* VM_SOFTDIRTY is defined as 0x0, then !(vm_flags & VM_SOFTDIRTY)
+* will be constantly true.
+*/
+   if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY))
+   return false;
+
+   /*
+* Soft-dirty is kind of special: its tracking is enabled when the
+* vma flags not set.
+*/
+   return !(vma->vm_flags & VM_SOFTDIRTY);
+}
+
 #include 
 
 /*
diff --git a/mm/internal.h b/mm/internal.h
index b4d86436565b..e49941747749 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -917,8 +917,6 @@ bool need_mlock_drain(int cpu);
 void mlock_drain_local(void);
 void mlock_drain_remote(int cpu);
 
-extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
-
 /**
  * vma_address - Find the virtual address a page range is mapped at
  * @vma: The vma which maps this object.
@@ -1229,14 +1227,6 @@ int migrate_device_coherent_page(struct page *page);
 int __must_check try_grab_folio(struct folio *folio, int refs,
unsigned int flags);
 
-/*
- * mm/huge_memory.c
- */
-void touch_pud(struct vm_area_struct *vma, unsigned long addr,
-  pud_t *pud, bool write);
-void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-  pmd_t *pmd, bool write);
-
 /*
  * mm/mmap.c
  */
@@ -1342,29 +1332,6 @@ static __always_inline void vma_set_range(struct 
vm_area_struct *vma,
vma->vm_pgoff = pgoff;
 }
 
-static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
-{
-   /*
-* NOTE: we must check this before VM_SOFTDIRTY on soft-dirty
-* enablements, because when without soft-dirty being compiled in,
-* VM_SOFTDIRTY is defined as 0x0, then !(vm_flags & VM_SOFTDIRTY)
-* will be constantly true.
-*/
-   if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY))
-   return false;
-
-   /*
-* Soft-dirty is kind of special: its tracking is enabled when the
-* vma flags not set.
-*/
-   return !(vma->vm_flags & VM_SOFTDIRTY);
-}
-
-static inline bool pmd_needs_soft_dirty_wp(struct vm_area_struct *vma, pmd_t 
pmd)
-{
-   return vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd);
-}
-
 static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t 
pte)
 {
return vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte);
-- 
2.45.0

[PATCH RFC 3/6] mm/treewide: Make pgtable-generic.c THP agnostic

2024-07-17 Thread Peter Xu

Make pmd/pud helpers to rely on the new PGTABLE_HAS_*_LEAVES option, rather
than THP alone, as THP is only one form of huge mapping.

Signed-off-by: Peter Xu 
---
 arch/arm64/include/asm/pgtable.h |  6 ++--
 arch/powerpc/include/asm/book3s/64/pgtable.h |  2 +-
 arch/powerpc/mm/book3s64/pgtable.c   |  2 +-
 arch/riscv/include/asm/pgtable.h |  4 +--
 arch/s390/include/asm/pgtable.h  |  2 +-
 arch/s390/mm/pgtable.c   |  4 +--
 arch/sparc/mm/tlb.c  |  2 +-
 arch/x86/mm/pgtable.c| 15 -
 include/linux/mm_types.h |  2 +-
 include/linux/pgtable.h  |  4 +--
 mm/memory.c  |  2 +-
 mm/pgtable-generic.c | 32 ++--
 12 files changed, 40 insertions(+), 37 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5d5d1b18b837..b93c03256ada 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1105,7 +1105,7 @@ extern int __ptep_set_access_flags(struct vm_area_struct 
*vma,
 unsigned long address, pte_t *ptep,
 pte_t entry, int dirty);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 #define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
 static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp,
@@ -1114,7 +1114,9 @@ static inline int pmdp_set_access_flags(struct 
vm_area_struct *vma,
return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
pmd_pte(entry), dirty);
 }
+#endif
 
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
 static inline int pud_devmap(pud_t pud)
 {
return 0;
@@ -1178,7 +1180,7 @@ static inline int __ptep_clear_flush_young(struct 
vm_area_struct *vma,
return young;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 #define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long address,
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 051b1b6d729c..84cf55e18334 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1119,7 +1119,7 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool 
write)
return pte_access_permitted(pmd_pte(pmd), write);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
 extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 5a4a75369043..d6a5457627df 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -37,7 +37,7 @@ EXPORT_SYMBOL(__pmd_frag_nr);
 unsigned long __pmd_frag_size_shift;
 EXPORT_SYMBOL(__pmd_frag_size_shift);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 /*
  * This is called when relaxing access to a hugepage. It's also called in the 
page
  * fault path when we don't hit any of the major fault cases, ie, a minor
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index ebfe8faafb79..8c28f15f601b 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -752,7 +752,7 @@ static inline bool pud_user_accessible_page(pud_t pud)
 }
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 static inline int pmd_trans_huge(pmd_t pmd)
 {
return pmd_leaf(pmd);
@@ -802,7 +802,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct 
*vma,
 #define pmdp_collapse_flush pmdp_collapse_flush
 extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 unsigned long address, pmd_t *pmdp);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_PGTABLE_HAS_PMD_LEAVES */
 
 /*
  * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index fb6870384b97..398bbed20dee 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1710,7 +1710,7 @@ pmd_t pmdp_xchg_direct(struct mm_struct *, unsigned long, 
pmd_t *, pmd_t);
 pmd_t pmdp_xchg_lazy(struct mm_struct *, unsigned long, pmd_t *, pmd_t);
 pud_t pudp_xchg_direct(struct mm_struct *, unsigned long, pud_t *, pud_t);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
 void pgtable_trans_huge_deposit(struct mm_st

[PATCH RFC 1/6] mm/treewide: Remove pgd_devmap()

2024-07-17 Thread Peter Xu

It's always 0 for all archs, and there's no sign to even support p4d entry
in the near future.  Remove it until it's needed for real.

Signed-off-by: Peter Xu 
---
 arch/arm64/include/asm/pgtable.h | 5 -
 arch/powerpc/include/asm/book3s/64/pgtable.h | 5 -
 arch/x86/include/asm/pgtable.h   | 5 -
 include/linux/pgtable.h  | 4 
 mm/gup.c | 2 --
 5 files changed, 21 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f8efbc128446..5d5d1b18b837 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1119,11 +1119,6 @@ static inline int pud_devmap(pud_t pud)
 {
return 0;
 }
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-   return 0;
-}
 #endif
 
 #ifdef CONFIG_PAGE_TABLE_CHECK
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 5da92ba68a45..051b1b6d729c 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1431,11 +1431,6 @@ static inline int pud_devmap(pud_t pud)
 {
return pte_devmap(pud_pte(pud));
 }
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-   return 0;
-}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 701593c53f3b..0d234f48ceeb 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -311,11 +311,6 @@ static inline int pud_devmap(pud_t pud)
return 0;
 }
 #endif
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-   return 0;
-}
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2289e9f7aa1b..0a904300ac90 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1626,10 +1626,6 @@ static inline int pud_devmap(pud_t pud)
 {
return 0;
 }
-static inline int pgd_devmap(pgd_t pgd)
-{
-   return 0;
-}
 #endif
 
 #if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \
diff --git a/mm/gup.c b/mm/gup.c
index 54d0dc3831fb..b023bcd38235 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3149,8 +3149,6 @@ static int gup_fast_pgd_leaf(pgd_t orig, pgd_t *pgdp, 
unsigned long addr,
if (!pgd_access_permitted(orig, flags & FOLL_WRITE))
return 0;
 
-   BUILD_BUG_ON(pgd_devmap(orig));
-
page = pgd_page(orig);
refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);
 
-- 
2.45.0

[PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options

2024-07-17 Thread Peter Xu

Introduce two more sub-options for PGTABLE_HAS_HUGE_LEAVES:

  - PGTABLE_HAS_PMD_LEAVES: set when there can be PMD mappings
  - PGTABLE_HAS_PUD_LEAVES: set when there can be PUD mappings

It will help to identify whether the current build may only want PMD
helpers but not PUD ones, as these sub-options will also check against the
arch support over HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD].

Note that having them depend on HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD] is
still some intermediate step.  The best way is to have an option say
"whether arch XXX supports PMD/PUD mappings" and so on.  However let's
leave that for later as that's the easy part.  So far, we use these options
to stably detect per-arch huge mapping support.

Signed-off-by: Peter Xu 
---
 include/linux/huge_mm.h | 10 +++---
 mm/Kconfig  |  6 ++
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 711632df7edf..37482c8445d1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -96,14 +96,18 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
 
-#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
-#define HPAGE_PMD_SHIFT PMD_SHIFT
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
 #define HPAGE_PUD_SHIFT PUD_SHIFT
 #else
-#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
 #endif
 
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
+#define HPAGE_PMD_SHIFT PMD_SHIFT
+#else
+#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
+#endif
+
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<

[PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings

2024-07-17 Thread Peter Xu

This is an RFC series, so not yet for merging.  Please don't be scared by
the code changes: most of them are code movements only.

This series is based on the dax mprotect fix series here (while that one is
based on mm-unstable):

  [PATCH v3 0/8] mm/mprotect: Fix dax puds
  https://lore.kernel.org/r/20240715192142.3241557-1-pet...@redhat.com

Overview


This series doesn't provide any feature change.  The only goal of this
series is to start decoupling two ideas: "THP" and "huge mapping".  We
already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this
one extends that idea into the code.

The issue is that we have so many functions that only compile with
CONFIG_THP=on, even though they're about huge mappings, and huge mapping is
a pretty common concept, which can apply to many things besides THPs
nowadays.  The major THP file is mm/huge_memory.c as of now.

The first example of such huge mapping users will be hugetlb.  We lived
until now with no problem simply because Linux almost duplicated all the
logics there in the "THP" files into hugetlb APIs.  If we want to get rid
of hugetlb specific APIs and paths, this _might_ be the first thing we want
to do, because we want to be able to e.g., zapping a hugetlb pmd entry even
if !CONFIG_THP.

Then consider other things like dax / pfnmaps.  Dax can depend on THP, then
it'll naturally be able to use pmd/pud helpers, that's okay.  However is it
a must?  Do we also want to have every new pmd/pud mappings in the future
to depend on THP (like PFNMAP)?  My answer is no, but I'm open to opinions.

If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that
are larger than PAGE_SIZE) is a more generic concept than THP, then I think
at some point we need to move the generic code out of THP code into a
common code base.

This is what this series does as a start.

In general, this series tries to move many THP things (mostly resides in
huge_memory.c right now) into two new files: huge_mapping_{pmd|pud}.c.
When I move them out, I also put them separately into different files for
different layers.  Then if an arch supports e.g. only PMD, it can avoid
compile the PUD helpers, with things like:

CONFIG_PGTABLE_HAS_PUD_LEAVES=n
obj-$(CONFIG_PGTABLE_HAS_PUD_LEAVES) += huge_mapping_pud.o

Note that there're a few tree-wide changes into arch/, but that's not a
lot, to make this not disturbing too much people, I only copied the open
lists of each arch not yet the arch maintainers.

Tests
=

My normal 19-archs cross-compilation tests pass with it, and smoke tested
on x86_64 with a local config of mine.

Comments welcomed, thanks.

Peter Xu (6):
  mm/treewide: Remove pgd_devmap()
  mm: PGTABLE_HAS_P[MU]D_LEAVES config options
  mm/treewide: Make pgtable-generic.c THP agnostic
  mm: Move huge mapping declarations from internal.h to huge_mm.h
  mm/huge_mapping: Create huge_mapping_pxx.c
  mm: Convert "*_trans_huge() || *_devmap()" to use *_leaf()

 arch/arm64/include/asm/pgtable.h |   11 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h |7 +-
 arch/powerpc/mm/book3s64/pgtable.c   |2 +-
 arch/riscv/include/asm/pgtable.h |4 +-
 arch/s390/include/asm/pgtable.h  |2 +-
 arch/s390/mm/pgtable.c   |4 +-
 arch/sparc/mm/tlb.c  |2 +-
 arch/x86/include/asm/pgtable.h   |5 -
 arch/x86/mm/pgtable.c|   15 +-
 include/linux/huge_mm.h  |  332 --
 include/linux/mm.h   |   18 +
 include/linux/mm_types.h |2 +-
 include/linux/pgtable.h  |   61 +-
 include/trace/events/huge_mapping.h  |   41 +
 include/trace/events/thp.h   |   28 -
 mm/Kconfig   |6 +
 mm/Makefile  |2 +
 mm/gup.c |2 -
 mm/hmm.c |4 +-
 mm/huge_mapping_pmd.c|  976 +++
 mm/huge_mapping_pud.c|  235 
 mm/huge_memory.c | 1125 +-
 mm/internal.h|   33 -
 mm/mapping_dirty_helpers.c   |4 +-
 mm/memory.c  |   16 +-
 mm/migrate_device.c  |2 +-
 mm/mprotect.c|4 +-
 mm/mremap.c  |5 +-
 mm/page_vma_mapped.c |5 +-
 mm/pgtable-generic.c |   37 +-
 30 files changed, 1595 insertions(+), 1395 deletions(-)
 create mode 100644 include/trace/events/huge_mapping.h
 create mode 100644 mm/huge_mapping_pmd.c
 create mode 100644 mm/huge_mapping_pud.c

-- 
2.45.0

Re: [PATCH RFC v2 4/5] dma-direct: add base offset to zone_dma_bits

2024-06-18 Thread Catalin Marinas

On Tue, Apr 09, 2024 at 09:17:57AM +0300, Baruch Siach wrote:
> Current code using zone_dma_bits assume that all addresses range in the
> bits mask are suitable for DMA. For some existing platforms this
> assumption is not correct. DMA range might have non zero lower limit.
[...]
> @@ -59,7 +60,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device 
> *dev, u64 *phys_limit)
>* zones.
>*/
>   *phys_limit = dma_to_phys(dev, dma_limit);
> - if (*phys_limit <= zone_dma_limit)
> + if (*phys_limit <= zone_dma_base + zone_dma_limit)
>   return GFP_DMA;
>   if (*phys_limit <= DMA_BIT_MASK(32))
>   return GFP_DMA32;

As I said previously, we no longer have zone_dma_bits after the first
patch, so adding this limit no longer make sense. In v1, you wanted a
limit like 32G to be added to the 30-bit zone_dma_bits to give you 33G
upper limit for ZONE_DMA. But since the first patch sets zone_dma_limit
to 33G already, this is no longer needed.

-- 
Catalin

Re: [PATCH RFC v2 2/5] of: get dma area lower limit

2024-06-18 Thread Catalin Marinas

On Tue, Apr 09, 2024 at 09:17:55AM +0300, Baruch Siach wrote:
> of_dma_get_max_cpu_address() returns the highest CPU address that
> devices can use for DMA. The implicit assumption is that all CPU
> addresses below that limit are suitable for DMA. However the
> 'dma-ranges' property this code uses also encodes a lower limit for DMA
> that is potentially non zero.
> 
> Rename to of_dma_get_cpu_limits(), and extend to retrieve also the lower
> limit for the same 'dma-ranges' property describing the high limit.

I don't understand the reason for the lower limit. The way the Linux
zones work is that ZONE_DMA always starts from the start of the RAM. It
doesn't matter whether it's 0 or not, you'd not allocate below the start
of RAM anyway. If you have a device that cannot use the bottom of the
RAM, it is pretty broken and not supported by Linux.

I think you added this limit before we tried to move away from
zone_dma_bits to a non-power-of-two limit (zone_dma_limit). With the
latter, we no longer need tricks with the lower limit,
of_dma_get_max_cpu_address() should capture the smallest upper CPU
address limit supported by all devices (and that's where ZONE_DMA should
end).

-- 
Catalin

Re: [PATCH RFC v2 1/5] dma-mapping: replace zone_dma_bits by zone_dma_limit

2024-06-18 Thread Catalin Marinas

(finally getting around to looking at this series, sorry for the delay)

On Tue, Apr 09, 2024 at 09:17:54AM +0300, Baruch Siach wrote:
> From: Catalin Marinas 
> 
> Hardware DMA limit might not be power of 2. When RAM range starts above
> 0, say 4GB, DMA limit of 30 bits should end at 5GB. A single high bit
> can not encode this limit.
> 
> Use direct phys_addr_t limit address for DMA zone limit.
> 
> Following commits will add explicit base address to DMA zone.
> 
> ---
> Catalin,
> 
> This is taken almost verbatim from your email:
> 
>   https://lore.kernel.org/all/zz2hnhjv3gdzu...@arm.com/
> 
> Would you provide your sign-off?

Signed-off-by: Catalin Marinas 

Thanks for writing a commit log. However, I think more work is needed.
See below.

> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 03efd86dce0a..00508c69ca9e 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -113,36 +113,24 @@ static void __init arch_reserve_crashkernel(void)
>   low_size, high);
>  }
>  
> -/*
> - * Return the maximum physical address for a zone accessible by the given 
> bits
> - * limit. If DRAM starts above 32-bit, expand the zone to the maximum
> - * available memory, otherwise cap it at 32-bit.
> - */
> -static phys_addr_t __init max_zone_phys(unsigned int zone_bits)
> +static phys_addr_t __init max_zone_phys(phys_addr_t zone_limit)
>  {
> - phys_addr_t zone_mask = DMA_BIT_MASK(zone_bits);
> - phys_addr_t phys_start = memblock_start_of_DRAM();
> -
> - if (phys_start > U32_MAX)
> - zone_mask = PHYS_ADDR_MAX;
> - else if (phys_start > zone_mask)
> - zone_mask = U32_MAX;
> -
> - return min(zone_mask, memblock_end_of_DRAM() - 1) + 1;
> + return min(zone_limit, memblock_end_of_DRAM() - 1) + 1;
>  }
>  
>  static void __init zone_sizes_init(void)
>  {
>   unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
> - unsigned int __maybe_unused acpi_zone_dma_bits;
> - unsigned int __maybe_unused dt_zone_dma_bits;
> - phys_addr_t __maybe_unused dma32_phys_limit = max_zone_phys(32);
> + phys_addr_t __maybe_unused acpi_zone_dma_limit;
> + phys_addr_t __maybe_unused dt_zone_dma_limit;
> + phys_addr_t __maybe_unused dma32_phys_limit =
> + max_zone_phys(DMA_BIT_MASK(32));
>  
>  #ifdef CONFIG_ZONE_DMA
> - acpi_zone_dma_bits = fls64(acpi_iort_dma_get_max_cpu_address());
> - dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL));
> - zone_dma_bits = min3(32U, dt_zone_dma_bits, acpi_zone_dma_bits);
> - arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
> + acpi_zone_dma_limit = acpi_iort_dma_get_max_cpu_address();
> + dt_zone_dma_limit = of_dma_get_max_cpu_address(NULL);
> + zone_dma_limit = min(dt_zone_dma_limit, acpi_zone_dma_limit);
> + arm64_dma_phys_limit = max_zone_phys(zone_dma_limit);
>   max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
>  #endif
>  #ifdef CONFIG_ZONE_DMA32

I think this goes wrong if zone_dma_limit ends up above 32-bit (e.g. no
restrictive dma-ranges properties) but the start of RAM is below 4G.
We'd simply reduce ZONE_DMA32 to zero and ZONE_DMA potentially covering
the whole RAM. Prior to this change, we capped zone_dma_bits to 32 via
min3(). I think we should maintain this cap if memblock_start_of_DRAM()
is below 4G.

We could fix this up in max_zone_phys() above:

if (memblock_start_of_DRAM() < U32_MAX)
zone_limit = min(U32_MAX, zone_limit);

return min(zone_limit, memblock_end_of_DRAM() - 1) + 1;

> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 4d543b1e9d57..3b2ebcd4f576 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -20,7 +20,7 @@
>   * it for entirely different regions. In that case the arch code needs to
>   * override the variable below for dma-direct to work properly.
>   */
> -unsigned int zone_dma_bits __ro_after_init = 24;
> +phys_addr_t zone_dma_limit __ro_after_init = DMA_BIT_MASK(24);
>  
>  static inline dma_addr_t phys_to_dma_direct(struct device *dev,
>   phys_addr_t phys)
> @@ -59,7 +59,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device 
> *dev, u64 *phys_limit)
>* zones.
>*/
>   *phys_limit = dma_to_phys(dev, dma_limit);
> - if (*phys_limit <= DMA_BIT_MASK(zone_dma_bits))
> + if (*phys_limit <= zone_dma_limit)
>   return GFP_DMA;
>   if (*phys_limit <= DMA_BIT_MASK(32))
>   return GFP_DMA32;

It's worth noting that if ZONE_DMA ends up entirely above 32-bit, there
won't be any ZONE_DMA32. Thinking about it, this could be a potential
problem. For example, if a device has a 32-bit DMA mask and an offset
that lifts this into the 32-36G range, the above may fail to set
GFP_DMA32.

Actually, I think these checks can go wrong even with the current
implementation, assuming RAM below 4G and no DMA offsets. For example,
we have two devices, one with a coherent ma

Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-04 Thread Conor Dooley

On Tue, Jun 04, 2024 at 01:44:15PM +0200, Alexandre Ghiti wrote:
> On Tue, Jun 4, 2024 at 10:52 AM Conor Dooley  wrote:
> >
> > On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote:
> > > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti  
> > > wrote:
> > > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui  
> > > > wrote:
> > > > >
> > > > > As for the current status of the patch, there are two points that can
> > > > > be optimized:
> > > > > 1. Some chip hardware implementations may not cache TLB invalid
> > > > > entries, so it doesn't matter whether svvptc is available or not. Can
> > > > > we consider adding a CONFIG_RISCV_SVVPTC to control it?
> > >
> > > That would produce a non-portable kernel. But I'm not opposed to that
> > > at all, let me check how we handle other extensions. Maybe @Conor
> > > Dooley has some feedback here?
> >
> > To be honest, not really sure what to give feedback on. Could you
> > elaborate on exactly what the option is going to do? Given the
> > portability concern, I guess you were proposing that the option would
> > remove the preventative fences, rather than your current patch that
> > removes them via an alternative?
> 
> No no, I won't do that, we need a generic kernel for distros so that's
> not even a question. What Yunhui was asking about (to me) is: can we
> introduce a Kconfig option to always remove the preventive fences,
> bypassing the use of alternatives altogether?
> 
> To me, it won't make a difference in terms of performance. But if we
> already offer such a possibility for other extensions, well I'll do
> it. Otherwise, the question is: should we start doing that?

We don't do that for other extensions yet, because currently all the
extensions we have options for are additive. There's like 3 alternative
patchsites, and they are all just one nop? I don't see the point of
having a Kconfig knob for that.


signature.asc
Description: PGP signature

Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-04 Thread Alexandre Ghiti

On Tue, Jun 4, 2024 at 10:52 AM Conor Dooley  wrote:
>
> On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote:
> > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti  
> > wrote:
> > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui  wrote:
> > > >
> > > > As for the current status of the patch, there are two points that can
> > > > be optimized:
> > > > 1. Some chip hardware implementations may not cache TLB invalid
> > > > entries, so it doesn't matter whether svvptc is available or not. Can
> > > > we consider adding a CONFIG_RISCV_SVVPTC to control it?
> >
> > That would produce a non-portable kernel. But I'm not opposed to that
> > at all, let me check how we handle other extensions. Maybe @Conor
> > Dooley has some feedback here?
>
> To be honest, not really sure what to give feedback on. Could you
> elaborate on exactly what the option is going to do? Given the
> portability concern, I guess you were proposing that the option would
> remove the preventative fences, rather than your current patch that
> removes them via an alternative?

No no, I won't do that, we need a generic kernel for distros so that's
not even a question. What Yunhui was asking about (to me) is: can we
introduce a Kconfig option to always remove the preventive fences,
bypassing the use of alternatives altogether?

To me, it won't make a difference in terms of performance. But if we
already offer such a possibility for other extensions, well I'll do
it. Otherwise, the question is: should we start doing that?

> I don't think we have any extension
> related options that work like that at the moment, and making that an
> option will just mean that distros that look to cater for multiple
> platforms won't be able to turn it on.
>
> Thanks,
> Conor.

Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-04 Thread Conor Dooley

On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote:
> On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti  wrote:
> > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui  wrote:
> > >
> > > As for the current status of the patch, there are two points that can
> > > be optimized:
> > > 1. Some chip hardware implementations may not cache TLB invalid
> > > entries, so it doesn't matter whether svvptc is available or not. Can
> > > we consider adding a CONFIG_RISCV_SVVPTC to control it?
> 
> That would produce a non-portable kernel. But I'm not opposed to that
> at all, let me check how we handle other extensions. Maybe @Conor
> Dooley has some feedback here?

To be honest, not really sure what to give feedback on. Could you
elaborate on exactly what the option is going to do? Given the
portability concern, I guess you were proposing that the option would
remove the preventative fences, rather than your current patch that
removes them via an alternative? I don't think we have any extension
related options that work like that at the moment, and making that an
option will just mean that distros that look to cater for multiple
platforms won't be able to turn it on.

Thanks,
Conor.

signature.asc
Description: PGP signature

Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-04 Thread Alexandre Ghiti

On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti  wrote:
>
> Hi Yunhui,
>
> On Tue, Jun 4, 2024 at 8:21 AM yunhui cui  wrote:
> >
> > Hi Alexandre,
> >
> > On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti  
> > wrote:
> > >
> > > Hi Yunhui,
> > >
> > > On Mon, Jun 3, 2024 at 4:26 AM yunhui cui  wrote:
> > > >
> > > > Hi Alexandre,
> > > >
> > > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti 
> > > >  wrote:
> > > > >
> > > > > In 6.5, we removed the vmalloc fault path because that can't work (see
> > > > > [1] [2]). Then in order to make sure that new page table entries were
> > > > > seen by the page table walker, we had to preventively emit a 
> > > > > sfence.vma
> > > > > on all harts [3] but this solution is very costly since it relies on 
> > > > > IPI.
> > > > >
> > > > > And even there, we could end up in a loop of vmalloc faults if a 
> > > > > vmalloc
> > > > > allocation is done in the IPI path (for example if it is traced, see
> > > > > [4]), which could result in a kernel stack overflow.
> > > > >
> > > > > Those preventive sfence.vma needed to be emitted because:
> > > > >
> > > > > - if the uarch caches invalid entries, the new mapping may not be
> > > > >   observed by the page table walker and an invalidation may be needed.
> > > > > - if the uarch does not cache invalid entries, a reordered access
> > > > >   could "miss" the new mapping and traps: in that case, we would 
> > > > > actually
> > > > >   only need to retry the access, no sfence.vma is required.
> > > > >
> > > > > So this patch removes those preventive sfence.vma and actually handles
> > > > > the possible (and unlikely) exceptions. And since the kernel stacks
> > > > > mappings lie in the vmalloc area, this handling must be done very 
> > > > > early
> > > > > when the trap is taken, at the very beginning of handle_exception: 
> > > > > this
> > > > > also rules out the vmalloc allocations in the fault path.
> > > > >
> > > > > Link: 
> > > > > https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bj...@kernel.org/
> > > > >  [1]
> > > > > Link: 
> > > > > https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dy...@andestech.com
> > > > >  [2]
> > > > > Link: 
> > > > > https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexgh...@rivosinc.com/
> > > > >  [3]
> > > > > Link: 
> > > > > https://lore.kernel.org/lkml/20200508144043.13893-1-j...@8bytes.org/ 
> > > > > [4]
> > > > > Signed-off-by: Alexandre Ghiti 
> > > > > ---
> > > > >  arch/riscv/include/asm/cacheflush.h  | 18 +-
> > > > >  arch/riscv/include/asm/thread_info.h |  5 ++
> > > > >  arch/riscv/kernel/asm-offsets.c  |  5 ++
> > > > >  arch/riscv/kernel/entry.S| 84 
> > > > > 
> > > > >  arch/riscv/mm/init.c |  2 +
> > > > >  5 files changed, 113 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/cacheflush.h 
> > > > > b/arch/riscv/include/asm/cacheflush.h
> > > > > index a129dac4521d..b0d631701757 100644
> > > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page 
> > > > > *page)
> > > > > flush_icache_mm(vma->vm_mm, 0)
> > > > >
> > > > >  #ifdef CONFIG_64BIT
> > > > > -#define flush_cache_vmap(start, end)   
> > > > > flush_tlb_kernel_range(start, end)
> > > > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > > > +extern char _end[];
> > > > > +#define flush_cache_vmap flush_cache_vmap
> > > > > +static inline void flush_cache_vmap(unsigned long start, unsigned 
> > > > > long end)
> > > > > +{
> > > > > +   if (is_vmalloc_or_module_addr((void *)start)) {
> > > > > +   int i;
> > > > > +
> > > > > +   /*
> > > > > +* We don't care if concurrently a cpu resets this 
> > > > > value since
> > > > > +* the only place this can happen is in 
> > > > > handle_exception() where
> > > > > +* an sfence.vma is emitted.
> > > > > +*/
> > > > > +   for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > > > > +   new_vmalloc[i] = -1ULL;
> > > > > +   }
> > > > > +}
> > > > >  #define flush_cache_vmap_early(start, end) 
> > > > > local_flush_tlb_kernel_range(start, end)
> > > > >  #endif
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/thread_info.h 
> > > > > b/arch/riscv/include/asm/thread_info.h
> > > > > index 5d473343634b..32631acdcdd4 100644
> > > > > --- a/arch/riscv/include/asm/thread_info.h
> > > > > +++ b/arch/riscv/include/asm/thread_info.h
> > > > > @@ -60,6 +60,11 @@ struct thread_info {
> > > > > void*scs_base;
> > > > > void*scs_sp;
> > > > >  #endif
> > > > > +   /*
> > > > > +* Used in handle_exception() to save a0, a1 and a2 before 
> > > > > knowing if we
> > > > > +* can access the kern

Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-04 Thread Alexandre Ghiti

Hi Yunhui,

On Tue, Jun 4, 2024 at 8:21 AM yunhui cui  wrote:
>
> Hi Alexandre,
>
> On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti  wrote:
> >
> > Hi Yunhui,
> >
> > On Mon, Jun 3, 2024 at 4:26 AM yunhui cui  wrote:
> > >
> > > Hi Alexandre,
> > >
> > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti  
> > > wrote:
> > > >
> > > > In 6.5, we removed the vmalloc fault path because that can't work (see
> > > > [1] [2]). Then in order to make sure that new page table entries were
> > > > seen by the page table walker, we had to preventively emit a sfence.vma
> > > > on all harts [3] but this solution is very costly since it relies on 
> > > > IPI.
> > > >
> > > > And even there, we could end up in a loop of vmalloc faults if a vmalloc
> > > > allocation is done in the IPI path (for example if it is traced, see
> > > > [4]), which could result in a kernel stack overflow.
> > > >
> > > > Those preventive sfence.vma needed to be emitted because:
> > > >
> > > > - if the uarch caches invalid entries, the new mapping may not be
> > > >   observed by the page table walker and an invalidation may be needed.
> > > > - if the uarch does not cache invalid entries, a reordered access
> > > >   could "miss" the new mapping and traps: in that case, we would 
> > > > actually
> > > >   only need to retry the access, no sfence.vma is required.
> > > >
> > > > So this patch removes those preventive sfence.vma and actually handles
> > > > the possible (and unlikely) exceptions. And since the kernel stacks
> > > > mappings lie in the vmalloc area, this handling must be done very early
> > > > when the trap is taken, at the very beginning of handle_exception: this
> > > > also rules out the vmalloc allocations in the fault path.
> > > >
> > > > Link: 
> > > > https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bj...@kernel.org/
> > > >  [1]
> > > > Link: 
> > > > https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dy...@andestech.com
> > > >  [2]
> > > > Link: 
> > > > https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexgh...@rivosinc.com/
> > > >  [3]
> > > > Link: 
> > > > https://lore.kernel.org/lkml/20200508144043.13893-1-j...@8bytes.org/ [4]
> > > > Signed-off-by: Alexandre Ghiti 
> > > > ---
> > > >  arch/riscv/include/asm/cacheflush.h  | 18 +-
> > > >  arch/riscv/include/asm/thread_info.h |  5 ++
> > > >  arch/riscv/kernel/asm-offsets.c  |  5 ++
> > > >  arch/riscv/kernel/entry.S| 84 
> > > >  arch/riscv/mm/init.c |  2 +
> > > >  5 files changed, 113 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/riscv/include/asm/cacheflush.h 
> > > > b/arch/riscv/include/asm/cacheflush.h
> > > > index a129dac4521d..b0d631701757 100644
> > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page 
> > > > *page)
> > > > flush_icache_mm(vma->vm_mm, 0)
> > > >
> > > >  #ifdef CONFIG_64BIT
> > > > -#define flush_cache_vmap(start, end)   
> > > > flush_tlb_kernel_range(start, end)
> > > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > > +extern char _end[];
> > > > +#define flush_cache_vmap flush_cache_vmap
> > > > +static inline void flush_cache_vmap(unsigned long start, unsigned long 
> > > > end)
> > > > +{
> > > > +   if (is_vmalloc_or_module_addr((void *)start)) {
> > > > +   int i;
> > > > +
> > > > +   /*
> > > > +* We don't care if concurrently a cpu resets this 
> > > > value since
> > > > +* the only place this can happen is in 
> > > > handle_exception() where
> > > > +* an sfence.vma is emitted.
> > > > +*/
> > > > +   for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > > > +   new_vmalloc[i] = -1ULL;
> > > > +   }
> > > > +}
> > > >  #define flush_cache_vmap_early(start, end) 
> > > > local_flush_tlb_kernel_range(start, end)
> > > >  #endif
> > > >
> > > > diff --git a/arch/riscv/include/asm/thread_info.h 
> > > > b/arch/riscv/include/asm/thread_info.h
> > > > index 5d473343634b..32631acdcdd4 100644
> > > > --- a/arch/riscv/include/asm/thread_info.h
> > > > +++ b/arch/riscv/include/asm/thread_info.h
> > > > @@ -60,6 +60,11 @@ struct thread_info {
> > > > void*scs_base;
> > > > void*scs_sp;
> > > >  #endif
> > > > +   /*
> > > > +* Used in handle_exception() to save a0, a1 and a2 before 
> > > > knowing if we
> > > > +* can access the kernel stack.
> > > > +*/
> > > > +   unsigned long   a0, a1, a2;
> > > >  };
> > > >
> > > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > > > diff --git a/arch/riscv/kernel/asm-offsets.c 
> > > > b/arch/riscv/kernel/asm-offsets.c
> > > > index a03129f40c46..939ddc0e3c6e 100644
> > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > +++ b/ar

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1447 matches

Mail list logo