Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On 23/10/2024 19:10, Liam R. Howlett wrote: > * Steven Price [241023 05:31]: * Box64 seems to have a custom allocator based on reading /proc/self/maps to allocate a block of VA space with a low enough address [1] * PHP has code reading /proc/self/maps - I think this is to find a segment which is close enough to the text segment [2] * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit addresses [3][4] >>> >>> Can't the limited number of applications that need to restrict the upper >>> bound use an LD_PRELOAD compatible library to do this? >> >> I'm not entirely sure what point you are making here. Yes an LD_PRELOAD >> approach could be used instead of a personality type as a 'hack' to >> preallocate the upper address space. The obvious disadvantage is that >> you can't (easily) layer LD_PRELOAD so it won't work in the general case. > > My point is that riscv could work around the limited number of > applications that requires this. It's not really viable for you. Ah ok - thanks for the clarification. >> * pmdk has some funky code to find the lowest address that meets certain requirements - this does look like an ALSR alternative and probably couldn't directly use MAP_BELOW_HINT, although maybe this suggests we need a mechanism to map without a VA-range? [5] * MIT-Scheme parses /proc/self/maps to find the lowest mapping within a range [6] * LuaJIT uses an approach to 'probe' to find a suitable low address for allocation [7] >>> >>> Although I did not take a deep dive into each example above, there are >>> some very odd things being done, we will never cover all the use cases >>> with an exact API match. What we have today can be made to work for >>> these users as they have figured ways to do it. >>> >>> Are they pretty? no. Are they common? no. I'm not sure it's worth >>> plumbing in new MM code in for these users. >> >> My issue with the existing 'solutions' is that they all seem to be fragile: >> >> * Using /proc/self/maps is inherently racy if there could be any other >> code running in the process at the same time. > > Yes, it is not thread safe. Parsing text is also undesirable. > >> >> * Attempting to map the upper part of the address space only works if >> done early enough - once an allocation arrives there, there's very >> little you can robustly do (because the stray allocation might be freed). >> >> * LuaJIT's probing mechanism is probably robust, but it's inefficient - >> LuaJIT has a fallback of linear probing, following by no hint (ASLR), >> followed by pseudo-random probing. I don't know the history of the code >> but it looks like it's probably been tweaked to try to avoid performance >> issues. >> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a library to get low addresses without causing any problems for the rest of the application. The use case I'm looking at is in a library and therefore a personality mode wouldn't be appropriate (because I don't want to affect the rest of the application). Reading /proc/self/maps is also problematic because other threads could be allocating/freeing at the same time. >>> >>> As long as you don't exhaust the lower limit you are trying to allocate >>> within - which is exactly the issue riscv is hitting. >> >> Obviously if you actually exhaust the lower limit then any >> MAP_BELOW_HINT API would also fail - there's really not much that can be >> done in that case. > > Today we reverse the search, so you end up in the higher address > (bottom-up vs top-down) - although the direction is arch dependent. > > If the allocation is too high/low then you could detect, free, and > handle the failure. Agreed, that's fine. >> >>> I understand that you are providing examples to prove that this is >>> needed, but I feel like you are better demonstrating the flexibility >>> exists to implement solutions in different ways using todays API. >> >> My intention is to show that today's API doesn't provide a robust way of >> doing this. Although I'm quite happy if you can point me at a robust way >> with the current API. As I mentioned my goal is to be able to map memory >> in a (multithreaded) library with a (ideally configurable) number of VA >> bits. I don't particularly want to restrict the whole process, just >> specific allocations. > > If you don't need to restrict everything, won't the hint work for your > usecase? I must be missing something from your requirements. The hint only works if the hint address is actually free. Otherwise mmap() falls back to as if the hint address wasn't specified. E.g. > for(int i = 0; i < 2; i++) { > void *addr = mmap((void*)(1UL << 32), PAGE_SIZE, PROT_NONE, > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > printf("%p\n", addr); > } Prints something lik
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
* Steven Price [241023 05:31]: > >> * Box64 seems to have a custom allocator based on reading > >> /proc/self/maps to allocate a block of VA space with a low enough > >> address [1] > >> > >> * PHP has code reading /proc/self/maps - I think this is to find a > >> segment which is close enough to the text segment [2] > >> > >> * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit > >> addresses [3][4] > > > > Can't the limited number of applications that need to restrict the upper > > bound use an LD_PRELOAD compatible library to do this? > > I'm not entirely sure what point you are making here. Yes an LD_PRELOAD > approach could be used instead of a personality type as a 'hack' to > preallocate the upper address space. The obvious disadvantage is that > you can't (easily) layer LD_PRELOAD so it won't work in the general case. My point is that riscv could work around the limited number of applications that requires this. It's not really viable for you. > > >> > >> * pmdk has some funky code to find the lowest address that meets > >> certain requirements - this does look like an ALSR alternative and > >> probably couldn't directly use MAP_BELOW_HINT, although maybe this > >> suggests we need a mechanism to map without a VA-range? [5] > >> > >> * MIT-Scheme parses /proc/self/maps to find the lowest mapping within > >> a range [6] > >> > >> * LuaJIT uses an approach to 'probe' to find a suitable low address > >> for allocation [7] > >> > > > > Although I did not take a deep dive into each example above, there are > > some very odd things being done, we will never cover all the use cases > > with an exact API match. What we have today can be made to work for > > these users as they have figured ways to do it. > > > > Are they pretty? no. Are they common? no. I'm not sure it's worth > > plumbing in new MM code in for these users. > > My issue with the existing 'solutions' is that they all seem to be fragile: > > * Using /proc/self/maps is inherently racy if there could be any other > code running in the process at the same time. Yes, it is not thread safe. Parsing text is also undesirable. > > * Attempting to map the upper part of the address space only works if > done early enough - once an allocation arrives there, there's very > little you can robustly do (because the stray allocation might be freed). > > * LuaJIT's probing mechanism is probably robust, but it's inefficient - > LuaJIT has a fallback of linear probing, following by no hint (ASLR), > followed by pseudo-random probing. I don't know the history of the code > but it looks like it's probably been tweaked to try to avoid performance > issues. > > >> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a > >> library to get low addresses without causing any problems for the rest > >> of the application. The use case I'm looking at is in a library and > >> therefore a personality mode wouldn't be appropriate (because I don't > >> want to affect the rest of the application). Reading /proc/self/maps > >> is also problematic because other threads could be allocating/freeing > >> at the same time. > > > > As long as you don't exhaust the lower limit you are trying to allocate > > within - which is exactly the issue riscv is hitting. > > Obviously if you actually exhaust the lower limit then any > MAP_BELOW_HINT API would also fail - there's really not much that can be > done in that case. Today we reverse the search, so you end up in the higher address (bottom-up vs top-down) - although the direction is arch dependent. If the allocation is too high/low then you could detect, free, and handle the failure. > > > I understand that you are providing examples to prove that this is > > needed, but I feel like you are better demonstrating the flexibility > > exists to implement solutions in different ways using todays API. > > My intention is to show that today's API doesn't provide a robust way of > doing this. Although I'm quite happy if you can point me at a robust way > with the current API. As I mentioned my goal is to be able to map memory > in a (multithreaded) library with a (ideally configurable) number of VA > bits. I don't particularly want to restrict the whole process, just > specific allocations. If you don't need to restrict everything, won't the hint work for your usecase? I must be missing something from your requirements. > > I had typed up a series similar to this one as a MAP_BELOW flag would > fit my use-case well. > > > I think it would be best to use the existing methods and work around the > > issue that was created in riscv while future changes could mirror amd64 > > and arm64. > > The riscv issue is a different issue to the one I'm trying to solve. I > agree MAP_BELOW_HINT isn't a great fix for that because we already have > differences between amd64 and arm64 and obviously no software currently > out there uses this
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
Hi Liam, On 21/10/2024 20:48, Liam R. Howlett wrote: > * Steven Price [241021 09:23]: >> On 09/09/2024 10:46, Kirill A. Shutemov wrote: >>> On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote: On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote: > On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote: >> Some applications rely on placing data in free bits addresses allocated >> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the >> address returned by mmap to be less than the 48-bit address space, >> unless the hint address uses more than 47 bits (the 48th bit is reserved >> for the kernel address space). >> >> The riscv architecture needs a way to similarly restrict the virtual >> address space. On the riscv port of OpenJDK an error is thrown if >> attempted to run on the 57-bit address space, called sv57 [1]. golang >> has a comment that sv57 support is not complete, but there are some >> workarounds to get it to mostly work [2]. >>> >>> I also saw libmozjs crashing with 57-bit address space on x86. >>> >> These applications work on x86 because x86 does an implicit 47-bit >> restriction of mmap() address that contain a hint address that is less >> than 48 bits. >> >> Instead of implicitly restricting the address space on riscv (or any >> current/future architecture), a flag would allow users to opt-in to this >> behavior rather than opt-out as is done on other architectures. This is >> desirable because it is a small class of applications that do pointer >> masking. >>> >>> You reiterate the argument about "small class of applications". But it >>> makes no sense to me. >> >> Sorry to chime in late on this - I had been considering implementing >> something like MAP_BELOW_HINT and found this thread. >> >> While the examples of applications that want to use high VA bits and get >> bitten by future upgrades is not very persuasive. It's worth pointing >> out that there are a variety of somewhat horrid hacks out there to work >> around this feature not existing. >> >> E.g. from my brief research into other code: >> >> * Box64 seems to have a custom allocator based on reading >> /proc/self/maps to allocate a block of VA space with a low enough >> address [1] >> >> * PHP has code reading /proc/self/maps - I think this is to find a >> segment which is close enough to the text segment [2] >> >> * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit >> addresses [3][4] > > Can't the limited number of applications that need to restrict the upper > bound use an LD_PRELOAD compatible library to do this? I'm not entirely sure what point you are making here. Yes an LD_PRELOAD approach could be used instead of a personality type as a 'hack' to preallocate the upper address space. The obvious disadvantage is that you can't (easily) layer LD_PRELOAD so it won't work in the general case. >> >> * pmdk has some funky code to find the lowest address that meets >> certain requirements - this does look like an ALSR alternative and >> probably couldn't directly use MAP_BELOW_HINT, although maybe this >> suggests we need a mechanism to map without a VA-range? [5] >> >> * MIT-Scheme parses /proc/self/maps to find the lowest mapping within >> a range [6] >> >> * LuaJIT uses an approach to 'probe' to find a suitable low address >> for allocation [7] >> > > Although I did not take a deep dive into each example above, there are > some very odd things being done, we will never cover all the use cases > with an exact API match. What we have today can be made to work for > these users as they have figured ways to do it. > > Are they pretty? no. Are they common? no. I'm not sure it's worth > plumbing in new MM code in for these users. My issue with the existing 'solutions' is that they all seem to be fragile: * Using /proc/self/maps is inherently racy if there could be any other code running in the process at the same time. * Attempting to map the upper part of the address space only works if done early enough - once an allocation arrives there, there's very little you can robustly do (because the stray allocation might be freed). * LuaJIT's probing mechanism is probably robust, but it's inefficient - LuaJIT has a fallback of linear probing, following by no hint (ASLR), followed by pseudo-random probing. I don't know the history of the code but it looks like it's probably been tweaked to try to avoid performance issues. >> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a >> library to get low addresses without causing any problems for the rest >> of the application. The use case I'm looking at is in a library and >> therefore a personality mode wouldn't be appropriate (because I don't >> want to affect the rest of the application). Reading /proc/self/maps >> is also problematic because othe
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On 09/09/2024 10:46, Kirill A. Shutemov wrote: > On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote: >> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote: >>> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote: Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space). The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2]. > > I also saw libmozjs crashing with 57-bit address space on x86. > These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits. Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking. > > You reiterate the argument about "small class of applications". But it > makes no sense to me. Sorry to chime in late on this - I had been considering implementing something like MAP_BELOW_HINT and found this thread. While the examples of applications that want to use high VA bits and get bitten by future upgrades is not very persuasive. It's worth pointing out that there are a variety of somewhat horrid hacks out there to work around this feature not existing. E.g. from my brief research into other code: * Box64 seems to have a custom allocator based on reading /proc/self/maps to allocate a block of VA space with a low enough address [1] * PHP has code reading /proc/self/maps - I think this is to find a segment which is close enough to the text segment [2] * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit addresses [3][4] * pmdk has some funky code to find the lowest address that meets certain requirements - this does look like an ALSR alternative and probably couldn't directly use MAP_BELOW_HINT, although maybe this suggests we need a mechanism to map without a VA-range? [5] * MIT-Scheme parses /proc/self/maps to find the lowest mapping within a range [6] * LuaJIT uses an approach to 'probe' to find a suitable low address for allocation [7] The biggest benefit I see of MAP_BELOW_HINT is that it would allow a library to get low addresses without causing any problems for the rest of the application. The use case I'm looking at is in a library and therefore a personality mode wouldn't be appropriate (because I don't want to affect the rest of the application). Reading /proc/self/maps is also problematic because other threads could be allocating/freeing at the same time. Thanks, Steve [1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/ [2] https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap.c/#L62 [3] https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp [4] https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb05c [5] https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=29#L29 [6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826 [7] https://sources.debian.org/src/luajit/2.1.0+openresty20240815-1/src/lj_alloc.c/ > With full address space by default, this small class of applications is > going to *broken* unless they would handle RISC-V case specifically. > > On other hand, if you limit VA to 128TiB by default (like many > architectures do[1]) everything would work without intervention. > And if an app needs wider address space it would get it with hint opt-in, > because it is required on x86-64 anyway. Again, no RISC-V-specific code. > > I see no upside with your approach. Just worse user experience. > > [1] See va_high_addr_switch test case in > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/mm/Makefile#n115 >
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
* Steven Price [241021 09:23]: > On 09/09/2024 10:46, Kirill A. Shutemov wrote: > > On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote: > >> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote: > >>> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the 48-bit address space, > unless the hint address uses more than 47 bits (the 48th bit is reserved > for the kernel address space). > > The riscv architecture needs a way to similarly restrict the virtual > address space. On the riscv port of OpenJDK an error is thrown if > attempted to run on the 57-bit address space, called sv57 [1]. golang > has a comment that sv57 support is not complete, but there are some > workarounds to get it to mostly work [2]. > > > > I also saw libmozjs crashing with 57-bit address space on x86. > > > These applications work on x86 because x86 does an implicit 47-bit > restriction of mmap() address that contain a hint address that is less > than 48 bits. > > Instead of implicitly restricting the address space on riscv (or any > current/future architecture), a flag would allow users to opt-in to this > behavior rather than opt-out as is done on other architectures. This is > desirable because it is a small class of applications that do pointer > masking. > > > > You reiterate the argument about "small class of applications". But it > > makes no sense to me. > > Sorry to chime in late on this - I had been considering implementing > something like MAP_BELOW_HINT and found this thread. > > While the examples of applications that want to use high VA bits and get > bitten by future upgrades is not very persuasive. It's worth pointing > out that there are a variety of somewhat horrid hacks out there to work > around this feature not existing. > > E.g. from my brief research into other code: > > * Box64 seems to have a custom allocator based on reading > /proc/self/maps to allocate a block of VA space with a low enough > address [1] > > * PHP has code reading /proc/self/maps - I think this is to find a > segment which is close enough to the text segment [2] > > * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit > addresses [3][4] Can't the limited number of applications that need to restrict the upper bound use an LD_PRELOAD compatible library to do this? > > * pmdk has some funky code to find the lowest address that meets > certain requirements - this does look like an ALSR alternative and > probably couldn't directly use MAP_BELOW_HINT, although maybe this > suggests we need a mechanism to map without a VA-range? [5] > > * MIT-Scheme parses /proc/self/maps to find the lowest mapping within > a range [6] > > * LuaJIT uses an approach to 'probe' to find a suitable low address > for allocation [7] > Although I did not take a deep dive into each example above, there are some very odd things being done, we will never cover all the use cases with an exact API match. What we have today can be made to work for these users as they have figured ways to do it. Are they pretty? no. Are they common? no. I'm not sure it's worth plumbing in new MM code in for these users. > The biggest benefit I see of MAP_BELOW_HINT is that it would allow a > library to get low addresses without causing any problems for the rest > of the application. The use case I'm looking at is in a library and > therefore a personality mode wouldn't be appropriate (because I don't > want to affect the rest of the application). Reading /proc/self/maps > is also problematic because other threads could be allocating/freeing > at the same time. As long as you don't exhaust the lower limit you are trying to allocate within - which is exactly the issue riscv is hitting. I understand that you are providing examples to prove that this is needed, but I feel like you are better demonstrating the flexibility exists to implement solutions in different ways using todays API. I think it would be best to use the existing methods and work around the issue that was created in riscv while future changes could mirror amd64 and arm64. ... > > > [1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/ > [2] > https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap.c/#L62 > [3] > https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp > [4] > https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb05c > [5] > https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=29#L29 > [6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826 > [7] > https://sources.debi
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, 13 Sep 2024 14:04:06 PDT (-0700), Charlie Jenkins wrote: > On Fri, Sep 13, 2024 at 08:41:34AM +0100, Lorenzo Stoakes wrote: >> On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote: >> > On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote: >> > > On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote: >> > > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote: >> > > > > * Catalin Marinas [240906 07:44]: >> > > > > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote: >> > > > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: >> > > > > > > > On Fri, Sep 6, 2024 at 3:18â¯PM Arnd Bergmann >> > > > > > > > wrote: >> > > > > > > >> It's also unclear to me how we want this flag to interact with >> > > > > > > >> the existing logic in arch_get_mmap_end(), which attempts to >> > > > > > > >> limit the default mapping to a 47-bit address space already. >> > > > > > > > >> > > > > > > > To optimize RISC-V progress, I recommend: >> > > > > > > > >> > > > > > > > Step 1: Approve the patch. >> > > > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. >> > > > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK >> > > > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() >> > > >> > > Point 4 is an ABI change. What guarantees that there isn't still >> > > software out there that relies on the old behaviour? >> > >> > Yeah I don't think it would be desirable to remove the 47 bit >> > constraint in architectures that already have it. >> > >> > > >> > > > > > > I really want to first see a plausible explanation about why >> > > > > > > RISC-V can't just implement this using a 47-bit >> > > > > > > DEFAULT_MAP_WINDOW >> > > > > > > like all the other major architectures (x86, arm64, powerpc64), >> > > > > > >> > > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the >> > > > > > default >> > > > > > configuration. We end up with a 47-bit with 16K pages but for a >> > > > > > different reason that has to do with LPA2 support (I doubt we need >> > > > > > this >> > > > > > for the user mapping but we need to untangle some of the macros >> > > > > > there; >> > > > > > that's for a separate discussion). >> > > > > > >> > > > > > That said, we haven't encountered any user space problems with a >> > > > > > 48-bit >> > > > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar >> > > > > > approach (47 or 48 bit default limit). Better to have some ABI >> > > > > > consistency between architectures. One can still ask for addresses >> > > > > > above >> > > > > > this default limit via mmap(). >> > > > > >> > > > > I think that is best as well. >> > > > > >> > > > > Can we please just do what x86 and arm64 does? >> > > > >> > > > I responded to Arnd in the other thread, but I am still not convinced >> > > > that the solution that x86 and arm64 have selected is the best >> > > > solution. >> > > > The solution of defaulting to 47 bits does allow applications the >> > > > ability to get addresses that are below 47 bits. However, due to >> > > > differences across architectures it doesn't seem possible to have all >> > > > architectures default to the same value. Additionally, this flag will >> > > > be >> > > > able to help users avoid potential bugs where a hint address is passed >> > > > that causes upper bits of a VA to be used. >> > > >> > > The reason we added this limit on arm64 is that we noticed programs >> > > using the top 8 bits of a 64-bit pointer for additional information. >> > > IIRC, it wasn't even openJDK but some JavaScript JIT. We could have >> > > taught those programs of a new flag but since we couldn't tell how many >> > > are out there, it was the safest to default to a smaller limit and opt >> > > in to the higher one. Such opt-in is via mmap() but if you prefer a >> > > prctl() flag, that's fine by me as well (though I think this should be >> > > opt-in to higher addresses rather than opt-out of the higher addresses). >> > >> > The mmap() flag was used in previous versions but was decided against >> > because this feature is more useful if it is process-wide. A >> > personality() flag was chosen instead of a prctl() flag because there >> > existed other flags in personality() that were similar. I am tempted to >> > use prctl() however because then we could have an additional arg to >> > select the exact number of bits that should be reserved (rather than >> > being fixed at 47 bits). >> >> I am very much not in favour of a prctl(), it would require us to add state >> limiting the address space and the timing of it becomes critical. Then we >> have the same issue we do with the other proposals as to - what happens if >> this is too low? >> >> What is 'too low' varies by architecture, and for 32-bit architectures >> could get quite... problematic. >> >> And again, wha is the RoI here - we introducing maintenance burden and edge >> cases vs. the x86 s
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
Charlie Jenkins writes: > On Wed, Sep 11, 2024 at 11:38:55PM +1000, Michael Ellerman wrote: >> Geert Uytterhoeven writes: >> > Hi Christophe, >> > >> > On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy >> > wrote: >> >> >>> diff --git a/include/uapi/linux/personality.h >> >> >>> b/include/uapi/linux/personality.h >> >> >>> index 49796b7756af..cd3b8c154d9b 100644 >> >> >>> --- a/include/uapi/linux/personality.h >> >> >>> +++ b/include/uapi/linux/personality.h >> >> >>> @@ -22,6 +22,7 @@ enum { >> >> >>> WHOLE_SECONDS = 0x200, >> >> >>> STICKY_TIMEOUTS = 0x400, >> >> >>> ADDR_LIMIT_3GB =0x800, >> >> >>> + ADDR_LIMIT_47BIT = 0x1000, >> >> >>> }; >> >> >> >> >> >> I wonder if ADDR_LIMIT_128T would be clearer? >> >> >> >> >> > >> >> > I don't follow, what does 128T represent? >> >> >> >> 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT >> >> address, that naming would be more consistant with the ADDR_LIMIT_3GB >> >> just above that means a 3 Gigabytes limit. >> > >> > Hence ADDR_LIMIT_128TB? >> >> Yes it should be 128TB. Typo by me. > > 47BIT was selected because the usecase for this flag is for applications > that want to store data in the upper bits of a virtual address space. In > this case, how large the virtual address space is irrelevant, and only > the number of bits that are being used, and hence the number of bits > that are free. Yeah I understand that's how you came to the problem. But for the user API I think using the size of the address space is clearer, easier to explain, and matches the existing ADDR_LIMIT_3GB. cheers
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 13, 2024 at 08:41:34AM +0100, Lorenzo Stoakes wrote: > On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote: > > On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote: > > > On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote: > > > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote: > > > > > * Catalin Marinas [240906 07:44]: > > > > > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote: > > > > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > > > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann > > > > > > > > wrote: > > > > > > > >> It's also unclear to me how we want this flag to interact with > > > > > > > >> the existing logic in arch_get_mmap_end(), which attempts to > > > > > > > >> limit the default mapping to a 47-bit address space already. > > > > > > > > > > > > > > > > To optimize RISC-V progress, I recommend: > > > > > > > > > > > > > > > > Step 1: Approve the patch. > > > > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > > > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK > > > > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() > > > > > > Point 4 is an ABI change. What guarantees that there isn't still > > > software out there that relies on the old behaviour? > > > > Yeah I don't think it would be desirable to remove the 47 bit > > constraint in architectures that already have it. > > > > > > > > > > > > I really want to first see a plausible explanation about why > > > > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > > > > > > > like all the other major architectures (x86, arm64, powerpc64), > > > > > > > > > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the > > > > > > default > > > > > > configuration. We end up with a 47-bit with 16K pages but for a > > > > > > different reason that has to do with LPA2 support (I doubt we need > > > > > > this > > > > > > for the user mapping but we need to untangle some of the macros > > > > > > there; > > > > > > that's for a separate discussion). > > > > > > > > > > > > That said, we haven't encountered any user space problems with a > > > > > > 48-bit > > > > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar > > > > > > approach (47 or 48 bit default limit). Better to have some ABI > > > > > > consistency between architectures. One can still ask for addresses > > > > > > above > > > > > > this default limit via mmap(). > > > > > > > > > > I think that is best as well. > > > > > > > > > > Can we please just do what x86 and arm64 does? > > > > > > > > I responded to Arnd in the other thread, but I am still not convinced > > > > that the solution that x86 and arm64 have selected is the best solution. > > > > The solution of defaulting to 47 bits does allow applications the > > > > ability to get addresses that are below 47 bits. However, due to > > > > differences across architectures it doesn't seem possible to have all > > > > architectures default to the same value. Additionally, this flag will be > > > > able to help users avoid potential bugs where a hint address is passed > > > > that causes upper bits of a VA to be used. > > > > > > The reason we added this limit on arm64 is that we noticed programs > > > using the top 8 bits of a 64-bit pointer for additional information. > > > IIRC, it wasn't even openJDK but some JavaScript JIT. We could have > > > taught those programs of a new flag but since we couldn't tell how many > > > are out there, it was the safest to default to a smaller limit and opt > > > in to the higher one. Such opt-in is via mmap() but if you prefer a > > > prctl() flag, that's fine by me as well (though I think this should be > > > opt-in to higher addresses rather than opt-out of the higher addresses). > > > > The mmap() flag was used in previous versions but was decided against > > because this feature is more useful if it is process-wide. A > > personality() flag was chosen instead of a prctl() flag because there > > existed other flags in personality() that were similar. I am tempted to > > use prctl() however because then we could have an additional arg to > > select the exact number of bits that should be reserved (rather than > > being fixed at 47 bits). > > I am very much not in favour of a prctl(), it would require us to add state > limiting the address space and the timing of it becomes critical. Then we > have the same issue we do with the other proposals as to - what happens if > this is too low? > > What is 'too low' varies by architecture, and for 32-bit architectures > could get quite... problematic. > > And again, wha is the RoI here - we introducing maintenance burden and edge > cases vs. the x86 solution in order to... accommodate things that need more > than 128 TiB of address space? A problem that does not appear to exist in > reality? > > I suggested the personality approach as t
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 13, 2024 at 11:08:23AM +0100, Catalin Marinas wrote: > On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote: > > On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote: > > > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote: > > > > Opting-in to the higher address space is reasonable. However, it is not > > > > my preference, because the purpose of this flag is to ensure that > > > > allocations do not exceed 47-bits, so it is a clearer ABI to have the > > > > applications that want this guarantee to be the ones setting the flag, > > > > rather than the applications that want the higher bits setting the flag. > > > > > > Yes, this would be ideal. Unfortunately those applications don't know > > > they need to set a flag in order to work. > > > > It's not a regression, the applications never worked (on platforms that > > do not have this default). The 47-bit default would allow applications > > that didn't work to start working at the cost of a non-ideal ABI. That > > doesn't seem like a reasonable tradeoff to me. If applications want to > > run on new hardware that has different requirements, shouldn't they be > > required to update rather than expect the kernel will solve their > > problems for them? > > That's a valid point but it depends on the application and how much you > want to spend updating user-space. OpenJDK is fine, if you need a JIT > you'll have to add support for that architecture anyway. But others are > arch-agnostic, you just recompile to your target. It's not an ABI > problem, more of an API one. The arch-agnosticism is my hope with this personality flag, it can be added arch-agnostic userspace code and allow the application to work everywhere, but it does have the downside of requiring that change to user-space code. > > The x86 case (and powerpc/arm64) was different, the 47-bit worked for a > long time before expanding it. So it made a lot of sense to keep the > same default. Yes it is very reasonable that this solution was selected for those architectures since the support for higher address spaces evolved in the manner that it did! - Charlie > > Anyway, the prctl() can go both ways, either expanding or limiting the > default address space. So I'd be fine with such interface. > > -- > Catalin
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 13, 2024 at 11:08:23AM +0100, Catalin Marinas wrote: > On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote: > > On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote: > > > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote: > > > > Opting-in to the higher address space is reasonable. However, it is not > > > > my preference, because the purpose of this flag is to ensure that > > > > allocations do not exceed 47-bits, so it is a clearer ABI to have the > > > > applications that want this guarantee to be the ones setting the flag, > > > > rather than the applications that want the higher bits setting the flag. [...] > Anyway, the prctl() can go both ways, either expanding or limiting the > default address space. So I'd be fine with such interface. Ah, I just realised (while reading Lorenzo's reply) that we can't really restrict the space via a prctl() as we have the main thread stack already allocated by the kernel before the user code starts. You may need to limit this stack as well, not just the later heap allocations (anonymous mmap()). -- Catalin
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote: > On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote: > > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote: > > > Opting-in to the higher address space is reasonable. However, it is not > > > my preference, because the purpose of this flag is to ensure that > > > allocations do not exceed 47-bits, so it is a clearer ABI to have the > > > applications that want this guarantee to be the ones setting the flag, > > > rather than the applications that want the higher bits setting the flag. > > > > Yes, this would be ideal. Unfortunately those applications don't know > > they need to set a flag in order to work. > > It's not a regression, the applications never worked (on platforms that > do not have this default). The 47-bit default would allow applications > that didn't work to start working at the cost of a non-ideal ABI. That > doesn't seem like a reasonable tradeoff to me. If applications want to > run on new hardware that has different requirements, shouldn't they be > required to update rather than expect the kernel will solve their > problems for them? That's a valid point but it depends on the application and how much you want to spend updating user-space. OpenJDK is fine, if you need a JIT you'll have to add support for that architecture anyway. But others are arch-agnostic, you just recompile to your target. It's not an ABI problem, more of an API one. The x86 case (and powerpc/arm64) was different, the 47-bit worked for a long time before expanding it. So it made a lot of sense to keep the same default. Anyway, the prctl() can go both ways, either expanding or limiting the default address space. So I'd be fine with such interface. -- Catalin
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote: > On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote: > > On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote: > > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote: > > > > * Catalin Marinas [240906 07:44]: > > > > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote: > > > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann > > > > > > > wrote: > > > > > > >> It's also unclear to me how we want this flag to interact with > > > > > > >> the existing logic in arch_get_mmap_end(), which attempts to > > > > > > >> limit the default mapping to a 47-bit address space already. > > > > > > > > > > > > > > To optimize RISC-V progress, I recommend: > > > > > > > > > > > > > > Step 1: Approve the patch. > > > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK > > > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() > > > > Point 4 is an ABI change. What guarantees that there isn't still > > software out there that relies on the old behaviour? > > Yeah I don't think it would be desirable to remove the 47 bit > constraint in architectures that already have it. > > > > > > > > > I really want to first see a plausible explanation about why > > > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > > > > > > like all the other major architectures (x86, arm64, powerpc64), > > > > > > > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default > > > > > configuration. We end up with a 47-bit with 16K pages but for a > > > > > different reason that has to do with LPA2 support (I doubt we need > > > > > this > > > > > for the user mapping but we need to untangle some of the macros there; > > > > > that's for a separate discussion). > > > > > > > > > > That said, we haven't encountered any user space problems with a > > > > > 48-bit > > > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar > > > > > approach (47 or 48 bit default limit). Better to have some ABI > > > > > consistency between architectures. One can still ask for addresses > > > > > above > > > > > this default limit via mmap(). > > > > > > > > I think that is best as well. > > > > > > > > Can we please just do what x86 and arm64 does? > > > > > > I responded to Arnd in the other thread, but I am still not convinced > > > that the solution that x86 and arm64 have selected is the best solution. > > > The solution of defaulting to 47 bits does allow applications the > > > ability to get addresses that are below 47 bits. However, due to > > > differences across architectures it doesn't seem possible to have all > > > architectures default to the same value. Additionally, this flag will be > > > able to help users avoid potential bugs where a hint address is passed > > > that causes upper bits of a VA to be used. > > > > The reason we added this limit on arm64 is that we noticed programs > > using the top 8 bits of a 64-bit pointer for additional information. > > IIRC, it wasn't even openJDK but some JavaScript JIT. We could have > > taught those programs of a new flag but since we couldn't tell how many > > are out there, it was the safest to default to a smaller limit and opt > > in to the higher one. Such opt-in is via mmap() but if you prefer a > > prctl() flag, that's fine by me as well (though I think this should be > > opt-in to higher addresses rather than opt-out of the higher addresses). > > The mmap() flag was used in previous versions but was decided against > because this feature is more useful if it is process-wide. A > personality() flag was chosen instead of a prctl() flag because there > existed other flags in personality() that were similar. I am tempted to > use prctl() however because then we could have an additional arg to > select the exact number of bits that should be reserved (rather than > being fixed at 47 bits). I am very much not in favour of a prctl(), it would require us to add state limiting the address space and the timing of it becomes critical. Then we have the same issue we do with the other proposals as to - what happens if this is too low? What is 'too low' varies by architecture, and for 32-bit architectures could get quite... problematic. And again, wha is the RoI here - we introducing maintenance burden and edge cases vs. the x86 solution in order to... accommodate things that need more than 128 TiB of address space? A problem that does not appear to exist in reality? I suggested the personality approach as the least impactful compromise way of this series working, but I think after what Arnd has said (and please forgive me if I've missed further discussion have been dipping in and out of this!) - adapting risc v to the approach we take elsewhere seems the most sensible solu
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote: > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote: > > Opting-in to the higher address space is reasonable. However, it is not > > my preference, because the purpose of this flag is to ensure that > > allocations do not exceed 47-bits, so it is a clearer ABI to have the > > applications that want this guarantee to be the ones setting the flag, > > rather than the applications that want the higher bits setting the flag. > > Yes, this would be ideal. Unfortunately those applications don't know > they need to set a flag in order to work. It's not a regression, the applications never worked (on platforms that do not have this default). The 47-bit default would allow applications that didn't work to start working at the cost of a non-ideal ABI. That doesn't seem like a reasonable tradeoff to me. If applications want to run on new hardware that has different requirements, shouldn't they be required to update rather than expect the kernel will solve their problems for them? > > A slightly better option is to leave the default 47-bit at the kernel > ABI level and have the libc/dynamic loader issue the prctl(). You can > control the default with environment variables if needed. Having glibc set the 47-bit requirement could make it slightly easier for applications since they would only have to set the environment variable. After the kernel interface is approved I can look into supporting that. - Charlie > > We do something similar in glibc for arm64 MTE. When MTE is enabled, the > top byte of an allocated pointer contains the tag that must not be > corrupted. We left the decision to the C library via the > glibc.mem.tagging tunable (Android has something similar via the app > manifest). An app can change the default if it wants but if you run with > old glibc or no environment variable to say otherwise, the default would > be safe. Distros can set the environment to be the maximum range by > default if they know the apps included have been upgraded and tested. > > -- > Catalin
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote: > Opting-in to the higher address space is reasonable. However, it is not > my preference, because the purpose of this flag is to ensure that > allocations do not exceed 47-bits, so it is a clearer ABI to have the > applications that want this guarantee to be the ones setting the flag, > rather than the applications that want the higher bits setting the flag. Yes, this would be ideal. Unfortunately those applications don't know they need to set a flag in order to work. A slightly better option is to leave the default 47-bit at the kernel ABI level and have the libc/dynamic loader issue the prctl(). You can control the default with environment variables if needed. We do something similar in glibc for arm64 MTE. When MTE is enabled, the top byte of an allocated pointer contains the tag that must not be corrupted. We left the decision to the C library via the glibc.mem.tagging tunable (Android has something similar via the app manifest). An app can change the default if it wants but if you run with old glibc or no environment variable to say otherwise, the default would be safe. Distros can set the environment to be the maximum range by default if they know the apps included have been upgraded and tested. -- Catalin
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Wed, Sep 11, 2024 at 11:38:55PM +1000, Michael Ellerman wrote: > Geert Uytterhoeven writes: > > Hi Christophe, > > > > On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy > > wrote: > >> >>> diff --git a/include/uapi/linux/personality.h > >> >>> b/include/uapi/linux/personality.h > >> >>> index 49796b7756af..cd3b8c154d9b 100644 > >> >>> --- a/include/uapi/linux/personality.h > >> >>> +++ b/include/uapi/linux/personality.h > >> >>> @@ -22,6 +22,7 @@ enum { > >> >>> WHOLE_SECONDS = 0x200, > >> >>> STICKY_TIMEOUTS = 0x400, > >> >>> ADDR_LIMIT_3GB =0x800, > >> >>> + ADDR_LIMIT_47BIT = 0x1000, > >> >>> }; > >> >> > >> >> I wonder if ADDR_LIMIT_128T would be clearer? > >> >> > >> > > >> > I don't follow, what does 128T represent? > >> > >> 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT > >> address, that naming would be more consistant with the ADDR_LIMIT_3GB > >> just above that means a 3 Gigabytes limit. > > > > Hence ADDR_LIMIT_128TB? > > Yes it should be 128TB. Typo by me. > > cheers 47BIT was selected because the usecase for this flag is for applications that want to store data in the upper bits of a virtual address space. In this case, how large the virtual address space is irrelevant, and only the number of bits that are being used, and hence the number of bits that are free. - Charlie
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote: > On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote: > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote: > > > * Catalin Marinas [240906 07:44]: > > > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote: > > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann wrote: > > > > > >> It's also unclear to me how we want this flag to interact with > > > > > >> the existing logic in arch_get_mmap_end(), which attempts to > > > > > >> limit the default mapping to a 47-bit address space already. > > > > > > > > > > > > To optimize RISC-V progress, I recommend: > > > > > > > > > > > > Step 1: Approve the patch. > > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK > > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() > > Point 4 is an ABI change. What guarantees that there isn't still > software out there that relies on the old behaviour? Yeah I don't think it would be desirable to remove the 47 bit constraint in architectures that already have it. > > > > > > I really want to first see a plausible explanation about why > > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > > > > > like all the other major architectures (x86, arm64, powerpc64), > > > > > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default > > > > configuration. We end up with a 47-bit with 16K pages but for a > > > > different reason that has to do with LPA2 support (I doubt we need this > > > > for the user mapping but we need to untangle some of the macros there; > > > > that's for a separate discussion). > > > > > > > > That said, we haven't encountered any user space problems with a 48-bit > > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar > > > > approach (47 or 48 bit default limit). Better to have some ABI > > > > consistency between architectures. One can still ask for addresses above > > > > this default limit via mmap(). > > > > > > I think that is best as well. > > > > > > Can we please just do what x86 and arm64 does? > > > > I responded to Arnd in the other thread, but I am still not convinced > > that the solution that x86 and arm64 have selected is the best solution. > > The solution of defaulting to 47 bits does allow applications the > > ability to get addresses that are below 47 bits. However, due to > > differences across architectures it doesn't seem possible to have all > > architectures default to the same value. Additionally, this flag will be > > able to help users avoid potential bugs where a hint address is passed > > that causes upper bits of a VA to be used. > > The reason we added this limit on arm64 is that we noticed programs > using the top 8 bits of a 64-bit pointer for additional information. > IIRC, it wasn't even openJDK but some JavaScript JIT. We could have > taught those programs of a new flag but since we couldn't tell how many > are out there, it was the safest to default to a smaller limit and opt > in to the higher one. Such opt-in is via mmap() but if you prefer a > prctl() flag, that's fine by me as well (though I think this should be > opt-in to higher addresses rather than opt-out of the higher addresses). The mmap() flag was used in previous versions but was decided against because this feature is more useful if it is process-wide. A personality() flag was chosen instead of a prctl() flag because there existed other flags in personality() that were similar. I am tempted to use prctl() however because then we could have an additional arg to select the exact number of bits that should be reserved (rather than being fixed at 47 bits). Opting-in to the higher address space is reasonable. However, it is not my preference, because the purpose of this flag is to ensure that allocations do not exceed 47-bits, so it is a clearer ABI to have the applications that want this guarantee to be the ones setting the flag, rather than the applications that want the higher bits setting the flag. - Charlie > > -- > Catalin
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Wed, Sep 11, 2024 at 07:25:08AM +, Arnd Bergmann wrote: > On Wed, Sep 11, 2024, at 00:45, Charlie Jenkins wrote: > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote: > > > > I responded to Arnd in the other thread, but I am still not convinced > > that the solution that x86 and arm64 have selected is the best solution. > > The solution of defaulting to 47 bits does allow applications the > > ability to get addresses that are below 47 bits. However, due to > > differences across architectures it doesn't seem possible to have all > > architectures default to the same value. Additionally, this flag will be > > able to help users avoid potential bugs where a hint address is passed > > that causes upper bits of a VA to be used. > > > > The other issue I have with this is that if there is not a hint address > > specified to be greater than 47 bits on x86, then mmap() may return an > > address that is greater than 47-bits. The documentation in > > Documentation/arch/x86/x86_64/5level-paging.rst says: > > > > "If hint address set above 47-bit, but MAP_FIXED is not specified, we try > > to look for unmapped area by specified address. If it's already > > occupied, we look for unmapped area in *full* address space, rather than > > from 47-bit window." > > This is also in the commit message of b569bab78d8d ("x86/mm: Prepare > to expose larger address space to userspace"), which introduced it. > However, I don't actually see the fallback to the full address space, > instead the actual behavior seems to be the same as arm64. > > Am I missing something in the x86 implementation, or do we just > need to update the documentation? > > Arnd Yeah I guess it is incorrect documentation then? It seems more reasonable to me to have a hint address fall back onto the larger address space because otherwise the "hint" address can cause allocations to fail even if there is space above the 47-bit limit. This is another reason I wanted to avoid having this default behavior on riscv, to not have this abuse of the hint address. - Charlie
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote: > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote: > > * Catalin Marinas [240906 07:44]: > > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote: > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann wrote: > > > > >> It's also unclear to me how we want this flag to interact with > > > > >> the existing logic in arch_get_mmap_end(), which attempts to > > > > >> limit the default mapping to a 47-bit address space already. > > > > > > > > > > To optimize RISC-V progress, I recommend: > > > > > > > > > > Step 1: Approve the patch. > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() Point 4 is an ABI change. What guarantees that there isn't still software out there that relies on the old behaviour? > > > > I really want to first see a plausible explanation about why > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > > > > like all the other major architectures (x86, arm64, powerpc64), > > > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default > > > configuration. We end up with a 47-bit with 16K pages but for a > > > different reason that has to do with LPA2 support (I doubt we need this > > > for the user mapping but we need to untangle some of the macros there; > > > that's for a separate discussion). > > > > > > That said, we haven't encountered any user space problems with a 48-bit > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar > > > approach (47 or 48 bit default limit). Better to have some ABI > > > consistency between architectures. One can still ask for addresses above > > > this default limit via mmap(). > > > > I think that is best as well. > > > > Can we please just do what x86 and arm64 does? > > I responded to Arnd in the other thread, but I am still not convinced > that the solution that x86 and arm64 have selected is the best solution. > The solution of defaulting to 47 bits does allow applications the > ability to get addresses that are below 47 bits. However, due to > differences across architectures it doesn't seem possible to have all > architectures default to the same value. Additionally, this flag will be > able to help users avoid potential bugs where a hint address is passed > that causes upper bits of a VA to be used. The reason we added this limit on arm64 is that we noticed programs using the top 8 bits of a 64-bit pointer for additional information. IIRC, it wasn't even openJDK but some JavaScript JIT. We could have taught those programs of a new flag but since we couldn't tell how many are out there, it was the safest to default to a smaller limit and opt in to the higher one. Such opt-in is via mmap() but if you prefer a prctl() flag, that's fine by me as well (though I think this should be opt-in to higher addresses rather than opt-out of the higher addresses). -- Catalin
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
"Arnd Bergmann" writes: > On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote: >> On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote: >>> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote: >>> The intent is to optionally be able to run a process that keeps higher bits >>> free for tagging and to be sure no memory mapping in the process will >>> clobber these (correct me if I'm wrong Charlie! :) ... > Let's see what the other architectures do and then come up with > a way that fixes the pointer tagging case first on those that are > broken. We can see if there needs to be an extra flag after that. > Here is what I found: > > - x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit > address space when an addr hint is passed. > - arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns > higher 52-bit addresses when either a hint is passed or > CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this > is a debugging option) > - ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), > returns 52 bit address when an addr hint is passed It's 46 or 47 depending on PAGE_SIZE (4K or 64K): $ git grep "define DEFAULT_MAP_WINDOW_USER64" arch/powerpc/include/asm/task_size_64.h arch/powerpc/include/asm/task_size_64.h:#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_128TB arch/powerpc/include/asm/task_size_64.h:#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_64TB cheers
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
Geert Uytterhoeven writes: > Hi Christophe, > > On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy > wrote: >> >>> diff --git a/include/uapi/linux/personality.h >> >>> b/include/uapi/linux/personality.h >> >>> index 49796b7756af..cd3b8c154d9b 100644 >> >>> --- a/include/uapi/linux/personality.h >> >>> +++ b/include/uapi/linux/personality.h >> >>> @@ -22,6 +22,7 @@ enum { >> >>> WHOLE_SECONDS = 0x200, >> >>> STICKY_TIMEOUTS = 0x400, >> >>> ADDR_LIMIT_3GB =0x800, >> >>> + ADDR_LIMIT_47BIT = 0x1000, >> >>> }; >> >> >> >> I wonder if ADDR_LIMIT_128T would be clearer? >> >> >> > >> > I don't follow, what does 128T represent? >> >> 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT >> address, that naming would be more consistant with the ADDR_LIMIT_3GB >> just above that means a 3 Gigabytes limit. > > Hence ADDR_LIMIT_128TB? Yes it should be 128TB. Typo by me. cheers
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
Charlie Jenkins writes: > On Fri, Sep 06, 2024 at 04:59:40PM +1000, Michael Ellerman wrote: >> Charlie Jenkins writes: >> > Create a personality flag ADDR_LIMIT_47BIT to support applications >> > that wish to transition from running in environments that support at >> > most 47-bit VAs to environments that support larger VAs. This >> > personality can be set to cause all allocations to be below the 47-bit >> > boundary. Using MAP_FIXED with mmap() will bypass this restriction. >> > >> > Signed-off-by: Charlie Jenkins >> > --- >> > include/uapi/linux/personality.h | 1 + >> > mm/mmap.c| 3 +++ >> > 2 files changed, 4 insertions(+) >> > >> > diff --git a/include/uapi/linux/personality.h >> > b/include/uapi/linux/personality.h >> > index 49796b7756af..cd3b8c154d9b 100644 >> > --- a/include/uapi/linux/personality.h >> > +++ b/include/uapi/linux/personality.h >> > @@ -22,6 +22,7 @@ enum { >> >WHOLE_SECONDS = 0x200, >> >STICKY_TIMEOUTS = 0x400, >> >ADDR_LIMIT_3GB =0x800, >> > + ADDR_LIMIT_47BIT = 0x1000, >> > }; >> >> I wonder if ADDR_LIMIT_128T would be clearer? >> > > I don't follow, what does 128T represent? Sorry, as Christophe explained it's 128 Terabytes, which is the actual value of the address limit. I think expressing it as the address value is probably more widely understood, and would also match ADDR_LIMIT_3GB. >> Have you looked at writing an update for the personality(2) man page? :) > > I will write an update to the man page if this patch is approved! Yeah fair enough. My (poorly expressed) point was that trying to describe the flag for the man page might highlight that using the 47BIT name requires more explanation. cheers
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Wed, Sep 11, 2024, at 00:45, Charlie Jenkins wrote: > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote: > > I responded to Arnd in the other thread, but I am still not convinced > that the solution that x86 and arm64 have selected is the best solution. > The solution of defaulting to 47 bits does allow applications the > ability to get addresses that are below 47 bits. However, due to > differences across architectures it doesn't seem possible to have all > architectures default to the same value. Additionally, this flag will be > able to help users avoid potential bugs where a hint address is passed > that causes upper bits of a VA to be used. > > The other issue I have with this is that if there is not a hint address > specified to be greater than 47 bits on x86, then mmap() may return an > address that is greater than 47-bits. The documentation in > Documentation/arch/x86/x86_64/5level-paging.rst says: > > "If hint address set above 47-bit, but MAP_FIXED is not specified, we try > to look for unmapped area by specified address. If it's already > occupied, we look for unmapped area in *full* address space, rather than > from 47-bit window." This is also in the commit message of b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace"), which introduced it. However, I don't actually see the fallback to the full address space, instead the actual behavior seems to be the same as arm64. Am I missing something in the x86 implementation, or do we just need to update the documentation? Arnd
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote: > * Catalin Marinas [240906 07:44]: > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote: > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann wrote: > > > >> It's also unclear to me how we want this flag to interact with > > > >> the existing logic in arch_get_mmap_end(), which attempts to > > > >> limit the default mapping to a 47-bit address space already. > > > > > > > > To optimize RISC-V progress, I recommend: > > > > > > > > Step 1: Approve the patch. > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > > > Step 3: Wait approximately several iterations for Go & OpenJDK > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() > > > > > > I really want to first see a plausible explanation about why > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > > > like all the other major architectures (x86, arm64, powerpc64), > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default > > configuration. We end up with a 47-bit with 16K pages but for a > > different reason that has to do with LPA2 support (I doubt we need this > > for the user mapping but we need to untangle some of the macros there; > > that's for a separate discussion). > > > > That said, we haven't encountered any user space problems with a 48-bit > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar > > approach (47 or 48 bit default limit). Better to have some ABI > > consistency between architectures. One can still ask for addresses above > > this default limit via mmap(). > > I think that is best as well. > > Can we please just do what x86 and arm64 does? > > Thanks, > Liam I responded to Arnd in the other thread, but I am still not convinced that the solution that x86 and arm64 have selected is the best solution. The solution of defaulting to 47 bits does allow applications the ability to get addresses that are below 47 bits. However, due to differences across architectures it doesn't seem possible to have all architectures default to the same value. Additionally, this flag will be able to help users avoid potential bugs where a hint address is passed that causes upper bits of a VA to be used. The other issue I have with this is that if there is not a hint address specified to be greater than 47 bits on x86, then mmap() may return an address that is greater than 47-bits. The documentation in Documentation/arch/x86/x86_64/5level-paging.rst says: "If hint address set above 47-bit, but MAP_FIXED is not specified, we try to look for unmapped area by specified address. If it's already occupied, we look for unmapped area in *full* address space, rather than from 47-bit window." arm64 on the other hand defines this as only being able to opt-into the 52-bit VA space with the hint address, and my understanding is that mmap() will not fall back to the 52-bit address space. Please correct me if I am wrong. From Documentation/arch/arm64/memory.rst: "To maintain compatibility with software that relies on the ARMv8.0 VA space maximum size of 48-bits, the kernel will, by default, return virtual addresses to userspace from a 48-bit range. "Software can "opt-in" to receiving VAs from a 52-bit space by specifying an mmap hint parameter that is larger than 48-bit." This is an inconsistency I am trying to solve with this personality flag. - Charlie
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Tue, Sep 10, 2024 at 09:13:33AM +, Arnd Bergmann wrote: > On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote: > > On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote: > >> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote: > >> The intent is to optionally be able to run a process that keeps higher bits > >> free for tagging and to be sure no memory mapping in the process will > >> clobber these (correct me if I'm wrong Charlie! :) > >> > >> So you really wouldn't want this if you are using tagged pointers, you'd > >> want to be sure literally nothing touches the higher bits. > > My understanding was that the purpose of the existing design > is to allow applications to ask for a high address without having > to resort to the complexity of MAP_FIXED. > > In particular, I'm sure there is precedent for applications that > want both tagged pointers (for most mappings) and untagged pointers > (for large mappings). With a per-mm_struct or per-task_struct > setting you can't do that. > > > Various architectures handle the hint address differently, but it > > appears that the only case across any architecture where an address > > above 47 bits will be returned is if the application had a hint address > > with a value greater than 47 bits and was using the MAP_FIXED flag. > > MAP_FIXED bypasses all other checks so I was assuming that it would be > > logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set, > > then the intent is for no hint address to cause a value greater than 47 > > bits to be returned. > > I don't think the MAP_FIXED case is that interesting here because > it has to work in both fixed and non-fixed mappings. > > >> This would be more consistent vs. other arches. > > > > Yes riscv is an outlier here. The reason I am pushing for something like > > a flag to restrict the address space rather than setting it to be the > > default is it seems like if applications are relying on upper bits to be > > free, then they should be explicitly asking the kernel to keep them free > > rather than assuming them to be free. > > Let's see what the other architectures do and then come up with > a way that fixes the pointer tagging case first on those that are > broken. We can see if there needs to be an extra flag after that. > Here is what I found: > > - x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit > address space when an addr hint is passed. > - arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns > higher 52-bit addresses when either a hint is passed or > CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this > is a debugging option) > - ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), > returns 52 bit address when an addr hint is passed > - riscv uses a DEFAULT_MAP_WINDOW of BIT(47) but only uses > it for allocating the stack below, ignoring it for normal > mappings > - s390 has no DEFAULT_MAP_WINDOW but tried to allocate in > the current number of pgtable levels and only upgrades to > the next level (31, 42, 53, 64 bits) if a hint is passed or > the current level is exhausted. > - loongarch64 has no DEFAULT_MAP_WINDOW, and a default VA > space of 47 bits (16K pages, 3 levels), but can support > a 55 bit space (64K pages, 3 levels). > - sparc has no DEFAULT_MAP_WINDOW and up to 52 bit VA space. > It may allocate both positive and negative addresses in > there. (?) > - mips64, parisc64 and alpha have no DEFAULT_MAP_WINDOW and > at most 48, 41 or 39 address bits, respectively. > > I would suggest these changes: > > - make riscv enforce DEFAULT_MAP_WINDOW like x86_64, arm64 >and ppc64, leave it at 47 > > - add DEFAULT_MAP_WINDOW on loongarch64 (47/48 bits > based on page size), sparc (48 bits) and s390 (unsure if > 42, 53, 47 or 48 bits) > > - leave the rest unchanged. > >Arnd Changing all architectures to have a standardized DEFAULT_MAP_WINDOW mostly solves the problem. However, I am concerned that it is fragile for applications to rely on a default like this. Having the personality bit flag is supposed to provide an intuitive ABI for users that guarantees that they will not accidentally request for memory outside of the boundary that they specified. Also you bring up that the DEFAULT_MAP_WINDOW would not be able to be standardized across architectures, so we still have the problem that this default behavior will be different across architectures which I am trying to solve. - Charlie
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
* Catalin Marinas [240906 07:44]: > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote: > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann wrote: > > >> It's also unclear to me how we want this flag to interact with > > >> the existing logic in arch_get_mmap_end(), which attempts to > > >> limit the default mapping to a 47-bit address space already. > > > > > > To optimize RISC-V progress, I recommend: > > > > > > Step 1: Approve the patch. > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > > Step 3: Wait approximately several iterations for Go & OpenJDK > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() > > > > I really want to first see a plausible explanation about why > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > > like all the other major architectures (x86, arm64, powerpc64), > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default > configuration. We end up with a 47-bit with 16K pages but for a > different reason that has to do with LPA2 support (I doubt we need this > for the user mapping but we need to untangle some of the macros there; > that's for a separate discussion). > > That said, we haven't encountered any user space problems with a 48-bit > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar > approach (47 or 48 bit default limit). Better to have some ABI > consistency between architectures. One can still ask for addresses above > this default limit via mmap(). I think that is best as well. Can we please just do what x86 and arm64 does? Thanks, Liam
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
Hi Christophe, On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy wrote: > >>> diff --git a/include/uapi/linux/personality.h > >>> b/include/uapi/linux/personality.h > >>> index 49796b7756af..cd3b8c154d9b 100644 > >>> --- a/include/uapi/linux/personality.h > >>> +++ b/include/uapi/linux/personality.h > >>> @@ -22,6 +22,7 @@ enum { > >>> WHOLE_SECONDS = 0x200, > >>> STICKY_TIMEOUTS = 0x400, > >>> ADDR_LIMIT_3GB =0x800, > >>> + ADDR_LIMIT_47BIT = 0x1000, > >>> }; > >> > >> I wonder if ADDR_LIMIT_128T would be clearer? > >> > > > > I don't follow, what does 128T represent? > > 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT > address, that naming would be more consistant with the ADDR_LIMIT_3GB > just above that means a 3 Gigabytes limit. Hence ADDR_LIMIT_128TB? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h index 49796b7756af..cd3b8c154d9b 100644 --- a/include/uapi/linux/personality.h +++ b/include/uapi/linux/personality.h @@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x200, STICKY_TIMEOUTS = 0x400, ADDR_LIMIT_3GB =0x800, + ADDR_LIMIT_47BIT = 0x1000, }; I wonder if ADDR_LIMIT_128T would be clearer? I don't follow, what does 128T represent? 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT address, that naming would be more consistant with the ADDR_LIMIT_3GB just above that means a 3 Gigabytes limit. Christophe
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote: > On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote: >> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote: >> The intent is to optionally be able to run a process that keeps higher bits >> free for tagging and to be sure no memory mapping in the process will >> clobber these (correct me if I'm wrong Charlie! :) >> >> So you really wouldn't want this if you are using tagged pointers, you'd >> want to be sure literally nothing touches the higher bits. My understanding was that the purpose of the existing design is to allow applications to ask for a high address without having to resort to the complexity of MAP_FIXED. In particular, I'm sure there is precedent for applications that want both tagged pointers (for most mappings) and untagged pointers (for large mappings). With a per-mm_struct or per-task_struct setting you can't do that. > Various architectures handle the hint address differently, but it > appears that the only case across any architecture where an address > above 47 bits will be returned is if the application had a hint address > with a value greater than 47 bits and was using the MAP_FIXED flag. > MAP_FIXED bypasses all other checks so I was assuming that it would be > logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set, > then the intent is for no hint address to cause a value greater than 47 > bits to be returned. I don't think the MAP_FIXED case is that interesting here because it has to work in both fixed and non-fixed mappings. >> This would be more consistent vs. other arches. > > Yes riscv is an outlier here. The reason I am pushing for something like > a flag to restrict the address space rather than setting it to be the > default is it seems like if applications are relying on upper bits to be > free, then they should be explicitly asking the kernel to keep them free > rather than assuming them to be free. Let's see what the other architectures do and then come up with a way that fixes the pointer tagging case first on those that are broken. We can see if there needs to be an extra flag after that. Here is what I found: - x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit address space when an addr hint is passed. - arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns higher 52-bit addresses when either a hint is passed or CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this is a debugging option) - ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns 52 bit address when an addr hint is passed - riscv uses a DEFAULT_MAP_WINDOW of BIT(47) but only uses it for allocating the stack below, ignoring it for normal mappings - s390 has no DEFAULT_MAP_WINDOW but tried to allocate in the current number of pgtable levels and only upgrades to the next level (31, 42, 53, 64 bits) if a hint is passed or the current level is exhausted. - loongarch64 has no DEFAULT_MAP_WINDOW, and a default VA space of 47 bits (16K pages, 3 levels), but can support a 55 bit space (64K pages, 3 levels). - sparc has no DEFAULT_MAP_WINDOW and up to 52 bit VA space. It may allocate both positive and negative addresses in there. (?) - mips64, parisc64 and alpha have no DEFAULT_MAP_WINDOW and at most 48, 41 or 39 address bits, respectively. I would suggest these changes: - make riscv enforce DEFAULT_MAP_WINDOW like x86_64, arm64 and ppc64, leave it at 47 - add DEFAULT_MAP_WINDOW on loongarch64 (47/48 bits based on page size), sparc (48 bits) and s390 (unsure if 42, 53, 47 or 48 bits) - leave the rest unchanged. Arnd
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote: > (Sorry having issues with my IPv6 setup that duplicated the original email... > > On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote: > > On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote: > > > On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote: > > >> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote: > > >> > Create a personality flag ADDR_LIMIT_47BIT to support applications > > >> > that wish to transition from running in environments that support at > > >> > most 47-bit VAs to environments that support larger VAs. This > > >> > personality can be set to cause all allocations to be below the 47-bit > > >> > boundary. Using MAP_FIXED with mmap() will bypass this restriction. > > >> > > > >> > Signed-off-by: Charlie Jenkins > > >> > > >> I think having an architecture-independent mechanism to limit the size > > >> of the 64-bit address space is useful in general, and we've discussed > > >> the same thing for arm64 in the past, though we have not actually > > >> reached an agreement on the ABI previously. > > > > > > The thread on the original proposals attests to this being rather a > > > fraught > > > topic, and I think the weight of opinion was more so in favour of opt-in > > > rather than opt-out. > > > > You mean opt-in to using the larger addresses like we do on arm64 and > > powerpc, while "opt-out" means a limit as Charlie suggested? > > I guess I'm not using brilliant terminology here haha! > > To clarify - the weight of opinion was for a situation where the address > space is limited, except if you set a hint above that (you could call that > opt-out or opt-in depending which way you look at it, so yeah ok very > unclear sorry!). > > It was against the MAP_ flag and also I think a _flexible_ per-process > limit is also questionable as you might end up setting a limit which breaks > something else, and this starts getting messy quick. > > To be clear, the ADDR_LIMIT_47BIT suggestion is absolutely a compromise and > practical suggestion. > > > > > >> > @@ -22,6 +22,7 @@ enum { > > >> >WHOLE_SECONDS = 0x200, > > >> >STICKY_TIMEOUTS = 0x400, > > >> >ADDR_LIMIT_3GB =0x800, > > >> > + ADDR_LIMIT_47BIT = 0x1000, > > >> > }; > > >> > > >> I'm a bit worried about having this done specifically in the > > >> personality flag bits, as they are rather limited. We obviously > > >> don't want to add many more such flags when there could be > > >> a way to just set the default limit. > > > > > > Since I'm the one who suggested it, I feel I should offer some kind of > > > vague defence here :) > > > > > > We shouldn't let perfect be the enemy of the good. This is a relatively > > > straightforward means of achieving the aim (assuming your concern about > > > arch_get_mmap_end() below isn't a blocker) which has the least impact on > > > existing code. > > > > > > Of course we can end up in absurdities where we start doing > > > ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an > > > egregious maintenance burden and is entirely opt-in so has things going > > > for > > > it. > > > > I'm more confused now, I think most importantly we should try to > > handle this consistently across all architectures. The proposed > > implementation seems to completely block addresses above BIT(47) > > even for applications that opt in by calling mmap(BIT(47), ...), > > which seems to break the existing applications. > > Hm, I thought the commit message suggested the hint overrides it still? > > The intent is to optionally be able to run a process that keeps higher bits > free for tagging and to be sure no memory mapping in the process will > clobber these (correct me if I'm wrong Charlie! :) > > So you really wouldn't want this if you are using tagged pointers, you'd > want to be sure literally nothing touches the higher bits. Various architectures handle the hint address differently, but it appears that the only case across any architecture where an address above 47 bits will be returned is if the application had a hint address with a value greater than 47 bits and was using the MAP_FIXED flag. MAP_FIXED bypasses all other checks so I was assuming that it would be logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set, then the intent is for no hint address to cause a value greater than 47 bits to be returned. This does have the issue that if MAP_FIXED is used then an address can be returned above 47-bits, but if an application does not want addresses above 47-bits then they shouldn't ask for a fixed address above that range. > > > > > If we want this flag for RISC-V and also keep the behavior of > > defaulting to >BIT(47) addresses for mmap(0, ...) how about > > changing arch_get_mmap_end() to return the limit based on > > ADDR_LIMIT_47BIT and then make this default to enabled on > > arm64 and powerpc but disabled on risc
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 06, 2024 at 04:59:40PM +1000, Michael Ellerman wrote: > Charlie Jenkins writes: > > Create a personality flag ADDR_LIMIT_47BIT to support applications > > that wish to transition from running in environments that support at > > most 47-bit VAs to environments that support larger VAs. This > > personality can be set to cause all allocations to be below the 47-bit > > boundary. Using MAP_FIXED with mmap() will bypass this restriction. > > > > Signed-off-by: Charlie Jenkins > > --- > > include/uapi/linux/personality.h | 1 + > > mm/mmap.c| 3 +++ > > 2 files changed, 4 insertions(+) > > > > diff --git a/include/uapi/linux/personality.h > > b/include/uapi/linux/personality.h > > index 49796b7756af..cd3b8c154d9b 100644 > > --- a/include/uapi/linux/personality.h > > +++ b/include/uapi/linux/personality.h > > @@ -22,6 +22,7 @@ enum { > > WHOLE_SECONDS = 0x200, > > STICKY_TIMEOUTS = 0x400, > > ADDR_LIMIT_3GB =0x800, > > + ADDR_LIMIT_47BIT = 0x1000, > > }; > > I wonder if ADDR_LIMIT_128T would be clearer? > I don't follow, what does 128T represent? > Have you looked at writing an update for the personality(2) man page? :) I will write an update to the man page if this patch is approved! > > cheers - Charlie
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote: > On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote: > > On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote: > > > Some applications rely on placing data in free bits addresses allocated > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > > address returned by mmap to be less than the 48-bit address space, > > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > > for the kernel address space). > > > > > > The riscv architecture needs a way to similarly restrict the virtual > > > address space. On the riscv port of OpenJDK an error is thrown if > > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > > has a comment that sv57 support is not complete, but there are some > > > workarounds to get it to mostly work [2]. I also saw libmozjs crashing with 57-bit address space on x86. > > > These applications work on x86 because x86 does an implicit 47-bit > > > restriction of mmap() address that contain a hint address that is less > > > than 48 bits. > > > > > > Instead of implicitly restricting the address space on riscv (or any > > > current/future architecture), a flag would allow users to opt-in to this > > > behavior rather than opt-out as is done on other architectures. This is > > > desirable because it is a small class of applications that do pointer > > > masking. You reiterate the argument about "small class of applications". But it makes no sense to me. With full address space by default, this small class of applications is going to *broken* unless they would handle RISC-V case specifically. On other hand, if you limit VA to 128TiB by default (like many architectures do[1]) everything would work without intervention. And if an app needs wider address space it would get it with hint opt-in, because it is required on x86-64 anyway. Again, no RISC-V-specific code. I see no upside with your approach. Just worse user experience. [1] See va_high_addr_switch test case in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/mm/Makefile#n115 -- Kiryl Shutsemau / Kirill A. Shutemov
Re: [PATCH RFC v3 0/2] mm: Introduce ADDR_LIMIT_47BIT personality flag
在2024年9月5日九月 下午10:15,Charlie Jenkins写道: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the 48-bit address space, > unless the hint address uses more than 47 bits (the 48th bit is reserved > for the kernel address space). > > The riscv architecture needs a way to similarly restrict the virtual > address space. On the riscv port of OpenJDK an error is thrown if > attempted to run on the 57-bit address space, called sv57 [1]. golang > has a comment that sv57 support is not complete, but there are some > workarounds to get it to mostly work [2]. > > These applications work on x86 because x86 does an implicit 47-bit > restriction of mmap() address that contain a hint address that is less > than 48 bits. > > Instead of implicitly restricting the address space on riscv (or any > current/future architecture), provide a flag to the personality syscall > that can be used to ensure an application works in any arbitrary VA > space. A similar feature has already been implemented by the personality > syscall in ADDR_LIMIT_32BIT. > > This flag will also allow seemless compatibility between all > architectures, so applications like Go and OpenJDK that use bits in a > virtual address can request the exact number of bits they need in a > generic way. The flag can be checked inside of vm_unmapped_area() so > that this flag does not have to be handled individually by each > architecture. Tested-by: Jiaxun Yang Tested on MIPS VA 48 system, fixed pointer tagging on mozjs! Thanks! [...] -- - Jiaxun
Re: [PATCH] [RFC] crash: Lock-free crash hotplug support reporting
On 09/07/24 at 10:30am, Sourabh Jain wrote: > Hello Baoquan, > > Do you think this patch would help reduce lock contention when > CPU/Memory resources are removed in bulk from a system? .snip... -- > > include/linux/kexec.h | 11 --- > > kernel/crash_core.c | 27 +-- > > kernel/kexec.c| 5 - > > kernel/kexec_file.c | 7 ++- > > 4 files changed, 23 insertions(+), 27 deletions(-) > > > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > > index f0e9f8eda7a3..bd755ba6bac4 100644 > > --- a/include/linux/kexec.h > > +++ b/include/linux/kexec.h > > @@ -318,13 +318,6 @@ struct kimage { > > unsigned int preserve_context : 1; > > /* If set, we are using file mode kexec syscall */ > > unsigned int file_mode:1; > > -#ifdef CONFIG_CRASH_HOTPLUG > > - /* If set, it is safe to update kexec segments that are > > -* excluded from SHA calculation. > > -*/ > > - unsigned int hotplug_support:1; > > -#endif > > - > > #ifdef ARCH_HAS_KIMAGE_ARCH > > struct kimage_arch arch; > > #endif > > @@ -370,6 +363,10 @@ struct kimage { > > unsigned long elf_load_addr; > > }; > > +#ifdef CONFIG_CRASH_HOTPLUG > > +extern unsigned int crash_hotplug_support; > > +#endif > > + > > /* kexec interface functions */ > > extern void machine_kexec(struct kimage *image); > > extern int machine_kexec_prepare(struct kimage *image); > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > > index 63cf89393c6e..3428deba0070 100644 > > --- a/kernel/crash_core.c > > +++ b/kernel/crash_core.c > > @@ -30,6 +30,13 @@ > > #include "kallsyms_internal.h" > > #include "kexec_internal.h" > > +#ifdef CONFIG_CRASH_HOTPLUG > > +/* if set, it is safe to update kexec segments that are > > + * excluded from sha calculation. > > + */ > > +unsigned int crash_hotplug_support; > > +#endif > > + > > /* Per cpu memory for storing cpu states in case of system crash. */ > > note_buf_t __percpu *crash_notes; > > @@ -500,23 +507,7 @@ static DEFINE_MUTEX(__crash_hotplug_lock); > >*/ > > int crash_check_hotplug_support(void) > > { > > - int rc = 0; > > - > > - crash_hotplug_lock(); > > - /* Obtain lock while reading crash information */ > > - if (!kexec_trylock()) { > > - pr_info("kexec_trylock() failed, elfcorehdr may be > > inaccurate\n"); > > - crash_hotplug_unlock(); > > - return 0; > > - } > > - if (kexec_crash_image) { > > - rc = kexec_crash_image->hotplug_support; > > - } > > - /* Release lock now that update complete */ > > - kexec_unlock(); > > - crash_hotplug_unlock(); > > - > > - return rc; > > + return crash_hotplug_support; I may not understand this well. Both kexec_load and kexec_file_load set hotplug_support, crash_check_hotplug_support and crash_handle_hotplug_event are to check the flag. How do you guarantee the cpu/memory sysfs checking won't have race with kexec_load and kexec_file_load? And here I see taking crash_hotplug_lock() is unnecessary in crash_check_hotplug_support() because it does't have race with crash_handle_hotplug_event(). > > } > > /* > > @@ -552,7 +543,7 @@ static void crash_handle_hotplug_event(unsigned int > > hp_action, unsigned int cpu, > > image = kexec_crash_image; > > /* Check that kexec segments update is permitted */ > > - if (!image->hotplug_support) > > + if (!crash_hotplug_support) > > goto out; > > if (hp_action == KEXEC_CRASH_HP_ADD_CPU || > > diff --git a/kernel/kexec.c b/kernel/kexec.c > > index a6b3f96bb50c..d5c6b51eaa8b 100644 > > --- a/kernel/kexec.c > > +++ b/kernel/kexec.c > > @@ -116,6 +116,9 @@ static int do_kexec_load(unsigned long entry, unsigned > > long nr_segments, > > /* Uninstall image */ > > kimage_free(xchg(dest_image, NULL)); > > ret = 0; > > +#ifdef CONFIG_CRASH_HOTPLUG > > + crash_hotplug_support = 0; > > +#endif > > goto out_unlock; > > } > > if (flags & KEXEC_ON_CRASH) { > > @@ -136,7 +139,7 @@ static int do_kexec_load(unsigned long entry, unsigned > > long nr_segments, > > #ifdef CONFIG_CRASH_HOTPLUG > > if ((flags & KEXEC_ON_CRASH) && arch_crash_hotplug_support(image, > > flags)) > > - image->hotplug_support = 1; > > + crash_hotplug_support = 1; > > #endif > > ret = machine_kexec_prepare(image); > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > > index 3d64290d24c9..b326edb90fd7 100644 > > --- a/kernel/kexec_file.c > > +++ b/kernel/kexec_file.c > > @@ -378,7 +378,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, > > initrd_fd, > > #ifdef CONFIG_CRASH_HOTPLUG > > if ((flags & KEXEC_FILE_ON_CRASH) && arch_crash_hotplug_support(image, > > flags)) > > - image->hotplug_support = 1; > > + crash_hotplug_support = 1; > > #endif > > ret = machine_kexec_prepare(image); > > @@ -432,6 +432,11 @@ SYSCALL_DEFINE5(kexec_file_load,
Re: [PATCH] [RFC] crash: Lock-free crash hotplug support reporting
Hello Baoquan, Do you think this patch would help reduce lock contention when CPU/Memory resources are removed in bulk from a system? Thanks, Sourabh Jain On 23/08/24 17:22, Sourabh Jain wrote: On a CPU/Memory hotplug event, the kexec lock is taken to update the kdump image. At the same time, this lock is also required to report the support for crash hotplug to user-space via the /sys/devices/system/[cpu|memory]/crash_hotplug sysfs interface, to avoid kdump reload. The kexec lock is needed to report crash hotplug support because the crash_hotplug variable, which tracks crash hotplug support, is part of the kdump image, and the kdump image needs to be updated during a hotplug event. Given that only one kdump image can be loaded at any given time, the crash_hotplug variable can be placed outside the kdump image and set or reset during kdump image load and unload. This allows crash hotplug support to be reported without taking the kexec lock. This would help in situation where CPU/Memory resource are hotplug from system in bulk. Commit e2a8f20dd8e9 ("Crash: add lock to serialize crash hotplug handling") introduced to serialize the kexec lock during bulk CPU/Memory hotplug events. However, with these changes, the kexec lock for crash hotplug support reporting can be avoided altogether. Cc: Hari Bathini Cc: Mahesh Salgaonkar Cc: ke...@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-ker...@vger.kernel.org Cc: x...@kernel.org Signed-off-by: Sourabh Jain --- include/linux/kexec.h | 11 --- kernel/crash_core.c | 27 +-- kernel/kexec.c| 5 - kernel/kexec_file.c | 7 ++- 4 files changed, 23 insertions(+), 27 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index f0e9f8eda7a3..bd755ba6bac4 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -318,13 +318,6 @@ struct kimage { unsigned int preserve_context : 1; /* If set, we are using file mode kexec syscall */ unsigned int file_mode:1; -#ifdef CONFIG_CRASH_HOTPLUG - /* If set, it is safe to update kexec segments that are -* excluded from SHA calculation. -*/ - unsigned int hotplug_support:1; -#endif - #ifdef ARCH_HAS_KIMAGE_ARCH struct kimage_arch arch; #endif @@ -370,6 +363,10 @@ struct kimage { unsigned long elf_load_addr; }; +#ifdef CONFIG_CRASH_HOTPLUG +extern unsigned int crash_hotplug_support; +#endif + /* kexec interface functions */ extern void machine_kexec(struct kimage *image); extern int machine_kexec_prepare(struct kimage *image); diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 63cf89393c6e..3428deba0070 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -30,6 +30,13 @@ #include "kallsyms_internal.h" #include "kexec_internal.h" +#ifdef CONFIG_CRASH_HOTPLUG +/* if set, it is safe to update kexec segments that are + * excluded from sha calculation. + */ +unsigned int crash_hotplug_support; +#endif + /* Per cpu memory for storing cpu states in case of system crash. */ note_buf_t __percpu *crash_notes; @@ -500,23 +507,7 @@ static DEFINE_MUTEX(__crash_hotplug_lock); */ int crash_check_hotplug_support(void) { - int rc = 0; - - crash_hotplug_lock(); - /* Obtain lock while reading crash information */ - if (!kexec_trylock()) { - pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n"); - crash_hotplug_unlock(); - return 0; - } - if (kexec_crash_image) { - rc = kexec_crash_image->hotplug_support; - } - /* Release lock now that update complete */ - kexec_unlock(); - crash_hotplug_unlock(); - - return rc; + return crash_hotplug_support; } /* @@ -552,7 +543,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu, image = kexec_crash_image; /* Check that kexec segments update is permitted */ - if (!image->hotplug_support) + if (!crash_hotplug_support) goto out; if (hp_action == KEXEC_CRASH_HP_ADD_CPU || diff --git a/kernel/kexec.c b/kernel/kexec.c index a6b3f96bb50c..d5c6b51eaa8b 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -116,6 +116,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, /* Uninstall image */ kimage_free(xchg(dest_image, NULL)); ret = 0; +#ifdef CONFIG_CRASH_HOTPLUG + crash_hotplug_support = 0; +#endif goto out_unlock; } if (flags & KEXEC_ON_CRASH) { @@ -136,7 +139,7 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, #ifdef CONFIG_CRASH_HOTPLUG if ((flags & KEXEC_ON_CRASH) && arch_crash_hotplug_support(image, flags)) - image->hotplug_support = 1; + crash_hotplug_support = 1; #endif
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote: > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann wrote: > >> It's also unclear to me how we want this flag to interact with > >> the existing logic in arch_get_mmap_end(), which attempts to > >> limit the default mapping to a 47-bit address space already. > > > > To optimize RISC-V progress, I recommend: > > > > Step 1: Approve the patch. > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > Step 3: Wait approximately several iterations for Go & OpenJDK > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() > > I really want to first see a plausible explanation about why > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > like all the other major architectures (x86, arm64, powerpc64), FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default configuration. We end up with a 47-bit with 16K pages but for a different reason that has to do with LPA2 support (I doubt we need this for the user mapping but we need to untangle some of the macros there; that's for a separate discussion). That said, we haven't encountered any user space problems with a 48-bit DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar approach (47 or 48 bit default limit). Better to have some ABI consistency between architectures. One can still ask for addresses above this default limit via mmap(). -- Catalin
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann wrote: >> >> It's also unclear to me how we want this flag to interact with >> the existing logic in arch_get_mmap_end(), which attempts to >> limit the default mapping to a 47-bit address space already. > > To optimize RISC-V progress, I recommend: > > Step 1: Approve the patch. > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > Step 3: Wait approximately several iterations for Go & OpenJDK > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() I really want to first see a plausible explanation about why RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW like all the other major architectures (x86, arm64, powerpc64), e.g. something like the patch below (untested, probably slightly wrong but show illustrate my point). Arnd diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h index 8702b8721a27..de9863be1efd 100644 --- a/arch/riscv/include/asm/processor.h +++ b/arch/riscv/include/asm/processor.h @@ -20,17 +20,8 @@ * mmap_end < addr, being mmap_end the top of that address space. * See Documentation/arch/riscv/vm-layout.rst for more details. */ -#define arch_get_mmap_end(addr, len, flags)\ -({ \ - unsigned long mmap_end; \ - typeof(addr) _addr = (addr);\ - if ((_addr) == 0 || is_compat_task() || \ - ((_addr + len) > BIT(VA_BITS - 1))) \ - mmap_end = STACK_TOP_MAX; \ - else\ - mmap_end = (_addr + len); \ - mmap_end; \ -}) +#define arch_get_mmap_end(addr, len, flags) \ + (((addr) > DEFAULT_MAP_WINDOW) ? TASK_SIZE : DEFAULT_MAP_WINDOW) #define arch_get_mmap_base(addr, base) \ ({ \ @@ -47,7 +38,7 @@ }) #ifdef CONFIG_64BIT -#define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1)) +#define DEFAULT_MAP_WINDOW (is_compat_task() ? (UL(1) << (MMAP_VA_BITS - 1)) : TASK_SIZE_32) #define STACK_TOP_MAX TASK_SIZE_64 #else #define DEFAULT_MAP_WINDOW TASK_SIZE
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
(Sorry having issues with my IPv6 setup that duplicated the original email... On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote: > On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote: > > On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote: > >> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote: > >> > Create a personality flag ADDR_LIMIT_47BIT to support applications > >> > that wish to transition from running in environments that support at > >> > most 47-bit VAs to environments that support larger VAs. This > >> > personality can be set to cause all allocations to be below the 47-bit > >> > boundary. Using MAP_FIXED with mmap() will bypass this restriction. > >> > > >> > Signed-off-by: Charlie Jenkins > >> > >> I think having an architecture-independent mechanism to limit the size > >> of the 64-bit address space is useful in general, and we've discussed > >> the same thing for arm64 in the past, though we have not actually > >> reached an agreement on the ABI previously. > > > > The thread on the original proposals attests to this being rather a fraught > > topic, and I think the weight of opinion was more so in favour of opt-in > > rather than opt-out. > > You mean opt-in to using the larger addresses like we do on arm64 and > powerpc, while "opt-out" means a limit as Charlie suggested? I guess I'm not using brilliant terminology here haha! To clarify - the weight of opinion was for a situation where the address space is limited, except if you set a hint above that (you could call that opt-out or opt-in depending which way you look at it, so yeah ok very unclear sorry!). It was against the MAP_ flag and also I think a _flexible_ per-process limit is also questionable as you might end up setting a limit which breaks something else, and this starts getting messy quick. To be clear, the ADDR_LIMIT_47BIT suggestion is absolutely a compromise and practical suggestion. > > >> > @@ -22,6 +22,7 @@ enum { > >> > WHOLE_SECONDS = 0x200, > >> > STICKY_TIMEOUTS = 0x400, > >> > ADDR_LIMIT_3GB =0x800, > >> > +ADDR_LIMIT_47BIT = 0x1000, > >> > }; > >> > >> I'm a bit worried about having this done specifically in the > >> personality flag bits, as they are rather limited. We obviously > >> don't want to add many more such flags when there could be > >> a way to just set the default limit. > > > > Since I'm the one who suggested it, I feel I should offer some kind of > > vague defence here :) > > > > We shouldn't let perfect be the enemy of the good. This is a relatively > > straightforward means of achieving the aim (assuming your concern about > > arch_get_mmap_end() below isn't a blocker) which has the least impact on > > existing code. > > > > Of course we can end up in absurdities where we start doing > > ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an > > egregious maintenance burden and is entirely opt-in so has things going for > > it. > > I'm more confused now, I think most importantly we should try to > handle this consistently across all architectures. The proposed > implementation seems to completely block addresses above BIT(47) > even for applications that opt in by calling mmap(BIT(47), ...), > which seems to break the existing applications. Hm, I thought the commit message suggested the hint overrides it still? The intent is to optionally be able to run a process that keeps higher bits free for tagging and to be sure no memory mapping in the process will clobber these (correct me if I'm wrong Charlie! :) So you really wouldn't want this if you are using tagged pointers, you'd want to be sure literally nothing touches the higher bits. > > If we want this flag for RISC-V and also keep the behavior of > defaulting to >BIT(47) addresses for mmap(0, ...) how about > changing arch_get_mmap_end() to return the limit based on > ADDR_LIMIT_47BIT and then make this default to enabled on > arm64 and powerpc but disabled on riscv? But you wouldn't necessarily want all processes to be so restricted, I think this is what Charlie's trying to avoid :) On the ohter hand - I'm not sure there are many processes on any arch that'd want the higher mappings. So that'd push us again towards risc v just limiting to 48-bits and only mapping above this if a hint is provided like x86-64 does (and as you mentioned via irc - it seems risc v is an outlier in that DEFAULT_MAP_WINDOW == TASK_SIZE). This would be more consistent vs. other arches. > > >> It's also unclear to me how we want this flag to interact with > >> the existing logic in arch_get_mmap_end(), which attempts to > >> limit the default mapping to a 47-bit address space already. > > > > How does ADDR_LIMIT_3GB presently interact with that? > > That is x86 specific and only relevant to compat tasks, limiting > them to 3 instead of 4 GB. There is also ADDR_LIMIT_32BIT, which > on arm32 is always set in practice to allow 32-bit addressing
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann wrote: > > On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote: > > Create a personality flag ADDR_LIMIT_47BIT to support applications > > that wish to transition from running in environments that support at > > most 47-bit VAs to environments that support larger VAs. This > > personality can be set to cause all allocations to be below the 47-bit > > boundary. Using MAP_FIXED with mmap() will bypass this restriction. > > > > Signed-off-by: Charlie Jenkins > > I think having an architecture-independent mechanism to limit the size > of the 64-bit address space is useful in general, and we've discussed > the same thing for arm64 in the past, though we have not actually > reached an agreement on the ABI previously. > > > @@ -22,6 +22,7 @@ enum { > > WHOLE_SECONDS = 0x200, > > STICKY_TIMEOUTS = 0x400, > > ADDR_LIMIT_3GB =0x800, > > + ADDR_LIMIT_47BIT = 0x1000, > > }; > > I'm a bit worried about having this done specifically in the > personality flag bits, as they are rather limited. We obviously > don't want to add many more such flags when there could be > a way to just set the default limit. > > It's also unclear to me how we want this flag to interact with > the existing logic in arch_get_mmap_end(), which attempts to > limit the default mapping to a 47-bit address space already. To optimize RISC-V progress, I recommend: Step 1: Approve the patch. Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. Step 3: Wait approximately several iterations for Go & OpenJDK Step 4: Remove the 47-bit constraint in arch_get_mmap_end() > > For some reason, it appears that the arch_get_mmap_end() > logic on RISC-V defaults to the maximum address > space for the 'addr==0' case which is inconsistentn with > the other architectures, so we should probably fix that > part first, possibly moving more of that logic into a > shared implementation. > > Arnd > -- Best Regards Guo Ren
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote: > On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote: >> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote: >> > Create a personality flag ADDR_LIMIT_47BIT to support applications >> > that wish to transition from running in environments that support at >> > most 47-bit VAs to environments that support larger VAs. This >> > personality can be set to cause all allocations to be below the 47-bit >> > boundary. Using MAP_FIXED with mmap() will bypass this restriction. >> > >> > Signed-off-by: Charlie Jenkins >> >> I think having an architecture-independent mechanism to limit the size >> of the 64-bit address space is useful in general, and we've discussed >> the same thing for arm64 in the past, though we have not actually >> reached an agreement on the ABI previously. > > The thread on the original proposals attests to this being rather a fraught > topic, and I think the weight of opinion was more so in favour of opt-in > rather than opt-out. You mean opt-in to using the larger addresses like we do on arm64 and powerpc, while "opt-out" means a limit as Charlie suggested? >> > @@ -22,6 +22,7 @@ enum { >> >WHOLE_SECONDS = 0x200, >> >STICKY_TIMEOUTS = 0x400, >> >ADDR_LIMIT_3GB =0x800, >> > + ADDR_LIMIT_47BIT = 0x1000, >> > }; >> >> I'm a bit worried about having this done specifically in the >> personality flag bits, as they are rather limited. We obviously >> don't want to add many more such flags when there could be >> a way to just set the default limit. > > Since I'm the one who suggested it, I feel I should offer some kind of > vague defence here :) > > We shouldn't let perfect be the enemy of the good. This is a relatively > straightforward means of achieving the aim (assuming your concern about > arch_get_mmap_end() below isn't a blocker) which has the least impact on > existing code. > > Of course we can end up in absurdities where we start doing > ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an > egregious maintenance burden and is entirely opt-in so has things going for > it. I'm more confused now, I think most importantly we should try to handle this consistently across all architectures. The proposed implementation seems to completely block addresses above BIT(47) even for applications that opt in by calling mmap(BIT(47), ...), which seems to break the existing applications. If we want this flag for RISC-V and also keep the behavior of defaulting to >BIT(47) addresses for mmap(0, ...) how about changing arch_get_mmap_end() to return the limit based on ADDR_LIMIT_47BIT and then make this default to enabled on arm64 and powerpc but disabled on riscv? >> It's also unclear to me how we want this flag to interact with >> the existing logic in arch_get_mmap_end(), which attempts to >> limit the default mapping to a 47-bit address space already. > > How does ADDR_LIMIT_3GB presently interact with that? That is x86 specific and only relevant to compat tasks, limiting them to 3 instead of 4 GB. There is also ADDR_LIMIT_32BIT, which on arm32 is always set in practice to allow 32-bit addressing as opposed to ARMv2 style 26-bit addressing (IIRC ARMv3 supported both 26-bit and 32-bit addressing, while ARMv4 through ARMv7 are 32-bit only. Arnd
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote: > On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote: > > Create a personality flag ADDR_LIMIT_47BIT to support applications > > that wish to transition from running in environments that support at > > most 47-bit VAs to environments that support larger VAs. This > > personality can be set to cause all allocations to be below the 47-bit > > boundary. Using MAP_FIXED with mmap() will bypass this restriction. > > > > Signed-off-by: Charlie Jenkins > > I think having an architecture-independent mechanism to limit the size > of the 64-bit address space is useful in general, and we've discussed > the same thing for arm64 in the past, though we have not actually > reached an agreement on the ABI previously. The thread on the original proposals attests to this being rather a fraught topic, and I think the weight of opinion was more so in favour of opt-in rather than opt-out. > > > @@ -22,6 +22,7 @@ enum { > > WHOLE_SECONDS = 0x200, > > STICKY_TIMEOUTS = 0x400, > > ADDR_LIMIT_3GB =0x800, > > + ADDR_LIMIT_47BIT = 0x1000, > > }; > > I'm a bit worried about having this done specifically in the > personality flag bits, as they are rather limited. We obviously > don't want to add many more such flags when there could be > a way to just set the default limit. Since I'm the one who suggested it, I feel I should offer some kind of vague defence here :) We shouldn't let perfect be the enemy of the good. This is a relatively straightforward means of achieving the aim (assuming your concern about arch_get_mmap_end() below isn't a blocker) which has the least impact on existing code. Of course we can end up in absurdities where we start doing ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an egregious maintenance burden and is entirely opt-in so has things going for it. > > It's also unclear to me how we want this flag to interact with > the existing logic in arch_get_mmap_end(), which attempts to > limit the default mapping to a 47-bit address space already. How does ADDR_LIMIT_3GB presently interact with that? > > For some reason, it appears that the arch_get_mmap_end() > logic on RISC-V defaults to the maximum address > space for the 'addr==0' case which is inconsistentn with > the other architectures, so we should probably fix that > part first, possibly moving more of that logic into a > shared implementation. > > Arnd
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote: > On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote: > > Create a personality flag ADDR_LIMIT_47BIT to support applications > > that wish to transition from running in environments that support at > > most 47-bit VAs to environments that support larger VAs. This > > personality can be set to cause all allocations to be below the 47-bit > > boundary. Using MAP_FIXED with mmap() will bypass this restriction. > > > > Signed-off-by: Charlie Jenkins > > I think having an architecture-independent mechanism to limit the size > of the 64-bit address space is useful in general, and we've discussed > the same thing for arm64 in the past, though we have not actually > reached an agreement on the ABI previously. The thread on the original proposals attests to this being rather a fraught topic, and I think the weight of opinion was more so in favour of opt-in rather than opt-out. > > > @@ -22,6 +22,7 @@ enum { > > WHOLE_SECONDS = 0x200, > > STICKY_TIMEOUTS = 0x400, > > ADDR_LIMIT_3GB =0x800, > > + ADDR_LIMIT_47BIT = 0x1000, > > }; > > I'm a bit worried about having this done specifically in the > personality flag bits, as they are rather limited. We obviously > don't want to add many more such flags when there could be > a way to just set the default limit. Since I'm the one who suggested it, I feel I should offer some kind of vague defence here :) We shouldn't let perfect be the enemy of the good. This is a relatively straightforward means of achieving the aim (assuming your concern about arch_get_mmap_end() below isn't a blocker) which has the least impact on existing code. Of course we can end up in absurdities where we start doing ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an egregious maintenance burden and is entirely opt-in so has things going for it. > > It's also unclear to me how we want this flag to interact with > the existing logic in arch_get_mmap_end(), which attempts to > limit the default mapping to a 47-bit address space already. How does ADDR_LIMIT_3GB presently interact with that? > > For some reason, it appears that the arch_get_mmap_end() > logic on RISC-V defaults to the maximum address > space for the 'addr==0' case which is inconsistentn with > the other architectures, so we should probably fix that > part first, possibly moving more of that logic into a > shared implementation. > > Arnd
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote: > Create a personality flag ADDR_LIMIT_47BIT to support applications > that wish to transition from running in environments that support at > most 47-bit VAs to environments that support larger VAs. This > personality can be set to cause all allocations to be below the 47-bit > boundary. Using MAP_FIXED with mmap() will bypass this restriction. > > Signed-off-by: Charlie Jenkins I think having an architecture-independent mechanism to limit the size of the 64-bit address space is useful in general, and we've discussed the same thing for arm64 in the past, though we have not actually reached an agreement on the ABI previously. > @@ -22,6 +22,7 @@ enum { > WHOLE_SECONDS = 0x200, > STICKY_TIMEOUTS = 0x400, > ADDR_LIMIT_3GB =0x800, > + ADDR_LIMIT_47BIT = 0x1000, > }; I'm a bit worried about having this done specifically in the personality flag bits, as they are rather limited. We obviously don't want to add many more such flags when there could be a way to just set the default limit. It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already. For some reason, it appears that the arch_get_mmap_end() logic on RISC-V defaults to the maximum address space for the 'addr==0' case which is inconsistentn with the other architectures, so we should probably fix that part first, possibly moving more of that logic into a shared implementation. Arnd
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
Charlie Jenkins writes: > Create a personality flag ADDR_LIMIT_47BIT to support applications > that wish to transition from running in environments that support at > most 47-bit VAs to environments that support larger VAs. This > personality can be set to cause all allocations to be below the 47-bit > boundary. Using MAP_FIXED with mmap() will bypass this restriction. > > Signed-off-by: Charlie Jenkins > --- > include/uapi/linux/personality.h | 1 + > mm/mmap.c| 3 +++ > 2 files changed, 4 insertions(+) > > diff --git a/include/uapi/linux/personality.h > b/include/uapi/linux/personality.h > index 49796b7756af..cd3b8c154d9b 100644 > --- a/include/uapi/linux/personality.h > +++ b/include/uapi/linux/personality.h > @@ -22,6 +22,7 @@ enum { > WHOLE_SECONDS = 0x200, > STICKY_TIMEOUTS = 0x400, > ADDR_LIMIT_3GB =0x800, > + ADDR_LIMIT_47BIT = 0x1000, > }; I wonder if ADDR_LIMIT_128T would be clearer? Have you looked at writing an update for the personality(2) man page? :) cheers
Re: [PATCH RFC v3 0/2] mm: Introduce ADDR_LIMIT_47BIT personality flag
Hi Charlie, On Thu, 2024-09-05 at 14:15 -0700, Charlie Jenkins wrote: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the 48-bit address space, > unless the hint address uses more than 47 bits (the 48th bit is reserved > for the kernel address space). > > The riscv architecture needs a way to similarly restrict the virtual > address space. On the riscv port of OpenJDK an error is thrown if > attempted to run on the 57-bit address space, called sv57 [1]. golang > has a comment that sv57 support is not complete, but there are some > workarounds to get it to mostly work [2]. > > These applications work on x86 because x86 does an implicit 47-bit > restriction of mmap() address that contain a hint address that is less > than 48 bits. > > Instead of implicitly restricting the address space on riscv (or any > current/future architecture), provide a flag to the personality syscall > that can be used to ensure an application works in any arbitrary VA > space. A similar feature has already been implemented by the personality > syscall in ADDR_LIMIT_32BIT. > > This flag will also allow seemless compatibility between all > architectures, so applications like Go and OpenJDK that use bits in a > virtual address can request the exact number of bits they need in a > generic way. The flag can be checked inside of vm_unmapped_area() so > that this flag does not have to be handled individually by each > architecture. > > Link: > https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79 > [1] > Link: > https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47 > [2] > > To: Arnd Bergmann > To: Richard Henderson > To: Ivan Kokshaysky > To: Matt Turner > To: Vineet Gupta > To: Russell King > To: Guo Ren > To: Huacai Chen > To: WANG Xuerui > To: Thomas Bogendoerfer > To: James E.J. Bottomley > To: Helge Deller > To: Michael Ellerman > To: Nicholas Piggin > To: Christophe Leroy > To: Naveen N Rao > To: Alexander Gordeev > To: Gerald Schaefer > To: Heiko Carstens > To: Vasily Gorbik > To: Christian Borntraeger > To: Sven Schnelle > To: Yoshinori Sato > To: Rich Felker > To: John Paul Adrian Glaubitz > To: David S. Miller > To: Andreas Larsson > To: Thomas Gleixner > To: Ingo Molnar > To: Borislav Petkov > To: Dave Hansen > To: x...@kernel.org > To: H. Peter Anvin > To: Andy Lutomirski > To: Peter Zijlstra > To: Muchun Song > To: Andrew Morton > To: Liam R. Howlett > To: Vlastimil Babka > To: Lorenzo Stoakes > To: Shuah Khan > To: Christoph Hellwig > To: Michal Hocko > To: "Kirill A. Shutemov" > To: Chris Torek > Cc: linux-a...@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Cc: linux-al...@vger.kernel.org > Cc: linux-snps-...@lists.infradead.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux-c...@vger.kernel.org > Cc: loonga...@lists.linux.dev > Cc: linux-m...@vger.kernel.org > Cc: linux-par...@vger.kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-s...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: sparcli...@vger.kernel.org > Cc: linux...@kvack.org > Cc: linux-kselft...@vger.kernel.org > Cc: linux-abi-de...@lists.sourceforge.net > Signed-off-by: Charlie Jenkins > > Changes in v2: > - Added much greater detail to cover letter > - Removed all code that touched architecture specific code and was able > to factor this out into all generic functions, except for flags that > needed to be added to vm_unmapped_area_info > - Made this an RFC since I have only tested it on riscv and x86 > - Link to v1: > https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb90...@rivosinc.com > > Changes in v3: > - Use a personality flag instead of an mmap flag > - Link to v2: > https://lore.kernel.org/r/20240829-patches-below_hint_mmap-v2-0-638a28d9e...@rivosinc.com > > --- > Charlie Jenkins (2): > mm: Add personality flag to limit address to 47 bits > selftests/mm: Create ADDR_LIMIT_47BIT test > > include/uapi/linux/personality.h | 1 + > mm/mmap.c | 3 ++ > tools/testing/selftests/mm/.gitignore | 1 + > tools/testing/selftests/mm/Makefile| 1 + > tools/testing/selftests/mm/map_47bit_personality.c | 34 > ++ > 5 files changed, 40 insertions(+) > --- > base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 > change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 Wow, this issue has been plaguing SPARC users for years already as the architecture uses a 52-bit virtual address space and Javascript engines such as the one in Firefox or Webkit have been crashing ever since. I should definitely give this series a try and see if that fixes Javascript crashes on SPARC. Thank
Re: [PATCH RFC v3 0/2] mm: Introduce ADDR_LIMIT_47BIT personality flag
On Fri, Sep 6, 2024 at 5:16 AM Charlie Jenkins wrote: > > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the 48-bit address space, > unless the hint address uses more than 47 bits (the 48th bit is reserved > for the kernel address space). > > The riscv architecture needs a way to similarly restrict the virtual > address space. On the riscv port of OpenJDK an error is thrown if > attempted to run on the 57-bit address space, called sv57 [1]. golang > has a comment that sv57 support is not complete, but there are some > workarounds to get it to mostly work [2]. > > These applications work on x86 because x86 does an implicit 47-bit > restriction of mmap() address that contain a hint address that is less > than 48 bits. > > Instead of implicitly restricting the address space on riscv (or any > current/future architecture), provide a flag to the personality syscall > that can be used to ensure an application works in any arbitrary VA > space. A similar feature has already been implemented by the personality > syscall in ADDR_LIMIT_32BIT. > > This flag will also allow seemless compatibility between all > architectures, so applications like Go and OpenJDK that use bits in a > virtual address can request the exact number of bits they need in a > generic way. The flag can be checked inside of vm_unmapped_area() so > that this flag does not have to be handled individually by each > architecture. Acked-by: Guo Ren Sv57's pain finds its cure in this antidote. > > Link: > https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79 > [1] > Link: > https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47 > [2] > > To: Arnd Bergmann > To: Richard Henderson > To: Ivan Kokshaysky > To: Matt Turner > To: Vineet Gupta > To: Russell King > To: Guo Ren > To: Huacai Chen > To: WANG Xuerui > To: Thomas Bogendoerfer > To: James E.J. Bottomley > To: Helge Deller > To: Michael Ellerman > To: Nicholas Piggin > To: Christophe Leroy > To: Naveen N Rao > To: Alexander Gordeev > To: Gerald Schaefer > To: Heiko Carstens > To: Vasily Gorbik > To: Christian Borntraeger > To: Sven Schnelle > To: Yoshinori Sato > To: Rich Felker > To: John Paul Adrian Glaubitz > To: David S. Miller > To: Andreas Larsson > To: Thomas Gleixner > To: Ingo Molnar > To: Borislav Petkov > To: Dave Hansen > To: x...@kernel.org > To: H. Peter Anvin > To: Andy Lutomirski > To: Peter Zijlstra > To: Muchun Song > To: Andrew Morton > To: Liam R. Howlett > To: Vlastimil Babka > To: Lorenzo Stoakes > To: Shuah Khan > To: Christoph Hellwig > To: Michal Hocko > To: "Kirill A. Shutemov" > To: Chris Torek > Cc: linux-a...@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Cc: linux-al...@vger.kernel.org > Cc: linux-snps-...@lists.infradead.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux-c...@vger.kernel.org > Cc: loonga...@lists.linux.dev > Cc: linux-m...@vger.kernel.org > Cc: linux-par...@vger.kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-s...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: sparcli...@vger.kernel.org > Cc: linux...@kvack.org > Cc: linux-kselft...@vger.kernel.org > Cc: linux-abi-de...@lists.sourceforge.net > Signed-off-by: Charlie Jenkins > > Changes in v2: > - Added much greater detail to cover letter > - Removed all code that touched architecture specific code and was able > to factor this out into all generic functions, except for flags that > needed to be added to vm_unmapped_area_info > - Made this an RFC since I have only tested it on riscv and x86 > - Link to v1: > https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb90...@rivosinc.com > > Changes in v3: > - Use a personality flag instead of an mmap flag > - Link to v2: > https://lore.kernel.org/r/20240829-patches-below_hint_mmap-v2-0-638a28d9e...@rivosinc.com > > --- > Charlie Jenkins (2): > mm: Add personality flag to limit address to 47 bits > selftests/mm: Create ADDR_LIMIT_47BIT test > > include/uapi/linux/personality.h | 1 + > mm/mmap.c | 3 ++ > tools/testing/selftests/mm/.gitignore | 1 + > tools/testing/selftests/mm/Makefile| 1 + > tools/testing/selftests/mm/map_47bit_personality.c | 34 > ++ > 5 files changed, 40 insertions(+) > --- > base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 > change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 > -- > - Charlie > -- Best Regards Guo Ren
[PATCH RFC v3 2/2] selftests/mm: Create ADDR_LIMIT_47BIT test
Add a selftest for the ADDR_LIMIT_47BIT personality flag that mmaps until it runs out of space and ensures no addresses are allocated above 47 bits. Signed-off-by: Charlie Jenkins --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile| 1 + tools/testing/selftests/mm/map_47bit_personality.c | 34 ++ 3 files changed, 36 insertions(+) diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index da030b43e43b..918ef05e180d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -32,6 +32,7 @@ mlock-random-test virtual_address_range gup_test va_128TBswitch +map_47bit_personality map_fixed_noreplace write_to_hugetlbfs hmm-tests diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index cfad627e8d94..2e95fd545409 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm TEST_GEN_FILES += hugepage-vmemmap TEST_GEN_FILES += khugepaged TEST_GEN_FILES += madv_populate +TEST_GEN_FILES += map_47bit_personality TEST_GEN_FILES += map_fixed_noreplace TEST_GEN_FILES += map_hugetlb TEST_GEN_FILES += map_populate diff --git a/tools/testing/selftests/mm/map_47bit_personality.c b/tools/testing/selftests/mm/map_47bit_personality.c new file mode 100644 index ..453412990c21 --- /dev/null +++ b/tools/testing/selftests/mm/map_47bit_personality.c @@ -0,0 +1,34 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test the ADDR_LIMIT_47BIT personality flag. + */ +#include +#include +#include +#include "../kselftest.h" + +#define LENGTH (1) + +#define ADDR_LIMIT_47BIT 0x1000 +#define BIT47 1UL << 47 + +/* + * Map memory with ADDR_LIMIT_47BIT until no memory left. Ensure that all returned + * addresses are below 47 bits. + */ +int main(int argc, char **argv) +{ + void *addr; + + syscall(__NR_personality, ADDR_LIMIT_47BIT); + + do { + addr = mmap(0, LENGTH, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + } while (addr != MAP_FAILED && (unsigned long)addr < BIT47); + + if (errno == ENOMEM) + ksft_test_result_pass("ADDR_LIMIT_47BIT works\n"); + else + ksft_test_result_fail("mmap returned address above 47 bits with ADDR_LIMIT_47BIT with addr: %p and err: %s\n", + addr, strerror(errno)); +} -- 2.45.0
[PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction. Signed-off-by: Charlie Jenkins --- include/uapi/linux/personality.h | 1 + mm/mmap.c| 3 +++ 2 files changed, 4 insertions(+) diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h index 49796b7756af..cd3b8c154d9b 100644 --- a/include/uapi/linux/personality.h +++ b/include/uapi/linux/personality.h @@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x200, STICKY_TIMEOUTS = 0x400, ADDR_LIMIT_3GB =0x800, + ADDR_LIMIT_47BIT = 0x1000, }; /* diff --git a/mm/mmap.c b/mm/mmap.c index d0dfc85b209b..a5c7544853e5 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1766,6 +1766,9 @@ unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info) { unsigned long addr; + if (current->personality & ADDR_LIMIT_47BIT) + info->high_limit = MIN(info->high_limit, BIT(47) - 1); + if (info->flags & VM_UNMAPPED_AREA_TOPDOWN) addr = unmapped_area_topdown(info); else -- 2.45.0
[PATCH RFC v3 0/2] mm: Introduce ADDR_LIMIT_47BIT personality flag
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space). The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2]. These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits. Instead of implicitly restricting the address space on riscv (or any current/future architecture), provide a flag to the personality syscall that can be used to ensure an application works in any arbitrary VA space. A similar feature has already been implemented by the personality syscall in ADDR_LIMIT_32BIT. This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture. Link: https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79 [1] Link: https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47 [2] To: Arnd Bergmann To: Richard Henderson To: Ivan Kokshaysky To: Matt Turner To: Vineet Gupta To: Russell King To: Guo Ren To: Huacai Chen To: WANG Xuerui To: Thomas Bogendoerfer To: James E.J. Bottomley To: Helge Deller To: Michael Ellerman To: Nicholas Piggin To: Christophe Leroy To: Naveen N Rao To: Alexander Gordeev To: Gerald Schaefer To: Heiko Carstens To: Vasily Gorbik To: Christian Borntraeger To: Sven Schnelle To: Yoshinori Sato To: Rich Felker To: John Paul Adrian Glaubitz To: David S. Miller To: Andreas Larsson To: Thomas Gleixner To: Ingo Molnar To: Borislav Petkov To: Dave Hansen To: x...@kernel.org To: H. Peter Anvin To: Andy Lutomirski To: Peter Zijlstra To: Muchun Song To: Andrew Morton To: Liam R. Howlett To: Vlastimil Babka To: Lorenzo Stoakes To: Shuah Khan To: Christoph Hellwig To: Michal Hocko To: "Kirill A. Shutemov" To: Chris Torek Cc: linux-a...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-al...@vger.kernel.org Cc: linux-snps-...@lists.infradead.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-c...@vger.kernel.org Cc: loonga...@lists.linux.dev Cc: linux-m...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-kselft...@vger.kernel.org Cc: linux-abi-de...@lists.sourceforge.net Signed-off-by: Charlie Jenkins Changes in v2: - Added much greater detail to cover letter - Removed all code that touched architecture specific code and was able to factor this out into all generic functions, except for flags that needed to be added to vm_unmapped_area_info - Made this an RFC since I have only tested it on riscv and x86 - Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb90...@rivosinc.com Changes in v3: - Use a personality flag instead of an mmap flag - Link to v2: https://lore.kernel.org/r/20240829-patches-below_hint_mmap-v2-0-638a28d9e...@rivosinc.com --- Charlie Jenkins (2): mm: Add personality flag to limit address to 47 bits selftests/mm: Create ADDR_LIMIT_47BIT test include/uapi/linux/personality.h | 1 + mm/mmap.c | 3 ++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile| 1 + tools/testing/selftests/mm/map_47bit_personality.c | 34 ++ 5 files changed, 40 insertions(+) --- base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 -- - Charlie
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote: > On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the 48-bit address space, > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > for the kernel address space). > > > > The riscv architecture needs a way to similarly restrict the virtual > > address space. On the riscv port of OpenJDK an error is thrown if > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > has a comment that sv57 support is not complete, but there are some > > workarounds to get it to mostly work [2]. > > > > These applications work on x86 because x86 does an implicit 47-bit > > restriction of mmap() address that contain a hint address that is less > > than 48 bits. > > > > Instead of implicitly restricting the address space on riscv (or any > > current/future architecture), a flag would allow users to opt-in to this > > behavior rather than opt-out as is done on other architectures. This is > > desirable because it is a small class of applications that do pointer > > masking. > > This argument looks broken to me. > > The "small class of applications" is going to be broken unless they got > patched to use your new mmap() flag. You are asking for bugs. > > Consider the case when you write, compile and validate a piece of software > on machine that has <=47bit VA. The binary got shipped to customers. > Later, customer gets a new shiny machine that supports larger address > space and your previously working software is broken. Such binaries might > exist today. > > It is bad idea to use >47bit VA by default. Most of software got tested on > x86 with 47bit VA. > > We can consider more options to opt-in into wider address space like > personality or prctl() handle. But opt-out is no-go from what I see. > > -- > Kiryl Shutsemau / Kirill A. Shutemov riscv is in an interesting state in regards to this because the software ecosystem is much less mature than other architectures. The existing riscv hardware supports either 38 or 47 bit userspace VAs, but a lot of people test on QEMU which defaults to 56 bit. As a result, a lot of code is tested with the larger address space. Applications that don't work on the larger address space, like OpenJDK, currently throw an error and exit. Since riscv does not currently have the address space default to 47 bits, some applications just don't work on 56 bits. We could change the kernel so that these applications start working without the need for them to change their code, but that seems like the kernel is overstepping and fixing binaries rather than providing users tools to fix the binaries themselves. This mmap flag was an attempt to provide a tool for these applications that work on the existing 47 bit VA hardware to also work on different hardware that supports a 56 bit VA space. After feedback, it looks like a better solution than the mmap flag is to use the personality syscall to set a process wide restriction to 47 bits instead, which matches the 32 bit flag that already exists. - Charlie
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the 48-bit address space, > unless the hint address uses more than 47 bits (the 48th bit is reserved > for the kernel address space). > > The riscv architecture needs a way to similarly restrict the virtual > address space. On the riscv port of OpenJDK an error is thrown if > attempted to run on the 57-bit address space, called sv57 [1]. golang > has a comment that sv57 support is not complete, but there are some > workarounds to get it to mostly work [2]. > > These applications work on x86 because x86 does an implicit 47-bit > restriction of mmap() address that contain a hint address that is less > than 48 bits. > > Instead of implicitly restricting the address space on riscv (or any > current/future architecture), a flag would allow users to opt-in to this > behavior rather than opt-out as is done on other architectures. This is > desirable because it is a small class of applications that do pointer > masking. This argument looks broken to me. The "small class of applications" is going to be broken unless they got patched to use your new mmap() flag. You are asking for bugs. Consider the case when you write, compile and validate a piece of software on machine that has <=47bit VA. The binary got shipped to customers. Later, customer gets a new shiny machine that supports larger address space and your previously working software is broken. Such binaries might exist today. It is bad idea to use >47bit VA by default. Most of software got tested on x86 with 47bit VA. We can consider more options to opt-in into wider address space like personality or prctl() handle. But opt-out is no-go from what I see. -- Kiryl Shutsemau / Kirill A. Shutemov
Re: [PATCH RFC v2 2/4] mm: Add hint and mmap_flags to struct vm_unmapped_area_info
Hi Charlie, Le 29/08/2024 à 09:15, Charlie Jenkins a écrit : The hint address and mmap_flags are necessary to determine if MAP_BELOW_HINT requirements are satisfied. Signed-off-by: Charlie Jenkins --- arch/alpha/kernel/osf_sys.c | 2 ++ arch/arc/mm/mmap.c | 3 +++ arch/arm/mm/mmap.c | 7 +++ arch/csky/abiv1/mmap.c | 3 +++ arch/loongarch/mm/mmap.c | 3 +++ arch/mips/mm/mmap.c | 3 +++ arch/parisc/kernel/sys_parisc.c | 3 +++ arch/powerpc/mm/book3s64/slice.c | 7 +++ arch/s390/mm/hugetlbpage.c | 4 arch/s390/mm/mmap.c | 6 ++ arch/sh/mm/mmap.c| 6 ++ arch/sparc/kernel/sys_sparc_32.c | 3 +++ arch/sparc/kernel/sys_sparc_64.c | 6 ++ arch/sparc/mm/hugetlbpage.c | 4 arch/x86/kernel/sys_x86_64.c | 6 ++ arch/x86/mm/hugetlbpage.c| 4 fs/hugetlbfs/inode.c | 4 include/linux/mm.h | 2 ++ mm/mmap.c| 6 ++ 19 files changed, 82 insertions(+) diff --git a/arch/powerpc/mm/book3s64/slice.c b/arch/powerpc/mm/book3s64/slice.c index ef3ce37f1bb3..f0e2550af6d0 100644 --- a/arch/powerpc/mm/book3s64/slice.c +++ b/arch/powerpc/mm/book3s64/slice.c @@ -286,6 +286,10 @@ static unsigned long slice_find_area_bottomup(struct mm_struct *mm, .length = len, .align_mask = PAGE_MASK & ((1ul << pshift) - 1), }; + + info.hint = addr; + info.mmap_flags = flags; + /* * Check till the allow max value for this mmap request */ @@ -331,6 +335,9 @@ static unsigned long slice_find_area_topdown(struct mm_struct *mm, }; unsigned long min_addr = max(PAGE_SIZE, mmap_min_addr); + info.hint = addr; + info.mmap_flags = flags; + /* * If we are trying to allocate above DEFAULT_MAP_WINDOW * Add the different to the mmap_base. ppc64_defconfig: CC arch/powerpc/mm/book3s64/slice.o arch/powerpc/mm/book3s64/slice.c: In function 'slice_find_area_bottomup': arch/powerpc/mm/book3s64/slice.c:291:27: error: 'flags' undeclared (first use in this function) 291 | info.mmap_flags = flags; | ^ arch/powerpc/mm/book3s64/slice.c:291:27: note: each undeclared identifier is reported only once for each function it appears in arch/powerpc/mm/book3s64/slice.c: In function 'slice_find_area_topdown': arch/powerpc/mm/book3s64/slice.c:339:27: error: 'flags' undeclared (first use in this function) 339 | info.mmap_flags = flags; | ^ make[5]: *** [scripts/Makefile.build:244: arch/powerpc/mm/book3s64/slice.o] Error 1
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu 29-08-24 10:33:22, Charlie Jenkins wrote: > On Thu, Aug 29, 2024 at 10:30:56AM +0200, Michal Hocko wrote: > > On Thu 29-08-24 00:15:57, Charlie Jenkins wrote: > > > Some applications rely on placing data in free bits addresses allocated > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > > address returned by mmap to be less than the 48-bit address space, > > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > > for the kernel address space). > > > > > > The riscv architecture needs a way to similarly restrict the virtual > > > address space. On the riscv port of OpenJDK an error is thrown if > > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > > has a comment that sv57 support is not complete, but there are some > > > workarounds to get it to mostly work [2]. > > > > > > These applications work on x86 because x86 does an implicit 47-bit > > > restriction of mmap() address that contain a hint address that is less > > > than 48 bits. > > > > > > Instead of implicitly restricting the address space on riscv (or any > > > current/future architecture), a flag would allow users to opt-in to this > > > behavior rather than opt-out as is done on other architectures. This is > > > desirable because it is a small class of applications that do pointer > > > masking. > > > > IIRC this has been discussed at length when 5-level page tables support > > has been proposed for x86. Sorry I do not have a link handy but lore > > should help you. Linus was not really convinced and in the end vetoed it > > and prefer that those few applications that benefit from greater address > > space would do that explicitly than other way around. > > I believe I found the conversation you were referring to. Ingo Molnar > recommended a flag similar to what I have proposed [1]. Catalin > recommended to make 52-bit opt-in on arm64 [2]. Dave Hansen brought up > MPX [3]. > > However these conversations are tangential to what I am proposing. arm64 > and x86 decided to have the default address space be 48 bits. However > this was done on a per-architecture basis with no way for applications > to have guarantees between architectures. Even this behavior to restrict > to 48 bits does not even appear in the man pages, so would require > reading the kernel source code to understand that this feature is > available. Then to opt-in to larger address spaces, applications have to > know to provide a hint address that is greater than 47 bits, mmap() will > then return an address that contains up to 56 bits on x86 and 52 bits on > arm64. This difference of 4 bits causes inconsistency and is part of the > problem I am trying to solve with this flag. Yes, I guess I do understand where you are heading. Our existing model assumes that anybody requiring more address space know what they are doing and deal with the reality. This is the way Linus has pushed this and I am not really convinced it is the right way TBH. On the other hand it is true that this allows a safe(r) transition to larger address spaces. > I am not proposing to change x86 and arm64 away from using their opt-out > feature, I am instead proposing a standard ABI for applications that > need some guarantees of the bits used in pointers. Right, but this is not really different from earlier attempts to achieve this IIRC. Extentind mmap for that purpose seems quite tricky as already pointed out in other sub-threads. Quite honestly I am not really sure what is the right and backwards compatible way. I just wanted to make you aware this has been discussed at lenght in the past. -- Michal Hocko SUSE Labs
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Fri, Aug 30, 2024 at 10:52:01AM +0100, Lorenzo Stoakes wrote: > On Thu, Aug 29, 2024 at 03:16:53PM GMT, Charlie Jenkins wrote: > > On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote: > > > On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote: > > > > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote: > > > > > Some applications rely on placing data in free bits addresses > > > > > allocated > > > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > > > > address returned by mmap to be less than the 48-bit address space, > > > > > unless the hint address uses more than 47 bits (the 48th bit is > > > > > reserved > > > > > for the kernel address space). > > > > > > > > I'm still confused as to why, if an mmap flag is desired, and thus > > > > programs > > > > are having to be heavily modified and controlled to be able to do this, > > > > why > > > > you can't just do an mmap() with PROT_NONE early, around a hinted > > > > address > > > > that, sits below the required limit, and then mprotect() or mmap() over > > > > it? > > > > > > > > Your feature is a major adjustment to mmap(), it needs to be pretty > > > > significantly justified, especially if taking up a new flag. > > > > > > > > > > > > > > The riscv architecture needs a way to similarly restrict the virtual > > > > > address space. On the riscv port of OpenJDK an error is thrown if > > > > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > > > > has a comment that sv57 support is not complete, but there are some > > > > > workarounds to get it to mostly work [2]. > > > > > > > > > > These applications work on x86 because x86 does an implicit 47-bit > > > > > restriction of mmap() address that contain a hint address that is less > > > > > than 48 bits. > > > > > > > > You mean x86 _has_ to limit to physically available bits in a canonical > > > > format :) this will not be the case for 5-page table levels though... > > > > I might be misunderstanding but I am not talking about pointer masking > > or canonical addresses here. I am referring to the pattern of: > > > > 1. Getting an address from mmap() > > 2. Writing data into bits assumed to be unused in the address > > 3. Using the data stored in the address > > 4. Clearing the data from the address and sign extending > > 5. Dereferencing the now sign-extended address to conform to canonical > >addresses > > > > I am just talking about step 1 and 2 here -- getting an address from > > mmap() that only uses bits that will allow your application to not > > break. How canonicalization happens is a a separate conversation, that > > can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv. > > While LAM for x86 is only capable of masking addresses to 48 or 57 bits, > > Ssnpm for riscv allow an arbitrary number of bits to be masked out. > > A design goal here is to be able to support all of the pointer masking > > flavors, and not just x86. > > Right I get that, I was just saying that the implicit limitation in x86 is > due to virtual addresses _having_ to be less than 48 bits. So that's why > that is right? I mean perhaps I'm mistaken? > > Or is it such that x86 can provide a space for tagging for CPU technology > that supports it (UAI perhaps?). > > I agree with what Michal and others said about the decision to default to > the reduced address space size and opt-in for higher bits. Your series > doesn't do this... > > > > > > > > > > > > > > > > > Instead of implicitly restricting the address space on riscv (or any > > > > > current/future architecture), a flag would allow users to opt-in to > > > > > this > > > > > behavior rather than opt-out as is done on other architectures. This > > > > > is > > > > > desirable because it is a small class of applications that do pointer > > > > > masking. > > > > > > > > I raised this last time and you didn't seem to address it so to be more > > > > blunt: > > > > > > > > I don't understand why this needs to be an mmap() flag. From this it > > > > seems > > > > the whole process needs allocations to be below a certain limit. > > > > Yeah making it per-process does seem logical, as it would help with > > pointer masking. > > To me it's the only feasible way forward, you can't control all libraries, > a map flag continues to seem a strange way to implement this, and I > understand that your justification is that it is the _least invasive_ way > of doing this, but as I've said below, it's actually pretty invasive if you > think about it, the current implementation seems to me to be insufficient > without having VMA flags etc. > > > > > > > > > > > That _could_ be achieved through a 'personality' or similar (though a > > > > personality is on/off, rather than allowing configuration so maybe > > > > something else would be needed). > > > > > > > > From what you're saying 57-bit is all you really need right? So maybe > > > > ADDR_LIMIT_57BIT? > > > > Addresses w
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Fri, Aug 30, 2024 at 08:03:25AM GMT, Dave Hansen wrote: > On 8/29/24 01:42, Lorenzo Stoakes wrote: > >> These applications work on x86 because x86 does an implicit 47-bit > >> restriction of mmap() address that contain a hint address that is less > >> than 48 bits. > > You mean x86 _has_ to limit to physically available bits in a canonical > > format 🙂 this will not be the case for 5-page table levels though... > > By "physically available bits" are you referring to the bits that can be > used as a part of the virtual address? "Physically" may not have been > the best choice of words. ;) > > There's a canonical hole in 4-level paging and 5-level paging on x86. > The 5-level canonical hole is just smaller. Yeah sorry this is what I meant! > > Also, I should probably say that the >47-bit mmap() access hint was more > of a crutch than something that we wanted to make ABI forever. We knew > that high addresses might break some apps and we hoped that the list of > things it would break would go down over time so that we could > eventually just let mmap() access the whole address space by default. > > That optimism may have been misplaced. Interesting, thanks. This speaks again I think to it being unwise to rely on these things. I do think the only workable form of this series is a fixed personality-based mapping limit.
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On 8/29/24 01:42, Lorenzo Stoakes wrote: >> These applications work on x86 because x86 does an implicit 47-bit >> restriction of mmap() address that contain a hint address that is less >> than 48 bits. > You mean x86 _has_ to limit to physically available bits in a canonical > format 🙂 this will not be the case for 5-page table levels though... By "physically available bits" are you referring to the bits that can be used as a part of the virtual address? "Physically" may not have been the best choice of words. ;) There's a canonical hole in 4-level paging and 5-level paging on x86. The 5-level canonical hole is just smaller. Also, I should probably say that the >47-bit mmap() access hint was more of a crutch than something that we wanted to make ABI forever. We knew that high addresses might break some apps and we hoped that the list of things it would break would go down over time so that we could eventually just let mmap() access the whole address space by default. That optimism may have been misplaced.
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu, Aug 29, 2024 at 03:16:53PM GMT, Charlie Jenkins wrote: > On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote: > > On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote: > > > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote: > > > > Some applications rely on placing data in free bits addresses allocated > > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > > > address returned by mmap to be less than the 48-bit address space, > > > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > > > for the kernel address space). > > > > > > I'm still confused as to why, if an mmap flag is desired, and thus > > > programs > > > are having to be heavily modified and controlled to be able to do this, > > > why > > > you can't just do an mmap() with PROT_NONE early, around a hinted address > > > that, sits below the required limit, and then mprotect() or mmap() over > > > it? > > > > > > Your feature is a major adjustment to mmap(), it needs to be pretty > > > significantly justified, especially if taking up a new flag. > > > > > > > > > > > The riscv architecture needs a way to similarly restrict the virtual > > > > address space. On the riscv port of OpenJDK an error is thrown if > > > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > > > has a comment that sv57 support is not complete, but there are some > > > > workarounds to get it to mostly work [2]. > > > > > > > > These applications work on x86 because x86 does an implicit 47-bit > > > > restriction of mmap() address that contain a hint address that is less > > > > than 48 bits. > > > > > > You mean x86 _has_ to limit to physically available bits in a canonical > > > format :) this will not be the case for 5-page table levels though... > > I might be misunderstanding but I am not talking about pointer masking > or canonical addresses here. I am referring to the pattern of: > > 1. Getting an address from mmap() > 2. Writing data into bits assumed to be unused in the address > 3. Using the data stored in the address > 4. Clearing the data from the address and sign extending > 5. Dereferencing the now sign-extended address to conform to canonical >addresses > > I am just talking about step 1 and 2 here -- getting an address from > mmap() that only uses bits that will allow your application to not > break. How canonicalization happens is a a separate conversation, that > can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv. > While LAM for x86 is only capable of masking addresses to 48 or 57 bits, > Ssnpm for riscv allow an arbitrary number of bits to be masked out. > A design goal here is to be able to support all of the pointer masking > flavors, and not just x86. Right I get that, I was just saying that the implicit limitation in x86 is due to virtual addresses _having_ to be less than 48 bits. So that's why that is right? I mean perhaps I'm mistaken? Or is it such that x86 can provide a space for tagging for CPU technology that supports it (UAI perhaps?). I agree with what Michal and others said about the decision to default to the reduced address space size and opt-in for higher bits. Your series doesn't do this... > > > > > > > > > > > > Instead of implicitly restricting the address space on riscv (or any > > > > current/future architecture), a flag would allow users to opt-in to this > > > > behavior rather than opt-out as is done on other architectures. This is > > > > desirable because it is a small class of applications that do pointer > > > > masking. > > > > > > I raised this last time and you didn't seem to address it so to be more > > > blunt: > > > > > > I don't understand why this needs to be an mmap() flag. From this it seems > > > the whole process needs allocations to be below a certain limit. > > Yeah making it per-process does seem logical, as it would help with > pointer masking. To me it's the only feasible way forward, you can't control all libraries, a map flag continues to seem a strange way to implement this, and I understand that your justification is that it is the _least invasive_ way of doing this, but as I've said below, it's actually pretty invasive if you think about it, the current implementation seems to me to be insufficient without having VMA flags etc. > > > > > > > That _could_ be achieved through a 'personality' or similar (though a > > > personality is on/off, rather than allowing configuration so maybe > > > something else would be needed). > > > > > > From what you're saying 57-bit is all you really need right? So maybe > > > ADDR_LIMIT_57BIT? > > Addresses will always be limited to 57 bits on riscv and x86 (but not > necessarily on other architectures). A flag like that would have no > impact, I do not understand what you are suggesting. This patch is to > have a configurable number of bits be restricted. I get that, but as I say below, I don't think a customisable limit is wor
Re: [PATCH RFC v2 1/4] mm: Add MAP_BELOW_HINT
On Thu, Aug 29, 2024 at 06:26:41PM +1000, Michael Ellerman wrote: > Charlie Jenkins writes: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the 48-bit address space, > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > for the kernel address space). > > > > To make this behavior explicit and more versatile across all > > architectures, define a mmap flag that allows users to define an > > arbitrary upper limit on addresses returned by mmap. > > > > Signed-off-by: Charlie Jenkins > > --- > > include/uapi/asm-generic/mman-common.h | 1 + > > tools/include/uapi/asm-generic/mman-common.h | 1 + > > You're not meant to update the headers in tools/ directly. There's a > mail somewhere from acme somewhere describing the proper process, but > the tldr is leave it up to him. Oh okay, thank you. > > > diff --git a/include/uapi/asm-generic/mman-common.h > > b/include/uapi/asm-generic/mman-common.h > > index 6ce1f1ceb432..03ac13d9aa37 100644 > > --- a/include/uapi/asm-generic/mman-common.h > > +++ b/include/uapi/asm-generic/mman-common.h > > @@ -32,6 +32,7 @@ > > > > #define MAP_UNINITIALIZED 0x400/* For anonymous mmap, memory > > could be > > * uninitialized */ > > +#define MAP_BELOW_HINT 0x800 /* give out address that is > > below (inclusive) hint address */ > > IMHO the API would be clearer if this actually forced the address to be > below the hint. That's what the flag name implies after all. > > It would also mean the application doesn't need to take into account the > length of the mapping when passing the hint. > > cheers That's a good point. The reason I did it this way was to allow mmap the possibility of returning the same address as the hint. If it must be strictly less than the hint then the hint address can never be returned. Maybe that doesn't matter though. - Charlie
Re: [PATCH RFC v2 2/4] mm: Add hint and mmap_flags to struct vm_unmapped_area_info
On Thu, Aug 29, 2024 at 09:48:44AM +0100, Lorenzo Stoakes wrote: > On Thu, Aug 29, 2024 at 12:15:59AM GMT, Charlie Jenkins wrote: > > [snip] > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index d0dfc85b209b..34ba0db23678 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -1796,6 +1796,9 @@ generic_get_unmapped_area(struct file *filp, unsigned > > long addr, > > struct vm_unmapped_area_info info = {}; > > const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags); > > > > + info.hint = addr; > > + info.mmap_flags = flags; > > + > > if (len > mmap_end - mmap_min_addr) > > return -ENOMEM; > > > > @@ -1841,6 +1844,9 @@ generic_get_unmapped_area_topdown(struct file *filp, > > unsigned long addr, > > struct vm_unmapped_area_info info = {}; > > const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags); > > > > + info.hint = addr; > > + info.mmap_flags = flags; > > + > > /* requested length too big for entire address space */ > > if (len > mmap_end - mmap_min_addr) > > return -ENOMEM; > > > > These line numbers suggest you're working against Linus's tree, mm/mmap.c > has changed a lot recently, so to avoid conflicts please base your changes > on mm-unstable in Andrew's tree (if looking to go through that) or at least > -next. I will make sure that I base off of mm-unstable for future revisions. - Charlie > > > -- > > 2.45.0 > >
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu, 29 Aug 2024 02:02:34 PDT (-0700), vba...@suse.cz wrote: > Such a large recipient list and no linux-api. CC'd, please include it on > future postings. > > On 8/29/24 09:15, Charlie Jenkins wrote: >> Some applications rely on placing data in free bits addresses allocated >> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the >> address returned by mmap to be less than the 48-bit address space, >> unless the hint address uses more than 47 bits (the 48th bit is reserved >> for the kernel address space). >> >> The riscv architecture needs a way to similarly restrict the virtual >> address space. On the riscv port of OpenJDK an error is thrown if >> attempted to run on the 57-bit address space, called sv57 [1]. golang >> has a comment that sv57 support is not complete, but there are some >> workarounds to get it to mostly work [2]. >> >> These applications work on x86 because x86 does an implicit 47-bit >> restriction of mmap() address that contain a hint address that is less >> than 48 bits. >> >> Instead of implicitly restricting the address space on riscv (or any >> current/future architecture), a flag would allow users to opt-in to this >> behavior rather than opt-out as is done on other architectures. This is >> desirable because it is a small class of applications that do pointer >> masking. > > I doubt it's desirable to have different behavior depending on architecture. > Also you could say it's a small class of applications that need more than 47 > bits. We're sort of stuck with the architeture-depending behavior here: for the first few years RISC-V only had 39-bit VAs, so the defato uABI ended up being that userspace can ignore way more bits. While 48 bits might be enough for everyone, 39 doesn't seem to be -- or at least IIRC when we tried restricting the default to that, we broke stuff. There's also some other wrinkles like arbitrary bit boundaries in pointer masking and vendor-specific paging formats, but at some point we just end up down a rabbit hole of insanity there... FWIW, I think that userspace depending on just tossing some VA bits because some kernels happened to never allocate from them is just broken, but it seems like other ports worked around the 48->57 bit transition and we're trying to do something similar for 39->48 (and that works with 49->57, as we'll have to deal with that eventually). So that's basically how we ended up with this sort of thing: trying to do something similar without a flag broke userspace because we were trying to jam too much into the hints. I couldn't really figure out a way to satisfy all the userspace constraints by just implicitly retrofitting behavior based on the hints, so we figured having an explicit flag to control the behavior would be the sanest way to go. That said: I'm not opposed to just saying "depending on 39-bit VAs is broken" and just forcing people to fix it. >> This flag will also allow seemless compatibility between all >> architectures, so applications like Go and OpenJDK that use bits in a >> virtual address can request the exact number of bits they need in a >> generic way. The flag can be checked inside of vm_unmapped_area() so >> that this flag does not have to be handled individually by each >> architecture. >> >> Link: >> https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79 >> [1] >> Link: >> https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47 >> [2] >> >> To: Arnd Bergmann >> To: Richard Henderson >> To: Ivan Kokshaysky >> To: Matt Turner >> To: Vineet Gupta >> To: Russell King >> To: Guo Ren >> To: Huacai Chen >> To: WANG Xuerui >> To: Thomas Bogendoerfer >> To: James E.J. Bottomley >> To: Helge Deller >> To: Michael Ellerman >> To: Nicholas Piggin >> To: Christophe Leroy >> To: Naveen N Rao >> To: Alexander Gordeev >> To: Gerald Schaefer >> To: Heiko Carstens >> To: Vasily Gorbik >> To: Christian Borntraeger >> To: Sven Schnelle >> To: Yoshinori Sato >> To: Rich Felker >> To: John Paul Adrian Glaubitz >> To: David S. Miller >> To: Andreas Larsson >> To: Thomas Gleixner >> To: Ingo Molnar >> To: Borislav Petkov >> To: Dave Hansen >> To: x...@kernel.org >> To: H. Peter Anvin >> To: Andy Lutomirski >> To: Peter Zijlstra >> To: Muchun Song >> To: Andrew Morton >> To: Liam R. Howlett >> To: Vlastimil Babka >> To: Lorenzo Stoakes >> To: Shuah Khan >> Cc: linux-a...@vger.kernel.org >> Cc: linux-ker...@vger.kernel.org >> Cc: linux-al...@vger.kernel.org >> Cc: linux-snps-...@lists.infradead.org >> Cc: linux-arm-ker...@lists.infradead.org >> Cc: linux-c...@vger.kernel.org >> Cc: loonga...@lists.linux.dev >> Cc: linux-m...@vger.kernel.org >> Cc: linux-par...@vger.kernel.org >> Cc: linuxppc-dev@lists.ozlabs.org >> Cc: linux-s...@vger.kernel.org >> Cc: linux...@vger.kernel.org >> Cc: sparcli...@vger.kernel.org >> Cc: linux...@kvack
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote: > On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote: > > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote: > > > Some applications rely on placing data in free bits addresses allocated > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > > address returned by mmap to be less than the 48-bit address space, > > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > > for the kernel address space). > > > > I'm still confused as to why, if an mmap flag is desired, and thus programs > > are having to be heavily modified and controlled to be able to do this, why > > you can't just do an mmap() with PROT_NONE early, around a hinted address > > that, sits below the required limit, and then mprotect() or mmap() over it? > > > > Your feature is a major adjustment to mmap(), it needs to be pretty > > significantly justified, especially if taking up a new flag. > > > > > > > > The riscv architecture needs a way to similarly restrict the virtual > > > address space. On the riscv port of OpenJDK an error is thrown if > > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > > has a comment that sv57 support is not complete, but there are some > > > workarounds to get it to mostly work [2]. > > > > > > These applications work on x86 because x86 does an implicit 47-bit > > > restriction of mmap() address that contain a hint address that is less > > > than 48 bits. > > > > You mean x86 _has_ to limit to physically available bits in a canonical > > format :) this will not be the case for 5-page table levels though... I might be misunderstanding but I am not talking about pointer masking or canonical addresses here. I am referring to the pattern of: 1. Getting an address from mmap() 2. Writing data into bits assumed to be unused in the address 3. Using the data stored in the address 4. Clearing the data from the address and sign extending 5. Dereferencing the now sign-extended address to conform to canonical addresses I am just talking about step 1 and 2 here -- getting an address from mmap() that only uses bits that will allow your application to not break. How canonicalization happens is a a separate conversation, that can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv. While LAM for x86 is only capable of masking addresses to 48 or 57 bits, Ssnpm for riscv allow an arbitrary number of bits to be masked out. A design goal here is to be able to support all of the pointer masking flavors, and not just x86. > > > > > > > > Instead of implicitly restricting the address space on riscv (or any > > > current/future architecture), a flag would allow users to opt-in to this > > > behavior rather than opt-out as is done on other architectures. This is > > > desirable because it is a small class of applications that do pointer > > > masking. > > > > I raised this last time and you didn't seem to address it so to be more > > blunt: > > > > I don't understand why this needs to be an mmap() flag. From this it seems > > the whole process needs allocations to be below a certain limit. Yeah making it per-process does seem logical, as it would help with pointer masking. > > > > That _could_ be achieved through a 'personality' or similar (though a > > personality is on/off, rather than allowing configuration so maybe > > something else would be needed). > > > > From what you're saying 57-bit is all you really need right? So maybe > > ADDR_LIMIT_57BIT? Addresses will always be limited to 57 bits on riscv and x86 (but not necessarily on other architectures). A flag like that would have no impact, I do not understand what you are suggesting. This patch is to have a configurable number of bits be restricted. If anything, a personality that was ADDR_LIMIT_48BIT would be the closest to what I am trying to achieve. Since the issue is that applications fail to work when the address space is greater than 48 bits. > > > > I don't see how you're going to actually enforce this in a process either > > via an mmap flag, as a library might decide not to use it, so you'd need to > > control the allocator, the thread library implementation, and everything > > that might allocate. It is reasonable to change the implementation to be per-process but that is not the current proposal. This flag was designed for applications which already directly manage all of their addresses like OpenJDK and Go. This flag implementation was an attempt to make this feature as least invasive as possible to reduce maintainence burden and implementation complexity. > > > > Liam also raised various points about VMA particulars that I'm not sure are > > addressed either. > > > > I just find it hard to believe that everything will fit together. > > > > I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m > > just not. > > > > > > > > This flag will also allow seemless comp
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu, Aug 29, 2024 at 10:30:56AM +0200, Michal Hocko wrote: > On Thu 29-08-24 00:15:57, Charlie Jenkins wrote: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the 48-bit address space, > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > for the kernel address space). > > > > The riscv architecture needs a way to similarly restrict the virtual > > address space. On the riscv port of OpenJDK an error is thrown if > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > has a comment that sv57 support is not complete, but there are some > > workarounds to get it to mostly work [2]. > > > > These applications work on x86 because x86 does an implicit 47-bit > > restriction of mmap() address that contain a hint address that is less > > than 48 bits. > > > > Instead of implicitly restricting the address space on riscv (or any > > current/future architecture), a flag would allow users to opt-in to this > > behavior rather than opt-out as is done on other architectures. This is > > desirable because it is a small class of applications that do pointer > > masking. > > IIRC this has been discussed at length when 5-level page tables support > has been proposed for x86. Sorry I do not have a link handy but lore > should help you. Linus was not really convinced and in the end vetoed it > and prefer that those few applications that benefit from greater address > space would do that explicitly than other way around. I believe I found the conversation you were referring to. Ingo Molnar recommended a flag similar to what I have proposed [1]. Catalin recommended to make 52-bit opt-in on arm64 [2]. Dave Hansen brought up MPX [3]. However these conversations are tangential to what I am proposing. arm64 and x86 decided to have the default address space be 48 bits. However this was done on a per-architecture basis with no way for applications to have guarantees between architectures. Even this behavior to restrict to 48 bits does not even appear in the man pages, so would require reading the kernel source code to understand that this feature is available. Then to opt-in to larger address spaces, applications have to know to provide a hint address that is greater than 47 bits, mmap() will then return an address that contains up to 56 bits on x86 and 52 bits on arm64. This difference of 4 bits causes inconsistency and is part of the problem I am trying to solve with this flag. I am not proposing to change x86 and arm64 away from using their opt-out feature, I am instead proposing a standard ABI for applications that need some guarantees of the bits used in pointers. - Charlie Link: https://lore.kernel.org/lkml/20161209050130.gc2...@gmail.com/ [1] Link: https://lore.kernel.org/lkml/20161209105120.ga3...@e104818-lin.cambridge.arm.com/ [2] Link: https://lore.kernel.org/lkml/a2f86495-b55f-fda0-40d2-242c45d3c...@intel.com/ [3] > > -- > Michal Hocko > SUSE Labs
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote: > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the 48-bit address space, > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > for the kernel address space). > > I'm still confused as to why, if an mmap flag is desired, and thus programs > are having to be heavily modified and controlled to be able to do this, why > you can't just do an mmap() with PROT_NONE early, around a hinted address > that, sits below the required limit, and then mprotect() or mmap() over it? > > Your feature is a major adjustment to mmap(), it needs to be pretty > significantly justified, especially if taking up a new flag. > > > > > The riscv architecture needs a way to similarly restrict the virtual > > address space. On the riscv port of OpenJDK an error is thrown if > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > has a comment that sv57 support is not complete, but there are some > > workarounds to get it to mostly work [2]. > > > > These applications work on x86 because x86 does an implicit 47-bit > > restriction of mmap() address that contain a hint address that is less > > than 48 bits. > > You mean x86 _has_ to limit to physically available bits in a canonical > format :) this will not be the case for 5-page table levels though... > > > > > Instead of implicitly restricting the address space on riscv (or any > > current/future architecture), a flag would allow users to opt-in to this > > behavior rather than opt-out as is done on other architectures. This is > > desirable because it is a small class of applications that do pointer > > masking. > > I raised this last time and you didn't seem to address it so to be more > blunt: > > I don't understand why this needs to be an mmap() flag. From this it seems > the whole process needs allocations to be below a certain limit. > > That _could_ be achieved through a 'personality' or similar (though a > personality is on/off, rather than allowing configuration so maybe > something else would be needed). > > From what you're saying 57-bit is all you really need right? So maybe > ADDR_LIMIT_57BIT? > > I don't see how you're going to actually enforce this in a process either > via an mmap flag, as a library might decide not to use it, so you'd need to > control the allocator, the thread library implementation, and everything > that might allocate. > > Liam also raised various points about VMA particulars that I'm not sure are > addressed either. > > I just find it hard to believe that everything will fit together. > > I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m > just not. > > > > > This flag will also allow seemless compatibility between all > > architectures, so applications like Go and OpenJDK that use bits in a > > virtual address can request the exact number of bits they need in a > > generic way. The flag can be checked inside of vm_unmapped_area() so > > that this flag does not have to be handled individually by each > > architecture. > > I'm still very unconvinced and feel the bar needs to be high for making > changes like this that carry maintainership burden. > > So for me, it's a no really as an overall concept. > > Happy to be convinced otherwise, however... (I may be missing details or > context that provide more justification). > Some more thoughts: * If you absolutely must keep allocations below a certain limit, you'd probably need to actually associate this information with the VMA so the memory can't be mremap()'d somewhere invalid (you might not control all code so you can't guarantee this won't happen). * Keeping a map limit associated with a VMA would be horrid and keeping VMAs as small as possible is a key aim, so that'd be a no go. VMA flags are in limited supply also. * If we did implement a per-process thing, but it were arbitrary, we'd then have to handle all kinds of corner cases forever (this is UAPI, can't break it etc.) with crazy-low values, or determine a minimum that might vary by arch... * If we did this we'd absolutely have to implement a check in the brk() implementation, which is a very very sensitive bit of code. And of course, in mmap() and mremap()... and any arch-specific code that might interface with this stuff (these functions are hooked). * A fixed address limit would make more sense, but it seems difficult to know what would work for everybody, and again we'd have to deal with edge cases and having a permanent maintenance burden. * If you did have a map flag what about merging between VMAs above the limit and below it? To avoid that you'd need to implement some kind of a 'VMA flag that has an arbitrary characteristic' or a 'limit' field, adjust all t
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
Such a large recipient list and no linux-api. CC'd, please include it on future postings. On 8/29/24 09:15, Charlie Jenkins wrote: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the 48-bit address space, > unless the hint address uses more than 47 bits (the 48th bit is reserved > for the kernel address space). > > The riscv architecture needs a way to similarly restrict the virtual > address space. On the riscv port of OpenJDK an error is thrown if > attempted to run on the 57-bit address space, called sv57 [1]. golang > has a comment that sv57 support is not complete, but there are some > workarounds to get it to mostly work [2]. > > These applications work on x86 because x86 does an implicit 47-bit > restriction of mmap() address that contain a hint address that is less > than 48 bits. > > Instead of implicitly restricting the address space on riscv (or any > current/future architecture), a flag would allow users to opt-in to this > behavior rather than opt-out as is done on other architectures. This is > desirable because it is a small class of applications that do pointer > masking. I doubt it's desirable to have different behavior depending on architecture. Also you could say it's a small class of applications that need more than 47 bits. > This flag will also allow seemless compatibility between all > architectures, so applications like Go and OpenJDK that use bits in a > virtual address can request the exact number of bits they need in a > generic way. The flag can be checked inside of vm_unmapped_area() so > that this flag does not have to be handled individually by each > architecture. > > Link: > https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79 > [1] > Link: > https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47 > [2] > > To: Arnd Bergmann > To: Richard Henderson > To: Ivan Kokshaysky > To: Matt Turner > To: Vineet Gupta > To: Russell King > To: Guo Ren > To: Huacai Chen > To: WANG Xuerui > To: Thomas Bogendoerfer > To: James E.J. Bottomley > To: Helge Deller > To: Michael Ellerman > To: Nicholas Piggin > To: Christophe Leroy > To: Naveen N Rao > To: Alexander Gordeev > To: Gerald Schaefer > To: Heiko Carstens > To: Vasily Gorbik > To: Christian Borntraeger > To: Sven Schnelle > To: Yoshinori Sato > To: Rich Felker > To: John Paul Adrian Glaubitz > To: David S. Miller > To: Andreas Larsson > To: Thomas Gleixner > To: Ingo Molnar > To: Borislav Petkov > To: Dave Hansen > To: x...@kernel.org > To: H. Peter Anvin > To: Andy Lutomirski > To: Peter Zijlstra > To: Muchun Song > To: Andrew Morton > To: Liam R. Howlett > To: Vlastimil Babka > To: Lorenzo Stoakes > To: Shuah Khan > Cc: linux-a...@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Cc: linux-al...@vger.kernel.org > Cc: linux-snps-...@lists.infradead.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux-c...@vger.kernel.org > Cc: loonga...@lists.linux.dev > Cc: linux-m...@vger.kernel.org > Cc: linux-par...@vger.kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-s...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: sparcli...@vger.kernel.org > Cc: linux...@kvack.org > Cc: linux-kselft...@vger.kernel.org > Signed-off-by: Charlie Jenkins > > Changes in v2: > - Added much greater detail to cover letter > - Removed all code that touched architecture specific code and was able > to factor this out into all generic functions, except for flags that > needed to be added to vm_unmapped_area_info > - Made this an RFC since I have only tested it on riscv and x86 > - Link to v1: > https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb90...@rivosinc.com > > --- > Charlie Jenkins (4): > mm: Add MAP_BELOW_HINT > mm: Add hint and mmap_flags to struct vm_unmapped_area_info > mm: Support MAP_BELOW_HINT in vm_unmapped_area() > selftests/mm: Create MAP_BELOW_HINT test > > arch/alpha/kernel/osf_sys.c | 2 ++ > arch/arc/mm/mmap.c | 3 +++ > arch/arm/mm/mmap.c | 7 ++ > arch/csky/abiv1/mmap.c | 3 +++ > arch/loongarch/mm/mmap.c | 3 +++ > arch/mips/mm/mmap.c | 3 +++ > arch/parisc/kernel/sys_parisc.c | 3 +++ > arch/powerpc/mm/book3s64/slice.c | 7 ++ > arch/s390/mm/hugetlbpage.c | 4 > arch/s390/mm/mmap.c | 6 ++ > arch/sh/mm/mmap.c| 6 ++ > arch/sparc/kernel/sys_sparc_32.c | 3 +++ > arch/sparc/kernel/sys_sparc_64.c | 6 ++ > arch/sparc/mm/hugetlbpage.c | 4 > arch/x86
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu 29-08-24 00:15:57, Charlie Jenkins wrote: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the 48-bit address space, > unless the hint address uses more than 47 bits (the 48th bit is reserved > for the kernel address space). > > The riscv architecture needs a way to similarly restrict the virtual > address space. On the riscv port of OpenJDK an error is thrown if > attempted to run on the 57-bit address space, called sv57 [1]. golang > has a comment that sv57 support is not complete, but there are some > workarounds to get it to mostly work [2]. > > These applications work on x86 because x86 does an implicit 47-bit > restriction of mmap() address that contain a hint address that is less > than 48 bits. > > Instead of implicitly restricting the address space on riscv (or any > current/future architecture), a flag would allow users to opt-in to this > behavior rather than opt-out as is done on other architectures. This is > desirable because it is a small class of applications that do pointer > masking. IIRC this has been discussed at length when 5-level page tables support has been proposed for x86. Sorry I do not have a link handy but lore should help you. Linus was not really convinced and in the end vetoed it and prefer that those few applications that benefit from greater address space would do that explicitly than other way around. -- Michal Hocko SUSE Labs
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the 48-bit address space, > unless the hint address uses more than 47 bits (the 48th bit is reserved > for the kernel address space). I'm still confused as to why, if an mmap flag is desired, and thus programs are having to be heavily modified and controlled to be able to do this, why you can't just do an mmap() with PROT_NONE early, around a hinted address that, sits below the required limit, and then mprotect() or mmap() over it? Your feature is a major adjustment to mmap(), it needs to be pretty significantly justified, especially if taking up a new flag. > > The riscv architecture needs a way to similarly restrict the virtual > address space. On the riscv port of OpenJDK an error is thrown if > attempted to run on the 57-bit address space, called sv57 [1]. golang > has a comment that sv57 support is not complete, but there are some > workarounds to get it to mostly work [2]. > > These applications work on x86 because x86 does an implicit 47-bit > restriction of mmap() address that contain a hint address that is less > than 48 bits. You mean x86 _has_ to limit to physically available bits in a canonical format :) this will not be the case for 5-page table levels though... > > Instead of implicitly restricting the address space on riscv (or any > current/future architecture), a flag would allow users to opt-in to this > behavior rather than opt-out as is done on other architectures. This is > desirable because it is a small class of applications that do pointer > masking. I raised this last time and you didn't seem to address it so to be more blunt: I don't understand why this needs to be an mmap() flag. From this it seems the whole process needs allocations to be below a certain limit. That _could_ be achieved through a 'personality' or similar (though a personality is on/off, rather than allowing configuration so maybe something else would be needed). >From what you're saying 57-bit is all you really need right? So maybe ADDR_LIMIT_57BIT? I don't see how you're going to actually enforce this in a process either via an mmap flag, as a library might decide not to use it, so you'd need to control the allocator, the thread library implementation, and everything that might allocate. Liam also raised various points about VMA particulars that I'm not sure are addressed either. I just find it hard to believe that everything will fit together. I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m just not. > > This flag will also allow seemless compatibility between all > architectures, so applications like Go and OpenJDK that use bits in a > virtual address can request the exact number of bits they need in a > generic way. The flag can be checked inside of vm_unmapped_area() so > that this flag does not have to be handled individually by each > architecture. I'm still very unconvinced and feel the bar needs to be high for making changes like this that carry maintainership burden. So for me, it's a no really as an overall concept. Happy to be convinced otherwise, however... (I may be missing details or context that provide more justification). > > Link: > https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79 > [1] > Link: > https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47 > [2] > > To: Arnd Bergmann > To: Richard Henderson > To: Ivan Kokshaysky > To: Matt Turner > To: Vineet Gupta > To: Russell King > To: Guo Ren > To: Huacai Chen > To: WANG Xuerui > To: Thomas Bogendoerfer > To: James E.J. Bottomley > To: Helge Deller > To: Michael Ellerman > To: Nicholas Piggin > To: Christophe Leroy > To: Naveen N Rao > To: Alexander Gordeev > To: Gerald Schaefer > To: Heiko Carstens > To: Vasily Gorbik > To: Christian Borntraeger > To: Sven Schnelle > To: Yoshinori Sato > To: Rich Felker > To: John Paul Adrian Glaubitz > To: David S. Miller > To: Andreas Larsson > To: Thomas Gleixner > To: Ingo Molnar > To: Borislav Petkov > To: Dave Hansen > To: x...@kernel.org > To: H. Peter Anvin > To: Andy Lutomirski > To: Peter Zijlstra > To: Muchun Song > To: Andrew Morton > To: Liam R. Howlett > To: Vlastimil Babka > To: Lorenzo Stoakes > To: Shuah Khan > Cc: linux-a...@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Cc: linux-al...@vger.kernel.org > Cc: linux-snps-...@lists.infradead.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux-c...@vger.kernel.org > Cc: loonga...@lists.linux.dev > Cc: linux-m...@vger.kernel.org > Cc: linux-par...@vger.kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-s...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: sparcli...@
Re: [PATCH RFC v2 2/4] mm: Add hint and mmap_flags to struct vm_unmapped_area_info
On Thu, Aug 29, 2024 at 12:15:59AM GMT, Charlie Jenkins wrote: [snip] > diff --git a/mm/mmap.c b/mm/mmap.c > index d0dfc85b209b..34ba0db23678 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1796,6 +1796,9 @@ generic_get_unmapped_area(struct file *filp, unsigned > long addr, > struct vm_unmapped_area_info info = {}; > const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags); > > + info.hint = addr; > + info.mmap_flags = flags; > + > if (len > mmap_end - mmap_min_addr) > return -ENOMEM; > > @@ -1841,6 +1844,9 @@ generic_get_unmapped_area_topdown(struct file *filp, > unsigned long addr, > struct vm_unmapped_area_info info = {}; > const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags); > > + info.hint = addr; > + info.mmap_flags = flags; > + > /* requested length too big for entire address space */ > if (len > mmap_end - mmap_min_addr) > return -ENOMEM; > These line numbers suggest you're working against Linus's tree, mm/mmap.c has changed a lot recently, so to avoid conflicts please base your changes on mm-unstable in Andrew's tree (if looking to go through that) or at least -next. > -- > 2.45.0 >
Re: [PATCH RFC v2 1/4] mm: Add MAP_BELOW_HINT
Charlie Jenkins writes: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the 48-bit address space, > unless the hint address uses more than 47 bits (the 48th bit is reserved > for the kernel address space). > > To make this behavior explicit and more versatile across all > architectures, define a mmap flag that allows users to define an > arbitrary upper limit on addresses returned by mmap. > > Signed-off-by: Charlie Jenkins > --- > include/uapi/asm-generic/mman-common.h | 1 + > tools/include/uapi/asm-generic/mman-common.h | 1 + You're not meant to update the headers in tools/ directly. There's a mail somewhere from acme somewhere describing the proper process, but the tldr is leave it up to him. > diff --git a/include/uapi/asm-generic/mman-common.h > b/include/uapi/asm-generic/mman-common.h > index 6ce1f1ceb432..03ac13d9aa37 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -32,6 +32,7 @@ > > #define MAP_UNINITIALIZED 0x400 /* For anonymous mmap, memory could be >* uninitialized */ > +#define MAP_BELOW_HINT 0x800 /* give out address that is > below (inclusive) hint address */ IMHO the API would be clearer if this actually forced the address to be below the hint. That's what the flag name implies after all. It would also mean the application doesn't need to take into account the length of the mapping when passing the hint. cheers
[PATCH RFC v2 4/4] selftests/mm: Create MAP_BELOW_HINT test
Add a selftest for MAP_BELOW_HINT that maps until it runs out of space below the hint address. Signed-off-by: Charlie Jenkins --- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 32 + 2 files changed, 33 insertions(+) diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index cfad627e8d94..4e2de85267b5 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm TEST_GEN_FILES += hugepage-vmemmap TEST_GEN_FILES += khugepaged TEST_GEN_FILES += madv_populate +TEST_GEN_FILES += map_below_hint TEST_GEN_FILES += map_fixed_noreplace TEST_GEN_FILES += map_hugetlb TEST_GEN_FILES += map_populate diff --git a/tools/testing/selftests/mm/map_below_hint.c b/tools/testing/selftests/mm/map_below_hint.c new file mode 100644 index ..55d6cbf90645 --- /dev/null +++ b/tools/testing/selftests/mm/map_below_hint.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test the MAP_BELOW_HINT mmap flag. + */ +#include +#include +#include "../kselftest.h" + +#define ADDR (1 << 20) +#define LENGTH (ADDR / 1) + +#define MAP_BELOW_HINT 0x800 /* Not defined in all libc */ + +/* + * Map memory with MAP_BELOW_HINT until no memory left. Ensure that all returned + * addresses are below the hint. + */ +int main(int argc, char **argv) +{ + void *addr; + + do { + addr = mmap((void *)ADDR, LENGTH, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_BELOW_HINT, -1, 0); + } while (addr != MAP_FAILED && (unsigned long)addr <= ADDR); + + if (errno == ENOMEM) + ksft_test_result_pass("MAP_BELOW_HINT works\n"); + else + ksft_test_result_fail("mmap returned address above hint with MAP_BELOW_HINT with error: %s\n", + strerror(errno)); +} -- 2.45.0
[PATCH RFC v2 2/4] mm: Add hint and mmap_flags to struct vm_unmapped_area_info
The hint address and mmap_flags are necessary to determine if MAP_BELOW_HINT requirements are satisfied. Signed-off-by: Charlie Jenkins --- arch/alpha/kernel/osf_sys.c | 2 ++ arch/arc/mm/mmap.c | 3 +++ arch/arm/mm/mmap.c | 7 +++ arch/csky/abiv1/mmap.c | 3 +++ arch/loongarch/mm/mmap.c | 3 +++ arch/mips/mm/mmap.c | 3 +++ arch/parisc/kernel/sys_parisc.c | 3 +++ arch/powerpc/mm/book3s64/slice.c | 7 +++ arch/s390/mm/hugetlbpage.c | 4 arch/s390/mm/mmap.c | 6 ++ arch/sh/mm/mmap.c| 6 ++ arch/sparc/kernel/sys_sparc_32.c | 3 +++ arch/sparc/kernel/sys_sparc_64.c | 6 ++ arch/sparc/mm/hugetlbpage.c | 4 arch/x86/kernel/sys_x86_64.c | 6 ++ arch/x86/mm/hugetlbpage.c| 4 fs/hugetlbfs/inode.c | 4 include/linux/mm.h | 2 ++ mm/mmap.c| 6 ++ 19 files changed, 82 insertions(+) diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c index e5f881bc8288..6903700afd12 100644 --- a/arch/alpha/kernel/osf_sys.c +++ b/arch/alpha/kernel/osf_sys.c @@ -1223,6 +1223,8 @@ arch_get_unmapped_area_1(unsigned long addr, unsigned long len, info.length = len; info.low_limit = addr; info.high_limit = limit; + info.hint = addr; + info.mmap_flags = flags; return vm_unmapped_area(&info); } diff --git a/arch/arc/mm/mmap.c b/arch/arc/mm/mmap.c index 69a915297155..5922cb51e029 100644 --- a/arch/arc/mm/mmap.c +++ b/arch/arc/mm/mmap.c @@ -29,6 +29,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, struct vm_area_struct *vma; struct vm_unmapped_area_info info = {}; + info.hint = addr; + info.mmap_flags = flags; + /* * We enforce the MAP_FIXED case. */ diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c index d65d0e6ed10a..04d9234f049a 100644 --- a/arch/arm/mm/mmap.c +++ b/arch/arm/mm/mmap.c @@ -36,6 +36,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, int aliasing = cache_is_vipt_aliasing(); struct vm_unmapped_area_info info = {}; + info.hint = addr; + info.mmap_flags = flags; + /* * We only need to do colour alignment if either the I or D * caches alias. @@ -56,6 +59,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, if (len > TASK_SIZE) return -ENOMEM; + if (addr) { if (do_align) addr = COLOUR_ALIGN(addr, pgoff); @@ -88,6 +92,9 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, int aliasing = cache_is_vipt_aliasing(); struct vm_unmapped_area_info info = {}; + info.hint = addr; + info.mmap_flags = flags; + /* * We only need to do colour alignment if either the I or D * caches alias. diff --git a/arch/csky/abiv1/mmap.c b/arch/csky/abiv1/mmap.c index 7f826331d409..0be4913c6cf3 100644 --- a/arch/csky/abiv1/mmap.c +++ b/arch/csky/abiv1/mmap.c @@ -35,6 +35,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, .align_offset = pgoff << PAGE_SHIFT }; + info.hint = addr; + info.mmap_flags = flags; + /* * We only need to do colour alignment if either the I or D * caches alias. diff --git a/arch/loongarch/mm/mmap.c b/arch/loongarch/mm/mmap.c index 889030985135..7d1e8be20519 100644 --- a/arch/loongarch/mm/mmap.c +++ b/arch/loongarch/mm/mmap.c @@ -27,6 +27,9 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, int do_color_align; struct vm_unmapped_area_info info = {}; + info.hint = addr; + info.mmap_flags = flags; + if (unlikely(len > TASK_SIZE)) return -ENOMEM; diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c index 7e11d7b58761..22e8f9c8eaa0 100644 --- a/arch/mips/mm/mmap.c +++ b/arch/mips/mm/mmap.c @@ -36,6 +36,9 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, int do_color_align; struct vm_unmapped_area_info info = {}; + info.hint = addr; + info.mmap_flags = flags; + if (unlikely(len > TASK_SIZE)) return -ENOMEM; diff --git a/arch/parisc/kernel/sys_parisc.c b/arch/parisc/kernel/sys_parisc.c index f7722451276e..2ac53f148624 100644 --- a/arch/parisc/kernel/sys_parisc.c +++ b/arch/parisc/kernel/sys_parisc.c @@ -108,6 +108,9 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, .length = len }; + info.hint = addr; + info.mmap_flags = flags; + if (unlikely(len > TASK_SIZE)) return -ENOMEM; diff --git a/arch/powerpc/mm/book3s64/slice.c b/arch/powerpc/mm/book3s64/slice.c index ef3ce37f1bb3..f0e2550af6d0 100644 --- a/arch/powerpc/mm/book3s64/s
[PATCH RFC v2 3/4] mm: Support MAP_BELOW_HINT in vm_unmapped_area()
To ensure that all memory allocations comply with the new MAP_BELOW_HINT flag, set the high_limit in vm_unmapped_area() to the hint address + length at most. All callers to this function set the high_limit to something reasonable, usually with space for a random offset and a gap for the stack. To respect the provided high_limit, take the minimum of hint+length and the given high_limit. Signed-off-by: Charlie Jenkins --- mm/mmap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/mmap.c b/mm/mmap.c index 34ba0db23678..459ad380c673 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1766,6 +1766,9 @@ unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info) { unsigned long addr; + if (info->hint != 0 && info->mmap_flags & MAP_BELOW_HINT) + info->high_limit = MIN(info->high_limit, info->hint + info->length); + if (info->flags & VM_UNMAPPED_AREA_TOPDOWN) addr = unmapped_area_topdown(info); else -- 2.45.0
[PATCH RFC v2 1/4] mm: Add MAP_BELOW_HINT
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space). To make this behavior explicit and more versatile across all architectures, define a mmap flag that allows users to define an arbitrary upper limit on addresses returned by mmap. Signed-off-by: Charlie Jenkins --- include/uapi/asm-generic/mman-common.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + 2 files changed, 2 insertions(+) diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..03ac13d9aa37 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -32,6 +32,7 @@ #define MAP_UNINITIALIZED 0x400/* For anonymous mmap, memory could be * uninitialized */ +#define MAP_BELOW_HINT 0x800 /* give out address that is below (inclusive) hint address */ /* * Flags for mlock diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..03ac13d9aa37 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -32,6 +32,7 @@ #define MAP_UNINITIALIZED 0x400/* For anonymous mmap, memory could be * uninitialized */ +#define MAP_BELOW_HINT 0x800 /* give out address that is below (inclusive) hint address */ /* * Flags for mlock -- 2.45.0
[PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space). The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2]. These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits. Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking. This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture. Link: https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79 [1] Link: https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47 [2] To: Arnd Bergmann To: Richard Henderson To: Ivan Kokshaysky To: Matt Turner To: Vineet Gupta To: Russell King To: Guo Ren To: Huacai Chen To: WANG Xuerui To: Thomas Bogendoerfer To: James E.J. Bottomley To: Helge Deller To: Michael Ellerman To: Nicholas Piggin To: Christophe Leroy To: Naveen N Rao To: Alexander Gordeev To: Gerald Schaefer To: Heiko Carstens To: Vasily Gorbik To: Christian Borntraeger To: Sven Schnelle To: Yoshinori Sato To: Rich Felker To: John Paul Adrian Glaubitz To: David S. Miller To: Andreas Larsson To: Thomas Gleixner To: Ingo Molnar To: Borislav Petkov To: Dave Hansen To: x...@kernel.org To: H. Peter Anvin To: Andy Lutomirski To: Peter Zijlstra To: Muchun Song To: Andrew Morton To: Liam R. Howlett To: Vlastimil Babka To: Lorenzo Stoakes To: Shuah Khan Cc: linux-a...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-al...@vger.kernel.org Cc: linux-snps-...@lists.infradead.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-c...@vger.kernel.org Cc: loonga...@lists.linux.dev Cc: linux-m...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-kselft...@vger.kernel.org Signed-off-by: Charlie Jenkins Changes in v2: - Added much greater detail to cover letter - Removed all code that touched architecture specific code and was able to factor this out into all generic functions, except for flags that needed to be added to vm_unmapped_area_info - Made this an RFC since I have only tested it on riscv and x86 - Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb90...@rivosinc.com --- Charlie Jenkins (4): mm: Add MAP_BELOW_HINT mm: Add hint and mmap_flags to struct vm_unmapped_area_info mm: Support MAP_BELOW_HINT in vm_unmapped_area() selftests/mm: Create MAP_BELOW_HINT test arch/alpha/kernel/osf_sys.c | 2 ++ arch/arc/mm/mmap.c | 3 +++ arch/arm/mm/mmap.c | 7 ++ arch/csky/abiv1/mmap.c | 3 +++ arch/loongarch/mm/mmap.c | 3 +++ arch/mips/mm/mmap.c | 3 +++ arch/parisc/kernel/sys_parisc.c | 3 +++ arch/powerpc/mm/book3s64/slice.c | 7 ++ arch/s390/mm/hugetlbpage.c | 4 arch/s390/mm/mmap.c | 6 ++ arch/sh/mm/mmap.c| 6 ++ arch/sparc/kernel/sys_sparc_32.c | 3 +++ arch/sparc/kernel/sys_sparc_64.c | 6 ++ arch/sparc/mm/hugetlbpage.c | 4 arch/x86/kernel/sys_x86_64.c | 6 ++ arch/x86/mm/hugetlbpage.c| 4 fs/hugetlbfs/inode.c | 4 include/linux/mm.h | 2 ++ include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c| 9 tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 32 23 files changed, 120 inse
Re: [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options
On Fri, Aug 23, 2024 at 06:19:52AM +, LEROY Christophe wrote: > Why is an option needed for that ? If pmd_leaf() returns always false, > it means the arch doesn't support pmd mappings and if properly used all > related code should fold away without a config option, shouldn't it ? It's not always easy to leverage an "if" clause there, IIUC. Take the case of when a driver wants to inject a pmd pfnmap, we may want something like: if (pmd_leaf_supported()) inject_pmd_leaf(&pmd); We don't have a pmd entry to reference at the point of pmd_leaf_supported() when making the decision. Thanks, -- Peter Xu
[PATCH] [RFC] crash: Lock-free crash hotplug support reporting
On a CPU/Memory hotplug event, the kexec lock is taken to update the kdump image. At the same time, this lock is also required to report the support for crash hotplug to user-space via the /sys/devices/system/[cpu|memory]/crash_hotplug sysfs interface, to avoid kdump reload. The kexec lock is needed to report crash hotplug support because the crash_hotplug variable, which tracks crash hotplug support, is part of the kdump image, and the kdump image needs to be updated during a hotplug event. Given that only one kdump image can be loaded at any given time, the crash_hotplug variable can be placed outside the kdump image and set or reset during kdump image load and unload. This allows crash hotplug support to be reported without taking the kexec lock. This would help in situation where CPU/Memory resource are hotplug from system in bulk. Commit e2a8f20dd8e9 ("Crash: add lock to serialize crash hotplug handling") introduced to serialize the kexec lock during bulk CPU/Memory hotplug events. However, with these changes, the kexec lock for crash hotplug support reporting can be avoided altogether. Cc: Hari Bathini Cc: Mahesh Salgaonkar Cc: ke...@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-ker...@vger.kernel.org Cc: x...@kernel.org Signed-off-by: Sourabh Jain --- include/linux/kexec.h | 11 --- kernel/crash_core.c | 27 +-- kernel/kexec.c| 5 - kernel/kexec_file.c | 7 ++- 4 files changed, 23 insertions(+), 27 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index f0e9f8eda7a3..bd755ba6bac4 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -318,13 +318,6 @@ struct kimage { unsigned int preserve_context : 1; /* If set, we are using file mode kexec syscall */ unsigned int file_mode:1; -#ifdef CONFIG_CRASH_HOTPLUG - /* If set, it is safe to update kexec segments that are -* excluded from SHA calculation. -*/ - unsigned int hotplug_support:1; -#endif - #ifdef ARCH_HAS_KIMAGE_ARCH struct kimage_arch arch; #endif @@ -370,6 +363,10 @@ struct kimage { unsigned long elf_load_addr; }; +#ifdef CONFIG_CRASH_HOTPLUG +extern unsigned int crash_hotplug_support; +#endif + /* kexec interface functions */ extern void machine_kexec(struct kimage *image); extern int machine_kexec_prepare(struct kimage *image); diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 63cf89393c6e..3428deba0070 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -30,6 +30,13 @@ #include "kallsyms_internal.h" #include "kexec_internal.h" +#ifdef CONFIG_CRASH_HOTPLUG +/* if set, it is safe to update kexec segments that are + * excluded from sha calculation. + */ +unsigned int crash_hotplug_support; +#endif + /* Per cpu memory for storing cpu states in case of system crash. */ note_buf_t __percpu *crash_notes; @@ -500,23 +507,7 @@ static DEFINE_MUTEX(__crash_hotplug_lock); */ int crash_check_hotplug_support(void) { - int rc = 0; - - crash_hotplug_lock(); - /* Obtain lock while reading crash information */ - if (!kexec_trylock()) { - pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n"); - crash_hotplug_unlock(); - return 0; - } - if (kexec_crash_image) { - rc = kexec_crash_image->hotplug_support; - } - /* Release lock now that update complete */ - kexec_unlock(); - crash_hotplug_unlock(); - - return rc; + return crash_hotplug_support; } /* @@ -552,7 +543,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu, image = kexec_crash_image; /* Check that kexec segments update is permitted */ - if (!image->hotplug_support) + if (!crash_hotplug_support) goto out; if (hp_action == KEXEC_CRASH_HP_ADD_CPU || diff --git a/kernel/kexec.c b/kernel/kexec.c index a6b3f96bb50c..d5c6b51eaa8b 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -116,6 +116,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, /* Uninstall image */ kimage_free(xchg(dest_image, NULL)); ret = 0; +#ifdef CONFIG_CRASH_HOTPLUG + crash_hotplug_support = 0; +#endif goto out_unlock; } if (flags & KEXEC_ON_CRASH) { @@ -136,7 +139,7 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, #ifdef CONFIG_CRASH_HOTPLUG if ((flags & KEXEC_ON_CRASH) && arch_crash_hotplug_support(image, flags)) - image->hotplug_support = 1; + crash_hotplug_support = 1; #endif ret = machine_kexec_prepare(image); diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 3d64290d24c9..b326edb90fd7 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -378,7 +378,7 @@ SYSCALL_DEFINE
Re: [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options
Le 22/08/2024 à 21:16, Peter Xu a écrit : > On Thu, Aug 22, 2024 at 05:22:03PM +, LEROY Christophe wrote: >> >> >> Le 18/07/2024 à 00:02, Peter Xu a écrit : >>> Introduce two more sub-options for PGTABLE_HAS_HUGE_LEAVES: >>> >>> - PGTABLE_HAS_PMD_LEAVES: set when there can be PMD mappings >>> - PGTABLE_HAS_PUD_LEAVES: set when there can be PUD mappings >>> >>> It will help to identify whether the current build may only want PMD >>> helpers but not PUD ones, as these sub-options will also check against the >>> arch support over HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD]. >>> >>> Note that having them depend on HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD] is >>> still some intermediate step. The best way is to have an option say >>> "whether arch XXX supports PMD/PUD mappings" and so on. However let's >>> leave that for later as that's the easy part. So far, we use these options >>> to stably detect per-arch huge mapping support. >>> >>> Signed-off-by: Peter Xu >>> --- >>>include/linux/huge_mm.h | 10 +++--- >>>mm/Kconfig | 6 ++ >>>2 files changed, 13 insertions(+), 3 deletions(-) >>> >>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >>> index 711632df7edf..37482c8445d1 100644 >>> --- a/include/linux/huge_mm.h >>> +++ b/include/linux/huge_mm.h >>> @@ -96,14 +96,18 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; >>>#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ >>> (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) >>> >>> -#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES >>> -#define HPAGE_PMD_SHIFT PMD_SHIFT >>> +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES >>>#define HPAGE_PUD_SHIFT PUD_SHIFT >>>#else >>> -#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) >>>#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; }) >>>#endif >>> >>> +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES >>> +#define HPAGE_PMD_SHIFT PMD_SHIFT >>> +#else >>> +#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) >>> +#endif >>> + >>>#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) >>>#define HPAGE_PMD_NR (1<>>#define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1)) >>> diff --git a/mm/Kconfig b/mm/Kconfig >>> index 60796402850e..2dbdc088dee8 100644 >>> --- a/mm/Kconfig >>> +++ b/mm/Kconfig >>> @@ -860,6 +860,12 @@ endif # TRANSPARENT_HUGEPAGE >>>config PGTABLE_HAS_HUGE_LEAVES >>> def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE >>> >>> +config PGTABLE_HAS_PMD_LEAVES >>> + def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE && PGTABLE_HAS_HUGE_LEAVES >>> + >>> +config PGTABLE_HAS_PUD_LEAVES >>> + def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD && PGTABLE_HAS_HUGE_LEAVES >>> + >> >> What if an architecture has hugepages at PMD and/or PUD level and >> doesn't support THP ? > > What's the arch to be discussed here? It is LOONGARCH and MIPS, they provide pud_leaf() that can return true even when they have no PUD. > > The whole purpose of this series so far is trying to make some pmd/pud > helpers that only defined with CONFIG_THP=on to be available even if not. > It means this series alone (or any future plan) shouldn't affect any arch > that has CONFIG_THP=off always. > > But logically I think we should need some config option just to say "this > arch supports pmd mappings" indeed, even if CONFIG_THP=off. When that's > there, we should perhaps add that option into this equation so > PGTABLE_HAS_*_LEAVES will also be selected in that case. > Why is an option needed for that ? If pmd_leaf() returns always false, it means the arch doesn't support pmd mappings and if properly used all related code should fold away without a config option, shouldn't it ?
Re: [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options
On Thu, Aug 22, 2024 at 05:22:03PM +, LEROY Christophe wrote: > > > Le 18/07/2024 à 00:02, Peter Xu a écrit : > > Introduce two more sub-options for PGTABLE_HAS_HUGE_LEAVES: > > > >- PGTABLE_HAS_PMD_LEAVES: set when there can be PMD mappings > >- PGTABLE_HAS_PUD_LEAVES: set when there can be PUD mappings > > > > It will help to identify whether the current build may only want PMD > > helpers but not PUD ones, as these sub-options will also check against the > > arch support over HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD]. > > > > Note that having them depend on HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD] is > > still some intermediate step. The best way is to have an option say > > "whether arch XXX supports PMD/PUD mappings" and so on. However let's > > leave that for later as that's the easy part. So far, we use these options > > to stably detect per-arch huge mapping support. > > > > Signed-off-by: Peter Xu > > --- > > include/linux/huge_mm.h | 10 +++--- > > mm/Kconfig | 6 ++ > > 2 files changed, 13 insertions(+), 3 deletions(-) > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index 711632df7edf..37482c8445d1 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -96,14 +96,18 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; > > #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ > > (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) > > > > -#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES > > -#define HPAGE_PMD_SHIFT PMD_SHIFT > > +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES > > #define HPAGE_PUD_SHIFT PUD_SHIFT > > #else > > -#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) > > #define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; }) > > #endif > > > > +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES > > +#define HPAGE_PMD_SHIFT PMD_SHIFT > > +#else > > +#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) > > +#endif > > + > > #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) > > #define HPAGE_PMD_NR (1< > #define HPAGE_PMD_MASK(~(HPAGE_PMD_SIZE - 1)) > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 60796402850e..2dbdc088dee8 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -860,6 +860,12 @@ endif # TRANSPARENT_HUGEPAGE > > config PGTABLE_HAS_HUGE_LEAVES > > def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE > > > > +config PGTABLE_HAS_PMD_LEAVES > > + def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE && PGTABLE_HAS_HUGE_LEAVES > > + > > +config PGTABLE_HAS_PUD_LEAVES > > + def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD && PGTABLE_HAS_HUGE_LEAVES > > + > > What if an architecture has hugepages at PMD and/or PUD level and > doesn't support THP ? What's the arch to be discussed here? The whole purpose of this series so far is trying to make some pmd/pud helpers that only defined with CONFIG_THP=on to be available even if not. It means this series alone (or any future plan) shouldn't affect any arch that has CONFIG_THP=off always. But logically I think we should need some config option just to say "this arch supports pmd mappings" indeed, even if CONFIG_THP=off. When that's there, we should perhaps add that option into this equation so PGTABLE_HAS_*_LEAVES will also be selected in that case. -- Peter Xu
Re: [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options
Le 18/07/2024 à 00:02, Peter Xu a écrit : > Introduce two more sub-options for PGTABLE_HAS_HUGE_LEAVES: > >- PGTABLE_HAS_PMD_LEAVES: set when there can be PMD mappings >- PGTABLE_HAS_PUD_LEAVES: set when there can be PUD mappings > > It will help to identify whether the current build may only want PMD > helpers but not PUD ones, as these sub-options will also check against the > arch support over HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD]. > > Note that having them depend on HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD] is > still some intermediate step. The best way is to have an option say > "whether arch XXX supports PMD/PUD mappings" and so on. However let's > leave that for later as that's the easy part. So far, we use these options > to stably detect per-arch huge mapping support. > > Signed-off-by: Peter Xu > --- > include/linux/huge_mm.h | 10 +++--- > mm/Kconfig | 6 ++ > 2 files changed, 13 insertions(+), 3 deletions(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 711632df7edf..37482c8445d1 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -96,14 +96,18 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; > #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ > (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) > > -#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES > -#define HPAGE_PMD_SHIFT PMD_SHIFT > +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES > #define HPAGE_PUD_SHIFT PUD_SHIFT > #else > -#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) > #define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; }) > #endif > > +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES > +#define HPAGE_PMD_SHIFT PMD_SHIFT > +#else > +#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) > +#endif > + > #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) > #define HPAGE_PMD_NR (1< #define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1)) > diff --git a/mm/Kconfig b/mm/Kconfig > index 60796402850e..2dbdc088dee8 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -860,6 +860,12 @@ endif # TRANSPARENT_HUGEPAGE > config PGTABLE_HAS_HUGE_LEAVES > def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE > > +config PGTABLE_HAS_PMD_LEAVES > + def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE && PGTABLE_HAS_HUGE_LEAVES > + > +config PGTABLE_HAS_PUD_LEAVES > + def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD && PGTABLE_HAS_HUGE_LEAVES > + What if an architecture has hugepages at PMD and/or PUD level and doesn't support THP ? Christophe > # > # UP and nommu archs use km based percpu allocator > #
Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
Le 23/07/2024 à 23:04, Peter Xu a écrit : >> >>> >>> Nornally I don't see this as much of a "code churn" category, because it >>> doesn't changes the code itself but only move things. I personally also >>> prefer without code churns, but only in the case where there'll be tiny >>> little functional changes here and there without real benefit. >>> >>> It's pretty unavoidable to me when one file grows too large and we'll need >>> to split, and in this case git doesn't have a good way to track such >>> movement.. >> >> Yes, that's what I mean. >> >> I've been recently thinking if we should pursue a different direction: >> >> Just as we recently relocated most follow_huge_* stuff into gup.c, likely we >> should rather look into moving copy_huge_pmd, change_huge_pmd, copy_huge_pmd >> ... into the files where they logically belong to. >> >> In madvise.c, we've been doing that in some places already: For >> madvise_cold_or_pageout_pte_range() we inline the code, but not for >> madvise_free_huge_pmd(). >> >> pmd_trans_huge() would already compile to a NOP without >> CONFIG_TRANSPARENT_HUGEPAGE, but to make that code avoid most >> CONFIG_TRANSPARENT_HUGEPAGE, we'd need a couple more function stubs to make >> the compiler happy while still being able to compile that code out when not >> required. > > Right, I had a patch does exactly that, where it's called pmd_is_leaf(), > for example, but taking CONFIG_* into account. > > I remember I had some issue with that, e.g. I used to see pmd_trans_huge() > (when !THP) can optimize some path but pmd_is_leaf() didn't do the same job > even if all configs were off. But that's another story and I didn't yet > dig deeper. Could be something small but overlooked. When I prepared series https://patchwork.kernel.org/project/linux-mm/list/?series=871008 , I detected that some architectures define some pXd_leaf() for levels they don't support, that's the reason why Andrew had to drop v2 at the last minute. And that's maybe the reason why some of the code you expect to get folded-off remains. Since then I sent v3 that fixes that. Don't know if Andrew is planning to take it. Christophe
Re: [PATCH RFC v2 2/5] of: get dma area lower limit
On Thu, Jul 25, 2024 at 02:49:01PM +0300, Baruch Siach wrote: > Hi Catalin, > > On Tue, Jun 18 2024, Catalin Marinas wrote: > > On Tue, Apr 09, 2024 at 09:17:55AM +0300, Baruch Siach wrote: > >> of_dma_get_max_cpu_address() returns the highest CPU address that > >> devices can use for DMA. The implicit assumption is that all CPU > >> addresses below that limit are suitable for DMA. However the > >> 'dma-ranges' property this code uses also encodes a lower limit for DMA > >> that is potentially non zero. > >> > >> Rename to of_dma_get_cpu_limits(), and extend to retrieve also the lower > >> limit for the same 'dma-ranges' property describing the high limit. > > > > I don't understand the reason for the lower limit. The way the Linux > > zones work is that ZONE_DMA always starts from the start of the RAM. It > > doesn't matter whether it's 0 or not, you'd not allocate below the start > > of RAM anyway. If you have a device that cannot use the bottom of the > > RAM, it is pretty broken and not supported by Linux. > > I won't argue with that assertion. My target system RAM happens to start > at that the lower end of devices DMA zone, so I'm fine with skipping > this patch. > > Just curious. What is the inherent limitation that prevents Linux from > supporting DMA zone with lower limit above RAM start? It's the way the zone allocation fallback mechanism works. Let's say a ZONE_DMA32 allocation fails, it falls back to ZONE_DMA and it's supposed to be compatible with the GFP_DMA32 request. If you have some other zone below ZONE_DMA, it should also be compatible with GFP_DMA allocations. -- Catalin
Re: [PATCH RFC v2 2/5] of: get dma area lower limit
Hi Catalin, On Tue, Jun 18 2024, Catalin Marinas wrote: > On Tue, Apr 09, 2024 at 09:17:55AM +0300, Baruch Siach wrote: >> of_dma_get_max_cpu_address() returns the highest CPU address that >> devices can use for DMA. The implicit assumption is that all CPU >> addresses below that limit are suitable for DMA. However the >> 'dma-ranges' property this code uses also encodes a lower limit for DMA >> that is potentially non zero. >> >> Rename to of_dma_get_cpu_limits(), and extend to retrieve also the lower >> limit for the same 'dma-ranges' property describing the high limit. > > I don't understand the reason for the lower limit. The way the Linux > zones work is that ZONE_DMA always starts from the start of the RAM. It > doesn't matter whether it's 0 or not, you'd not allocate below the start > of RAM anyway. If you have a device that cannot use the bottom of the > RAM, it is pretty broken and not supported by Linux. I won't argue with that assertion. My target system RAM happens to start at that the lower end of devices DMA zone, so I'm fine with skipping this patch. Just curious. What is the inherent limitation that prevents Linux from supporting DMA zone with lower limit above RAM start? Thanks, baruch -- ~. .~ Tk Open Systems =}ooO--U--Ooo{= - bar...@tkos.co.il - tel: +972.52.368.4656, http://www.tkos.co.il -
Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
On 23.07.24 23:04, Peter Xu wrote: On Tue, Jul 23, 2024 at 10:18:37AM +0200, David Hildenbrand wrote: On 22.07.24 17:31, Peter Xu wrote: On Mon, Jul 22, 2024 at 03:29:43PM +0200, David Hildenbrand wrote: On 18.07.24 00:02, Peter Xu wrote: This is an RFC series, so not yet for merging. Please don't be scared by the code changes: most of them are code movements only. This series is based on the dax mprotect fix series here (while that one is based on mm-unstable): [PATCH v3 0/8] mm/mprotect: Fix dax puds https://lore.kernel.org/r/20240715192142.3241557-1-pet...@redhat.com Overview This series doesn't provide any feature change. The only goal of this series is to start decoupling two ideas: "THP" and "huge mapping". We already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this one extends that idea into the code. The issue is that we have so many functions that only compile with CONFIG_THP=on, even though they're about huge mappings, and huge mapping is a pretty common concept, which can apply to many things besides THPs nowadays. The major THP file is mm/huge_memory.c as of now. The first example of such huge mapping users will be hugetlb. We lived until now with no problem simply because Linux almost duplicated all the logics there in the "THP" files into hugetlb APIs. If we want to get rid of hugetlb specific APIs and paths, this _might_ be the first thing we want to do, because we want to be able to e.g., zapping a hugetlb pmd entry even if !CONFIG_THP. Then consider other things like dax / pfnmaps. Dax can depend on THP, then it'll naturally be able to use pmd/pud helpers, that's okay. However is it a must? Do we also want to have every new pmd/pud mappings in the future to depend on THP (like PFNMAP)? My answer is no, but I'm open to opinions. If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that are larger than PAGE_SIZE) is a more generic concept than THP, then I think at some point we need to move the generic code out of THP code into a common code base. This is what this series does as a start. Hi Peter! From a quick glimpse, patch #1-#4 do make sense independent of patch #5. I am not so sure about all of the code movement in patch #5. If large folios are the future, then likely huge_memory.c should simply be the home for all that logic. Maybe the goal should better be to compile huge_memory.c not only for THP, but also for other use cases that require that logic, and fence off all THP specific stuff using #ifdef? Not sure, though. But a lot of this code movements/churn might be avoidable. I'm fine using ifdefs in the current fine, but IMHO it's a matter of whether we want to keep huge_memory.c growing into even larger file, and keep all large folio logics only in that file. Currently it's ~4000 LOCs. Depends on "how much" for sure. huge_memory.c is currently on place 12 of the biggest files in mm/. So there might not be immediate cause for action ... just yet :) [guess which file is on #2 :) ] 7821, hugetlb.c 7602, vmscan.c 7275, slub.c 7072, page_alloc.c 6673, memory.c 5402, memcontrol.c 5239, shmem.c 5155, vmalloc.c 4419, filemap.c 4060, mmap.c 3882, huge_memory.c IMHO a split is normally better than keeping everything in one file, but yeah I'd confess THP file isn't that bad comparing to others.. And I'm definitely surprised it's even out of top ten. It's always interesting looking at the numbers here. For v6.10 we had: 8521 mm/memcontrol.c 7813 mm/hugetlb.c 7550 mm/vmscan.c 7266 mm/slub.c 7018 mm/page_alloc.c 6468 mm/memory.c 5154 mm/vmalloc.c 5002 mm/shmem.c 4419 mm/filemap.c 4019 mm/mmap.c 3954 mm/ksm.c 3740 mm/swapfile.c 3730 mm/huge_memory.c 3689 mm/gup.c 3542 mm/mempolicy.c I suspect memcontrol.c shrunk because of the v1 split-off, leaving hugetlb.c now at #1 :) -- Cheers, David / dhildenb
Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
On Tue, Jul 23, 2024 at 10:18:37AM +0200, David Hildenbrand wrote: > On 22.07.24 17:31, Peter Xu wrote: > > On Mon, Jul 22, 2024 at 03:29:43PM +0200, David Hildenbrand wrote: > > > On 18.07.24 00:02, Peter Xu wrote: > > > > This is an RFC series, so not yet for merging. Please don't be scared > > > > by > > > > the code changes: most of them are code movements only. > > > > > > > > This series is based on the dax mprotect fix series here (while that > > > > one is > > > > based on mm-unstable): > > > > > > > > [PATCH v3 0/8] mm/mprotect: Fix dax puds > > > > https://lore.kernel.org/r/20240715192142.3241557-1-pet...@redhat.com > > > > > > > > Overview > > > > > > > > > > > > This series doesn't provide any feature change. The only goal of this > > > > series is to start decoupling two ideas: "THP" and "huge mapping". We > > > > already started with having PGTABLE_HAS_HUGE_LEAVES config option, and > > > > this > > > > one extends that idea into the code. > > > > > > > > The issue is that we have so many functions that only compile with > > > > CONFIG_THP=on, even though they're about huge mappings, and huge > > > > mapping is > > > > a pretty common concept, which can apply to many things besides THPs > > > > nowadays. The major THP file is mm/huge_memory.c as of now. > > > > > > > > The first example of such huge mapping users will be hugetlb. We lived > > > > until now with no problem simply because Linux almost duplicated all the > > > > logics there in the "THP" files into hugetlb APIs. If we want to get > > > > rid > > > > of hugetlb specific APIs and paths, this _might_ be the first thing we > > > > want > > > > to do, because we want to be able to e.g., zapping a hugetlb pmd entry > > > > even > > > > if !CONFIG_THP. > > > > > > > > Then consider other things like dax / pfnmaps. Dax can depend on THP, > > > > then > > > > it'll naturally be able to use pmd/pud helpers, that's okay. However > > > > is it > > > > a must? Do we also want to have every new pmd/pud mappings in the > > > > future > > > > to depend on THP (like PFNMAP)? My answer is no, but I'm open to > > > > opinions. > > > > > > > > If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that > > > > are larger than PAGE_SIZE) is a more generic concept than THP, then I > > > > think > > > > at some point we need to move the generic code out of THP code into a > > > > common code base. > > > > > > > > This is what this series does as a start. > > > > > > Hi Peter! > > > > > > From a quick glimpse, patch #1-#4 do make sense independent of patch #5. > > > > > > I am not so sure about all of the code movement in patch #5. If large > > > folios > > > are the future, then likely huge_memory.c should simply be the home for > > > all > > > that logic. > > > > > > Maybe the goal should better be to compile huge_memory.c not only for THP, > > > but also for other use cases that require that logic, and fence off all > > > THP > > > specific stuff using #ifdef? > > > > > > Not sure, though. But a lot of this code movements/churn might be > > > avoidable. > > > > I'm fine using ifdefs in the current fine, but IMHO it's a matter of > > whether we want to keep huge_memory.c growing into even larger file, and > > keep all large folio logics only in that file. Currently it's ~4000 LOCs. > > Depends on "how much" for sure. huge_memory.c is currently on place 12 of > the biggest files in mm/. So there might not be immediate cause for action > ... just yet :) [guess which file is on #2 :) ] 7821, hugetlb.c 7602, vmscan.c 7275, slub.c 7072, page_alloc.c 6673, memory.c 5402, memcontrol.c 5239, shmem.c 5155, vmalloc.c 4419, filemap.c 4060, mmap.c 3882, huge_memory.c IMHO a split is normally better than keeping everything in one file, but yeah I'd confess THP file isn't that bad comparing to others.. And I'm definitely surprised it's even out of top ten. > > > > > Nornally I don't see this as much of a "code churn" category, because it > > doesn't changes the code itself but only move things. I personally also > > prefer without code churns, but only in the case where there'll be tiny > > little functional changes here and there without real benefit. > > > > It's pretty unavoidable to me when one file grows too large and we'll need > > to split, and in this case git doesn't have a good way to track such > > movement.. > > Yes, that's what I mean. > > I've been recently thinking if we should pursue a different direction: > > Just as we recently relocated most follow_huge_* stuff into gup.c, likely we > should rather look into moving copy_huge_pmd, change_huge_pmd, copy_huge_pmd > ... into the files where they logically belong to. > > In madvise.c, we've been doing that in some places already: For > madvise_cold_or_pageout_pte_range() we inline the code, but not for > madvise_free_huge_pmd(). > > pmd_trans_huge() would already comp
Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
On 22.07.24 17:31, Peter Xu wrote: On Mon, Jul 22, 2024 at 03:29:43PM +0200, David Hildenbrand wrote: On 18.07.24 00:02, Peter Xu wrote: This is an RFC series, so not yet for merging. Please don't be scared by the code changes: most of them are code movements only. This series is based on the dax mprotect fix series here (while that one is based on mm-unstable): [PATCH v3 0/8] mm/mprotect: Fix dax puds https://lore.kernel.org/r/20240715192142.3241557-1-pet...@redhat.com Overview This series doesn't provide any feature change. The only goal of this series is to start decoupling two ideas: "THP" and "huge mapping". We already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this one extends that idea into the code. The issue is that we have so many functions that only compile with CONFIG_THP=on, even though they're about huge mappings, and huge mapping is a pretty common concept, which can apply to many things besides THPs nowadays. The major THP file is mm/huge_memory.c as of now. The first example of such huge mapping users will be hugetlb. We lived until now with no problem simply because Linux almost duplicated all the logics there in the "THP" files into hugetlb APIs. If we want to get rid of hugetlb specific APIs and paths, this _might_ be the first thing we want to do, because we want to be able to e.g., zapping a hugetlb pmd entry even if !CONFIG_THP. Then consider other things like dax / pfnmaps. Dax can depend on THP, then it'll naturally be able to use pmd/pud helpers, that's okay. However is it a must? Do we also want to have every new pmd/pud mappings in the future to depend on THP (like PFNMAP)? My answer is no, but I'm open to opinions. If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that are larger than PAGE_SIZE) is a more generic concept than THP, then I think at some point we need to move the generic code out of THP code into a common code base. This is what this series does as a start. Hi Peter! From a quick glimpse, patch #1-#4 do make sense independent of patch #5. I am not so sure about all of the code movement in patch #5. If large folios are the future, then likely huge_memory.c should simply be the home for all that logic. Maybe the goal should better be to compile huge_memory.c not only for THP, but also for other use cases that require that logic, and fence off all THP specific stuff using #ifdef? Not sure, though. But a lot of this code movements/churn might be avoidable. I'm fine using ifdefs in the current fine, but IMHO it's a matter of whether we want to keep huge_memory.c growing into even larger file, and keep all large folio logics only in that file. Currently it's ~4000 LOCs. Depends on "how much" for sure. huge_memory.c is currently on place 12 of the biggest files in mm/. So there might not be immediate cause for action ... just yet :) [guess which file is on #2 :) ] Nornally I don't see this as much of a "code churn" category, because it doesn't changes the code itself but only move things. I personally also prefer without code churns, but only in the case where there'll be tiny little functional changes here and there without real benefit. It's pretty unavoidable to me when one file grows too large and we'll need to split, and in this case git doesn't have a good way to track such movement.. Yes, that's what I mean. I've been recently thinking if we should pursue a different direction: Just as we recently relocated most follow_huge_* stuff into gup.c, likely we should rather look into moving copy_huge_pmd, change_huge_pmd, copy_huge_pmd ... into the files where they logically belong to. In madvise.c, we've been doing that in some places already: For madvise_cold_or_pageout_pte_range() we inline the code, but not for madvise_free_huge_pmd(). pmd_trans_huge() would already compile to a NOP without CONFIG_TRANSPARENT_HUGEPAGE, but to make that code avoid most CONFIG_TRANSPARENT_HUGEPAGE, we'd need a couple more function stubs to make the compiler happy while still being able to compile that code out when not required. The idea would be that e.g., pmd_leaf() would return "false" at compile time if no active configuration (THP, HUGETLB, ...) would be active. So we could just use pmd_leaf() similar to pmd_trans_huge() in relevant code and have the compiler optimize it all out without putting it into separate files. That means, large folios and PMD/PUD mappings will become "more common" and better integrated, without the need to jump between files. Just some thought about an alternative that would make sense to me. Irrelevant of this, just to mention I think there's still one option that I at least can make the huge pfnmap depends on THP again which shouldn't be a huge deal (I don't have any use case that needs huge pfnmap but disable THP, anyway..), so this series isn't an immediate concern to me for that route. But for a hugetlb rework this might be
Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
On Mon, Jul 22, 2024 at 03:29:43PM +0200, David Hildenbrand wrote: > On 18.07.24 00:02, Peter Xu wrote: > > This is an RFC series, so not yet for merging. Please don't be scared by > > the code changes: most of them are code movements only. > > > > This series is based on the dax mprotect fix series here (while that one is > > based on mm-unstable): > > > >[PATCH v3 0/8] mm/mprotect: Fix dax puds > >https://lore.kernel.org/r/20240715192142.3241557-1-pet...@redhat.com > > > > Overview > > > > > > This series doesn't provide any feature change. The only goal of this > > series is to start decoupling two ideas: "THP" and "huge mapping". We > > already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this > > one extends that idea into the code. > > > > The issue is that we have so many functions that only compile with > > CONFIG_THP=on, even though they're about huge mappings, and huge mapping is > > a pretty common concept, which can apply to many things besides THPs > > nowadays. The major THP file is mm/huge_memory.c as of now. > > > > The first example of such huge mapping users will be hugetlb. We lived > > until now with no problem simply because Linux almost duplicated all the > > logics there in the "THP" files into hugetlb APIs. If we want to get rid > > of hugetlb specific APIs and paths, this _might_ be the first thing we want > > to do, because we want to be able to e.g., zapping a hugetlb pmd entry even > > if !CONFIG_THP. > > > > Then consider other things like dax / pfnmaps. Dax can depend on THP, then > > it'll naturally be able to use pmd/pud helpers, that's okay. However is it > > a must? Do we also want to have every new pmd/pud mappings in the future > > to depend on THP (like PFNMAP)? My answer is no, but I'm open to opinions. > > > > If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that > > are larger than PAGE_SIZE) is a more generic concept than THP, then I think > > at some point we need to move the generic code out of THP code into a > > common code base. > > > > This is what this series does as a start. > > Hi Peter! > > From a quick glimpse, patch #1-#4 do make sense independent of patch #5. > > I am not so sure about all of the code movement in patch #5. If large folios > are the future, then likely huge_memory.c should simply be the home for all > that logic. > > Maybe the goal should better be to compile huge_memory.c not only for THP, > but also for other use cases that require that logic, and fence off all THP > specific stuff using #ifdef? > > Not sure, though. But a lot of this code movements/churn might be avoidable. I'm fine using ifdefs in the current fine, but IMHO it's a matter of whether we want to keep huge_memory.c growing into even larger file, and keep all large folio logics only in that file. Currently it's ~4000 LOCs. Nornally I don't see this as much of a "code churn" category, because it doesn't changes the code itself but only move things. I personally also prefer without code churns, but only in the case where there'll be tiny little functional changes here and there without real benefit. It's pretty unavoidable to me when one file grows too large and we'll need to split, and in this case git doesn't have a good way to track such movement.. Irrelevant of this, just to mention I think there's still one option that I at least can make the huge pfnmap depends on THP again which shouldn't be a huge deal (I don't have any use case that needs huge pfnmap but disable THP, anyway..), so this series isn't an immediate concern to me for that route. But for a hugetlb rework this might be something we need to do, because we simplly can't make CONFIG_HUGETLB rely on CONFIG_THP.. Thanks, -- Peter Xu
Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
On 18.07.24 00:02, Peter Xu wrote: This is an RFC series, so not yet for merging. Please don't be scared by the code changes: most of them are code movements only. This series is based on the dax mprotect fix series here (while that one is based on mm-unstable): [PATCH v3 0/8] mm/mprotect: Fix dax puds https://lore.kernel.org/r/20240715192142.3241557-1-pet...@redhat.com Overview This series doesn't provide any feature change. The only goal of this series is to start decoupling two ideas: "THP" and "huge mapping". We already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this one extends that idea into the code. The issue is that we have so many functions that only compile with CONFIG_THP=on, even though they're about huge mappings, and huge mapping is a pretty common concept, which can apply to many things besides THPs nowadays. The major THP file is mm/huge_memory.c as of now. The first example of such huge mapping users will be hugetlb. We lived until now with no problem simply because Linux almost duplicated all the logics there in the "THP" files into hugetlb APIs. If we want to get rid of hugetlb specific APIs and paths, this _might_ be the first thing we want to do, because we want to be able to e.g., zapping a hugetlb pmd entry even if !CONFIG_THP. Then consider other things like dax / pfnmaps. Dax can depend on THP, then it'll naturally be able to use pmd/pud helpers, that's okay. However is it a must? Do we also want to have every new pmd/pud mappings in the future to depend on THP (like PFNMAP)? My answer is no, but I'm open to opinions. If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that are larger than PAGE_SIZE) is a more generic concept than THP, then I think at some point we need to move the generic code out of THP code into a common code base. This is what this series does as a start. Hi Peter! From a quick glimpse, patch #1-#4 do make sense independent of patch #5. I am not so sure about all of the code movement in patch #5. If large folios are the future, then likely huge_memory.c should simply be the home for all that logic. Maybe the goal should better be to compile huge_memory.c not only for THP, but also for other use cases that require that logic, and fence off all THP specific stuff using #ifdef? Not sure, though. But a lot of this code movements/churn might be avoidable. -- Cheers, David / dhildenb
[PATCH RFC 5/6] mm/huge_mapping: Create huge_mapping_pxx.c
At some point, we need to decouple "huge mapping" with THP, for any non-THP huge mappings in the future (hugetlb, pfnmap, etc..). This is the first step towards it. Or say, we already started to do this when PGTABLE_HAS_HUGE_LEAVES option was introduced: that is the first thing Linux start to describe LEAVEs rather than THPs when it is about huge mappings. Before that, mostly any huge mapping will have THP involved, like devmap. Hugetlb is special only because we duplicated the whole world there, but we also have a demand to decouple that now. Linux used to have huge_memory.c which only compiles with THP enabled, I wished it was called thp.c from the start. In reality, it contains more than processing THP: any huge mapping (even if not falling into THP category) will be able to leverage many of these helpers, but unfortunately this file is not compiled if !THP. These helpers are normally only about the pgtable operations, which may not be directly relevant to what type of huge folio (e.g. THP) underneath, or perhaps even if there's no vmemmap to back it. It's better we move them out of THP world. Create a new set of files huge_mapping_p[mu]d.c. This patch starts to move quite a few essential helpers from huge_memory.c into these new files, so that they'll start to work and compile rely on PGTABLE_HAS_PXX_LEAVES rather than THP. Split them into two files by nature so that e.g. archs that only supports PMD huge mapping can avoid compiling the whole -pud file, with the hope to reduce the size of object compiled and linked. No functional change intended, but only code movement. Said that, there will be some "ifdef" machinery changes to pass all kinds of compilations. Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Oscar Salvador Signed-off-by: Peter Xu --- include/linux/huge_mm.h | 318 +--- include/linux/pgtable.h | 23 +- include/trace/events/huge_mapping.h | 41 + include/trace/events/thp.h | 28 - mm/Makefile |2 + mm/huge_mapping_pmd.c | 979 +++ mm/huge_mapping_pud.c | 235 ++ mm/huge_memory.c| 1125 +-- 8 files changed, 1472 insertions(+), 1279 deletions(-) create mode 100644 include/trace/events/huge_mapping.h create mode 100644 mm/huge_mapping_pmd.c create mode 100644 mm/huge_mapping_pud.c diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index d8b642ad512d..aea2784df8ef 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -8,43 +8,214 @@ #include /* only for vma_is_dax() */ #include +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES +void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud); void touch_pud(struct vm_area_struct *vma, unsigned long addr, pud_t *pud, bool write); -void touch_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, bool write); -pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma); -vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); -int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, - struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); -void huge_pmd_set_accessed(struct vm_fault *vmf); int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, pud_t *dst_pud, pud_t *src_pud, unsigned long addr, struct vm_area_struct *vma); +int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, +unsigned long addr); +int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, + pud_t *pudp, unsigned long addr, pgprot_t newprot, + unsigned long cp_flags); +void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, + unsigned long address); +spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma); -#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD -void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud); -#else -static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud) +static inline spinlock_t * +pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma) { + if (pud_trans_huge(*pud) || pud_devmap(*pud)) + return __pud_trans_huge_lock(pud, vma); + else + return NULL; } -#endif +#define split_huge_pud(__vma, __pud, __address) \ + do {\ + pud_t *pud = (__pud); \ + if (pud_trans_huge(*pud) || pud_devmap(*pud)) \ + __split_huge_pud(__vma, __pud, __address); \ + } while (0) +#else /* CONFIG_PGTABLE_HAS_PUD_LEAVES */ +static inline void +huge_pud_set_accessed(
[PATCH RFC 6/6] mm: Convert "*_trans_huge() || *_devmap()" to use *_leaf()
This patch converted all such checks into one *_leaf() check under common mm/, as "thp+devmap" should compose everything for a *_leaf() for now. I didn't yet touch arch code in other directories, as some arch may need some special attention, so I left those separately. It should start to save some cycles on such check and pave way for the new leaf types. E.g., when a new type of leaf is introduced, it'll naturally go the same route to what we have now for thp+devmap. Here one issue with pxx_leaf() API is that such API will be defined by arch but it doesn't consider kernel config. For example, below "if" branch cannot be automatically optimized: if (pmd_leaf()) { ... } Even if both THP && HUGETLB are not enabled (which means pmd_leaf() can never return true). To provide a chance for compilers to optimize and omit code when possible, introduce a light wrapper for them and call them pxx_is_leaf(). That will take kernel config into account and properly allow omitting branches when the compiler knows it'll constantly returns false. This tries to mimic what we used to have with pxx_trans_huge() when !THP, so it now also applies to pxx_leaf() API. Cc: Alistair Popple Cc: Dan Williams Cc: Jason Gunthorpe Signed-off-by: Peter Xu --- include/linux/huge_mm.h| 6 +++--- include/linux/pgtable.h| 30 +- mm/hmm.c | 4 ++-- mm/huge_mapping_pmd.c | 9 +++-- mm/huge_mapping_pud.c | 6 +++--- mm/mapping_dirty_helpers.c | 4 ++-- mm/memory.c| 14 ++ mm/migrate_device.c| 2 +- mm/mprotect.c | 4 ++-- mm/mremap.c| 5 ++--- mm/page_vma_mapped.c | 5 ++--- mm/pgtable-generic.c | 7 +++ 12 files changed, 58 insertions(+), 38 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index aea2784df8ef..a5b026d0731e 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -27,7 +27,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma); static inline spinlock_t * pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma) { - if (pud_trans_huge(*pud) || pud_devmap(*pud)) + if (pud_is_leaf(*pud)) return __pud_trans_huge_lock(pud, vma); else return NULL; @@ -36,7 +36,7 @@ pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma) #define split_huge_pud(__vma, __pud, __address) \ do {\ pud_t *pud = (__pud); \ - if (pud_trans_huge(*pud) || pud_devmap(*pud)) \ + if (pud_is_leaf(*pud)) \ __split_huge_pud(__vma, __pud, __address); \ } while (0) #else /* CONFIG_PGTABLE_HAS_PUD_LEAVES */ @@ -125,7 +125,7 @@ static inline int is_swap_pmd(pmd_t pmd) static inline spinlock_t * pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) { - if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) + if (is_swap_pmd(*pmd) || pmd_is_leaf(*pmd)) return __pmd_trans_huge_lock(pmd, vma); else return NULL; diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 5e505373b113..af7709a132aa 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1641,7 +1641,7 @@ static inline int pud_trans_unstable(pud_t *pud) defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) pud_t pudval = READ_ONCE(*pud); - if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval)) + if (pud_none(pudval) || pud_leaf(pudval)) return 1; if (unlikely(pud_bad(pudval))) { pud_clear_bad(pud); @@ -1901,6 +1901,34 @@ typedef unsigned int pgtbl_mod_mask; #define pmd_leaf(x)false #endif +/* + * Wrapper of pxx_leaf() helpers. + * + * Comparing to pxx_leaf() API, the only difference is: using these macros + * can help code generation, so unnecessary code can be omitted when the + * specific level of leaf is not possible due to kernel config. It is + * needed because normally pxx_leaf() can be defined in arch code without + * knowing the kernel config. + * + * Currently we only need pmd/pud versions, because the largest leaf Linux + * supports so far is pud. + * + * Defining here also means that in arch's pgtable headers these macros + * cannot be used, pxx_leaf()s need to be used instead, because this file + * will not be included in arch's pgtable headers. + */ +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES +#define pmd_is_leaf(x) pmd_leaf(x) +#else +#define pmd_is_leaf(x) false +#endif + +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES +#define pud_is_leaf(x) pud_leaf(x) +#else +#define pud_is_leaf(x) false +#endif + #ifndef pgd_leaf_size #define pgd_leaf_size(x) (1ULL << PGDIR_SHIFT) #e
[PATCH RFC 4/6] mm: Move huge mapping declarations from internal.h to huge_mm.h
Most of the huge mapping relevant helpers are declared in huge_mm.h, not internal.h. Move the only few from internal.h into huge_mm.h. Here to move pmd_needs_soft_dirty_wp() over, we'll also need to move vma_soft_dirty_enabled() into mm.h as it'll be needed in two headers later (internal.h, huge_mm.h). Signed-off-by: Peter Xu --- include/linux/huge_mm.h | 10 ++ include/linux/mm.h | 18 ++ mm/internal.h | 33 - 3 files changed, 28 insertions(+), 33 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 37482c8445d1..d8b642ad512d 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -8,6 +8,11 @@ #include /* only for vma_is_dax() */ #include +void touch_pud(struct vm_area_struct *vma, unsigned long addr, + pud_t *pud, bool write); +void touch_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmd, bool write); +pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma); vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, @@ -629,4 +634,9 @@ static inline int split_folio_to_order(struct folio *folio, int new_order) #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0) #define split_folio(f) split_folio_to_order(f, 0) +static inline bool pmd_needs_soft_dirty_wp(struct vm_area_struct *vma, pmd_t pmd) +{ + return vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd); +} + #endif /* _LINUX_HUGE_MM_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 5f1075d19600..fa10802d8faa 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1117,6 +1117,24 @@ static inline unsigned int folio_order(struct folio *folio) return folio->_flags_1 & 0xff; } +static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma) +{ + /* +* NOTE: we must check this before VM_SOFTDIRTY on soft-dirty +* enablements, because when without soft-dirty being compiled in, +* VM_SOFTDIRTY is defined as 0x0, then !(vm_flags & VM_SOFTDIRTY) +* will be constantly true. +*/ + if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY)) + return false; + + /* +* Soft-dirty is kind of special: its tracking is enabled when the +* vma flags not set. +*/ + return !(vma->vm_flags & VM_SOFTDIRTY); +} + #include /* diff --git a/mm/internal.h b/mm/internal.h index b4d86436565b..e49941747749 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -917,8 +917,6 @@ bool need_mlock_drain(int cpu); void mlock_drain_local(void); void mlock_drain_remote(int cpu); -extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma); - /** * vma_address - Find the virtual address a page range is mapped at * @vma: The vma which maps this object. @@ -1229,14 +1227,6 @@ int migrate_device_coherent_page(struct page *page); int __must_check try_grab_folio(struct folio *folio, int refs, unsigned int flags); -/* - * mm/huge_memory.c - */ -void touch_pud(struct vm_area_struct *vma, unsigned long addr, - pud_t *pud, bool write); -void touch_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, bool write); - /* * mm/mmap.c */ @@ -1342,29 +1332,6 @@ static __always_inline void vma_set_range(struct vm_area_struct *vma, vma->vm_pgoff = pgoff; } -static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma) -{ - /* -* NOTE: we must check this before VM_SOFTDIRTY on soft-dirty -* enablements, because when without soft-dirty being compiled in, -* VM_SOFTDIRTY is defined as 0x0, then !(vm_flags & VM_SOFTDIRTY) -* will be constantly true. -*/ - if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY)) - return false; - - /* -* Soft-dirty is kind of special: its tracking is enabled when the -* vma flags not set. -*/ - return !(vma->vm_flags & VM_SOFTDIRTY); -} - -static inline bool pmd_needs_soft_dirty_wp(struct vm_area_struct *vma, pmd_t pmd) -{ - return vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd); -} - static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t pte) { return vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte); -- 2.45.0
[PATCH RFC 3/6] mm/treewide: Make pgtable-generic.c THP agnostic
Make pmd/pud helpers to rely on the new PGTABLE_HAS_*_LEAVES option, rather than THP alone, as THP is only one form of huge mapping. Signed-off-by: Peter Xu --- arch/arm64/include/asm/pgtable.h | 6 ++-- arch/powerpc/include/asm/book3s/64/pgtable.h | 2 +- arch/powerpc/mm/book3s64/pgtable.c | 2 +- arch/riscv/include/asm/pgtable.h | 4 +-- arch/s390/include/asm/pgtable.h | 2 +- arch/s390/mm/pgtable.c | 4 +-- arch/sparc/mm/tlb.c | 2 +- arch/x86/mm/pgtable.c| 15 - include/linux/mm_types.h | 2 +- include/linux/pgtable.h | 4 +-- mm/memory.c | 2 +- mm/pgtable-generic.c | 32 ++-- 12 files changed, 40 insertions(+), 37 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 5d5d1b18b837..b93c03256ada 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1105,7 +1105,7 @@ extern int __ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t entry, int dirty); -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES #define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS static inline int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp, @@ -1114,7 +1114,9 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma, return __ptep_set_access_flags(vma, address, (pte_t *)pmdp, pmd_pte(entry), dirty); } +#endif +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES static inline int pud_devmap(pud_t pud) { return 0; @@ -1178,7 +1180,7 @@ static inline int __ptep_clear_flush_young(struct vm_area_struct *vma, return young; } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES #define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long address, diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 051b1b6d729c..84cf55e18334 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1119,7 +1119,7 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool write) return pte_access_permitted(pmd_pte(pmd), write); } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot); extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot); extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot); diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 5a4a75369043..d6a5457627df 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__pmd_frag_nr); unsigned long __pmd_frag_size_shift; EXPORT_SYMBOL(__pmd_frag_size_shift); -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES /* * This is called when relaxing access to a hugepage. It's also called in the page * fault path when we don't hit any of the major fault cases, ie, a minor diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index ebfe8faafb79..8c28f15f601b 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -752,7 +752,7 @@ static inline bool pud_user_accessible_page(pud_t pud) } #endif -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES static inline int pmd_trans_huge(pmd_t pmd) { return pmd_leaf(pmd); @@ -802,7 +802,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma, #define pmdp_collapse_flush pmdp_collapse_flush extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif /* CONFIG_PGTABLE_HAS_PMD_LEAVES */ /* * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index fb6870384b97..398bbed20dee 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1710,7 +1710,7 @@ pmd_t pmdp_xchg_direct(struct mm_struct *, unsigned long, pmd_t *, pmd_t); pmd_t pmdp_xchg_lazy(struct mm_struct *, unsigned long, pmd_t *, pmd_t); pud_t pudp_xchg_direct(struct mm_struct *, unsigned long, pud_t *, pud_t); -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES #define __HAVE_ARCH_PGTABLE_DEPOSIT void pgtable_trans_huge_deposit(struct mm_st
[PATCH RFC 1/6] mm/treewide: Remove pgd_devmap()
It's always 0 for all archs, and there's no sign to even support p4d entry in the near future. Remove it until it's needed for real. Signed-off-by: Peter Xu --- arch/arm64/include/asm/pgtable.h | 5 - arch/powerpc/include/asm/book3s/64/pgtable.h | 5 - arch/x86/include/asm/pgtable.h | 5 - include/linux/pgtable.h | 4 mm/gup.c | 2 -- 5 files changed, 21 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index f8efbc128446..5d5d1b18b837 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1119,11 +1119,6 @@ static inline int pud_devmap(pud_t pud) { return 0; } - -static inline int pgd_devmap(pgd_t pgd) -{ - return 0; -} #endif #ifdef CONFIG_PAGE_TABLE_CHECK diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 5da92ba68a45..051b1b6d729c 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1431,11 +1431,6 @@ static inline int pud_devmap(pud_t pud) { return pte_devmap(pud_pte(pud)); } - -static inline int pgd_devmap(pgd_t pgd) -{ - return 0; -} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 701593c53f3b..0d234f48ceeb 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -311,11 +311,6 @@ static inline int pud_devmap(pud_t pud) return 0; } #endif - -static inline int pgd_devmap(pgd_t pgd) -{ - return 0; -} #endif #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 2289e9f7aa1b..0a904300ac90 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1626,10 +1626,6 @@ static inline int pud_devmap(pud_t pud) { return 0; } -static inline int pgd_devmap(pgd_t pgd) -{ - return 0; -} #endif #if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \ diff --git a/mm/gup.c b/mm/gup.c index 54d0dc3831fb..b023bcd38235 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -3149,8 +3149,6 @@ static int gup_fast_pgd_leaf(pgd_t orig, pgd_t *pgdp, unsigned long addr, if (!pgd_access_permitted(orig, flags & FOLL_WRITE)) return 0; - BUILD_BUG_ON(pgd_devmap(orig)); - page = pgd_page(orig); refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr); -- 2.45.0
[PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options
Introduce two more sub-options for PGTABLE_HAS_HUGE_LEAVES: - PGTABLE_HAS_PMD_LEAVES: set when there can be PMD mappings - PGTABLE_HAS_PUD_LEAVES: set when there can be PUD mappings It will help to identify whether the current build may only want PMD helpers but not PUD ones, as these sub-options will also check against the arch support over HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD]. Note that having them depend on HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD] is still some intermediate step. The best way is to have an option say "whether arch XXX supports PMD/PUD mappings" and so on. However let's leave that for later as that's the easy part. So far, we use these options to stably detect per-arch huge mapping support. Signed-off-by: Peter Xu --- include/linux/huge_mm.h | 10 +++--- mm/Kconfig | 6 ++ 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 711632df7edf..37482c8445d1 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -96,14 +96,18 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) -#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES -#define HPAGE_PMD_SHIFT PMD_SHIFT +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES #define HPAGE_PUD_SHIFT PUD_SHIFT #else -#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) #define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; }) #endif +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES +#define HPAGE_PMD_SHIFT PMD_SHIFT +#else +#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) +#endif + #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) #define HPAGE_PMD_NR (1<
[PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
This is an RFC series, so not yet for merging. Please don't be scared by the code changes: most of them are code movements only. This series is based on the dax mprotect fix series here (while that one is based on mm-unstable): [PATCH v3 0/8] mm/mprotect: Fix dax puds https://lore.kernel.org/r/20240715192142.3241557-1-pet...@redhat.com Overview This series doesn't provide any feature change. The only goal of this series is to start decoupling two ideas: "THP" and "huge mapping". We already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this one extends that idea into the code. The issue is that we have so many functions that only compile with CONFIG_THP=on, even though they're about huge mappings, and huge mapping is a pretty common concept, which can apply to many things besides THPs nowadays. The major THP file is mm/huge_memory.c as of now. The first example of such huge mapping users will be hugetlb. We lived until now with no problem simply because Linux almost duplicated all the logics there in the "THP" files into hugetlb APIs. If we want to get rid of hugetlb specific APIs and paths, this _might_ be the first thing we want to do, because we want to be able to e.g., zapping a hugetlb pmd entry even if !CONFIG_THP. Then consider other things like dax / pfnmaps. Dax can depend on THP, then it'll naturally be able to use pmd/pud helpers, that's okay. However is it a must? Do we also want to have every new pmd/pud mappings in the future to depend on THP (like PFNMAP)? My answer is no, but I'm open to opinions. If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that are larger than PAGE_SIZE) is a more generic concept than THP, then I think at some point we need to move the generic code out of THP code into a common code base. This is what this series does as a start. In general, this series tries to move many THP things (mostly resides in huge_memory.c right now) into two new files: huge_mapping_{pmd|pud}.c. When I move them out, I also put them separately into different files for different layers. Then if an arch supports e.g. only PMD, it can avoid compile the PUD helpers, with things like: CONFIG_PGTABLE_HAS_PUD_LEAVES=n obj-$(CONFIG_PGTABLE_HAS_PUD_LEAVES) += huge_mapping_pud.o Note that there're a few tree-wide changes into arch/, but that's not a lot, to make this not disturbing too much people, I only copied the open lists of each arch not yet the arch maintainers. Tests = My normal 19-archs cross-compilation tests pass with it, and smoke tested on x86_64 with a local config of mine. Comments welcomed, thanks. Peter Xu (6): mm/treewide: Remove pgd_devmap() mm: PGTABLE_HAS_P[MU]D_LEAVES config options mm/treewide: Make pgtable-generic.c THP agnostic mm: Move huge mapping declarations from internal.h to huge_mm.h mm/huge_mapping: Create huge_mapping_pxx.c mm: Convert "*_trans_huge() || *_devmap()" to use *_leaf() arch/arm64/include/asm/pgtable.h | 11 +- arch/powerpc/include/asm/book3s/64/pgtable.h |7 +- arch/powerpc/mm/book3s64/pgtable.c |2 +- arch/riscv/include/asm/pgtable.h |4 +- arch/s390/include/asm/pgtable.h |2 +- arch/s390/mm/pgtable.c |4 +- arch/sparc/mm/tlb.c |2 +- arch/x86/include/asm/pgtable.h |5 - arch/x86/mm/pgtable.c| 15 +- include/linux/huge_mm.h | 332 -- include/linux/mm.h | 18 + include/linux/mm_types.h |2 +- include/linux/pgtable.h | 61 +- include/trace/events/huge_mapping.h | 41 + include/trace/events/thp.h | 28 - mm/Kconfig |6 + mm/Makefile |2 + mm/gup.c |2 - mm/hmm.c |4 +- mm/huge_mapping_pmd.c| 976 +++ mm/huge_mapping_pud.c| 235 mm/huge_memory.c | 1125 +- mm/internal.h| 33 - mm/mapping_dirty_helpers.c |4 +- mm/memory.c | 16 +- mm/migrate_device.c |2 +- mm/mprotect.c|4 +- mm/mremap.c |5 +- mm/page_vma_mapped.c |5 +- mm/pgtable-generic.c | 37 +- 30 files changed, 1595 insertions(+), 1395 deletions(-) create mode 100644 include/trace/events/huge_mapping.h create mode 100644 mm/huge_mapping_pmd.c create mode 100644 mm/huge_mapping_pud.c -- 2.45.0
Re: [PATCH RFC v2 4/5] dma-direct: add base offset to zone_dma_bits
On Tue, Apr 09, 2024 at 09:17:57AM +0300, Baruch Siach wrote: > Current code using zone_dma_bits assume that all addresses range in the > bits mask are suitable for DMA. For some existing platforms this > assumption is not correct. DMA range might have non zero lower limit. [...] > @@ -59,7 +60,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device > *dev, u64 *phys_limit) >* zones. >*/ > *phys_limit = dma_to_phys(dev, dma_limit); > - if (*phys_limit <= zone_dma_limit) > + if (*phys_limit <= zone_dma_base + zone_dma_limit) > return GFP_DMA; > if (*phys_limit <= DMA_BIT_MASK(32)) > return GFP_DMA32; As I said previously, we no longer have zone_dma_bits after the first patch, so adding this limit no longer make sense. In v1, you wanted a limit like 32G to be added to the 30-bit zone_dma_bits to give you 33G upper limit for ZONE_DMA. But since the first patch sets zone_dma_limit to 33G already, this is no longer needed. -- Catalin
Re: [PATCH RFC v2 2/5] of: get dma area lower limit
On Tue, Apr 09, 2024 at 09:17:55AM +0300, Baruch Siach wrote: > of_dma_get_max_cpu_address() returns the highest CPU address that > devices can use for DMA. The implicit assumption is that all CPU > addresses below that limit are suitable for DMA. However the > 'dma-ranges' property this code uses also encodes a lower limit for DMA > that is potentially non zero. > > Rename to of_dma_get_cpu_limits(), and extend to retrieve also the lower > limit for the same 'dma-ranges' property describing the high limit. I don't understand the reason for the lower limit. The way the Linux zones work is that ZONE_DMA always starts from the start of the RAM. It doesn't matter whether it's 0 or not, you'd not allocate below the start of RAM anyway. If you have a device that cannot use the bottom of the RAM, it is pretty broken and not supported by Linux. I think you added this limit before we tried to move away from zone_dma_bits to a non-power-of-two limit (zone_dma_limit). With the latter, we no longer need tricks with the lower limit, of_dma_get_max_cpu_address() should capture the smallest upper CPU address limit supported by all devices (and that's where ZONE_DMA should end). -- Catalin
Re: [PATCH RFC v2 1/5] dma-mapping: replace zone_dma_bits by zone_dma_limit
(finally getting around to looking at this series, sorry for the delay) On Tue, Apr 09, 2024 at 09:17:54AM +0300, Baruch Siach wrote: > From: Catalin Marinas > > Hardware DMA limit might not be power of 2. When RAM range starts above > 0, say 4GB, DMA limit of 30 bits should end at 5GB. A single high bit > can not encode this limit. > > Use direct phys_addr_t limit address for DMA zone limit. > > Following commits will add explicit base address to DMA zone. > > --- > Catalin, > > This is taken almost verbatim from your email: > > https://lore.kernel.org/all/zz2hnhjv3gdzu...@arm.com/ > > Would you provide your sign-off? Signed-off-by: Catalin Marinas Thanks for writing a commit log. However, I think more work is needed. See below. > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 03efd86dce0a..00508c69ca9e 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -113,36 +113,24 @@ static void __init arch_reserve_crashkernel(void) > low_size, high); > } > > -/* > - * Return the maximum physical address for a zone accessible by the given > bits > - * limit. If DRAM starts above 32-bit, expand the zone to the maximum > - * available memory, otherwise cap it at 32-bit. > - */ > -static phys_addr_t __init max_zone_phys(unsigned int zone_bits) > +static phys_addr_t __init max_zone_phys(phys_addr_t zone_limit) > { > - phys_addr_t zone_mask = DMA_BIT_MASK(zone_bits); > - phys_addr_t phys_start = memblock_start_of_DRAM(); > - > - if (phys_start > U32_MAX) > - zone_mask = PHYS_ADDR_MAX; > - else if (phys_start > zone_mask) > - zone_mask = U32_MAX; > - > - return min(zone_mask, memblock_end_of_DRAM() - 1) + 1; > + return min(zone_limit, memblock_end_of_DRAM() - 1) + 1; > } > > static void __init zone_sizes_init(void) > { > unsigned long max_zone_pfns[MAX_NR_ZONES] = {0}; > - unsigned int __maybe_unused acpi_zone_dma_bits; > - unsigned int __maybe_unused dt_zone_dma_bits; > - phys_addr_t __maybe_unused dma32_phys_limit = max_zone_phys(32); > + phys_addr_t __maybe_unused acpi_zone_dma_limit; > + phys_addr_t __maybe_unused dt_zone_dma_limit; > + phys_addr_t __maybe_unused dma32_phys_limit = > + max_zone_phys(DMA_BIT_MASK(32)); > > #ifdef CONFIG_ZONE_DMA > - acpi_zone_dma_bits = fls64(acpi_iort_dma_get_max_cpu_address()); > - dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL)); > - zone_dma_bits = min3(32U, dt_zone_dma_bits, acpi_zone_dma_bits); > - arm64_dma_phys_limit = max_zone_phys(zone_dma_bits); > + acpi_zone_dma_limit = acpi_iort_dma_get_max_cpu_address(); > + dt_zone_dma_limit = of_dma_get_max_cpu_address(NULL); > + zone_dma_limit = min(dt_zone_dma_limit, acpi_zone_dma_limit); > + arm64_dma_phys_limit = max_zone_phys(zone_dma_limit); > max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit); > #endif > #ifdef CONFIG_ZONE_DMA32 I think this goes wrong if zone_dma_limit ends up above 32-bit (e.g. no restrictive dma-ranges properties) but the start of RAM is below 4G. We'd simply reduce ZONE_DMA32 to zero and ZONE_DMA potentially covering the whole RAM. Prior to this change, we capped zone_dma_bits to 32 via min3(). I think we should maintain this cap if memblock_start_of_DRAM() is below 4G. We could fix this up in max_zone_phys() above: if (memblock_start_of_DRAM() < U32_MAX) zone_limit = min(U32_MAX, zone_limit); return min(zone_limit, memblock_end_of_DRAM() - 1) + 1; > diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c > index 4d543b1e9d57..3b2ebcd4f576 100644 > --- a/kernel/dma/direct.c > +++ b/kernel/dma/direct.c > @@ -20,7 +20,7 @@ > * it for entirely different regions. In that case the arch code needs to > * override the variable below for dma-direct to work properly. > */ > -unsigned int zone_dma_bits __ro_after_init = 24; > +phys_addr_t zone_dma_limit __ro_after_init = DMA_BIT_MASK(24); > > static inline dma_addr_t phys_to_dma_direct(struct device *dev, > phys_addr_t phys) > @@ -59,7 +59,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device > *dev, u64 *phys_limit) >* zones. >*/ > *phys_limit = dma_to_phys(dev, dma_limit); > - if (*phys_limit <= DMA_BIT_MASK(zone_dma_bits)) > + if (*phys_limit <= zone_dma_limit) > return GFP_DMA; > if (*phys_limit <= DMA_BIT_MASK(32)) > return GFP_DMA32; It's worth noting that if ZONE_DMA ends up entirely above 32-bit, there won't be any ZONE_DMA32. Thinking about it, this could be a potential problem. For example, if a device has a 32-bit DMA mask and an offset that lifts this into the 32-36G range, the above may fail to set GFP_DMA32. Actually, I think these checks can go wrong even with the current implementation, assuming RAM below 4G and no DMA offsets. For example, we have two devices, one with a coherent ma
Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
On Tue, Jun 04, 2024 at 01:44:15PM +0200, Alexandre Ghiti wrote: > On Tue, Jun 4, 2024 at 10:52 AM Conor Dooley wrote: > > > > On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote: > > > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti > > > wrote: > > > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui > > > > wrote: > > > > > > > > > > As for the current status of the patch, there are two points that can > > > > > be optimized: > > > > > 1. Some chip hardware implementations may not cache TLB invalid > > > > > entries, so it doesn't matter whether svvptc is available or not. Can > > > > > we consider adding a CONFIG_RISCV_SVVPTC to control it? > > > > > > That would produce a non-portable kernel. But I'm not opposed to that > > > at all, let me check how we handle other extensions. Maybe @Conor > > > Dooley has some feedback here? > > > > To be honest, not really sure what to give feedback on. Could you > > elaborate on exactly what the option is going to do? Given the > > portability concern, I guess you were proposing that the option would > > remove the preventative fences, rather than your current patch that > > removes them via an alternative? > > No no, I won't do that, we need a generic kernel for distros so that's > not even a question. What Yunhui was asking about (to me) is: can we > introduce a Kconfig option to always remove the preventive fences, > bypassing the use of alternatives altogether? > > To me, it won't make a difference in terms of performance. But if we > already offer such a possibility for other extensions, well I'll do > it. Otherwise, the question is: should we start doing that? We don't do that for other extensions yet, because currently all the extensions we have options for are additive. There's like 3 alternative patchsites, and they are all just one nop? I don't see the point of having a Kconfig knob for that. signature.asc Description: PGP signature
Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
On Tue, Jun 4, 2024 at 10:52 AM Conor Dooley wrote: > > On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote: > > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti > > wrote: > > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui wrote: > > > > > > > > As for the current status of the patch, there are two points that can > > > > be optimized: > > > > 1. Some chip hardware implementations may not cache TLB invalid > > > > entries, so it doesn't matter whether svvptc is available or not. Can > > > > we consider adding a CONFIG_RISCV_SVVPTC to control it? > > > > That would produce a non-portable kernel. But I'm not opposed to that > > at all, let me check how we handle other extensions. Maybe @Conor > > Dooley has some feedback here? > > To be honest, not really sure what to give feedback on. Could you > elaborate on exactly what the option is going to do? Given the > portability concern, I guess you were proposing that the option would > remove the preventative fences, rather than your current patch that > removes them via an alternative? No no, I won't do that, we need a generic kernel for distros so that's not even a question. What Yunhui was asking about (to me) is: can we introduce a Kconfig option to always remove the preventive fences, bypassing the use of alternatives altogether? To me, it won't make a difference in terms of performance. But if we already offer such a possibility for other extensions, well I'll do it. Otherwise, the question is: should we start doing that? > I don't think we have any extension > related options that work like that at the moment, and making that an > option will just mean that distros that look to cater for multiple > platforms won't be able to turn it on. > > Thanks, > Conor.
Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote: > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti wrote: > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui wrote: > > > > > > As for the current status of the patch, there are two points that can > > > be optimized: > > > 1. Some chip hardware implementations may not cache TLB invalid > > > entries, so it doesn't matter whether svvptc is available or not. Can > > > we consider adding a CONFIG_RISCV_SVVPTC to control it? > > That would produce a non-portable kernel. But I'm not opposed to that > at all, let me check how we handle other extensions. Maybe @Conor > Dooley has some feedback here? To be honest, not really sure what to give feedback on. Could you elaborate on exactly what the option is going to do? Given the portability concern, I guess you were proposing that the option would remove the preventative fences, rather than your current patch that removes them via an alternative? I don't think we have any extension related options that work like that at the moment, and making that an option will just mean that distros that look to cater for multiple platforms won't be able to turn it on. Thanks, Conor. signature.asc Description: PGP signature
Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti wrote: > > Hi Yunhui, > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui wrote: > > > > Hi Alexandre, > > > > On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti > > wrote: > > > > > > Hi Yunhui, > > > > > > On Mon, Jun 3, 2024 at 4:26 AM yunhui cui wrote: > > > > > > > > Hi Alexandre, > > > > > > > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti > > > > wrote: > > > > > > > > > > In 6.5, we removed the vmalloc fault path because that can't work (see > > > > > [1] [2]). Then in order to make sure that new page table entries were > > > > > seen by the page table walker, we had to preventively emit a > > > > > sfence.vma > > > > > on all harts [3] but this solution is very costly since it relies on > > > > > IPI. > > > > > > > > > > And even there, we could end up in a loop of vmalloc faults if a > > > > > vmalloc > > > > > allocation is done in the IPI path (for example if it is traced, see > > > > > [4]), which could result in a kernel stack overflow. > > > > > > > > > > Those preventive sfence.vma needed to be emitted because: > > > > > > > > > > - if the uarch caches invalid entries, the new mapping may not be > > > > > observed by the page table walker and an invalidation may be needed. > > > > > - if the uarch does not cache invalid entries, a reordered access > > > > > could "miss" the new mapping and traps: in that case, we would > > > > > actually > > > > > only need to retry the access, no sfence.vma is required. > > > > > > > > > > So this patch removes those preventive sfence.vma and actually handles > > > > > the possible (and unlikely) exceptions. And since the kernel stacks > > > > > mappings lie in the vmalloc area, this handling must be done very > > > > > early > > > > > when the trap is taken, at the very beginning of handle_exception: > > > > > this > > > > > also rules out the vmalloc allocations in the fault path. > > > > > > > > > > Link: > > > > > https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bj...@kernel.org/ > > > > > [1] > > > > > Link: > > > > > https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dy...@andestech.com > > > > > [2] > > > > > Link: > > > > > https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexgh...@rivosinc.com/ > > > > > [3] > > > > > Link: > > > > > https://lore.kernel.org/lkml/20200508144043.13893-1-j...@8bytes.org/ > > > > > [4] > > > > > Signed-off-by: Alexandre Ghiti > > > > > --- > > > > > arch/riscv/include/asm/cacheflush.h | 18 +- > > > > > arch/riscv/include/asm/thread_info.h | 5 ++ > > > > > arch/riscv/kernel/asm-offsets.c | 5 ++ > > > > > arch/riscv/kernel/entry.S| 84 > > > > > > > > > > arch/riscv/mm/init.c | 2 + > > > > > 5 files changed, 113 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/arch/riscv/include/asm/cacheflush.h > > > > > b/arch/riscv/include/asm/cacheflush.h > > > > > index a129dac4521d..b0d631701757 100644 > > > > > --- a/arch/riscv/include/asm/cacheflush.h > > > > > +++ b/arch/riscv/include/asm/cacheflush.h > > > > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page > > > > > *page) > > > > > flush_icache_mm(vma->vm_mm, 0) > > > > > > > > > > #ifdef CONFIG_64BIT > > > > > -#define flush_cache_vmap(start, end) > > > > > flush_tlb_kernel_range(start, end) > > > > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > > > > > +extern char _end[]; > > > > > +#define flush_cache_vmap flush_cache_vmap > > > > > +static inline void flush_cache_vmap(unsigned long start, unsigned > > > > > long end) > > > > > +{ > > > > > + if (is_vmalloc_or_module_addr((void *)start)) { > > > > > + int i; > > > > > + > > > > > + /* > > > > > +* We don't care if concurrently a cpu resets this > > > > > value since > > > > > +* the only place this can happen is in > > > > > handle_exception() where > > > > > +* an sfence.vma is emitted. > > > > > +*/ > > > > > + for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i) > > > > > + new_vmalloc[i] = -1ULL; > > > > > + } > > > > > +} > > > > > #define flush_cache_vmap_early(start, end) > > > > > local_flush_tlb_kernel_range(start, end) > > > > > #endif > > > > > > > > > > diff --git a/arch/riscv/include/asm/thread_info.h > > > > > b/arch/riscv/include/asm/thread_info.h > > > > > index 5d473343634b..32631acdcdd4 100644 > > > > > --- a/arch/riscv/include/asm/thread_info.h > > > > > +++ b/arch/riscv/include/asm/thread_info.h > > > > > @@ -60,6 +60,11 @@ struct thread_info { > > > > > void*scs_base; > > > > > void*scs_sp; > > > > > #endif > > > > > + /* > > > > > +* Used in handle_exception() to save a0, a1 and a2 before > > > > > knowing if we > > > > > +* can access the kern
Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
Hi Yunhui, On Tue, Jun 4, 2024 at 8:21 AM yunhui cui wrote: > > Hi Alexandre, > > On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti wrote: > > > > Hi Yunhui, > > > > On Mon, Jun 3, 2024 at 4:26 AM yunhui cui wrote: > > > > > > Hi Alexandre, > > > > > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti > > > wrote: > > > > > > > > In 6.5, we removed the vmalloc fault path because that can't work (see > > > > [1] [2]). Then in order to make sure that new page table entries were > > > > seen by the page table walker, we had to preventively emit a sfence.vma > > > > on all harts [3] but this solution is very costly since it relies on > > > > IPI. > > > > > > > > And even there, we could end up in a loop of vmalloc faults if a vmalloc > > > > allocation is done in the IPI path (for example if it is traced, see > > > > [4]), which could result in a kernel stack overflow. > > > > > > > > Those preventive sfence.vma needed to be emitted because: > > > > > > > > - if the uarch caches invalid entries, the new mapping may not be > > > > observed by the page table walker and an invalidation may be needed. > > > > - if the uarch does not cache invalid entries, a reordered access > > > > could "miss" the new mapping and traps: in that case, we would > > > > actually > > > > only need to retry the access, no sfence.vma is required. > > > > > > > > So this patch removes those preventive sfence.vma and actually handles > > > > the possible (and unlikely) exceptions. And since the kernel stacks > > > > mappings lie in the vmalloc area, this handling must be done very early > > > > when the trap is taken, at the very beginning of handle_exception: this > > > > also rules out the vmalloc allocations in the fault path. > > > > > > > > Link: > > > > https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bj...@kernel.org/ > > > > [1] > > > > Link: > > > > https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dy...@andestech.com > > > > [2] > > > > Link: > > > > https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexgh...@rivosinc.com/ > > > > [3] > > > > Link: > > > > https://lore.kernel.org/lkml/20200508144043.13893-1-j...@8bytes.org/ [4] > > > > Signed-off-by: Alexandre Ghiti > > > > --- > > > > arch/riscv/include/asm/cacheflush.h | 18 +- > > > > arch/riscv/include/asm/thread_info.h | 5 ++ > > > > arch/riscv/kernel/asm-offsets.c | 5 ++ > > > > arch/riscv/kernel/entry.S| 84 > > > > arch/riscv/mm/init.c | 2 + > > > > 5 files changed, 113 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/arch/riscv/include/asm/cacheflush.h > > > > b/arch/riscv/include/asm/cacheflush.h > > > > index a129dac4521d..b0d631701757 100644 > > > > --- a/arch/riscv/include/asm/cacheflush.h > > > > +++ b/arch/riscv/include/asm/cacheflush.h > > > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page > > > > *page) > > > > flush_icache_mm(vma->vm_mm, 0) > > > > > > > > #ifdef CONFIG_64BIT > > > > -#define flush_cache_vmap(start, end) > > > > flush_tlb_kernel_range(start, end) > > > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > > > > +extern char _end[]; > > > > +#define flush_cache_vmap flush_cache_vmap > > > > +static inline void flush_cache_vmap(unsigned long start, unsigned long > > > > end) > > > > +{ > > > > + if (is_vmalloc_or_module_addr((void *)start)) { > > > > + int i; > > > > + > > > > + /* > > > > +* We don't care if concurrently a cpu resets this > > > > value since > > > > +* the only place this can happen is in > > > > handle_exception() where > > > > +* an sfence.vma is emitted. > > > > +*/ > > > > + for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i) > > > > + new_vmalloc[i] = -1ULL; > > > > + } > > > > +} > > > > #define flush_cache_vmap_early(start, end) > > > > local_flush_tlb_kernel_range(start, end) > > > > #endif > > > > > > > > diff --git a/arch/riscv/include/asm/thread_info.h > > > > b/arch/riscv/include/asm/thread_info.h > > > > index 5d473343634b..32631acdcdd4 100644 > > > > --- a/arch/riscv/include/asm/thread_info.h > > > > +++ b/arch/riscv/include/asm/thread_info.h > > > > @@ -60,6 +60,11 @@ struct thread_info { > > > > void*scs_base; > > > > void*scs_sp; > > > > #endif > > > > + /* > > > > +* Used in handle_exception() to save a0, a1 and a2 before > > > > knowing if we > > > > +* can access the kernel stack. > > > > +*/ > > > > + unsigned long a0, a1, a2; > > > > }; > > > > > > > > #ifdef CONFIG_SHADOW_CALL_STACK > > > > diff --git a/arch/riscv/kernel/asm-offsets.c > > > > b/arch/riscv/kernel/asm-offsets.c > > > > index a03129f40c46..939ddc0e3c6e 100644 > > > > --- a/arch/riscv/kernel/asm-offsets.c > > > > +++ b/ar