On 18/09/2025 19:31, Edgecombe, Rick P wrote: > On Thu, 2025-09-18 at 16:15 +0200, Kevin Brodsky wrote: >> This is where I have to apologise to Rick for not having studied his >> series more thoroughly, as patch 17 [2] covers this issue very well in >> the commit message. >> >> It seems fair to say there is no ideal or simple solution, though. >> Rick's patch reserves enough (PTE-mapped) memory for fully splitting the >> linear map, which is relatively simple but not very pleasant. Chatting >> with Ryan Roberts, we figured another approach, improving on solution 1 >> mentioned in [2]. It would rely on allocating all PTPs from a special >> pool (without using set_memory_pkey() in pagetable_*_ctor), along those >> lines: > Oh I didn't realize ARM split the direct map now at runtime. IIRC it used to > just map at 4k if there were any permissions configured.
Until recently the linear map was always PTE-mapped on arm64 if rodata=full (default) or in other situations (e.g. DEBUG_PAGEALLOC), so that it never needed to be split at runtime. Since [1b] landed though, there is support for setting permissions at the block level and splitting, meaning that the linear map can be block-mapped in most cases (see force_pte_mapping() in patch 3 for details). This is only enabled on systems with the BBML2_NOABORT feature though. [1b] https://lore.kernel.org/all/[email protected]/ >> 1. 2 pages are reserved at all times (with the appropriate pkey) >> 2. Try to allocate a 2M block. If needed, use a reserved page as PMD to >> split a PUD. If successful, set its pkey - the entire block can now be >> used for PTPs. Replenish the reserve from the block if needed. >> 3. If no block is available, make an order-2 allocation (4 pages). If >> needed, use 1-2 reserved pages to split PUD/PMD. Set the pkey of the 4 >> pages, take 1-2 pages to replenish the reserve if needed. > Oh, good idea! > >> This ensures that we never run out of PTPs for splitting. We may get >> into an OOM situation more easily due to the order-2 requirement, but >> the risk remains low compared to requiring a 2M block. A bigger concern >> is concurrency - do we need a per-CPU cache? Reserving a 2M block per >> CPU could be very much overkill. >> >> No matter which solution is used, this clearly increases the complexity >> of kpkeys_hardened_pgtables. Mike Rapoport has posted a number of RFCs >> [3][4] that aim at addressing this problem more generally, but no >> consensus seems to have emerged and I'm not sure they would completely >> solve this specific problem either. >> >> For now, my plan is to stick to solution 3 from [2], i.e. force the >> linear map to be PTE-mapped. This is easily done on arm64 as it is the >> default, and is required for rodata=full, unless [1] is applied and the >> system supports BBML2_NOABORT. See [1] for the potential performance >> improvements we'd be missing out on (~5% ballpark). >> > I continue to be surprised that allocation time pkey conversion is not a > performance disaster, even with the directmap pre-split. > >> I'm not quite sure >> what the picture looks like on x86 - it may well be more significant as >> Rick suggested. > I think having more efficient direct map permissions is a solvable problem, > but > each usage is just a little too small to justify the infrastructure for a good > solution. And each simple solution is a little too much overhead to justify > the > usage. So there is a long tail of blocked usages: > - pkeys usages (page tables and secret protection) > - kernel shadow stacks > - More efficient executable code allocations (BPF, kprobe trampolines, etc) > > Although the BPF folks started doing their own thing for this. But I don't > think > there are any fundamentally unsolvable problems for a generic solution. It's a > question of a leading killer usage to justify the infrastructure. Maybe it > will > be kernel shadow stack. It seems to be exactly the situation yes. Given Will's feedback, I'll try to implement such a dedicated allocator one more time (based on the scheme I suggested above) and see how it goes. Hopefully that will create more momentum for a generic infrastructure :) - Kevin
