On Thu, 2025-09-18 at 16:15 +0200, Kevin Brodsky wrote: > This is where I have to apologise to Rick for not having studied his > series more thoroughly, as patch 17 [2] covers this issue very well in > the commit message. > > It seems fair to say there is no ideal or simple solution, though. > Rick's patch reserves enough (PTE-mapped) memory for fully splitting the > linear map, which is relatively simple but not very pleasant. Chatting > with Ryan Roberts, we figured another approach, improving on solution 1 > mentioned in [2]. It would rely on allocating all PTPs from a special > pool (without using set_memory_pkey() in pagetable_*_ctor), along those > lines:
Oh I didn't realize ARM split the direct map now at runtime. IIRC it used to just map at 4k if there were any permissions configured. > > 1. 2 pages are reserved at all times (with the appropriate pkey) > 2. Try to allocate a 2M block. If needed, use a reserved page as PMD to > split a PUD. If successful, set its pkey - the entire block can now be > used for PTPs. Replenish the reserve from the block if needed. > 3. If no block is available, make an order-2 allocation (4 pages). If > needed, use 1-2 reserved pages to split PUD/PMD. Set the pkey of the 4 > pages, take 1-2 pages to replenish the reserve if needed. Oh, good idea! > > This ensures that we never run out of PTPs for splitting. We may get > into an OOM situation more easily due to the order-2 requirement, but > the risk remains low compared to requiring a 2M block. A bigger concern > is concurrency - do we need a per-CPU cache? Reserving a 2M block per > CPU could be very much overkill. > > No matter which solution is used, this clearly increases the complexity > of kpkeys_hardened_pgtables. Mike Rapoport has posted a number of RFCs > [3][4] that aim at addressing this problem more generally, but no > consensus seems to have emerged and I'm not sure they would completely > solve this specific problem either. > > For now, my plan is to stick to solution 3 from [2], i.e. force the > linear map to be PTE-mapped. This is easily done on arm64 as it is the > default, and is required for rodata=full, unless [1] is applied and the > system supports BBML2_NOABORT. See [1] for the potential performance > improvements we'd be missing out on (~5% ballpark). > I continue to be surprised that allocation time pkey conversion is not a performance disaster, even with the directmap pre-split. > I'm not quite sure > what the picture looks like on x86 - it may well be more significant as > Rick suggested. I think having more efficient direct map permissions is a solvable problem, but each usage is just a little too small to justify the infrastructure for a good solution. And each simple solution is a little too much overhead to justify the usage. So there is a long tail of blocked usages: - pkeys usages (page tables and secret protection) - kernel shadow stacks - More efficient executable code allocations (BPF, kprobe trampolines, etc) Although the BPF folks started doing their own thing for this. But I don't think there are any fundamentally unsolvable problems for a generic solution. It's a question of a leading killer usage to justify the infrastructure. Maybe it will be kernel shadow stack.
