On Mon, 2025-08-18 at 18:02 +0200, Kevin Brodsky wrote: > The benchmarking results (see cover letter) don't seem to point to a > major performance hit from setting the pkey on arm64 (worth noting that > the linear mapping is PTE-mapped on arm64 today so no splitting should > occur when setting the pkey). The overhead may well be substantially > higher on x86.
It's surprising to me. The batching seems to be about switching the pkey, not the conversion of the direct map. And with batching you measured a fork benchmark actually sped up a tiny bit. Shouldn't it involve a pile of page table allocations and so extra direct map work? I don't know if it's possible the mock implementation skipped some set_memory() work somehow? > > I agree this is worth looking into, though. I will check the overhead > added by set_memory_pkey() specifically (ignoring pkey register > switches), and maybe try to allocate page tables with a dedicated > kmem_cache instead, reusing this patch [1] from my other kpkeys series. > A kmem_cache won't be as optimal as a dedicated allocator, but batching > the page freeing may already improve things substantially. I actually never got to the benchmark on real HW stage either, but I'd be surprised if this approach would have acceptable performance for x86. There are so many optimizations around minimizing TLB flushes in Linux. Dunno. Maybe my arm knowledge is too lacking.
