On Fri, Jan 02, 2026, Fred Griffoul wrote: > From: Fred Griffoul <[email protected]> > > Add infrastructure to persist nested virtualization state when L2 vCPUs
Please be more transparent with what exactly is being persisted. > are switched on an L1 vCPU or migrated between L1 vCPUs. > > The nested context table uses a hash table for fast lookup by nested > control block GPA (VMPTR for VMX, VMCB for SVM) and maintains a free > list for context management. > > The kvm_nested_context_load() function searches for a context indexed by > the target GPA; if not found, it allocates a new context up to the > configured maximum. If at capacity, it recycles the oldest context from > the free list. > > The oversubscription is hardcoded to support up to 8 L2 vCPUs per L1 > vCPU. > > The kvm_nested_context_clear() function moves the context to the free > list while keeping it in the hash table for potential reuse. > > This allows nested hypervisors to multiplex multiple L2 vCPUs on L1 > vCPUs without losing cached nested state, significantly improving > performance for workloads with frequent L2 context switches. > > This patch adds the basic infrastructure. Subsequent patches will add > the nested VMX and SVM specific support to populate and utilize the > cached nested state. > > Signed-off-by: Fred Griffoul <[email protected]> > --- > arch/x86/include/asm/kvm_host.h | 31 +++++ > arch/x86/include/uapi/asm/kvm.h | 2 + > arch/x86/kvm/Makefile | 2 +- > arch/x86/kvm/nested.c | 199 ++++++++++++++++++++++++++++++++ > arch/x86/kvm/x86.c | 5 +- > 5 files changed, 237 insertions(+), 2 deletions(-) Please provide concrete performance numbers. They need to be isolated from the switch to gpcs, and need to show how much benefit is provided for a per-VM hash table vs. (much) simpler approaches, e.g. versus a stupid simple per-vCPU LRU cache, a la KVM's pgd caching. There also needs to be an analysis of the downsides of the performance gains. If I'm putting the pieces together correctly, quoting a snippet from the cover letter, the performance benefits come from: The pfncache infrastructure maintains persistent mappings as long as the page GPA does not change, eliminating the memremap/memunmap overhead on every VM entry/exit cycle. Which means that this caching effectively eliminates the security value added by removing memory from the kernel's direct map. If, in the long term, we're collectively moving towards guest_memfd (for setups that don't want all of the overcommit goodness provided by mm/), then the performance provided by this approach is directly at odds with the efforts to remove guest_memfd memory from the direct map for added security. E.g. if the ratio of L2:L1 contexts is pushed high enough, it would be possible to have the majority of guest memory mapped into the host kernel. That then raises the question of whether or not we are optimizing the right thing. E.g. if we can somehow make map+unmap blazing fast for "all" real world usage that matters, then maybe we don't need this type of caching. In general, this needs a _lot_ more justification on the design decisions. A lot, a lot, a _lot_ more. This is too much code and complexity for me to even start reviewing without hard data.

