On Fri, Jan 02, 2026, Fred Griffoul wrote:
> From: Fred Griffoul <[email protected]>
> 
> Add infrastructure to persist nested virtualization state when L2 vCPUs

Please be more transparent with what exactly is being persisted.

> are switched on an L1 vCPU or migrated between L1 vCPUs.
> 
> The nested context table uses a hash table for fast lookup by nested
> control block GPA (VMPTR for VMX, VMCB for SVM) and maintains a free
> list for context management.
>
> The kvm_nested_context_load() function searches for a context indexed by
> the target GPA; if not found, it allocates a new context up to the
> configured maximum. If at capacity, it recycles the oldest context from
> the free list.
> 
> The oversubscription is hardcoded to support up to 8 L2 vCPUs per L1
> vCPU.
> 
> The kvm_nested_context_clear() function moves the context to the free
> list while keeping it in the hash table for potential reuse.
> 
> This allows nested hypervisors to multiplex multiple L2 vCPUs on L1
> vCPUs without losing cached nested state, significantly improving
> performance for workloads with frequent L2 context switches.
> 
> This patch adds the basic infrastructure. Subsequent patches will add
> the nested VMX and SVM specific support to populate and utilize the
> cached nested state.
> 
> Signed-off-by: Fred Griffoul <[email protected]>
> ---
>  arch/x86/include/asm/kvm_host.h |  31 +++++
>  arch/x86/include/uapi/asm/kvm.h |   2 +
>  arch/x86/kvm/Makefile           |   2 +-
>  arch/x86/kvm/nested.c           | 199 ++++++++++++++++++++++++++++++++
>  arch/x86/kvm/x86.c              |   5 +-
>  5 files changed, 237 insertions(+), 2 deletions(-)

Please provide concrete performance numbers.  They need to be isolated from the
switch to gpcs, and need to show how much benefit is provided for a per-VM hash
table vs. (much) simpler approaches, e.g. versus a stupid simple per-vCPU LRU
cache, a la KVM's pgd caching.

There also needs to be an analysis of the downsides of the performance gains.
If I'm putting the pieces together correctly, quoting a snippet from the cover
letter, the performance benefits come from:

  The pfncache infrastructure maintains persistent mappings as long as the
  page GPA does not change, eliminating the memremap/memunmap overhead on
  every VM entry/exit cycle. 

Which means that this caching effectively eliminates the security value added by
removing memory from the kernel's direct map.  If, in the long term, we're
collectively moving towards guest_memfd (for setups that don't want all of the
overcommit goodness provided by mm/), then the performance provided by this 
approach
is directly at odds with the efforts to remove guest_memfd memory from the 
direct
map for added security.

E.g. if the ratio of L2:L1 contexts is pushed high enough, it would be possible
to have the majority of guest memory mapped into the host kernel.

That then raises the question of whether or not we are optimizing the right 
thing.
E.g. if we can somehow make map+unmap blazing fast for "all" real world usage 
that
matters, then maybe we don't need this type of caching.

In general, this needs a _lot_ more justification on the design decisions.  A 
lot,
a lot, a _lot_ more.  This is too much code and complexity for me to even start
reviewing without hard data.

Reply via email to