from:"Paolo Bonzini"

Re: [PATCH v12 48/84] KVM: Move x86's API to release a faultin page to common KVM

2024-07-31 Thread Paolo Bonzini


On 7/30/24 21:15, Sean Christopherson wrote:

Does it make sense to move RET_PF_* to common code, and avoid a bool
argument here?

After this series, probably?  Especially if/when we make "struct kvm_page_fault"
a common structure and converge all arch code.  In this series, definitely not,
as it would require even more patches to convert other architectures, and it's
not clear that it would be a net win, at least not without even more massaging.


It does not seem to be hard, but I agree that all the other 
architectures right now use 0/-errno in the callers of 
kvm_release_faultin_page().


Paolo

Re: [PATCH v12 45/84] KVM: guest_memfd: Provide "struct page" as output from kvm_gmem_get_pfn()

2024-07-31 Thread Paolo Bonzini


On 7/30/24 22:00, Sean Christopherson wrote:

The probability of guest_memfd not having struct page for mapped pfns is likely
very low, but at the same time, providing a pfn+page pair doesn't cost us much.
And if it turns out that not having struct page is nonsensical, deferring the
kvm_gmem_get_pfn() => kvm_gmem_get_page() conversion could be annoying, but 
highly
unlikely to be painful since it should be 100% mechanical.  Whereas reverting 
back
to kvm_gmem_get_pfn() if we make the wrong decision now could mean doing surgery
on a pile of arch code.


Ok, fair enough.  The conflict resolution is trivial either way (I also 
checked the TDX series and miraculously it has only one conflict which 
is also trivial).


Paolo

Re: [PATCH v12 34/84] KVM: Add a helper to lookup a pfn without grabbing a reference

2024-07-31 Thread Paolo Bonzini


On 7/30/24 22:15, Sean Christopherson wrote:

On Tue, Jul 30, 2024, Paolo Bonzini wrote:

On 7/27/24 01:51, Sean Christopherson wrote:

Add a kvm_follow_pfn() wrapper, kvm_lookup_pfn(), to allow looking up a
gfn=>pfn mapping without the caller getting a reference to any underlying
page.  The API will be used in flows that want to know if a gfn points at
a valid pfn, but don't actually need to do anything with the pfn.


Can you rename the function kvm_gfn_has_pfn(), or kvm_gfn_can_be_mapped(),
and make it return a bool?


Heh, sure.  I initially planned on having it return a bool, but I couldn't 
figure
out a name, mainly because the kernel's pfn_valid() makes things like
kvm_gfn_has_valid_pfn() confusing/misleading :-(


(As an aside, I wonder if reexecute_instruction() could just use
kvm_is_error_hva(kvm_vcpu_gfn_to_hva(vcpu, gpa_to_gfn(gpa)) instead of going
all the way to a pfn.  But it's ok to be more restrictive).


Heh #2, I wondered the same thing.  I think it would work?  Verifying that 
there's
a usable pfn also protects against retrying an access that hit -EHWPOISON, but 
I'm
prety sure that would require a rare race, and I don't think it could result in
the guest being put into an infinite loop.


Indeed, and even the check in kvm_alloc_apic_access_page() is totally 
useless.  The page can go away at any time between the call and 
vmx_set_apic_access_page_addr() or, for AMD, the #NPF on 
APIC_DEFAULT_PHYS_BASE.


Yes, it's verifying that the system isn't under extreme memory pressure, 
but in practice a 4K get_user_pages is never going to fail, it's just 
going to cause something else to be swapped.  I'd just get rid of both 
of them, so there's no need for kvm_lookup_pfn().


Paolo

Re: [PATCH v12 84/84] KVM: Don't grab reference on VM_MIXEDMAP pfns that have a "struct page"

2024-07-31 Thread Paolo Bonzini


On 7/30/24 22:21, Sean Christopherson wrote:

On Tue, Jul 30, 2024, Paolo Bonzini wrote:

On 7/27/24 01:52, Sean Christopherson wrote:

Now that KVM no longer relies on an ugly heuristic to find its struct page
references, i.e. now that KVM can't get false positives on VM_MIXEDMAP
pfns, remove KVM's hack to elevate the refcount for pfns that happen to
have a valid struct page.  In addition to removing a long-standing wart
in KVM, this allows KVM to map non-refcounted struct page memory into the
guest, e.g. for exposing GPU TTM buffers to KVM guests.


Feel free to leave it to me for later, but there are more cleanups that
can be made, given how simple kvm_resolve_pfn() is now:


I'll revisit kvm_resolve_pfn(), Maxim also wasn't a fan of a similar helper that
existed in v11.


FWIW kvm_resolve_pfn() is totally fine as an intermediate step.  Just 
food for thought for possible follow-ups.



Also, check_user_page_hwpoison() should not be needed anymore, probably
not since commit 234b239bea39 ("kvm: Faults which trigger IO release the
mmap_sem", 2014-09-24) removed get_user_pages_fast() from hva_to_pfn_slow().


Ha, I *knew* this sounded familiar.  Past me apparently came to the same
conclusion[*], though I wrongly suspected a memory leak and promptly forgot to
ever send a patch.  I'll tack one on this time around.


As you prefer.

Paolo

Re: [PATCH v12 00/84] KVM: Stop grabbing references to PFNMAP'd pages

2024-07-30 Thread Paolo Bonzini


On 7/27/24 01:51, Sean Christopherson wrote:

arm64 folks, the first two patches are bug fixes, but I have very low
confidence that they are correct and/or desirable.  If they are more or
less correct, I can post them separately if that'd make life easier.  I
included them here to avoid conflicts, and because I'm pretty sure how
KVM deals with MTE tags vs. dirty logging will impact what APIs KVM needs
to provide to arch code.

On to the series...  The TL;DR is that I would like to get input on two
things:

  1. Marking folios dirty/accessed only on the intial stage-2 page fault
  2. The new APIs for faulting, prefetching, and doing "lookups" on pfns


Wow!

Splitting out prefetching makes a lot of sense, as it's the only one 
with npages > 1 and it doesn't need all the complexity of hva_to_pfn().


I've left a comment on the lookup API, which is probably the only one 
that can be simplified further.


The faulting API looks good as a first iteration.  Code-wise, 
kvm_resolve_pfn() is probably unnecessary at the end of the series but I 
can see why you had to restrain yourself and declare it done. :)


An interesting evolution of the API could be to pass a struct 
kvm_follow_pfn pointer to {,__}kvm_faultin_pfn() and __gfn_to_page() 
(the "constructors"); and on the other side to 
kvm_release_faultin_page() and kvm_release_page_*().  The struct 
kvm_follow_pfn could be embedded in the (x86) kvm_page_fault and 
(generic) kvm_host_map structs.  But certainly not as part of this 
already huge work.


Paolo

Re: [PATCH v12 84/84] KVM: Don't grab reference on VM_MIXEDMAP pfns that have a "struct page"

2024-07-30 Thread Paolo Bonzini


On 7/27/24 01:52, Sean Christopherson wrote:

Now that KVM no longer relies on an ugly heuristic to find its struct page
references, i.e. now that KVM can't get false positives on VM_MIXEDMAP
pfns, remove KVM's hack to elevate the refcount for pfns that happen to
have a valid struct page.  In addition to removing a long-standing wart
in KVM, this allows KVM to map non-refcounted struct page memory into the
guest, e.g. for exposing GPU TTM buffers to KVM guests.


Feel free to leave it to me for later, but there are more cleanups that
can be made, given how simple kvm_resolve_pfn() is now:


@@ -2814,35 +2768,10 @@ static kvm_pfn_t kvm_resolve_pfn(struct kvm_follow_pfn 
*kfp, struct page *page,
if (kfp->map_writable)
*kfp->map_writable = writable;
  
	if (pte)

pfn = pte_pfn(*pte);
else
pfn = page_to_pfn(page);
  
  	*kfp->refcounted_page = page;
  


Something like (untested/uncompiled):

--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2758,32 +2758,12 @@ static inline int check_user_page_hwpois
return rc == -EHWPOISON;
 }
 
-static kvm_pfn_t kvm_resolve_pfn(struct kvm_follow_pfn *kfp, struct page *page,

-pte_t *pte, bool writable)
-{
-   kvm_pfn_t pfn;
-
-   WARN_ON_ONCE(!!page == !!pte);
-
-   if (kfp->map_writable)
-   *kfp->map_writable = writable;
-
-   if (pte)
-   pfn = pte_pfn(*pte);
-   else
-   pfn = page_to_pfn(page);
-
-   *kfp->refcounted_page = page;
-
-   return pfn;
-}
-
 /*
  * The fast path to get the writable pfn which will be stored in @pfn,
  * true indicates success, otherwise false is returned.  It's also the
  * only part that runs if we can in atomic context.
  */
-static bool hva_to_pfn_fast(struct kvm_follow_pfn *kfp, kvm_pfn_t *pfn)
+static bool hva_to_page_fast(struct kvm_follow_pfn *kfp)
 {
struct page *page;
bool r;
@@ -2799,23 +2779,21 @@ static bool hva_to_pfn_fast(struct kvm_f
return false;
 
 	if (kfp->pin)

-   r = pin_user_pages_fast(kfp->hva, 1, FOLL_WRITE, &page) == 1;
+   r = pin_user_pages_fast(kfp->hva, 1, FOLL_WRITE, 
kfp->refcounted_page) == 1;
else
-   r = get_user_page_fast_only(kfp->hva, FOLL_WRITE, &page);
+   r = get_user_page_fast_only(kfp->hva, FOLL_WRITE, 
kfp->refcounted_page);
 
-	if (r) {

-   *pfn = kvm_resolve_pfn(kfp, page, NULL, true);
-   return true;
-   }
+   if (r)
+   kfp->flags |= FOLL_WRITE;
 
-	return false;

+   return r;
 }
 
 /*

  * The slow path to get the pfn of the specified host virtual address,
  * 1 indicates success, -errno is returned if error is detected.
  */
-static int hva_to_pfn_slow(struct kvm_follow_pfn *kfp, kvm_pfn_t *pfn)
+static int hva_to_page(struct kvm_follow_pfn *kfp)
 {
/*
 * When a VCPU accesses a page that is not mapped into the secondary
@@ -2829,34 +2807,32 @@ static int hva_to_pfn_slow(struct kvm_fo
 * implicitly honor NUMA hinting faults and don't need this flag.
 */
unsigned int flags = FOLL_HWPOISON | FOLL_HONOR_NUMA_FAULT | kfp->flags;
-   struct page *page, *wpage;
+   struct page *wpage;
int npages;
 
+	if (hva_to_page_fast(kfp))

+   return 1;
+
if (kfp->pin)
-   npages = pin_user_pages_unlocked(kfp->hva, 1, &page, flags);
+   npages = pin_user_pages_unlocked(kfp->hva, 1, 
kfp->refcounted_page, flags);
else
-   npages = get_user_pages_unlocked(kfp->hva, 1, &page, flags);
-   if (npages != 1)
-   return npages;
+   npages = get_user_pages_unlocked(kfp->hva, 1, 
kfp->refcounted_page, flags);
 
 	/*

-* Pinning is mutually exclusive with opportunistically mapping a read
-* fault as writable, as KVM should never pin pages when mapping memory
-* into the guest (pinning is only for direct accesses from KVM).
+* Map read fault as writable if possible; pinning is mutually exclusive
+* with opportunistically mapping a read fault as writable, as KVM 
should
+* should never pin pages when mapping memory into the guest (pinning is
+* only for direct accesses from KVM).
 */
-   if (WARN_ON_ONCE(kfp->map_writable && kfp->pin))
-   goto out;
-
-   /* map read fault as writable if possible */
-   if (!(flags & FOLL_WRITE) && kfp->map_writable &&
+   if (npages == 1 &&
+   kfp->map_writable && !WARN_ON_ONCE(kfp->pin) &&
+   !(flags & FOLL_WRITE) &&
get_user_page_fast_only(kfp->hva, FOLL_WRITE, &wpage)) {
-   put_page(page);
-   page = wpage;
-   flags |= FOLL_WRITE;
+   put_page(kfp->refcounted_page);
+   kfp->refcounted_page = wpage;
+   kfp->flags |= FOLL_WRITE;
}

Re: [PATCH v12 34/84] KVM: Add a helper to lookup a pfn without grabbing a reference

2024-07-30 Thread Paolo Bonzini


On 7/27/24 01:51, Sean Christopherson wrote:

Add a kvm_follow_pfn() wrapper, kvm_lookup_pfn(), to allow looking up a
gfn=>pfn mapping without the caller getting a reference to any underlying
page.  The API will be used in flows that want to know if a gfn points at
a valid pfn, but don't actually need to do anything with the pfn.


Can you rename the function kvm_gfn_has_pfn(), or 
kvm_gfn_can_be_mapped(), and make it return a bool?


(As an aside, I wonder if reexecute_instruction() could just use 
kvm_is_error_hva(kvm_vcpu_gfn_to_hva(vcpu, gpa_to_gfn(gpa)) instead of 
going all the way to a pfn.  But it's ok to be more restrictive).


Paolo

Re: [PATCH v12 45/84] KVM: guest_memfd: Provide "struct page" as output from kvm_gmem_get_pfn()

2024-07-30 Thread Paolo Bonzini


On 7/27/24 01:51, Sean Christopherson wrote:

Provide the "struct page" associated with a guest_memfd pfn as an output
from __kvm_gmem_get_pfn() so that KVM guest page fault handlers can

   

Just "kvm_gmem_get_pfn()".


directly put the page instead of having to rely on
kvm_pfn_to_refcounted_page().


This will conflict with my series, where I'm introducing
folio_file_pfn() and using it here:

-   page = folio_file_page(folio, index);
+   *page = folio_file_page(folio, index);
  
-	*pfn = page_to_pfn(page);

+   *pfn = page_to_pfn(*page);
if (max_order)
*max_order = 0;


That said, I think it's better to turn kvm_gmem_get_pfn() into
kvm_gmem_get_page() here, and pull the page_to_pfn() or page_to_phys()
to the caller as applicable.  This highlights that the caller always
gets a refcounted page with guest_memfd.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 901be9e420a4..bcc4a4c594ef 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4348,13 +4348,14 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu 
*vcpu,
return -EFAULT;
}
 
-	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,

+   r = kvm_gmem_get_page(vcpu->kvm, fault->slot, fault->gfn, 
&fault->refcounted_page,
 &max_order);
if (r) {
kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
return r;
}
 
+	fault->pfn = page_to_pfn(page);

fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault->pfn,
 fault->max_level, 
max_order);
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index a16c873b3232..db4181d11f2e 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3847,7 +3847,7 @@ static int __sev_snp_update_protected_guest_state(struct 
kvm_vcpu *vcpu)
if (VALID_PAGE(svm->sev_es.snp_vmsa_gpa)) {
gfn_t gfn = gpa_to_gfn(svm->sev_es.snp_vmsa_gpa);
struct kvm_memory_slot *slot;
-   kvm_pfn_t pfn;
+   struct page *page;
 
 		slot = gfn_to_memslot(vcpu->kvm, gfn);

if (!slot)
@@ -3857,7 +3857,7 @@ static int __sev_snp_update_protected_guest_state(struct 
kvm_vcpu *vcpu)
 * The new VMSA will be private memory guest memory, so
 * retrieve the PFN from the gmem backend.
 */
-   if (kvm_gmem_get_pfn(vcpu->kvm, slot, gfn, &pfn, NULL))
+   if (kvm_gmem_get_page(vcpu->kvm, slot, gfn, &page, NULL))
return -EINVAL;
 
 		/*

@@ -3873,7 +3873,7 @@ static int __sev_snp_update_protected_guest_state(struct 
kvm_vcpu *vcpu)
svm->sev_es.snp_has_guest_vmsa = true;
 
 		/* Use the new VMSA */

-   svm->vmcb->control.vmsa_pa = pfn_to_hpa(pfn);
+   svm->vmcb->control.vmsa_pa = page_to_phys(page);
 
 		/* Mark the vCPU as runnable */

vcpu->arch.pv.pv_unhalted = false;
@@ -3886,7 +3886,7 @@ static int __sev_snp_update_protected_guest_state(struct 
kvm_vcpu *vcpu)
 * changes then care should be taken to ensure
 * svm->sev_es.vmsa is pinned through some other means.
 */
-   kvm_release_pfn_clean(pfn);
+   kvm_release_page_clean(page);
}
 
 	/*

@@ -4687,6 +4687,7 @@ void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t 
gpa, u64 error_code)
struct kvm *kvm = vcpu->kvm;
int order, rmp_level, ret;
bool assigned;
+   struct page *page;
kvm_pfn_t pfn;
gfn_t gfn;
 
@@ -4712,13 +4713,14 @@ void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)

return;
}
 
-	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order);

+   ret = kvm_gmem_get_page(kvm, slot, gfn, &page, &order);
if (ret) {
pr_warn_ratelimited("SEV: Unexpected RMP fault, no backing page for 
private GPA 0x%llx\n",
gpa);
return;
}
 
+	pfn = page_to_pfn(page);

ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
if (ret || !assigned) {
pr_warn_ratelimited("SEV: Unexpected RMP fault, no assigned RMP 
entry found for GPA 0x%llx PFN 0x%llx error %d\n",
@@ -4770,7 +4772,7 @@ void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t 
gpa, u64 error_code)
 out:
trace_kvm_rmp_fault(vcpu, gpa, pfn, error_code, rmp_level, ret);
 out_no_trace:
-   put_page(pfn_to_page(pfn));
+   kvm_release_page_unused(page);
 }
 
 static bool is_pfn_range_shared(kvm_pfn_t start, kvm_pfn_t end)



And the change in virt/kvm/guest_memfd.c then is just as trivial, apart
from all the renaming:

-   *pfn = folio_file_pfn(folio, index);
+

Re: [PATCH v12 50/84] KVM: VMX: Use __kvm_faultin_page() to get APIC access page/pfn

2024-07-30 Thread Paolo Bonzini


On 7/27/24 01:51, Sean Christopherson wrote:

Use __kvm_faultin_page() get the APIC access page so that KVM can
precisely release the refcounted page, i.e. to remove yet another user
of kvm_pfn_to_refcounted_page().  While the path isn't handling a guest
page fault, the semantics are effectively the same; KVM just happens to
be mapping the pfn into a VMCS field instead of a secondary MMU.

Signed-off-by: Sean Christopherson 
---
  arch/x86/kvm/vmx/vmx.c | 13 +
  1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 30032585f7dc..b109bd282a52 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6786,8 +6786,10 @@ void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
struct kvm *kvm = vcpu->kvm;
struct kvm_memslots *slots = kvm_memslots(kvm);
struct kvm_memory_slot *slot;
+   struct page *refcounted_page;
unsigned long mmu_seq;
kvm_pfn_t pfn;
+   bool ign;


Even if you don't use it, call the out argument "writable".

Paolo

  
  	/* Defer reload until vmcs01 is the current VMCS. */

if (is_guest_mode(vcpu)) {
@@ -6823,7 +6825,7 @@ void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 * controls the APIC-access page memslot, and only deletes the memslot
 * if APICv is permanently inhibited, i.e. the memslot won't reappear.
 */
-   pfn = gfn_to_pfn_memslot(slot, gfn);
+   pfn = __kvm_faultin_pfn(slot, gfn, FOLL_WRITE, &ign, &refcounted_page);
if (is_error_noslot_pfn(pfn))
return;
  
@@ -6834,10 +6836,13 @@ void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)

vmcs_write64(APIC_ACCESS_ADDR, pfn_to_hpa(pfn));
  
  	/*

-* Do not pin apic access page in memory, the MMU notifier
-* will call us again if it is migrated or swapped out.
+* Do not pin the APIC access page in memory so that it can be freely
+* migrated, the MMU notifier will call us again if it is migrated or
+* swapped out.  KVM backs the memslot with anonymous memory, the pfn
+* should always point at a refcounted page (if the pfn is valid).
 */
-   kvm_release_pfn_clean(pfn);
+   if (!WARN_ON_ONCE(!refcounted_page))
+   kvm_release_page_clean(refcounted_page);
  
  	/*

 * No need for a manual TLB flush at this point, KVM has already done a

Re: [PATCH v12 48/84] KVM: Move x86's API to release a faultin page to common KVM

2024-07-30 Thread Paolo Bonzini


On 7/27/24 01:51, Sean Christopherson wrote:

Move KVM x86's helper that "finishes" the faultin process to common KVM
so that the logic can be shared across all architectures.  Note, not all
architectures implement a fast page fault path, but the gist of the
comment applies to all architectures.

Signed-off-by: Sean Christopherson 
---
  arch/x86/kvm/mmu/mmu.c   | 24 ++--
  include/linux/kvm_host.h | 26 ++
  2 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 95beb50748fc..2a0cfa225c8d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4323,28 +4323,8 @@ static u8 kvm_max_private_mapping_level(struct kvm *kvm, 
kvm_pfn_t pfn,
  static void kvm_mmu_finish_page_fault(struct kvm_vcpu *vcpu,
  struct kvm_page_fault *fault, int r)
  {
-   lockdep_assert_once(lockdep_is_held(&vcpu->kvm->mmu_lock) ||
-   r == RET_PF_RETRY);
-
-   if (!fault->refcounted_page)
-   return;
-
-   /*
-* If the page that KVM got from the *primary MMU* is writable, and KVM
-* installed or reused a SPTE, mark the page/folio dirty.  Note, this
-* may mark a folio dirty even if KVM created a read-only SPTE, e.g. if
-* the GFN is write-protected.  Folios can't be safely marked dirty
-* outside of mmu_lock as doing so could race with writeback on the
-* folio.  As a result, KVM can't mark folios dirty in the fast page
-* fault handler, and so KVM must (somewhat) speculatively mark the
-* folio dirty if KVM could locklessly make the SPTE writable.
-*/
-   if (r == RET_PF_RETRY)
-   kvm_release_page_unused(fault->refcounted_page);
-   else if (!fault->map_writable)
-   kvm_release_page_clean(fault->refcounted_page);
-   else
-   kvm_release_page_dirty(fault->refcounted_page);
+   kvm_release_faultin_page(vcpu->kvm, fault->refcounted_page,
+r == RET_PF_RETRY, fault->map_writable);


Does it make sense to move RET_PF_* to common code, and avoid a bool 
argument here?


Paolo


  }
  
  static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d2a97eb30e4..91341cdc6562 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1216,6 +1216,32 @@ static inline void kvm_release_page_unused(struct page 
*page)
  void kvm_release_page_clean(struct page *page);
  void kvm_release_page_dirty(struct page *page);
  
+static inline void kvm_release_faultin_page(struct kvm *kvm, struct page *page,

+   bool unused, bool dirty)
+{
+   lockdep_assert_once(lockdep_is_held(&kvm->mmu_lock) || unused);
+
+   if (!page)
+   return;
+
+   /*
+* If the page that KVM got from the *primary MMU* is writable, and KVM
+* installed or reused a SPTE, mark the page/folio dirty.  Note, this
+* may mark a folio dirty even if KVM created a read-only SPTE, e.g. if
+* the GFN is write-protected.  Folios can't be safely marked dirty
+* outside of mmu_lock as doing so could race with writeback on the
+* folio.  As a result, KVM can't mark folios dirty in the fast page
+* fault handler, and so KVM must (somewhat) speculatively mark the
+* folio dirty if KVM could locklessly make the SPTE writable.
+*/
+   if (unused)
+   kvm_release_page_unused(page);
+   else if (dirty)
+   kvm_release_page_dirty(page);
+   else
+   kvm_release_page_clean(page);
+}
+
  kvm_pfn_t kvm_lookup_pfn(struct kvm *kvm, gfn_t gfn);
  kvm_pfn_t __kvm_faultin_pfn(const struct kvm_memory_slot *slot, gfn_t gfn,
unsigned int foll, bool *writable,

Re: [PATCH v12 41/84] KVM: x86/mmu: Mark pages/folios dirty at the origin of make_spte()

2024-07-30 Thread Paolo Bonzini


On 7/27/24 01:51, Sean Christopherson wrote:

Move the marking of folios dirty from make_spte() out to its callers,
which have access to the _struct page_, not just the underlying pfn.
Once all architectures follow suit, this will allow removing KVM's ugly
hack where KVM elevates the refcount of VM_MIXEDMAP pfns that happen to
be struct page memory.

Signed-off-by: Sean Christopherson 
---
  arch/x86/kvm/mmu/mmu.c | 29 +++--
  arch/x86/kvm/mmu/paging_tmpl.h |  5 +
  arch/x86/kvm/mmu/spte.c| 11 ---
  3 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1cdd67707461..7e7b855ce1e1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2918,7 +2918,16 @@ static bool kvm_mmu_prefetch_sptes(struct kvm_vcpu 
*vcpu, gfn_t gfn, u64 *sptep,
for (i = 0; i < nr_pages; i++, gfn++, sptep++) {
mmu_set_spte(vcpu, slot, sptep, access, gfn,
 page_to_pfn(pages[i]), NULL);
-   kvm_release_page_clean(pages[i]);
+
+   /*
+* KVM always prefetches writable pages from the primary MMU,
+* and KVM can make its SPTE writable in the fast page, without


"with a fast page fault"

Paolo


+* notifying the primary MMU.  Mark pages/folios dirty now to
+* ensure file data is written back if it ends up being written
+* by the guest.  Because KVM's prefetching GUPs writable PTEs,
+* the probability of unnecessary writeback is extremely low.
+*/
+   kvm_release_page_dirty(pages[i]);
}
  
  	return true;

@@ -4314,7 +4323,23 @@ static u8 kvm_max_private_mapping_level(struct kvm *kvm, 
kvm_pfn_t pfn,
  static void kvm_mmu_finish_page_fault(struct kvm_vcpu *vcpu,
  struct kvm_page_fault *fault, int r)
  {
-   kvm_release_pfn_clean(fault->pfn);
+   lockdep_assert_once(lockdep_is_held(&vcpu->kvm->mmu_lock) ||
+   r == RET_PF_RETRY);
+
+   /*
+* If the page that KVM got from the *primary MMU* is writable, and KVM
+* installed or reused a SPTE, mark the page/folio dirty.  Note, this
+* may mark a folio dirty even if KVM created a read-only SPTE, e.g. if
+* the GFN is write-protected.  Folios can't be safely marked dirty
+* outside of mmu_lock as doing so could race with writeback on the
+* folio.  As a result, KVM can't mark folios dirty in the fast page
+* fault handler, and so KVM must (somewhat) speculatively mark the
+* folio dirty if KVM could locklessly make the SPTE writable.
+*/
+   if (!fault->map_writable || r == RET_PF_RETRY)
+   kvm_release_pfn_clean(fault->pfn);
+   else
+   kvm_release_pfn_dirty(fault->pfn);
  }
  
  static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index b6897916c76b..2e2d87a925ac 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -953,6 +953,11 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp, int
  spte_to_pfn(spte), spte, true, false,
  host_writable, &spte);
  
+	/*

+* There is no need to mark the pfn dirty, as the new protections must
+* be a subset of the old protections, i.e. synchronizing a SPTE cannot
+* change the SPTE from read-only to writable.
+*/
return mmu_spte_update(sptep, spte);
  }
  
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c

index 9b8795bd2f04..2c5650390d3b 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -277,17 +277,6 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page 
*sp,
mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
}
  
-	/*

-* If the page that KVM got from the primary MMU is writable, i.e. if
-* it's host-writable, mark the page/folio dirty.  As alluded to above,
-* folios can't be safely marked dirty in the fast page fault handler,
-* and so KVM must (somewhat) speculatively mark the folio dirty even
-* though it isn't guaranteed to be written as KVM won't mark the folio
-* dirty if/when the SPTE is made writable.
-*/
-   if (host_writable)
-   kvm_set_pfn_dirty(pfn);
-
*new_spte = spte;
return wrprot;
  }

Re: [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()

2024-04-11 Thread Paolo Bonzini

On Wed, Apr 10, 2024 at 11:30 PM Andrew Morton
 wrote:
> On Fri,  5 Apr 2024 07:58:11 -0400 Paolo Bonzini  wrote:
> > Please review!  Also feel free to take the KVM patches through the mm
> > tree, as I don't expect any conflicts.
>
> It's mainly a KVM thing and the MM changes are small and simple.
> I'd say that the KVM tree would be a better home?

Sure! I'll queue them on my side then.

Paolo

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-11 Thread Paolo Bonzini

On Mon, Apr 8, 2024 at 3:56 PM Peter Xu  wrote:
> Paolo,
>
> I may miss a bunch of details here (as I still remember some change_pte
> patches previously on the list..), however not sure whether we considered
> enable it?  Asked because I remember Andrea used to have a custom tree
> maintaining that part:
>
> https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968

The patch enables it only for KSM, so it would still require a bunch
of cleanups, for example I also would still use set_pte_at() in all
the places that are not KSM. This would at least fix the issue with
the poor documentation of where to use set_pte_at_notify() vs
set_pte_at().

With regard to the implementation, I like the idea of disabling the
invalidation on the MMU notifier side, but I would rather have
MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
overloading the event field.

> Maybe it can't be enabled for some reason that I overlooked in the current
> tree, or we just decided to not to?

I have just learnt about the patch, nobody had ever mentioned it even
though it's almost 2 years old... It's a lot of code though and no one
has ever reported an issue for over 10 years, so I think it's easiest
to just rip the code out.

Paolo

> Thanks,
>
> --
> Peter Xu
>

[PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at()

2024-04-05 Thread Paolo Bonzini

With the demise of the .change_pte() MMU notifier callback, there is no
notification happening in set_pte_at_notify().  It is a synonym of
set_pte_at() and can be replaced with it.

Signed-off-by: Paolo Bonzini 
---
 include/linux/mmu_notifier.h | 2 --
 kernel/events/uprobes.c  | 5 ++---
 mm/ksm.c | 4 ++--
 mm/memory.c  | 7 +--
 mm/migrate_device.c  | 8 ++--
 5 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 8c72bf651606..d39ebb10caeb 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -657,6 +657,4 @@ static inline void mmu_notifier_synchronize(void)
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
-#define set_pte_at_notify set_pte_at
-
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e4834d23e1d1..f4523b95c945 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include /* anon_vma_prepare */
-#include /* set_pte_at_notify */
 #include /* folio_free_swap */
 #include   /* user_enable_single_step */
 #include   /* notifier mechanism */
@@ -195,8 +194,8 @@ static int __replace_page(struct vm_area_struct *vma, 
unsigned long addr,
flush_cache_page(vma, addr, pte_pfn(ptep_get(pvmw.pte)));
ptep_clear_flush(vma, addr, pvmw.pte);
if (new_page)
-   set_pte_at_notify(mm, addr, pvmw.pte,
- mk_pte(new_page, vma->vm_page_prot));
+   set_pte_at(mm, addr, pvmw.pte,
+  mk_pte(new_page, vma->vm_page_prot));
 
folio_remove_rmap_pte(old_folio, old_page, vma);
if (!folio_mapped(old_folio))
diff --git a/mm/ksm.c b/mm/ksm.c
index 8c001819cf10..108a4d167824 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1345,7 +1345,7 @@ static int write_protect_page(struct vm_area_struct *vma, 
struct page *page,
if (pte_write(entry))
entry = pte_wrprotect(entry);
 
-   set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
+   set_pte_at(mm, pvmw.address, pvmw.pte, entry);
}
*orig_pte = entry;
err = 0;
@@ -1447,7 +1447,7 @@ static int replace_page(struct vm_area_struct *vma, 
struct page *page,
 * See Documentation/mm/mmu_notifier.rst
 */
ptep_clear_flush(vma, addr, ptep);
-   set_pte_at_notify(mm, addr, ptep, newpte);
+   set_pte_at(mm, addr, ptep, newpte);
 
folio = page_folio(page);
folio_remove_rmap_pte(folio, page, vma);
diff --git a/mm/memory.c b/mm/memory.c
index f2bc6dd15eb8..9a6f4d8aa379 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3327,13 +3327,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush(vma, vmf->address, vmf->pte);
folio_add_new_anon_rmap(new_folio, vma, vmf->address);
folio_add_lru_vma(new_folio, vma);
-   /*
-* We call the notify macro here because, when using secondary
-* mmu page tables (such as kvm shadow page tables), we want the
-* new page to be mapped directly into the secondary page table.
-*/
BUG_ON(unshare && pte_write(entry));
-   set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
+   set_pte_at(mm, vmf->address, vmf->pte, entry);
update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
if (old_folio) {
/*
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index b6c27c76e1a0..66206734b1b9 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -664,13 +664,9 @@ static void migrate_vma_insert_page(struct migrate_vma 
*migrate,
if (flush) {
flush_cache_page(vma, addr, pte_pfn(orig_pte));
ptep_clear_flush(vma, addr, ptep);
-   set_pte_at_notify(mm, addr, ptep, entry);
-   update_mmu_cache(vma, addr, ptep);
-   } else {
-   /* No need to invalidate - it was non-present before */
-   set_pte_at(mm, addr, ptep, entry);
-   update_mmu_cache(vma, addr, ptep);
}
+   set_pte_at(mm, addr, ptep, entry);
+   update_mmu_cache(vma, addr, ptep);
 
pte_unmap_unlock(ptep, ptl);
*src = MIGRATE_PFN_MIGRATE;
-- 
2.43.0

[PATCH 2/4] KVM: remove unused argument of kvm_handle_hva_range()

2024-04-05 Thread Paolo Bonzini

The only user was kvm_mmu_notifier_change_pte(), which is now gone.

Signed-off-by: Paolo Bonzini 
---
 virt/kvm/kvm_main.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2fcd9979752a..970111ad 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -595,8 +595,6 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
-static const union kvm_mmu_notifier_arg KVM_MMU_NOTIFIER_NO_ARG;
-
 /* Iterate over each memslot intersecting [start, last] (inclusive) range */
 #define kvm_for_each_memslot_in_hva_range(node, slots, start, last) \
for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
@@ -682,14 +680,12 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
unsigned long start,
unsigned long end,
-   union kvm_mmu_notifier_arg arg,
gfn_handler_t handler)
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
const struct kvm_mmu_notifier_range range = {
.start  = start,
.end= end,
-   .arg= arg,
.handler= handler,
.on_lock= (void *)kvm_null_fn,
.flush_on_ret   = true,
@@ -880,8 +876,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct 
mmu_notifier *mn,
 {
trace_kvm_age_hva(start, end);
 
-   return kvm_handle_hva_range(mn, start, end, KVM_MMU_NOTIFIER_NO_ARG,
-   kvm_age_gfn);
+   return kvm_handle_hva_range(mn, start, end, kvm_age_gfn);
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
-- 
2.43.0

[PATCH 3/4] mmu_notifier: remove the .change_pte() callback

2024-04-05 Thread Paolo Bonzini

The scope of set_pte_at_notify() has reduced more and more through the
years.  Initially, it was meant for when the change to the PTE was
not bracketed by mmu_notifier_invalidate_range_{start,end}().  However,
that has not been so for over ten years.  During all this period
the only implementation of .change_pte() was KVM and it
had no actual functionality, because it was called after
mmu_notifier_invalidate_range_start() zapped the secondary PTE.

Now that this (nonfunctional) user of the .change_pte() callback is
gone, the whole callback can be removed.  For now, leave in place
set_pte_at_notify() even though it is just a synonym for set_pte_at().

Signed-off-by: Paolo Bonzini 
---
 include/linux/mmu_notifier.h | 46 ++--
 mm/mmu_notifier.c| 17 -
 2 files changed, 2 insertions(+), 61 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..8c72bf651606 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -122,15 +122,6 @@ struct mmu_notifier_ops {
  struct mm_struct *mm,
  unsigned long address);
 
-   /*
-* change_pte is called in cases that pte mapping to page is changed:
-* for example, when ksm remaps pte to point to a new shared page.
-*/
-   void (*change_pte)(struct mmu_notifier *subscription,
-  struct mm_struct *mm,
-  unsigned long address,
-  pte_t pte);
-
/*
 * invalidate_range_start() and invalidate_range_end() must be
 * paired and are called only when the mmap_lock and/or the
@@ -392,8 +383,6 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
  unsigned long end);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 unsigned long address);
-extern void __mmu_notifier_change_pte(struct mm_struct *mm,
- unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
 extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
@@ -439,13 +428,6 @@ static inline int mmu_notifier_test_young(struct mm_struct 
*mm,
return 0;
 }
 
-static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-  unsigned long address, pte_t pte)
-{
-   if (mm_has_notifiers(mm))
-   __mmu_notifier_change_pte(mm, address, pte);
-}
-
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
@@ -581,26 +563,6 @@ static inline void mmu_notifier_range_init_owner(
__young;\
 })
 
-/*
- * set_pte_at_notify() sets the pte _after_ running the notifier.
- * This is safe to start by updating the secondary MMUs, because the primary 
MMU
- * pte invalidate must have already happened with a ptep_clear_flush() before
- * set_pte_at_notify() has been invoked.  Updating the secondary MMUs first is
- * required when we change both the protection of the mapping from read-only to
- * read-write and the pfn (like during copy on write page faults). Otherwise 
the
- * old page would remain mapped readonly in the secondary MMUs after the new
- * page is already writable by some CPU through the primary MMU.
- */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)  \
-({ \
-   struct mm_struct *___mm = __mm; \
-   unsigned long ___address = __address;   \
-   pte_t ___pte = __pte;   \
-   \
-   mmu_notifier_change_pte(___mm, ___address, ___pte); \
-   set_pte_at(___mm, ___address, __ptep, ___pte);  \
-})
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 struct mmu_notifier_range {
@@ -650,11 +612,6 @@ static inline int mmu_notifier_test_young(struct mm_struct 
*mm,
return 0;
 }
 
-static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-  unsigned long address, pte_t pte)
-{
-}
-
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
@@ -693,7 +650,6 @@ static inline void 
mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
 #defineptep_clear_flush_notify ptep_clear_flush
 #define pmdp_huge_clear_flush_notify pmdp_huge_clear_flush
 #define pudp_huge_clear_flush_notify pudp_huge_clear_flush
-#define set_pte_at_notify set_pte_at
 
 static inline void mmu_notifier_synchronize(void)
 {
@@ -701,4 +657,6 @@ static

[PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-05 Thread Paolo Bonzini

The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().

Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.

In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.

This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.

Signed-off-by: Paolo Bonzini 
---
 arch/arm64/kvm/mmu.c  | 34 -
 arch/loongarch/include/asm/kvm_host.h |  1 -
 arch/loongarch/kvm/mmu.c  | 32 
 arch/mips/kvm/mmu.c   | 30 ---
 arch/powerpc/include/asm/kvm_ppc.h|  1 -
 arch/powerpc/kvm/book3s.c |  5 ---
 arch/powerpc/kvm/book3s.h |  1 -
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 --
 arch/powerpc/kvm/book3s_hv.c  |  1 -
 arch/powerpc/kvm/book3s_pr.c  |  7 
 arch/powerpc/kvm/e500_mmu_host.c  |  6 ---
 arch/riscv/kvm/mmu.c  | 20 --
 arch/x86/kvm/mmu/mmu.c| 54 +--
 arch/x86/kvm/mmu/spte.c   | 16 
 arch/x86/kvm/mmu/spte.h   |  2 -
 arch/x86/kvm/mmu/tdp_mmu.c| 46 ---
 arch/x86/kvm/mmu/tdp_mmu.h|  1 -
 include/linux/kvm_host.h  |  2 -
 include/trace/events/kvm.h| 15 
 virt/kvm/kvm_main.c   | 43 -
 20 files changed, 2 insertions(+), 327 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index dc04bc767865..ff17849be9f4 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range)
return false;
 }
 
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-   kvm_pfn_t pfn = pte_pfn(range->arg.pte);
-
-   if (!kvm->arch.mmu.pgt)
-   return false;
-
-   WARN_ON(range->end - range->start != 1);
-
-   /*
-* If the page isn't tagged, defer to user_mem_abort() for sanitising
-* the MTE tags. The S2 pte should have been unmapped by
-* mmu_notifier_invalidate_range_end().
-*/
-   if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
-   return false;
-
-   /*
-* We've moved a page around, probably through CoW, so let's treat
-* it just like a translation fault and the map handler will clean
-* the cache to the PoC.
-*
-* The MMU notifiers will have unmapped a huge PMD before calling
-* ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
-* therefore we never need to clear out a huge PMD through this
-* calling path and a memcache is not required.
-*/
-   kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
-  PAGE_SIZE, __pfn_to_phys(pfn),
-  KVM_PGTABLE_PROT_R, NULL, 0);
-
-   return false;
-}
-
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
u64 size = (range->end - range->start) << PAGE_SHIFT;
diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index 2d62f7b0d377..69305441f40d 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void);
 void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa);
 int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write);
 
-void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long 
end, bool blockable);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
index a556cff35740..98883aa23ab8 100644
--- a/arch/loongarch/kvm/mmu.c
+++ b/arch/loongarch/kvm/mmu.c
@@ -494,38 +494,6 @@ bool kvm_unmap_gfn_range(stru

[PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()

2024-04-05 Thread Paolo Bonzini

The .change_pte() MMU notifier callback was intended as an optimization
and for this reason it was initially called without a surrounding
mmu_notifier_invalidate_range_{start,end}() pair.  It was only ever
implemented by KVM (which was also the original user of MMU notifiers)
and the rules on when to call set_pte_at_notify() rather than set_pte_at()
have always been pretty obscure.

It may seem a miracle that it has never caused any hard to trigger
bugs, but there's a good reason for that: KVM's implementation has
been nonfunctional for a good part of its existence.  Already in
2012, commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
invalidate_range_start and invalidate_range_end", 2012-10-09) changed the
.change_pte() callback to occur within an invalidate_range_start/end()
pair; and because KVM unmaps the sPTEs during .invalidate_range_start(),
.change_pte() has no hope of finding a sPTE to change.

Therefore, all the code for .change_pte() can be removed from both KVM
and mm/, and set_pte_at_notify() can be replaced with just set_pte_at().

Please review!  Also feel free to take the KVM patches through the mm
tree, as I don't expect any conflicts.

Thanks,

Paolo

Paolo Bonzini (4):
  KVM: delete .change_pte MMU notifier callback
  KVM: remove unused argument of kvm_handle_hva_range()
  mmu_notifier: remove the .change_pte() callback
  mm: replace set_pte_at_notify() with just set_pte_at()

 arch/arm64/kvm/mmu.c  | 34 -
 arch/loongarch/include/asm/kvm_host.h |  1 -
 arch/loongarch/kvm/mmu.c  | 32 
 arch/mips/kvm/mmu.c   | 30 ---
 arch/powerpc/include/asm/kvm_ppc.h|  1 -
 arch/powerpc/kvm/book3s.c |  5 ---
 arch/powerpc/kvm/book3s.h |  1 -
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 --
 arch/powerpc/kvm/book3s_hv.c  |  1 -
 arch/powerpc/kvm/book3s_pr.c  |  7 
 arch/powerpc/kvm/e500_mmu_host.c  |  6 ---
 arch/riscv/kvm/mmu.c  | 20 --
 arch/x86/kvm/mmu/mmu.c| 54 +--
 arch/x86/kvm/mmu/spte.c   | 16 
 arch/x86/kvm/mmu/spte.h   |  2 -
 arch/x86/kvm/mmu/tdp_mmu.c| 46 ---
 arch/x86/kvm/mmu/tdp_mmu.h|  1 -
 include/linux/kvm_host.h  |  2 -
 include/linux/mmu_notifier.h  | 44 --
 include/trace/events/kvm.h| 15 
 kernel/events/uprobes.c   |  5 +--
 mm/ksm.c  |  4 +-
 mm/memory.c   |  7 +---
 mm/migrate_device.c   |  8 +---
 mm/mmu_notifier.c | 17 -
 virt/kvm/kvm_main.c   | 50 +
 26 files changed, 10 insertions(+), 411 deletions(-)

-- 
2.43.0

Re: [PATCH] KVM: Get rid of return value from kvm_arch_create_vm_debugfs()

2024-02-23 Thread Paolo Bonzini

On Fri, Feb 16, 2024 at 5:00 PM Oliver Upton  wrote:
>
> The general expectation with debugfs is that any initialization failure
> is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows
> implementations to return an error and kvm_create_vm_debugfs() allows
> that to fail VM creation.
>
> Change to a void return to discourage architectures from making debugfs
> failures fatal for the VM. Seems like everyone already had the right
> idea, as all implementations already return 0 unconditionally.
>
> Signed-off-by: Oliver Upton 

Acked-by: Paolo Bonzini 

Feel free to place it in kvm-arm.

Paolo

Re: [PATCH v2 2/4] eventfd: simplify eventfd_signal()

2024-02-07 Thread Paolo Bonzini

On Wed, Nov 22, 2023 at 1:49 PM Christian Brauner  wrote:
>
> Ever since the evenfd type was introduced back in 2007 in commit
> e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal()
> function only ever passed 1 as a value for @n. There's no point in
> keeping that additional argument.
>
> Signed-off-by: Christian Brauner 
> ---
>  arch/x86/kvm/hyperv.c |  2 +-
>  arch/x86/kvm/xen.c|  2 +-
>  virt/kvm/eventfd.c|  4 ++--
>  30 files changed, 60 insertions(+), 63 deletions(-)

For KVM:

Acked-by: Paolo Bonzini

Re: [PATCH 34/34] KVM: selftests: Add a memory region subtest to validate invalid flags

2023-11-21 Thread Paolo Bonzini

On Mon, Nov 20, 2023 at 3:09 PM Mark Brown  wrote:
>
> On Wed, Nov 08, 2023 at 05:08:01PM -0800, Anish Moorthy wrote:
> > Applying [1] and [2] reveals that this also breaks non-x86 builds- the
> > MEM_REGION_GPA/SLOT definitions are guarded behind an #ifdef
> > __x86_64__, while the usages introduced here aren't.
> >
> > Should
> >
> > On Sun, Nov 5, 2023 at 8:35 AM Paolo Bonzini  wrote:
> > >
> > > +   test_invalid_memory_region_flags();
> >
> > be #ifdef'd, perhaps? I'm not quite sure what the intent is.
>
> This has been broken in -next for a week now, do we have any progress
> on a fix or should we just revert the patch?

Sorry, I was away last week. I have now posted a patch.

Paolo

Re: [PATCH v14 00/34] KVM: guest_memfd() and per-page attributes

2023-11-13 Thread Paolo Bonzini


On 11/5/23 17:30, Paolo Bonzini wrote:

The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.


The changes from review are small enough and entirely in tests, so
I went ahead and pushed it to kvm/next, together with "selftests: kvm/s390x: use 
vm_create_barebones()" which also fixed testcase failures (similar to the 
aarch64/page_fault_test.c hunk below).

The guestmemfd branch on kvm.git was force-pushed, and can be used for further
development if you don't want to run 6.7-rc1 for whatever reason.

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 38882263278d..926241e23aeb 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1359,7 +1359,6 @@ yet and must be cleared on entry.
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
__u64 userspace_addr; /* start of the userspace allocated memory */
-   __u64 pad[16];
   };
 
   /* for kvm_userspace_memory_region::flags */

diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c 
b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
index eb4217b7c768..08a5ca5bed56 100644
--- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
+++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
@@ -705,7 +705,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
 	print_test_banner(mode, p);
 
-	vm = vm_create(mode);

+   vm = vm_create(VM_SHAPE(mode));
setup_memslots(vm, p);
kvm_vm_elf_load(vm, program_invocation_name);
setup_ucall(vm);
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c 
b/tools/testing/selftests/kvm/guest_memfd_test.c
index ea0ae7e25330..fd389663c49b 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -6,14 +6,6 @@
  */
 
 #define _GNU_SOURCE

-#include "test_util.h"
-#include "kvm_util_base.h"
-#include 
-#include 
-#include 
-#include 
-#include 
-
 #include 
 #include 
 #include 
@@ -21,6 +13,15 @@
 #include 
 #include 
 
+#include 

+#include 
+#include 
+#include 
+#include 
+
+#include "test_util.h"
+#include "kvm_util_base.h"
+
 static void test_file_read_write(int fd)
 {
char buf[64];
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index e4d2cd9218b2..1b58f943562f 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -819,6 +819,7 @@ static inline struct kvm_vm *vm_create_barebones(void)
return vm_create(VM_SHAPE_DEFAULT);
 }
 
+#ifdef __x86_64__

 static inline struct kvm_vm *vm_create_barebones_protected_vm(void)
 {
const struct vm_shape shape = {
@@ -828,6 +829,7 @@ static inline struct kvm_vm 
*vm_create_barebones_protected_vm(void)
 
 	return vm_create(shape);

 }
+#endif
 
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)

 {
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index d05d95cc3693..9b29cbf49476 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1214,7 +1214,7 @@ void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t 
base, uint64_t size,
TEST_ASSERT(region && region->region.flags & 
KVM_MEM_GUEST_MEMFD,
"Private memory region not found for GPA 0x%lx", 
gpa);
 
-		offset = (gpa - region->region.guest_phys_addr);

+   offset = gpa - region->region.guest_phys_addr;
fd_offset = region->region.guest_memfd_offset + offset;
len = min_t(uint64_t, end - gpa, region->region.memory_size - 
offset);
 
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c

index 343e807043e1..1efee1cfcff0 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -433,6 +433,7 @@ static void test_add_max_memory_regions(void)
 }
 
 
+#ifdef __x86_64__

 static void test_invalid_guest_memfd(struct kvm_vm *vm, int memfd,
 size_t offset, const char *msg)
 {
@@ -523,14 +524,13 @@ static void 
test_add_overlapping_private_memory_regions(void)
close(memfd);
kvm_vm_free(vm

Re: [PATCH 27/34] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type

2023-11-09 Thread Paolo Bonzini


On 11/9/23 00:37, Anish Moorthy wrote:

On Wed, Nov 8, 2023 at 9:00 AM Anish Moorthy  wrote:


This commit breaks the arm64 selftests build btw: looks like a simple oversight?


Yup, fix is a one-liner. Posted below.

diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c 
b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
index eb4217b7c768..08a5ca5bed56 100644
--- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
+++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
@@ -705,7 +705,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
  
  	print_test_banner(mode, p);
  
-	vm = vm_create(mode);

+   vm = vm_create(VM_SHAPE(mode));


Yes, this is similar to the s390 patch I sent yesterday 
(https://patchew.org/linux/20231108094055.221234-1-pbonz...@redhat.com/).


Paolo

Re: [PATCH 31/34] KVM: selftests: Expand set_memory_region_test to validate guest_memfd()

2023-11-06 Thread Paolo Bonzini


On 11/5/23 17:30, Paolo Bonzini wrote:

From: Chao Peng 

Expand set_memory_region_test to exercise various positive and negative
testcases for private memory.

  - Non-guest_memfd() file descriptor for private memory
  - guest_memfd() from different VM
  - Overlapping bindings
  - Unaligned bindings


This needs a small fixup:

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index e4d2cd9218b2..1b58f943562f 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -819,6 +819,7 @@ static inline struct kvm_vm *vm_create_barebones(void)
return vm_create(VM_SHAPE_DEFAULT);
 }
 
+#ifdef __x86_64__

 static inline struct kvm_vm *vm_create_barebones_protected_vm(void)
 {
const struct vm_shape shape = {
@@ -828,6 +829,7 @@ static inline struct kvm_vm 
*vm_create_barebones_protected_vm(void)
 
 	return vm_create(shape);

 }
+#endif
 
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)

 {
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c 
b/tools/testing/selftests/kvm/set_memory_region_test.c
index 1891774eb6d4..302c7a46955b 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -386,6 +386,7 @@ static void test_add_max_memory_regions(void)
 }
 
 
+#ifdef __x86_64__

 static void test_invalid_guest_memfd(struct kvm_vm *vm, int memfd,
 size_t offset, const char *msg)
 {
@@ -476,14 +477,13 @@ static void 
test_add_overlapping_private_memory_regions(void)
close(memfd);
kvm_vm_free(vm);
 }
+#endif
 
 int main(int argc, char *argv[])

 {
 #ifdef __x86_64__
int i, loops;
-#endif
 
-#ifdef __x86_64__

/*
 * FIXME: the zero-memslot test fails on aarch64 and s390x because
 * KVM_RUN fails with ENOEXEC or EFAULT.
@@ -493,6 +493,7 @@ int main(int argc, char *argv[])
 
 	test_add_max_memory_regions();
 
+#ifdef __x86_64__

if (kvm_has_cap(KVM_CAP_GUEST_MEMFD) &&
(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM))) {
test_add_private_memory_region();
@@ -501,7 +502,6 @@ int main(int argc, char *argv[])
pr_info("Skipping tests for KVM_MEM_GUEST_MEMFD memory 
regions\n");
}
 
-#ifdef __x86_64__

if (argc > 1)
loops = atoi_positive("Number of iterations", argv[1]);
else

in order to compile successfully on non-x86 platforms.

Re: [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory

2023-11-06 Thread Paolo Bonzini


On 11/6/23 12:00, Fuad Tabba wrote:

Hi,


On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson  wrote:


Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX


nit: "that isn't"

Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 


Hi Fuad,

thanks for your reviews and tests of the gmem patches!  Can you please 
continue replying to v14?


Thanks,

Paolo


Cheers,
/fuad


for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson 
---
  Documentation/virt/kvm/api.rst  | 32 
  arch/x86/include/asm/kvm_host.h | 15 +--
  arch/x86/include/uapi/asm/kvm.h |  3 +++
  arch/x86/kvm/Kconfig| 12 
  arch/x86/kvm/mmu/mmu_internal.h |  1 +
  arch/x86/kvm/x86.c  | 16 +++-
  include/uapi/linux/kvm.h|  1 +
  virt/kvm/Kconfig|  5 +
  8 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 38dc1fda4f45..00029436ac5b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -147,10 +147,29 @@ described as 'basic' will be available.
  The new VM has no virtual cpus and no memory.
  You probably want to use 0 as machine type.

+X86:
+
+
+Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
+
+S390:
+^
+
  In order to create user controlled virtual machines on S390, check
  KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
  privileged user (CAP_SYS_ADMIN).

+MIPS:
+^
+
+To use hardware assisted virtualization on MIPS (VZ ASE) rather than
+the default trap & emulate implementation (which changes the virtual
+memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
+flag KVM_VM_MIPS_VZ.
+
+ARM64:
+^^
+
  On arm64, the physical address size for a VM (IPA Size limit) is limited
  to 40bits by default. The limit can be configured if the host supports the
  extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -8650,6 +8669,19 @@ block sizes is exposed in 
KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
  64-bit bitmap (each bit describing a block size). The default value is
  0, to disable the eager page splitting.

+8.41 KVM_CAP_VM_TYPES
+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: system ioctl
+
+This capability returns a bitmap of support VM types.  The 1-setting of bit @n
+means the VM type with value @n is supported.  Possible values of @n are::
+
+  #define KVM_X86_DEFAULT_VM   0
+  #define KVM_X86_SW_PROTECTED_VM  1
+
  9. Known KVM API problems
  =

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f9e8d5642069..dff10051e9b6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1244,6 +1244,7 @@ enum kvm_apicv_inhibit {
  };

  struct kvm_arch {
+   unsigned long vm_type;
 unsigned long n_used_mmu_pages;
 unsigned long n_requested_mmu_pages;
 unsigned long n_max_mmu_pages;
@@ -2077,6 +2078,12 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t 
new_pgd);
  void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
int tdp_max_root_level, int tdp_huge_page_level);

+#ifdef CONFIG_KVM_PRIVATE_MEM
+#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != 
KVM_X86_DEFAULT_VM)
+#else
+#define kvm_arch_has_private_mem(kvm) false
+#endif
+
  static inline u16 kvm_read_ldt(void)
  {
 u16 ldt;
@@ -2125,14 +2132,10 @@ enum {
  #define HF_SMM_INSIDE_NMI_MASK (1 << 2)

  # define KVM_MAX_NR_ADDRESS_SPACES 2
+/* SMM is currently unsupported for guests with private memory. */
+# define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 
2)
  # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 
1 : 0)
  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
-
-static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
-{
-   return KVM_MAX_NR_ADDRESS_SPACES;
-}
-
  #else
  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
  #endif
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 1a6a1f987949..a448d0964fc0 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -562,4 +

[PATCH 36/34] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Extended guest_memfd to allow backing guest memory with transparent
hugepages.  Require userspace to opt-in via a flag even though there's no
known/anticipated use case for forcing small pages as THP is optional,
i.e. to avoid ending up in a situation where userspace is unaware that
KVM can't provide hugepages.

For simplicity, require the guest_memfd size to be a multiple of the
hugepage size, e.g. so that KVM doesn't need to do bounds checking when
deciding whether or not to allocate a huge folio.

When reporting the max order when KVM gets a pfn from guest_memfd, force
order-0 pages if the hugepage is not fully contained by the memslot
binding, e.g. if userspace requested hugepages but punches a hole in the
memslot bindings in order to emulate x86's VGA hole.

Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-18-sea...@google.com>
[Allow even with CONFIG_TRANSPARENT_HUGEPAGE; dropped momentarily due to
 uneasiness about the API. - Paolo]
Signed-off-by: Paolo Bonzini 
---
 Documentation/virt/kvm/api.rst|  7 ++
 include/uapi/linux/kvm.h  |  2 +
 .../testing/selftests/kvm/guest_memfd_test.c  | 15 
 tools/testing/selftests/kvm/lib/kvm_util.c|  9 +++
 .../kvm/x86_64/private_mem_conversions_test.c |  7 +-
 virt/kvm/guest_memfd.c| 70 ---
 6 files changed, 101 insertions(+), 9 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 38882263278d..c13ede498369 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6318,6 +6318,8 @@ and cannot be resized  (guest_memfd files do however 
support PUNCH_HOLE).
__u64 reserved[6];
   };
 
+  #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0)
+
 Conceptually, the inode backing a guest_memfd file represents physical memory,
 i.e. is coupled to the virtual machine as a thing, not to a "struct kvm".  The
 file itself, which is bound to a "struct kvm", is that instance's view of the
@@ -6334,6 +6336,11 @@ most one mapping per page, i.e. binding multiple memory 
regions to a single
 guest_memfd range is not allowed (any number of memory regions can be bound to
 a single guest_memfd file, but the bound ranges must not overlap).
 
+If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate
+and map hugepages for the guest_memfd file.  This is currently best effort.  If
+KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, the size must be aligned to the maximum
+transparent hugepage size supported by the kernel
+
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
 5. The kvm_run structure
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e9cb2df67a1d..b4ba4b53b834 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2316,4 +2316,6 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
 };
 
+#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0)
+
 #endif /* __LINUX_KVM_H */
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c 
b/tools/testing/selftests/kvm/guest_memfd_test.c
index ea0ae7e25330..c15de9852316 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -123,6 +123,7 @@ static void test_invalid_punch_hole(int fd, size_t 
page_size, size_t total_size)
 
 static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
 {
+   uint64_t valid_flags = 0;
size_t page_size = getpagesize();
uint64_t flag;
size_t size;
@@ -135,9 +136,23 @@ static void test_create_guest_memfd_invalid(struct kvm_vm 
*vm)
size);
}
 
+   if (thp_configured()) {
+   for (size = page_size * 2; size < get_trans_hugepagesz(); size 
+= page_size) {
+   fd = __vm_create_guest_memfd(vm, size, 
KVM_GUEST_MEMFD_ALLOW_HUGEPAGE);
+   TEST_ASSERT(fd == -1 && errno == EINVAL,
+   "guest_memfd() with non-hugepage-aligned 
page size '0x%lx' should fail with EINVAL",
+   size);
+   }
+
+   valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+   }
+
for (flag = 1; flag; flag <<= 1) {
uint64_t bit;
 
+   if (flag & valid_flags)
+   continue;
+
fd = __vm_create_guest_memfd(vm, page_size, flag);
TEST_ASSERT(fd == -1 && errno == EINVAL,
"guest_memfd() with flag '0x%lx' should fail with 
EINVAL",
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index d05d95cc3693..ed81a00e5df1 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kv

[PATCH 35/34] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Add flags to "struct kvm_gfn_range" to let notifier events target only
shared and only private mappings, and write up the existing mmu_notifier
events to be shared-only (private memory is never associated with a
userspace virtual address, i.e. can't be reached via mmu_notifiers).

Add two flags so that KVM can handle the three possibilities (shared,
private, and shared+private) without needing something like a tri-state
enum.

Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-13-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c  | 17 +
 2 files changed, 19 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3ebc6912c54a..4d5d139b0bde 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -264,6 +264,8 @@ struct kvm_gfn_range {
gfn_t start;
gfn_t end;
union kvm_mmu_notifier_arg arg;
+   bool only_private;
+   bool only_shared;
bool may_block;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8758cb799e18..9170a61ea99f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
 * the second or later invocation of the handler).
 */
gfn_range.arg = range->arg;
+
+   /*
+* HVA-based notifications provide a userspace address,
+* and as such are only relevant for shared mappings.
+*/
+   gfn_range.only_private = false;
+   gfn_range.only_shared = true;
gfn_range.may_block = range->may_block;
 
/*
@@ -2493,6 +2500,16 @@ static __always_inline void kvm_handle_gfn_range(struct 
kvm *kvm,
gfn_range.arg = range->arg;
gfn_range.may_block = range->may_block;
 
+   /*
+* If/when KVM supports more attributes beyond private .vs shared, this
+* _could_ set only_{private,shared} appropriately if the entire target
+* range already has the desired private vs. shared state (it's unclear
+* if that is a net win).  For now, KVM reaches this point if and only
+* if the private flag is being toggled, i.e. all mappings are in play.
+*/
+   gfn_range.only_private = false;
+   gfn_range.only_shared = false;
+
for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);
 
-- 
2.39.1

[PATCH 34/34] KVM: selftests: Add a memory region subtest to validate invalid flags

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Add a subtest to set_memory_region_test to verify that KVM rejects invalid
flags and combinations with -EINVAL.  KVM might or might not fail with
EINVAL anyways, but we can at least try.

Signed-off-by: Sean Christopherson 
Message-Id: <20231031002049.3915752-1-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 .../selftests/kvm/set_memory_region_test.c| 49 +++
 1 file changed, 49 insertions(+)

diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c 
b/tools/testing/selftests/kvm/set_memory_region_test.c
index 1891774eb6d4..343e807043e1 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -326,6 +326,53 @@ static void test_zero_memory_regions(void)
 }
 #endif /* __x86_64__ */
 
+static void test_invalid_memory_region_flags(void)
+{
+   uint32_t supported_flags = KVM_MEM_LOG_DIRTY_PAGES;
+   const uint32_t v2_only_flags = KVM_MEM_GUEST_MEMFD;
+   struct kvm_vm *vm;
+   int r, i;
+
+#ifdef __x86_64__
+   supported_flags |= KVM_MEM_READONLY;
+
+   if (kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM))
+   vm = vm_create_barebones_protected_vm();
+   else
+#endif
+   vm = vm_create_barebones();
+
+   if (kvm_check_cap(KVM_CAP_MEMORY_ATTRIBUTES) & 
KVM_MEMORY_ATTRIBUTE_PRIVATE)
+   supported_flags |= KVM_MEM_GUEST_MEMFD;
+
+   for (i = 0; i < 32; i++) {
+   if ((supported_flags & BIT(i)) && !(v2_only_flags & BIT(i)))
+   continue;
+
+   r = __vm_set_user_memory_region(vm, MEM_REGION_SLOT, BIT(i),
+   MEM_REGION_GPA, 
MEM_REGION_SIZE, NULL);
+
+   TEST_ASSERT(r && errno == EINVAL,
+   "KVM_SET_USER_MEMORY_REGION should have failed on 
v2 only flag 0x%lx", BIT(i));
+
+   if (supported_flags & BIT(i))
+   continue;
+
+   r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, BIT(i),
+MEM_REGION_GPA, 
MEM_REGION_SIZE, NULL, 0, 0);
+   TEST_ASSERT(r && errno == EINVAL,
+   "KVM_SET_USER_MEMORY_REGION2 should have failed on 
unsupported flag 0x%lx", BIT(i));
+   }
+
+   if (supported_flags & KVM_MEM_GUEST_MEMFD) {
+   r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT,
+KVM_MEM_LOG_DIRTY_PAGES | 
KVM_MEM_GUEST_MEMFD,
+MEM_REGION_GPA, 
MEM_REGION_SIZE, NULL, 0, 0);
+   TEST_ASSERT(r && errno == EINVAL,
+   "KVM_SET_USER_MEMORY_REGION2 should have failed, 
dirty logging private memory is unsupported");
+   }
+}
+
 /*
  * Test it can be added memory slots up to KVM_CAP_NR_MEMSLOTS, then any
  * tentative to add further slots should fail.
@@ -491,6 +538,8 @@ int main(int argc, char *argv[])
test_zero_memory_regions();
 #endif
 
+   test_invalid_memory_region_flags();
+
test_add_max_memory_regions();
 
if (kvm_has_cap(KVM_CAP_GUEST_MEMFD) &&
-- 
2.39.1

[PATCH 33/34] KVM: selftests: Test KVM exit behavior for private memory/access

2023-11-05 Thread Paolo Bonzini

From: Ackerley Tng 

"Testing private access when memslot gets deleted" tests the behavior
of KVM when a private memslot gets deleted while the VM is using the
private memslot. When KVM looks up the deleted (slot = NULL) memslot,
KVM should exit to userspace with KVM_EXIT_MEMORY_FAULT.

In the second test, upon a private access to non-private memslot, KVM
should also exit to userspace with KVM_EXIT_MEMORY_FAULT.

Intentionally don't take a requirement on KVM_CAP_GUEST_MEMFD,
KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_ATTRIBUTE_PRIVATE, etc., as it's a
KVM bug to advertise KVM_X86_SW_PROTECTED_VM without its prerequisites.

Signed-off-by: Ackerley Tng 
[sean: call out the similarities with set_memory_region_test]
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-36-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../kvm/x86_64/private_mem_kvm_exits_test.c   | 120 ++
 2 files changed, 121 insertions(+)
 create mode 100644 
tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index fd3b30a4ca7b..69ce8e06b3a3 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -92,6 +92,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
 TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c 
b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
new file mode 100644
index ..2f02f6128482
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#include 
+#include 
+#include 
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+
+/* Arbitrarily selected to avoid overlaps with anything else */
+#define EXITS_TEST_GVA 0xc000
+#define EXITS_TEST_GPA EXITS_TEST_GVA
+#define EXITS_TEST_NPAGES 1
+#define EXITS_TEST_SIZE (EXITS_TEST_NPAGES * PAGE_SIZE)
+#define EXITS_TEST_SLOT 10
+
+static uint64_t guest_repeatedly_read(void)
+{
+   volatile uint64_t value;
+
+   while (true)
+   value = *((uint64_t *) EXITS_TEST_GVA);
+
+   return value;
+}
+
+static uint32_t run_vcpu_get_exit_reason(struct kvm_vcpu *vcpu)
+{
+   int r;
+
+   r = _vcpu_run(vcpu);
+   if (r) {
+   TEST_ASSERT(errno == EFAULT, KVM_IOCTL_ERROR(KVM_RUN, r));
+   TEST_ASSERT_EQ(vcpu->run->exit_reason, KVM_EXIT_MEMORY_FAULT);
+   }
+   return vcpu->run->exit_reason;
+}
+
+const struct vm_shape protected_vm_shape = {
+   .mode = VM_MODE_DEFAULT,
+   .type = KVM_X86_SW_PROTECTED_VM,
+};
+
+static void test_private_access_memslot_deleted(void)
+{
+   struct kvm_vm *vm;
+   struct kvm_vcpu *vcpu;
+   pthread_t vm_thread;
+   void *thread_return;
+   uint32_t exit_reason;
+
+   vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
+  guest_repeatedly_read);
+
+   vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+   EXITS_TEST_GPA, EXITS_TEST_SLOT,
+   EXITS_TEST_NPAGES,
+   KVM_MEM_GUEST_MEMFD);
+
+   virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
+
+   /* Request to access page privately */
+   vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+
+   pthread_create(&vm_thread, NULL,
+  (void *(*)(void *))run_vcpu_get_exit_reason,
+  (void *)vcpu);
+
+   vm_mem_region_delete(vm, EXITS_TEST_SLOT);
+
+   pthread_join(vm_thread, &thread_return);
+   exit_reason = (uint32_t)(uint64_t)thread_return;
+
+   TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+   TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, 
KVM_MEMORY_EXIT_FLAG_PRIVATE);
+   TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+   TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+
+   kvm_vm_free(vm);
+}
+
+static void test_private_access_memslot_not_private(void)
+{
+   struct kvm_vm *vm;
+   struct kvm_vcpu *vcpu;
+   uint32_t exit_reason;
+
+   vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
+  guest_repeatedly_read);
+
+   /*

[PATCH 32/34] KVM: selftests: Add basic selftest for guest_memfd()

2023-11-05 Thread Paolo Bonzini

From: Chao Peng 

Add a selftest to verify the basic functionality of guest_memfd():

+ file descriptor created with the guest_memfd() ioctl does not allow
  read/write/mmap operations
+ file size and block size as returned from fstat are as expected
+ fallocate on the fd checks that offset/length on
  fallocate(FALLOC_FL_PUNCH_HOLE) should be page aligned
+ invalid inputs (misaligned size, invalid flags) are rejected
+ file size and inode are unique (the innocuous-sounding
  anon_inode_getfile() backs all files with a single inode...)

Signed-off-by: Chao Peng 
Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
Co-developed-by: Paolo Bonzini 
Signed-off-by: Paolo Bonzini 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-35-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../testing/selftests/kvm/guest_memfd_test.c  | 206 ++
 2 files changed, 207 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index ecdea5e7afa8..fd3b30a4ca7b 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -134,6 +134,7 @@ TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
+TEST_GEN_PROGS_x86_64 += guest_memfd_test
 TEST_GEN_PROGS_x86_64 += guest_print_test
 TEST_GEN_PROGS_x86_64 += hardware_disable_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c 
b/tools/testing/selftests/kvm/guest_memfd_test.c
new file mode 100644
index ..ea0ae7e25330
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -0,0 +1,206 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright Intel Corporation, 2023
+ *
+ * Author: Chao Peng 
+ */
+
+#define _GNU_SOURCE
+#include "test_util.h"
+#include "kvm_util_base.h"
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static void test_file_read_write(int fd)
+{
+   char buf[64];
+
+   TEST_ASSERT(read(fd, buf, sizeof(buf)) < 0,
+   "read on a guest_mem fd should fail");
+   TEST_ASSERT(write(fd, buf, sizeof(buf)) < 0,
+   "write on a guest_mem fd should fail");
+   TEST_ASSERT(pread(fd, buf, sizeof(buf), 0) < 0,
+   "pread on a guest_mem fd should fail");
+   TEST_ASSERT(pwrite(fd, buf, sizeof(buf), 0) < 0,
+   "pwrite on a guest_mem fd should fail");
+}
+
+static void test_mmap(int fd, size_t page_size)
+{
+   char *mem;
+
+   mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+   TEST_ASSERT_EQ(mem, MAP_FAILED);
+}
+
+static void test_file_size(int fd, size_t page_size, size_t total_size)
+{
+   struct stat sb;
+   int ret;
+
+   ret = fstat(fd, &sb);
+   TEST_ASSERT(!ret, "fstat should succeed");
+   TEST_ASSERT_EQ(sb.st_size, total_size);
+   TEST_ASSERT_EQ(sb.st_blksize, page_size);
+}
+
+static void test_fallocate(int fd, size_t page_size, size_t total_size)
+{
+   int ret;
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, total_size);
+   TEST_ASSERT(!ret, "fallocate with aligned offset and size should 
succeed");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   page_size - 1, page_size);
+   TEST_ASSERT(ret, "fallocate with unaligned offset should fail");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, page_size);
+   TEST_ASSERT(ret, "fallocate beginning at total_size should fail");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + page_size, 
page_size);
+   TEST_ASSERT(ret, "fallocate beginning after total_size should fail");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   total_size, page_size);
+   TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) at total_size should succeed");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   total_size + page_size, page_size);
+   TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) after total_size should 
succeed");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   page_size, page_size - 1);
+   TEST_ASSERT(ret, "fallocate with unaligned size should fail");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   page_size, page_size);
+   TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) wi

[PATCH 31/34] KVM: selftests: Expand set_memory_region_test to validate guest_memfd()

2023-11-05 Thread Paolo Bonzini

From: Chao Peng 

Expand set_memory_region_test to exercise various positive and negative
testcases for private memory.

 - Non-guest_memfd() file descriptor for private memory
 - guest_memfd() from different VM
 - Overlapping bindings
 - Unaligned bindings

Signed-off-by: Chao Peng 
Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
[sean: trim the testcases to remove duplicate coverage]
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-34-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 .../selftests/kvm/include/kvm_util_base.h |  10 ++
 .../selftests/kvm/set_memory_region_test.c| 100 ++
 2 files changed, 110 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 8ec122f5fcc8..e4d2cd9218b2 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -819,6 +819,16 @@ static inline struct kvm_vm *vm_create_barebones(void)
return vm_create(VM_SHAPE_DEFAULT);
 }
 
+static inline struct kvm_vm *vm_create_barebones_protected_vm(void)
+{
+   const struct vm_shape shape = {
+   .mode = VM_MODE_DEFAULT,
+   .type = KVM_X86_SW_PROTECTED_VM,
+   };
+
+   return vm_create(shape);
+}
+
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)
 {
return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0);
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c 
b/tools/testing/selftests/kvm/set_memory_region_test.c
index b32960189f5f..1891774eb6d4 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -385,6 +385,98 @@ static void test_add_max_memory_regions(void)
kvm_vm_free(vm);
 }
 
+
+static void test_invalid_guest_memfd(struct kvm_vm *vm, int memfd,
+size_t offset, const char *msg)
+{
+   int r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, 
KVM_MEM_GUEST_MEMFD,
+MEM_REGION_GPA, MEM_REGION_SIZE,
+0, memfd, offset);
+   TEST_ASSERT(r == -1 && errno == EINVAL, "%s", msg);
+}
+
+static void test_add_private_memory_region(void)
+{
+   struct kvm_vm *vm, *vm2;
+   int memfd, i;
+
+   pr_info("Testing ADD of KVM_MEM_GUEST_MEMFD memory regions\n");
+
+   vm = vm_create_barebones_protected_vm();
+
+   test_invalid_guest_memfd(vm, vm->kvm_fd, 0, "KVM fd should fail");
+   test_invalid_guest_memfd(vm, vm->fd, 0, "VM's fd should fail");
+
+   memfd = kvm_memfd_alloc(MEM_REGION_SIZE, false);
+   test_invalid_guest_memfd(vm, memfd, 0, "Regular memfd() should fail");
+   close(memfd);
+
+   vm2 = vm_create_barebones_protected_vm();
+   memfd = vm_create_guest_memfd(vm2, MEM_REGION_SIZE, 0);
+   test_invalid_guest_memfd(vm, memfd, 0, "Other VM's guest_memfd() should 
fail");
+
+   vm_set_user_memory_region2(vm2, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD,
+  MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 
0);
+   close(memfd);
+   kvm_vm_free(vm2);
+
+   memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE, 0);
+   for (i = 1; i < PAGE_SIZE; i++)
+   test_invalid_guest_memfd(vm, memfd, i, "Unaligned offset should 
fail");
+
+   vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD,
+  MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 
0);
+   close(memfd);
+
+   kvm_vm_free(vm);
+}
+
+static void test_add_overlapping_private_memory_regions(void)
+{
+   struct kvm_vm *vm;
+   int memfd;
+   int r;
+
+   pr_info("Testing ADD of overlapping KVM_MEM_GUEST_MEMFD memory 
regions\n");
+
+   vm = vm_create_barebones_protected_vm();
+
+   memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE * 4, 0);
+
+   vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD,
+  MEM_REGION_GPA, MEM_REGION_SIZE * 2, 0, 
memfd, 0);
+
+   vm_set_user_memory_region2(vm, MEM_REGION_SLOT + 1, KVM_MEM_GUEST_MEMFD,
+  MEM_REGION_GPA * 2, MEM_REGION_SIZE * 2,
+  0, memfd, MEM_REGION_SIZE * 2);
+
+   /*
+* Delete the first memslot, and then attempt to recreate it except
+* with a "bad" offset that results in overlap in the guest_memfd().
+*/
+   vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD,
+  MEM_REGION_GPA, 0, NULL, -1, 0);
+
+   /* Overlap the front half of the other slot. */
+   r = __vm_set_user_memory_regio

[PATCH 30/34] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper

2023-11-05 Thread Paolo Bonzini

From: Chao Peng 

Add helpers to invoke KVM_SET_USER_MEMORY_REGION2 directly so that tests
can validate of features that are unique to "version 2" of "set user
memory region", e.g. do negative testing on gmem_fd and gmem_offset.

Provide a raw version as well as an assert-success version to reduce
the amount of boilerplate code need for basic usage.

Signed-off-by: Chao Peng 
Signed-off-by: Ackerley Tng 
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-33-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 .../selftests/kvm/include/kvm_util_base.h |  7 +
 tools/testing/selftests/kvm/lib/kvm_util.c| 29 +++
 2 files changed, 36 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 157508c071f3..8ec122f5fcc8 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -522,6 +522,13 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t 
slot, uint32_t flags,
   uint64_t gpa, uint64_t size, void *hva);
 int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
uint64_t gpa, uint64_t size, void *hva);
+void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
+   uint64_t gpa, uint64_t size, void *hva,
+   uint32_t guest_memfd, uint64_t 
guest_memfd_offset);
+int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
+uint64_t gpa, uint64_t size, void *hva,
+uint32_t guest_memfd, uint64_t 
guest_memfd_offset);
+
 void vm_userspace_mem_region_add(struct kvm_vm *vm,
enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 1c74310f1d44..d05d95cc3693 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -873,6 +873,35 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t 
slot, uint32_t flags,
errno, strerror(errno));
 }
 
+int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
+uint64_t gpa, uint64_t size, void *hva,
+uint32_t guest_memfd, uint64_t 
guest_memfd_offset)
+{
+   struct kvm_userspace_memory_region2 region = {
+   .slot = slot,
+   .flags = flags,
+   .guest_phys_addr = gpa,
+   .memory_size = size,
+   .userspace_addr = (uintptr_t)hva,
+   .guest_memfd = guest_memfd,
+   .guest_memfd_offset = guest_memfd_offset,
+   };
+
+   return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION2, ®ion);
+}
+
+void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
+   uint64_t gpa, uint64_t size, void *hva,
+   uint32_t guest_memfd, uint64_t 
guest_memfd_offset)
+{
+   int ret = __vm_set_user_memory_region2(vm, slot, flags, gpa, size, hva,
+  guest_memfd, guest_memfd_offset);
+
+   TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed, errno = %d (%s)",
+   errno, strerror(errno));
+}
+
+
 /* FIXME: This thing needs to be ripped apart and rewritten. */
 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
-- 
2.39.1

[PATCH 29/34] KVM: selftests: Add x86-only selftest for private memory conversions

2023-11-05 Thread Paolo Bonzini

From: Vishal Annapurve 

Add a selftest to exercise implicit/explicit conversion functionality
within KVM and verify:

 - Shared memory is visible to host userspace
 - Private memory is not visible to host userspace
 - Host userspace and guest can communicate over shared memory
 - Data in shared backing is preserved across conversions (test's
   host userspace doesn't free the data)
 - Private memory is bound to the lifetime of the VM

Ideally, KVM's selftests infrastructure would be reworked to allow backing
a single region of guest memory with multiple memslots for _all_ backing
types and shapes, i.e. ideally the code for using a single backing fd
across multiple memslots would work for "regular" memory as well.  But
sadly, support for KVM_CREATE_GUEST_MEMFD has languished for far too long,
and overhauling selftests' memslots infrastructure would likely open a can
of worms, i.e. delay things even further.

In addition to the more obvious tests, verify that PUNCH_HOLE actually
frees memory.  Directly verifying that KVM frees memory is impractical, if
it's even possible, so instead indirectly verify memory is freed by
asserting that the guest reads zeroes after a PUNCH_HOLE.  E.g. if KVM
zaps SPTEs but doesn't actually punch a hole in the inode, the subsequent
read will still see the previous value.  And obviously punching a hole
shouldn't cause explosions.

Let the user specify the number of memslots in the private mem conversion
test, i.e. don't require the number of memslots to be '1' or "nr_vcpus".
Creating more memslots than vCPUs is particularly interesting, e.g. it can
result in a single KVM_SET_MEMORY_ATTRIBUTES spanning multiple memslots.
To keep the math reasonable, align each vCPU's chunk to at least 2MiB (the
size is 2MiB+4KiB), and require the total size to be cleanly divisible by
the number of memslots.  The goal is to be able to validate that KVM plays
nice with multiple memslots, being able to create a truly arbitrary number
of memslots doesn't add meaningful value, i.e. isn't worth the cost.

Intentionally don't take a requirement on KVM_CAP_GUEST_MEMFD,
KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_ATTRIBUTE_PRIVATE, etc., as it's a
KVM bug to advertise KVM_X86_SW_PROTECTED_VM without its prerequisites.

Signed-off-by: Vishal Annapurve 
Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-32-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../kvm/x86_64/private_mem_conversions_test.c | 482 ++
 2 files changed, 483 insertions(+)
 create mode 100644 
tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index a5963ab9215b..ecdea5e7afa8 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -91,6 +91,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test
 TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c 
b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
new file mode 100644
index ..4d6a37a5d896
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -0,0 +1,482 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#define BASE_DATA_SLOT 10
+#define BASE_DATA_GPA  ((uint64_t)(1ull << 32))
+#define PER_CPU_DATA_SIZE  ((uint64_t)(SZ_2M + PAGE_SIZE))
+
+/* Horrific macro so that the line info is captured accurately :-( */
+#define memcmp_g(gpa, pattern,  size)  
\
+do {   
\
+   uint8_t *mem = (uint8_t *)gpa;  
\
+   size_t i;   
\
+   
\
+   for (i = 0; i < size; i++)

[PATCH 28/34] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Add GUEST_SYNC[1-6]() so that tests can pass the maximum amount of
information supported via ucall(), without needing to resort to shared
memory.

Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-31-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 tools/testing/selftests/kvm/include/ucall_common.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/ucall_common.h 
b/tools/testing/selftests/kvm/include/ucall_common.h
index ce33d306c2cb..0fb472a5a058 100644
--- a/tools/testing/selftests/kvm/include/ucall_common.h
+++ b/tools/testing/selftests/kvm/include/ucall_common.h
@@ -52,6 +52,17 @@ int ucall_nr_pages_required(uint64_t page_size);
 #define GUEST_SYNC_ARGS(stage, arg1, arg2, arg3, arg4) \
ucall(UCALL_SYNC, 6, "hello", stage, arg1, 
arg2, arg3, arg4)
 #define GUEST_SYNC(stage)  ucall(UCALL_SYNC, 2, "hello", stage)
+#define GUEST_SYNC1(arg0)  ucall(UCALL_SYNC, 1, arg0)
+#define GUEST_SYNC2(arg0, arg1)ucall(UCALL_SYNC, 2, arg0, arg1)
+#define GUEST_SYNC3(arg0, arg1, arg2) \
+   ucall(UCALL_SYNC, 3, arg0, arg1, arg2)
+#define GUEST_SYNC4(arg0, arg1, arg2, arg3) \
+   ucall(UCALL_SYNC, 4, arg0, arg1, arg2, arg3)
+#define GUEST_SYNC5(arg0, arg1, arg2, arg3, arg4) \
+   ucall(UCALL_SYNC, 5, arg0, arg1, arg2, arg3, 
arg4)
+#define GUEST_SYNC6(arg0, arg1, arg2, arg3, arg4, arg5) \
+   ucall(UCALL_SYNC, 6, arg0, arg1, arg2, arg3, 
arg4, arg5)
+
 #define GUEST_PRINTF(_fmt, _args...) ucall_fmt(UCALL_PRINTF, _fmt, ##_args)
 #define GUEST_DONE()   ucall(UCALL_DONE, 0)
 
-- 
2.39.1

[PATCH 27/34] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Add a "vm_shape" structure to encapsulate the selftests-defined "mode",
along with the KVM-defined "type" for use when creating a new VM.  "mode"
tracks physical and virtual address properties, as well as the preferred
backing memory type, while "type" corresponds to the VM type.

Taking the VM type will allow adding tests for KVM_CREATE_GUEST_MEMFD,
a.k.a. guest private memory, without needing an entirely separate set of
helpers.  Guest private memory is effectively usable only by confidential
VM types, and it's expected that x86 will double down and require unique
VM types for TDX and SNP guests.

Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-30-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 tools/testing/selftests/kvm/dirty_log_test.c  |  2 +-
 .../selftests/kvm/include/kvm_util_base.h | 54 +++
 .../selftests/kvm/kvm_page_table_test.c   |  2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c| 43 +++
 tools/testing/selftests/kvm/lib/memstress.c   |  3 +-
 .../kvm/x86_64/ucna_injection_test.c  |  2 +-
 6 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index 936f3a8d1b83..6cbecf499767 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -699,7 +699,7 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, 
struct kvm_vcpu **vcpu,
 
pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode));
 
-   vm = __vm_create(mode, 1, extra_mem_pages);
+   vm = __vm_create(VM_SHAPE(mode), 1, extra_mem_pages);
 
log_mode_create_vm_done(vm);
*vcpu = vm_vcpu_add(vm, 0, guest_code);
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 1441fca6c273..157508c071f3 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -188,6 +188,23 @@ enum vm_guest_mode {
NUM_VM_MODES,
 };
 
+struct vm_shape {
+   enum vm_guest_mode mode;
+   unsigned int type;
+};
+
+#define VM_TYPE_DEFAULT0
+
+#define VM_SHAPE(__mode)   \
+({ \
+   struct vm_shape shape = {   \
+   .mode = (__mode),   \
+   .type = VM_TYPE_DEFAULT \
+   };  \
+   \
+   shape;  \
+})
+
 #if defined(__aarch64__)
 
 extern enum vm_guest_mode vm_mode_default;
@@ -220,6 +237,8 @@ extern enum vm_guest_mode vm_mode_default;
 
 #endif
 
+#define VM_SHAPE_DEFAULT   VM_SHAPE(VM_MODE_DEFAULT)
+
 #define MIN_PAGE_SIZE  (1U << MIN_PAGE_SHIFT)
 #define PTES_PER_MIN_PAGE  ptes_per_page(MIN_PAGE_SIZE)
 
@@ -784,21 +803,21 @@ vm_paddr_t vm_alloc_page_table(struct kvm_vm *vm);
  * __vm_create() does NOT create vCPUs, @nr_runnable_vcpus is used purely to
  * calculate the amount of memory needed for per-vCPU data, e.g. stacks.
  */
-struct kvm_vm *vm_create(enum vm_guest_mode mode);
-struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus,
+struct kvm_vm *vm_create(struct vm_shape shape);
+struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
   uint64_t nr_extra_pages);
 
 static inline struct kvm_vm *vm_create_barebones(void)
 {
-   return vm_create(VM_MODE_DEFAULT);
+   return vm_create(VM_SHAPE_DEFAULT);
 }
 
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)
 {
-   return __vm_create(VM_MODE_DEFAULT, nr_runnable_vcpus, 0);
+   return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0);
 }
 
-struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t 
nr_vcpus,
+struct kvm_vm *__vm_create_with_vcpus(struct vm_shape shape, uint32_t nr_vcpus,
  uint64_t extra_mem_pages,
  void *guest_code, struct kvm_vcpu 
*vcpus[]);
 
@@ -806,17 +825,27 @@ static inline struct kvm_vm 
*vm_create_with_vcpus(uint32_t nr_vcpus,
  void *guest_code,
  struct kvm_vcpu *vcpus[])
 {
-   return __vm_create_with_vcpus(VM_MODE_DEFAULT, nr_vcpus, 0,
+   return __vm_create_with_vcpus(VM_SHAPE_DEFAULT, nr_vcpus, 0,
  guest_code, vcpus);
 }
 
+
+struct kvm_vm *__vm_create_shape_with_one_vcpu(struct vm_shape shape,
+  struct kvm_vcpu **vcpu,
+  uint64_t extra_mem_pages,
+

[PATCH 26/34] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86)

2023-11-05 Thread Paolo Bonzini

From: Vishal Annapurve 

Add helpers for x86 guests to invoke the KVM_HC_MAP_GPA_RANGE hypercall,
which KVM will forward to userspace and thus can be used by tests to
coordinate private<=>shared conversions between host userspace code and
guest code.

Signed-off-by: Vishal Annapurve 
[sean: drop shared/private helpers (let tests specify flags)]
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-29-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 .../selftests/kvm/include/x86_64/processor.h  | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h 
b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 25bc61dac5fb..a84863503fcb 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 
+#include 
 #include 
 
 #include "../kvm_util.h"
@@ -1194,6 +1195,20 @@ uint64_t kvm_hypercall(uint64_t nr, uint64_t a0, 
uint64_t a1, uint64_t a2,
 uint64_t __xen_hypercall(uint64_t nr, uint64_t a0, void *a1);
 void xen_hypercall(uint64_t nr, uint64_t a0, void *a1);
 
+static inline uint64_t __kvm_hypercall_map_gpa_range(uint64_t gpa,
+uint64_t size, uint64_t 
flags)
+{
+   return kvm_hypercall(KVM_HC_MAP_GPA_RANGE, gpa, size >> PAGE_SHIFT, 
flags, 0);
+}
+
+static inline void kvm_hypercall_map_gpa_range(uint64_t gpa, uint64_t size,
+  uint64_t flags)
+{
+   uint64_t ret = __kvm_hypercall_map_gpa_range(gpa, size, flags);
+
+   GUEST_ASSERT(!ret);
+}
+
 void __vm_xsave_require_permission(uint64_t xfeature, const char *name);
 
 #define vm_xsave_require_permission(xfeature)  \
-- 
2.39.1

[PATCH 25/34] KVM: selftests: Add helpers to convert guest memory b/w private and shared

2023-11-05 Thread Paolo Bonzini

From: Vishal Annapurve 

Add helpers to convert memory between private and shared via KVM's
memory attributes, as well as helpers to free/allocate guest_memfd memory
via fallocate().  Userspace, i.e. tests, is NOT required to do fallocate()
when converting memory, as the attributes are the single source of true.
Provide allocate() helpers so that tests can mimic a userspace that frees
private memory on conversion, e.g. to prioritize memory usage over
performance.

Signed-off-by: Vishal Annapurve 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-28-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 .../selftests/kvm/include/kvm_util_base.h | 48 +++
 tools/testing/selftests/kvm/lib/kvm_util.c| 28 +++
 2 files changed, 76 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 9f861182c02a..1441fca6c273 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -333,6 +333,54 @@ static inline void vm_enable_cap(struct kvm_vm *vm, 
uint32_t cap, uint64_t arg0)
vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
 }
 
+static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
+   uint64_t size, uint64_t attributes)
+{
+   struct kvm_memory_attributes attr = {
+   .attributes = attributes,
+   .address = gpa,
+   .size = size,
+   .flags = 0,
+   };
+
+   /*
+* KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes.  These flows
+* need significant enhancements to support multiple attributes.
+*/
+   TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
+   "Update me to support multiple attributes!");
+
+   vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
+}
+
+
+static inline void vm_mem_set_private(struct kvm_vm *vm, uint64_t gpa,
+ uint64_t size)
+{
+   vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
+uint64_t size)
+{
+   vm_set_memory_attributes(vm, gpa, size, 0);
+}
+
+void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
+   bool punch_hole);
+
+static inline void vm_guest_mem_punch_hole(struct kvm_vm *vm, uint64_t gpa,
+  uint64_t size)
+{
+   vm_guest_mem_fallocate(vm, gpa, size, true);
+}
+
+static inline void vm_guest_mem_allocate(struct kvm_vm *vm, uint64_t gpa,
+uint64_t size)
+{
+   vm_guest_mem_fallocate(vm, gpa, size, false);
+}
+
 void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
 const char *vm_guest_mode_string(uint32_t i);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index b63500fca627..95a553400ea9 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1167,6 +1167,34 @@ void vm_mem_region_delete(struct kvm_vm *vm, uint32_t 
slot)
__vm_mem_region_delete(vm, memslot2region(vm, slot), true);
 }
 
+void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t base, uint64_t size,
+   bool punch_hole)
+{
+   const int mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? 
FALLOC_FL_PUNCH_HOLE : 0);
+   struct userspace_mem_region *region;
+   uint64_t end = base + size;
+   uint64_t gpa, len;
+   off_t fd_offset;
+   int ret;
+
+   for (gpa = base; gpa < end; gpa += len) {
+   uint64_t offset;
+
+   region = userspace_mem_region_find(vm, gpa, gpa);
+   TEST_ASSERT(region && region->region.flags & 
KVM_MEM_GUEST_MEMFD,
+   "Private memory region not found for GPA 0x%lx", 
gpa);
+
+   offset = (gpa - region->region.guest_phys_addr);
+   fd_offset = region->region.guest_memfd_offset + offset;
+   len = min_t(uint64_t, end - gpa, region->region.memory_size - 
offset);
+
+   ret = fallocate(region->region.guest_memfd, mode, fd_offset, 
len);
+   TEST_ASSERT(!ret, "fallocate() failed to %s at %lx (len = %lu), 
fd = %d, mode = %x, offset = %lx\n",
+   punch_hole ? "punch hole" : "allocate", gpa, len,
+   region->region.guest_memfd, mode, fd_offset);
+   }
+}
+
 /* Returns the size of a vCPU's kvm_run structure. */
 static int vcpu_mmap_sz(void)
 {
-- 
2.39.1

[PATCH 24/34] KVM: selftests: Add support for creating private memslots

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Add support for creating "private" memslots via KVM_CREATE_GUEST_MEMFD and
KVM_SET_USER_MEMORY_REGION2.  Make vm_userspace_mem_region_add() a wrapper
to its effective replacement, vm_mem_add(), so that private memslots are
fully opt-in, i.e. don't require update all tests that add memory regions.

Pivot on the KVM_MEM_PRIVATE flag instead of the validity of the "gmem"
file descriptor so that simple tests can let vm_mem_add() do the heavy
lifting of creating the guest memfd, but also allow the caller to pass in
an explicit fd+offset so that fancier tests can do things like back
multiple memslots with a single file.  If the caller passes in a fd, dup()
the fd so that (a) __vm_mem_region_delete() can close the fd associated
with the memory region without needing yet another flag, and (b) so that
the caller can safely close its copy of the fd without having to first
destroy memslots.

Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-27-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 .../selftests/kvm/include/kvm_util_base.h | 23 ++
 .../testing/selftests/kvm/include/test_util.h |  5 ++
 tools/testing/selftests/kvm/lib/kvm_util.c| 76 +++
 3 files changed, 73 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 9f144841c2ee..9f861182c02a 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -431,6 +431,26 @@ static inline uint64_t vm_get_stat(struct kvm_vm *vm, 
const char *stat_name)
 
 void vm_create_irqchip(struct kvm_vm *vm);
 
+static inline int __vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+   uint64_t flags)
+{
+   struct kvm_create_guest_memfd guest_memfd = {
+   .size = size,
+   .flags = flags,
+   };
+
+   return __vm_ioctl(vm, KVM_CREATE_GUEST_MEMFD, &guest_memfd);
+}
+
+static inline int vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+   uint64_t flags)
+{
+   int fd = __vm_create_guest_memfd(vm, size, flags);
+
+   TEST_ASSERT(fd >= 0, KVM_IOCTL_ERROR(KVM_CREATE_GUEST_MEMFD, fd));
+   return fd;
+}
+
 void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
   uint64_t gpa, uint64_t size, void *hva);
 int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
@@ -439,6 +459,9 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
uint32_t flags);
+void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
+   uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+   uint32_t flags, int guest_memfd_fd, uint64_t 
guest_memfd_offset);
 
 void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags);
 void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa);
diff --git a/tools/testing/selftests/kvm/include/test_util.h 
b/tools/testing/selftests/kvm/include/test_util.h
index 7e614adc6cf4..7257f2243ab9 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -142,6 +142,11 @@ static inline bool backing_src_is_shared(enum 
vm_mem_backing_src_type t)
return vm_mem_backing_src_alias(t)->flag & MAP_SHARED;
 }
 
+static inline bool backing_src_can_be_huge(enum vm_mem_backing_src_type t)
+{
+   return t != VM_MEM_SRC_ANONYMOUS && t != VM_MEM_SRC_SHMEM;
+}
+
 /* Aligns x up to the next multiple of size. Size must be a power of 2. */
 static inline uint64_t align_up(uint64_t x, uint64_t size)
 {
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 3676b37bea38..b63500fca627 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -669,6 +669,8 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("munmap()", ret));
close(region->fd);
}
+   if (region->region.guest_memfd >= 0)
+   close(region->region.guest_memfd);
 
free(region);
 }
@@ -870,36 +872,15 @@ void vm_set_user_memory_region(struct kvm_vm *vm, 
uint32_t slot, uint32_t flags,
errno, strerror(errno));
 }
 
-/*
- * VM Userspace Memory Region Add
- *
- * Input Args:
- *   vm - Virtual Machine
- *   src_type - Storage source for this region.
- *  NULL to use anonymous memory.
- *   guest_paddr - Starting guest physical address
- *   slot - KVM region s

[PATCH 23/34] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
support for guest private memory can be added without needing an entirely
separate set of helpers.

Note, this obviously makes selftests backwards-incompatible with older KVM
versions from this point forward.

Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-26-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 .../selftests/kvm/include/kvm_util_base.h |  2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c| 19 ++-
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 967eaaeacd75..9f144841c2ee 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -44,7 +44,7 @@ typedef uint64_t vm_paddr_t; /* Virtual Machine (Guest) 
physical address */
 typedef uint64_t vm_vaddr_t; /* Virtual Machine (Guest) virtual address */
 
 struct userspace_mem_region {
-   struct kvm_userspace_memory_region region;
+   struct kvm_userspace_memory_region2 region;
struct sparsebit *unused_phy_pages;
int fd;
off_t offset;
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index f09295d56c23..3676b37bea38 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -453,8 +453,9 @@ void kvm_vm_restart(struct kvm_vm *vmp)
vm_create_irqchip(vmp);
 
hash_for_each(vmp->regions.slot_hash, ctr, region, slot_node) {
-   int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION, 
®ion->region);
-   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL 
failed,\n"
+   int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION2, 
®ion->region);
+
+   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL 
failed,\n"
"  rc: %i errno: %i\n"
"  slot: %u flags: 0x%x\n"
"  guest_phys_addr: 0x%llx size: 0x%llx",
@@ -657,7 +658,7 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
}
 
region->region.memory_size = 0;
-   vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, ®ion->region);
+   vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, ®ion->region);
 
sparsebit_free(®ion->unused_phy_pages);
ret = munmap(region->mmap_start, region->mmap_size);
@@ -1014,8 +1015,8 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
region->region.guest_phys_addr = guest_paddr;
region->region.memory_size = npages * vm->page_size;
region->region.userspace_addr = (uintptr_t) region->host_mem;
-   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, ®ion->region);
-   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, ®ion->region);
+   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
"  rc: %i errno: %i\n"
"  slot: %u flags: 0x%x\n"
"  guest_phys_addr: 0x%lx size: 0x%lx",
@@ -1097,9 +1098,9 @@ void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t 
slot, uint32_t flags)
 
region->region.flags = flags;
 
-   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, ®ion->region);
+   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, ®ion->region);
 
-   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
"  rc: %i errno: %i slot: %u flags: 0x%x",
ret, errno, slot, flags);
 }
@@ -1127,9 +1128,9 @@ void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, 
uint64_t new_gpa)
 
region->region.guest_phys_addr = new_gpa;
 
-   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, ®ion->region);
+   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, ®ion->region);
 
-   TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION failed\n"
+   TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed\n"
"ret: %i errno: %i slot: %u new_gpa: 0x%lx",
ret, errno, slot, new_gpa);
 }
-- 
2.39.1

[PATCH 22/34] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Drop kvm_userspace_memory_region_find(), it's unused and a terrible API
(probably why it's unused).  If anything outside of kvm_util.c needs to
get at the memslot, userspace_mem_region_find() can be exposed to give
others full access to all memory region/slot information.

Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-25-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 .../selftests/kvm/include/kvm_util_base.h |  4 ---
 tools/testing/selftests/kvm/lib/kvm_util.c| 29 ---
 2 files changed, 33 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index a18db6a7b3cf..967eaaeacd75 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -776,10 +776,6 @@ vm_adjust_num_guest_pages(enum vm_guest_mode mode, 
unsigned int num_guest_pages)
return n;
 }
 
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
-uint64_t end);
-
 #define sync_global_to_guest(vm, g) ({ \
typeof(g) *_p = addr_gva2hva(vm, (vm_vaddr_t)&(g)); \
memcpy(_p, &(g), sizeof(g));\
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 7a8af1821f5d..f09295d56c23 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -590,35 +590,6 @@ userspace_mem_region_find(struct kvm_vm *vm, uint64_t 
start, uint64_t end)
return NULL;
 }
 
-/*
- * KVM Userspace Memory Region Find
- *
- * Input Args:
- *   vm - Virtual Machine
- *   start - Starting VM physical address
- *   end - Ending VM physical address, inclusive.
- *
- * Output Args: None
- *
- * Return:
- *   Pointer to overlapping region, NULL if no such region.
- *
- * Public interface to userspace_mem_region_find. Allows tests to look up
- * the memslot datastructure for a given range of guest physical memory.
- */
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
-uint64_t end)
-{
-   struct userspace_mem_region *region;
-
-   region = userspace_mem_region_find(vm, start, end);
-   if (!region)
-   return NULL;
-
-   return ®ion->region;
-}
-
 __weak void vcpu_arch_free(struct kvm_vcpu *vcpu)
 {
 
-- 
2.39.1

[PATCH 21/34] KVM: x86: Add support for "protected VMs" that can utilize private memory

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Message-Id: <20231027182217.3615211-24-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 Documentation/virt/kvm/api.rst  | 32 
 arch/x86/include/asm/kvm_host.h | 15 +--
 arch/x86/include/uapi/asm/kvm.h |  3 +++
 arch/x86/kvm/Kconfig| 12 
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/x86.c  | 16 +++-
 include/uapi/linux/kvm.h|  1 +
 virt/kvm/Kconfig|  5 +
 8 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 4a9a291380ad..38882263278d 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -147,10 +147,29 @@ described as 'basic' will be available.
 The new VM has no virtual cpus and no memory.
 You probably want to use 0 as machine type.
 
+X86:
+
+
+Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
+
+S390:
+^
+
 In order to create user controlled virtual machines on S390, check
 KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
 privileged user (CAP_SYS_ADMIN).
 
+MIPS:
+^
+
+To use hardware assisted virtualization on MIPS (VZ ASE) rather than
+the default trap & emulate implementation (which changes the virtual
+memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
+flag KVM_VM_MIPS_VZ.
+
+ARM64:
+^^
+
 On arm64, the physical address size for a VM (IPA Size limit) is limited
 to 40bits by default. The limit can be configured if the host supports the
 extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -8766,6 +8785,19 @@ block sizes is exposed in 
KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
 64-bit bitmap (each bit describing a block size). The default value is
 0, to disable the eager page splitting.
 
+8.41 KVM_CAP_VM_TYPES
+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: system ioctl
+
+This capability returns a bitmap of support VM types.  The 1-setting of bit @n
+means the VM type with value @n is supported.  Possible values of @n are::
+
+  #define KVM_X86_DEFAULT_VM   0
+  #define KVM_X86_SW_PROTECTED_VM  1
+
 9. Known KVM API problems
 =
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 75ab0da06e64..a565a2e70f30 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1255,6 +1255,7 @@ enum kvm_apicv_inhibit {
 };
 
 struct kvm_arch {
+   unsigned long vm_type;
unsigned long n_used_mmu_pages;
unsigned long n_requested_mmu_pages;
unsigned long n_max_mmu_pages;
@@ -2089,6 +2090,12 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t 
new_pgd);
 void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
   int tdp_max_root_level, int tdp_huge_page_level);
 
+#ifdef CONFIG_KVM_PRIVATE_MEM
+#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != 
KVM_X86_DEFAULT_VM)
+#else
+#define kvm_arch_has_private_mem(kvm) false
+#endif
+
 static inline u16 kvm_read_ldt(void)
 {
u16 ldt;
@@ -2137,14 +2144,10 @@ enum {
 #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
 
 # define KVM_MAX_NR_ADDRESS_SPACES 2
+/* SMM is currently unsupported for guests with private memory. */
+# define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 
2)
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 
1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
-
-static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
-{
-   return KVM_MAX_NR_ADDRESS_SPACES;
-}
-
 #else
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
 #endif
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 1a6a1f987949..a448d0964fc0 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -562,4 +562,7 @@ struct kvm_pmu_event_filter {
 /* x86-specific KVM_EXIT_HYPERCALL flags. */
 #define KVM_EXIT_HYPERCALL_LONG_MODE

[PATCH 20/34] KVM: Allow arch code to track number of memslot address spaces per VM

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs.  Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.

Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).

Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-23-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/powerpc/kvm/book3s_hv.c|  2 +-
 arch/x86/include/asm/kvm_host.h |  8 +++-
 arch/x86/kvm/debugfs.c  |  2 +-
 arch/x86/kvm/mmu/mmu.c  |  6 +++---
 arch/x86/kvm/x86.c  |  2 +-
 include/linux/kvm_host.h| 17 +++--
 virt/kvm/dirty_ring.c   |  2 +-
 virt/kvm/kvm_main.c | 26 ++
 8 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..9b0eaa17275a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -6084,7 +6084,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
}
 
srcu_idx = srcu_read_lock(&kvm->srcu);
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
struct kvm_memory_slot *memslot;
struct kvm_memslots *slots = __kvm_memslots(kvm, i);
int bkt;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 061eec231299..75ab0da06e64 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2136,9 +2136,15 @@ enum {
 #define HF_SMM_MASK(1 << 1)
 #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
 
-# define KVM_ADDRESS_SPACE_NUM 2
+# define KVM_MAX_NR_ADDRESS_SPACES 2
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 
1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
+
+static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
+{
+   return KVM_MAX_NR_ADDRESS_SPACES;
+}
+
 #else
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
 #endif
diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index ee8c4c3496ed..42026b3f3ff3 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -111,7 +111,7 @@ static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void 
*v)
mutex_lock(&kvm->slots_lock);
write_lock(&kvm->mmu_lock);
 
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
int bkt;
 
slots = __kvm_memslots(kvm, i);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 754a5aaebee5..4de7670d5976 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3763,7 +3763,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
kvm_page_track_write_tracking_enabled(kvm))
goto out_success;
 
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);
kvm_for_each_memslot(slot, bkt, slots) {
/*
@@ -6309,7 +6309,7 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t 
gfn_start, gfn_t gfn_e
if (!kvm_memslots_have_rmaps(kvm))
return flush;
 
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);
 
kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, 
gfn_end) {
@@ -6806,7 +6806,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 
gen)
 * modifier prior to checking for a wrap of the MMIO generation so
 * that a wrap in any address space is detected.
 */
-   gen &= ~((u64)KVM_ADDRESS_SPACE_NUM - 1);
+   gen &= ~((u64)kvm_arch_nr_memslot_as_ids(kvm) - 1);
 
/*
 * The very rare case: if the MMIO generation number has wrapped,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e1aad0c81f6f..f521c97f5c64 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12577,7 +12577,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, 
int id, gpa_t gpa,
hva = slot->userspace_addr;
}
 
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
struct

[PATCH 19/34] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
KVM_ADDRESS_SPACE_NUM.

No functional change intended.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-22-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_host.h | 1 -
 include/linux/kvm_host.h| 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fa0d42202405..061eec231299 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2136,7 +2136,6 @@ enum {
 #define HF_SMM_MASK(1 << 1)
 #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
 
-# define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
 # define KVM_ADDRESS_SPACE_NUM 2
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 
1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 67dfd4d79529..db423ea9e3a4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -690,7 +690,7 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
 #define KVM_MEM_SLOTS_NUM SHRT_MAX
 #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
 
-#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
+#if KVM_ADDRESS_SPACE_NUM == 1
 static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 {
return 0;
-- 
2.39.1

[PATCH 18/34] KVM: x86/mmu: Handle page fault for private memory

2023-11-05 Thread Paolo Bonzini

From: Chao Peng 

Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory.  For such VMs,
KVM_MEM_PRIVATE memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.

For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes.  To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace.  Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.

Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits.  In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.

Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.

Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-21-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 Documentation/virt/kvm/api.rst  |   8 ++-
 arch/x86/kvm/mmu/mmu.c  | 101 ++--
 arch/x86/kvm/mmu/mmu_internal.h |   1 +
 include/linux/kvm_host.h|   8 ++-
 include/uapi/linux/kvm.h|   1 +
 5 files changed, 110 insertions(+), 9 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 6d681f45969e..4a9a291380ad 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6953,6 +6953,7 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
 
/* KVM_EXIT_MEMORY_FAULT */
struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3)
__u64 flags;
__u64 gpa;
__u64 size;
@@ -6961,8 +6962,11 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
 KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
 could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
 guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
-describes properties of the faulting access that are likely pertinent.
-Currently, no flags are defined.
+describes properties of the faulting access that are likely pertinent:
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
+   on a private memory access.  When clear, indicates the fault occurred on a
+   shared access.
 
 Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
 accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f5c6b0643645..754a5aaebee5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3147,9 +3147,9 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t 
gfn,
return level;
 }
 
-int kvm_mmu_max_mapping_level(struct kvm *kvm,
- const struct kvm_memory_slot *slot, gfn_t gfn,
- int max_level)
+static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
+  const struct kvm_memory_slot *slot,
+  gfn_t gfn, int max_level, bool 
is_private)
 {
struct kvm_lpage_info *linfo;
int host_level;
@@ -3161,6 +3161,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
break;
}
 
+   if (is_private)
+   return max_level;
+
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;
 
@@ -3168,6 +3171,16 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
return min(host_level, max_level);
 }
 
+int kvm_mmu_max_mapping_level(struct kvm *kvm,
+ const struct kvm_memory_slot *slot, gfn_t gfn,
+ int max_level)
+{
+   bool is_private = kvm_slot_can_be_private(slot) &&
+ kvm_mem_is_private(kvm, gfn);
+
+   return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, 
is_private);
+}

[PATCH 17/34] KVM: x86: Disallow hugepages when memory attributes are mixed

2023-11-05 Thread Paolo Bonzini

From: Chao Peng 

Disallow creating hugepages with mixed memory attributes, e.g. shared
versus private, as mapping a hugepage in this case would allow the guest
to access memory with the wrong attributes, e.g. overlaying private memory
with a shared hugepage.

Tracking whether or not attributes are mixed via the existing
disallow_lpage field, but use the most significant bit in 'disallow_lpage'
to indicate a hugepage has mixed attributes instead using the normal
refcounting.  Whether or not attributes are mixed is binary; either they
are or they aren't.  Attempting to squeeze that info into the refcount is
unnecessarily complex as it would require knowing the previous state of
the mixed count when updating attributes.  Using a flag means KVM just
needs to ensure the current status is reflected in the memslots.

Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-20-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/kvm/mmu/mmu.c  | 154 +++-
 arch/x86/kvm/x86.c  |   4 +
 3 files changed, 159 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6f559fb75e6d..fa0d42202405 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1848,6 +1848,9 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu);
 void kvm_mmu_init_vm(struct kvm *kvm);
 void kvm_mmu_uninit_vm(struct kvm *kvm);
 
+void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
+   struct kvm_memory_slot *slot);
+
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu);
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b2d916f786ca..f5c6b0643645 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -795,16 +795,26 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
return &slot->arch.lpage_info[level - 2][idx];
 }
 
+/*
+ * The most significant bit in disallow_lpage tracks whether or not memory
+ * attributes are mixed, i.e. not identical for all gfns at the current level.
+ * The lower order bits are used to refcount other cases where a hugepage is
+ * disallowed, e.g. if KVM has shadow a page table at the gfn.
+ */
+#define KVM_LPAGE_MIXED_FLAG   BIT(31)
+
 static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
gfn_t gfn, int count)
 {
struct kvm_lpage_info *linfo;
-   int i;
+   int old, i;
 
for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
+
+   old = linfo->disallow_lpage;
linfo->disallow_lpage += count;
-   WARN_ON_ONCE(linfo->disallow_lpage < 0);
+   WARN_ON_ONCE((old ^ linfo->disallow_lpage) & 
KVM_LPAGE_MIXED_FLAG);
}
 }
 
@@ -7176,3 +7186,143 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
if (kvm->arch.nx_huge_page_recovery_thread)
kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
+
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+   int level)
+{
+   return lpage_info_slot(gfn, slot, level)->disallow_lpage & 
KVM_LPAGE_MIXED_FLAG;
+}
+
+static void hugepage_clear_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+int level)
+{
+   lpage_info_slot(gfn, slot, level)->disallow_lpage &= 
~KVM_LPAGE_MIXED_FLAG;
+}
+
+static void hugepage_set_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+  int level)
+{
+   lpage_info_slot(gfn, slot, level)->disallow_lpage |= 
KVM_LPAGE_MIXED_FLAG;
+}
+
+static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
+  gfn_t gfn, int level, unsigned long attrs)
+{
+   const unsigned long start = gfn;
+   const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
+
+   if (level == PG_LEVEL_2M)
+   return kvm_range_has_memory_attributes(kvm, start, end, attrs);
+
+   for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
+   if (hugepage_test_mixed(slot, gfn, level - 1) ||
+   attrs != kvm_get_memory_attributes(kvm, gfn))
+   return false;
+   }
+   return true;
+}
+
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+struct kvm_gfn_range *range)
+{
+   unsigned long attrs = range->arg.attributes;
+   struct kvm_memory_slot *slot = range->slot;
+   int level;
+
+

[PATCH 16/34] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
the probability of exiting to userspace with a stale run->exit_reason that
*appears* to be valid.

To support fd-based guest memory (guest memory without a corresponding
userspace virtual address), KVM will exit to userspace for various memory
related errors, which userspace *may* be able to resolve, instead of using
e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
utilize the same functionality to let userspace "intercept" and handle
memory faults when the userspace mapping is missing, i.e. when fast gup()
fails.

Because many of KVM's internal APIs related to guest memory use '0' to
indicate "success, continue on" and not "exit to userspace", reporting
memory faults/errors to userspace will set run->exit_reason and
corresponding fields in the run structure fields in conjunction with a
a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
KVM already returns  -EFAULT in many paths, there's a relatively high
probability that KVM could return -EFAULT without setting run->exit_reason,
in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
whatever exit reason happened to be in the run structure.

Note, KVM must wait until after run->immediate_exit is serviced to
sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
preserved across KVM_RUN when run->immediate_exit is true.

Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoor...@google.com
Link: https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com
Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-19-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f9d8939b63b..f661acb01c58 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11082,6 +11082,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 {
int r;
 
+   vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
vcpu->arch.l1tf_flush_l1d = true;
 
for (;;) {
-- 
2.39.1

[PATCH 15/34] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-05 Thread Paolo Bonzini

imil Babka 
Cc: David Hildenbrand 
Cc: Quentin Perret 
Cc: Michael Roth 
Cc: Wang 
Cc: Liam Merwick 
Cc: Isaku Yamahata 
Co-developed-by: Kirill A. Shutemov 
Signed-off-by: Kirill A. Shutemov 
Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Co-developed-by: Chao Peng 
Signed-off-by: Chao Peng 
Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
Co-developed-by: Isaku Yamahata 
Signed-off-by: Isaku Yamahata 
Co-developed-by: Paolo Bonzini 
Signed-off-by: Paolo Bonzini 
Co-developed-by: Michael Roth 
Signed-off-by: Michael Roth 
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-17-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 Documentation/virt/kvm/api.rst |  69 -
 fs/anon_inodes.c   |   1 +
 include/linux/kvm_host.h   |  48 +++
 include/uapi/linux/kvm.h   |  15 +-
 virt/kvm/Kconfig   |   4 +
 virt/kvm/Makefile.kvm  |   1 +
 virt/kvm/guest_memfd.c | 538 +
 virt/kvm/kvm_main.c|  59 +++-
 virt/kvm/kvm_mm.h  |  26 ++
 9 files changed, 754 insertions(+), 7 deletions(-)
 create mode 100644 virt/kvm/guest_memfd.c

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 083ed507e200..6d681f45969e 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6202,6 +6202,15 @@ superset of the features supported by the system.
 :Parameters: struct kvm_userspace_memory_region2 (in)
 :Returns: 0 on success, -1 on error
 
+KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
+allows mapping guest_memfd memory into a guest.  All fields shared with
+KVM_SET_USER_MEMORY_REGION identically.  Userspace can set KVM_MEM_GUEST_MEMFD
+in flags to have KVM bind the memory region to a given guest_memfd range of
+[guest_memfd_offset, guest_memfd_offset + memory_size].  The target guest_memfd
+must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
+the target range must not be bound to any other memory region.  All standard
+bounds checks apply (use common sense).
+
 ::
 
   struct kvm_userspace_memory_region2 {
@@ -6210,9 +6219,24 @@ superset of the features supported by the system.
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
__u64 userspace_addr; /* start of the userspace allocated memory */
+   __u64 guest_memfd_offset;
+   __u32 guest_memfd;
+   __u32 pad1;
+   __u64 pad2[14];
   };
 
-See KVM_SET_USER_MEMORY_REGION.
+A KVM_MEM_GUEST_MEMFD region _must_ have a valid guest_memfd (private memory) 
and
+userspace_addr (shared memory).  However, "valid" for userspace_addr simply
+means that the address itself must be a legal userspace address.  The backing
+mapping for userspace_addr is not required to be valid/populated at the time of
+KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated
+on-demand.
+
+When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes
+userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE
+state.  At VM creation time, all memory is shared, i.e. the PRIVATE attribute
+is '0' for all gfns.  Userspace can control whether memory is shared/private by
+toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
 
 4.141 KVM_SET_MEMORY_ATTRIBUTES
 ---
@@ -6250,6 +6274,49 @@ the state of a gfn/page as needed.
 
 The "flags" field is reserved for future extensions and must be '0'.
 
+4.142 KVM_CREATE_GUEST_MEMFD
+
+
+:Capability: KVM_CAP_GUEST_MEMFD
+:Architectures: none
+:Type: vm ioctl
+:Parameters: struct kvm_create_guest_memfd(in)
+:Returns: 0 on success, <0 on error
+
+KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file descriptor
+that refers to it.  guest_memfd files are roughly analogous to files created
+via memfd_create(), e.g. guest_memfd files live in RAM, have volatile storage,
+and are automatically released when the last reference is dropped.  Unlike
+"regular" memfd_create() files, guest_memfd files are bound to their owning
+virtual machine (see below), cannot be mapped, read, or written by userspace,
+and cannot be resized  (guest_memfd files do however support PUNCH_HOLE).
+
+::
+
+  struct kvm_create_guest_memfd {
+   __u64 size;
+   __u64 flags;
+   __u64 reserved[6];
+  };
+
+Conceptually, the inode backing a guest_memfd file represents physical memory,
+i.e. is coupled to the virtual machine as a thing, not to a "struct kvm".  The
+file itself, which is bound to a "struct kvm", is that instance's view of the
+underlying memory, e.g. effectively provides the translation of guest addresses
+to host memory.  This allows for use cases where multiple KVM structures are
+used to manage a single virtual machine, e.g. when performing intrahost

[PATCH 14/34] fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()

2023-11-05 Thread Paolo Bonzini

The call to the inode_init_security_anon() LSM hook is not the sole
reason to use anon_inode_getfile_secure() or anon_inode_getfd_secure().
For example, the functions also allow one to create a file with non-zero
size, without needing a full-blown filesystem.  In this case, you don't
need a "secure" version, just unique inodes; the current name of the
functions is confusing and does not explain well the difference with
the more "standard" anon_inode_getfile() and anon_inode_getfd().

Of course, there is another side of the coin; neither io_uring nor
userfaultfd strictly speaking need distinct inodes, and it is not
that clear anymore that anon_inode_create_get{file,fd}() allow the LSM
to intercept and block the inode's creation.  If one was so inclined,
anon_inode_getfile_secure() and anon_inode_getfd_secure() could be kept,
using the shared inode or a new one depending on CONFIG_SECURITY.
However, this is probably overkill, and potentially a cause of bugs in
different configurations.  Therefore, just add a comment to io_uring
and userfaultfd explaining the choice of the function.

While at it, remove the export for what is now anon_inode_create_getfd().
There is no in-tree module that uses it, and the old name is gone anyway.
If anybody actually needs the symbol, they can ask or they can just use
anon_inode_create_getfile(), which will be exported very soon for use
in KVM.

Suggested-by: Christian Brauner 
Signed-off-by: Paolo Bonzini 
---
 fs/anon_inodes.c| 46 +++--
 fs/userfaultfd.c|  5 ++--
 include/linux/anon_inodes.h |  4 ++--
 io_uring/io_uring.c |  3 ++-
 4 files changed, 36 insertions(+), 22 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 24192a7667ed..3d4a27f8b4fe 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -79,7 +79,7 @@ static struct file *__anon_inode_getfile(const char *name,
 const struct file_operations *fops,
 void *priv, int flags,
 const struct inode *context_inode,
-bool secure)
+bool make_inode)
 {
struct inode *inode;
struct file *file;
@@ -87,7 +87,7 @@ static struct file *__anon_inode_getfile(const char *name,
if (fops->owner && !try_module_get(fops->owner))
return ERR_PTR(-ENOENT);
 
-   if (secure) {
+   if (make_inode) {
inode = anon_inode_make_secure_inode(name, context_inode);
if (IS_ERR(inode)) {
file = ERR_CAST(inode);
@@ -149,13 +149,10 @@ struct file *anon_inode_getfile(const char *name,
 EXPORT_SYMBOL_GPL(anon_inode_getfile);
 
 /**
- * anon_inode_getfile_secure - Like anon_inode_getfile(), but creates a new
+ * anon_inode_create_getfile - Like anon_inode_getfile(), but creates a new
  * !S_PRIVATE anon inode rather than reuse the
  * singleton anon inode and calls the
- * inode_init_security_anon() LSM hook.  This
- * allows for both the inode to have its own
- * security context and for the LSM to enforce
- * policy on the inode's creation.
+ * inode_init_security_anon() LSM hook.
  *
  * @name:[in]name of the "class" of the new file
  * @fops:[in]file operations for the new file
@@ -164,11 +161,19 @@ EXPORT_SYMBOL_GPL(anon_inode_getfile);
  * @context_inode:
  *   [in]the logical relationship with the new inode (optional)
  *
+ * Create a new anonymous inode and file pair.  This can be done for two
+ * reasons:
+ * - for the inode to have its own security context, so that LSMs can enforce
+ *   policy on the inode's creation;
+ * - if the caller needs a unique inode, for example in order to customize
+ *   the size returned by fstat()
+ *
  * The LSM may use @context_inode in inode_init_security_anon(), but a
- * reference to it is not held.  Returns the newly created file* or an error
- * pointer.  See the anon_inode_getfile() documentation for more information.
+ * reference to it is not held.
+ *
+ * Returns the newly created file* or an error pointer.
  */
-struct file *anon_inode_getfile_secure(const char *name,
+struct file *anon_inode_create_getfile(const char *name,
   const struct file_operations *fops,
   void *priv, int flags,
   const struct inode *context_inode)
@@ -181,7 +186,7 @@ static int __anon_inode_getfd(const char *name,
  const struct file_operations *fops,
  void *priv, int flags,

[PATCH 13/34] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Add an "unmovable" flag for mappings that cannot be migrated under any
circumstance.  KVM will use the flag for its upcoming GUEST_MEMFD support,
which will not support compaction/migration, at least not in the
foreseeable future.

Test AS_UNMOVABLE under folio lock as already done for the async
compaction/dirty folio case, as the mapping can be removed by truncation
while compaction is running.  To avoid having to lock every folio with a
mapping, assume/require that unmovable mappings are also unevictable, and
have mapping_set_unmovable() also set AS_UNEVICTABLE.

Cc: Matthew Wilcox 
Co-developed-by: Vlastimil Babka 
Signed-off-by: Vlastimil Babka 
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-15-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 include/linux/pagemap.h | 19 +-
 mm/compaction.c | 43 +
 mm/migrate.c|  2 ++
 3 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 351c3b7f93a1..82c9bf506b79 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -203,7 +203,8 @@ enum mapping_flags {
/* writeback related tags are not used */
AS_NO_WRITEBACK_TAGS = 5,
AS_LARGE_FOLIO_SUPPORT = 6,
-   AS_RELEASE_ALWAYS,  /* Call ->release_folio(), even if no private 
data */
+   AS_RELEASE_ALWAYS = 7,  /* Call ->release_folio(), even if no private 
data */
+   AS_UNMOVABLE= 8,/* The mapping cannot be moved, ever */
 };
 
 /**
@@ -289,6 +290,22 @@ static inline void mapping_clear_release_always(struct 
address_space *mapping)
clear_bit(AS_RELEASE_ALWAYS, &mapping->flags);
 }
 
+static inline void mapping_set_unmovable(struct address_space *mapping)
+{
+   /*
+* It's expected unmovable mappings are also unevictable. Compaction
+* migrate scanner (isolate_migratepages_block()) relies on this to
+* reduce page locking.
+*/
+   set_bit(AS_UNEVICTABLE, &mapping->flags);
+   set_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
+static inline bool mapping_unmovable(struct address_space *mapping)
+{
+   return test_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
return mapping->gfp_mask;
diff --git a/mm/compaction.c b/mm/compaction.c
index 38c8d216c6a3..12b828aed7c8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -883,6 +883,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
/* Time to isolate some pages for migration */
for (; low_pfn < end_pfn; low_pfn++) {
+   bool is_dirty, is_unevictable;
 
if (skip_on_failure && low_pfn >= next_skip_pfn) {
/*
@@ -1080,8 +1081,10 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (!folio_test_lru(folio))
goto isolate_fail_put;
 
+   is_unevictable = folio_test_unevictable(folio);
+
/* Compaction might skip unevictable pages but CMA takes them */
-   if (!(mode & ISOLATE_UNEVICTABLE) && 
folio_test_unevictable(folio))
+   if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
goto isolate_fail_put;
 
/*
@@ -1093,26 +1096,42 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if ((mode & ISOLATE_ASYNC_MIGRATE) && 
folio_test_writeback(folio))
goto isolate_fail_put;
 
-   if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {
-   bool migrate_dirty;
+   is_dirty = folio_test_dirty(folio);
+
+   if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) ||
+   (mapping && is_unevictable)) {
+   bool migrate_dirty = true;
+   bool is_unmovable;
 
/*
 * Only folios without mappings or that have
-* a ->migrate_folio callback are possible to
-* migrate without blocking.  However, we may
-* be racing with truncation, which can free
-* the mapping.  Truncation holds the folio lock
-* until after the folio is removed from the page
-* cache so holding it ourselves is sufficient.
+* a ->migrate_folio callback are possible to migrate
+* without blocking.
+*
+* Folios from unmovable mappings are not migratable.
+*
+

[PATCH 12/34] KVM: Introduce per-page memory attributes

2023-11-05 Thread Paolo Bonzini

From: Chao Peng 

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.

Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.

To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation.  For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson 
Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
Cc: Fuad Tabba 
Cc: Xu Yilun 
Cc: Mickaël Salaün 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-14-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 Documentation/virt/kvm/api.rst |  36 ++
 include/linux/kvm_host.h   |  19 +++
 include/uapi/linux/kvm.h   |  13 ++
 virt/kvm/Kconfig   |   4 +
 virt/kvm/kvm_main.c| 216 +
 5 files changed, 288 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 481fb0e2ce90..083ed507e200 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6214,6 +6214,42 @@ superset of the features supported by the system.
 
 See KVM_SET_USER_MEMORY_REGION.
 
+4.141 KVM_SET_MEMORY_ATTRIBUTES
+---
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_attributes (in)
+:Returns: 0 on success, <0 on error
+
+KVM_SET_MEMORY_ATTRIBUTES allows userspace to set memory attributes for a range
+of guest physical memory.
+
+::
+
+  struct kvm_memory_attributes {
+   __u64 address;
+   __u64 size;
+   __u64 attributes;
+   __u64 flags;
+  };
+
+  #define KVM_MEMORY_ATTRIBUTE_PRIVATE   (1ULL << 3)
+
+The address and size must be page aligned.  The supported attributes can be
+retrieved via ioctl(KVM_CHECK_EXTENSION) on KVM_CAP_MEMORY_ATTRIBUTES.  If
+executed on a VM, KVM_CAP_MEMORY_ATTRIBUTES precisely returns the attributes
+supported by that VM.  If executed at system scope, KVM_CAP_MEMORY_ATTRIBUTES
+returns all attributes supported by KVM.  The only attribute defined at this
+time is KVM_MEMORY_ATTRIBUTE_PRIVATE, which marks the associated gfn as being
+guest private memory.
+
+Note, there is no "get" API.  Userspace is responsible for explicitly tracking
+the state of a gfn/page as needed.
+
+The "flags" field is reserved for future extensions and must be '0'.
+
 5. The kvm_run structure
 
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 96aa930536b1..68a144cb7dbc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,6 +256,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 union kvm_mmu_notifier_arg {
pte_t pte;
+   unsigned long attributes;
 };
 
 struct kvm_gfn_range {
@@ -806,6 +807,10 @@ struct kvm {
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
struct notifier_block pm_notifier;
+#endif
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+   /* Protected by slots_locks (for writes) and RCU (for reads) */
+   struct xarray mem_attr_array;
 #endif
char stats_id[KVM_STATS_NAME_SIZE];
 };
@@ -2338,4 +2343,18 @@ static inline void kvm_prepare_memory_fault_exit(struct 
kvm_vcpu *vcpu,
vcpu->run->memory_fault.flags = 0;
 }
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_

[PATCH 11/34] KVM: Drop .on_unlock() mmu_notifier hook

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
notifying arch code that memory has been reclaimed.  Adding .on_unlock()
and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
resulted in .on_lock() and .on_unlock() having divergent and asymmetric
behavior, and set future developers up for failure, i.e. all but asked for
bugs where KVM relied on using .on_unlock() to try to run a callback while
holding mmu_lock.

Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
guard against future bugs of this nature.

Reported-by: Isaku Yamahata 
Link: https://lore.kernel.org/all/20230802203119.gb2021...@ls.amr.corp.intel.com
Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-12-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 virt/kvm/kvm_main.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e18a7f152c0b..7f3291dec7a6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -544,7 +544,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct 
mmu_notifier *mn)
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm);
-typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
 struct kvm_mmu_notifier_range {
/*
@@ -556,7 +555,6 @@ struct kvm_mmu_notifier_range {
union kvm_mmu_notifier_arg arg;
gfn_handler_t handler;
on_lock_fn_t on_lock;
-   on_unlock_fn_t on_unlock;
bool flush_on_ret;
bool may_block;
 };
@@ -663,11 +661,8 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
if (range->flush_on_ret && r.ret)
kvm_flush_remote_tlbs(kvm);
 
-   if (r.found_memslot) {
+   if (r.found_memslot)
KVM_MMU_UNLOCK(kvm);
-   if (!IS_KVM_NULL_FN(range->on_unlock))
-   range->on_unlock(kvm);
-   }
 
srcu_read_unlock(&kvm->srcu, idx);
 
@@ -687,7 +682,6 @@ static __always_inline int kvm_handle_hva_range(struct 
mmu_notifier *mn,
.arg= arg,
.handler= handler,
.on_lock= (void *)kvm_null_fn,
-   .on_unlock  = (void *)kvm_null_fn,
.flush_on_ret   = true,
.may_block  = false,
};
@@ -706,7 +700,6 @@ static __always_inline int 
kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
.end= end,
.handler= handler,
.on_lock= (void *)kvm_null_fn,
-   .on_unlock  = (void *)kvm_null_fn,
.flush_on_ret   = false,
.may_block  = false,
};
@@ -813,7 +806,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
.end= range->end,
.handler= kvm_mmu_unmap_gfn_range,
.on_lock= kvm_mmu_invalidate_begin,
-   .on_unlock  = (void *)kvm_null_fn,
.flush_on_ret   = true,
.may_block  = mmu_notifier_range_blockable(range),
};
@@ -891,7 +883,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct 
mmu_notifier *mn,
.end= range->end,
.handler= (void *)kvm_null_fn,
.on_lock= kvm_mmu_invalidate_end,
-   .on_unlock  = (void *)kvm_null_fn,
.flush_on_ret   = false,
.may_block  = mmu_notifier_range_blockable(range),
};
-- 
2.39.1

[PATCH 10/34] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
__kvm_handle_hva_range() return whether or not an overlapping memslot
was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.

Use a small struct to return the tuple of the notifier-specific return,
plus whether or not overlap was found.  Because the iteration helpers are
__always_inlined, practically speaking, the struct will never actually be
returned from a function call (not to mention the size of the struct will
be two bytes in practice).

Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-11-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 virt/kvm/kvm_main.c | 53 +++--
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 756b94ecd511..e18a7f152c0b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -561,6 +561,19 @@ struct kvm_mmu_notifier_range {
bool may_block;
 };
 
+/*
+ * The inner-most helper returns a tuple containing the return value from the
+ * arch- and action-specific handler, plus a flag indicating whether or not at
+ * least one memslot was found, i.e. if the handler found guest memory.
+ *
+ * Note, most notifiers are averse to booleans, so even though KVM tracks the
+ * return from arch code as a bool, outer helpers will cast it to an int. :-(
+ */
+typedef struct kvm_mmu_notifier_return {
+   bool ret;
+   bool found_memslot;
+} kvm_mn_ret_t;
+
 /*
  * Use a dedicated stub instead of NULL to indicate that there is no callback
  * function/handler.  The compiler technically can't guarantee that a real
@@ -582,22 +595,25 @@ static const union kvm_mmu_notifier_arg 
KVM_MMU_NOTIFIER_NO_ARG;
 node;   \
 node = interval_tree_iter_next(node, start, last))  \
 
-static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
- const struct 
kvm_mmu_notifier_range *range)
+static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
+  const struct 
kvm_mmu_notifier_range *range)
 {
-   bool ret = false, locked = false;
+   struct kvm_mmu_notifier_return r = {
+   .ret = false,
+   .found_memslot = false,
+   };
struct kvm_gfn_range gfn_range;
struct kvm_memory_slot *slot;
struct kvm_memslots *slots;
int i, idx;
 
if (WARN_ON_ONCE(range->end <= range->start))
-   return 0;
+   return r;
 
/* A null handler is allowed if and only if on_lock() is provided. */
if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
 IS_KVM_NULL_FN(range->handler)))
-   return 0;
+   return r;
 
idx = srcu_read_lock(&kvm->srcu);
 
@@ -631,8 +647,8 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE 
- 1, slot);
gfn_range.slot = slot;
 
-   if (!locked) {
-   locked = true;
+   if (!r.found_memslot) {
+   r.found_memslot = true;
KVM_MMU_LOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_lock))
range->on_lock(kvm);
@@ -640,14 +656,14 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
if (IS_KVM_NULL_FN(range->handler))
break;
}
-   ret |= range->handler(kvm, &gfn_range);
+   r.ret |= range->handler(kvm, &gfn_range);
}
}
 
-   if (range->flush_on_ret && ret)
+   if (range->flush_on_ret && r.ret)
kvm_flush_remote_tlbs(kvm);
 
-   if (locked) {
+   if (r.found_memslot) {
KVM_MMU_UNLOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_unlock))
range->on_unlock(kvm);
@@ -655,8 +671,7 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
 
srcu_read_unlock(&kvm->srcu, idx);
 
-   /* The notifiers are averse to booleans. :-( */
-   return (int)ret;
+   return r;
 }
 
 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
@@ -677,7 +692,7 @@ static __always_inline int kvm_handle_

[PATCH 09/34] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-05 Thread Paolo Bonzini

From: Chao Peng 

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).

KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory.  With guest private memory,
there will be two kind of memory conversions:

  - explicit conversion: happens when the guest explicitly calls into KVM
to map a range (as private or shared)

  - implicit conversion: happens when the guest attempts to access a gfn
that is configured in the "wrong" state (private vs. shared)

On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.

KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.

Note!  To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'!  Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.

Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages.  It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.

Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoor...@google.com
Link: https://lore.kernel.org/all/zq3amlo2syv3d...@google.com
Cc: Anish Moorthy 
Cc: David Matlack 
Suggested-by: Sean Christopherson 
Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Message-Id: <20231027182217.3615211-10-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 Documentation/virt/kvm/api.rst | 41 ++
 arch/x86/kvm/x86.c |  1 +
 include/linux/kvm_host.h   | 11 +
 include/uapi/linux/kvm.h   |  8 +++
 4 files changed, 61 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index bdea1423c5f8..481fb0e2ce90 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6846,6 +6846,26 @@ array field represents return values. The userspace 
should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+   /* KVM_EXIT_MEMORY_FAULT */
+   struct {
+   __u64 flags;
+   __u64 gpa;
+   __u64 size;
+   } memory_fault;
+
+KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
+could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
+guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
+describes properties of the faulting access that are likely pertinent.
+Currently, no flags are defined.
+
+Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
+accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
+or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
+kvm_run.exit_reason is stale/undefined for all other error numbers.
+
 ::
 
 /* KVM_EXIT_NOTIFY */
@@ -7880,6 +7900,27 @@ This capability is aimed to mitigate the threat that 
malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_MEMORY_FAULT_INFO
+--
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN will fill
+kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
+there is a valid memslot but no backing VMA for the corresponding host virtual
+address.
+
+The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
+an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
+to KVM_EXIT_MEMORY_

[PATCH 08/34] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
information can be supplied without setting userspace up to fail.  The
padding in the new kvm_userspace_memory_region2 structure will be used to
pass a file descriptor in addition to the userspace_addr, i.e. allow
userspace to point at a file descriptor and map memory into a guest that
is NOT mapped into host userspace.

Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
makes detection of bad flags a bit more robust, e.g. if the new fd field
is guarded only by a flag and not a new ioctl(), then a userspace bug
(setting a "bad" flag) would generate out-of-bounds access instead of an
-EINVAL error.

Cc: Jarkko Sakkinen 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Xiaoyao Li 
Signed-off-by: Sean Christopherson 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-9-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 Documentation/virt/kvm/api.rst | 22 +
 arch/x86/kvm/x86.c |  2 +-
 include/linux/kvm_host.h   |  4 +--
 include/uapi/linux/kvm.h   | 13 
 virt/kvm/kvm_main.c| 57 +-
 5 files changed, 87 insertions(+), 11 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7025b3751027..bdea1423c5f8 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1340,6 +1340,7 @@ yet and must be cleared on entry.
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
__u64 userspace_addr; /* start of the userspace allocated memory */
+   __u64 pad[16];
   };
 
   /* for kvm_userspace_memory_region::flags */
@@ -6192,6 +6193,27 @@ to know what fields can be changed for the system 
register described by
 ``op0, op1, crn, crm, op2``. KVM rejects ID register values that describe a
 superset of the features supported by the system.
 
+4.140 KVM_SET_USER_MEMORY_REGION2
+-
+
+:Capability: KVM_CAP_USER_MEMORY2
+:Architectures: all
+:Type: vm ioctl
+:Parameters: struct kvm_userspace_memory_region2 (in)
+:Returns: 0 on success, -1 on error
+
+::
+
+  struct kvm_userspace_memory_region2 {
+   __u32 slot;
+   __u32 flags;
+   __u64 guest_phys_addr;
+   __u64 memory_size; /* bytes */
+   __u64 userspace_addr; /* start of the userspace allocated memory */
+  };
+
+See KVM_SET_USER_MEMORY_REGION.
+
 5. The kvm_run structure
 
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c924075f6f1..7b389f27dffc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12576,7 +12576,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, 
int id, gpa_t gpa,
}
 
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-   struct kvm_userspace_memory_region m;
+   struct kvm_userspace_memory_region2 m;
 
m.slot = id | (i << 16);
m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5faba69403ac..4e741ff27af3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1146,9 +1146,9 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_userspace_memory_region2 *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-   const struct kvm_userspace_memory_region *mem);
+   const struct kvm_userspace_memory_region2 *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 211b86de35ac..308cc70bd6ab 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+/* for KVM_SET_USER_MEMORY_REGION2 */
+struct kvm_userspace_memory_region2 {
+   __u32 slot;
+   __u32 flags;
+   __u64 guest_phys_addr;
+   __u64 memory_size;
+   __u64 userspace_addr;
+   __u64 pad[16];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
  * userspace, other bits are reserved for kvm internal use which are defined
@@ -1201,6 +1211,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
 #define KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES 230
+#define KVM_CAP_USER_MEMORY2 231
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1483,6 +1494,8 @@ struct kvm_vfio_spapr_tce {

[PATCH 07/34] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior.  Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.

Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared.  PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.

  bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);

Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.

Acked-by: Anup Patel 
Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-8-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/arm64/include/asm/kvm_host.h |  2 --
 arch/arm64/kvm/Kconfig|  2 +-
 arch/loongarch/include/asm/kvm_host.h |  1 -
 arch/loongarch/kvm/Kconfig|  2 +-
 arch/mips/include/asm/kvm_host.h  |  2 --
 arch/mips/kvm/Kconfig |  2 +-
 arch/powerpc/include/asm/kvm_host.h   |  2 --
 arch/powerpc/kvm/Kconfig  |  8 
 arch/powerpc/kvm/powerpc.c|  4 +---
 arch/riscv/include/asm/kvm_host.h |  2 --
 arch/riscv/kvm/Kconfig|  2 +-
 arch/x86/include/asm/kvm_host.h   |  2 --
 arch/x86/kvm/Kconfig  |  2 +-
 include/linux/kvm_host.h  |  6 +++---
 include/linux/kvm_types.h |  1 +
 virt/kvm/Kconfig  |  4 
 virt/kvm/kvm_main.c   | 10 +-
 17 files changed, 23 insertions(+), 31 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 5653d3553e3e..9029fe09f3f6 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -954,8 +954,6 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
 int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
  struct kvm_vcpu_events *events);
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 83c1e09be42e..1a15199f 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -22,7 +22,7 @@ menuconfig KVM
bool "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM
select KVM_GENERIC_HARDWARE_ENABLING
-   select MMU_NOTIFIER
+   select KVM_GENERIC_MMU_NOTIFIER
select PREEMPT_NOTIFIERS
select HAVE_KVM_CPU_RELAX_INTERCEPT
select KVM_MMIO
diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index 11328700d4fa..b108301c2e5a 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -183,7 +183,6 @@ void kvm_flush_tlb_all(void);
 void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa);
 int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write);
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long 
end, bool blockable);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
diff --git a/arch/loongarch/kvm/Kconfig b/arch/loongarch/kvm/Kconfig
index fda425babfb2..f22bae89b07d 100644
--- a/arch/loongarch/kvm/Kconfig
+++ b/arch/loongarch/kvm/Kconfig
@@ -26,9 +26,9 @@ config KVM
select HAVE_KVM_VCPU_ASYNC_IOCTL
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select KVM_GENERIC_HARDWARE_ENABLING
+   select KVM_GENERIC_MMU_NOTIFIER
select KVM_MMIO
select KVM_XFER_TO_GUEST_WORK
-   select MMU_NOTIFIER
select PREEMPT_NOTIFIERS
help
  Support hosting virtualized guest machines using
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 54a85f1d4f2c..179f320cc231 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -810,8 +810,6 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t 
start_gfn, gfn_t end_gfn);
 pgd_t *kvm_pgd_alloc(void);
 void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 /* Emulation */
 enum emulation_result update_pc(struct kvm_vcpu *vcpu, u32 cause);
 int kvm_get_badinstr(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
ind

[PATCH 06/34] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Advertise that KVM's MMU is synchronized with the primary MMU for all
flavors of PPC KVM support, i.e. advertise that the MMU is synchronized
when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y but the VM is not using hypervisor
mode (a.k.a. PR VMs).  PR VMs, via kvm_unmap_gfn_range_pr(), do the right
thing for mmu_notifier invalidation events, and more tellingly, KVM
returns '1' for KVM_CAP_SYNC_MMU when CONFIG_KVM_BOOK3S_HV_POSSIBLE=n
and CONFIG_KVM_BOOK3S_PR_POSSIBLE=y, i.e. KVM already advertises a
synchronized MMU for PR VMs, just not when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y.

Suggested-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
Message-Id: <20231027182217.3615211-7-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/powerpc/kvm/powerpc.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b0a512ede764..8d3ec483bc2b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -635,11 +635,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
BUILD_BUG();
 #endif
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-   r = hv_enabled;
-#else
r = 1;
-#endif
break;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
case KVM_CAP_PPC_HTAB_FD:
-- 
2.39.1

[PATCH 05/34] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are
defined when KVM is enabled, and return '1' unconditionally for the
CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path.  All flavors of PPC support for KVM
select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally
defined by arch/powerpc/include/asm/kvm_host.h.

Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a
future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e.
will allow combining all of the

  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)

checks into a single

  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER

without having to worry about PPC's "bare" usage of
KVM_ARCH_WANT_MMU_NOTIFIER.

Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-6-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/powerpc/kvm/powerpc.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 7197c8256668..b0a512ede764 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -632,12 +632,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
break;
 #endif
case KVM_CAP_SYNC_MMU:
+#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+   BUILD_BUG();
+#endif
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
r = hv_enabled;
-#elif defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-   r = 1;
 #else
-   r = 0;
+   r = 1;
 #endif
break;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-- 
2.39.1

[PATCH 04/34] KVM: WARN if there are dangling MMU invalidations at VM destruction

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Add an assertion that there are no in-progress MMU invalidations when a
VM is being destroyed, with the exception of the scenario where KVM
unregisters its MMU notifier between an .invalidate_range_start() call and
the corresponding .invalidate_range_end().

KVM can't detect unpaired calls from the mmu_notifier due to the above
exception waiver, but the assertion can detect KVM bugs, e.g. such as the
bug that *almost* escaped initial guest_memfd development.

Link: 
https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5...@linux.intel.com
Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-5-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 virt/kvm/kvm_main.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9cc57b23ec81..5422ce20dcba 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1358,9 +1358,16 @@ static void kvm_destroy_vm(struct kvm *kvm)
 * No threads can be waiting in kvm_swap_active_memslots() as the
 * last reference on KVM has been dropped, but freeing
 * memslots would deadlock without this manual intervention.
+*
+* If the count isn't unbalanced, i.e. KVM did NOT unregister its MMU
+* notifier between a start() and end(), then there shouldn't be any
+* in-progress invalidations.
 */
WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
-   kvm->mn_active_invalidate_count = 0;
+   if (kvm->mn_active_invalidate_count)
+   kvm->mn_active_invalidate_count = 0;
+   else
+   WARN_ON(kvm->mmu_invalidate_in_progress);
 #else
kvm_flush_shadow_all(kvm);
 #endif
-- 
2.39.1

[PATCH 03/34] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-11-05 Thread Paolo Bonzini

From: Chao Peng 

Currently in mmu_notifier invalidate path, hva range is recorded and then
checked against by mmu_invalidate_retry_hva() in the page fault handling
path. However, for the soon-to-be-introduced private memory, a page fault
may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson 
Cc: Xu Yilun 
Signed-off-by: Chao Peng 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Xu Yilun 
Message-Id: <20231027182217.3615211-4-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c   | 10 ++
 arch/x86/kvm/vmx/vmx.c   | 11 +-
 include/linux/kvm_host.h | 33 +++---
 virt/kvm/kvm_main.c  | 43 +++-
 4 files changed, 66 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b0f01d605617..b2d916f786ca 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3056,7 +3056,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, 
u64 *sptep)
  *
  * There are several ways to safely use this helper:
  *
- * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before
+ * - Check mmu_invalidate_retry_gfn() after grabbing the mapping level, before
  *   consuming it.  In this case, mmu_lock doesn't need to be held during the
  *   lookup, but it does need to be held while checking the MMU notifier.
  *
@@ -4366,7 +4366,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
return true;
 
return fault->slot &&
-  mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
+  mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault 
*fault)
@@ -6260,7 +6260,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
 
write_lock(&kvm->mmu_lock);
 
-   kvm_mmu_invalidate_begin(kvm, 0, -1ul);
+   kvm_mmu_invalidate_begin(kvm);
+
+   kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
 
flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6270,7 +6272,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
if (flush)
kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - 
gfn_start);
 
-   kvm_mmu_invalidate_end(kvm, 0, -1ul);
+   kvm_mmu_invalidate_end(kvm);
 
write_unlock(&kvm->mmu_lock);
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index be20a60047b1..40e3780d73ae 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6757,10 +6757,10 @@ static void vmx_set_apic_access_page_addr(struct 
kvm_vcpu *vcpu)
return;
 
/*
-* Grab the memslot so that the hva lookup for the mmu_notifier retry
-* is guaranteed to use the same memslot as the pfn lookup, i.e. rely
-* on the pfn lookup's validation of the memslot to ensure a valid hva
-* is used for the retry check.
+* Explicitly grab the memslot using KVM's internal slot ID to ensure
+* KVM doesn't unintentionally grab a userspace memslot.  It _should_
+* be impossible for userspace to create a memslot for the APIC when
+* APICv is enabled, but paranoia won't hurt in this case.
 */
slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
@@ -6785,8 +6785,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu 
*vcpu)
return;
 
read_lock(&vcpu->kvm->mmu_lock);
-   if (mmu_invalidate_retry_hva(kvm, mmu_seq,
-gfn_to_hva_memslot(slot, gfn))) {
+   if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) {
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
read_unlock(&vcpu->kvm->mmu_lock);
goto out;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fb6c6109fdca..11d091688346 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -787,8 +787,8 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
-   unsigned long mmu_invalidate_range_start;
-   unsigned long mmu_invalidate_range_end;
+   gfn_t mmu_invalidate_range

[PATCH 02/34] KVM: Assert that mmu_invalidate_in_progress never goes negative

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Move the assertion on the in-progress invalidation count from the primary
MMU's notifier path to KVM's common notification path, i.e. assert that
the count doesn't go negative even when the invalidation is coming from
KVM itself.

Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
the affected VM, not the entire kernel.  A corrupted count is fatal to the
VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
to block any and all attempts to install new mappings.  But it's far from
guaranteed that an end() without a start() is fatal or even problematic to
anything other than the target VM, e.g. the underlying bug could simply be
a duplicate call to end().  And it's much more likely that a missed
invalidation, i.e. a potential use-after-free, would manifest as no
notification whatsoever, not an end() without a start().

Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-3-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 virt/kvm/kvm_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0524933856d4..5a97e6c7d9c2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -833,6 +833,7 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long 
start,
 * in conjunction with the smp_rmb in mmu_invalidate_retry().
 */
kvm->mmu_invalidate_in_progress--;
+   KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
@@ -863,8 +864,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct 
mmu_notifier *mn,
 */
if (wake)
rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
-
-   BUG_ON(kvm->mmu_invalidate_in_progress < 0);
 }
 
 static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
-- 
2.39.1

[PATCH 01/34] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges

2023-11-05 Thread Paolo Bonzini

From: Sean Christopherson 

Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so
that the structure can be used to handle notifications that operate on gfn
context, i.e. that aren't tied to a host virtual address.  Rename the
handler typedef too (arguably it should always have been gfn_handler_t).

Practically speaking, this is a nop for 64-bit kernels as the only
meaningful change is to store start+end as u64s instead of unsigned longs.

Reviewed-by: Paolo Bonzini 
Reviewed-by: Xiaoyao Li 
Signed-off-by: Sean Christopherson 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-2-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 virt/kvm/kvm_main.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 486800a7024b..0524933856d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -541,18 +541,22 @@ static inline struct kvm *mmu_notifier_to_kvm(struct 
mmu_notifier *mn)
return container_of(mn, struct kvm, mmu_notifier);
 }
 
-typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
 unsigned long end);
 
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
-struct kvm_hva_range {
-   unsigned long start;
-   unsigned long end;
+struct kvm_mmu_notifier_range {
+   /*
+* 64-bit addresses, as KVM notifiers can operate on host virtual
+* addresses (unsigned long) and guest physical addresses (64-bit).
+*/
+   u64 start;
+   u64 end;
union kvm_mmu_notifier_arg arg;
-   hva_handler_t handler;
+   gfn_handler_t handler;
on_lock_fn_t on_lock;
on_unlock_fn_t on_unlock;
bool flush_on_ret;
@@ -581,7 +585,7 @@ static const union kvm_mmu_notifier_arg 
KVM_MMU_NOTIFIER_NO_ARG;
 node = interval_tree_iter_next(node, start, last))  \
 
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
- const struct kvm_hva_range 
*range)
+ const struct 
kvm_mmu_notifier_range *range)
 {
bool ret = false, locked = false;
struct kvm_gfn_range gfn_range;
@@ -608,9 +612,9 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
unsigned long hva_start, hva_end;
 
slot = container_of(node, struct kvm_memory_slot, 
hva_node[slots->node_idx]);
-   hva_start = max(range->start, slot->userspace_addr);
-   hva_end = min(range->end, slot->userspace_addr +
- (slot->npages << PAGE_SHIFT));
+   hva_start = max_t(unsigned long, range->start, 
slot->userspace_addr);
+   hva_end = min_t(unsigned long, range->end,
+   slot->userspace_addr + (slot->npages << 
PAGE_SHIFT));
 
/*
 * To optimize for the likely case where the address
@@ -660,10 +664,10 @@ static __always_inline int kvm_handle_hva_range(struct 
mmu_notifier *mn,
unsigned long start,
unsigned long end,
union kvm_mmu_notifier_arg arg,
-   hva_handler_t handler)
+   gfn_handler_t handler)
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
-   const struct kvm_hva_range range = {
+   const struct kvm_mmu_notifier_range range = {
.start  = start,
.end= end,
.arg= arg,
@@ -680,10 +684,10 @@ static __always_inline int kvm_handle_hva_range(struct 
mmu_notifier *mn,
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier 
*mn,
 unsigned long start,
 unsigned long end,
-hva_handler_t handler)
+gfn_handler_t handler)
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
-   const struct kvm_hva_range range = {
+   const struct kvm_mmu_notifier_range range = {
.start  = start,
.end= end,
.handler= handler,
@@ -771,7 +775,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,

[PATCH v14 00/34] KVM: guest_memfd() and per-page attributes

2023-11-05 Thread Paolo Bonzini

dding AS_UNMOVABLE isn't strictly required as it's "just" an
optimization, but we'd prefer to have it in place straightaway.

If you would like to see a range-diff, I suggest using Patchew; start
from https://patchew.org/linux/20231027182217.3615211-1-sea...@google.com/
and click v14 on top.

Thanks,

Paolo

Ackerley Tng (1):
  KVM: selftests: Test KVM exit behavior for private memory/access

Chao Peng (8):
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  KVM: Introduce per-page memory attributes
  KVM: x86: Disallow hugepages when memory attributes are mixed
  KVM: x86/mmu: Handle page fault for private memory
  KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper
  KVM: selftests: Expand set_memory_region_test to validate
    guest_memfd()
  KVM: selftests: Add basic selftest for guest_memfd()

Paolo Bonzini (1):
  fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()

Sean Christopherson (23):
  KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn
ranges
  KVM: Assert that mmu_invalidate_in_progress *never* goes negative
  KVM: WARN if there are dangling MMU invalidations at VM destruction
  KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
  KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU
  KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to
CONFIG_KVM_GENERIC_MMU_NOTIFIER
  KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
  KVM: Drop .on_unlock() mmu_notifier hook
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
  KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing
memory
  KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
  KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
  KVM: Allow arch code to track number of memslot address spaces per VM
  KVM: x86: Add support for "protected VMs" that can utilize private
memory
  KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper
  KVM: selftests: Convert lib's mem regions to
KVM_SET_USER_MEMORY_REGION2
  KVM: selftests: Add support for creating private memslots
  KVM: selftests: Introduce VM "shape" to allow tests to specify the VM
type
  KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data
  KVM: selftests: Add a memory region subtest to validate invalid flags
  KVM: Prepare for handling only shared mappings in mmu_notifier events
  KVM: Add transparent hugepage support for dedicated guest memory

Vishal Annapurve (3):
  KVM: selftests: Add helpers to convert guest memory b/w private and
shared
  KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls
(x86)
  KVM: selftests: Add x86-only selftest for private memory conversions


 Documentation/virt/kvm/api.rst| 209 +++
 arch/arm64/include/asm/kvm_host.h |   2 -
 arch/arm64/kvm/Kconfig|   2 +-
 arch/loongarch/include/asm/kvm_host.h |   1 -
 arch/loongarch/kvm/Kconfig|   2 +-
 arch/mips/include/asm/kvm_host.h  |   2 -
 arch/mips/kvm/Kconfig |   2 +-
 arch/powerpc/include/asm/kvm_host.h   |   2 -
 arch/powerpc/kvm/Kconfig  |   8 +-
 arch/powerpc/kvm/book3s_hv.c  |   2 +-
 arch/powerpc/kvm/powerpc.c|   7 +-
 arch/riscv/include/asm/kvm_host.h |   2 -
 arch/riscv/kvm/Kconfig|   2 +-
 arch/x86/include/asm/kvm_host.h   |  17 +-
 arch/x86/include/uapi/asm/kvm.h   |   3 +
 arch/x86/kvm/Kconfig  |  14 +-
 arch/x86/kvm/debugfs.c|   2 +-
 arch/x86/kvm/mmu/mmu.c| 271 +++-
 arch/x86/kvm/mmu/mmu_internal.h   |   2 +
 arch/x86/kvm/vmx/vmx.c|  11 +-
 arch/x86/kvm/x86.c|  26 +-
 fs/anon_inodes.c  |  47 +-
 fs/userfaultfd.c  |   5 +-
 include/linux/anon_inodes.h   |   4 +-
 include/linux/kvm_host.h  | 144 -
 include/linux/kvm_types.h |   1 +
 include/linux/pagemap.h   |  19 +-
 include/uapi/linux/kvm.h  |  51 ++
 io_uring/io_uring.c   |   3 +-
 mm/compaction.c   |  43 +-
 mm/migrate.c  |   2 +
 tools/testing/selftests/kvm/Makefile  |   3 +
 tools/testing/selftests/kvm/dirty_log_test.c  |   2 +-
 .../testing/selftests/kvm/guest_memfd_test.c  | 221 +++
 .../selftests/kvm/include/kvm_util_base.h | 148 -
 .../testing/selftests/kvm/include/test_util.h |   5 +
 .../selftests/kvm/include/ucall_common.h  |  11 +
 .../selftests/kvm

Re: [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory

2023-11-05 Thread Paolo Bonzini

On Sun, Nov 5, 2023 at 2:04 PM Xu Yilun  wrote:
>
> > +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> > +   struct kvm_page_fault *fault)
> > +{
> > + kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> > +   PAGE_SIZE, fault->write, fault->exec,
> > +   fault->is_private);
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > +struct kvm_page_fault *fault)
> > +{
> > + int max_order, r;
> > +
> > + if (!kvm_slot_can_be_private(fault->slot)) {
> > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > + return -EFAULT;
> > + }
> > +
> > + r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
> > +  &max_order);
> > + if (r) {
> > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > + return r;
>
> Why report KVM_EXIT_MEMORY_FAULT here? even with a ret != -EFAULT?

The cases are EFAULT, EHWPOISON (which can report
KVM_EXIT_MEMORY_FAULT) and ENOMEM. I think it's fine
that even -ENOMEM can return KVM_EXIT_MEMORY_FAULT,
and it doesn't violate the documentation.  The docs tell you "what
can you do if error if EFAULT or EHWPOISON?"; they don't
exclude that other errnos result in KVM_EXIT_MEMORY_FAULT,
it's just that you're not supposed to look at it

Paolo

Re: [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM

2023-11-03 Thread Paolo Bonzini


On 11/2/23 17:24, Christian Brauner wrote:

On Fri, Oct 27, 2023 at 11:21:57AM -0700, Sean Christopherson wrote:

Export anon_inode_getfile_secure() so that it can be used by KVM to create
and manage file-based guest memory without need a fullblow filesystem.
The "standard" anon_inode_getfd() doesn't work for KVM's use case as KVM
needs a unique inode for each file, e.g. to be able to independently
manage the size and lifecycle of a given file.

Note, KVM doesn't need a "secure" version, just unique inodes, i.e. ignore
the name.

Signed-off-by: Sean Christopherson 
---


Before we enshrine this misleading name let's rename this to:

create_anon_inode_getfile()

I don't claim it's a great name but it's better than *_secure() which is
very confusing. So just:

struct file *create_anon_inode_getfile(const char *name,
const struct file_operations *fops,
void *priv, int flags)


I slightly prefer anon_inode_create_getfile(); grepping include/linux 
for '\

Neither userfaultfd (which uses anon_inode_getfd_secure()) nor io_uring 
strictly speaking need separate inodes; they do want the call to 
inode_init_security_anon().  But I agree that the new name is better and 
I will adjust the comments so that it is clear why you'd use this 
function instead of anon_inode_get{file,fd}().



May also just remove that context_inode argument from the exported
function. The only other caller is io_uring. And neither it nor this
patchset need the context_inode thing afaict.


True, OTOH we might as well rename anon_inode_getfd_secure() to 
anon_inode_create_getfd(), and that one does need context_inode.


I'll Cc you on v14 and will carry the patch in my tree.

Paolo


Merge conflict risk is
extremely low so carrying that as part of this patchset is fine and
shouldn't cause huge issues for you.

Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-02 Thread Paolo Bonzini

On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson  wrote:
> Actually, looking that this again, there's not actually a hard dependency on 
> THP.
> A THP-enabled kernel _probably_  gives a higher probability of using 
> hugepages,
> but mostly because THP selects COMPACTION, and I suppose because using THP for
> other allocations reduces overall fragmentation.

Yes, that's why I didn't even bother enabling it unless THP is
enabled, but it makes even more sense to just try.

> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I 
> think
> we should do the below (I verified KVM can create hugepages with THP=n).  
> We'll
> need another capability, but (a) we probably should have that anyways and (b) 
> it
> provides a cleaner path to adding PUD-sized hugepage support in the future.

I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This
should be a generic kernel API and in fact the sizes are available in
a not-so-friendly format in /sys/kernel/mm/hugepages.

We should just add /sys/kernel/mm/hugepages/sizes that contains
"2097152 1073741824" on x86 (only the former if 1G pages are not
supported).

Plus: is this the best API if we need something else for 1G pages?

Let's drop *this* patch and proceed incrementally. (Again, this is
what I want to do with this final review: identify places that are
stil sticky, and don't let them block the rest).

Coincidentially we have an open spot next week at plumbers. Let's
extend Fuad's section to cover more guestmem work.

Paolo

> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c 
> b/tools/testing/selftests/kvm/guest_memfd_test.c
> index c15de9852316..c9f449718fce 100644
> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
> @@ -201,6 +201,10 @@ int main(int argc, char *argv[])
>
> TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
>
> +   if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && 
> thp_configured())
> +   TEST_ASSERT_EQ(get_trans_hugepagesz(),
> +  
> kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE));
> +
> page_size = getpagesize();
> total_size = page_size * 4;
>
> diff --git 
> a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c 
> b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> index be311944e90a..245901587ed2 100644
> --- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> @@ -396,7 +396,7 @@ static void test_mem_conversions(enum 
> vm_mem_backing_src_type src_type, uint32_t
>
> vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << 
> KVM_HC_MAP_GPA_RANGE));
>
> -   if (backing_src_can_be_huge(src_type))
> +   if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE))
> memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> else
> memfd_flags = 0;
>
> --
> From: Sean Christopherson 
> Date: Wed, 25 Oct 2023 16:26:41 -0700
> Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest
>  memory
>
> Extend guest_memfd to allow backing guest memory with hugepages.  For now,
> make hugepage utilization best-effort, i.e. fall back to non-huge mappings
> if a hugepage can't be allocated.  Guaranteeing hugepages would require a
> dedicated memory pool and significantly more complexity and churn..
>
> Require userspace to opt-in via a flag even though it's unlikely real use
> cases will ever want to use order-0 pages, e.g. to give userspace a safety
> valve in case hugepage support is buggy, and to allow for easier testing
> of both paths.
>
> Do not take a dependency on CONFIG_TRANSPARENT_HUGEPAGE, as THP enabling
> primarily deals with userspace page tables, which are explicitly not in
> play for guest_memfd.  Selecting THP does make obtaining hugepages more
> likely, but only because THP selects CONFIG_COMPACTION.  Don't select
> CONFIG_COMPACTION either, because again it's not a hard dependency.
>
> For simplicity, require the guest_memfd size to be a multiple of the
> hugepage size, e.g. so that KVM doesn't need to do bounds checking when
> deciding whether or not to allocate a huge folio.
>
> When reporting the max order when KVM gets a pfn from guest_memfd, force
> order-0 pages if the hugepage is not fully contained by the memslot
> binding, e.g. if userspace requested hugepages but punches a hole in the
> memslot bindings in order to emulate x86's VGA hole.
>
> Signed-off-by: Sean Christopherson 
> ---
>  Documentation/virt/kvm/api.rst | 17 +
>  include/uapi/linux/kvm.h   |  3 ++
>  virt/kvm/guest_memfd.c | 69 +-
>  virt/kvm/kvm_main.c|  4 ++
>  4 files changed, 84 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index e82c69d5e755..ccdd5413920d 100644
> --- a/Documen

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-02 Thread Paolo Bonzini


On 10/31/23 23:39, David Matlack wrote:

Maybe can you sketch out how you see this proposal being extensible to
using guest_memfd for shared mappings?

For in-place conversions, e.g. pKVM, no additional guest_memfd is needed.  
What's
missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM needs to
ensure there are no outstanding references when converting back to private.

For TDX/SNP, assuming we don't find a performant and robust way to do in-place
conversions, a second fd+offset pair would be needed.

Is there a way to support non-in-place conversions within a single guest_memfd?


For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to 
guest memory.  The hook would invalidate now-private parts if they have 
a VMA, causing a SIGSEGV/EFAULT if the host touches them.


It would forbid mappings from multiple gfns to a single offset of the 
guest_memfd, because then the shared vs. private attribute would be tied 
to the offset.  This should not be a problem; for example, in the case 
of SNP, the RMP already requires a single mapping from host physical 
address to guest physical address.


Paolo

Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-11-02 Thread Paolo Bonzini


On 11/2/23 06:59, Binbin Wu wrote:



Add flags to "struct kvm_gfn_range" to let notifier events target
only shared and only private mappings, and write up the existing
mmu_notifier events to be shared-only (private memory is never
associated with a userspace virtual address, i.e. can't be reached
via mmu_notifiers).

Add two flags so that KVM can handle the three possibilities
(shared, private, and shared+private) without needing something
like a tri-state enum.


I see the two flags are set/cleared in __kvm_handle_hva_range() in
this patch and kvm_handle_gfn_range() from the later patch 13/35, but
I didn't see they are used/read in this patch series if I didn't miss
anything.  How are they supposed to be used in KVM?


They are going to be used by SNP/TDX patches.

Paolo

Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Paolo Bonzini


On 11/2/23 10:35, Huang, Kai wrote:

IIUC KVM can already handle the case of poisoned
page by sending signal to user app: 

	static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, 
			struct kvm_page_fault *fault)   
	{   
		...


if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(fault->slot, fault->gfn);
	return RET_PF_RETRY;  
	}

}


EHWPOISON is not implemented by this series, so it should be left out of 
the documentation.




Currently as mentioned above when
vepc fault handler cannot allocate EPC page KVM returns -EFAULT to Qemu, and
Qemu prints ...

...: Bad address


... which is nonsense.

If we can use memory_fault.flags (or is 'fault_reason' a better name?) to carry
a specific value for EPC to let Qemu know and Qemu can then do more reasonable
things.


Yes, that's a good idea that can be implemented on top.

Paolo

Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Paolo Bonzini


On 11/1/23 18:36, Sean Christopherson wrote:

A good example is KVM_RUN with -EINTR; if KVM were to return something other 
than
-EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall 
over.


And dually if KVM were to return KVM_EXIT_INTR together with something 
other than -EINTR.


And purging exit_reason super early is subtly tricky because KVM's 
(again, poorly documented) ABI is that *some* exit reasons are preserved 
across KVM_RUN with vcpu->run->immediate_exit (or with a pending 
signal). https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com


vcpu->run->immediate_exit preserves all exit reasons, but it's not a 
good idea that immediate_exit behaves different from a pending signal on 
entry to KVM_RUN (remember that immediate_exit was meant to be a better 
performing alternative to KVM_SET_SIGNAL_MASK).


In principle, vcpu->run->immediate_exit could return KVM_EXIT_INTR 
(perhaps even _should_, except that breaks selftests so at this point it 
is ABI).


Paolo

Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes

2023-11-02 Thread Paolo Bonzini


On 11/2/23 04:01, Huang, Kai wrote:

On Fri, 2023-10-27 at 11:21 -0700, Sean Christopherson wrote:

From: Chao Peng 

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
userspace to operate on the per-page memory attributes.
   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
 a guest memory range.
   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
 memory attributes.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.

Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.

To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation.  For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson 
Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
Cc: Fuad Tabba 
Cc: Xu Yilun 
Cc: Mickaël Salaün 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 



[...]


+Note, there is no "get" API.  Userspace is responsible for explicitly tracking
+the state of a gfn/page as needed.
+



[...]

  
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES

+static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t 
gfn)
+{
+   return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
+}


Only call xa_to_value() when xa_load() returns !NULL?


This xarray does not store a pointer, therefore xa_load() actually 
returns an integer that is tagged with 1 in the low bit:


static inline unsigned long xa_to_value(const void *entry)
{
return (unsigned long)entry >> 1;
}

Returning zero for an empty entry is okay, so the result of xa_load() 
can be used directly.




+
+bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+unsigned long attrs);


Seems it's not immediately clear why this function is needed in this patch,
especially when you said there is no "get" API above.  Add some material to
changelog?


It's used by later patches; even without a "get" API, it's a pretty 
fundamental functionality.



+bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+   struct kvm_gfn_range *range);
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+struct kvm_gfn_range *range);


Looks if this Kconfig is on, the above two arch hooks won't have implementation.

Is it better to have two __weak empty versions here in this patch?

Anyway, from the changelog it seems it's not mandatory for some ARCH to provide
the above two if one wants to turn this on, i.e., the two hooks can be empty and
the ARCH can just use the __weak version.


I think this can be added by the first arch that needs memory attributes 
and also doesn't need one of these hooks.  Or perhaps the x86 
kvm_arch_pre_set_memory_attributes() could be made generic and thus that 
would be the __weak version.  It's too early to tell, so it's better to 
leave the implementation to the architectures for now.


Paolo

Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-01 Thread Paolo Bonzini

On Wed, Nov 1, 2023 at 11:35 PM Sean Christopherson  wrote:
>
> On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> > On 11/1/23 17:36, Sean Christopherson wrote:
> > > > > "Allow" isn't perfect, e.g. I would much prefer a straight 
> > > > > KVM_GUEST_MEMFD_USE_HUGEPAGES
> > > > > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey 
> > > > > that KVM doesn't
> > > > > (yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is 
> > > > > stronger than
> > > > > a hint, but weaker than a requirement.  And if/when KVM supports a 
> > > > > dedicated memory
> > > > > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
> > > > I think that the current patch is fine, but I will adjust it to always
> > > > allow the flag, and to make the size check even if 
> > > > !CONFIG_TRANSPARENT_HUGEPAGE.
> > > > If hugepages are not guaranteed, and (theoretically) you could have no
> > > > hugepage at all in the result, it's okay to get this result even if THP 
> > > > is not
> > > > available in the kernel.
> > > Can you post a fixup patch?  It's not clear to me exactly what behavior 
> > > you intend
> > > to end up with.
> >
> > Sure, just this:
> >
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 7d1a33c2ad42..34fd070e03d9 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct 
> > kvm_create_guest_memfd *args)
> >  {
> >   loff_t size = args->size;
> >   u64 flags = args->flags;
> > - u64 valid_flags = 0;
> > -
> > - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> >   if (flags & ~valid_flags)
> >   return -EINVAL;
> > @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct 
> > kvm_create_guest_memfd *args)
> >   if (size < 0 || !PAGE_ALIGNED(size))
> >   return -EINVAL;
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >   if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> >   !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> >   return -EINVAL;
> > -#endif
>
> That won't work, HPAGE_PMD_SIZE is valid only for 
> CONFIG_TRANSPARENT_HUGEPAGE=y.
>
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })

Would have caught it when actually testing it, I guess. :) It has to
be PMD_SIZE, possibly with

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
BUILD_BUG_ON(HPAGE_PMD_SIZE != PMD_SIZE);
#endif

for extra safety.

Paolo

Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-01 Thread Paolo Bonzini


On 11/1/23 17:36, Sean Christopherson wrote:

"Allow" isn't perfect, e.g. I would much prefer a straight 
KVM_GUEST_MEMFD_USE_HUGEPAGES
or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM 
doesn't
(yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
a hint, but weaker than a requirement.  And if/when KVM supports a dedicated 
memory
pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.

I think that the current patch is fine, but I will adjust it to always
allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
If hugepages are not guaranteed, and (theoretically) you could have no
hugepage at all in the result, it's okay to get this result even if THP is not
available in the kernel.

Can you post a fixup patch?  It's not clear to me exactly what behavior you 
intend
to end up with.


Sure, just this:

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 7d1a33c2ad42..34fd070e03d9 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct 
kvm_create_guest_memfd *args)
 {
loff_t size = args->size;
u64 flags = args->flags;
-   u64 valid_flags = 0;
-
-   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
-   valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+   u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
 
 	if (flags & ~valid_flags)

return -EINVAL;
@@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct 
kvm_create_guest_memfd *args)
if (size < 0 || !PAGE_ALIGNED(size))
return -EINVAL;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE

if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
!IS_ALIGNED(size, HPAGE_PMD_SIZE))
return -EINVAL;
-#endif
 
 	return __kvm_gmem_create(kvm, size, flags);

 }

Paolo

Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-01 Thread Paolo Bonzini

On Wed, Nov 1, 2023 at 2:41 PM Sean Christopherson  wrote:
>
> On Wed, Nov 01, 2023, Xiaoyao Li wrote:
> > On 10/31/2023 10:16 PM, Sean Christopherson wrote:
> > > On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> > > > On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > > > > Extended guest_memfd to allow backing guest memory with transparent
> > > > > hugepages. Require userspace to opt-in via a flag even though there's 
> > > > > no
> > > > > known/anticipated use case for forcing small pages as THP is optional,
> > > > > i.e. to avoid ending up in a situation where userspace is unaware that
> > > > > KVM can't provide hugepages.
> > > >
> > > > Personally, it seems not so "transparent" if requiring userspace to 
> > > > opt-in.
> > > >
> > > > People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
> > > > support, or check is the sysfs of transparent hugepage exists; 2)get the
> > > > maximum support hugepage size 3) ensure the size satisfies the 
> > > > alignment;
> > > > before opt-in it.
> > > >
> > > > Even simpler, userspace can blindly try to create guest memfd with
> > > > transparent hugapage flag. If getting error, fallback to create without 
> > > > the
> > > > transparent hugepage flag.
> > > >
> > > > However, it doesn't look transparent to me.
> > >
> > > The "transparent" part is referring to the underlying kernel mechanism, 
> > > it's not
> > > saying anything about the API.  The "transparent" part of THP is that the 
> > > kernel
> > > doesn't guarantee hugepages, i.e. whether or not hugepages are actually 
> > > used is
> > > (mostly) transparent to userspace.
> > >
> > > Paolo also isn't the biggest fan[*], but there are also downsides to 
> > > always
> > > allowing hugepages, e.g. silent failure due to lack of THP or unaligned 
> > > size,
> > > and there's precedent in the form of MADV_HUGEPAGE.
> > >
> > > [*] 
> > > https://lore.kernel.org/all/84a908ae-04c7-51c7-c9a8-119e1933a...@redhat.com
> >
> > But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the
> > failure of MADV_HUGEPAGE is not fatal, user space can ignore it and
> > continue.
> >
> > However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads
> > to failure of guest memfd creation.
>
> Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different
> action from userspace, i.e. instead of ignoring the error, userspace could 
> redo
> KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0.
>
> We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we 
> could
> extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to let
> userspace provide advice/hints after creating a guest_memfd.  But I suspect 
> that
> guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a post-creation 
> hint
> is actually less desirable.
>
> KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a
> compatible size or the kernel doesn't support THP.  An incompatible size is 
> likely
> a userspace bug, and for most setups that want to utilize guest_memfd, lack 
> of THP
> support is likely a configuration bug.  I.e. many/most uses *want* failures 
> due to
> KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal.
>
> > For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE
> > fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?
>
> Why?  Verbs like "prefer" and "desire" aren't a good fit IMO because they 
> suggest
> the flag is a hint, and hints are usually best effort only, i.e. are ignored 
> if
> there is a fundamental incompatibility.
>
> "Allow" isn't perfect, e.g. I would much prefer a straight 
> KVM_GUEST_MEMFD_USE_HUGEPAGES
> or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM 
> doesn't
> (yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger 
> than
> a hint, but weaker than a requirement.  And if/when KVM supports a dedicated 
> memory
> pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.

I think that the current patch is fine, but I will adjust it to always
allow the flag,
and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
If hugepages are not guaranteed, and (theoretically) you could have no
hugepage at all in the result, it's okay to get this result even if THP is not
available in the kernel.

Paolo

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-10-31 Thread Paolo Bonzini

On Tue, Oct 31, 2023 at 11:13 PM Sean Christopherson  wrote:
> On Tue, Oct 31, 2023, Fuad Tabba wrote:
> > On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson  
> > wrote:
> Since we now know that at least pKVM will use guest_memfd for shared memory, 
> and
> odds are quite good that "regular" VMs will also do the same, i.e. will want
> guest_memfd with the concept of private memory, I agree that we should avoid
> PRIVATE.
>
> Though I vote for KVM_MEM_GUEST_MEMFD (or KVM_MEM_GUEST_MEMFD_VALID or
> KVM_MEM_USE_GUEST_MEMFD).  I.e. do our best to avoid ambiguity between 
> referring
> to "guest memory" at-large and guest_memfd.

I was going to propose KVM_MEM_HAS_GUESTMEMFD.  Any option
is okay for me so, if no one complains, I'll go for KVM_MEM_GUESTMEMFD
(no underscore because I found the repeated "_MEM" distracting).

Paolo

Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Paolo Bonzini


On 10/30/23 21:25, Sean Christopherson wrote:

On Mon, Oct 30, 2023, Paolo Bonzini wrote:

On 10/27/23 20:21, Sean Christopherson wrote:


+   if (ioctl == KVM_SET_USER_MEMORY_REGION)
+   size = sizeof(struct kvm_userspace_memory_region);


This also needs a memset(&mem, 0, sizeof(mem)), otherwise the out-of-bounds
access of the commit message becomes a kernel stack read.


Ouch.  There's some irony.  Might be worth doing memset(&mem, -1, sizeof(mem))
though as '0' is a valid file descriptor and a valid file offset.


Either is okay, because unless the flags check is screwed up it should
not matter.  The memset is actually unnecessary, though it may be a good
idea anyway to keep it, aka belt-and-suspenders.


Probably worth adding a check on valid flags here.


Definitely needed.  There's a very real bug here.  But rather than duplicate 
flags
checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we
have the fancy guard(mutex) and there are no internal calls to 
kvm_set_memory_region(),
what if we:

   1. Acquire/release slots_lock in __kvm_set_memory_region()
   2. Call kvm_set_memory_region() from x86 code for the internal memslots
   3. Disallow *any* flags for internal memslots
   4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region()


I dislike this step, there is a clear point where all paths meet
(ioctl/internal, locked/unlocked) and that's __kvm_set_memory_region().
I think that's the place where flags should be checked.  (I don't mind
the restriction on internal memslots; it's just that to me it's not a
particularly natural way to structure the checks).

On the other hand, the place where to protect from out-of-bounds
accesses, is the place where you stop caring about struct
kvm_userspace_memory_region vs kvm_userspace_memory_region2 (and
your code gets it right, by dropping "ioctl" as soon as possible).

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 87f45aa91ced..fe5a2af14fff 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1635,6 +1635,14 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
return true;
 }
 
+/*

+ * Flags that do not access any of the extra space of struct
+ * kvm_userspace_memory_region2.  KVM_SET_USER_MEMORY_REGION_FLAGS
+ * only allows these.
+ */
+#define KVM_SET_USER_MEMORY_REGION_FLAGS \
+   (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
+
 static int check_memory_region_flags(struct kvm *kvm,
 const struct kvm_userspace_memory_region2 
*mem)
 {
@@ -5149,10 +5149,16 @@ static long kvm_vm_ioctl(struct file *filp,
struct kvm_userspace_memory_region2 mem;
unsigned long size;
 
-		if (ioctl == KVM_SET_USER_MEMORY_REGION)

+   if (ioctl == KVM_SET_USER_MEMORY_REGION) {
+   /*
+* Fields beyond struct kvm_userspace_memory_region 
shouldn't be
+* accessed, but avoid leaking kernel memory in case of 
a bug.
+*/
+   memset(&mem, 0, sizeof(mem));
size = sizeof(struct kvm_userspace_memory_region);
-   else
+   } else {
size = sizeof(struct kvm_userspace_memory_region2);
+   }
 
 		/* Ensure the common parts of the two structs are identical. */

SANITY_CHECK_MEM_REGION_FIELD(slot);
@@ -5165,6 +5167,11 @@ static long kvm_vm_ioctl(struct file *filp,
if (copy_from_user(&mem, argp, size))
goto out;
 
+		r = -EINVAL;

+   if (ioctl == KVM_SET_USER_MEMORY_REGION &&
+   (mem->flags & ~KVM_SET_USER_MEMORY_REGION_FLAGS))
+   goto out;
+
r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
break;
}


That's a kind of patch that you can't really get wrong (though I have
the brown paper bag ready).

Maintainance-wise it's fine, since flags are being added at a pace of
roughly one every five years, and anyway it's also future proof: I placed
the #define near check_memory_region_flags so that in five years we remember
to keep it up to date.  But worst case, the new flags will only be allowed
by KVM_SET_USER_MEMORY_REGION2 unnecessarily; there are no security issues
waiting to bite us.

In sum, this is exactly the only kind of fix that should be in the v13->v14
delta.

Paolo

Re: [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Non-KVM people, please take a gander at two small-ish patches buried in the
middle of this series:

   fs: Export anon_inode_getfile_secure() for use by KVM
   mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

Our plan/hope is to take this through the KVM tree for 6.8, reviews (and acks!)
would be much appreciated.  Note, adding AS_UNMOVABLE isn't strictly required as
it's "just" an optimization, but we'd prefer to have it in place straightaway.


Reporting what I wrote in the other thread, for wider distribution:

I'm going to wait a couple days more for reviews to come in, post a v14
myself, and apply the series to kvm/next as soon as Linus merges the 6.7
changes.  The series will be based on the 6.7 tags/for-linus, and when
6.7-rc1 comes up, I'll do this to straighten the history:

git checkout kvm/next
git tag -s -f kvm-gmem HEAD
git reset --hard v6.7-rc1
git merge tags/kvm-gmem
# fix conflict with Christian Brauner's VFS series
git commit
git push kvm

6.8 is not going to be out for four months, and I'm pretty sure that
anything that would be discovered within "a few weeks" can also be
applied on top, and the heaviness of a 35-patch series will outweigh any
imperfections by a long margin.

(Full disclosure: this is _also_ because I want to apply this series to
the RHEL kernel, and Red Hat has a high level of disdain for
non-upstream patches.  But it's mostly because I want all dependencies
to be able to move on and be developed on top of stock kvm/next).

Paolo

Re: [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:22, Sean Christopherson wrote:

Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson 


Reviewed-by: Paolo Bonzini 

with one nit:


+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: system ioctl
+
+This capability returns a bitmap of support VM types.  The 1-setting of bit @n


s/support/supported/

Paolo

Re: [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:22, Sean Christopherson wrote:

Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs.  Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.

Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).


Reviewed-by: Paolo Bonzini

Re: [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:22, Sean Christopherson wrote:

Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
the probability of exiting to userspace with a stale run->exit_reason that
*appears* to be valid.

To support fd-based guest memory (guest memory without a corresponding
userspace virtual address), KVM will exit to userspace for various memory
related errors, which userspace *may* be able to resolve, instead of using
e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
utilize the same functionality to let userspace "intercept" and handle
memory faults when the userspace mapping is missing, i.e. when fast gup()
fails.

Because many of KVM's internal APIs related to guest memory use '0' to
indicate "success, continue on" and not "exit to userspace", reporting
memory faults/errors to userspace will set run->exit_reason and
corresponding fields in the run structure fields in conjunction with a
a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
KVM already returns  -EFAULT in many paths, there's a relatively high
probability that KVM could return -EFAULT without setting run->exit_reason,
in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
whatever exit reason happened to be in the run structure.

Note, KVM must wait until after run->immediate_exit is serviced to
sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
preserved across KVM_RUN when run->immediate_exit is true.

Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoor...@google.com
Link: https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com
Signed-off-by: Sean Christopherson 
---
  arch/x86/kvm/x86.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ee3cd8c3c0ef..f41dbb1465a0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10963,6 +10963,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
  {
int r;
  
+	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;

    vcpu->arch.l1tf_flush_l1d = true;
  
  	for (;;) {


Reviewed-by: Paolo Bonzini

Re: [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:
Export anon_inode_getfile_secure() so that it can be used by KVM to 
create and manage file-based guest memory without need a fullblow 


without introducing a full-blown

Otherwise,

Reviewed-by: Paolo Bonzini 

Paolo

filesystem. The "standard" anon_inode_getfd() doesn't work for KVM's use 
case as KVM needs a unique inode for each file, e.g. to be able to 
independently manage the size and lifecycle of a given file.

Re: [PATCH v13 14/35] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Add an "unmovable" flag for mappings that cannot be migrated under any
circumstance.  KVM will use the flag for its upcoming GUEST_MEMFD support,
which will not support compaction/migration, at least not in the
foreseeable future.

Test AS_UNMOVABLE under folio lock as already done for the async
compaction/dirty folio case, as the mapping can be removed by truncation
while compaction is running.  To avoid having to lock every folio with a
mapping, assume/require that unmovable mappings are also unevictable, and
have mapping_set_unmovable() also set AS_UNEVICTABLE.

Cc: Matthew Wilcox 
Co-developed-by: Vlastimil Babka 
Signed-off-by: Vlastimil Babka 
Signed-off-by: Sean Christopherson 


I think this could even be "From: Vlastimil", but no biggie.

Paolo


---
  include/linux/pagemap.h | 19 +-
  mm/compaction.c | 43 +
  mm/migrate.c|  2 ++
  3 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 351c3b7f93a1..82c9bf506b79 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -203,7 +203,8 @@ enum mapping_flags {
/* writeback related tags are not used */
AS_NO_WRITEBACK_TAGS = 5,
AS_LARGE_FOLIO_SUPPORT = 6,
-   AS_RELEASE_ALWAYS,  /* Call ->release_folio(), even if no private 
data */
+   AS_RELEASE_ALWAYS = 7,  /* Call ->release_folio(), even if no private 
data */
+   AS_UNMOVABLE= 8,/* The mapping cannot be moved, ever */
  };
  
  /**

@@ -289,6 +290,22 @@ static inline void mapping_clear_release_always(struct 
address_space *mapping)
clear_bit(AS_RELEASE_ALWAYS, &mapping->flags);
  }
  
+static inline void mapping_set_unmovable(struct address_space *mapping)

+{
+   /*
+* It's expected unmovable mappings are also unevictable. Compaction
+* migrate scanner (isolate_migratepages_block()) relies on this to
+* reduce page locking.
+*/
+   set_bit(AS_UNEVICTABLE, &mapping->flags);
+   set_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
+static inline bool mapping_unmovable(struct address_space *mapping)
+{
+   return test_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
  static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
  {
return mapping->gfp_mask;
diff --git a/mm/compaction.c b/mm/compaction.c
index 38c8d216c6a3..12b828aed7c8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -883,6 +883,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
  
  	/* Time to isolate some pages for migration */

for (; low_pfn < end_pfn; low_pfn++) {
+   bool is_dirty, is_unevictable;
  
  		if (skip_on_failure && low_pfn >= next_skip_pfn) {

/*
@@ -1080,8 +1081,10 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (!folio_test_lru(folio))
goto isolate_fail_put;
  
+		is_unevictable = folio_test_unevictable(folio);

+
/* Compaction might skip unevictable pages but CMA takes them */
-   if (!(mode & ISOLATE_UNEVICTABLE) && 
folio_test_unevictable(folio))
+   if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
goto isolate_fail_put;
  
  		/*

@@ -1093,26 +1096,42 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if ((mode & ISOLATE_ASYNC_MIGRATE) && 
folio_test_writeback(folio))
goto isolate_fail_put;
  
-		if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {

-   bool migrate_dirty;
+   is_dirty = folio_test_dirty(folio);
+
+   if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) ||
+   (mapping && is_unevictable)) {
+   bool migrate_dirty = true;
+   bool is_unmovable;
  
  			/*

 * Only folios without mappings or that have
-* a ->migrate_folio callback are possible to
-* migrate without blocking.  However, we may
-* be racing with truncation, which can free
-* the mapping.  Truncation holds the folio lock
-* until after the folio is removed from the page
-* cache so holding it ourselves is sufficient.
+* a ->migrate_folio callback are possible to migrate
+* without blocking.
+*
+* Folios from unmovable mappings are not migratable.
+*
+* However, we can be racing with truncation, which can
+* free the mapping that we need to check. Truncation
+* holds the folio loc

Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

From: Chao Peng 

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).

KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory.  With guest private memory,
there will be two kind of memory conversions:

   - explicit conversion: happens when the guest explicitly calls into KVM
 to map a range (as private or shared)

   - implicit conversion: happens when the guest attempts to access a gfn
 that is configured in the "wrong" state (private vs. shared)

On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.

KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.

Note!  To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'!  Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.

Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages.  It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.

Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoor...@google.com
Link: https://lore.kernel.org/all/zq3amlo2syv3d...@google.com
Cc: Anish Moorthy 
Cc: David Matlack 
Suggested-by: Sean Christopherson 
Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 


Reviewed-by: Paolo Bonzini 


---
  Documentation/virt/kvm/api.rst | 41 ++
  arch/x86/kvm/x86.c |  1 +
  include/linux/kvm_host.h   | 11 +
  include/uapi/linux/kvm.h   |  8 +++
  4 files changed, 61 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index ace984acc125..860216536810 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6723,6 +6723,26 @@ array field represents return values. The userspace 
should update the return
  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
  spec refer, https://github.com/riscv/riscv-sbi-doc.
  
+::

+
+   /* KVM_EXIT_MEMORY_FAULT */
+   struct {
+   __u64 flags;
+   __u64 gpa;
+   __u64 size;
+   } memory;
+
+KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
+could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
+guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
+describes properties of the faulting access that are likely pertinent.
+Currently, no flags are defined.
+
+Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
+accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
+or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
+kvm_run.exit_reason is stale/undefined for all other error numbers.
+
  ::
  
  /* KVM_EXIT_NOTIFY */

@@ -7757,6 +,27 @@ This capability is aimed to mitigate the threat that 
malicious VMs can
  cause CPU stuck (due to event windows don't open up) and make the CPU
  unavailable to host or other VMs.
  
+7.34 KVM_CAP_MEMORY_FAULT_INFO

+--
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN will fill
+kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
+there is a valid memslot but no backing VMA for the corresponding host virtual
+address.
+
+The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
+an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
+to KVM_EXIT_MEMORY_FAULT.
+
+Note: Userspaces

Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

@@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
 * the second or later invocation of the handler).
 */
gfn_range.arg = range->arg;
+
+   /*
+* HVA-based notifications aren't relevant to private
+* mappings as they don't have a userspace mapping.


It's confusing who "they" is.  Maybe

 * HVA-based notifications provide a userspace address,
 * and as such are only relevant for shared mappings.

Paolo


+*/
+   gfn_range.only_private = false;
+   gfn_range.only_shared = true;
gfn_range.may_block = range->may_block;
 
 			/*

Re: [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
notifying arch code that memory has been reclaimed.  Adding .on_unlock()
and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
resulted in .on_lock() and .on_unlock() having divergent and asymmetric
behavior, and set future developers up for failure, i.e. all but asked for
bugs where KVM relied on using .on_unlock() to try to run a callback while
holding mmu_lock.

Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
guard against future bugs of this nature.


This is what David suggested to do in patch 3, FWIW.

Reviewed-by: Paolo Bonzini 

Paolo


Reported-by: Isaku Yamahata 
Link: https://lore.kernel.org/all/20230802203119.gb2021...@ls.amr.corp.intel.com
Signed-off-by: Sean Christopherson 
---

Re: [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
__kvm_handle_hva_range() return whether or not an overlapping memslot
was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.

Use a small struct to return the tuple of the notifier-specific return,
plus whether or not overlap was found.  Because the iteration helpers are
__always_inlined, practically speaking, the struct will never actually be
returned from a function call (not to mention the size of the struct will
be two bytes in practice).


Could have been split in two patches, but it's fine anyway.

Reviewed-by: Paolo Bonzini 

Paolo

Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-10-30 Thread Paolo Bonzini

On Mon, Oct 30, 2023 at 5:53 PM David Matlack  wrote:
>
> On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> > From: Chao Peng 
> >
> > Currently in mmu_notifier invalidate path, hva range is recorded and
> > then checked against by mmu_notifier_retry_hva() in the page fault
> > handling path. However, for the to be introduced private memory, a page
>   
>
> Is there a missing word here?

No but there could be missing hyphens ("for the to-be-introduced
private memory"); possibly a "soon" could help parsing and that is
what you were talking about?

> >   if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > + kvm->mmu_invalidate_range_end = INVALID_GPA;
>
> I don't think this is incorrect, but I was a little suprised to see this
> here rather than in end() when mmu_invalidate_in_progress decrements to
> 0.

I think that would be incorrect on the very first start?

> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + lockdep_assert_held_write(&kvm->mmu_lock);
>
> Does this compile/function on KVM architectures with
> !KVM_HAVE_MMU_RWLOCK?

Yes:

#define lockdep_assert_held_write(l)\
lockdep_assert(lockdep_is_held_type(l, 0))

where 0 is the lock-held type used by lock_acquire_exclusive. In turn
is what you get for a spinlock or mutex, in addition to a rwlock or
rwsem that is taken for write.

Instead, lockdep_assert_held() asserts that the lock is taken without
asserting a particular lock-held type.

> > @@ -834,6 +851,12 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned 
> > long start,
>
> Let's add a lockdep_assert_held_write(&kvm->mmu_lock) here too while
> we're at it?

Yes, good idea.

Paolo

Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:


+   if (ioctl == KVM_SET_USER_MEMORY_REGION)
+   size = sizeof(struct kvm_userspace_memory_region);


This also needs a memset(&mem, 0, sizeof(mem)), otherwise the 
out-of-bounds access of the commit message becomes a kernel stack read.


Probably worth adding a check on valid flags here.

Paolo


+   else
+   size = sizeof(struct kvm_userspace_memory_region2);
+
+   /* Ensure the common parts of the two structs are identical. */
+   SANITY_CHECK_MEM_REGION_FIELD(slot);
+   SANITY_CHECK_MEM_REGION_FIELD(flags);
+   SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+   SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+   SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);

Re: [PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior.  Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.

Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared.  PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.

  bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);

Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.


The new #define should also imply KVM_CAP_SYNC_MMU, or even: 
KVM_CAP_SYNC_MMU should just be enabled by all architectures at this 
point.  You don't need to care about it, I have a larger series for caps 
that are enabled by all architectures and I'll post it for 6.8.


Reviewed-by: Paolo Bonzini 

Paolo

Re: [PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are
defined when KVM is enabled, and return '1' unconditionally for the
CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path.  All flavors of PPC support for KVM
select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally
defined by arch/powerpc/include/asm/kvm_host.h.

Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a
future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e.
will allow combining all of the

   #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)

checks into a single

   #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER

without having to worry about PPC's "bare" usage of
KVM_ARCH_WANT_MMU_NOTIFIER.

Signed-off-by: Sean Christopherson 
---
  arch/powerpc/kvm/powerpc.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 7197c8256668..b0a512ede764 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -632,12 +632,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
break;
  #endif
case KVM_CAP_SYNC_MMU:
+#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+   BUILD_BUG();
+#endif
  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
r = hv_enabled;
-#elif defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-   r = 1;
  #else
-   r = 0;
+   r = 1;
  #endif
break;
  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE


Reviewed-by: Paolo Bonzini

Re: [PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:
Add an assertion that there are no in-progress MMU invalidations when a 
VM is being destroyed, with the exception of the scenario where KVM 
unregisters its MMU notifier between an .invalidate_range_start() call 
and the corresponding .invalidate_range_end(). KVM can't detect unpaired 
calls from the mmu_notifier due to the above exception waiver, but the 
assertion can detect KVM bugs, e.g. such as the bug that *almost* 
escaped initial guest_memfd development.


Link: 
https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5...@linux.intel.com
Signed-off-by: Sean Christopherson 


Reviewed-by: Paolo Bonzini 

Paolo

Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:
From: Chao Peng  Currently in mmu_notifier 
invalidate path, hva range is recorded and then checked against by 
mmu_notifier_retry_hva() in the page fault handling path. However, for 
the to be introduced private memory, a page fault may not have a hva 
associated, checking gfn(gpa) makes more sense. For existing hva based 
shared memory, gfn is expected to also work. The only downside is when 
aliasing multiple gfns to a single hva, the current algorithm of 
checking multiple ranges could result in a much larger range being 
rejected. Such aliasing should be uncommon, so the impact is expected 
small.


Reviewed-by: Paolo Bonzini 

Paolo

Re: [PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress never goes negative

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Move the assertion on the in-progress invalidation count from the primary
MMU's notifier path to KVM's common notification path, i.e. assert that
the count doesn't go negative even when the invalidation is coming from
KVM itself.

Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
the affected VM, not the entire kernel.  A corrupted count is fatal to the
VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
to block any and all attempts to install new mappings.  But it's far from
guaranteed that an end() without a start() is fatal or even problematic to
anything other than the target VM, e.g. the underlying bug could simply be
a duplicate call to end().  And it's much more likely that a missed
invalidation, i.e. a potential use-after-free, would manifest as no
notification whatsoever, not an end() without a start().


Reviewed-by: Paolo Bonzini 


Signed-off-by: Sean Christopherson 
---
  virt/kvm/kvm_main.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0524933856d4..5a97e6c7d9c2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -833,6 +833,7 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long 
start,
 * in conjunction with the smp_rmb in mmu_invalidate_retry().
 */
kvm->mmu_invalidate_in_progress--;
+   KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
  }
  
  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,

@@ -863,8 +864,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct 
mmu_notifier *mn,
 */
if (wake)
rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
-
-   BUG_ON(kvm->mmu_invalidate_in_progress < 0);
  }
  
  static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,

Re: [RFC PATCH v11 13/29] KVM: Add transparent hugepage support for dedicated guest memory

2023-09-06 Thread Paolo Bonzini

On Fri, Jul 21, 2023 at 7:13 PM Sean Christopherson  wrote:
> On Fri, Jul 21, 2023, Paolo Bonzini wrote:
> > On 7/19/23 01:44, Sean Christopherson wrote:
> > > @@ -413,6 +454,9 @@ int kvm_gmem_create(struct kvm *kvm, struct 
> > > kvm_create_guest_memfd *args)
> > > u64 flags = args->flags;
> > > u64 valid_flags = 0;
> > > +   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > > +   valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > > +
> >
> > I think it should be always allowed.  The outcome would just be "never have
> > a hugepage" if thp is not enabled in the kernel.
>
> I don't have a strong preference.  My thinking was that userspace would 
> probably
> rather have an explicit error, as opposed to silently running with a 
> misconfigured
> setup.

Considering that is how madvise(MADV_HUGEPAGE) behaves, your patch is
good. I disagree but consistency is better.

Paolo

Re: [RFC PATCH v11 06/29] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-07-31 Thread Paolo Bonzini


On 7/29/23 02:03, Sean Christopherson wrote:

KVM would need to do multiple uaccess reads, but that's not a big
deal.  Am I missing something, or did past us just get too clever and
miss the obvious solution?


You would have to introduce struct kvm_userspace_memory_region2 anyway, 
though not a new ioctl, for two reasons:


1) the current size of the struct is part of the userspace API via the 
KVM_SET_USER_MEMORY_REGION #define, so introducing a new struct is the 
easiest way to preserve this


2) the struct can (at least theoretically) enter the ABI of a shared 
library, and such mismatches are really hard to detect and resolve.  So 
it's better to add the padding to a new struct, and keep struct 
kvm_userspace_memory_region backwards-compatible.



As to whether we should introduce a new ioctl: doing so makes 
KVM_SET_USER_MEMORY_REGION's detection of bad flags a bit more robust; 
it's not like we cannot introduce new flags at all, of course, but 
having out-of-bounds reads as a side effect of new flags is a bit nasty. 
 Protecting programs from their own bugs gets into diminishing returns 
very quickly, but introducing a new ioctl can make exploits a bit harder 
when struct kvm_userspace_memory_region is on the stack and adjacent to 
an attacker-controlled location.


Paolo

Re: [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

2023-07-28 Thread Paolo Bonzini


On 7/28/23 18:02, Vlastimil Babka wrote:

There's even a comment to that effect later on in the function:

Hmm, well spotted. But it wouldn't be so great if we now had to lock every
inspected page (and not just dirty pages), just to check the AS_ bit.

But I wonder if this is leftover from previous versions. Are the guest pages
even PageLRU currently? (and should they be, given how they can't be swapped
out or anything?) If not, isolate_migratepages_block will skip them anyway.


No, they're not (migration or even swap-out is not excluded for the 
future, but for now it's left for future work.


Paolo

Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-07-21 Thread Paolo Bonzini


On 7/19/23 01:44, Sean Christopherson wrote:

+   inode = alloc_anon_inode(mnt->mnt_sb);
+   if (IS_ERR(inode))
+   return PTR_ERR(inode);
+
+   err = security_inode_init_security_anon(inode, &qname, NULL);
+   if (err)
+   goto err_inode;
+


I don't understand the need to have a separate filesystem.  If it is to 
fully setup the inode before it's given a struct file, why not just 
export anon_inode_make_secure_inode instead of 
security_inode_init_security_anon?


Paolo

Re: [RFC PATCH v11 18/29] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper

2023-07-21 Thread Paolo Bonzini


On 7/19/23 01:45, Sean Christopherson wrote:

Drop kvm_userspace_memory_region_find(), it's unused and a terrible API
(probably why it's unused).  If anything outside of kvm_util.c needs to
get at the memslot, userspace_mem_region_find() can be exposed to give
others full access to all memory region/slot information.

Signed-off-by: Sean Christopherson 
---
  .../selftests/kvm/include/kvm_util_base.h |  4 ---
  tools/testing/selftests/kvm/lib/kvm_util.c| 29 ---
  2 files changed, 33 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 07732a157ccd..6aeb008dd668 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -753,10 +753,6 @@ vm_adjust_num_guest_pages(enum vm_guest_mode mode, 
unsigned int num_guest_pages)
return n;
  }
  
-struct kvm_userspace_memory_region *

-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
-uint64_t end);
-
  #define sync_global_to_guest(vm, g) ({\
typeof(g) *_p = addr_gva2hva(vm, (vm_vaddr_t)&(g)); \
memcpy(_p, &(g), sizeof(g));\
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 9741a7ff6380..45d21e052db0 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -586,35 +586,6 @@ userspace_mem_region_find(struct kvm_vm *vm, uint64_t 
start, uint64_t end)
return NULL;
  }
  
-/*

- * KVM Userspace Memory Region Find
- *
- * Input Args:
- *   vm - Virtual Machine
- *   start - Starting VM physical address
- *   end - Ending VM physical address, inclusive.
- *
- * Output Args: None
- *
- * Return:
- *   Pointer to overlapping region, NULL if no such region.
- *
- * Public interface to userspace_mem_region_find. Allows tests to look up
- * the memslot datastructure for a given range of guest physical memory.
- */
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
-uint64_t end)
-{
-   struct userspace_mem_region *region;
-
-   region = userspace_mem_region_find(vm, start, end);
-   if (!region)
-   return NULL;
-
-   return ®ion->region;
-}
-
  __weak void vcpu_arch_free(struct kvm_vcpu *vcpu)
  {
  


Will queue this.

Paolo

Re: [RFC PATCH v11 16/29] KVM: Allow arch code to track number of memslot address spaces per VM

2023-07-21 Thread Paolo Bonzini


On 7/19/23 01:44, Sean Christopherson wrote:

@@ -4725,9 +4725,9 @@ static int kvm_vm_ioctl_check_extension_generic(struct 
kvm *kvm, long arg)
case KVM_CAP_IRQ_ROUTING:
return KVM_MAX_IRQ_ROUTES;
  #endif
-#if KVM_ADDRESS_SPACE_NUM > 1
+#if KVM_MAX_NR_ADDRESS_SPACES > 1
case KVM_CAP_MULTI_ADDRESS_SPACE:
-   return KVM_ADDRESS_SPACE_NUM;
+   return KVM_MAX_NR_ADDRESS_SPACES;
  #endif


Since this is a VM ioctl, it should return 
kvm_arch_nr_memslot_as_ids(kvm) if kvm != NULL.


Paolo

Re: [RFC PATCH v11 14/29] KVM: x86/mmu: Handle page fault for private memory

2023-07-21 Thread Paolo Bonzini

urn RET_PF_USER;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+  struct kvm_page_fault *fault)
+{
+   int max_order, r;
+
+   if (!kvm_slot_can_be_private(fault->slot))
+   return kvm_do_memory_fault_exit(vcpu, fault);
+
+   r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
+&max_order);
+   if (r)
+   return r;
+
+   fault->max_level = min(kvm_max_level_for_order(max_order),
+  fault->max_level);
+   fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
+   return RET_PF_CONTINUE;
+}
+
  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault 
*fault)
  {
struct kvm_memory_slot *slot = fault->slot;
@@ -4336,6 +4399,12 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault
return RET_PF_EMULATE;
}
  
+	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))

+   return kvm_do_memory_fault_exit(vcpu, fault);
+
+   if (fault->is_private)
+   return kvm_faultin_pfn_private(vcpu, fault);
+
async = false;
fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, 
&async,
  fault->write, &fault->map_writable,
@@ -5771,6 +5840,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, 
gpa_t cr2_or_gpa, u64 err
return -EIO;
}
  
+	if (r == RET_PF_USER)

+   return 0;
+
if (r < 0)
return r;
if (r != RET_PF_EMULATE)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d39af5639ce9..268b517e88cb 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -203,6 +203,7 @@ struct kvm_page_fault {
  
  	/* Derived from mmu and global state.  */

const bool is_tdp;
+   const bool is_private;
const bool nx_huge_page_workaround_enabled;
  
  	/*

@@ -259,6 +260,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct 
kvm_page_fault *fault);
   * RET_PF_RETRY: let CPU fault again on the address.
   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
   * RET_PF_INVALID: the spte is invalid, let the real page fault path update 
it.
+ * RET_PF_USER: need to exit to userspace to handle this fault.
   * RET_PF_FIXED: The faulting entry has been fixed.
   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another 
vCPU.
   *
@@ -275,6 +277,7 @@ enum {
RET_PF_RETRY,
RET_PF_EMULATE,
RET_PF_INVALID,
+   RET_PF_USER,
RET_PF_FIXED,
RET_PF_SPURIOUS,
  };
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..2d7555381955 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
  TRACE_DEFINE_ENUM(RET_PF_RETRY);
  TRACE_DEFINE_ENUM(RET_PF_EMULATE);
  TRACE_DEFINE_ENUM(RET_PF_INVALID);
+TRACE_DEFINE_ENUM(RET_PF_USER);
  TRACE_DEFINE_ENUM(RET_PF_FIXED);
  TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
  


Reviewed-by: Paolo Bonzini

Re: [RFC PATCH v11 15/29] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro

2023-07-21 Thread Paolo Bonzini


On 7/19/23 01:44, Sean Christopherson wrote:

Signed-off-by: Sean Christopherson 
---
  arch/x86/include/asm/kvm_host.h | 1 -
  include/linux/kvm_host.h| 2 +-
  2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b87ff7b601fa..7a905e033932 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2105,7 +2105,6 @@ enum {
  #define HF_SMM_MASK   (1 << 1)
  #define HF_SMM_INSIDE_NMI_MASK(1 << 2)
  
-# define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE

  # define KVM_ADDRESS_SPACE_NUM 2
  # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 
1 : 0)
  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0d1e2ee8ae7a..5839ef44e145 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -693,7 +693,7 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
  #define KVM_MEM_SLOTS_NUM SHRT_MAX
  #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
  
-#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE

+#if KVM_ADDRESS_SPACE_NUM == 1
  static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
  {
return 0;


Reviewed-by: Paolo Bonzini

1 2 3 >

1 - 100 of 290 matches

Mail list logo