Re: [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()

2024-04-11 Thread Paolo Bonzini
On Wed, Apr 10, 2024 at 11:30 PM Andrew Morton
 wrote:
> On Fri,  5 Apr 2024 07:58:11 -0400 Paolo Bonzini  wrote:
> > Please review!  Also feel free to take the KVM patches through the mm
> > tree, as I don't expect any conflicts.
>
> It's mainly a KVM thing and the MM changes are small and simple.
> I'd say that the KVM tree would be a better home?

Sure! I'll queue them on my side then.

Paolo




Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-11 Thread Paolo Bonzini
On Mon, Apr 8, 2024 at 3:56 PM Peter Xu  wrote:
> Paolo,
>
> I may miss a bunch of details here (as I still remember some change_pte
> patches previously on the list..), however not sure whether we considered
> enable it?  Asked because I remember Andrea used to have a custom tree
> maintaining that part:
>
> https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968

The patch enables it only for KSM, so it would still require a bunch
of cleanups, for example I also would still use set_pte_at() in all
the places that are not KSM. This would at least fix the issue with
the poor documentation of where to use set_pte_at_notify() vs
set_pte_at().

With regard to the implementation, I like the idea of disabling the
invalidation on the MMU notifier side, but I would rather have
MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
overloading the event field.

> Maybe it can't be enabled for some reason that I overlooked in the current
> tree, or we just decided to not to?

I have just learnt about the patch, nobody had ever mentioned it even
though it's almost 2 years old... It's a lot of code though and no one
has ever reported an issue for over 10 years, so I think it's easiest
to just rip the code out.

Paolo

> Thanks,
>
> --
> Peter Xu
>




[PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at()

2024-04-05 Thread Paolo Bonzini
With the demise of the .change_pte() MMU notifier callback, there is no
notification happening in set_pte_at_notify().  It is a synonym of
set_pte_at() and can be replaced with it.

Signed-off-by: Paolo Bonzini 
---
 include/linux/mmu_notifier.h | 2 --
 kernel/events/uprobes.c  | 5 ++---
 mm/ksm.c | 4 ++--
 mm/memory.c  | 7 +--
 mm/migrate_device.c  | 8 ++--
 5 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 8c72bf651606..d39ebb10caeb 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -657,6 +657,4 @@ static inline void mmu_notifier_synchronize(void)
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
-#define set_pte_at_notify set_pte_at
-
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e4834d23e1d1..f4523b95c945 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include /* anon_vma_prepare */
-#include /* set_pte_at_notify */
 #include /* folio_free_swap */
 #include   /* user_enable_single_step */
 #include   /* notifier mechanism */
@@ -195,8 +194,8 @@ static int __replace_page(struct vm_area_struct *vma, 
unsigned long addr,
flush_cache_page(vma, addr, pte_pfn(ptep_get(pvmw.pte)));
ptep_clear_flush(vma, addr, pvmw.pte);
if (new_page)
-   set_pte_at_notify(mm, addr, pvmw.pte,
- mk_pte(new_page, vma->vm_page_prot));
+   set_pte_at(mm, addr, pvmw.pte,
+  mk_pte(new_page, vma->vm_page_prot));
 
folio_remove_rmap_pte(old_folio, old_page, vma);
if (!folio_mapped(old_folio))
diff --git a/mm/ksm.c b/mm/ksm.c
index 8c001819cf10..108a4d167824 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1345,7 +1345,7 @@ static int write_protect_page(struct vm_area_struct *vma, 
struct page *page,
if (pte_write(entry))
entry = pte_wrprotect(entry);
 
-   set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
+   set_pte_at(mm, pvmw.address, pvmw.pte, entry);
}
*orig_pte = entry;
err = 0;
@@ -1447,7 +1447,7 @@ static int replace_page(struct vm_area_struct *vma, 
struct page *page,
 * See Documentation/mm/mmu_notifier.rst
 */
ptep_clear_flush(vma, addr, ptep);
-   set_pte_at_notify(mm, addr, ptep, newpte);
+   set_pte_at(mm, addr, ptep, newpte);
 
folio = page_folio(page);
folio_remove_rmap_pte(folio, page, vma);
diff --git a/mm/memory.c b/mm/memory.c
index f2bc6dd15eb8..9a6f4d8aa379 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3327,13 +3327,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush(vma, vmf->address, vmf->pte);
folio_add_new_anon_rmap(new_folio, vma, vmf->address);
folio_add_lru_vma(new_folio, vma);
-   /*
-* We call the notify macro here because, when using secondary
-* mmu page tables (such as kvm shadow page tables), we want the
-* new page to be mapped directly into the secondary page table.
-*/
BUG_ON(unshare && pte_write(entry));
-   set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
+   set_pte_at(mm, vmf->address, vmf->pte, entry);
update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
if (old_folio) {
/*
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index b6c27c76e1a0..66206734b1b9 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -664,13 +664,9 @@ static void migrate_vma_insert_page(struct migrate_vma 
*migrate,
if (flush) {
flush_cache_page(vma, addr, pte_pfn(orig_pte));
ptep_clear_flush(vma, addr, ptep);
-   set_pte_at_notify(mm, addr, ptep, entry);
-   update_mmu_cache(vma, addr, ptep);
-   } else {
-   /* No need to invalidate - it was non-present before */
-   set_pte_at(mm, addr, ptep, entry);
-   update_mmu_cache(vma, addr, ptep);
}
+   set_pte_at(mm, addr, ptep, entry);
+   update_mmu_cache(vma, addr, ptep);
 
pte_unmap_unlock(ptep, ptl);
*src = MIGRATE_PFN_MIGRATE;
-- 
2.43.0




[PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()

2024-04-05 Thread Paolo Bonzini
The .change_pte() MMU notifier callback was intended as an optimization
and for this reason it was initially called without a surrounding
mmu_notifier_invalidate_range_{start,end}() pair.  It was only ever
implemented by KVM (which was also the original user of MMU notifiers)
and the rules on when to call set_pte_at_notify() rather than set_pte_at()
have always been pretty obscure.

It may seem a miracle that it has never caused any hard to trigger
bugs, but there's a good reason for that: KVM's implementation has
been nonfunctional for a good part of its existence.  Already in
2012, commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
invalidate_range_start and invalidate_range_end", 2012-10-09) changed the
.change_pte() callback to occur within an invalidate_range_start/end()
pair; and because KVM unmaps the sPTEs during .invalidate_range_start(),
.change_pte() has no hope of finding a sPTE to change.

Therefore, all the code for .change_pte() can be removed from both KVM
and mm/, and set_pte_at_notify() can be replaced with just set_pte_at().

Please review!  Also feel free to take the KVM patches through the mm
tree, as I don't expect any conflicts.

Thanks,

Paolo

Paolo Bonzini (4):
  KVM: delete .change_pte MMU notifier callback
  KVM: remove unused argument of kvm_handle_hva_range()
  mmu_notifier: remove the .change_pte() callback
  mm: replace set_pte_at_notify() with just set_pte_at()

 arch/arm64/kvm/mmu.c  | 34 -
 arch/loongarch/include/asm/kvm_host.h |  1 -
 arch/loongarch/kvm/mmu.c  | 32 
 arch/mips/kvm/mmu.c   | 30 ---
 arch/powerpc/include/asm/kvm_ppc.h|  1 -
 arch/powerpc/kvm/book3s.c |  5 ---
 arch/powerpc/kvm/book3s.h |  1 -
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 --
 arch/powerpc/kvm/book3s_hv.c  |  1 -
 arch/powerpc/kvm/book3s_pr.c  |  7 
 arch/powerpc/kvm/e500_mmu_host.c  |  6 ---
 arch/riscv/kvm/mmu.c  | 20 --
 arch/x86/kvm/mmu/mmu.c| 54 +--
 arch/x86/kvm/mmu/spte.c   | 16 
 arch/x86/kvm/mmu/spte.h   |  2 -
 arch/x86/kvm/mmu/tdp_mmu.c| 46 ---
 arch/x86/kvm/mmu/tdp_mmu.h|  1 -
 include/linux/kvm_host.h  |  2 -
 include/linux/mmu_notifier.h  | 44 --
 include/trace/events/kvm.h| 15 
 kernel/events/uprobes.c   |  5 +--
 mm/ksm.c  |  4 +-
 mm/memory.c   |  7 +---
 mm/migrate_device.c   |  8 +---
 mm/mmu_notifier.c | 17 -
 virt/kvm/kvm_main.c   | 50 +
 26 files changed, 10 insertions(+), 411 deletions(-)

-- 
2.43.0




[PATCH 3/4] mmu_notifier: remove the .change_pte() callback

2024-04-05 Thread Paolo Bonzini
The scope of set_pte_at_notify() has reduced more and more through the
years.  Initially, it was meant for when the change to the PTE was
not bracketed by mmu_notifier_invalidate_range_{start,end}().  However,
that has not been so for over ten years.  During all this period
the only implementation of .change_pte() was KVM and it
had no actual functionality, because it was called after
mmu_notifier_invalidate_range_start() zapped the secondary PTE.

Now that this (nonfunctional) user of the .change_pte() callback is
gone, the whole callback can be removed.  For now, leave in place
set_pte_at_notify() even though it is just a synonym for set_pte_at().

Signed-off-by: Paolo Bonzini 
---
 include/linux/mmu_notifier.h | 46 ++--
 mm/mmu_notifier.c| 17 -
 2 files changed, 2 insertions(+), 61 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..8c72bf651606 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -122,15 +122,6 @@ struct mmu_notifier_ops {
  struct mm_struct *mm,
  unsigned long address);
 
-   /*
-* change_pte is called in cases that pte mapping to page is changed:
-* for example, when ksm remaps pte to point to a new shared page.
-*/
-   void (*change_pte)(struct mmu_notifier *subscription,
-  struct mm_struct *mm,
-  unsigned long address,
-  pte_t pte);
-
/*
 * invalidate_range_start() and invalidate_range_end() must be
 * paired and are called only when the mmap_lock and/or the
@@ -392,8 +383,6 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
  unsigned long end);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 unsigned long address);
-extern void __mmu_notifier_change_pte(struct mm_struct *mm,
- unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
 extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
@@ -439,13 +428,6 @@ static inline int mmu_notifier_test_young(struct mm_struct 
*mm,
return 0;
 }
 
-static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-  unsigned long address, pte_t pte)
-{
-   if (mm_has_notifiers(mm))
-   __mmu_notifier_change_pte(mm, address, pte);
-}
-
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
@@ -581,26 +563,6 @@ static inline void mmu_notifier_range_init_owner(
__young;\
 })
 
-/*
- * set_pte_at_notify() sets the pte _after_ running the notifier.
- * This is safe to start by updating the secondary MMUs, because the primary 
MMU
- * pte invalidate must have already happened with a ptep_clear_flush() before
- * set_pte_at_notify() has been invoked.  Updating the secondary MMUs first is
- * required when we change both the protection of the mapping from read-only to
- * read-write and the pfn (like during copy on write page faults). Otherwise 
the
- * old page would remain mapped readonly in the secondary MMUs after the new
- * page is already writable by some CPU through the primary MMU.
- */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)  \
-({ \
-   struct mm_struct *___mm = __mm; \
-   unsigned long ___address = __address;   \
-   pte_t ___pte = __pte;   \
-   \
-   mmu_notifier_change_pte(___mm, ___address, ___pte); \
-   set_pte_at(___mm, ___address, __ptep, ___pte);  \
-})
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 struct mmu_notifier_range {
@@ -650,11 +612,6 @@ static inline int mmu_notifier_test_young(struct mm_struct 
*mm,
return 0;
 }
 
-static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-  unsigned long address, pte_t pte)
-{
-}
-
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
@@ -693,7 +650,6 @@ static inline void 
mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
 #defineptep_clear_flush_notify ptep_clear_flush
 #define pmdp_huge_clear_flush_notify pmdp_huge_clear_flush
 #define pudp_huge_clear_flush_notify pudp_huge_clear_flush
-#define set_pte_at_notify set_pte_at
 
 static inline void mmu_notifier_synchronize(void)
 {
@@ -701,4 +657,6 @@ static

[PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-05 Thread Paolo Bonzini
The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().

Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.

In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.

This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.

Signed-off-by: Paolo Bonzini 
---
 arch/arm64/kvm/mmu.c  | 34 -
 arch/loongarch/include/asm/kvm_host.h |  1 -
 arch/loongarch/kvm/mmu.c  | 32 
 arch/mips/kvm/mmu.c   | 30 ---
 arch/powerpc/include/asm/kvm_ppc.h|  1 -
 arch/powerpc/kvm/book3s.c |  5 ---
 arch/powerpc/kvm/book3s.h |  1 -
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 --
 arch/powerpc/kvm/book3s_hv.c  |  1 -
 arch/powerpc/kvm/book3s_pr.c  |  7 
 arch/powerpc/kvm/e500_mmu_host.c  |  6 ---
 arch/riscv/kvm/mmu.c  | 20 --
 arch/x86/kvm/mmu/mmu.c| 54 +--
 arch/x86/kvm/mmu/spte.c   | 16 
 arch/x86/kvm/mmu/spte.h   |  2 -
 arch/x86/kvm/mmu/tdp_mmu.c| 46 ---
 arch/x86/kvm/mmu/tdp_mmu.h|  1 -
 include/linux/kvm_host.h  |  2 -
 include/trace/events/kvm.h| 15 
 virt/kvm/kvm_main.c   | 43 -
 20 files changed, 2 insertions(+), 327 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index dc04bc767865..ff17849be9f4 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range)
return false;
 }
 
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-   kvm_pfn_t pfn = pte_pfn(range->arg.pte);
-
-   if (!kvm->arch.mmu.pgt)
-   return false;
-
-   WARN_ON(range->end - range->start != 1);
-
-   /*
-* If the page isn't tagged, defer to user_mem_abort() for sanitising
-* the MTE tags. The S2 pte should have been unmapped by
-* mmu_notifier_invalidate_range_end().
-*/
-   if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
-   return false;
-
-   /*
-* We've moved a page around, probably through CoW, so let's treat
-* it just like a translation fault and the map handler will clean
-* the cache to the PoC.
-*
-* The MMU notifiers will have unmapped a huge PMD before calling
-* ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
-* therefore we never need to clear out a huge PMD through this
-* calling path and a memcache is not required.
-*/
-   kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
-  PAGE_SIZE, __pfn_to_phys(pfn),
-  KVM_PGTABLE_PROT_R, NULL, 0);
-
-   return false;
-}
-
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
u64 size = (range->end - range->start) << PAGE_SHIFT;
diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index 2d62f7b0d377..69305441f40d 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void);
 void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa);
 int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write);
 
-void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long 
end, bool blockable);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
index a556cff35740..98883aa23ab8 100644
--- a/arch/loongarch/kvm/mmu.c
+++ b/arch/loongarch/kvm/mmu.c
@@ -494,38 +494,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 

[PATCH 2/4] KVM: remove unused argument of kvm_handle_hva_range()

2024-04-05 Thread Paolo Bonzini
The only user was kvm_mmu_notifier_change_pte(), which is now gone.

Signed-off-by: Paolo Bonzini 
---
 virt/kvm/kvm_main.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2fcd9979752a..970111ad 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -595,8 +595,6 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
-static const union kvm_mmu_notifier_arg KVM_MMU_NOTIFIER_NO_ARG;
-
 /* Iterate over each memslot intersecting [start, last] (inclusive) range */
 #define kvm_for_each_memslot_in_hva_range(node, slots, start, last) \
for (node = interval_tree_iter_first(>hva_tree, start, last); \
@@ -682,14 +680,12 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
unsigned long start,
unsigned long end,
-   union kvm_mmu_notifier_arg arg,
gfn_handler_t handler)
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
const struct kvm_mmu_notifier_range range = {
.start  = start,
.end= end,
-   .arg= arg,
.handler= handler,
.on_lock= (void *)kvm_null_fn,
.flush_on_ret   = true,
@@ -880,8 +876,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct 
mmu_notifier *mn,
 {
trace_kvm_age_hva(start, end);
 
-   return kvm_handle_hva_range(mn, start, end, KVM_MMU_NOTIFIER_NO_ARG,
-   kvm_age_gfn);
+   return kvm_handle_hva_range(mn, start, end, kvm_age_gfn);
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
-- 
2.43.0





Re: [PATCH] KVM: deprecate KVM_WERROR in favor of general WERROR

2023-10-12 Thread Paolo Bonzini
On Wed, Oct 11, 2023 at 1:46 AM Sean Christopherson  wrote:
> > > The DRM_I915_WERROR config depends on EXPERT and !COMPILE_TEST, and to
> > > my knowledge this has never caused issues outside of i915 developers and
> > > CI.
> >
> > Ack, I think you do it right. I was trying to establish a precedent
> > so that we can delete these as soon as they cause an issue, not sooner.
>
> So isn't the underlying problem simply that KVM_WERROR is enabled by default 
> for
> some configurations?  If that's the case, then my proposal to make KVM_WERROR
> always off by default, and "depends on KVM && EXPERT && !KASAN", would make 
> this
> go away, no?

No objection to adding EXPERT. Adding W=1 when build-testing KVM
patches is a good
idea, you will still get the occasional patch from someone who didn't
have it but that's fine.

I added KVM_WERROR a relatively long time ago after a warning scrolled
away too quickly (a
harmless one, but also a kind that honestly shouldn't have made it to
Linus). At the time there
were still too many warnings to enable WERROR globally, and I feel
that now we're on the
same boat with W=1. I think we should keep KVM_WERROR (which was based on
DRM_I915_WERROR indeed) and maintainers should just add W=1 when build-testing
KVM patches.

Paolo



Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration

2021-04-20 Thread Paolo Bonzini

On 20/04/21 22:16, Sean Christopherson wrote:

On Tue, Apr 20, 2021, Sean Christopherson wrote:

On Tue, Apr 20, 2021, Paolo Bonzini wrote:

In this particular case, if userspace sets the bit in CPUID2 but doesn't
handle KVM_EXIT_HYPERCALL, the guest will probably trigger some kind of
assertion failure as soon as it invokes the HC_PAGE_ENC_STATUS hypercall.


Oh!  Almost forgot my hail mary idea.  Instead of a new capability, can we
reject the hypercall if userspace has _not_ set 
KVM_CAP_ENFORCE_PV_FEATURE_CPUID?

if (vcpu->arch.pv_cpuid.enforce &&
!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS)
break;


Couldn't userspace enable that capability and _still_ copy the supported 
CPUID blindly to the guest CPUID, without supporting the hypercall?


(BTW, it's better to return a bitmask of hypercalls that will exit to 
userspace from KVM_CHECK_EXTENSION.  Userspace can still reject with 
-ENOSYS those that it doesn't know, but it's important that it knows in 
general how to handle KVM_EXIT_HYPERCALL).


Paolo



Re: [PATCH v13 00/12] Add AMD SEV guest live migration support

2021-04-20 Thread Paolo Bonzini

On 20/04/21 20:51, Borislav Petkov wrote:

Hey Paolo,

On Tue, Apr 20, 2021 at 01:11:31PM +0200, Paolo Bonzini wrote:

I have queued patches 1-6.

For patches 8 and 10 I will post my own version based on my review and
feedback.


can you pls push that tree up to here to a branch somewhere so that ...


Yup, for now it's all at kvm/queue and it will land in kvm/next tomorrow 
(hopefully).  The guest interface patches in KVM are very near the top.


Paolo


For guest patches, please repost separately so that x86 maintainers will
notice them and ack them.


... I can take a look at the guest bits in the full context of the
changes?




Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration

2021-04-20 Thread Paolo Bonzini

On 20/04/21 19:31, Sean Christopherson wrote:

+   case KVM_HC_PAGE_ENC_STATUS: {
+   u64 gpa = a0, npages = a1, enc = a2;
+
+   ret = -KVM_ENOSYS;
+   if (!vcpu->kvm->arch.hypercall_exit_enabled)


I don't follow, why does the hypercall need to be gated by a capability?  What
would break if this were changed to?

if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))


The problem is that it's valid to take KVM_GET_SUPPORTED_CPUID and send 
it unmodified to KVM_SET_CPUID2.  For this reason, features that are 
conditional on other ioctls, or that require some kind of userspace 
support, must not be in KVM_GET_SUPPORTED_CPUID.  For example:


- TSC_DEADLINE because it is only implemented after KVM_CREATE_IRQCHIP 
(or after KVM_ENABLE_CAP of KVM_CAP_IRQCHIP_SPLIT)


- MONITOR only makes sense if userspace enables KVM_CAP_X86_DISABLE_EXITS

X2APIC is reported even though it shouldn't be.  Too late to fix that, I 
think.


In this particular case, if userspace sets the bit in CPUID2 but doesn't 
handle KVM_EXIT_HYPERCALL, the guest will probably trigger some kind of 
assertion failure as soon as it invokes the HC_PAGE_ENC_STATUS hypercall.


(I should document that, Jim asked for documentation around 
KVM_GET_SUPPORTED_CPUID and KVM_GET_MSR_INDEX_LIST many times).


Paolo


+   break;
+
+   if (!PAGE_ALIGNED(gpa) || !npages ||
+   gpa_to_gfn(gpa) + npages <= gpa_to_gfn(gpa)) {
+   ret = -EINVAL;
+   break;
+   }
+
+   vcpu->run->exit_reason= KVM_EXIT_HYPERCALL;
+   vcpu->run->hypercall.nr   = KVM_HC_PAGE_ENC_STATUS;
+   vcpu->run->hypercall.args[0]  = gpa;
+   vcpu->run->hypercall.args[1]  = npages;
+   vcpu->run->hypercall.args[2]  = enc;
+   vcpu->run->hypercall.longmode = op_64_bit;
+   vcpu->arch.complete_userspace_io = complete_hypercall_exit;
+   return 0;
+   }
default:
ret = -KVM_ENOSYS;
break;


...


diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 590cc811c99a..d696a9f13e33 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3258,6 +3258,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
vcpu->arch.msr_kvm_poll_control = data;
break;
  
+	case MSR_KVM_MIGRATION_CONTROL:

+   if (data & ~KVM_PAGE_ENC_STATUS_UPTODATE)
+   return 1;
+
+   if (data && !guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))


Why let the guest write '0'?  Letting the guest do WRMSR but not RDMSR is
bizarre.


Because it was the simplest way to write the code, but returning 0 
unconditionally from RDMSR is actually simpler.


Paolo


+   return 1;
+   break;
+
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
@@ -3549,6 +3557,12 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF))
return 1;
  
+		msr_info->data = 0;

+   break;
+   case MSR_KVM_MIGRATION_CONTROL:
+   if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))
+   return 1;
+
msr_info->data = 0;
break;
case MSR_KVM_STEAL_TIME:
--
2.26.2







Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration

2021-04-20 Thread Paolo Bonzini

On 20/04/21 17:15, Sean Christopherson wrote:

On Tue, Apr 20, 2021, Paolo Bonzini wrote:

Do not return the SEV-ES bit from KVM_GET_SUPPORTED_CPUID unless
the corresponding module parameter is 1, and clear the memory encryption
leaf completely if SEV is disabled.


Impeccable timing, I was planning on refreshing my SEV cleanup series[*] today.
There's going to be an annoying conflict with the svm_set_cpu_caps() change
(see below), any objecting to folding your unintentional feedback into my 
series?


That's fine of course.


diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 888e88b42e8d..e873a60a4830 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -99,6 +99,7 @@ static const struct cpuid_reg reverse_cpuid[] = {
[CPUID_7_EDX] = { 7, 0, CPUID_EDX},
[CPUID_7_1_EAX]   = { 7, 1, CPUID_EAX},
[CPUID_12_EAX]= {0x0012, 0, CPUID_EAX},
+   [CPUID_8000_001F_EAX] = {0x801F, 0, CPUID_EAX},
  };
  
  /*

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index cd8c333ed2dc..acdb8457289e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -923,6 +923,13 @@ static __init void svm_set_cpu_caps(void)
if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) ||
boot_cpu_has(X86_FEATURE_AMD_SSBD))
kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD);
+
+   /* CPUID 0x801F */
+   if (sev) {
+   kvm_cpu_cap_set(X86_FEATURE_SEV);
+   if (sev_es)
+   kvm_cpu_cap_set(X86_FEATURE_SEV_ES);


Gah, I completely spaced on the module params in my series, which is more
problematic than normal because it also moves "sev" and "sev_es" to sev.c.  The
easy solution is to add sev_set_cpu_caps().


Sounds good.

Paolo



Re: [PATCH] KVM: selftests: Always run vCPU thread with blocked SIG_IPI

2021-04-20 Thread Paolo Bonzini

On 20/04/21 17:32, Peter Xu wrote:

On Tue, Apr 20, 2021 at 10:37:39AM -0400, Peter Xu wrote:

On Tue, Apr 20, 2021 at 04:16:14AM -0400, Paolo Bonzini wrote:

The main thread could start to send SIG_IPI at any time, even before signal
blocked on vcpu thread.  Therefore, start the vcpu thread with the signal
blocked.

Without this patch, on very busy cores the dirty_log_test could fail directly
on receiving a SIGUSR1 without a handler (when vcpu runs far slower than main).

Reported-by: Peter Xu 
Cc: sta...@vger.kernel.org
Signed-off-by: Paolo Bonzini 


Yes, indeed better! :)

Reviewed-by: Peter Xu 


I just remembered one thing: this will avoid program quits, but still we'll get
the signal missing.


In what sense the signal will be missing?  As long as the thread exists, 
the signal will be accepted (but not delivered because it is blocked); 
it will then be delivered on the first KVM_RUN.


Paolo

  From that pov I slightly prefer the old patch.  However

not a big deal so far as only dirty ring uses SIG_IPI, so there's always ring
full which will just delay the kick. It's just we need to remember this when we
extend IPI to non-dirty-ring tests as the kick is prone to be lost then.

Thanks,





Re: [PATCH v3 1/2] KVM: selftests: Sync data verify of dirty logging with guest sync

2021-04-20 Thread Paolo Bonzini

On 20/04/21 15:10, Peter Xu wrote:

On Tue, Apr 20, 2021 at 10:07:16AM +0200, Paolo Bonzini wrote:

On 18/04/21 14:43, Peter Xu wrote:

8<-
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index 25230e799bc4..d3050d1c2cd0 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -377,7 +377,7 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, 
int ret, int err)
  /* A ucall-sync or ring-full event is allowed */
  if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
  /* We should allow this to continue */
-   ;
+   vcpu_handle_sync_stop();
  } else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL ||
 (ret == -1 && err == EINTR)) {
  /* Update the flag first before pause */
8<-

That's my intention when I introduce vcpu_handle_sync_stop(), but forgot to
add...


And possibly even this (untested though):

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index ffa4e2791926..918954f01cef 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -383,6 +383,7 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, 
int ret, int err)
/* Update the flag first before pause */
WRITE_ONCE(dirty_ring_vcpu_ring_full,
   run->exit_reason == KVM_EXIT_DIRTY_RING_FULL);
+   atomic_set(_sync_stop_requested, false);
sem_post(_vcpu_stop);
pr_info("vcpu stops because %s...\n",
dirty_ring_vcpu_ring_full ?
@@ -804,8 +805,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 * the flush of the last page, and since we handle the last
 * page specially verification will succeed anyway.
 */
-   assert(host_log_mode == LOG_MODE_DIRTY_RING ||
-  atomic_read(_sync_stop_requested) == false);
+   assert(atomic_read(_sync_stop_requested) == false);
vm_dirty_log_verify(mode, bmap);
sem_post(_vcpu_cont);

You can submit all these as a separate patch.


But it could race, then?

 main thread vcpu thread
 --- ---
   ring full
 vcpu_sync_stop_requested=0
 sem_post(_vcpu_stop)
  vcpu_sync_stop_requested=1
  sem_wait(_vcpu_stop)
  assert(vcpu_sync_stop_requested==0)   <


Yes, it could indeed.

Thanks,

Paolo



[PATCH 0/3] KVM: x86: guest interface for SEV live migration

2021-04-20 Thread Paolo Bonzini
This is a reviewed version of the guest interface (hypercall+MSR)
for SEV live migration.  The differences lie mostly in the API
for userspace.  In particular:

- the CPUID feature is not exposed in KVM_GET_SUPPORTED_CPUID

- the hypercall must be enabled manually with KVM_ENABLE_CAP

- the MSR has sensible behavior if not filtered (reads as zero,
  writes must leave undefined bits as zero or they #GP)

Patch 1 is an unrelated cleanup, that was made moot by not
exposing the bit in KVM_GET_SUPPORTED_CPUID.

Paolo

Ashish Kalra (1):
  KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall

Paolo Bonzini (2):
  KVM: SEV: mask CPUID[0x801F].eax according to supported features
  KVM: x86: add MSR_KVM_MIGRATION_CONTROL

 Documentation/virt/kvm/api.rst| 11 ++
 Documentation/virt/kvm/cpuid.rst  |  6 
 Documentation/virt/kvm/hypercalls.rst | 15 +
 Documentation/virt/kvm/msr.rst| 10 ++
 arch/x86/include/asm/kvm_host.h   |  2 ++
 arch/x86/include/uapi/asm/kvm_para.h  |  4 +++
 arch/x86/kvm/cpuid.c  |  5 ++-
 arch/x86/kvm/cpuid.h  |  1 +
 arch/x86/kvm/svm/svm.c|  7 
 arch/x86/kvm/x86.c| 48 +++
 include/uapi/linux/kvm.h  |  1 +
 include/uapi/linux/kvm_para.h |  1 +
 12 files changed, 110 insertions(+), 1 deletion(-)

-- 
2.26.2

>From 547d4d4edcd05fdfac6ce650d65db1d42bcd2807 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini 
Date: Tue, 20 Apr 2021 05:49:11 -0400
Subject: [PATCH 1/3] KVM: SEV: mask CPUID[0x801F].eax according to
 supported features

Do not return the SEV-ES bit from KVM_GET_SUPPORTED_CPUID unless
the corresponding module parameter is 1, and clear the memory encryption
leaf completely if SEV is disabled.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/cpuid.c   | 5 -
 arch/x86/kvm/cpuid.h   | 1 +
 arch/x86/kvm/svm/svm.c | 7 +++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 2ae061586677..d791d1f093ab 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -944,8 +944,11 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array 
*array, u32 function)
break;
/* Support memory encryption cpuid if host supports it */
case 0x801F:
-   if (!boot_cpu_has(X86_FEATURE_SEV))
+   if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
+   break;
+   }
+   cpuid_entry_override(entry, CPUID_8000_001F_EAX);
break;
/*Add support for Centaur's CPUID instruction*/
case 0xC000:
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 888e88b42e8d..e873a60a4830 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -99,6 +99,7 @@ static const struct cpuid_reg reverse_cpuid[] = {
[CPUID_7_EDX] = { 7, 0, CPUID_EDX},
[CPUID_7_1_EAX]   = { 7, 1, CPUID_EAX},
[CPUID_12_EAX]= {0x0012, 0, CPUID_EAX},
+   [CPUID_8000_001F_EAX] = {0x801F, 0, CPUID_EAX},
 };
 
 /*
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index cd8c333ed2dc..acdb8457289e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -923,6 +923,13 @@ static __init void svm_set_cpu_caps(void)
if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) ||
boot_cpu_has(X86_FEATURE_AMD_SSBD))
kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD);
+
+   /* CPUID 0x801F */
+   if (sev) {
+   kvm_cpu_cap_set(X86_FEATURE_SEV);
+   if (sev_es)
+   kvm_cpu_cap_set(X86_FEATURE_SEV_ES);
+   }
 }
 
 static __init int svm_hardware_setup(void)
-- 
2.26.2


>From ef78673f78e3f2eedc498c1fbf9271146caa83cb Mon Sep 17 00:00:00 2001
From: Ashish Kalra 
Date: Thu, 15 Apr 2021 15:57:02 +
Subject: [PATCH 2/3] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall

This hypercall is used by the SEV guest to notify a change in the page
encryption status to the hypervisor. The hypercall should be invoked
only when the encryption attribute is changed from encrypted -> decrypted
and vice versa. By default all guest pages are considered encrypted.

The hypercall exits to userspace to manage the guest shared regions and
integrate with the userspace VMM's migration code.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Paolo Bonzini 
Cc: Joerg Roedel 
Cc: Borislav Petkov 
Cc: Tom Lendacky 
Cc: x...@kernel.org
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford 
Signed-off-by: Brijesh Singh 
Signed-off-by: Ashish Kalra 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Message-Id: 
<93d7f2c2888315adc48905722574d89699edde33.1618498113.git.ashish.ka...@amd.com>
Signed-off-

Re: [PATCH v13 00/12] Add AMD SEV guest live migration support

2021-04-20 Thread Paolo Bonzini

On 15/04/21 17:52, Ashish Kalra wrote:

From: Ashish Kalra 

The series add support for AMD SEV guest live migration commands. To protect the
confidentiality of an SEV protected guest memory while in transit we need to
use the SEV commands defined in SEV API spec [1].

SEV guest VMs have the concept of private and shared memory. Private memory
is encrypted with the guest-specific key, while shared memory may be encrypted
with hypervisor key. The commands provided by the SEV FW are meant to be used
for the private memory only. The patch series introduces a new hypercall.
The guest OS can use this hypercall to notify the page encryption status.
If the page is encrypted with guest specific-key then we use SEV command during
the migration. If page is not encrypted then fallback to default.

The patch uses the KVM_EXIT_HYPERCALL exitcode and hypercall to
userspace exit functionality as a common interface from the guest back to the
VMM and passing on the guest shared/unencrypted page information to the
userspace VMM/Qemu. Qemu can consult this information during migration to know
whether the page is encrypted.

This section descibes how the SEV live migration feature is negotiated
between the host and guest, the host indicates this feature support via
KVM_FEATURE_CPUID. The guest firmware (OVMF) detects this feature and
sets a UEFI enviroment variable indicating OVMF support for live
migration, the guest kernel also detects the host support for this
feature via cpuid and in case of an EFI boot verifies if OVMF also
supports this feature by getting the UEFI enviroment variable and if it
set then enables live migration feature on host by writing to a custom
MSR, if not booted under EFI, then it simply enables the feature by
again writing to the custom MSR. The MSR is also handled by the
userspace VMM/Qemu.

A branch containing these patches is available here:
https://github.com/AMDESE/linux/tree/sev-migration-v13

[1] https://developer.amd.com/wp-content/resources/55766.PDF


I have queued patches 1-6.

For patches 8 and 10 I will post my own version based on my review and 
feedback.


For guest patches, please repost separately so that x86 maintainers will 
notice them and ack them.


Paolo


Changes since v12:
- Reset page encryption status during early boot instead of just
   before the kexec to avoid SMP races during kvm_pv_guest_cpu_reboot().
- Remove incorrect log message in case of non-EFI boot and implicit
   enabling of SEV live migration feature.

Changes since v11:
- Clean up and remove kvm_x86_ops callback for page_enc_status_hc and
   instead add a new per-VM flag to support/enable the page encryption
   status hypercall.
- Remove KVM_EXIT_DMA_SHARE/KVM_EXIT_DMA_UNSHARE exitcodes and instead
   use the KVM_EXIT_HYPERCALL exitcode for page encryption status
   hypercall to userspace functionality.

Changes since v10:
- Adds new KVM_EXIT_DMA_SHARE/KVM_EXIT_DMA_UNSHARE hypercall to
   userspace exit functionality as a common interface from the guest back to the
   KVM and passing on the guest shared/unencrypted region information to the
   userspace VMM/Qemu. KVM/host kernel does not maintain the guest shared
   memory regions information anymore.
- Remove implicit enabling of SEV live migration feature for an SEV
   guest, now this is explicitly in control of the userspace VMM/Qemu.
- Custom MSR handling is also now moved into userspace VMM/Qemu.
- As KVM does not maintain the guest shared memory region information
   anymore, sev_dbg_crypt() cannot bypass unencrypted guest memory
   regions without support from userspace VMM/Qemu.

Changes since v9:
- Transitioning from page encryption bitmap to the shared pages list
   to keep track of guest's shared/unencrypted memory regions.
- Move back to marking the complete _bss_decrypted section as
   decrypted in the shared pages list.
- Invoke a new function check_kvm_sev_migration() via kvm_init_platform()
   for guest to query for host-side support for SEV live migration
   and to enable the SEV live migration feature, to avoid
   #ifdefs in code
- Rename MSR_KVM_SEV_LIVE_MIG_EN to MSR_KVM_SEV_LIVE_MIGRATION.
- Invoke a new function handle_unencrypted_region() from
   sev_dbg_crypt() to bypass unencrypted guest memory regions.

Changes since v8:
- Rebasing to kvm next branch.
- Fixed and added comments as per review feedback on v8 patches.
- Removed implicitly enabling live migration for incoming VMs in
   in KVM_SET_PAGE_ENC_BITMAP, it is now done via KVM_SET_MSR ioctl.
- Adds support for bypassing unencrypted guest memory regions for
   DBG_DECRYPT API calls, guest memory region encryption status in
   sev_dbg_decrypt() is referenced using the page encryption bitmap.

Changes since v7:
- Removed the hypervisor specific hypercall/paravirt callback for
   SEV live migration and moved back to calling kvm_sev_hypercall3
   directly.
- Fix build errors as
   Reported-by: kbuild test robot , specifically fixed
   build error when CONFIG_HYPERVISOR_GUEST=y and

Re: [PATCH v13 08/12] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall

2021-04-20 Thread Paolo Bonzini

On 15/04/21 17:57, Ashish Kalra wrote:

From: Ashish Kalra 

This hypercall is used by the SEV guest to notify a change in the page
encryption status to the hypervisor. The hypercall should be invoked
only when the encryption attribute is changed from encrypted -> decrypted
and vice versa. By default all guest pages are considered encrypted.

The hypercall exits to userspace to manage the guest shared regions and
integrate with the userspace VMM's migration code.


I think this should be exposed to userspace as a capability, rather than 
as a CPUID bit.  Userspace then can enable the capability and set the 
CPUID bit if it wants.


The reason is that userspace could pass KVM_GET_SUPPORTED_CPUID to
KVM_SET_CPUID2 and the hypercall then would break the guest.

Paolo


Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Paolo Bonzini 
Cc: Joerg Roedel 
Cc: Borislav Petkov 
Cc: Tom Lendacky 
Cc: x...@kernel.org
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford 
Signed-off-by: Brijesh Singh 
Signed-off-by: Ashish Kalra 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
  Documentation/virt/kvm/hypercalls.rst | 15 ++
  arch/x86/include/asm/kvm_host.h   |  2 ++
  arch/x86/kvm/svm/sev.c|  1 +
  arch/x86/kvm/x86.c| 29 +++
  include/uapi/linux/kvm_para.h |  1 +
  5 files changed, 48 insertions(+)

diff --git a/Documentation/virt/kvm/hypercalls.rst 
b/Documentation/virt/kvm/hypercalls.rst
index ed4fddd364ea..7aff0cebab7c 100644
--- a/Documentation/virt/kvm/hypercalls.rst
+++ b/Documentation/virt/kvm/hypercalls.rst
@@ -169,3 +169,18 @@ a0: destination APIC ID
  
  :Usage example: When sending a call-function IPI-many to vCPUs, yield if

any of the IPI target vCPUs was preempted.
+
+
+8. KVM_HC_PAGE_ENC_STATUS
+-
+:Architecture: x86
+:Status: active
+:Purpose: Notify the encryption status changes in guest page table (SEV guest)
+
+a0: the guest physical address of the start page
+a1: the number of pages
+a2: encryption attribute
+
+   Where:
+   * 1: Encryption attribute is set
+   * 0: Encryption attribute is cleared
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3768819693e5..42eb0fe3df5d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1050,6 +1050,8 @@ struct kvm_arch {
  
  	bool bus_lock_detection_enabled;
  
+	bool page_enc_hc_enable;

+
/* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
u32 user_space_msr_mask;
struct kvm_x86_msr_filter __rcu *msr_filter;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c9795a22e502..5184a0c0131a 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -197,6 +197,7 @@ static int sev_guest_init(struct kvm *kvm, struct 
kvm_sev_cmd *argp)
sev->active = true;
sev->asid = asid;
INIT_LIST_HEAD(>regions_list);
+   kvm->arch.page_enc_hc_enable = true;
  
  	return 0;
  
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

index f7d12fca397b..e8986478b653 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8208,6 +8208,13 @@ static void kvm_sched_yield(struct kvm *kvm, unsigned 
long dest_id)
kvm_vcpu_yield_to(target);
  }
  
+static int complete_hypercall_exit(struct kvm_vcpu *vcpu)

+{
+   kvm_rax_write(vcpu, vcpu->run->hypercall.ret);
+   ++vcpu->stat.hypercalls;
+   return kvm_skip_emulated_instruction(vcpu);
+}
+
  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
  {
unsigned long nr, a0, a1, a2, a3, ret;
@@ -8273,6 +8280,28 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
kvm_sched_yield(vcpu->kvm, a0);
ret = 0;
break;
+   case KVM_HC_PAGE_ENC_STATUS: {
+   u64 gpa = a0, npages = a1, enc = a2;
+
+   ret = -KVM_ENOSYS;
+   if (!vcpu->kvm->arch.page_enc_hc_enable)
+   break;
+
+   if (!PAGE_ALIGNED(gpa) || !npages ||
+   gpa_to_gfn(gpa) + npages <= gpa_to_gfn(gpa)) {
+   ret = -EINVAL;
+   break;
+   }
+
+   vcpu->run->exit_reason= KVM_EXIT_HYPERCALL;
+   vcpu->run->hypercall.nr   = KVM_HC_PAGE_ENC_STATUS;
+   vcpu->run->hypercall.args[0]  = gpa;
+   vcpu->run->hypercall.args[1]  = npages;
+   vcpu->run->hypercall.args[2]  = enc;
+   vcpu->run->hypercall.longmode = op_64_bit;
+   vcpu->arch.complete_userspace_io = complete_hypercall_exit;
+   return 0;
+   }
default:
ret = -KVM_ENOSYS;
break;
diff --git a/include/uapi/linux/kvm_para.h 

[PATCH] KVM: x86: add MSR_KVM_MIGRATION_CONTROL

2021-04-20 Thread Paolo Bonzini
Add a new MSR that can be used to communicate whether the page
encryption status bitmap is up to date and therefore whether live
migration of an encrypted guest is possible.

The MSR should be processed by userspace if it is going to live
migrate the guest; the default implementation does nothing.

Signed-off-by: Paolo Bonzini 
---
 Documentation/virt/kvm/cpuid.rst |  3 ++-
 Documentation/virt/kvm/msr.rst   | 10 ++
 arch/x86/include/uapi/asm/kvm_para.h |  3 +++
 arch/x86/kvm/x86.c   | 14 ++
 4 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index c9378d163b5a..15923857de56 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -99,7 +99,8 @@ KVM_FEATURE_MSI_EXT_DEST_ID15  guest checks 
this feature bit
 KVM_FEATURE_HC_PAGE_ENC_STATUS 16  guest checks this feature bit 
before
using the page encryption state
hypercall to notify the page 
state
-   change
+   change, and before setting bit 
0 of
+   MSR_KVM_MIGRATION_CONTROL
 
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24  host will warn if no guest-side
per-cpu warps are expected in
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index e37a14c323d2..691d718df60a 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -376,3 +376,13 @@ data:
write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
and check if there are more notifications pending. The MSR is available
if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+MSR_KVM_MIGRATION_CONTROL:
+0x4b564d08
+
+data:
+If the guest is running with encrypted memory and it is communicating
+page encryption status to the host using the ``KVM_HC_PAGE_ENC_STATUS``
+hypercall, it can set bit 0 in this MSR to allow live migration of
+the guest.  The bit can be set if KVM_FEATURE_HC_PAGE_ENC_STATUS is
+present in CPUID.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index be49956b603f..c5ee419775d8 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -55,6 +55,7 @@
 #define MSR_KVM_POLL_CONTROL   0x4b564d05
 #define MSR_KVM_ASYNC_PF_INT   0x4b564d06
 #define MSR_KVM_ASYNC_PF_ACK   0x4b564d07
+#define MSR_KVM_MIGRATION_CONTROL  0x4b564d08
 
 struct kvm_steal_time {
__u64 steal;
@@ -91,6 +92,8 @@ struct kvm_clock_pairing {
 /* MSR_KVM_ASYNC_PF_INT */
 #define KVM_ASYNC_PF_VEC_MASK  GENMASK(7, 0)
 
+/* MSR_KVM_MIGRATION_CONTROL */
+#define KVM_PAGE_ENC_STATUS_UPTODATE   (1 << 0)
 
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE1
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 36d302009fd8..efb98be3338d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3258,6 +3258,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
vcpu->arch.msr_kvm_poll_control = data;
break;
 
+   case MSR_KVM_MIGRATION_CONTROL:
+   if (data & ~KVM_PAGE_ENC_STATUS_UPTODATE)
+   return 1;
+
+   if (data && !guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))
+   return 1;
+   break;
+
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
@@ -3549,6 +3557,12 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF))
return 1;
 
+   msr_info->data = 0;
+   break;
+   case MSR_KVM_MIGRATION_CONTROL:
+   if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))
+   return 1;
+
msr_info->data = 0;
break;
case MSR_KVM_STEAL_TIME:
-- 
2.26.2



Re: [PATCH v13 12/12] x86/kvm: Add guest support for detecting and enabling SEV Live Migration feature.

2021-04-20 Thread Paolo Bonzini

On 15/04/21 18:01, Ashish Kalra wrote:

From: Ashish Kalra 

The guest support for detecting and enabling SEV Live migration
feature uses the following logic :

  - kvm_init_plaform() invokes check_kvm_sev_migration() which
checks if its booted under the EFI

- If not EFI,

  i) check for the KVM_FEATURE_CPUID

  ii) if CPUID reports that migration is supported, issue a wrmsrl()
  to enable the SEV live migration support

- If EFI,

  i) check for the KVM_FEATURE_CPUID

  ii) If CPUID reports that migration is supported, read the UEFI variable 
which
  indicates OVMF support for live migration

  iii) the variable indicates live migration is supported, issue a wrmsrl() 
to
   enable the SEV live migration support

The EFI live migration check is done using a late_initcall() callback.

Also, ensure that _bss_decrypted section is marked as decrypted in the
shared pages list.

Also adds kexec support for SEV Live Migration.

Reset the host's shared pages list related to kernel
specific page encryption status settings before we load a
new kernel by kexec. We cannot reset the complete
shared pages list here as we need to retain the
UEFI/OVMF firmware specific settings.

The host's shared pages list is maintained for the
guest to keep track of all unencrypted guest memory regions,
therefore we need to explicitly mark all shared pages as
encrypted again before rebooting into the new guest kernel.

Signed-off-by: Ashish Kalra 


Boris, this one needs an ACK as well.

Paolo


---
  arch/x86/include/asm/mem_encrypt.h |  8 
  arch/x86/kernel/kvm.c  | 55 +
  arch/x86/mm/mem_encrypt.c  | 64 ++
  3 files changed, 127 insertions(+)

diff --git a/arch/x86/include/asm/mem_encrypt.h 
b/arch/x86/include/asm/mem_encrypt.h
index 31c4df123aa0..19b77f3a62dc 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -21,6 +21,7 @@
  extern u64 sme_me_mask;
  extern u64 sev_status;
  extern bool sev_enabled;
+extern bool sev_live_migration_enabled;
  
  void sme_encrypt_execute(unsigned long encrypted_kernel_vaddr,

 unsigned long decrypted_kernel_vaddr,
@@ -44,8 +45,11 @@ void __init sme_enable(struct boot_params *bp);
  
  int __init early_set_memory_decrypted(unsigned long vaddr, unsigned long size);

  int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long 
size);
+void __init early_set_mem_enc_dec_hypercall(unsigned long vaddr, int npages,
+   bool enc);
  
  void __init mem_encrypt_free_decrypted_mem(void);

+void __init check_kvm_sev_migration(void);
  
  /* Architecture __weak replacement functions */

  void __init mem_encrypt_init(void);
@@ -60,6 +64,7 @@ bool sev_es_active(void);
  #else /* !CONFIG_AMD_MEM_ENCRYPT */
  
  #define sme_me_mask	0ULL

+#define sev_live_migration_enabled false
  
  static inline void __init sme_early_encrypt(resource_size_t paddr,

unsigned long size) { }
@@ -84,8 +89,11 @@ static inline int __init
  early_set_memory_decrypted(unsigned long vaddr, unsigned long size) { return 
0; }
  static inline int __init
  early_set_memory_encrypted(unsigned long vaddr, unsigned long size) { return 
0; }
+static inline void __init
+early_set_mem_enc_dec_hypercall(unsigned long vaddr, int npages, bool enc) {}
  
  static inline void mem_encrypt_free_decrypted_mem(void) { }

+static inline void check_kvm_sev_migration(void) { }
  
  #define __bss_decrypted
  
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c

index 78bb0fae3982..94ef16d263a7 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -26,6 +26,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -429,6 +430,59 @@ static inline void __set_percpu_decrypted(void *ptr, 
unsigned long size)
early_set_memory_decrypted((unsigned long) ptr, size);
  }
  
+static int __init setup_kvm_sev_migration(void)

+{
+   efi_char16_t efi_sev_live_migration_enabled[] = 
L"SevLiveMigrationEnabled";
+   efi_guid_t efi_variable_guid = MEM_ENCRYPT_GUID;
+   efi_status_t status;
+   unsigned long size;
+   bool enabled;
+
+   /*
+* check_kvm_sev_migration() invoked via kvm_init_platform() before
+* this callback would have setup the indicator that live migration
+* feature is supported/enabled.
+*/
+   if (!sev_live_migration_enabled)
+   return 0;
+
+   if (!efi_enabled(EFI_BOOT))
+   return 0;
+
+   if (!efi_enabled(EFI_RUNTIME_SERVICES)) {
+   pr_info("%s : EFI runtime services are not enabled\n", 
__func__);
+   return 0;
+   }
+
+   size = sizeof(enabled);
+
+   /* Get variable contents into buffer */
+   status = efi.get_variable(efi_sev_live_migration_enabled,
+

Re: [PATCH v13 10/12] KVM: x86: Introduce new KVM_FEATURE_SEV_LIVE_MIGRATION feature & Custom MSR.

2021-04-20 Thread Paolo Bonzini

On 20/04/21 01:06, Sean Christopherson wrote:

diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 950afebfba88..f6bfa138874f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -33,6 +33,7 @@
  #define KVM_FEATURE_PV_SCHED_YIELD13
  #define KVM_FEATURE_ASYNC_PF_INT  14
  #define KVM_FEATURE_MSI_EXT_DEST_ID   15
+#define KVM_FEATURE_SEV_LIVE_MIGRATION 16
  
  #define KVM_HINTS_REALTIME  0
  
@@ -54,6 +55,7 @@

  #define MSR_KVM_POLL_CONTROL  0x4b564d05
  #define MSR_KVM_ASYNC_PF_INT  0x4b564d06
  #define MSR_KVM_ASYNC_PF_ACK  0x4b564d07
+#define MSR_KVM_SEV_LIVE_MIGRATION 0x4b564d08
  
  struct kvm_steal_time {

__u64 steal;
@@ -136,4 +138,6 @@ struct kvm_vcpu_pv_apf_data {
  #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
  #define KVM_PV_EOI_DISABLED 0x0
  
+#define KVM_SEV_LIVE_MIGRATION_ENABLED BIT_ULL(0)


Even though the intent is to "force" userspace to intercept the MSR, I think KVM
should at least emulate the legal bits as a nop.  Deferring completely to
userspace is rather bizarre as there's not really anything to justify KVM
getting involved.  It would also force userspace to filter the MSR just to
support the hypercall.


I think this is the intention, the hypercall by itself cannot do much if
you cannot tell userspace that it's up-to-date.

On the other hand it is kind of wrong that KVM_GET_SUPPORTED_CPUID
returns the feature, but the MSR is not supported.


Somewhat of a nit, but I think we should do something like s/ENABLED/READY,


Agreed.  I'll send a patch that puts everything together.

Paolo



Re: [PATCH] KVM: Boost vCPU candidiate in user mode which is delivering interrupt

2021-04-20 Thread Paolo Bonzini

On 20/04/21 10:48, Wanpeng Li wrote:

I was thinking of something simpler:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9b8e30dd5b9b..455c648f9adc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3198,10 +3198,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool 
yield_to_kernel_mode)
   {
 struct kvm *kvm = me->kvm;
 struct kvm_vcpu *vcpu;
-   int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
 int yielded = 0;
 int try = 3;
-   int pass;
+   int pass, num_passes = 1;
 int i;

 kvm_vcpu_set_in_spin_loop(me, true);
@@ -3212,13 +3211,14 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool 
yield_to_kernel_mode)
  * VCPU is holding the lock that we need and will release it.
  * We approximate round-robin by starting at the last boosted VCPU.
  */
-   for (pass = 0; pass < 2 && !yielded && try; pass++) {
-   kvm_for_each_vcpu(i, vcpu, kvm) {
-   if (!pass && i <= last_boosted_vcpu) {
-   i = last_boosted_vcpu;
-   continue;
-   } else if (pass && i > last_boosted_vcpu)
-   break;
+   for (pass = 0; pass < num_passes; pass++) {
+   int idx = me->kvm->last_boosted_vcpu;
+   int n = atomic_read(>online_vcpus);
+   for (i = 0; i < n; i++, idx++) {
+   if (idx == n)
+   idx = 0;
+
+   vcpu = kvm_get_vcpu(kvm, idx);
 if (!READ_ONCE(vcpu->ready))
 continue;
 if (vcpu == me)
@@ -3226,23 +3226,36 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool 
yield_to_kernel_mode)
 if (rcuwait_active(>wait) &&
 !vcpu_dy_runnable(vcpu))
 continue;
-   if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode 
&&
-   !kvm_arch_vcpu_in_kernel(vcpu))
-   continue;
 if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
 continue;

+   if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode 
&&
+   !kvm_arch_vcpu_in_kernel(vcpu)) {
+   /*
+* A vCPU running in userspace can get to kernel 
mode via
+* an interrupt.  That's a worse choice than a CPU 
already
+* in kernel mode so only do it on a second pass.
+*/
+   if (!vcpu_dy_runnable(vcpu))
+   continue;
+   if (pass == 0) {
+   num_passes = 2;
+   continue;
+   }
+   }
+
 yielded = kvm_vcpu_yield_to(vcpu);
 if (yielded > 0) {
 kvm->last_boosted_vcpu = i;
-   break;
+   goto done;
 } else if (yielded < 0) {
 try--;
 if (!try)
-   break;
+   goto done;
 }
 }
 }
+done:


We just tested the above post against 96 vCPUs VM in an over-subscribe
scenario, the score of pbzip2 fluctuated drastically. Sometimes it is
worse than vanilla, but the average improvement is around 2.2%. The
new version of my post is around 9.3%,the origial posted patch is
around 10% which is totally as expected since now both IPI receivers
in user-mode and lock-waiters are second class citizens.


Fair enough.  Of the two patches you posted I prefer the original, so 
I'll go with that one.


Paolo



Re: [PATCH v13 10/12] KVM: x86: Introduce new KVM_FEATURE_SEV_LIVE_MIGRATION feature & Custom MSR.

2021-04-20 Thread Paolo Bonzini

On 15/04/21 17:58, Ashish Kalra wrote:

From: Ashish Kalra 

Add new KVM_FEATURE_SEV_LIVE_MIGRATION feature for guest to check
for host-side support for SEV live migration. Also add a new custom
MSR_KVM_SEV_LIVE_MIGRATION for guest to enable the SEV live migration
feature.

MSR is handled by userspace using MSR filters.

Signed-off-by: Ashish Kalra 
Reviewed-by: Steve Rutherford 


Let's leave the MSR out for now and rename the feature to 
KVM_FEATURE_HC_PAGE_ENC_STATUS.


Paolo


---
  Documentation/virt/kvm/cpuid.rst |  5 +
  Documentation/virt/kvm/msr.rst   | 12 
  arch/x86/include/uapi/asm/kvm_para.h |  4 
  arch/x86/kvm/cpuid.c |  3 ++-
  4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index cf62162d4be2..0bdb6cdb12d3 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -96,6 +96,11 @@ KVM_FEATURE_MSI_EXT_DEST_ID15  guest checks 
this feature bit
 before using extended 
destination
 ID bits in MSI address bits 
11-5.
  
+KVM_FEATURE_SEV_LIVE_MIGRATION 16  guest checks this feature bit before

+   using the page encryption state
+   hypercall to notify the page 
state
+   change
+
  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24  host will warn if no guest-side
 per-cpu warps are expected in
 kvmclock
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index e37a14c323d2..020245d16087 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -376,3 +376,15 @@ data:
write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
and check if there are more notifications pending. The MSR is available
if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+MSR_KVM_SEV_LIVE_MIGRATION:
+0x4b564d08
+
+   Control SEV Live Migration features.
+
+data:
+Bit 0 enables (1) or disables (0) host-side SEV Live Migration feature,
+in other words, this is guest->host communication that it's properly
+handling the shared pages list.
+
+All other bits are reserved.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 950afebfba88..f6bfa138874f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -33,6 +33,7 @@
  #define KVM_FEATURE_PV_SCHED_YIELD13
  #define KVM_FEATURE_ASYNC_PF_INT  14
  #define KVM_FEATURE_MSI_EXT_DEST_ID   15
+#define KVM_FEATURE_SEV_LIVE_MIGRATION 16
  
  #define KVM_HINTS_REALTIME  0
  
@@ -54,6 +55,7 @@

  #define MSR_KVM_POLL_CONTROL  0x4b564d05
  #define MSR_KVM_ASYNC_PF_INT  0x4b564d06
  #define MSR_KVM_ASYNC_PF_ACK  0x4b564d07
+#define MSR_KVM_SEV_LIVE_MIGRATION 0x4b564d08
  
  struct kvm_steal_time {

__u64 steal;
@@ -136,4 +138,6 @@ struct kvm_vcpu_pv_apf_data {
  #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
  #define KVM_PV_EOI_DISABLED 0x0
  
+#define KVM_SEV_LIVE_MIGRATION_ENABLED BIT_ULL(0)

+
  #endif /* _UAPI_ASM_X86_KVM_PARA_H */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 6bd2f8b830e4..4e2e69a692aa 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -812,7 +812,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array 
*array, u32 function)
 (1 << KVM_FEATURE_PV_SEND_IPI) |
 (1 << KVM_FEATURE_POLL_CONTROL) |
 (1 << KVM_FEATURE_PV_SCHED_YIELD) |
-(1 << KVM_FEATURE_ASYNC_PF_INT);
+(1 << KVM_FEATURE_ASYNC_PF_INT) |
+(1 << KVM_FEATURE_SEV_LIVE_MIGRATION);
  
  		if (sched_info_on())

entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);





Re: [PATCH v13 09/12] mm: x86: Invoke hypercall when page encryption status is changed

2021-04-20 Thread Paolo Bonzini

On 15/04/21 17:57, Ashish Kalra wrote:

From: Brijesh Singh 

Invoke a hypercall when a memory region is changed from encrypted ->
decrypted and vice versa. Hypervisor needs to know the page encryption
status during the guest migration.


Boris, can you ack this patch?

Paolo


Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Paolo Bonzini 
Cc: Joerg Roedel 
Cc: Borislav Petkov 
Cc: Tom Lendacky 
Cc: x...@kernel.org
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford 
Reviewed-by: Venu Busireddy 
Signed-off-by: Brijesh Singh 
Signed-off-by: Ashish Kalra 
---
  arch/x86/include/asm/paravirt.h   | 10 +
  arch/x86/include/asm/paravirt_types.h |  2 +
  arch/x86/kernel/paravirt.c|  1 +
  arch/x86/mm/mem_encrypt.c | 57 ++-
  arch/x86/mm/pat/set_memory.c  |  7 
  5 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110e2243..efaa3e628967 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,12 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct 
*mm)
PVOP_VCALL1(mmu.exit_mmap, mm);
  }
  
+static inline void page_encryption_changed(unsigned long vaddr, int npages,

+   bool enc)
+{
+   PVOP_VCALL3(mmu.page_encryption_changed, vaddr, npages, enc);
+}
+
  #ifdef CONFIG_PARAVIRT_XXL
  static inline void load_sp0(unsigned long sp0)
  {
@@ -799,6 +805,10 @@ static inline void paravirt_arch_dup_mmap(struct mm_struct 
*oldmm,
  static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
  {
  }
+
+static inline void page_encryption_changed(unsigned long vaddr, int npages, 
bool enc)
+{
+}
  #endif
  #endif /* __ASSEMBLY__ */
  #endif /* _ASM_X86_PARAVIRT_H */
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index de87087d3bde..69ef9c207b38 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -195,6 +195,8 @@ struct pv_mmu_ops {
  
  	/* Hook for intercepting the destruction of an mm_struct. */

void (*exit_mmap)(struct mm_struct *mm);
+   void (*page_encryption_changed)(unsigned long vaddr, int npages,
+   bool enc);
  
  #ifdef CONFIG_PARAVIRT_XXL

struct paravirt_callee_save read_cr2;
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c60222ab8ab9..9f206e192f6b 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -335,6 +335,7 @@ struct paravirt_patch_template pv_ops = {
(void (*)(struct mmu_gather *, void *))tlb_remove_page,
  
  	.mmu.exit_mmap		= paravirt_nop,

+   .mmu.page_encryption_changed= paravirt_nop,
  
  #ifdef CONFIG_PARAVIRT_XXL

.mmu.read_cr2   = __PV_IS_CALLEE_SAVE(native_read_cr2),
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ae78cef79980..fae9ccbd0da7 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -19,6 +19,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -29,6 +30,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include "mm_internal.h"
  
@@ -229,6 +231,47 @@ void __init sev_setup_arch(void)

swiotlb_adjust_size(size);
  }
  
+static void set_memory_enc_dec_hypercall(unsigned long vaddr, int npages,

+   bool enc)
+{
+   unsigned long sz = npages << PAGE_SHIFT;
+   unsigned long vaddr_end, vaddr_next;
+
+   vaddr_end = vaddr + sz;
+
+   for (; vaddr < vaddr_end; vaddr = vaddr_next) {
+   int psize, pmask, level;
+   unsigned long pfn;
+   pte_t *kpte;
+
+   kpte = lookup_address(vaddr, );
+   if (!kpte || pte_none(*kpte))
+   return;
+
+   switch (level) {
+   case PG_LEVEL_4K:
+   pfn = pte_pfn(*kpte);
+   break;
+   case PG_LEVEL_2M:
+   pfn = pmd_pfn(*(pmd_t *)kpte);
+   break;
+   case PG_LEVEL_1G:
+   pfn = pud_pfn(*(pud_t *)kpte);
+   break;
+   default:
+   return;
+   }
+
+   psize = page_level_size(level);
+   pmask = page_level_mask(level);
+
+   kvm_sev_hypercall3(KVM_HC_PAGE_ENC_STATUS,
+  pfn << PAGE_SHIFT, psize >> PAGE_SHIFT, enc);
+
+   vaddr_next = (vaddr & pmask) + psize;
+   }
+}
+
  static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc)
  {
pgprot_t old_prot, new_prot;
@@ -286,12 +329,13 @@ static void __init __set_clr_pte_enc(pte_t 

[PATCH] KVM: x86: document behavior of measurement ioctls with len==0

2021-04-20 Thread Paolo Bonzini
Signed-off-by: Paolo Bonzini 
---
 Documentation/virt/kvm/amd-memory-encryption.rst | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst 
b/Documentation/virt/kvm/amd-memory-encryption.rst
index 469a6308765b..34ce2d1fcb89 100644
--- a/Documentation/virt/kvm/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/amd-memory-encryption.rst
@@ -148,6 +148,9 @@ measurement. Since the guest owner knows the initial 
contents of the guest at
 boot, the measurement can be verified by comparing it to what the guest owner
 expects.
 
+If len is zero on entry, the measurement blob length is written to len and
+uaddr is unused.
+
 Parameters (in): struct  kvm_sev_launch_measure
 
 Returns: 0 on success, -negative on error
@@ -271,6 +274,9 @@ report containing the SHA-256 digest of the guest memory 
and VMSA passed through
 commands and signed with the PEK. The digest returned by the command should 
match the digest
 used by the guest owner with the KVM_SEV_LAUNCH_MEASURE.
 
+If len is zero on entry, the measurement blob length is written to len and
+uaddr is unused.
+
 Parameters (in): struct kvm_sev_attestation
 
 Returns: 0 on success, -negative on error
-- 
2.26.2



Re: [PATCH v13 04/12] KVM: SVM: Add support for KVM_SEV_RECEIVE_START command

2021-04-20 Thread Paolo Bonzini

On 20/04/21 10:38, Paolo Bonzini wrote:

On 15/04/21 17:54, Ashish Kalra wrote:

+    }
+
+    sev->handle = start->handle;
+    sev->fd = argp->sev_fd;


These two lines are spurious, I'll delete them.


And this is wrong as well.  My apologies.

Paolo



Re: [PATCH v13 01/12] KVM: SVM: Add KVM_SEV SEND_START command

2021-04-20 Thread Paolo Bonzini

On 15/04/21 17:53, Ashish Kalra wrote:

From: Brijesh Singh 

The command is used to create an outgoing SEV guest encryption context.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Paolo Bonzini 
Cc: Joerg Roedel 
Cc: Borislav Petkov 
Cc: Tom Lendacky 
Cc: x...@kernel.org
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford 
Reviewed-by: Venu Busireddy 
Signed-off-by: Brijesh Singh 
Signed-off-by: Ashish Kalra 
---
  .../virt/kvm/amd-memory-encryption.rst|  27 
  arch/x86/kvm/svm/sev.c| 125 ++
  include/linux/psp-sev.h   |   8 +-
  include/uapi/linux/kvm.h  |  12 ++
  4 files changed, 168 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst 
b/Documentation/virt/kvm/amd-memory-encryption.rst
index 469a6308765b..ac799dd7a618 100644
--- a/Documentation/virt/kvm/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/amd-memory-encryption.rst
@@ -284,6 +284,33 @@ Returns: 0 on success, -negative on error
  __u32 len;
  };
  
+10. KVM_SEV_SEND_START

+--
+
+The KVM_SEV_SEND_START command can be used by the hypervisor to create an
+outgoing guest encryption context.
+
+Parameters (in): struct kvm_sev_send_start
+
+Returns: 0 on success, -negative on error
+
+::
+struct kvm_sev_send_start {
+__u32 policy; /* guest policy */
+
+__u64 pdh_cert_uaddr; /* platform Diffie-Hellman 
certificate */
+__u32 pdh_cert_len;
+
+__u64 plat_certs_uaddr;/* platform certificate chain */
+__u32 plat_certs_len;
+
+__u64 amd_certs_uaddr;/* AMD certificate */
+__u32 amd_certs_len;
+
+__u64 session_uaddr;  /* Guest session information */
+__u32 session_len;
+};
+
  References
  ==
  
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c

index 874ea309279f..2b65900c05d6 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1110,6 +1110,128 @@ static int sev_get_attestation_report(struct kvm *kvm, 
struct kvm_sev_cmd *argp)
return ret;
  }
  
+/* Userspace wants to query session length. */

+static int
+__sev_send_start_query_session_length(struct kvm *kvm, struct kvm_sev_cmd 
*argp,
+ struct kvm_sev_send_start *params)
+{
+   struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info;
+   struct sev_data_send_start *data;
+   int ret;
+
+   data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+   if (data == NULL)
+   return -ENOMEM;
+
+   data->handle = sev->handle;
+   ret = sev_issue_cmd(kvm, SEV_CMD_SEND_START, data, >error);


This is missing an "if (ret < 0)" (and this time I'm pretty sure it's 
indeed the case :)), otherwise you miss for example the EBADF return 
code if the SEV file descriptor is closed or reused.  Same for 
KVM_SEND_UPDATE_DATA.  Also, the length==0 case is not documented.


Paolo


+   params->session_len = data->session_len;
+   if (copy_to_user((void __user *)(uintptr_t)argp->data, params,
+   sizeof(struct kvm_sev_send_start)))
+   ret = -EFAULT;
+
+   kfree(data);
+   return ret;
+}
+
+static int sev_send_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+   struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info;
+   struct sev_data_send_start *data;
+   struct kvm_sev_send_start params;
+   void *amd_certs, *session_data;
+   void *pdh_cert, *plat_certs;
+   int ret;
+
+   if (!sev_guest(kvm))
+   return -ENOTTY;
+
+   if (copy_from_user(, (void __user *)(uintptr_t)argp->data,
+   sizeof(struct kvm_sev_send_start)))
+   return -EFAULT;
+
+   /* if session_len is zero, userspace wants to query the session length 
*/
+   if (!params.session_len)
+   return __sev_send_start_query_session_length(kvm, argp,
+   );
+
+   /* some sanity checks */
+   if (!params.pdh_cert_uaddr || !params.pdh_cert_len ||
+   !params.session_uaddr || params.session_len > SEV_FW_BLOB_MAX_SIZE)
+   return -EINVAL;
+
+   /* allocate the memory to hold the session data blob */
+   session_data = kmalloc(params.session_len, GFP_KERNEL_ACCOUNT);
+   if (!session_data)
+   return -ENOMEM;
+
+   /* copy the certificate blobs from userspace */
+   pdh_cert = psp_copy_user_blob(params.pdh_cert_uaddr,
+   params.pdh_cert_len);
+   if (IS_ERR(pdh_cert)) {
+   ret = PTR_ERR(pdh_cert);
+   goto e_free_session;
+   }
+
+   pl

Re: [PATCH v13 05/12] KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command

2021-04-20 Thread Paolo Bonzini

On 20/04/21 10:40, Paolo Bonzini wrote:

On 15/04/21 17:55, Ashish Kalra wrote:

+    if (!guest_page)
+    goto e_free;
+


Missing unpin on error (but it won't be needed with Sean's patches that 
move the data block to the stack, so I can fix this too).


No, sorry---the initialization order is different between 
send_update_data and receive_update_data, so it's okay.


Paolo


Re: [PATCH v13 05/12] KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command

2021-04-20 Thread Paolo Bonzini

On 15/04/21 17:55, Ashish Kalra wrote:

+   if (!guest_page)
+   goto e_free;
+


Missing unpin on error (but it won't be needed with Sean's patches that 
move the data block to the stack, so I can fix this too).


Paolo



Re: [PATCH v13 04/12] KVM: SVM: Add support for KVM_SEV_RECEIVE_START command

2021-04-20 Thread Paolo Bonzini

On 15/04/21 17:54, Ashish Kalra wrote:

+   }
+
+   sev->handle = start->handle;
+   sev->fd = argp->sev_fd;


These two lines are spurious, I'll delete them.

Paolo



[PATCH] KVM: selftests: Always run vCPU thread with blocked SIG_IPI

2021-04-20 Thread Paolo Bonzini
The main thread could start to send SIG_IPI at any time, even before signal
blocked on vcpu thread.  Therefore, start the vcpu thread with the signal
blocked.

Without this patch, on very busy cores the dirty_log_test could fail directly
on receiving a SIGUSR1 without a handler (when vcpu runs far slower than main).

Reported-by: Peter Xu 
Cc: sta...@vger.kernel.org
Signed-off-by: Paolo Bonzini 
---
 tools/testing/selftests/kvm/dirty_log_test.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index ffa4e2791926..81edbd23d371 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -527,9 +527,8 @@ static void *vcpu_worker(void *data)
 */
sigmask->len = 8;
pthread_sigmask(0, NULL, sigset);
+   sigdelset(sigset, SIG_IPI);
vcpu_ioctl(vm, VCPU_ID, KVM_SET_SIGNAL_MASK, sigmask);
-   sigaddset(sigset, SIG_IPI);
-   pthread_sigmask(SIG_BLOCK, sigset, NULL);
 
sigemptyset(sigset);
sigaddset(sigset, SIG_IPI);
@@ -858,6 +857,7 @@ int main(int argc, char *argv[])
.interval = TEST_HOST_LOOP_INTERVAL,
};
int opt, i;
+   sigset_t sigset;
 
sem_init(_vcpu_stop, 0, 0);
sem_init(_vcpu_cont, 0, 0);
@@ -916,6 +916,11 @@ int main(int argc, char *argv[])
 
srandom(time(0));
 
+   /* Ensure that vCPU threads start with SIG_IPI blocked.  */
+   sigemptyset();
+   sigaddset(, SIG_IPI);
+   pthread_sigmask(SIG_BLOCK, , NULL);
+
if (host_log_mode_option == LOG_MODE_ALL) {
/* Run each log mode */
for (i = 0; i < LOG_MODE_NUM; i++) {
-- 
2.26.2



Re: [PATCH v3 2/2] KVM: selftests: Wait for vcpu thread before signal setup

2021-04-20 Thread Paolo Bonzini

On 17/04/21 16:36, Peter Xu wrote:

The main thread could start to send SIG_IPI at any time, even before signal
blocked on vcpu thread.  Reuse the sem_vcpu_stop to sync on that, so when
SIG_IPI is sent the signal will always land correctly as an -EINTR.

Without this patch, on very busy cores the dirty_log_test could fail directly
on receiving a SIG_USR1 without a handler (when vcpu runs far slower than main).

Signed-off-by: Peter Xu 


This should be a better way to do the same:

--- 8< 
From: Paolo Bonzini 
Subject: [PATCH] KVM: selftests: Wait for vcpu thread before signal setup

The main thread could start to send SIG_IPI at any time, even before signal
blocked on vcpu thread.  Therefore, start the vcpu thread with the signal
blocked.

Without this patch, on very busy cores the dirty_log_test could fail directly
on receiving a SIGUSR1 without a handler (when vcpu runs far slower than main).

Reported-by: Peter Xu 
Cc: sta...@vger.kernel.org
Signed-off-by: Paolo Bonzini 

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index ffa4e2791926..048973d50a45 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -527,9 +527,8 @@ static void *vcpu_worker(void *data)
 */
sigmask->len = 8;
pthread_sigmask(0, NULL, sigset);
+   sigdelset(sigset, SIG_IPI);
vcpu_ioctl(vm, VCPU_ID, KVM_SET_SIGNAL_MASK, sigmask);
-   sigaddset(sigset, SIG_IPI);
-   pthread_sigmask(SIG_BLOCK, sigset, NULL);
 
 	sigemptyset(sigset);

sigaddset(sigset, SIG_IPI);
@@ -858,6 +857,7 @@ int main(int argc, char *argv[])
.interval = TEST_HOST_LOOP_INTERVAL,
};
int opt, i;
+   sigset_t sigset;
 
 	sem_init(_vcpu_stop, 0, 0);

sem_init(_vcpu_cont, 0, 0);
@@ -916,6 +916,11 @@ int main(int argc, char *argv[])
 
 	srandom(time(0));
 
+	/* Ensure that vCPU threads start with SIG_IPI blocked.  */

+   sigemptyset();
+   sigaddset(, SIG_IPI);
+   pthread_sigmask(SIG_BLOCK, sigset, NULL);
+
if (host_log_mode_option == LOG_MODE_ALL) {
/* Run each log mode */
for (i = 0; i < LOG_MODE_NUM; i++) {



---
  tools/testing/selftests/kvm/dirty_log_test.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index 510884f0eab8..25230e799bc4 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -534,6 +534,12 @@ static void *vcpu_worker(void *data)
sigemptyset(sigset);
sigaddset(sigset, SIG_IPI);
  
+	/*

+* Tell the main thread that signals are setup already; let's borrow
+* sem_vcpu_stop even if it's not for it.
+*/
+   sem_post(_vcpu_stop);
+
guest_array = addr_gva2hva(vm, (vm_vaddr_t)random_array);
  
  	while (!READ_ONCE(host_quit)) {

@@ -785,6 +791,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
  
  	pthread_create(_thread, NULL, vcpu_worker, vm);
  
+	sem_wait_until(_vcpu_stop);

+
while (iteration < p->iterations) {
/* Give the vcpu thread some time to dirty some pages */
usleep(p->interval * 1000);





Re: [PATCH v3 1/2] KVM: selftests: Sync data verify of dirty logging with guest sync

2021-04-20 Thread Paolo Bonzini

On 18/04/21 14:43, Peter Xu wrote:

8<-
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index 25230e799bc4..d3050d1c2cd0 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -377,7 +377,7 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, 
int ret, int err)
 /* A ucall-sync or ring-full event is allowed */
 if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
 /* We should allow this to continue */
-   ;
+   vcpu_handle_sync_stop();
 } else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL ||
(ret == -1 && err == EINTR)) {
 /* Update the flag first before pause */
8<-

That's my intention when I introduce vcpu_handle_sync_stop(), but forgot to
add...


And possibly even this (untested though):

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index ffa4e2791926..918954f01cef 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -383,6 +383,7 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, 
int ret, int err)
/* Update the flag first before pause */
WRITE_ONCE(dirty_ring_vcpu_ring_full,
   run->exit_reason == KVM_EXIT_DIRTY_RING_FULL);
+   atomic_set(_sync_stop_requested, false);
sem_post(_vcpu_stop);
pr_info("vcpu stops because %s...\n",
dirty_ring_vcpu_ring_full ?
@@ -804,8 +805,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 * the flush of the last page, and since we handle the last
 * page specially verification will succeed anyway.
 */
-   assert(host_log_mode == LOG_MODE_DIRTY_RING ||
-  atomic_read(_sync_stop_requested) == false);
+   assert(atomic_read(_sync_stop_requested) == false);
vm_dirty_log_verify(mode, bmap);
sem_post(_vcpu_cont);
 


You can submit all these as a separate patch.

Paolo



Re: [PATCH] KVM: Boost vCPU candidiate in user mode which is delivering interrupt

2021-04-20 Thread Paolo Bonzini

On 20/04/21 08:08, Wanpeng Li wrote:

On Tue, 20 Apr 2021 at 14:02, Wanpeng Li  wrote:


On Tue, 20 Apr 2021 at 00:59, Paolo Bonzini  wrote:


On 19/04/21 18:32, Sean Christopherson wrote:

If false positives are a big concern, what about adding another pass to the loop
and only yielding to usermode vCPUs with interrupts in the second full pass?
I.e. give vCPUs that are already in kernel mode priority, and only yield to
handle an interrupt if there are no vCPUs in kernel mode.

kvm_arch_dy_runnable() pulls in pv_unhalted, which seems like a good thing.


pv_unhalted won't help if you're waiting for a kernel spinlock though,
would it?  Doing two passes (or looking for a "best" candidate that
prefers kernel mode vCPUs to user mode vCPUs waiting for an interrupt)
seems like the best choice overall.


How about something like this:


I was thinking of something simpler:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9b8e30dd5b9b..455c648f9adc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3198,10 +3198,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool 
yield_to_kernel_mode)
 {
struct kvm *kvm = me->kvm;
struct kvm_vcpu *vcpu;
-   int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
int yielded = 0;
int try = 3;
-   int pass;
+   int pass, num_passes = 1;
int i;
 
 	kvm_vcpu_set_in_spin_loop(me, true);

@@ -3212,13 +3211,14 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool 
yield_to_kernel_mode)
 * VCPU is holding the lock that we need and will release it.
 * We approximate round-robin by starting at the last boosted VCPU.
 */
-   for (pass = 0; pass < 2 && !yielded && try; pass++) {
-   kvm_for_each_vcpu(i, vcpu, kvm) {
-   if (!pass && i <= last_boosted_vcpu) {
-   i = last_boosted_vcpu;
-   continue;
-   } else if (pass && i > last_boosted_vcpu)
-   break;
+   for (pass = 0; pass < num_passes; pass++) {
+   int idx = me->kvm->last_boosted_vcpu;
+   int n = atomic_read(>online_vcpus);
+   for (i = 0; i < n; i++, idx++) {
+   if (idx == n)
+   idx = 0;
+
+   vcpu = kvm_get_vcpu(kvm, idx);
if (!READ_ONCE(vcpu->ready))
continue;
if (vcpu == me)
@@ -3226,23 +3226,36 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool 
yield_to_kernel_mode)
if (rcuwait_active(>wait) &&
!vcpu_dy_runnable(vcpu))
continue;
-   if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode 
&&
-   !kvm_arch_vcpu_in_kernel(vcpu))
-   continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;
 
+			if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&

+   !kvm_arch_vcpu_in_kernel(vcpu)) {
+   /*
+* A vCPU running in userspace can get to kernel 
mode via
+* an interrupt.  That's a worse choice than a CPU 
already
+* in kernel mode so only do it on a second pass.
+*/
+   if (!vcpu_dy_runnable(vcpu))
+   continue;
+   if (pass == 0) {
+   num_passes = 2;
+   continue;
+   }
+   }
+
yielded = kvm_vcpu_yield_to(vcpu);
if (yielded > 0) {
kvm->last_boosted_vcpu = i;
-   break;
+   goto done;
} else if (yielded < 0) {
try--;
if (!try)
-   break;
+   goto done;
}
}
}
+done:
kvm_vcpu_set_in_spin_loop(me, false);
 
 	/* Ensure vcpu is not eligible during next spinloop */


Paolo



Re: Doubt regarding memory allocation in KVM

2021-04-20 Thread Paolo Bonzini

On 20/04/21 07:45, Shivank Garg wrote:

Hi,
I'm learning about qemu KVM, looking into code and experimenting on
it. I have the following doubts regarding it, I would be grateful if
you help me to get some idea on them.

1. I observe that KVM allocates memory to guests when it needs it but
doesn't take it back (except for ballooning case).
Also, the Qemu/KVM process does not free the memory even when the
guest is rebooted. In this case,  Does the Guest VM get access to
memory already pre-filled with some garbage from the previous run??


Yes.


(Since the host would allocate zeroed pages to guests the first time
it requests but after that it's up to guests). Can it be a security
issue?


No, it's the same that happens on non-virtual machine.


2. How does the KVM know if GPFN (guest physical frame number) is
backed by an actual machine frame number in host? If not mapped, then
it faults in the host and allocates a physical frame for guests in the
host. (kvm_mmu_page_fault)


It's all handled by Linux.  KVM only does a call to get_user_pages.  See 
functions whose name starts with hva_to_pfn in virt/kvm/kvm_main.c


Given a GPA, the GFN is simply the guest physical address minus bits 
0:11, so shifted right by 12.


Paolo



Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-04-19 Thread Paolo Bonzini

On 19/04/21 17:09, Sean Christopherson wrote:

- this loses the rwsem fairness.  On the other hand, mm/mmu_notifier.c's
own interval-tree-based filter is also using a similar mechanism that is
likewise not fair, so it should be okay.


The one concern I had with an unfair mechanism of this nature is that, in 
theory,
the memslot update could be blocked indefinitely.


Yep, that's why I mentioned it.


@@ -1333,9 +1351,22 @@ static struct kvm_memslots *install_new_memslots(struct 
kvm *kvm,
WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
-   down_write(>mmu_notifier_slots_lock);
+   /*
+* This cannot be an rwsem because the MMU notifier must not run
+* inside the critical section.  A sleeping rwsem cannot exclude
+* that.


How on earth did you decipher that from the splat?  I stared at it for a good
five minutes and was completely befuddled.


Just scratch that, it makes no sense.  It's much simpler, but you have
to look at include/linux/mmu_notifier.h to figure it out:

invalidate_range_start
  take pseudo lock
  down_read()   (*)
  release pseudo lock
invalidate_range_end
  take pseudo lock  (**)
  up_read()
  release pseudo lock

At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.

This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):

- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots

- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes

- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.

Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers).  This also means that there's no need for a raw spinlock.

Given this simple explanation, I think it's okay to include this
patch in the merge window pull request, with the fix after my
signature squashed in.  The fix actually undoes a lot of the
changes to __kvm_handle_hva_range that this patch made, so the
result is relatively simple.  You can already find the result
in kvm/queue.

Paolo

From daefeeb229ba8be5bd819a51875bc1fd5e74fc85 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini 
Date: Mon, 19 Apr 2021 09:01:46 -0400
Subject: [PATCH] KVM: avoid "deadlock" between install_new_memslots and MMU
 notifier

Wanpeng Li is reporting this splat:

 ==
 WARNING: possible circular locking dependency detected
 5.12.0-rc3+ #6 Tainted: G   OE
 --
 qemu-system-x86/3069 is trying to acquire lock:
 9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: 
__mmu_notifier_invalidate_range_end+0x5/0x190

 but task is already holding lock:
 aff7410a9160 (>mmu_notifier_slots_lock){.+.+}-{3:3}, at: 
kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]

 which lock already depends on the new lock.

This corresponds to the following MMU notifier logic:

invalidate_range_start
  take pseudo lock
  down_read()   (*)
  release pseudo lock
invalidate_range_end
  take pseudo lock  (**)
  up_read()
  release pseudo lock

At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.

This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):

- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots

- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes

- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.

Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock.  This also allows handling blockable and non-blockable
critical section in the same way.

Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever.  Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.

Signed-off-by: Paolo Bonzini 
---
 Documentation/virt/kvm/locking.rst |   9 +--
 include/linux/kvm_host.h   |   8 +-
 virt/kvm/kvm_main.c| 119 ++---
 3 files changed, 67 insertions(+), 69 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst 
b/Documentation/virt/kvm/locking.rst
index 8f5d5bcf5689..e628f48dfdda 100644
--- a/Documentation/virt/kvm/locking.

[PATCH fixed] KVM: x86/mmu: Fast invalidation for TDP MMU

2021-04-19 Thread Paolo Bonzini
From: Ben Gardon 

Provide a real mechanism for fast invalidation by marking roots as
invalid so that their reference count will quickly fall to zero
and they will be torn down.

One negative side affect of this approach is that a vCPU thread will
likely drop the last reference to a root and be saddled with the work of
tearing down an entire paging structure. This issue will be resolved in
a later commit.

Signed-off-by: Ben Gardon 
Message-Id: <20210401233736.638171-13-bgar...@google.com>
[Move the loop to tdp_mmu.c, otherwise compilation fails on 32-bit. - Paolo]
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 12 +---
 arch/x86/kvm/mmu/tdp_mmu.c | 17 +
 arch/x86/kvm/mmu/tdp_mmu.h |  5 +
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 61c0d1091fc5..3323ab281f34 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5421,6 +5421,15 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 */
kvm->arch.mmu_valid_gen = kvm->arch.mmu_valid_gen ? 0 : 1;
 
+   /* In order to ensure all threads see this change when
+* handling the MMU reload signal, this must happen in the
+* same critical section as kvm_reload_remote_mmus, and
+* before kvm_zap_obsolete_pages as kvm_zap_obsolete_pages
+* could drop the MMU lock and yield.
+*/
+   if (is_tdp_mmu_enabled(kvm))
+   kvm_tdp_mmu_invalidate_all_roots(kvm);
+
/*
 * Notify all vcpus to reload its shadow page table and flush TLB.
 * Then all vcpus will switch to new shadow page table with the new
@@ -5433,9 +5442,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 
kvm_zap_obsolete_pages(kvm);
 
-   if (is_tdp_mmu_enabled(kvm))
-   kvm_tdp_mmu_zap_all(kvm);
-
write_unlock(>mmu_lock);
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 335c126d9860..bc9308a104b1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -797,6 +797,23 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
kvm_flush_remote_tlbs(kvm);
 }
 
+/*
+ * Mark each TDP MMU root as invalid so that other threads
+ * will drop their references and allow the root count to
+ * go to 0.
+ *
+ * This has essentially the same effect for the TDP MMU
+ * as updating mmu_valid_gen does for the shadow MMU.
+ */
+void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
+{
+   struct kvm_mmu_page *root;
+
+   lockdep_assert_held_write(>mmu_lock);
+   list_for_each_entry(root, >arch.tdp_mmu_roots, link)
+   root->role.invalid = true;
+}
+
 /*
  * Installs a last-level SPTE to handle a TDP page fault.
  * (NPT/EPT violation/misconfiguration)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 2e1913bbc0ba..25ec0173700e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,6 +10,9 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
 __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
 struct kvm_mmu_page *root)
 {
+   if (root->role.invalid)
+   return false;
+
return refcount_inc_not_zero(>tdp_mmu_root_count);
 }
 
@@ -43,7 +46,9 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct 
kvm_mmu_page *sp)
return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
   sp->gfn, end, false, false, false);
 }
+
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
+void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
int map_writable, int max_level, kvm_pfn_t pfn,
-- 
2.26.2



Re: [PATCH 00/13] [RFC] Rust support

2021-04-19 Thread Paolo Bonzini

On 19/04/21 19:14, Linus Torvalds wrote:

On Mon, Apr 19, 2021 at 2:36 AM Peter Zijlstra  wrote:


I also don't see how this is better than seq_cst.

But yes, not broken, but also very much not optimal.


I continue to feel like kernel people should just entirely ignore the
C++ memory ordering standard.

It's inferior to what we already have, and simply not helpful. It
doesn't actually solve any problems as far as the kernel is concerned,
and it generates its own set of issues (ie assuming that the compiler
supports it, and assuming the compiler gets it right).

The really subtle cases that it could have been helpful for (eg RCU,
or the load-store control dependencies) were _too_ subtle for the
standard.

And I do not believe Rust changes _any_ of that.


It changes it for the worse, in that access to fields that are shared 
across threads *must* either use atomic types (which boil down to the 
same compiler intrinsics as the C/C++ memory model) or synchronization 
primitives.  LKMM operates in the grey area between the C standard and 
what gcc/clang actually implement, but there's no such grey area in Rust 
unless somebody wants to rewrite arch/*/asm atomic access primitives and 
memory barriers in Rust.


Of course it's possible to say Rust code just uses the C/C++/Rust model 
and C code follows the LKMM, but that really only delays the inevitable 
until a driver is written part in C part in Rust, and needs to perform 
accesses outside synchronization primitives.


Paolo


Any kernel Rust code will simply have to follow the LKMM rules, and
use the kernel model for the interfaces. Things like the C++ memory
model is simply not _relevant_ to the kernel.




Re: [PATCH] KVM: Boost vCPU candidiate in user mode which is delivering interrupt

2021-04-19 Thread Paolo Bonzini

On 19/04/21 18:32, Sean Christopherson wrote:

If false positives are a big concern, what about adding another pass to the loop
and only yielding to usermode vCPUs with interrupts in the second full pass?
I.e. give vCPUs that are already in kernel mode priority, and only yield to
handle an interrupt if there are no vCPUs in kernel mode.

kvm_arch_dy_runnable() pulls in pv_unhalted, which seems like a good thing.


pv_unhalted won't help if you're waiting for a kernel spinlock though, 
would it?  Doing two passes (or looking for a "best" candidate that 
prefers kernel mode vCPUs to user mode vCPUs waiting for an interrupt) 
seems like the best choice overall.


Paolo



Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-04-19 Thread Paolo Bonzini

On 19/04/21 10:49, Wanpeng Li wrote:

I saw this splatting:

  ==
  WARNING: possible circular locking dependency detected
  5.12.0-rc3+ #6 Tainted: G   OE
  --
  qemu-system-x86/3069 is trying to acquire lock:
  9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0},
at: __mmu_notifier_invalidate_range_end+0x5/0x190

  but task is already holding lock:
  aff7410a9160 (>mmu_notifier_slots_lock){.+.+}-{3:3}, at:
kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]


I guess it is possible to open-code the wait using a readers count and a
spinlock (see patch after signature).  This allows including the
rcu_assign_pointer in the same critical section that checks the number
of readers.  Also on the plus side, the init_rwsem() is replaced by
slightly nicer code.

IIUC this could be extended to non-sleeping invalidations too, but I
am not really sure about that.

There are some issues with the patch though:

- I am not sure if this should be a raw spin lock to avoid the same issue
on PREEMPT_RT kernel.  That said the critical section is so tiny that using
a raw spin lock may make sense anyway

- this loses the rwsem fairness.  On the other hand, mm/mmu_notifier.c's
own interval-tree-based filter is also using a similar mechanism that is
likewise not fair, so it should be okay.

Any opinions?  For now I placed the change below in kvm/queue, but I'm
leaning towards delaying this optimization to the next merge window.

Paolo

diff --git a/Documentation/virt/kvm/locking.rst 
b/Documentation/virt/kvm/locking.rst
index 8f5d5bcf5689..e628f48dfdda 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -16,12 +16,11 @@ The acquisition orders for mutexes are as follows:
 - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
   them together is quite rare.
 
-- The kvm->mmu_notifier_slots_lock rwsem ensures that pairs of

+- kvm->mn_active_invalidate_count ensures that pairs of
   invalidate_range_start() and invalidate_range_end() callbacks
-  use the same memslots array.  kvm->slots_lock is taken outside the
-  write-side critical section of kvm->mmu_notifier_slots_lock, so
-  MMU notifiers must not take kvm->slots_lock.  No other write-side
-  critical sections should be added.
+  use the same memslots array.  kvm->slots_lock is taken on the
+  waiting side in install_new_memslots, so MMU notifiers must not
+  take kvm->slots_lock.
 
 On x86:
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h

index 76b340dd6981..44a4a0c5148a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -472,11 +472,15 @@ struct kvm {
 #endif /* KVM_HAVE_MMU_RWLOCK */
 
 	struct mutex slots_lock;

-   struct rw_semaphore mmu_notifier_slots_lock;
struct mm_struct *mm; /* userspace tied to this vm */
struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 
+	/* Used to wait for completion of MMU notifiers.  */

+   spinlock_t mn_invalidate_lock;
+   unsigned long mn_active_invalidate_count;
+   struct rcuwait mn_memslots_update_rcuwait;
+
/*
 * created_vcpus is protected by kvm->lock, and is incremented
 * at the beginning of KVM_CREATE_VCPU.  online_vcpus is only
@@ -662,7 +666,7 @@ static inline struct kvm_memslots *__kvm_memslots(struct 
kvm *kvm, int as_id)
as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
return srcu_dereference_check(kvm->memslots[as_id], >srcu,
  lockdep_is_held(>slots_lock) ||
- 
lockdep_is_held(>mmu_notifier_slots_lock) ||
+ 
READ_ONCE(kvm->mn_active_invalidate_count) ||
  !refcount_read(>users_count));
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c

index ff9e95eb6960..cdaa1841e725 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -624,7 +624,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier 
*mn,
 * otherwise, mmu_notifier_count is incremented unconditionally.
 */
if (!kvm->mmu_notifier_count) {
-   lockdep_assert_held(>mmu_notifier_slots_lock);
+   WARN_ON(!READ_ONCE(kvm->mn_active_invalidate_count));
return;
}
 
@@ -689,10 +689,13 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,

 * The complexity required to handle conditional locking for this case
 * is not worth the marginal benefits, the VM is likely doomed anyways.
 *
-* Pairs with the up_read in range_end().
+* Pairs with the decrement in range_end().
 */
-   if (blockable)
-   down_read(>mmu_notifier_slots_lock);
+   if (blockable) {
+   

Re: [PATCH 00/13] [RFC] Rust support

2021-04-19 Thread Paolo Bonzini

On 19/04/21 11:36, Peter Zijlstra wrote:

On Mon, Apr 19, 2021 at 11:02:12AM +0200, Paolo Bonzini wrote:

void writer(void)
{
  atomic_store_explicit(, seq+1, memory_order_relaxed);
  atomic_thread_fence(memory_order_acquire);


This needs to be memory_order_release.  The only change in the resulting
assembly is that "dmb ishld" becomes "dmb ish", which is not as good as the
"dmb ishst" you get from smp_wmb() but not buggy either.


Yuck! And that is what requires the insides to be
atomic_store_explicit(), otherwise this fence doesn't have to affect
them.


Not just that, even the write needs to be atomic_store_explicit in order 
to avoid a data race.atomic_store_explicit



I also don't see how this is better than seq_cst.


It is better than seq_cst on TSO architectures.  Another possibility is 
to use release stores for everything (both increments and the stores 
between them).



But yes, not broken, but also very much not optimal.


Agreed on that, just like RCU/memory_order_consume.

Paolo



Re: [PATCH 00/13] [RFC] Rust support

2021-04-19 Thread Paolo Bonzini

On 19/04/21 10:26, Peter Zijlstra wrote:

On Mon, Apr 19, 2021 at 09:53:06AM +0200, Paolo Bonzini wrote:

On 19/04/21 09:32, Peter Zijlstra wrote:

On Sat, Apr 17, 2021 at 04:51:58PM +0200, Paolo Bonzini wrote:

On 16/04/21 09:09, Peter Zijlstra wrote:

Well, the obvious example would be seqlocks. C11 can't do them


Sure it can.  C11 requires annotating with (the equivalent of) READ_ONCE all
reads of seqlock-protected fields, but the memory model supports seqlocks
just fine.


How does that help?

IIRC there's two problems, one on each side the lock. On the write side
we have:

seq++;
smp_wmb();
X = r;
Y = r;
smp_wmb();
seq++;

Which C11 simply cannot do right because it does't have wmb.


It has atomic_thread_fence(memory_order_release), and
atomic_thread_fence(memory_order_acquire) on the read side.


https://godbolt.org/z/85xoPxeE5

void writer(void)
{
 atomic_store_explicit(, seq+1, memory_order_relaxed);
 atomic_thread_fence(memory_order_acquire);


This needs to be memory_order_release.  The only change in the resulting 
assembly is that "dmb ishld" becomes "dmb ish", which is not as good as 
the "dmb ishst" you get from smp_wmb() but not buggy either.


The read side can use "dmb ishld" so it gets the same code as Linux.

LWN needs a "C11 memory model for kernel folks" article.  In the 
meanwhile there is 
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html 
which is the opposite (Linux kernel memory model for C11 folks).


Paolo



 X = 1;
 Y = 2;

 atomic_store_explicit(, seq+1, memory_order_release);
}

gives:

writer:
 adrpx1, .LANCHOR0
 add x0, x1, :lo12:.LANCHOR0
 ldr w2, [x1, #:lo12:.LANCHOR0]
 add w2, w2, 1
 str w2, [x0]
 dmb ishld
 ldr w1, [x1, #:lo12:.LANCHOR0]
 mov w3, 1
 mov w2, 2
 stp w3, w2, [x0, 4]
 add w1, w1, w3
 stlrw1, [x0]
 ret

Which, afaict, is completely buggered. What it seems to be doing is
turning the seq load into a load-acquire, but what we really need is to
make sure the seq store (increment) is ordered before the other stores.






Re: [PATCH 00/13] [RFC] Rust support

2021-04-19 Thread Paolo Bonzini

On 19/04/21 09:32, Peter Zijlstra wrote:

On Sat, Apr 17, 2021 at 04:51:58PM +0200, Paolo Bonzini wrote:

On 16/04/21 09:09, Peter Zijlstra wrote:

Well, the obvious example would be seqlocks. C11 can't do them


Sure it can.  C11 requires annotating with (the equivalent of) READ_ONCE all
reads of seqlock-protected fields, but the memory model supports seqlocks
just fine.


How does that help?

IIRC there's two problems, one on each side the lock. On the write side
we have:

seq++;
smp_wmb();
X = r;
Y = r;
smp_wmb();
seq++;

Which C11 simply cannot do right because it does't have wmb.


It has atomic_thread_fence(memory_order_release), and 
atomic_thread_fence(memory_order_acquire) on the read side.



You end up
having to use seq_cst for the first wmb or make both X and Y (on top of
the last seq) a store-release, both options are sub-optimal.


seq_cst (except for the fence which is just smp_mb) is a pile of manure, 
no doubt about that. :)


Paolo



Re: [PATCH 00/13] [RFC] Rust support

2021-04-17 Thread Paolo Bonzini

On 16/04/21 17:58, Theodore Ts'o wrote:

Another fairly common use case is a lockless, racy test of a
particular field, as an optimization before we take the lock before we
test it for realsies.  In this particular case, we can't allocate
memory while holding a spinlock, so we check to see without taking the
spinlock to see whether we should allocate memory (which is expensive,
and unnecessasry most of the time):

alloc_transaction:
/*
 * This check is racy but it is just an optimization of allocating new
 * transaction early if there are high chances we'll need it. If we
 * guess wrong, we'll retry or free the unused transaction.
 */
if (!data_race(journal->j_running_transaction)) {
/*
 * If __GFP_FS is not present, then we may be being called from
 * inside the fs writeback layer, so we MUST NOT fail.
 */
if ((gfp_mask & __GFP_FS) == 0)
gfp_mask |= __GFP_NOFAIL;
new_transaction = kmem_cache_zalloc(transaction_cache,
gfp_mask);
if (!new_transaction)
return -ENOMEM;
}


From my limited experience with Rust, things like these are a bit 
annoying indeed, sooner or later Mutex<> just doesn't cut it and you 
have to deal with its limitations.


In this particular case you would use an AtomicBool field, place it 
outside the Mutex-protected struct, and make sure that is only accessed 
under the lock just like in C.
One easy way out is to make the Mutex protect (officially) nothing, i.e. 
Mutex<()>, and handle the mutable fields yourself using RefCell (which 
gives you run-time checking but has some some space cost) or UnsafeCell 
(which is unsafe as the name says).  Rust makes it pretty easy to write 
smart pointers (Mutex<>'s lock guard itself is a smart pointer) so you 
also have the possibility of writing a safe wrapper for the combination 
of Mutex<()> and UnsafeCell.


Another example is when yu have a list of XYZ objects and use the same 
mutex for both the list of XYZ and a field in struct XYZ.  You could 
place that field in an UnsafeCell and write a function that receives a 
guard for the list lock and returns the field, or something like that. 
It *is* quite ugly though.


As an aside, from a teaching standpoint associating a Mutex with a 
specific data structure is bad IMNSHO, because it encourages too 
fine-grained locking.  Sometimes the easiest path to scalability is to 
use a more coarse lock and ensure that contention is extremely rare. 
But it does work for most simple use cases (and device drivers would 
qualify as simple more often than not).


Paolo



Re: [PATCH 00/13] [RFC] Rust support

2021-04-17 Thread Paolo Bonzini

On 16/04/21 09:09, Peter Zijlstra wrote:

Well, the obvious example would be seqlocks. C11 can't do them


Sure it can.  C11 requires annotating with (the equivalent of) READ_ONCE 
all reads of seqlock-protected fields, but the memory model supports 
seqlocks just fine.



Simlar thing for RCU; C11 can't optimally do that


Technically if you know what you're doing (i.e. that you're not on 
Alpha) you can do RCU using a relaxed load followed by an 
atomic_signal_fence(memory_order_consume).  Which I agree is horrible 
and not entirely within the standard, but it works in practice.  The 
Linux implementation of memory barriers, atomic RMW primitives, 
load-acquire/store-release etc. is also completely outside the standard, 
so it's not much different and more portable.


The only thing that I really, really miss when programming with C11 
atomics is smp_mb__{before,after}_atomic().


Paolo



Re: [PATCH v6 03/10] KVM: selftests: Use flag CLOCK_MONOTONIC_RAW for timing

2021-04-17 Thread Paolo Bonzini

On 30/03/21 10:08, Yanan Wang wrote:

In addition to function of CLOCK_MONOTONIC, flag CLOCK_MONOTONIC_RAW can
also shield possiable impact of NTP, which can provide more robustness.

Suggested-by: Vitaly Kuznetsov
Signed-off-by: Yanan Wang
Reviewed-by: Ben Gardon
Reviewed-by: Andrew Jones


I'm not sure about this one, is the effect visible?

Paolo



Re: [PATCH v6 00/10] KVM: selftests: some improvement and a new test for kvm page table

2021-04-17 Thread Paolo Bonzini

On 06/04/21 05:05, wangyanan (Y) wrote:

Kindly ping...

Hi Paolo,
Will this series be picked up soon, or is there any other work for me to 
do?


Regards,
Yanan


On 2021/3/30 16:08, Yanan Wang wrote:

Hi,
This v6 series can mainly include two parts.
Rebased on kvm queue branch: 
https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=queue


In the first part, all the known hugetlb backing src types specified
with different hugepage sizes are listed, so that we can specify use
of hugetlb source of the exact granularity that we want, instead of
the system default ones. And as all the known hugetlb page sizes are
listed, it's appropriate for all architectures. Besides, a helper that
can get granularity of different backing src types(anonumous/thp/hugetlb)
is added, so that we can use the accurate backing src granularity for
kinds of alignment or guest memory accessing of vcpus.

In the second part, a new test is added:
This test is added to serve as a performance tester and a bug reproducer
for kvm page table code (GPA->HPA mappings), it gives guidance for the
people trying to make some improvement for kvm. And the following 
explains

what we can exactly do through this test.

The function guest_code() can cover the conditions where a single vcpu or
multiple vcpus access guest pages within the same memory region, in three
VM stages(before dirty logging, during dirty logging, after dirty 
logging).

Besides, the backing src memory type(ANONYMOUS/THP/HUGETLB) of the tested
memory region can be specified by users, which means normal page mappings
or block mappings can be chosen by users to be created in the test.

If ANONYMOUS memory is specified, kvm will create normal page mappings
for the tested memory region before dirty logging, and update attributes
of the page mappings from RO to RW during dirty logging. If THP/HUGETLB
memory is specified, kvm will create block mappings for the tested memory
region before dirty logging, and split the blcok mappings into normal 
page

mappings during dirty logging, and coalesce the page mappings back into
block mappings after dirty logging is stopped.

So in summary, as a performance tester, this test can present the
performance of kvm creating/updating normal page mappings, or the
performance of kvm creating/splitting/recovering block mappings,
through execution time.

When we need to coalesce the page mappings back to block mappings after
dirty logging is stopped, we have to firstly invalidate *all* the TLB
entries for the page mappings right before installation of the block 
entry,

because a TLB conflict abort error could occur if we can't invalidate the
TLB entries fully. We have hit this TLB conflict twice on aarch64 
software

implementation and fixed it. As this test can imulate process from dirty
logging enabled to dirty logging stopped of a VM with block mappings,
so it can also reproduce this TLB conflict abort due to inadequate TLB
invalidation when coalescing tables.

Links about the TLB conflict abort:
https://lore.kernel.org/lkml/20201201201034.116760-3-wangyana...@huawei.com/ 



---

Change logs:

v5->v6:
- Address Andrew Jones's comments for v5 series
- Add Andrew Jones's R-b tags in some patches
- Rebased on newest kvm/queue tree
- v5: 
https://lore.kernel.org/lkml/20210323135231.24948-1-wangyana...@huawei.com/ 



v4->v5:
- Use synchronization(sem_wait) for time measurement
- Add a new patch about TEST_ASSERT(patch 4)
- Address Andrew Jones's comments for v4 series
- Add Andrew Jones's R-b tags in some patches
- v4: 
https://lore.kernel.org/lkml/20210302125751.19080-1-wangyana...@huawei.com/ 



v3->v4:
- Add a helper to get system default hugetlb page size
- Add tags of Reviewed-by of Ben in the patches
- v3: 
https://lore.kernel.org/lkml/20210301065916.11484-1-wangyana...@huawei.com/ 



v2->v3:
- Add tags of Suggested-by, Reviewed-by in the patches
- Add a generic micro to get hugetlb page sizes
- Some changes for suggestions about v2 series
- v2: 
https://lore.kernel.org/lkml/20210225055940.18748-1-wangyana...@huawei.com/ 



v1->v2:
- Add a patch to sync header files
- Add helpers to get granularity of different backing src types
- Some changes for suggestions about v1 series
- v1: 
https://lore.kernel.org/lkml/20210208090841.333724-1-wangyana...@huawei.com/ 



---

Yanan Wang (10):
   tools headers: sync headers of asm-generic/hugetlb_encode.h
   mm/hugetlb: Add a macro to get HUGETLB page sizes for mmap
   KVM: selftests: Use flag CLOCK_MONOTONIC_RAW for timing
   KVM: selftests: Print the errno besides error-string in TEST_ASSERT
   KVM: selftests: Make a generic helper to get vm guest mode strings
   KVM: selftests: Add a helper to get system configured THP page size
   KVM: selftests: Add a helper to get system default hugetlb page size
   KVM: selftests: List all hugetlb src types specified with page sizes
   KVM: selftests: Adapt vm_userspace_mem_region_add to new helpers
   KVM: selftests: Add a test for kvm page table code

  

Re: [PATCH v2] KVM: vmx: add mismatched size assertions in vmcs_check32()

2021-04-17 Thread Paolo Bonzini

On 09/04/21 04:24, lihaiwei.ker...@gmail.com wrote:

From: Haiwei Li 

Add compile-time assertions in vmcs_check32() to disallow accesses to
64-bit and 64-bit high fields via vmcs_{read,write}32().  Upper level KVM
code should never do partial accesses to VMCS fields.  KVM handles the
split accesses automatically in vmcs_{read,write}64() when running as a
32-bit kernel.

Reviewed-and-tested-by: Sean Christopherson 
Signed-off-by: Haiwei Li 
---
v1 -> v2:
* Improve the changelog

  arch/x86/kvm/vmx/vmx_ops.h | 4 
  1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx_ops.h b/arch/x86/kvm/vmx/vmx_ops.h
index 692b0c3..164b64f 100644
--- a/arch/x86/kvm/vmx/vmx_ops.h
+++ b/arch/x86/kvm/vmx/vmx_ops.h
@@ -37,6 +37,10 @@ static __always_inline void vmcs_check32(unsigned long field)
  {
BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6000) == 0,
 "32-bit accessor invalid for 16-bit field");
+   BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6001) == 
0x2000,
+"32-bit accessor invalid for 64-bit field");
+   BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6001) == 
0x2001,
+"32-bit accessor invalid for 64-bit high field");
BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6000) == 
0x6000,
 "32-bit accessor invalid for natural width field");
  }



Queued, thanks.

paolo



Re: [PATCH 0/3] KVM: Fixes and a cleanup for coalesced MMIO

2021-04-17 Thread Paolo Bonzini

On 13/04/21 00:20, Sean Christopherson wrote:

Fix two bugs that are exposed if unregistered a device on an I/O bus fails
due to OOM.  Tack on opportunistic cleanup in the related code.
  
Sean Christopherson (3):

   KVM: Destroy I/O bus devices on unregister failure _after_ sync'ing
 SRCU
   KVM: Stop looking for coalesced MMIO zones if the bus is destroyed
   KVM: Add proper lockdep assertion in I/O bus unregister

  include/linux/kvm_host.h  |  4 ++--
  virt/kvm/coalesced_mmio.c | 19 +--
  virt/kvm/kvm_main.c   | 26 --
  3 files changed, 35 insertions(+), 14 deletions(-)



Queued, thanks.

Paolo



Re: [PATCH] KVM: Boost vCPU candidiate in user mode which is delivering interrupt

2021-04-17 Thread Paolo Bonzini

On 16/04/21 05:08, Wanpeng Li wrote:

From: Wanpeng Li 

Both lock holder vCPU and IPI receiver that has halted are condidate for
boost. However, the PLE handler was originally designed to deal with the
lock holder preemption problem. The Intel PLE occurs when the spinlock
waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
they can be in either kernel or user mode. the vCPU candidate in user mode
will not be boosted even if they should respond to IPIs. Some benchmarks
like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
of the time they are running in user mode. It can lead to a large number
of continuous PLE events because the IPI sender causes PLE events
repeatedly until the receiver is scheduled while the receiver is not
candidate for a boost.

This patch boosts the vCPU candidiate in user mode which is delivery
interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
96 HTs Intel CLX box). There is no performance regression for other
benchmarks like Unixbench spawn (most of the time contend read/write
lock in kernel mode), ebizzy (most of the time contend read/write sem
and TLB shoodtdown in kernel mode).
  
+bool kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu)

+{
+   if (vcpu->arch.apicv_active && 
static_call(kvm_x86_dy_apicv_has_pending_interrupt)(vcpu))
+   return true;
+
+   return false;
+}


Can you reuse vcpu_dy_runnable instead of this new function?

Paolo




  bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
  {
return vcpu->arch.preempted_in_kernel;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3b06d12..5012fc4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -954,6 +954,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
  bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
  bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu);
+bool kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu);
  int kvm_arch_post_init_vm(struct kvm *kvm);
  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
  
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c

index 0a481e7..781d2db 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3012,6 +3012,11 @@ static bool vcpu_dy_runnable(struct kvm_vcpu *vcpu)
return false;
  }
  
+bool __weak kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu)

+{
+   return false;
+}
+
  void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
  {
struct kvm *kvm = me->kvm;
@@ -3045,6 +3050,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool 
yield_to_kernel_mode)
!vcpu_dy_runnable(vcpu))
continue;
if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode 
&&
+   !kvm_arch_interrupt_delivery(vcpu) &&
!kvm_arch_vcpu_in_kernel(vcpu))
continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))





Re: [PATCH v2] kvm/selftests: Fix race condition with dirty_log_test

2021-04-17 Thread Paolo Bonzini

On 13/04/21 23:36, Peter Xu wrote:

This patch closes this race by allowing the main thread to give the vcpu thread
chance to do a VMENTER to complete that write operation.  It's done by adding a
vcpu loop counter (must be defined as volatile as main thread will do read
loop), then the main thread can guarantee the vcpu got at least another VMENTER
by making sure the guest_vcpu_loops increases by 2.

Dirty ring does not need this since dirty_ring_last_page would already help
avoid this specific race condition.


Just a nit, the comment and commit message should mention KVM_RUN rather 
than vmentry; it's possible to be preempted many times in 
vcpu_enter_guest without making progress, but those wouldn't return to 
userspace and thus would not update guest_vcpu_loops.


Also, volatile is considered harmful even in userspace/test code[1]. 
Technically rather than volatile one should use an atomic load (even a 
relaxed one), but in practice it's okay to use volatile too *for this 
specific use* (READ_ONCE/WRITE_ONCE are volatile reads and writes as 
well).  If the selftests gained 32-bit support, one should not use 
volatile because neither reads or writes to uint64_t variables would be 
guaranteed to be atomic.


Queued, thanks.

Paolo

[1] Documentation/process/volatile-considered-harmful.rst



Re: [PATCH] doc/virt/kvm: move KVM_X86_SET_MSR_FILTER in section 8

2021-04-17 Thread Paolo Bonzini

On 16/03/21 18:08, Emanuele Giuseppe Esposito wrote:

KVM_X86_SET_MSR_FILTER is a capability, not an ioctl.
Therefore move it from section 4.97 to the new 8.31 (other capabilities).


Here and in the subject I think KVM_X86_SET_MSR_FILTER should be 
replaced by KVM_CAP_PPC_MULTITCE.


Otherwise looks good, so I queued it.

Thanks,

Paolo


To fill the gap, move KVM_X86_SET_MSR_FILTER (was 4.126) to
4.97, and shifted Xen-related ioctl (were 4.127 - 4.130) by
one place (4.126 - 4.129).

Also fixed minor typo in KVM_GET_MSR_INDEX_LIST ioctl description
(section 4.3).




Re: [PATCH] KVM: x86: Remove unused function declaration

2021-04-17 Thread Paolo Bonzini

On 06/04/21 08:35, Keqian Zhu wrote:

kvm_mmu_slot_largepage_remove_write_access() is decared but not used,
just remove it.

Signed-off-by: Keqian Zhu 
---
  arch/x86/include/asm/kvm_host.h | 2 --
  1 file changed, 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3768819693e5..9c0af0971c9f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1440,8 +1440,6 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
   const struct kvm_memory_slot *memslot);
  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
   struct kvm_memory_slot *memslot);
-void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
-   struct kvm_memory_slot *memslot);
  void kvm_mmu_zap_all(struct kvm *kvm);
  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
  unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm);



Queued, thanks.

Paolo



Re: [PATCH 0/4] KVM: SVM: A fix and cleanups for vmcb tracking

2021-04-17 Thread Paolo Bonzini

On 06/04/21 19:18, Sean Christopherson wrote:

Belated code review for the vmcb changes that are queued for 5.13.

Sean Christopherson (4):
   KVM: SVM: Don't set current_vmcb->cpu when switching vmcb
   KVM: SVM: Drop vcpu_svm.vmcb_pa
   KVM: SVM: Add a comment to clarify what vcpu_svm.vmcb points at
   KVM: SVM: Enhance and clean up the vmcb tracking comment in
 pre_svm_run()

  arch/x86/kvm/svm/svm.c | 29 +
  arch/x86/kvm/svm/svm.h |  2 +-
  2 files changed, 14 insertions(+), 17 deletions(-)



Queued, thanks -- especially for the bug in patch 1, which avoided review.

Paolo



Re: [PATCH v2 0/8] ccp: KVM: SVM: Use stack for SEV command buffers

2021-04-17 Thread Paolo Bonzini

On 07/04/21 00:49, Sean Christopherson wrote:

This series teaches __sev_do_cmd_locked() to gracefully handle vmalloc'd
command buffers by copying _all_ incoming data pointers to an internal
buffer before sending the command to the PSP.  The SEV driver and KVM are
then converted to use the stack for all command buffers.

Tested everything except sev_ioctl_do_pek_import(), I don't know anywhere
near enough about the PSP to give it the right input.

v2:
   - Rebase to kvm/queue, commit f96be2deac9b ("KVM: x86: Support KVM VMs
 sharing SEV context").
   - Unconditionally copy @data to the internal buffer. [Christophe, Brijesh]
   - Allocate a full page for the buffer. [Brijesh]
   - Drop one set of the "!"s. [Christophe]
   - Use virt_addr_valid() instead of is_vmalloc_addr() for the temporary
 patch (definitely feel free to drop the patch if it's not worth
 backporting). [Christophe]
   - s/intput/input/. [Tom]
   - Add a patch to free "sev" if init fails.  This is not strictly
 necessary (I think; I suck horribly when it comes to the driver
 framework).   But it felt wrong to not free cmd_buf on failure, and
 even more wrong to free cmd_buf but not sev.

v1:
   - https://lkml.kernel.org/r/20210402233702.3291792-1-sea...@google.com

Sean Christopherson (8):
   crypto: ccp: Free SEV device if SEV init fails
   crypto: ccp: Detect and reject "invalid" addresses destined for PSP
   crypto: ccp: Reject SEV commands with mismatching command buffer
   crypto: ccp: Play nice with vmalloc'd memory for SEV command structs
   crypto: ccp: Use the stack for small SEV command buffers
   crypto: ccp: Use the stack and common buffer for status commands
   crypto: ccp: Use the stack and common buffer for INIT command
   KVM: SVM: Allocate SEV command structures on local stack

  arch/x86/kvm/svm/sev.c   | 262 +--
  drivers/crypto/ccp/sev-dev.c | 197 +-
  drivers/crypto/ccp/sev-dev.h |   4 +-
  3 files changed, 196 insertions(+), 267 deletions(-)



Queued, thanks.

Paolo



Re: [PATCH v2 8/8] KVM: SVM: Allocate SEV command structures on local stack

2021-04-17 Thread Paolo Bonzini

On 07/04/21 19:34, Borislav Petkov wrote:

On Wed, Apr 07, 2021 at 05:05:07PM +, Sean Christopherson wrote:

I used memset() to defer initialization until after the various sanity
checks,


I'd actually vote for that too - I don't like doing stuff which is not
going to be used. I.e., don't change what you have.


It's three votes for that then. :)

Sean, I squashed in this change

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index ec4c01807272..a4d0ca8c4710 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1039,21 +1039,14 @@ static int sev_get_attestation_report(struct kvm *kvm, 
struct kvm_sev_cmd *argp)
 static int sev_send_cancel(struct kvm *kvm, struct kvm_sev_cmd *argp)
 {
struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info;
-   struct sev_data_send_cancel *data;
+   struct sev_data_send_cancel data;
int ret;
 
 	if (!sev_guest(kvm))

return -ENOTTY;
 
-	data = kzalloc(sizeof(*data), GFP_KERNEL);

-   if (!data)
-   return -ENOMEM;
-
-   data->handle = sev->handle;
-   ret = sev_issue_cmd(kvm, SEV_CMD_SEND_CANCEL, data, >error);
-
-   kfree(data);
-   return ret;
+   data.handle = sev->handle;
+   return sev_issue_cmd(kvm, SEV_CMD_SEND_CANCEL, , >error);
 }
 
 int svm_mem_enc_op(struct kvm *kvm, void __user *argp)


to handle SV_CMD_SEND_CANCEL which I merged before this series.

Paolo



Re: [PATCH v2 7/8] crypto: ccp: Use the stack and common buffer for INIT command

2021-04-17 Thread Paolo Bonzini

On 07/04/21 07:20, Christophe Leroy wrote:


+    struct sev_data_init data;


struct sev_data_init data = {0, 0, 0, 0};


Having to count the number of items is suboptimal.  The alternative 
could be {} (which however is technically not standard C), {0} (a bit 
mysterious, but it works) and memset.  I kept the latter to avoid 
touching the submitter's patch too much.


Paolo



Re: [PATCH v2 5/8] crypto: ccp: Use the stack for small SEV command buffers

2021-04-17 Thread Paolo Bonzini

On 07/04/21 00:49, Sean Christopherson wrote:

For commands with small input/output buffers, use the local stack to
"allocate" the structures used to communicate with the PSP.   Now that
__sev_do_cmd_locked() gracefully handles vmalloc'd buffers, there's no
reason to avoid using the stack, e.g. CONFIG_VMAP_STACK=y will just work.

Signed-off-by: Sean Christopherson 


Squashing this in (inspired by Christophe's review, though not quite
matching his suggestion).

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 0f5644a3b138..246b281b6376 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -408,12 +408,11 @@ static int sev_ioctl_do_pek_csr(struct sev_issue_cmd 
*argp, bool writable)
if (copy_from_user(, (void __user *)argp->data, sizeof(input)))
return -EFAULT;
 
+	memset(, 0, sizeof(data));

+
/* userspace wants to query CSR length */
-   if (!input.address || !input.length) {
-   data.address = 0;
-   data.len = 0;
+   if (!input.address || !input.length)
goto cmd;
-   }
 
 	/* allocate a physically contiguous buffer to store the CSR blob */

input_address = (void __user *)input.address;




Re: [PATCH v2 1/3] x86/kvm: Don't bother __pv_cpu_mask when !CONFIG_SMP

2021-04-17 Thread Paolo Bonzini

On 09/04/21 06:18, Wanpeng Li wrote:

From: Wanpeng Li 

Enable PV TLB shootdown when !CONFIG_SMP doesn't make sense. Let's
move it inside CONFIG_SMP. In addition, we can avoid define and
alloc __pv_cpu_mask when !CONFIG_SMP and get rid of 'alloc' variable
in kvm_alloc_cpumask.

Signed-off-by: Wanpeng Li 
---
v1 -> v2:
  * shuffle things around a bit more

  arch/x86/kernel/kvm.c | 118 +++---
  1 file changed, 55 insertions(+), 63 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5e78e01..224a7a1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -451,6 +451,10 @@ static void __init sev_map_percpu_data(void)
}
  }
  
+#ifdef CONFIG_SMP

+
+static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);
+
  static bool pv_tlb_flush_supported(void)
  {
return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
@@ -458,10 +462,6 @@ static bool pv_tlb_flush_supported(void)
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
  }
  
-static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);

-
-#ifdef CONFIG_SMP
-
  static bool pv_ipi_supported(void)
  {
return kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI);
@@ -574,6 +574,49 @@ static void kvm_smp_send_call_func_ipi(const struct 
cpumask *mask)
}
  }
  
+static void kvm_flush_tlb_others(const struct cpumask *cpumask,

+   const struct flush_tlb_info *info)
+{
+   u8 state;
+   int cpu;
+   struct kvm_steal_time *src;
+   struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
+
+   cpumask_copy(flushmask, cpumask);
+   /*
+* We have to call flush only on online vCPUs. And
+* queue flush_on_enter for pre-empted vCPUs
+*/
+   for_each_cpu(cpu, flushmask) {
+   src = _cpu(steal_time, cpu);
+   state = READ_ONCE(src->preempted);
+   if ((state & KVM_VCPU_PREEMPTED)) {
+   if (try_cmpxchg(>preempted, ,
+   state | KVM_VCPU_FLUSH_TLB))
+   __cpumask_clear_cpu(cpu, flushmask);
+   }
+   }
+
+   native_flush_tlb_others(flushmask, info);
+}
+
+static __init int kvm_alloc_cpumask(void)
+{
+   int cpu;
+
+   if (!kvm_para_available() || nopv)
+   return 0;
+
+   if (pv_tlb_flush_supported() || pv_ipi_supported())
+   for_each_possible_cpu(cpu) {
+   zalloc_cpumask_var_node(per_cpu_ptr(&__pv_cpu_mask, 
cpu),
+   GFP_KERNEL, cpu_to_node(cpu));
+   }
+
+   return 0;
+}
+arch_initcall(kvm_alloc_cpumask);
+
  static void __init kvm_smp_prepare_boot_cpu(void)
  {
/*
@@ -611,33 +654,8 @@ static int kvm_cpu_down_prepare(unsigned int cpu)
local_irq_enable();
return 0;
  }
-#endif
-
-static void kvm_flush_tlb_others(const struct cpumask *cpumask,
-   const struct flush_tlb_info *info)
-{
-   u8 state;
-   int cpu;
-   struct kvm_steal_time *src;
-   struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
-
-   cpumask_copy(flushmask, cpumask);
-   /*
-* We have to call flush only on online vCPUs. And
-* queue flush_on_enter for pre-empted vCPUs
-*/
-   for_each_cpu(cpu, flushmask) {
-   src = _cpu(steal_time, cpu);
-   state = READ_ONCE(src->preempted);
-   if ((state & KVM_VCPU_PREEMPTED)) {
-   if (try_cmpxchg(>preempted, ,
-   state | KVM_VCPU_FLUSH_TLB))
-   __cpumask_clear_cpu(cpu, flushmask);
-   }
-   }
  
-	native_flush_tlb_others(flushmask, info);

-}
+#endif
  
  static void __init kvm_guest_init(void)

  {
@@ -653,12 +671,6 @@ static void __init kvm_guest_init(void)
pv_ops.time.steal_clock = kvm_steal_clock;
}
  
-	if (pv_tlb_flush_supported()) {

-   pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
-   pv_ops.mmu.tlb_remove_table = tlb_remove_table;
-   pr_info("KVM setup pv remote TLB flush\n");
-   }
-
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);
  
@@ -668,6 +680,12 @@ static void __init kvm_guest_init(void)

}
  
  #ifdef CONFIG_SMP

+   if (pv_tlb_flush_supported()) {
+   pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
+   pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+   pr_info("KVM setup pv remote TLB flush\n");
+   }
+
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
if (pv_sched_yield_supported()) {
smp_ops.send_call_func_ipi = kvm_smp_send_call_func_ipi;
@@ -734,7 +752,7 @@ static uint32_t __init kvm_detect(void)
  
  static void __init kvm_apic_init(void)

Re: linux-next: Fixes tag needs some work in the kvm tree

2021-04-16 Thread Paolo Bonzini

On 16/04/21 14:38, Christian Borntraeger wrote:

On 16.04.21 14:27, Stephen Rothwell wrote:

Hi all,

In commit

   c3171e94cc1c ("KVM: s390: VSIE: fix MVPG handling for prefixing and 
MSO")


Fixes tag

   Fixes: bdf7509bbefa ("s390/kvm: VSIE: correctly handle MVPG when in 
VSIE")


has these problem(s):

   - Subject does not match target commit subject
 Just use
git log -1 --format='Fixes: %h ("%s")'


Hmm, this has been sitting in kvms390/next for some time now. Is this a 
new check?




Maybe you just missed it when it was reported for kvms390?

https://www.spinics.net/lists/linux-next/msg59652.html

The SHA1 is stable now so it's too late.  It's not a big deal I guess.

Paolo



Re: [PATCH v2] tools: do not include scripts/Kbuild.include

2021-04-16 Thread Paolo Bonzini

On 16/04/21 15:26, Christian Borntraeger wrote:



On 16.04.21 15:00, Masahiro Yamada wrote:

Since commit d9f4ff50d2aa ("kbuild: spilt cc-option and friends to
scripts/Makefile.compiler"), some kselftests fail to build.

The tools/ directory opted out Kbuild, and went in a different
direction. They copy any kind of files to the tools/ directory
in order to do whatever they want in their world.

tools/build/Build.include mimics scripts/Kbuild.include, but some
tool Makefiles included the Kbuild one to import a feature that is
missing in tools/build/Build.include:

  - Commit ec04aa3ae87b ("tools/thermal: tmon: use "-fstack-protector"
    only if supported") included scripts/Kbuild.include from
    tools/thermal/tmon/Makefile to import the cc-option macro.

  - Commit c2390f16fc5b ("selftests: kvm: fix for compilers that do
    not support -no-pie") included scripts/Kbuild.include from
    tools/testing/selftests/kvm/Makefile to import the try-run macro.

  - Commit 9cae4ace80ef ("selftests/bpf: do not ignore clang
    failures") included scripts/Kbuild.include from
    tools/testing/selftests/bpf/Makefile to import the .DELETE_ON_ERROR
    target.

  - Commit 0695f8bca93e ("selftests/powerpc: Handle Makefile for
    unrecognized option") included scripts/Kbuild.include from
    tools/testing/selftests/powerpc/pmu/ebb/Makefile to import the
    try-run macro.

Copy what they need into tools/build/Build.include, and make them
include it instead of scripts/Kbuild.include.

Link: 
https://lore.kernel.org/lkml/86dadf33-70f7-a5ac-cb8c-64966d2f4...@linux.ibm.com/ 

Fixes: d9f4ff50d2aa ("kbuild: spilt cc-option and friends to 
scripts/Makefile.compiler")

Reported-by: Janosch Frank 
Reported-by: Christian Borntraeger 
Signed-off-by: Masahiro Yamada 


looks better.
Tested-by: Christian Borntraeger 



Thank you very much Masahiro, this look great.

Paolo



Re: [PATCH v2 3/7] KVM: x86: hyper-v: Move the remote TLB flush logic out of vmx

2021-04-16 Thread Paolo Bonzini

On 16/04/21 10:36, Vitaly Kuznetsov wrote:

- Create a dedicated set of files, e.g. 'kvmonhyperv.[ch]' (I also
thought about 'hyperv_host.[ch]' but then I realized it's equally
misleading as one can read this as 'KVM is acting as Hyper-V host').

Personally, I'd vote for the later. Besides eliminating confusion, the
benefit of having dedicated files is that we can avoid compiling them
completely when !IS_ENABLED(CONFIG_HYPERV) (#ifdefs in C are ugly).


Indeed.  For the file, kvm-on-hv.[ch] can do.

Paolo



Re: [PATCH v3 5/8] docs: vcpu-requests.rst: fix reference for atomic ops

2021-04-15 Thread Paolo Bonzini

On 09/04/21 14:47, Mauro Carvalho Chehab wrote:

Changeset f0400a77ebdc ("atomic: Delete obsolete documentation")
got rid of atomic_ops.rst, pointing that this was superseded by
Documentation/atomic_*.txt.

Update its reference accordingly.

Fixes: f0400a77ebdc ("atomic: Delete obsolete documentation")
Signed-off-by: Mauro Carvalho Chehab 
---
  Documentation/virt/kvm/vcpu-requests.rst | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/vcpu-requests.rst 
b/Documentation/virt/kvm/vcpu-requests.rst
index 5feb3706a7ae..5f8798e7fdf8 100644
--- a/Documentation/virt/kvm/vcpu-requests.rst
+++ b/Documentation/virt/kvm/vcpu-requests.rst
@@ -302,6 +302,6 @@ VCPU returns from the call.
  References
  ==
  
-.. [atomic-ops] Documentation/core-api/atomic_ops.rst

+.. [atomic-ops] Documentation/atomic_bitops.txt and Documentation/atomic_t.txt
  .. [memory-barriers] Documentation/memory-barriers.txt
  .. [lwn-mb] https://lwn.net/Articles/573436/



Acked-by: Paolo Bonzini 



Re: [PATCH v2 0/8] ccp: KVM: SVM: Use stack for SEV command buffers

2021-04-15 Thread Paolo Bonzini

On 07/04/21 20:00, Tom Lendacky wrote:

For the series:

Acked-by: Tom Lendacky


Shall I take this as a request (or permission, whatever :)) to merge it 
through the KVM tree?


Paolo



Re: [PATCH 2/2] tools: do not include scripts/Kbuild.include

2021-04-15 Thread Paolo Bonzini

On 15/04/21 10:04, Masahiro Yamada wrote:

On Thu, Apr 15, 2021 at 4:40 PM Paolo Bonzini  wrote:

I think it would make sense to add try-run, cc-option and
.DELETE_ON_ERROR to tools/build/Build.include?


To be safe, I just copy-pasted what the makefiles need.
If someone wants to refactor the tool build system, that is fine,
but, to me, I do not see consistent rules or policy under tools/.


"Please put this in a common file instead of introducing duplication" is 
not asking for wholesale refactoring.


Paolo



Re: [PATCH 2/2] tools: do not include scripts/Kbuild.include

2021-04-15 Thread Paolo Bonzini

On 15/04/21 09:27, Masahiro Yamada wrote:

Since commit d9f4ff50d2aa ("kbuild: spilt cc-option and friends to
scripts/Makefile.compiler"), some kselftests fail to build.

The tools/ directory opted out Kbuild, and went in a different
direction. They copy any kind of files to the tools/ directory
in order to do whatever they want to do in their world.

tools/build/Build.include mimics scripts/Kbuild.include, but some
tool Makefiles included the Kbuild one to import a feature that is
missing in tools/build/Build.include:

  - Commit ec04aa3ae87b ("tools/thermal: tmon: use "-fstack-protector"
only if supported") included scripts/Kbuild.include from
tools/thermal/tmon/Makefile to import the cc-option macro.

  - Commit c2390f16fc5b ("selftests: kvm: fix for compilers that do
not support -no-pie") included scripts/Kbuild.include from
tools/testing/selftests/kvm/Makefile to import the try-run macro.

  - Commit 9cae4ace80ef ("selftests/bpf: do not ignore clang
failures") included scripts/Kbuild.include from
tools/testing/selftests/bpf/Makefile to import the .DELETE_ON_ERROR
target.

  - Commit 0695f8bca93e ("selftests/powerpc: Handle Makefile for
unrecognized option") included scripts/Kbuild.include from
tools/testing/selftests/powerpc/pmu/ebb/Makefile to import the
try-run macro.

Copy what they want there, and stop including scripts/Kbuild.include
from the tool Makefiles.


I think it would make sense to add try-run, cc-option and 
.DELETE_ON_ERROR to tools/build/Build.include?


Paolo


Link: 
https://lore.kernel.org/lkml/86dadf33-70f7-a5ac-cb8c-64966d2f4...@linux.ibm.com/
Fixes: d9f4ff50d2aa ("kbuild: spilt cc-option and friends to 
scripts/Makefile.compiler")
Reported-by: Janosch Frank 
Reported-by: Christian Borntraeger 
Signed-off-by: Masahiro Yamada 
---

  tools/testing/selftests/bpf/Makefile  |  3 ++-
  tools/testing/selftests/kvm/Makefile  | 12 +++-
  .../selftests/powerpc/pmu/ebb/Makefile| 11 ++-
  tools/thermal/tmon/Makefile   | 19 +--
  4 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 044bfdcf5b74..d872b9f41543 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,5 +1,4 @@
  # SPDX-License-Identifier: GPL-2.0
-include ../../../../scripts/Kbuild.include
  include ../../../scripts/Makefile.arch
  include ../../../scripts/Makefile.include
  
@@ -476,3 +475,5 @@ EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)	\

prog_tests/tests.h map_tests/tests.h verifier/tests.h   \
feature \
$(addprefix $(OUTPUT)/,*.o *.skel.h no_alu32 bpf_gcc bpf_testmod.ko)
+
+.DELETE_ON_ERROR:
diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index a6d61f451f88..8b45bc417d83 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -1,5 +1,15 @@
  # SPDX-License-Identifier: GPL-2.0-only
-include ../../../../scripts/Kbuild.include
+
+TMPOUT = .tmp_
+
+try-run = $(shell set -e;  \
+   TMP=$(TMPOUT)/tmp;  \
+   mkdir -p $(TMPOUT); \
+   trap "rm -rf $(TMPOUT)" EXIT; \
+   if ($(1)) >/dev/null 2>&1;\
+   then echo "$(2)"; \
+   else echo "$(3)"; \
+   fi)
  
  all:
  
diff --git a/tools/testing/selftests/powerpc/pmu/ebb/Makefile b/tools/testing/selftests/powerpc/pmu/ebb/Makefile

index af3df79d8163..d5d3e869df93 100644
--- a/tools/testing/selftests/powerpc/pmu/ebb/Makefile
+++ b/tools/testing/selftests/powerpc/pmu/ebb/Makefile
@@ -1,5 +1,4 @@
  # SPDX-License-Identifier: GPL-2.0
-include ../../../../../../scripts/Kbuild.include
  
  noarg:

$(MAKE) -C ../../
@@ -8,6 +7,16 @@ noarg:
  CFLAGS += -m64
  
  TMPOUT = $(OUTPUT)/TMPDIR/

+
+try-run = $(shell set -e;  \
+   TMP=$(TMPOUT)/tmp;  \
+   mkdir -p $(TMPOUT); \
+   trap "rm -rf $(TMPOUT)" EXIT; \
+   if ($(1)) >/dev/null 2>&1;\
+   then echo "$(2)"; \
+   else echo "$(3)"; \
+   fi)
+
  # Toolchains may build PIE by default which breaks the assembly
  no-pie-option := $(call try-run, echo 'int main() { return 0; }' | \
  $(CC) -Werror $(KBUILD_CPPFLAGS) $(CC_OPTION_CFLAGS) -no-pie -x c - -o 
"$$TMP", -no-pie)
diff --git a/tools/thermal/tmon/Makefile b/tools/thermal/tmon/Makefile
index 59e417ec3e13..92a683e4866c 100644
--- a/tools/thermal/tmon/Makefile
+++ b/tools/thermal/tmon/Makefile
@@ -1,6 +1,21 @@
  # SPDX-License-Identifier: GPL-2.0
-# We need this for the "cc-option" macro.
-include ../../../scripts/Kbuild.include
+
+TMPOUT = .tmp_
+
+try-run = $(shell set -e;  \
+   TMP=$(TMPOUT)/tmp;  \
+   mkdir -p 

Re: [PATCH 2/2] KVM: x86: Fix split-irqchip vs interrupt injection window request

2021-04-15 Thread Paolo Bonzini

On 15/04/21 02:59, Lai Jiangshan wrote:

The next call to inject_pending_event() will reach here AT FIRST with
vcpu->arch.exception.injected==false and vcpu->arch.exception.pending==false


  ... if (!vcpu->arch.exception.pending) {
  if (vcpu->arch.nmi_injected) {
  static_call(kvm_x86_set_nmi)(vcpu);
  can_inject = false;
  } else if (vcpu->arch.interrupt.injected) {
  static_call(kvm_x86_set_irq)(vcpu);
  can_inject = false;


And comes here and vcpu->arch.interrupt.injected is true for there is
an interrupt queued by KVM_INTERRUPT for pure user irqchip. It then does
the injection of the interrupt without checking the EFLAGS.IF.


Ok, understood now.  Yeah, that could be a problem for userspace irqchip 
so we should switch it to use pending_external_vector instead.  Are you 
going to write the patch or should I?


Thanks!

Paolo


My question is that what stops the next call to inject_pending_event()
to reach here when KVM_INTERRUPT is called with exepction pending.




Re: [PATCH 2/2] KVM: x86: Fix split-irqchip vs interrupt injection window request

2021-04-14 Thread Paolo Bonzini

On 14/04/21 04:28, Lai Jiangshan wrote:

On Tue, Apr 13, 2021 at 8:15 PM Paolo Bonzini  wrote:


On 13/04/21 13:03, Lai Jiangshan wrote:

This patch claims that it has a place to
stash the IRQ when EFLAGS.IF=0, but inject_pending_event() seams to ignore
EFLAGS.IF and queues the IRQ to the guest directly in the first branch
of using "kvm_x86_ops.set_irq(vcpu)".


This is only true for pure-userspace irqchip.  For split-irqchip, in
which case the "place to stash" the interrupt is
vcpu->arch.pending_external_vector.

For pure-userspace irqchip, KVM_INTERRUPT only cares about being able to
stash the interrupt in vcpu->arch.interrupt.injected.  It is indeed
wrong for userspace to call KVM_INTERRUPT if the vCPU is not ready for
interrupt injection, but KVM_INTERRUPT does not return an error.


Thanks for the reply.

May I ask what is the correct/practical way of using KVM_INTERRUPT ABI
for pure-userspace irqchip.

gVisor is indeed a pure-userspace irqchip, it will call KVM_INTERRUPT
when kvm_run->ready_for_interrupt_injection=1 (along with other conditions
unrelated to our discussion).

https://github.com/google/gvisor/blob/a9441aea2780da8c93da1c73da860219f98438de/pkg/sentry/platform/kvm/bluepill_amd64_unsafe.go#L105

if kvm_run->ready_for_interrupt_injection=1 when expection pending or
EFLAGS.IF=0, it would be unexpected for gVisor.


Not with EFLAGS.IF=0.  For pending exception, there is code to handle it 
in inject_pending_event:


... if (!vcpu->arch.exception.pending) {
if (vcpu->arch.nmi_injected) {
static_call(kvm_x86_set_nmi)(vcpu);
can_inject = false;
} else if (vcpu->arch.interrupt.injected) {
static_call(kvm_x86_set_irq)(vcpu);
can_inject = false;
}
}
...
if (vcpu->arch.exception.pending) {
...
can_inject = false;
}
// this is vcpu->arch.interrupt.injected for userspace LAPIC
if (kvm_cpu_has_injectable_intr(vcpu)) {
r = can_inject ? 
static_call(kvm_x86_interrupt_allowed)(vcpu, true) : -EBUSY;

if (r < 0)
goto busy;
...
}

so what happens is:

- the interrupt will not be injected before the exception

- KVM will schedule an immediate vmexit to inject the interrupt as well

- if (as is likely) the exception has turned off interrupts, the next 
call to inject_pending_event will reach 
static_call(kvm_x86_enable_irq_window) and the interrupt will only be 
injected when IF becomes 1 again.


Paolo



[GIT PULL] KVM changes for 5.12-rc8 or final

2021-04-13 Thread Paolo Bonzini
Linus,

The following changes since commit 315f02c60d9425b38eb8ad7f21b8a35e40db23f9:

  KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp 
(2021-04-08 07:48:18 -0400)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus

for you to fetch changes up to 04c4f2ee3f68c9a4bf1653d15f1a9a435ae33f7a:

  KVM: VMX: Don't use vcpu->run->internal.ndata as an array index (2021-04-13 
18:23:41 -0400)


Fix for a possible out-of-bounds access.


Reiji Watanabe (1):
  KVM: VMX: Don't use vcpu->run->internal.ndata as an array index

 arch/x86/kvm/vmx/vmx.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)



Re: [PATCH] doc/virt/kvm: move KVM_X86_SET_MSR_FILTER in section 8

2021-04-13 Thread Paolo Bonzini

On 13/04/21 23:20, Jonathan Corbet wrote:

Emanuele Giuseppe Esposito  writes:


KVM_X86_SET_MSR_FILTER is a capability, not an ioctl.
Therefore move it from section 4.97 to the new 8.31 (other capabilities).

To fill the gap, move KVM_X86_SET_MSR_FILTER (was 4.126) to
4.97, and shifted Xen-related ioctl (were 4.127 - 4.130) by
one place (4.126 - 4.129).

Also fixed minor typo in KVM_GET_MSR_INDEX_LIST ioctl description
(section 4.3).

Signed-off-by: Emanuele Giuseppe Esposito 
---
  Documentation/virt/kvm/api.rst | 250 -
  1 file changed, 125 insertions(+), 125 deletions(-)


Paolo, what's your thought on this one?  If it's OK should I pick it up?


I missed the patch, I'll queue it up for 5.13.

Paolo


Thanks,

jon



diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 1a2b5210cdbf..a230140d6a7f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -201,7 +201,7 @@ Errors:
  
== 

EFAULT the msr index list cannot be read from or written to
-  E2BIG  the msr index list is to be to fit in the array specified by
+  E2BIG  the msr index list is too big to fit in the array specified by
   the user.
== 
  
@@ -3686,31 +3686,105 @@ which is the maximum number of possibly pending cpu-local interrupts.
  
  Queues an SMI on the thread's vcpu.
  
-4.97 KVM_CAP_PPC_MULTITCE

--
+4.97 KVM_X86_SET_MSR_FILTER
+
  
-:Capability: KVM_CAP_PPC_MULTITCE

-:Architectures: ppc
-:Type: vm
+:Capability: KVM_X86_SET_MSR_FILTER
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_msr_filter
+:Returns: 0 on success, < 0 on error
  
-This capability means the kernel is capable of handling hypercalls

-H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
-space. This significantly accelerates DMA operations for PPC KVM guests.
-User space should expect that its handlers for these hypercalls
-are not going to be called if user space previously registered LIOBN
-in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
+::
  
-In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,

-user space might have to advertise it for the guest. For example,
-IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
-present in the "ibm,hypertas-functions" device-tree property.
+  struct kvm_msr_filter_range {
+  #define KVM_MSR_FILTER_READ  (1 << 0)
+  #define KVM_MSR_FILTER_WRITE (1 << 1)
+   __u32 flags;
+   __u32 nmsrs; /* number of msrs in bitmap */
+   __u32 base;  /* MSR index the bitmap starts at */
+   __u8 *bitmap; /* a 1 bit allows the operations in flags, 0 denies */
+  };
  
-The hypercalls mentioned above may or may not be processed successfully

-in the kernel based fast path. If they can not be handled by the kernel,
-they will get passed on to user space. So user space still has to have
-an implementation for these despite the in kernel acceleration.
+  #define KVM_MSR_FILTER_MAX_RANGES 16
+  struct kvm_msr_filter {
+  #define KVM_MSR_FILTER_DEFAULT_ALLOW (0 << 0)
+  #define KVM_MSR_FILTER_DEFAULT_DENY  (1 << 0)
+   __u32 flags;
+   struct kvm_msr_filter_range ranges[KVM_MSR_FILTER_MAX_RANGES];
+  };
  
-This capability is always enabled.

+flags values for ``struct kvm_msr_filter_range``:
+
+``KVM_MSR_FILTER_READ``
+
+  Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
+  indicates that a read should immediately fail, while a 1 indicates that
+  a read for a particular MSR should be handled regardless of the default
+  filter action.
+
+``KVM_MSR_FILTER_WRITE``
+
+  Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
+  indicates that a write should immediately fail, while a 1 indicates that
+  a write for a particular MSR should be handled regardless of the default
+  filter action.
+
+``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE``
+
+  Filter both read and write accesses to MSRs using the given bitmap. A 0
+  in the bitmap indicates that both reads and writes should immediately fail,
+  while a 1 indicates that reads and writes for a particular MSR are not
+  filtered by this range.
+
+flags values for ``struct kvm_msr_filter``:
+
+``KVM_MSR_FILTER_DEFAULT_ALLOW``
+
+  If no filter range matches an MSR index that is getting accessed, KVM will
+  fall back to allowing access to the MSR.
+
+``KVM_MSR_FILTER_DEFAULT_DENY``
+
+  If no filter range matches an MSR index that is getting accessed, KVM will
+  fall back to rejecting access to the MSR. In this mode, all MSRs that should
+  be processed by KVM need to explicitly be marked as allowed in the bitmaps.
+
+This ioctl allows user space to define up to 16 bitmaps of MSR ranges to
+specify whether a certain MSR access should be explicitly filtered for 

Re: [PATCH 2/2] KVM: x86: Fix split-irqchip vs interrupt injection window request

2021-04-13 Thread Paolo Bonzini

On 13/04/21 13:03, Lai Jiangshan wrote:

This patch claims that it has a place to
stash the IRQ when EFLAGS.IF=0, but inject_pending_event() seams to ignore
EFLAGS.IF and queues the IRQ to the guest directly in the first branch
of using "kvm_x86_ops.set_irq(vcpu)".


This is only true for pure-userspace irqchip.  For split-irqchip, in 
which case the "place to stash" the interrupt is 
vcpu->arch.pending_external_vector.


For pure-userspace irqchip, KVM_INTERRUPT only cares about being able to 
stash the interrupt in vcpu->arch.interrupt.injected.  It is indeed 
wrong for userspace to call KVM_INTERRUPT if the vCPU is not ready for 
interrupt injection, but KVM_INTERRUPT does not return an error.


Ignoring the fact that this would be incorrect use of the API, are you 
saying that the incorrect injection was not possible before this patch?


Paolo



Re: [PATCH 2/2] KVM: x86: Fix split-irqchip vs interrupt injection window request

2021-04-13 Thread Paolo Bonzini

On 12/04/21 23:43, Sean Christopherson wrote:

where kvm_arch_interrupt_allowed() checks EFLAGS.IF (and an edge case related to
nested virtualization).  KVM also captures EFLAGS.IF in vcpu->run->if_flag.
For whatever reason, QEMU checks both vcpu->run flags before injecting an IRQ,
maybe to handle a case where QEMU itself clears EFLAGS.IF?


It's mostly obsolete code (that will be deprecated in the next version 
and removed in about a year) so I wouldn't read much into it.  if_flag 
itself is obsolete; it is not provided by SEV-ES, and a useless subset 
of ready_for_interrupt_injection.


Paolo



Re: [RFC v2] KVM: x86: Support KVM VMs sharing SEV context

2021-04-09 Thread Paolo Bonzini

On 09/04/21 03:18, James Bottomley wrote:

If you want to share ASIDs you have to share the firmware that the
running VM has been attested to.  Once the VM moves from LAUNCH to
RUNNING, the PSP won't allow the VMM to inject any more firmware or do
any more attestations.


I think Steve is suggesting to just change the RIP of the mirror VM, 
which would work for SEV but not SEV-ES (the RAM migration helper won't 
*suffice* for SEV-ES, but perhaps you could use the PSP to migrate the 
VMSA and the migration helper for the rest?).


If you want to use a single firmware binary, SEC does almost no I/O 
accesses (the exception being the library constructor from 
SourceLevelDebugPkg's SecPeiDebugAgentLib), so you probably can:


- detect the migration helper hardware in PlatformPei, either from 
fw_cfg or based on the lack of it


- either divert execution to the migration helper through 
gEfiEndOfPeiSignalPpiGuid, or if it's too late add a new boot mode and 
PPI to DxeLoadCore.


Paolo


What you mirror after this point can thus only
contain what has already been measured or what the guest added.  This
is why we think there has to be a new entry path into the VM for the
mirror vCPU.




Re: [PATCH v2 07/17] KVM: x86/mmu: Check PDPTRs before allocating PAE roots

2021-04-08 Thread Paolo Bonzini

On 08/04/21 18:27, Sean Christopherson wrote:

For your approach, can we put the out label after the success path?  Setting
mmu->root_pgd isn't wrong per se, but doing so might mislead future readers into
thinking that it's functionally necessary.


Indeed, thanks for the speedy review.  I'll get it queued tomorrow.

Paolo



Re: [PATCH] KVM: vmx: add mismatched size in vmcs_check32

2021-04-08 Thread Paolo Bonzini

On 08/04/21 18:05, Sean Christopherson wrote:

   Add compile-time assertions in vmcs_check32() to disallow accesses to
   64-bit and 64-bit high fields via vmcs_{read,write}32().  Upper level
   KVM code should never do partial accesses to VMCS fields.  KVM handles
   the split accesses automatically in vmcs_{read,write}64() when running
   as a 32-bit kernel.


KVM also uses raw vmread/vmwrite (__vmcs_readl/__vmcs_writel) when 
copying to and from the shadow VMCS, so that path will not go through 
vmcs_check32 either.


Paolo



Re: [PATCH] KVM: SVM: Make sure GHCB is mapped before updating

2021-04-08 Thread Paolo Bonzini

On 08/04/21 18:04, Tom Lendacky wrote:

+   if (!err || !sev_es_guest(vcpu->kvm) || !WARN_ON_ONCE(svm->ghcb))

This should be WARN_ON_ONCE(!svm->ghcb), otherwise you'll get the right
result, but get a stack trace immediately.

Doh, yep.

Actually, because of the "or's", this needs to be:

if (!err || !sev_es_guest(vcpu->kvm) || (sev_es_guest(vcpu->kvm) && 
WARN_ON_ONCE(!svm->ghcb)))


No, || cuts the right-hand side if the left-hand side is true.  So:

- if err == 0, the rest is not evaluated

- if !sev_es_guest(vcpu->kvm), WARN_ON_ONCE(!svm->ghcb) is not evaluated

Paolo



Re: [PATCH v2 07/17] KVM: x86/mmu: Check PDPTRs before allocating PAE roots

2021-04-08 Thread Paolo Bonzini

On 08/04/21 17:48, Sean Christopherson wrote:

Freaking PDPTRs.  I was really hoping we could keep the lock and 
pages_available()
logic outside of the helpers.  What if kvm_mmu_load() reads the PDPTRs and
passes them into mmu_alloc_shadow_roots()?  Or is that too ugly?


The patch I have posted (though untested) tries to do that in a slightly 
less ugly way by pushing make_mmu_pages_available down to mmu_alloc_*_roots.


Paolo


diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index efb41f31e80a..e3c4938cd665 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3275,11 +3275,11 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 return 0;
  }

-static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
+static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu, u64 pdptrs[4])
  {
 struct kvm_mmu *mmu = vcpu->arch.mmu;
-   u64 pdptrs[4], pm_mask;
 gfn_t root_gfn, root_pgd;
+   u64 pm_mask;
 hpa_t root;
 int i;

@@ -3291,11 +3291,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)

 if (mmu->root_level == PT32E_ROOT_LEVEL) {
 for (i = 0; i < 4; ++i) {
-   pdptrs[i] = mmu->get_pdptr(vcpu, i);
-   if (!(pdptrs[i] & PT_PRESENT_MASK))
-   continue;
-
-   if (mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT))
+   if ((pdptrs[i] & PT_PRESENT_MASK) &&
+   mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT))
 return 1;
 }
 }
@@ -4844,21 +4841,33 @@ EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);

  int kvm_mmu_load(struct kvm_vcpu *vcpu)
  {
-   int r;
+   struct kvm_mmu *mmu = vcpu->arch.mmu;
+   u64 pdptrs[4];
+   int r, i;

-   r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->direct_map);
+   r = mmu_topup_memory_caches(vcpu, !mmu->direct_map);
 if (r)
 goto out;
 r = mmu_alloc_special_roots(vcpu);
 if (r)
 goto out;
+
+   /*
+* On SVM, reading PDPTRs might access guest memory, which might fault
+* and thus might sleep.  Grab the PDPTRs before acquiring mmu_lock.
+*/
+   if (!mmu->direct_map && mmu->root_level == PT32E_ROOT_LEVEL) {
+   for (i = 0; i < 4; ++i)
+   pdptrs[i] = mmu->get_pdptr(vcpu, i);
+   }
+
 write_lock(>kvm->mmu_lock);
 if (make_mmu_pages_available(vcpu))
 r = -ENOSPC;
 else if (vcpu->arch.mmu->direct_map)
 r = mmu_alloc_direct_roots(vcpu);
 else
-   r = mmu_alloc_shadow_roots(vcpu);
+   r = mmu_alloc_shadow_roots(vcpu, pdptrs);
 write_unlock(>kvm->mmu_lock);
 if (r)
 goto out;





Re: [PATCH 0/4] Add support for XMM fast hypercalls

2021-04-08 Thread Paolo Bonzini

On 08/04/21 17:40, Siddharth Chandrasekaran wrote:

Although the Hyper-v TLFS mentions that a guest cannot use this feature
unless the hypervisor advertises support for it, some hypercalls which
we plan on upstreaming in future uses them anyway.

No, please don't do this. Check the feature bit(s) before you issue
hypercalls which rely on the extended interface.

Perhaps Siddharth should clarify this, but I read it as Hyper-V being
buggy and using XMM arguments unconditionally.

The guest is at fault here as it expects Hyper-V to consume arguments
from XMM registers for certain hypercalls (that we are working) even if
we didn't expose the feature via CPUID bits.


What guest is that?

Paolo



Re: [PATCH 6/7] KVM: SVM: hyper-v: Enlightened MSR-Bitmap support

2021-04-08 Thread Paolo Bonzini

On 07/04/21 16:41, Vineeth Pillai wrote:
  
+#if IS_ENABLED(CONFIG_HYPERV)

+static inline void hv_vmcb_dirty_nested_enlightenments(struct kvm_vcpu *vcpu)
+{
+   struct vmcb *vmcb = to_svm(vcpu)->vmcb;
+
+   /*
+* vmcb can be NULL if called during early vcpu init.
+* And its okay not to mark vmcb dirty during vcpu init
+* as we mark it dirty unconditionally towards end of vcpu
+* init phase.
+*/
+   if (vmcb && vmcb_is_clean(vmcb, VMCB_HV_NESTED_ENLIGHTENMENTS) &&
+   vmcb->hv_enlightenments.hv_enlightenments_control.msr_bitmap)
+   vmcb_mark_dirty(vmcb, VMCB_HV_NESTED_ENLIGHTENMENTS);
+}


In addition to what Vitaly said, can svm->vmcb really be NULL?  If so it 
might be better to reorder initializations and skip the NULL check.


Paolo



Re: [PATCH 4/7] KVM: SVM: hyper-v: Nested enlightenments in VMCB

2021-04-08 Thread Paolo Bonzini

On 07/04/21 16:41, Vineeth Pillai wrote:

+#define VMCB_ALL_CLEAN_MASK (__CLEAN_MASK | (1U << 
VMCB_HV_NESTED_ENLIGHTENMENTS))
+#else
+#define VMCB_ALL_CLEAN_MASK __CLEAN_MASK
+#endif


I think this should depend on whether KVM is running on top of Hyper-V; 
not on whether KVM is *compiled* with Hyper-V support.


So you should turn VMCB_ALL_CLEAN_MASK into a __read_mostly variable.

Paolo


  /* TPR and CR2 are always written before VMRUN */
  #define VMCB_ALWAYS_DIRTY_MASK((1U << VMCB_INTR) | (1U << VMCB_CR2))
  
@@ -230,7 +251,7 @@ static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
  
  static inline void vmcb_mark_all_clean(struct vmcb *vmcb)

  {
-   vmcb->control.clean = ((1 << VMCB_DIRTY_MAX) - 1)
+   vmcb->control.clean = VMCB_ALL_CLEAN_MASK
   & ~VMCB_ALWAYS_DIRTY_MASK;
  }




Re: [PATCH 0/4] Add support for XMM fast hypercalls

2021-04-08 Thread Paolo Bonzini

On 08/04/21 17:28, Wei Liu wrote:

Although the Hyper-v TLFS mentions that a guest cannot use this feature
unless the hypervisor advertises support for it, some hypercalls which
we plan on upstreaming in future uses them anyway.


No, please don't do this. Check the feature bit(s) before you issue
hypercalls which rely on the extended interface.


Perhaps Siddharth should clarify this, but I read it as Hyper-V being 
buggy and using XMM arguments unconditionally.


Paolo



Re: [PATCH 3/4] KVM: x86: kvm_hv_flush_tlb use inputs from XMM registers

2021-04-08 Thread Paolo Bonzini

On 08/04/21 14:01, Vitaly Kuznetsov wrote:


Also, we can probably defer kvm_hv_hypercall_read_xmm() until we know
how many regs we actually need to not read them all (we will always
need xmm[0] I guess so we can as well read it here).


The cost is get/put FPU, so I think there's not much to gain from that.

Paolo



Re: [PATCH 1/2] KVM: x86: reduce pvclock_gtod_sync_lock critical sections

2021-04-08 Thread Paolo Bonzini

On 08/04/21 14:00, Marcelo Tosatti wrote:


KVM_REQ_MCLOCK_INPROGRESS is only needed to kick running vCPUs out of the
execution loop;

We do not want vcpus with different system_timestamp/tsc_timestamp
pair:

  * To avoid that problem, do not allow visibility of distinct
  * system_timestamp/tsc_timestamp values simultaneously: use a master
  * copy of host monotonic time values. Update that master copy
  * in lockstep.

So KVM_REQ_MCLOCK_INPROGRESS also ensures that no vcpu enters
guest mode (via vcpu->requests check before VM-entry) with a
different system_timestamp/tsc_timestamp pair.


Yes this is what KVM_REQ_MCLOCK_INPROGRESS does, but it does not have to 
be done that way.  All you really need is the IPI with KVM_REQUEST_WAIT, 
which ensures that updates happen after the vCPUs have exited guest 
mode.  You don't need to loop on vcpu->requests for example, because 
kvm_guest_time_update could just spin on pvclock_gtod_sync_lock until 
pvclock_update_vm_gtod_copy is done.


So this morning I tried protecting the kvm->arch fields for kvmclock 
using a seqcount, which is nice also because get_kvmclock_ns() does not 
have to bounce the cacheline of pvclock_gtod_sync_lock anymore.  I'll 
post it tomorrow or next week.


Paolo



[GIT PULL] KVM changes for 5.12-rc7

2021-04-08 Thread Paolo Bonzini
Linus,

The following changes since commit 55626ca9c6909d077eca71bccbe15fef6e5ad917:

  selftests: kvm: Check that TSC page value is small after KVM_SET_CLOCK(0) 
(2021-04-01 05:14:19 -0400)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus

for you to fetch changes up to 315f02c60d9425b38eb8ad7f21b8a35e40db23f9:

  KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp 
(2021-04-08 07:48:18 -0400)


A lone x86 patch, for a bug found while developing a backport to
stable versions.


Paolo Bonzini (1):
  KVM: x86/mmu: preserve pending TLB flush across calls to 
kvm_tdp_mmu_zap_sp

 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



Re: [PATCH v2 07/17] KVM: x86/mmu: Check PDPTRs before allocating PAE roots

2021-04-08 Thread Paolo Bonzini

On 08/04/21 13:15, Wanpeng Li wrote:

I saw this splatting:

  BUG: sleeping function called from invalid context at
arch/x86/kvm/kvm_cache_regs.h:115
   kvm_pdptr_read+0x20/0x60 [kvm]
   kvm_mmu_load+0x3bd/0x540 [kvm]

There is a might_sleep() in kvm_pdptr_read(), however, the original
commit didn't explain more. I can send a formal one if the below fix
is acceptable.


I think we can just push make_mmu_pages_available down into
kvm_mmu_load's callees.  This way it's not necessary to hold the lock
until after the PDPTR check:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0d92a269c5fa..f92c3695bfeb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3244,6 +3244,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
u8 shadow_root_level = mmu->shadow_root_level;
hpa_t root;
unsigned i;
+   int r;
+
+   write_lock(>kvm->mmu_lock);
+   r = make_mmu_pages_available(vcpu);
+   if (r < 0)
+   goto out_unlock;
 
 	if (is_tdp_mmu_enabled(vcpu->kvm)) {

root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
@@ -3266,13 +3272,16 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
mmu->root_hpa = __pa(mmu->pae_root);
} else {
WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level);
-   return -EIO;
+   r = -EIO;
}
 
+out_unlock:

+   write_unlock(>kvm->mmu_lock);
+
/* root_pgd is ignored for direct MMUs. */
mmu->root_pgd = 0;
 
-	return 0;

+   return r;
 }
 
 static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)

@@ -3282,6 +3291,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
gfn_t root_gfn, root_pgd;
hpa_t root;
int i;
+   int r;
 
 	root_pgd = mmu->get_guest_pgd(vcpu);

root_gfn = root_pgd >> PAGE_SHIFT;
@@ -3300,6 +3310,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
}
}
 
+	write_lock(>kvm->mmu_lock);

+   r = make_mmu_pages_available(vcpu);
+   if (r < 0)
+   goto out_unlock;
+
/*
 * Do we shadow a long mode page table? If so we need to
 * write-protect the guests page table root.
@@ -3308,7 +3323,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
root = mmu_alloc_root(vcpu, root_gfn, 0,
  mmu->shadow_root_level, false);
mmu->root_hpa = root;
-   goto set_root_pgd;
+   goto out_unlock;
}
 
 	if (WARN_ON_ONCE(!mmu->pae_root))

@@ -3350,7 +3365,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
else
mmu->root_hpa = __pa(mmu->pae_root);
 
-set_root_pgd:

+out_unlock:
+   write_unlock(>kvm->mmu_lock);
mmu->root_pgd = root_pgd;
 
 	return 0;

@@ -4852,14 +4868,10 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
r = mmu_alloc_special_roots(vcpu);
if (r)
goto out;
-   write_lock(>kvm->mmu_lock);
-   if (make_mmu_pages_available(vcpu))
-   r = -ENOSPC;
-   else if (vcpu->arch.mmu->direct_map)
+   if (vcpu->arch.mmu->direct_map)
r = mmu_alloc_direct_roots(vcpu);
else
r = mmu_alloc_shadow_roots(vcpu);
-   write_unlock(>kvm->mmu_lock);
if (r)
goto out;
 


Paolo



Re: [PATCH v2] KVM: Explicitly use GFP_KERNEL_ACCOUNT for 'struct kvm_vcpu' allocations

2021-04-08 Thread Paolo Bonzini

On 06/04/21 21:07, Sean Christopherson wrote:

Use GFP_KERNEL_ACCOUNT when allocating vCPUs to make it more obvious that
that the allocations are accounted, to make it easier to audit KVM's
allocations in the future, and to be consistent with other cache usage in
KVM.

When using SLAB/SLUB, this is a nop as the cache itself is created with
SLAB_ACCOUNT.

When using SLOB, there are caveats within caveats.  SLOB doesn't honor
SLAB_ACCOUNT, so passing GFP_KERNEL_ACCOUNT will result in vCPU
allocations now being accounted.   But, even that depends on internal
SLOB details as SLOB will only go to the page allocator when its cache is
depleted.  That just happens to be extremely likely for vCPUs because the
size of kvm_vcpu is larger than the a page for almost all combinations of
architecture and page size.  Whether or not the SLOB behavior is by
design is unknown; it's just as likely that no SLOB users care about
accounding and so no one has bothered to implemented support in SLOB.
Regardless, accounting vCPU allocations will not break SLOB+KVM+cgroup
users, if any exist.

Cc: Wanpeng Li 
Signed-off-by: Sean Christopherson 
---

v2: Drop the Fixes tag and rewrite the changelog since this is a nop when
 using SLUB or SLAB. [Wanpeng]

  virt/kvm/kvm_main.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0a481e7780f0..580f98386b42 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3192,7 +3192,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 
id)
if (r)
goto vcpu_decrement;
  
-	vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);

+   vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
if (!vcpu) {
r = -ENOMEM;
goto vcpu_decrement;



Queued, thanks.

Paolo



Re: [PATCH 3/3] mm: unexport follow_pfn

2021-04-08 Thread Paolo Bonzini

On 08/04/21 12:05, Daniel Vetter wrote:

On Mon, Mar 29, 2021 at 10:31:01AM -0300, Jason Gunthorpe wrote:

On Tue, Mar 16, 2021 at 04:33:03PM +0100, Daniel Vetter wrote:

Both kvm (in bd2fae8da794 ("KVM: do not assume PTE is writable after
follow_pfn")) and vfio (in 07956b6269d3 ("vfio/type1: Use
follow_pte()")) have lost their callsites of follow_pfn(). All the
other ones have been switched over to unsafe_follow_pfn because they
cannot be fixed without breaking userspace api.

Argueably the vfio code is still racy, but that's kinda a bigger


vfio and kvm


Hm I thought kvm is non-racy due to the mmu notifier catch races?


No, but the plan is indeed to have some struct for each page that uses 
follow_pfn and update it from the MMU notifiers.


Paolo




picture. But since it does leak the pte beyond where it drops the pt
lock, without anything else like an mmu notifier guaranteeing
coherence, the problem is at least clearly visible in the vfio code.
So good enough with me.

I've decided to keep the explanation that after dropping the pt lock
you must have an mmu notifier if you keep using the pte somehow by
adjusting it and moving it into the kerneldoc for the new follow_pte()
function.

Cc: 3...@google.com
Cc: Jann Horn 
Cc: Paolo Bonzini 
Cc: Jason Gunthorpe 
Cc: Cornelia Huck 
Cc: Peter Xu 
Cc: Alex Williamson 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
Cc: k...@vger.kernel.org
Signed-off-by: Daniel Vetter 
---
  include/linux/mm.h |  2 --
  mm/memory.c| 26 +-
  mm/nommu.c | 13 +
  3 files changed, 6 insertions(+), 35 deletions(-)


Reviewed-by: Jason Gunthorpe 


Thanks for your r-b tags, I'll add them.
-Daniel



Jason






Re: [PATCH 1/2] KVM: x86: reduce pvclock_gtod_sync_lock critical sections

2021-04-08 Thread Paolo Bonzini

On 07/04/21 19:40, Marcelo Tosatti wrote:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fe806e894212..0a83eff40b43 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2562,10 +2562,12 @@ static void kvm_gen_update_masterclock(struct kvm *kvm)
  
  	kvm_hv_invalidate_tsc_page(kvm);
  
-	spin_lock(>pvclock_gtod_sync_lock);

kvm_make_mclock_inprogress_request(kvm);
+

Might be good to serialize against two kvm_gen_update_masterclock
callers? Otherwise one caller could clear KVM_REQ_MCLOCK_INPROGRESS,
while the other is still at pvclock_update_vm_gtod_copy().


Makes sense, but this stuff has always seemed unnecessarily complicated 
to me.


KVM_REQ_MCLOCK_INPROGRESS is only needed to kick running vCPUs out of 
the execution loop; clearing it in kvm_gen_update_masterclock is 
unnecessary, because KVM_REQ_CLOCK_UPDATE takes pvclock_gtod_sync_lock 
too and thus will already wait for pvclock_update_vm_gtod_copy to end.


I think it's possible to use a seqcount in KVM_REQ_CLOCK_UPDATE instead 
of KVM_REQ_MCLOCK_INPROGRESS.  Both cause the vCPUs to spin. I'll take a 
look.


Paolo



Re: [PATCH 5.10 096/126] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map

2021-04-06 Thread Paolo Bonzini

On 06/04/21 21:44, Sasha Levin wrote:

On Tue, Apr 06, 2021 at 08:28:27PM +0200, Paolo Bonzini wrote:
If a patch doesn't (more or less trivially) apply, the maintainer 
should take action.  Distro maintainers can also jump in and post the 
backport to subsystem mailing lists.  If the stable kernel loses a 
patch because a maintainer doesn't have the time to do a backport, 
it's not the end of the world.


This quickly went from a "world class" to "fuck it".


Known bugs are better than unknown bugs.  If something is reported on 
4.19 and the stable@ backports were only done up to 5.4 because the 
backport was a bit more messy, it's okay.  If a user comes up with a 
weird bug that I've never seen and that it's caused by a botched 
backport, it's much worse.


In this specific case we're talking of 5.10; but in many cases users of 
really old kernels, let's say 4.4-4.9 at this point, won't care about 
having *all* bugfixes.  Some newer (and thus more buggy) features may be 
so distant from the mainline in those kernels, or so immature, that no 
one will consider them while staying on such an old kernel.


Again, kernel necrophilia pays my bills, so I have some experience there. :)


It's understandable that maintainers don't have all the time in the
world for this, but are you really asking not to backport fixes to
stable trees because you don't have the time for it and don't want
anyone else to do it instead?


If a bug is serious I *will* do the backport; I literally did this 
specific backport on the first working day after the failure report. 
But not all bugs are created equal and neither are all stable@-worthy 
bugs.  I try to use the "Fixes" tag correctly, but sometimes a bug that 
*technically* is 10-years-old may not be worthwhile or even possible to 
fix even in 5.4.


That said... one thing that would be really, really awesome would be a 
website where you navigate a Linux checkout and for each directory you 
can choose to get a list of stable patches that were Cc-ed to stable@ 
and failed to apply.  A pipedream maybe, but also a great internship 
project. :)


Paolo


Maybe consider designating someone who knows the subsystem well and does
have time for this?




Re: [PATCH] KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp

2021-04-06 Thread Paolo Bonzini

On 06/04/21 20:25, Greg KH wrote:

On Tue, Apr 06, 2021 at 12:25:50PM -0400, Paolo Bonzini wrote:

Right now, if a call to kvm_tdp_mmu_zap_sp returns false, the caller
will skip the TLB flush, which is wrong.  There are two ways to fix
it:

- since kvm_tdp_mmu_zap_sp will not yield and therefore will not flush
   the TLB itself, we could change the call to kvm_tdp_mmu_zap_sp to
   use "flush |= ..."

- or we can chain the flush argument through kvm_tdp_mmu_zap_sp down
   to __kvm_tdp_mmu_zap_gfn_range.

This patch does the former to simplify application to stable kernels.

Cc: sea...@google.com
Fixes: 048f49809c526 ("KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX 
zapping")
Cc:  # 5.10.x: 048f49809c: KVM: x86/mmu: Ensure TLBs 
are flushed for TDP MMU during NX zapping
Cc:  # 5.10.x: 33a3164161: KVM: x86/mmu: Don't allow 
TDP MMU to yield when recovering NX pages
Cc: 
Signed-off-by: Paolo Bonzini 
---
  arch/x86/kvm/mmu/mmu.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Is this for only the stable kernels, or is it addressed toward upstream
merges?

Confused,


It's for upstream.  I'll include it (with the expected "[ Upstream 
commit abcd ]" header) when I post the complete backport.  I'll send 
this patch to Linus as soon as I get a review even if I don't have 
anything else in the queue, so (as a general idea) the full backport 
should be sent and tested on Thursday-Friday.


Paolo



Re: [PATCH 5.10 096/126] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map

2021-04-06 Thread Paolo Bonzini

On 06/04/21 20:01, Sasha Levin wrote:

On Tue, Apr 06, 2021 at 05:48:50PM +0200, Paolo Bonzini wrote:

On 06/04/21 15:49, Sasha Levin wrote:

Yup. Is there anything wrong with those patches?


The big issue, and the one that you ignoredz every time we discuss 
this topic, is that this particular subset of 17 has AFAIK never been 
tested by anyone.


Few of the CI systems that run on stable(-rc) releases run
kvm-unit-tests, which passed. So yes, this was tested.


The code didn't run unless those CI systems set the module parameter 
that gates the experimental code (they don't).  A compile test is better 
than nothing I guess, but code that didn't run cannot be considered 
tested.  Again, I don't expect that anyone would notice a botched 
backport to 5.10 or 5.11 of this code, but that isn't an excuse for a 
poor process.



Right, I looked at what needed to be backported, took it back to 5.4,
and ran kvm-unit-tests on it. 


I guess that's a typo since the code was added in 5.10, but anyway...


What other hoops should we jump through so we won't need to "hope"
anymore?


... you should jump through _less_ hoops.  You are not expected to know 
the status of the code in each and every subsystem, not even Linus does.


If a patch doesn't (more or less trivially) apply, the maintainer should 
take action.  Distro maintainers can also jump in and post the backport 
to subsystem mailing lists.  If the stable kernel loses a patch because 
a maintainer doesn't have the time to do a backport, it's not the end of 
the world.


Paolo



[PATCH] KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp

2021-04-06 Thread Paolo Bonzini
Right now, if a call to kvm_tdp_mmu_zap_sp returns false, the caller
will skip the TLB flush, which is wrong.  There are two ways to fix
it:

- since kvm_tdp_mmu_zap_sp will not yield and therefore will not flush
  the TLB itself, we could change the call to kvm_tdp_mmu_zap_sp to
  use "flush |= ..."

- or we can chain the flush argument through kvm_tdp_mmu_zap_sp down
  to __kvm_tdp_mmu_zap_gfn_range.

This patch does the former to simplify application to stable kernels.

Cc: sea...@google.com
Fixes: 048f49809c526 ("KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during 
NX zapping")
Cc:  # 5.10.x: 048f49809c: KVM: x86/mmu: Ensure TLBs 
are flushed for TDP MMU during NX zapping
Cc:  # 5.10.x: 33a3164161: KVM: x86/mmu: Don't allow 
TDP MMU to yield when recovering NX pages
Cc: 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 486aa94ecf1d..951dae4e7175 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5906,7 +5906,7 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
  lpage_disallowed_link);
WARN_ON_ONCE(!sp->lpage_disallowed);
if (is_tdp_mmu_page(sp)) {
-   flush = kvm_tdp_mmu_zap_sp(kvm, sp);
+   flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
} else {
kvm_mmu_prepare_zap_page(kvm, sp, _list);
WARN_ON_ONCE(sp->lpage_disallowed);
-- 
2.26.2



Re: [PATCH 5.10 096/126] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map

2021-04-06 Thread Paolo Bonzini

On 06/04/21 15:49, Sasha Levin wrote:

On Tue, Apr 06, 2021 at 08:09:26AM +0200, Paolo Bonzini wrote:
Whoa no, you have included basically a whole new feature, except for 
the final patch that actually enables the feature.  The whole new MMU 


Right, we would usually grab dependencies rather than modifying the
patch. It means we diverge less with upstream, and custom backports tend
to be buggier than just grabbing dependencies.


In general I can't disagree.  However, you *are* modifying at least 
commit 048f49809c in your backport, so it's not clear where you draw the 
line and why you didn't ask for help (more on this below).


Only the first five patches here are actual prerequisites for an easy 
backport of the two commits that actually matter (a835429cda91, "KVM: 
x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap"; 
and 048f49809c52, "KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU 
during NX zapping", 2021-03-30).  Everything after "KVM: x86/mmu: Yield 
in TDU MMU iter even if no SPTES changed" could be dropped without 
making it any harder to backport those two.


is still not meant to be used in production and development is still 
happening as of 5.13.


Unrelated to this discussion, but how are folks supposed to know which
feature can and which feature can't be used in production? If it's a
released kernel, in theory anyone can pick up 5.12 and use it in
production.


It's not enabled by default and requires turning on a module parameter.

That also means that something like CKI will not notice if anything's 
wrong with these patches.  It also means that I could just shrug and 
hope that no one ever runs this code in 5.10 and 5.11 (which is actually 
the most likely case), but it is the process that is *just wrong*.


Were all these patches (82-97) included just to enable patch 98 ("KVM: 
x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping")? Same 
for 105-120 in 5.11.


Yup. Is there anything wrong with those patches?


The big issue, and the one that you ignoredz every time we discuss this 
topic, is that this particular subset of 17 has AFAIK never been tested 
by anyone.


There's plenty of locking changes in here, one patch that you didn't 
backport has this in its commit message:


   This isn't technically a bug fix in the current code [...] but that
   is all very, very subtle, and will break at the slightest sneeze,

meaning that the locking in 5.10 and 5.11 was also less robust to 
changes elsewhere in the code.


Let's also talk about the process and the timing.  I got the "failed to 
apply" automated message last Friday and I was going to work on the 
backport today since yesterday was a holiday here.  I was *never* CCed 
on a post of this backport for maintainers to review; you guys 
*literally* took random subsets of patches from a feature that is new 
and in active development, and hoped that they worked on a past release.


I could be happy because you just provided me with a perfect example of 
why to use my employer's franken-kernel instead of upstream stable 
kernels... ;) but this is not how a world-class operating system is 
developed.  Who cares if a VM breaks or even if my laptop panics; but 
I'd seriously fear for my data if you applied the same attitude to XFS 
or ext4fs.


For now, please drop all 17 patches from 5.10 and 5.11.  I'll send a 
tested backport as soon as possible.


Paolo



Re: [PATCH v11 00/13] Add AMD SEV guest live migration support

2021-04-06 Thread Paolo Bonzini

On 05/04/21 20:27, Steve Rutherford wrote:

On Mon, Apr 5, 2021 at 8:17 AM Peter Gonda  wrote:


Could this patch set include support for the SEND_CANCEL command?


That's separate from this patchset. I sent up an implementation last week.


And I was going to queue it today.

Paolo



Re: [PATCH v11 10/13] KVM: x86: Introduce new KVM_FEATURE_SEV_LIVE_MIGRATION feature & Custom MSR.

2021-04-06 Thread Paolo Bonzini

On 06/04/21 15:26, Ashish Kalra wrote:

It's a little unintuitive to see KVM_MSR_RET_FILTERED here, since
userspace can make this happen on its own without having an entry in
this switch statement (by setting it in the msr filter bitmaps). When
using MSR filters, I would only expect to get MSR filter exits for
MSRs I specifically asked for.

Not a huge deal, just a little unintuitive. I'm not sure other options
are much better (you could put KVM_MSR_RET_INVALID, or you could just
not have these entries in svm_{get,set}_msr).


Actually KVM_MSR_RET_FILTERED seems more logical to use, especially in
comparison with KVM_MSR_RET_INVALID.

Also, hooking this msr in svm_{get,set}_msr allows some in-kernel error
pre-processsing before doing the pass-through to userspace.


I agree that it should be up to userspace to set up the filter since we 
now have that functionality.


Let me read the whole threads for the past versions to see what the 
objections were...


Paolo



Re: [PATCH 2/4] KVM: MIPS: rework flush_shadow_* callbacks into one that prepares the flush

2021-04-06 Thread Paolo Bonzini

On 06/04/21 13:39, Thomas Bogendoerfer wrote:

On Tue, Apr 06, 2021 at 08:05:40AM +0200, Paolo Bonzini wrote:

On 06/04/21 03:36, Huacai Chen wrote:

I tried the merge and it will be enough for Linus to remove
arch/mips/kvm/trap_emul.c.  So I will leave it as is, but next time I'd
prefer KVM MIPS changes to go through either my tree or a common topic
branch.

Emmm, the TE removal series is done by Thomas, not me.:)


Sure, sorry if the sentence sounded like it was directed to you.  No matter
who wrote the code, synchronization between trees is only the maintainers'
task. :)


Sorry about the mess. I'll leave arch/mips/kvm to go via your tree then.


No problem.  In this specific case, at some point I'll pull from 
mips-next, prior to sending the pull request to Linus.  It will look 
just like other KVM submaintainer pulls.


Paolo



Re: [PATCH 5.10 096/126] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map

2021-04-06 Thread Paolo Bonzini

On 05/04/21 10:54, Greg Kroah-Hartman wrote:

From: Ben Gardon 

[ Upstream commit 9a77daacc87dee9fd63e31243f21894132ed8407 ]

To prepare for handling page faults in parallel, change the TDP MMU
page fault handler to use atomic operations to set SPTEs so that changes
are not lost if multiple threads attempt to modify the same SPTE.

Reviewed-by: Peter Feiner 
Signed-off-by: Ben Gardon 

Message-Id: <20210202185734.1680553-21-bgar...@google.com>
[Document new locking rules. - Paolo]
Signed-off-by: Paolo Bonzini 
Signed-off-by: Sasha Levin 
---
  Documentation/virt/kvm/locking.rst |   9 +-
  arch/x86/include/asm/kvm_host.h|  13 +++
  arch/x86/kvm/mmu/tdp_mmu.c | 142 ++---
  3 files changed, 130 insertions(+), 34 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst 
b/Documentation/virt/kvm/locking.rst
index b21a34c34a21..0aa4817b466d 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -16,7 +16,14 @@ The acquisition orders for mutexes are as follows:
  - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
them together is quite rare.
  
-On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.

+On x86:
+
+- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock
+
+- kvm->arch.mmu_lock is an rwlock.  kvm->arch.tdp_mmu_pages_lock is
+  taken inside kvm->arch.mmu_lock, and cannot be taken without already
+  holding kvm->arch.mmu_lock (typically with ``read_lock``, otherwise
+  there's no need to take kvm->arch.tdp_mmu_pages_lock at all).
  
  Everything else is a leaf: no other lock is taken inside the critical

  sections.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 02d4c74d30e2..47cd8f9b3fe7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1014,6 +1014,19 @@ struct kvm_arch {
struct list_head tdp_mmu_roots;
/* List of struct tdp_mmu_pages not being used as roots */
struct list_head tdp_mmu_pages;
+
+   /*
+* Protects accesses to the following fields when the MMU lock
+* is held in read mode:
+*  - tdp_mmu_pages (above)
+*  - the link field of struct kvm_mmu_pages used by the TDP MMU
+*  - lpage_disallowed_mmu_pages
+*  - the lpage_disallowed_link field of struct kvm_mmu_pages used
+*by the TDP MMU
+* It is acceptable, but not necessary, to acquire this lock when
+* the thread holds the MMU lock in write mode.
+*/
+   spinlock_t tdp_mmu_pages_lock;
  };
  
  struct kvm_vm_stat {

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 14d69c01c710..eb38f74af3f2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -7,6 +7,7 @@
  #include "tdp_mmu.h"
  #include "spte.h"
  
+#include 

  #include 
  
  #ifdef CONFIG_X86_64

@@ -33,6 +34,7 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
kvm->arch.tdp_mmu_enabled = true;
  
  	INIT_LIST_HEAD(>arch.tdp_mmu_roots);

+   spin_lock_init(>arch.tdp_mmu_pages_lock);
INIT_LIST_HEAD(>arch.tdp_mmu_pages);
  }
  
@@ -225,7 +227,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)

  }
  
  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,

-   u64 old_spte, u64 new_spte, int level);
+   u64 old_spte, u64 new_spte, int level,
+   bool shared);
  
  static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)

  {
@@ -267,17 +270,26 @@ static void handle_changed_spte_dirty_log(struct kvm 
*kvm, int as_id, gfn_t gfn,
   *
   * @kvm: kvm instance
   * @sp: the new page
+ * @shared: This operation may not be running under the exclusive use of
+ * the MMU lock and the operation must synchronize with other
+ * threads that might be adding or removing pages.
   * @account_nx: This page replaces a NX large page and should be marked for
   *eventual reclaim.
   */
  static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- bool account_nx)
+ bool shared, bool account_nx)
  {
-   lockdep_assert_held_write(>mmu_lock);
+   if (shared)
+   spin_lock(>arch.tdp_mmu_pages_lock);
+   else
+   lockdep_assert_held_write(>mmu_lock);
  
  	list_add(>link, >arch.tdp_mmu_pages);

if (account_nx)
account_huge_nx_page(kvm, sp);
+
+   if (shared)
+   spin_unlock(>arch.tdp_mmu_pages_lock);
  }
  
  /**

@@ -285,14 +297,24 @@ static void tdp_mmu_link_page(struct kvm *kvm, struct 
kvm_mmu_page *sp,
   *
   * @kvm: kvm instance
   * @sp: the page to be removed
+ * @shared: This operation may not be running under the exclusive use of
+ * the MMU lock and the operati

Re: [PATCH 2/4] KVM: MIPS: rework flush_shadow_* callbacks into one that prepares the flush

2021-04-06 Thread Paolo Bonzini

On 06/04/21 03:36, Huacai Chen wrote:

I tried the merge and it will be enough for Linus to remove
arch/mips/kvm/trap_emul.c.  So I will leave it as is, but next time I'd
prefer KVM MIPS changes to go through either my tree or a common topic
branch.

Emmm, the TE removal series is done by Thomas, not me.:)


Sure, sorry if the sentence sounded like it was directed to you.  No 
matter who wrote the code, synchronization between trees is only the 
maintainers' task. :)


Paolo



Re: [PATCH 2/4] KVM: MIPS: rework flush_shadow_* callbacks into one that prepares the flush

2021-04-03 Thread Paolo Bonzini

On 03/04/21 04:31, Huacai Chen wrote:

Hi, Paolo,

TE mode has been removed in the MIPS tree, can we also remove it in
KVM tree before this rework?


I tried the merge and it will be enough for Linus to remove 
arch/mips/kvm/trap_emul.c.  So I will leave it as is, but next time I'd 
prefer KVM MIPS changes to go through either my tree or a common topic 
branch.


Paolo



Re: [PATCH 2/4] KVM: MIPS: rework flush_shadow_* callbacks into one that prepares the flush

2021-04-03 Thread Paolo Bonzini

On 03/04/21 04:31, Huacai Chen wrote:

Hi, Paolo,

TE mode has been removed in the MIPS tree, can we also remove it in
KVM tree before this rework?


Fortunately I can pull the exact commit that was applied to the MIPS 
tree, as it was the first patch that was applied to the tree, but next 
time please send KVM changes through the KVM tree.


Paolo



Re: [PATCH v2 0/9] KVM: my debug patch queue

2021-04-02 Thread Paolo Bonzini

On 01/04/21 15:54, Maxim Levitsky wrote:

Hi!

I would like to publish two debug features which were needed for other stuff
I work on.

One is the reworked lx-symbols script which now actually works on at least
gdb 9.1 (gdb 9.2 was reported to fail to load the debug symbols from the kernel
for some reason, not related to this patch) and upstream qemu.


Queued patches 2-5 for now.  6 is okay but it needs a selftest. (e.g. 
using KVM_VCPU_SET_EVENTS) and the correct name for the constant.


Paolo


The other feature is the ability to trap all guest exceptions (on SVM for now)
and see them in kvmtrace prior to potential merge to double/triple fault.

This can be very useful and I already had to manually patch KVM a few
times for this.
I will, once time permits, implement this feature on Intel as well.

V2:

  * Some more refactoring and workarounds for lx-symbols script

  * added KVM_GUESTDBG_BLOCKEVENTS flag to enable 'block interrupts on
single step' together with KVM_CAP_SET_GUEST_DEBUG2 capability
to indicate which guest debug flags are supported.

This is a replacement for unconditional block of interrupts on single
step that was done in previous version of this patch set.
Patches to qemu to use that feature will be sent soon.

  * Reworked the the 'intercept all exceptions for debug' feature according
to the review feedback:

- renamed the parameter that enables the feature and
  moved it to common kvm module.
  (only SVM part is currently implemented though)

- disable the feature for SEV guests as was suggested during the review
- made the vmexit table const again, as was suggested in the review as well.

Best regards,
Maxim Levitsky

Maxim Levitsky (9):
   scripts/gdb: rework lx-symbols gdb script
   KVM: introduce KVM_CAP_SET_GUEST_DEBUG2
   KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2
   KVM: aarch64: implement KVM_CAP_SET_GUEST_DEBUG2
   KVM: s390x: implement KVM_CAP_SET_GUEST_DEBUG2
   KVM: x86: implement KVM_GUESTDBG_BLOCKEVENTS
   KVM: SVM: split svm_handle_invalid_exit
   KVM: x86: add force_intercept_exceptions_mask
   KVM: SVM: implement force_intercept_exceptions_mask

  Documentation/virt/kvm/api.rst|   4 +
  arch/arm64/include/asm/kvm_host.h |   4 +
  arch/arm64/kvm/arm.c  |   2 +
  arch/arm64/kvm/guest.c|   5 -
  arch/s390/include/asm/kvm_host.h  |   4 +
  arch/s390/kvm/kvm-s390.c  |   3 +
  arch/x86/include/asm/kvm_host.h   |  12 ++
  arch/x86/include/uapi/asm/kvm.h   |   1 +
  arch/x86/kvm/svm/svm.c|  87 +++--
  arch/x86/kvm/svm/svm.h|   6 +-
  arch/x86/kvm/x86.c|  14 ++-
  arch/x86/kvm/x86.h|   2 +
  include/uapi/linux/kvm.h  |   1 +
  kernel/module.c   |   8 +-
  scripts/gdb/linux/symbols.py  | 203 --
  15 files changed, 272 insertions(+), 84 deletions(-)





  1   2   3   4   5   6   7   8   9   10   >