Re: [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()
On Wed, Apr 10, 2024 at 11:30 PM Andrew Morton wrote: > On Fri, 5 Apr 2024 07:58:11 -0400 Paolo Bonzini wrote: > > Please review! Also feel free to take the KVM patches through the mm > > tree, as I don't expect any conflicts. > > It's mainly a KVM thing and the MM changes are small and simple. > I'd say that the KVM tree would be a better home? Sure! I'll queue them on my side then. Paolo
Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
On Mon, Apr 8, 2024 at 3:56 PM Peter Xu wrote: > Paolo, > > I may miss a bunch of details here (as I still remember some change_pte > patches previously on the list..), however not sure whether we considered > enable it? Asked because I remember Andrea used to have a custom tree > maintaining that part: > > https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968 The patch enables it only for KSM, so it would still require a bunch of cleanups, for example I also would still use set_pte_at() in all the places that are not KSM. This would at least fix the issue with the poor documentation of where to use set_pte_at_notify() vs set_pte_at(). With regard to the implementation, I like the idea of disabling the invalidation on the MMU notifier side, but I would rather have MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of overloading the event field. > Maybe it can't be enabled for some reason that I overlooked in the current > tree, or we just decided to not to? I have just learnt about the patch, nobody had ever mentioned it even though it's almost 2 years old... It's a lot of code though and no one has ever reported an issue for over 10 years, so I think it's easiest to just rip the code out. Paolo > Thanks, > > -- > Peter Xu >
[PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at()
With the demise of the .change_pte() MMU notifier callback, there is no notification happening in set_pte_at_notify(). It is a synonym of set_pte_at() and can be replaced with it. Signed-off-by: Paolo Bonzini --- include/linux/mmu_notifier.h | 2 -- kernel/events/uprobes.c | 5 ++--- mm/ksm.c | 4 ++-- mm/memory.c | 7 +-- mm/migrate_device.c | 8 ++-- 5 files changed, 7 insertions(+), 19 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 8c72bf651606..d39ebb10caeb 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -657,6 +657,4 @@ static inline void mmu_notifier_synchronize(void) #endif /* CONFIG_MMU_NOTIFIER */ -#define set_pte_at_notify set_pte_at - #endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index e4834d23e1d1..f4523b95c945 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -18,7 +18,6 @@ #include #include #include /* anon_vma_prepare */ -#include /* set_pte_at_notify */ #include /* folio_free_swap */ #include /* user_enable_single_step */ #include /* notifier mechanism */ @@ -195,8 +194,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, flush_cache_page(vma, addr, pte_pfn(ptep_get(pvmw.pte))); ptep_clear_flush(vma, addr, pvmw.pte); if (new_page) - set_pte_at_notify(mm, addr, pvmw.pte, - mk_pte(new_page, vma->vm_page_prot)); + set_pte_at(mm, addr, pvmw.pte, + mk_pte(new_page, vma->vm_page_prot)); folio_remove_rmap_pte(old_folio, old_page, vma); if (!folio_mapped(old_folio)) diff --git a/mm/ksm.c b/mm/ksm.c index 8c001819cf10..108a4d167824 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1345,7 +1345,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, if (pte_write(entry)) entry = pte_wrprotect(entry); - set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry); + set_pte_at(mm, pvmw.address, pvmw.pte, entry); } *orig_pte = entry; err = 0; @@ -1447,7 +1447,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, * See Documentation/mm/mmu_notifier.rst */ ptep_clear_flush(vma, addr, ptep); - set_pte_at_notify(mm, addr, ptep, newpte); + set_pte_at(mm, addr, ptep, newpte); folio = page_folio(page); folio_remove_rmap_pte(folio, page, vma); diff --git a/mm/memory.c b/mm/memory.c index f2bc6dd15eb8..9a6f4d8aa379 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3327,13 +3327,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) ptep_clear_flush(vma, vmf->address, vmf->pte); folio_add_new_anon_rmap(new_folio, vma, vmf->address); folio_add_lru_vma(new_folio, vma); - /* -* We call the notify macro here because, when using secondary -* mmu page tables (such as kvm shadow page tables), we want the -* new page to be mapped directly into the secondary page table. -*/ BUG_ON(unshare && pte_write(entry)); - set_pte_at_notify(mm, vmf->address, vmf->pte, entry); + set_pte_at(mm, vmf->address, vmf->pte, entry); update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); if (old_folio) { /* diff --git a/mm/migrate_device.c b/mm/migrate_device.c index b6c27c76e1a0..66206734b1b9 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -664,13 +664,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, if (flush) { flush_cache_page(vma, addr, pte_pfn(orig_pte)); ptep_clear_flush(vma, addr, ptep); - set_pte_at_notify(mm, addr, ptep, entry); - update_mmu_cache(vma, addr, ptep); - } else { - /* No need to invalidate - it was non-present before */ - set_pte_at(mm, addr, ptep, entry); - update_mmu_cache(vma, addr, ptep); } + set_pte_at(mm, addr, ptep, entry); + update_mmu_cache(vma, addr, ptep); pte_unmap_unlock(ptep, ptl); *src = MIGRATE_PFN_MIGRATE; -- 2.43.0
[PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()
The .change_pte() MMU notifier callback was intended as an optimization and for this reason it was initially called without a surrounding mmu_notifier_invalidate_range_{start,end}() pair. It was only ever implemented by KVM (which was also the original user of MMU notifiers) and the rules on when to call set_pte_at_notify() rather than set_pte_at() have always been pretty obscure. It may seem a miracle that it has never caused any hard to trigger bugs, but there's a good reason for that: KVM's implementation has been nonfunctional for a good part of its existence. Already in 2012, commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) changed the .change_pte() callback to occur within an invalidate_range_start/end() pair; and because KVM unmaps the sPTEs during .invalidate_range_start(), .change_pte() has no hope of finding a sPTE to change. Therefore, all the code for .change_pte() can be removed from both KVM and mm/, and set_pte_at_notify() can be replaced with just set_pte_at(). Please review! Also feel free to take the KVM patches through the mm tree, as I don't expect any conflicts. Thanks, Paolo Paolo Bonzini (4): KVM: delete .change_pte MMU notifier callback KVM: remove unused argument of kvm_handle_hva_range() mmu_notifier: remove the .change_pte() callback mm: replace set_pte_at_notify() with just set_pte_at() arch/arm64/kvm/mmu.c | 34 - arch/loongarch/include/asm/kvm_host.h | 1 - arch/loongarch/kvm/mmu.c | 32 arch/mips/kvm/mmu.c | 30 --- arch/powerpc/include/asm/kvm_ppc.h| 1 - arch/powerpc/kvm/book3s.c | 5 --- arch/powerpc/kvm/book3s.h | 1 - arch/powerpc/kvm/book3s_64_mmu_hv.c | 12 -- arch/powerpc/kvm/book3s_hv.c | 1 - arch/powerpc/kvm/book3s_pr.c | 7 arch/powerpc/kvm/e500_mmu_host.c | 6 --- arch/riscv/kvm/mmu.c | 20 -- arch/x86/kvm/mmu/mmu.c| 54 +-- arch/x86/kvm/mmu/spte.c | 16 arch/x86/kvm/mmu/spte.h | 2 - arch/x86/kvm/mmu/tdp_mmu.c| 46 --- arch/x86/kvm/mmu/tdp_mmu.h| 1 - include/linux/kvm_host.h | 2 - include/linux/mmu_notifier.h | 44 -- include/trace/events/kvm.h| 15 kernel/events/uprobes.c | 5 +-- mm/ksm.c | 4 +- mm/memory.c | 7 +--- mm/migrate_device.c | 8 +--- mm/mmu_notifier.c | 17 - virt/kvm/kvm_main.c | 50 + 26 files changed, 10 insertions(+), 411 deletions(-) -- 2.43.0
[PATCH 3/4] mmu_notifier: remove the .change_pte() callback
The scope of set_pte_at_notify() has reduced more and more through the years. Initially, it was meant for when the change to the PTE was not bracketed by mmu_notifier_invalidate_range_{start,end}(). However, that has not been so for over ten years. During all this period the only implementation of .change_pte() was KVM and it had no actual functionality, because it was called after mmu_notifier_invalidate_range_start() zapped the secondary PTE. Now that this (nonfunctional) user of the .change_pte() callback is gone, the whole callback can be removed. For now, leave in place set_pte_at_notify() even though it is just a synonym for set_pte_at(). Signed-off-by: Paolo Bonzini --- include/linux/mmu_notifier.h | 46 ++-- mm/mmu_notifier.c| 17 - 2 files changed, 2 insertions(+), 61 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index f349e08a9dfe..8c72bf651606 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -122,15 +122,6 @@ struct mmu_notifier_ops { struct mm_struct *mm, unsigned long address); - /* -* change_pte is called in cases that pte mapping to page is changed: -* for example, when ksm remaps pte to point to a new shared page. -*/ - void (*change_pte)(struct mmu_notifier *subscription, - struct mm_struct *mm, - unsigned long address, - pte_t pte); - /* * invalidate_range_start() and invalidate_range_end() must be * paired and are called only when the mmap_lock and/or the @@ -392,8 +383,6 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm, unsigned long end); extern int __mmu_notifier_test_young(struct mm_struct *mm, unsigned long address); -extern void __mmu_notifier_change_pte(struct mm_struct *mm, - unsigned long address, pte_t pte); extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r); extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r); extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm, @@ -439,13 +428,6 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm, return 0; } -static inline void mmu_notifier_change_pte(struct mm_struct *mm, - unsigned long address, pte_t pte) -{ - if (mm_has_notifiers(mm)) - __mmu_notifier_change_pte(mm, address, pte); -} - static inline void mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { @@ -581,26 +563,6 @@ static inline void mmu_notifier_range_init_owner( __young;\ }) -/* - * set_pte_at_notify() sets the pte _after_ running the notifier. - * This is safe to start by updating the secondary MMUs, because the primary MMU - * pte invalidate must have already happened with a ptep_clear_flush() before - * set_pte_at_notify() has been invoked. Updating the secondary MMUs first is - * required when we change both the protection of the mapping from read-only to - * read-write and the pfn (like during copy on write page faults). Otherwise the - * old page would remain mapped readonly in the secondary MMUs after the new - * page is already writable by some CPU through the primary MMU. - */ -#define set_pte_at_notify(__mm, __address, __ptep, __pte) \ -({ \ - struct mm_struct *___mm = __mm; \ - unsigned long ___address = __address; \ - pte_t ___pte = __pte; \ - \ - mmu_notifier_change_pte(___mm, ___address, ___pte); \ - set_pte_at(___mm, ___address, __ptep, ___pte); \ -}) - #else /* CONFIG_MMU_NOTIFIER */ struct mmu_notifier_range { @@ -650,11 +612,6 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm, return 0; } -static inline void mmu_notifier_change_pte(struct mm_struct *mm, - unsigned long address, pte_t pte) -{ -} - static inline void mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { @@ -693,7 +650,6 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm) #defineptep_clear_flush_notify ptep_clear_flush #define pmdp_huge_clear_flush_notify pmdp_huge_clear_flush #define pudp_huge_clear_flush_notify pudp_huge_clear_flush -#define set_pte_at_notify set_pte_at static inline void mmu_notifier_synchronize(void) { @@ -701,4 +657,6 @@ static
[PATCH 1/4] KVM: delete .change_pte MMU notifier callback
The .change_pte() MMU notifier callback was intended as an optimization. The original point of it was that KSM could tell KVM to flip its secondary PTE to a new location without having to first zap it. At the time there was also an .invalidate_page() callback; both of them were *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), and .invalidate_page() also doubled as a fallback implementation of .change_pte(). Later on, however, both callbacks were changed to occur within an invalidate_range_start/end() block. In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) did so to remove the fallback from .invalidate_page() to .change_pte() and allow sleepable .invalidate_page() hooks. This however made KVM's usage of the .change_pte() callback completely moot, because KVM unmaps the sPTEs during .invalidate_range_start() and therefore .change_pte() has no hope of finding a sPTE to change. Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as well as all the architecture specific implementations. Signed-off-by: Paolo Bonzini --- arch/arm64/kvm/mmu.c | 34 - arch/loongarch/include/asm/kvm_host.h | 1 - arch/loongarch/kvm/mmu.c | 32 arch/mips/kvm/mmu.c | 30 --- arch/powerpc/include/asm/kvm_ppc.h| 1 - arch/powerpc/kvm/book3s.c | 5 --- arch/powerpc/kvm/book3s.h | 1 - arch/powerpc/kvm/book3s_64_mmu_hv.c | 12 -- arch/powerpc/kvm/book3s_hv.c | 1 - arch/powerpc/kvm/book3s_pr.c | 7 arch/powerpc/kvm/e500_mmu_host.c | 6 --- arch/riscv/kvm/mmu.c | 20 -- arch/x86/kvm/mmu/mmu.c| 54 +-- arch/x86/kvm/mmu/spte.c | 16 arch/x86/kvm/mmu/spte.h | 2 - arch/x86/kvm/mmu/tdp_mmu.c| 46 --- arch/x86/kvm/mmu/tdp_mmu.h| 1 - include/linux/kvm_host.h | 2 - include/trace/events/kvm.h| 15 virt/kvm/kvm_main.c | 43 - 20 files changed, 2 insertions(+), 327 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index dc04bc767865..ff17849be9f4 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) return false; } -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) -{ - kvm_pfn_t pfn = pte_pfn(range->arg.pte); - - if (!kvm->arch.mmu.pgt) - return false; - - WARN_ON(range->end - range->start != 1); - - /* -* If the page isn't tagged, defer to user_mem_abort() for sanitising -* the MTE tags. The S2 pte should have been unmapped by -* mmu_notifier_invalidate_range_end(). -*/ - if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn))) - return false; - - /* -* We've moved a page around, probably through CoW, so let's treat -* it just like a translation fault and the map handler will clean -* the cache to the PoC. -* -* The MMU notifiers will have unmapped a huge PMD before calling -* ->change_pte() (which in turn calls kvm_set_spte_gfn()) and -* therefore we never need to clear out a huge PMD through this -* calling path and a memcache is not required. -*/ - kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT, - PAGE_SIZE, __pfn_to_phys(pfn), - KVM_PGTABLE_PROT_R, NULL, 0); - - return false; -} - bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { u64 size = (range->end - range->start) << PAGE_SHIFT; diff --git a/arch/loongarch/include/asm/kvm_host.h b/arch/loongarch/include/asm/kvm_host.h index 2d62f7b0d377..69305441f40d 100644 --- a/arch/loongarch/include/asm/kvm_host.h +++ b/arch/loongarch/include/asm/kvm_host.h @@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void); void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa); int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write); -void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end, bool blockable); int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c index a556cff35740..98883aa23ab8 100644 --- a/arch/loongarch/kvm/mmu.c +++ b/arch/loongarch/kvm/mmu.c @@ -494,38 +494,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct
[PATCH 2/4] KVM: remove unused argument of kvm_handle_hva_range()
The only user was kvm_mmu_notifier_change_pte(), which is now gone. Signed-off-by: Paolo Bonzini --- virt/kvm/kvm_main.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2fcd9979752a..970111ad 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -595,8 +595,6 @@ static void kvm_null_fn(void) } #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn) -static const union kvm_mmu_notifier_arg KVM_MMU_NOTIFIER_NO_ARG; - /* Iterate over each memslot intersecting [start, last] (inclusive) range */ #define kvm_for_each_memslot_in_hva_range(node, slots, start, last) \ for (node = interval_tree_iter_first(>hva_tree, start, last); \ @@ -682,14 +680,12 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, unsigned long start, unsigned long end, - union kvm_mmu_notifier_arg arg, gfn_handler_t handler) { struct kvm *kvm = mmu_notifier_to_kvm(mn); const struct kvm_mmu_notifier_range range = { .start = start, .end= end, - .arg= arg, .handler= handler, .on_lock= (void *)kvm_null_fn, .flush_on_ret = true, @@ -880,8 +876,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, { trace_kvm_age_hva(start, end); - return kvm_handle_hva_range(mn, start, end, KVM_MMU_NOTIFIER_NO_ARG, - kvm_age_gfn); + return kvm_handle_hva_range(mn, start, end, kvm_age_gfn); } static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn, -- 2.43.0
Re: [PATCH] KVM: deprecate KVM_WERROR in favor of general WERROR
On Wed, Oct 11, 2023 at 1:46 AM Sean Christopherson wrote: > > > The DRM_I915_WERROR config depends on EXPERT and !COMPILE_TEST, and to > > > my knowledge this has never caused issues outside of i915 developers and > > > CI. > > > > Ack, I think you do it right. I was trying to establish a precedent > > so that we can delete these as soon as they cause an issue, not sooner. > > So isn't the underlying problem simply that KVM_WERROR is enabled by default > for > some configurations? If that's the case, then my proposal to make KVM_WERROR > always off by default, and "depends on KVM && EXPERT && !KASAN", would make > this > go away, no? No objection to adding EXPERT. Adding W=1 when build-testing KVM patches is a good idea, you will still get the occasional patch from someone who didn't have it but that's fine. I added KVM_WERROR a relatively long time ago after a warning scrolled away too quickly (a harmless one, but also a kind that honestly shouldn't have made it to Linus). At the time there were still too many warnings to enable WERROR globally, and I feel that now we're on the same boat with W=1. I think we should keep KVM_WERROR (which was based on DRM_I915_WERROR indeed) and maintainers should just add W=1 when build-testing KVM patches. Paolo
Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration
On 20/04/21 22:16, Sean Christopherson wrote: On Tue, Apr 20, 2021, Sean Christopherson wrote: On Tue, Apr 20, 2021, Paolo Bonzini wrote: In this particular case, if userspace sets the bit in CPUID2 but doesn't handle KVM_EXIT_HYPERCALL, the guest will probably trigger some kind of assertion failure as soon as it invokes the HC_PAGE_ENC_STATUS hypercall. Oh! Almost forgot my hail mary idea. Instead of a new capability, can we reject the hypercall if userspace has _not_ set KVM_CAP_ENFORCE_PV_FEATURE_CPUID? if (vcpu->arch.pv_cpuid.enforce && !guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS) break; Couldn't userspace enable that capability and _still_ copy the supported CPUID blindly to the guest CPUID, without supporting the hypercall? (BTW, it's better to return a bitmask of hypercalls that will exit to userspace from KVM_CHECK_EXTENSION. Userspace can still reject with -ENOSYS those that it doesn't know, but it's important that it knows in general how to handle KVM_EXIT_HYPERCALL). Paolo
Re: [PATCH v13 00/12] Add AMD SEV guest live migration support
On 20/04/21 20:51, Borislav Petkov wrote: Hey Paolo, On Tue, Apr 20, 2021 at 01:11:31PM +0200, Paolo Bonzini wrote: I have queued patches 1-6. For patches 8 and 10 I will post my own version based on my review and feedback. can you pls push that tree up to here to a branch somewhere so that ... Yup, for now it's all at kvm/queue and it will land in kvm/next tomorrow (hopefully). The guest interface patches in KVM are very near the top. Paolo For guest patches, please repost separately so that x86 maintainers will notice them and ack them. ... I can take a look at the guest bits in the full context of the changes?
Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration
On 20/04/21 19:31, Sean Christopherson wrote: + case KVM_HC_PAGE_ENC_STATUS: { + u64 gpa = a0, npages = a1, enc = a2; + + ret = -KVM_ENOSYS; + if (!vcpu->kvm->arch.hypercall_exit_enabled) I don't follow, why does the hypercall need to be gated by a capability? What would break if this were changed to? if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS)) The problem is that it's valid to take KVM_GET_SUPPORTED_CPUID and send it unmodified to KVM_SET_CPUID2. For this reason, features that are conditional on other ioctls, or that require some kind of userspace support, must not be in KVM_GET_SUPPORTED_CPUID. For example: - TSC_DEADLINE because it is only implemented after KVM_CREATE_IRQCHIP (or after KVM_ENABLE_CAP of KVM_CAP_IRQCHIP_SPLIT) - MONITOR only makes sense if userspace enables KVM_CAP_X86_DISABLE_EXITS X2APIC is reported even though it shouldn't be. Too late to fix that, I think. In this particular case, if userspace sets the bit in CPUID2 but doesn't handle KVM_EXIT_HYPERCALL, the guest will probably trigger some kind of assertion failure as soon as it invokes the HC_PAGE_ENC_STATUS hypercall. (I should document that, Jim asked for documentation around KVM_GET_SUPPORTED_CPUID and KVM_GET_MSR_INDEX_LIST many times). Paolo + break; + + if (!PAGE_ALIGNED(gpa) || !npages || + gpa_to_gfn(gpa) + npages <= gpa_to_gfn(gpa)) { + ret = -EINVAL; + break; + } + + vcpu->run->exit_reason= KVM_EXIT_HYPERCALL; + vcpu->run->hypercall.nr = KVM_HC_PAGE_ENC_STATUS; + vcpu->run->hypercall.args[0] = gpa; + vcpu->run->hypercall.args[1] = npages; + vcpu->run->hypercall.args[2] = enc; + vcpu->run->hypercall.longmode = op_64_bit; + vcpu->arch.complete_userspace_io = complete_hypercall_exit; + return 0; + } default: ret = -KVM_ENOSYS; break; ... diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 590cc811c99a..d696a9f13e33 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3258,6 +3258,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) vcpu->arch.msr_kvm_poll_control = data; break; + case MSR_KVM_MIGRATION_CONTROL: + if (data & ~KVM_PAGE_ENC_STATUS_UPTODATE) + return 1; + + if (data && !guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS)) Why let the guest write '0'? Letting the guest do WRMSR but not RDMSR is bizarre. Because it was the simplest way to write the code, but returning 0 unconditionally from RDMSR is actually simpler. Paolo + return 1; + break; + case MSR_IA32_MCG_CTL: case MSR_IA32_MCG_STATUS: case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1: @@ -3549,6 +3557,12 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF)) return 1; + msr_info->data = 0; + break; + case MSR_KVM_MIGRATION_CONTROL: + if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS)) + return 1; + msr_info->data = 0; break; case MSR_KVM_STEAL_TIME: -- 2.26.2
Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration
On 20/04/21 17:15, Sean Christopherson wrote: On Tue, Apr 20, 2021, Paolo Bonzini wrote: Do not return the SEV-ES bit from KVM_GET_SUPPORTED_CPUID unless the corresponding module parameter is 1, and clear the memory encryption leaf completely if SEV is disabled. Impeccable timing, I was planning on refreshing my SEV cleanup series[*] today. There's going to be an annoying conflict with the svm_set_cpu_caps() change (see below), any objecting to folding your unintentional feedback into my series? That's fine of course. diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 888e88b42e8d..e873a60a4830 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -99,6 +99,7 @@ static const struct cpuid_reg reverse_cpuid[] = { [CPUID_7_EDX] = { 7, 0, CPUID_EDX}, [CPUID_7_1_EAX] = { 7, 1, CPUID_EAX}, [CPUID_12_EAX]= {0x0012, 0, CPUID_EAX}, + [CPUID_8000_001F_EAX] = {0x801F, 0, CPUID_EAX}, }; /* diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index cd8c333ed2dc..acdb8457289e 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -923,6 +923,13 @@ static __init void svm_set_cpu_caps(void) if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || boot_cpu_has(X86_FEATURE_AMD_SSBD)) kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD); + + /* CPUID 0x801F */ + if (sev) { + kvm_cpu_cap_set(X86_FEATURE_SEV); + if (sev_es) + kvm_cpu_cap_set(X86_FEATURE_SEV_ES); Gah, I completely spaced on the module params in my series, which is more problematic than normal because it also moves "sev" and "sev_es" to sev.c. The easy solution is to add sev_set_cpu_caps(). Sounds good. Paolo
Re: [PATCH] KVM: selftests: Always run vCPU thread with blocked SIG_IPI
On 20/04/21 17:32, Peter Xu wrote: On Tue, Apr 20, 2021 at 10:37:39AM -0400, Peter Xu wrote: On Tue, Apr 20, 2021 at 04:16:14AM -0400, Paolo Bonzini wrote: The main thread could start to send SIG_IPI at any time, even before signal blocked on vcpu thread. Therefore, start the vcpu thread with the signal blocked. Without this patch, on very busy cores the dirty_log_test could fail directly on receiving a SIGUSR1 without a handler (when vcpu runs far slower than main). Reported-by: Peter Xu Cc: sta...@vger.kernel.org Signed-off-by: Paolo Bonzini Yes, indeed better! :) Reviewed-by: Peter Xu I just remembered one thing: this will avoid program quits, but still we'll get the signal missing. In what sense the signal will be missing? As long as the thread exists, the signal will be accepted (but not delivered because it is blocked); it will then be delivered on the first KVM_RUN. Paolo From that pov I slightly prefer the old patch. However not a big deal so far as only dirty ring uses SIG_IPI, so there's always ring full which will just delay the kick. It's just we need to remember this when we extend IPI to non-dirty-ring tests as the kick is prone to be lost then. Thanks,
Re: [PATCH v3 1/2] KVM: selftests: Sync data verify of dirty logging with guest sync
On 20/04/21 15:10, Peter Xu wrote: On Tue, Apr 20, 2021 at 10:07:16AM +0200, Paolo Bonzini wrote: On 18/04/21 14:43, Peter Xu wrote: 8<- diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index 25230e799bc4..d3050d1c2cd0 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -377,7 +377,7 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, int ret, int err) /* A ucall-sync or ring-full event is allowed */ if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) { /* We should allow this to continue */ - ; + vcpu_handle_sync_stop(); } else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL || (ret == -1 && err == EINTR)) { /* Update the flag first before pause */ 8<- That's my intention when I introduce vcpu_handle_sync_stop(), but forgot to add... And possibly even this (untested though): diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index ffa4e2791926..918954f01cef 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -383,6 +383,7 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, int ret, int err) /* Update the flag first before pause */ WRITE_ONCE(dirty_ring_vcpu_ring_full, run->exit_reason == KVM_EXIT_DIRTY_RING_FULL); + atomic_set(_sync_stop_requested, false); sem_post(_vcpu_stop); pr_info("vcpu stops because %s...\n", dirty_ring_vcpu_ring_full ? @@ -804,8 +805,7 @@ static void run_test(enum vm_guest_mode mode, void *arg) * the flush of the last page, and since we handle the last * page specially verification will succeed anyway. */ - assert(host_log_mode == LOG_MODE_DIRTY_RING || - atomic_read(_sync_stop_requested) == false); + assert(atomic_read(_sync_stop_requested) == false); vm_dirty_log_verify(mode, bmap); sem_post(_vcpu_cont); You can submit all these as a separate patch. But it could race, then? main thread vcpu thread --- --- ring full vcpu_sync_stop_requested=0 sem_post(_vcpu_stop) vcpu_sync_stop_requested=1 sem_wait(_vcpu_stop) assert(vcpu_sync_stop_requested==0) < Yes, it could indeed. Thanks, Paolo
[PATCH 0/3] KVM: x86: guest interface for SEV live migration
This is a reviewed version of the guest interface (hypercall+MSR) for SEV live migration. The differences lie mostly in the API for userspace. In particular: - the CPUID feature is not exposed in KVM_GET_SUPPORTED_CPUID - the hypercall must be enabled manually with KVM_ENABLE_CAP - the MSR has sensible behavior if not filtered (reads as zero, writes must leave undefined bits as zero or they #GP) Patch 1 is an unrelated cleanup, that was made moot by not exposing the bit in KVM_GET_SUPPORTED_CPUID. Paolo Ashish Kalra (1): KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall Paolo Bonzini (2): KVM: SEV: mask CPUID[0x801F].eax according to supported features KVM: x86: add MSR_KVM_MIGRATION_CONTROL Documentation/virt/kvm/api.rst| 11 ++ Documentation/virt/kvm/cpuid.rst | 6 Documentation/virt/kvm/hypercalls.rst | 15 + Documentation/virt/kvm/msr.rst| 10 ++ arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/include/uapi/asm/kvm_para.h | 4 +++ arch/x86/kvm/cpuid.c | 5 ++- arch/x86/kvm/cpuid.h | 1 + arch/x86/kvm/svm/svm.c| 7 arch/x86/kvm/x86.c| 48 +++ include/uapi/linux/kvm.h | 1 + include/uapi/linux/kvm_para.h | 1 + 12 files changed, 110 insertions(+), 1 deletion(-) -- 2.26.2 >From 547d4d4edcd05fdfac6ce650d65db1d42bcd2807 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Tue, 20 Apr 2021 05:49:11 -0400 Subject: [PATCH 1/3] KVM: SEV: mask CPUID[0x801F].eax according to supported features Do not return the SEV-ES bit from KVM_GET_SUPPORTED_CPUID unless the corresponding module parameter is 1, and clear the memory encryption leaf completely if SEV is disabled. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 5 - arch/x86/kvm/cpuid.h | 1 + arch/x86/kvm/svm/svm.c | 7 +++ 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 2ae061586677..d791d1f093ab 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -944,8 +944,11 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) break; /* Support memory encryption cpuid if host supports it */ case 0x801F: - if (!boot_cpu_has(X86_FEATURE_SEV)) + if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) { entry->eax = entry->ebx = entry->ecx = entry->edx = 0; + break; + } + cpuid_entry_override(entry, CPUID_8000_001F_EAX); break; /*Add support for Centaur's CPUID instruction*/ case 0xC000: diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 888e88b42e8d..e873a60a4830 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -99,6 +99,7 @@ static const struct cpuid_reg reverse_cpuid[] = { [CPUID_7_EDX] = { 7, 0, CPUID_EDX}, [CPUID_7_1_EAX] = { 7, 1, CPUID_EAX}, [CPUID_12_EAX]= {0x0012, 0, CPUID_EAX}, + [CPUID_8000_001F_EAX] = {0x801F, 0, CPUID_EAX}, }; /* diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index cd8c333ed2dc..acdb8457289e 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -923,6 +923,13 @@ static __init void svm_set_cpu_caps(void) if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || boot_cpu_has(X86_FEATURE_AMD_SSBD)) kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD); + + /* CPUID 0x801F */ + if (sev) { + kvm_cpu_cap_set(X86_FEATURE_SEV); + if (sev_es) + kvm_cpu_cap_set(X86_FEATURE_SEV_ES); + } } static __init int svm_hardware_setup(void) -- 2.26.2 >From ef78673f78e3f2eedc498c1fbf9271146caa83cb Mon Sep 17 00:00:00 2001 From: Ashish Kalra Date: Thu, 15 Apr 2021 15:57:02 + Subject: [PATCH 2/3] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall This hypercall is used by the SEV guest to notify a change in the page encryption status to the hypervisor. The hypercall should be invoked only when the encryption attribute is changed from encrypted -> decrypted and vice versa. By default all guest pages are considered encrypted. The hypercall exits to userspace to manage the guest shared regions and integrate with the userspace VMM's migration code. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Paolo Bonzini Cc: Joerg Roedel Cc: Borislav Petkov Cc: Tom Lendacky Cc: x...@kernel.org Cc: k...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Steve Rutherford Signed-off-by: Brijesh Singh Signed-off-by: Ashish Kalra Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Message-Id: <93d7f2c2888315adc48905722574d89699edde33.1618498113.git.ashish.ka...@amd.com> Signed-off-
Re: [PATCH v13 00/12] Add AMD SEV guest live migration support
On 15/04/21 17:52, Ashish Kalra wrote: From: Ashish Kalra The series add support for AMD SEV guest live migration commands. To protect the confidentiality of an SEV protected guest memory while in transit we need to use the SEV commands defined in SEV API spec [1]. SEV guest VMs have the concept of private and shared memory. Private memory is encrypted with the guest-specific key, while shared memory may be encrypted with hypervisor key. The commands provided by the SEV FW are meant to be used for the private memory only. The patch series introduces a new hypercall. The guest OS can use this hypercall to notify the page encryption status. If the page is encrypted with guest specific-key then we use SEV command during the migration. If page is not encrypted then fallback to default. The patch uses the KVM_EXIT_HYPERCALL exitcode and hypercall to userspace exit functionality as a common interface from the guest back to the VMM and passing on the guest shared/unencrypted page information to the userspace VMM/Qemu. Qemu can consult this information during migration to know whether the page is encrypted. This section descibes how the SEV live migration feature is negotiated between the host and guest, the host indicates this feature support via KVM_FEATURE_CPUID. The guest firmware (OVMF) detects this feature and sets a UEFI enviroment variable indicating OVMF support for live migration, the guest kernel also detects the host support for this feature via cpuid and in case of an EFI boot verifies if OVMF also supports this feature by getting the UEFI enviroment variable and if it set then enables live migration feature on host by writing to a custom MSR, if not booted under EFI, then it simply enables the feature by again writing to the custom MSR. The MSR is also handled by the userspace VMM/Qemu. A branch containing these patches is available here: https://github.com/AMDESE/linux/tree/sev-migration-v13 [1] https://developer.amd.com/wp-content/resources/55766.PDF I have queued patches 1-6. For patches 8 and 10 I will post my own version based on my review and feedback. For guest patches, please repost separately so that x86 maintainers will notice them and ack them. Paolo Changes since v12: - Reset page encryption status during early boot instead of just before the kexec to avoid SMP races during kvm_pv_guest_cpu_reboot(). - Remove incorrect log message in case of non-EFI boot and implicit enabling of SEV live migration feature. Changes since v11: - Clean up and remove kvm_x86_ops callback for page_enc_status_hc and instead add a new per-VM flag to support/enable the page encryption status hypercall. - Remove KVM_EXIT_DMA_SHARE/KVM_EXIT_DMA_UNSHARE exitcodes and instead use the KVM_EXIT_HYPERCALL exitcode for page encryption status hypercall to userspace functionality. Changes since v10: - Adds new KVM_EXIT_DMA_SHARE/KVM_EXIT_DMA_UNSHARE hypercall to userspace exit functionality as a common interface from the guest back to the KVM and passing on the guest shared/unencrypted region information to the userspace VMM/Qemu. KVM/host kernel does not maintain the guest shared memory regions information anymore. - Remove implicit enabling of SEV live migration feature for an SEV guest, now this is explicitly in control of the userspace VMM/Qemu. - Custom MSR handling is also now moved into userspace VMM/Qemu. - As KVM does not maintain the guest shared memory region information anymore, sev_dbg_crypt() cannot bypass unencrypted guest memory regions without support from userspace VMM/Qemu. Changes since v9: - Transitioning from page encryption bitmap to the shared pages list to keep track of guest's shared/unencrypted memory regions. - Move back to marking the complete _bss_decrypted section as decrypted in the shared pages list. - Invoke a new function check_kvm_sev_migration() via kvm_init_platform() for guest to query for host-side support for SEV live migration and to enable the SEV live migration feature, to avoid #ifdefs in code - Rename MSR_KVM_SEV_LIVE_MIG_EN to MSR_KVM_SEV_LIVE_MIGRATION. - Invoke a new function handle_unencrypted_region() from sev_dbg_crypt() to bypass unencrypted guest memory regions. Changes since v8: - Rebasing to kvm next branch. - Fixed and added comments as per review feedback on v8 patches. - Removed implicitly enabling live migration for incoming VMs in in KVM_SET_PAGE_ENC_BITMAP, it is now done via KVM_SET_MSR ioctl. - Adds support for bypassing unencrypted guest memory regions for DBG_DECRYPT API calls, guest memory region encryption status in sev_dbg_decrypt() is referenced using the page encryption bitmap. Changes since v7: - Removed the hypervisor specific hypercall/paravirt callback for SEV live migration and moved back to calling kvm_sev_hypercall3 directly. - Fix build errors as Reported-by: kbuild test robot , specifically fixed build error when CONFIG_HYPERVISOR_GUEST=y and
Re: [PATCH v13 08/12] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall
On 15/04/21 17:57, Ashish Kalra wrote: From: Ashish Kalra This hypercall is used by the SEV guest to notify a change in the page encryption status to the hypervisor. The hypercall should be invoked only when the encryption attribute is changed from encrypted -> decrypted and vice versa. By default all guest pages are considered encrypted. The hypercall exits to userspace to manage the guest shared regions and integrate with the userspace VMM's migration code. I think this should be exposed to userspace as a capability, rather than as a CPUID bit. Userspace then can enable the capability and set the CPUID bit if it wants. The reason is that userspace could pass KVM_GET_SUPPORTED_CPUID to KVM_SET_CPUID2 and the hypercall then would break the guest. Paolo Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Paolo Bonzini Cc: Joerg Roedel Cc: Borislav Petkov Cc: Tom Lendacky Cc: x...@kernel.org Cc: k...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Steve Rutherford Signed-off-by: Brijesh Singh Signed-off-by: Ashish Kalra Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson --- Documentation/virt/kvm/hypercalls.rst | 15 ++ arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/svm/sev.c| 1 + arch/x86/kvm/x86.c| 29 +++ include/uapi/linux/kvm_para.h | 1 + 5 files changed, 48 insertions(+) diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/hypercalls.rst index ed4fddd364ea..7aff0cebab7c 100644 --- a/Documentation/virt/kvm/hypercalls.rst +++ b/Documentation/virt/kvm/hypercalls.rst @@ -169,3 +169,18 @@ a0: destination APIC ID :Usage example: When sending a call-function IPI-many to vCPUs, yield if any of the IPI target vCPUs was preempted. + + +8. KVM_HC_PAGE_ENC_STATUS +- +:Architecture: x86 +:Status: active +:Purpose: Notify the encryption status changes in guest page table (SEV guest) + +a0: the guest physical address of the start page +a1: the number of pages +a2: encryption attribute + + Where: + * 1: Encryption attribute is set + * 0: Encryption attribute is cleared diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3768819693e5..42eb0fe3df5d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1050,6 +1050,8 @@ struct kvm_arch { bool bus_lock_detection_enabled; + bool page_enc_hc_enable; + /* Deflect RDMSR and WRMSR to user space when they trigger a #GP */ u32 user_space_msr_mask; struct kvm_x86_msr_filter __rcu *msr_filter; diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index c9795a22e502..5184a0c0131a 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -197,6 +197,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) sev->active = true; sev->asid = asid; INIT_LIST_HEAD(>regions_list); + kvm->arch.page_enc_hc_enable = true; return 0; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f7d12fca397b..e8986478b653 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8208,6 +8208,13 @@ static void kvm_sched_yield(struct kvm *kvm, unsigned long dest_id) kvm_vcpu_yield_to(target); } +static int complete_hypercall_exit(struct kvm_vcpu *vcpu) +{ + kvm_rax_write(vcpu, vcpu->run->hypercall.ret); + ++vcpu->stat.hypercalls; + return kvm_skip_emulated_instruction(vcpu); +} + int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) { unsigned long nr, a0, a1, a2, a3, ret; @@ -8273,6 +8280,28 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) kvm_sched_yield(vcpu->kvm, a0); ret = 0; break; + case KVM_HC_PAGE_ENC_STATUS: { + u64 gpa = a0, npages = a1, enc = a2; + + ret = -KVM_ENOSYS; + if (!vcpu->kvm->arch.page_enc_hc_enable) + break; + + if (!PAGE_ALIGNED(gpa) || !npages || + gpa_to_gfn(gpa) + npages <= gpa_to_gfn(gpa)) { + ret = -EINVAL; + break; + } + + vcpu->run->exit_reason= KVM_EXIT_HYPERCALL; + vcpu->run->hypercall.nr = KVM_HC_PAGE_ENC_STATUS; + vcpu->run->hypercall.args[0] = gpa; + vcpu->run->hypercall.args[1] = npages; + vcpu->run->hypercall.args[2] = enc; + vcpu->run->hypercall.longmode = op_64_bit; + vcpu->arch.complete_userspace_io = complete_hypercall_exit; + return 0; + } default: ret = -KVM_ENOSYS; break; diff --git a/include/uapi/linux/kvm_para.h
[PATCH] KVM: x86: add MSR_KVM_MIGRATION_CONTROL
Add a new MSR that can be used to communicate whether the page encryption status bitmap is up to date and therefore whether live migration of an encrypted guest is possible. The MSR should be processed by userspace if it is going to live migrate the guest; the default implementation does nothing. Signed-off-by: Paolo Bonzini --- Documentation/virt/kvm/cpuid.rst | 3 ++- Documentation/virt/kvm/msr.rst | 10 ++ arch/x86/include/uapi/asm/kvm_para.h | 3 +++ arch/x86/kvm/x86.c | 14 ++ 4 files changed, 29 insertions(+), 1 deletion(-) diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst index c9378d163b5a..15923857de56 100644 --- a/Documentation/virt/kvm/cpuid.rst +++ b/Documentation/virt/kvm/cpuid.rst @@ -99,7 +99,8 @@ KVM_FEATURE_MSI_EXT_DEST_ID15 guest checks this feature bit KVM_FEATURE_HC_PAGE_ENC_STATUS 16 guest checks this feature bit before using the page encryption state hypercall to notify the page state - change + change, and before setting bit 0 of + MSR_KVM_MIGRATION_CONTROL KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 host will warn if no guest-side per-cpu warps are expected in diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst index e37a14c323d2..691d718df60a 100644 --- a/Documentation/virt/kvm/msr.rst +++ b/Documentation/virt/kvm/msr.rst @@ -376,3 +376,13 @@ data: write '1' to bit 0 of the MSR, this causes the host to re-scan its queue and check if there are more notifications pending. The MSR is available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID. + +MSR_KVM_MIGRATION_CONTROL: +0x4b564d08 + +data: +If the guest is running with encrypted memory and it is communicating +page encryption status to the host using the ``KVM_HC_PAGE_ENC_STATUS`` +hypercall, it can set bit 0 in this MSR to allow live migration of +the guest. The bit can be set if KVM_FEATURE_HC_PAGE_ENC_STATUS is +present in CPUID. diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index be49956b603f..c5ee419775d8 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -55,6 +55,7 @@ #define MSR_KVM_POLL_CONTROL 0x4b564d05 #define MSR_KVM_ASYNC_PF_INT 0x4b564d06 #define MSR_KVM_ASYNC_PF_ACK 0x4b564d07 +#define MSR_KVM_MIGRATION_CONTROL 0x4b564d08 struct kvm_steal_time { __u64 steal; @@ -91,6 +92,8 @@ struct kvm_clock_pairing { /* MSR_KVM_ASYNC_PF_INT */ #define KVM_ASYNC_PF_VEC_MASK GENMASK(7, 0) +/* MSR_KVM_MIGRATION_CONTROL */ +#define KVM_PAGE_ENC_STATUS_UPTODATE (1 << 0) /* Operations for KVM_HC_MMU_OP */ #define KVM_MMU_OP_WRITE_PTE1 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 36d302009fd8..efb98be3338d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3258,6 +3258,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) vcpu->arch.msr_kvm_poll_control = data; break; + case MSR_KVM_MIGRATION_CONTROL: + if (data & ~KVM_PAGE_ENC_STATUS_UPTODATE) + return 1; + + if (data && !guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS)) + return 1; + break; + case MSR_IA32_MCG_CTL: case MSR_IA32_MCG_STATUS: case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1: @@ -3549,6 +3557,12 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF)) return 1; + msr_info->data = 0; + break; + case MSR_KVM_MIGRATION_CONTROL: + if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS)) + return 1; + msr_info->data = 0; break; case MSR_KVM_STEAL_TIME: -- 2.26.2
Re: [PATCH v13 12/12] x86/kvm: Add guest support for detecting and enabling SEV Live Migration feature.
On 15/04/21 18:01, Ashish Kalra wrote: From: Ashish Kalra The guest support for detecting and enabling SEV Live migration feature uses the following logic : - kvm_init_plaform() invokes check_kvm_sev_migration() which checks if its booted under the EFI - If not EFI, i) check for the KVM_FEATURE_CPUID ii) if CPUID reports that migration is supported, issue a wrmsrl() to enable the SEV live migration support - If EFI, i) check for the KVM_FEATURE_CPUID ii) If CPUID reports that migration is supported, read the UEFI variable which indicates OVMF support for live migration iii) the variable indicates live migration is supported, issue a wrmsrl() to enable the SEV live migration support The EFI live migration check is done using a late_initcall() callback. Also, ensure that _bss_decrypted section is marked as decrypted in the shared pages list. Also adds kexec support for SEV Live Migration. Reset the host's shared pages list related to kernel specific page encryption status settings before we load a new kernel by kexec. We cannot reset the complete shared pages list here as we need to retain the UEFI/OVMF firmware specific settings. The host's shared pages list is maintained for the guest to keep track of all unencrypted guest memory regions, therefore we need to explicitly mark all shared pages as encrypted again before rebooting into the new guest kernel. Signed-off-by: Ashish Kalra Boris, this one needs an ACK as well. Paolo --- arch/x86/include/asm/mem_encrypt.h | 8 arch/x86/kernel/kvm.c | 55 + arch/x86/mm/mem_encrypt.c | 64 ++ 3 files changed, 127 insertions(+) diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h index 31c4df123aa0..19b77f3a62dc 100644 --- a/arch/x86/include/asm/mem_encrypt.h +++ b/arch/x86/include/asm/mem_encrypt.h @@ -21,6 +21,7 @@ extern u64 sme_me_mask; extern u64 sev_status; extern bool sev_enabled; +extern bool sev_live_migration_enabled; void sme_encrypt_execute(unsigned long encrypted_kernel_vaddr, unsigned long decrypted_kernel_vaddr, @@ -44,8 +45,11 @@ void __init sme_enable(struct boot_params *bp); int __init early_set_memory_decrypted(unsigned long vaddr, unsigned long size); int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size); +void __init early_set_mem_enc_dec_hypercall(unsigned long vaddr, int npages, + bool enc); void __init mem_encrypt_free_decrypted_mem(void); +void __init check_kvm_sev_migration(void); /* Architecture __weak replacement functions */ void __init mem_encrypt_init(void); @@ -60,6 +64,7 @@ bool sev_es_active(void); #else /* !CONFIG_AMD_MEM_ENCRYPT */ #define sme_me_mask 0ULL +#define sev_live_migration_enabled false static inline void __init sme_early_encrypt(resource_size_t paddr, unsigned long size) { } @@ -84,8 +89,11 @@ static inline int __init early_set_memory_decrypted(unsigned long vaddr, unsigned long size) { return 0; } static inline int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size) { return 0; } +static inline void __init +early_set_mem_enc_dec_hypercall(unsigned long vaddr, int npages, bool enc) {} static inline void mem_encrypt_free_decrypted_mem(void) { } +static inline void check_kvm_sev_migration(void) { } #define __bss_decrypted diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 78bb0fae3982..94ef16d263a7 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -429,6 +430,59 @@ static inline void __set_percpu_decrypted(void *ptr, unsigned long size) early_set_memory_decrypted((unsigned long) ptr, size); } +static int __init setup_kvm_sev_migration(void) +{ + efi_char16_t efi_sev_live_migration_enabled[] = L"SevLiveMigrationEnabled"; + efi_guid_t efi_variable_guid = MEM_ENCRYPT_GUID; + efi_status_t status; + unsigned long size; + bool enabled; + + /* +* check_kvm_sev_migration() invoked via kvm_init_platform() before +* this callback would have setup the indicator that live migration +* feature is supported/enabled. +*/ + if (!sev_live_migration_enabled) + return 0; + + if (!efi_enabled(EFI_BOOT)) + return 0; + + if (!efi_enabled(EFI_RUNTIME_SERVICES)) { + pr_info("%s : EFI runtime services are not enabled\n", __func__); + return 0; + } + + size = sizeof(enabled); + + /* Get variable contents into buffer */ + status = efi.get_variable(efi_sev_live_migration_enabled, +
Re: [PATCH v13 10/12] KVM: x86: Introduce new KVM_FEATURE_SEV_LIVE_MIGRATION feature & Custom MSR.
On 20/04/21 01:06, Sean Christopherson wrote: diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 950afebfba88..f6bfa138874f 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -33,6 +33,7 @@ #define KVM_FEATURE_PV_SCHED_YIELD13 #define KVM_FEATURE_ASYNC_PF_INT 14 #define KVM_FEATURE_MSI_EXT_DEST_ID 15 +#define KVM_FEATURE_SEV_LIVE_MIGRATION 16 #define KVM_HINTS_REALTIME 0 @@ -54,6 +55,7 @@ #define MSR_KVM_POLL_CONTROL 0x4b564d05 #define MSR_KVM_ASYNC_PF_INT 0x4b564d06 #define MSR_KVM_ASYNC_PF_ACK 0x4b564d07 +#define MSR_KVM_SEV_LIVE_MIGRATION 0x4b564d08 struct kvm_steal_time { __u64 steal; @@ -136,4 +138,6 @@ struct kvm_vcpu_pv_apf_data { #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK #define KVM_PV_EOI_DISABLED 0x0 +#define KVM_SEV_LIVE_MIGRATION_ENABLED BIT_ULL(0) Even though the intent is to "force" userspace to intercept the MSR, I think KVM should at least emulate the legal bits as a nop. Deferring completely to userspace is rather bizarre as there's not really anything to justify KVM getting involved. It would also force userspace to filter the MSR just to support the hypercall. I think this is the intention, the hypercall by itself cannot do much if you cannot tell userspace that it's up-to-date. On the other hand it is kind of wrong that KVM_GET_SUPPORTED_CPUID returns the feature, but the MSR is not supported. Somewhat of a nit, but I think we should do something like s/ENABLED/READY, Agreed. I'll send a patch that puts everything together. Paolo
Re: [PATCH] KVM: Boost vCPU candidiate in user mode which is delivering interrupt
On 20/04/21 10:48, Wanpeng Li wrote: I was thinking of something simpler: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 9b8e30dd5b9b..455c648f9adc 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3198,10 +3198,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) { struct kvm *kvm = me->kvm; struct kvm_vcpu *vcpu; - int last_boosted_vcpu = me->kvm->last_boosted_vcpu; int yielded = 0; int try = 3; - int pass; + int pass, num_passes = 1; int i; kvm_vcpu_set_in_spin_loop(me, true); @@ -3212,13 +3211,14 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) * VCPU is holding the lock that we need and will release it. * We approximate round-robin by starting at the last boosted VCPU. */ - for (pass = 0; pass < 2 && !yielded && try; pass++) { - kvm_for_each_vcpu(i, vcpu, kvm) { - if (!pass && i <= last_boosted_vcpu) { - i = last_boosted_vcpu; - continue; - } else if (pass && i > last_boosted_vcpu) - break; + for (pass = 0; pass < num_passes; pass++) { + int idx = me->kvm->last_boosted_vcpu; + int n = atomic_read(>online_vcpus); + for (i = 0; i < n; i++, idx++) { + if (idx == n) + idx = 0; + + vcpu = kvm_get_vcpu(kvm, idx); if (!READ_ONCE(vcpu->ready)) continue; if (vcpu == me) @@ -3226,23 +3226,36 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) if (rcuwait_active(>wait) && !vcpu_dy_runnable(vcpu)) continue; - if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode && - !kvm_arch_vcpu_in_kernel(vcpu)) - continue; if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) continue; + if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode && + !kvm_arch_vcpu_in_kernel(vcpu)) { + /* +* A vCPU running in userspace can get to kernel mode via +* an interrupt. That's a worse choice than a CPU already +* in kernel mode so only do it on a second pass. +*/ + if (!vcpu_dy_runnable(vcpu)) + continue; + if (pass == 0) { + num_passes = 2; + continue; + } + } + yielded = kvm_vcpu_yield_to(vcpu); if (yielded > 0) { kvm->last_boosted_vcpu = i; - break; + goto done; } else if (yielded < 0) { try--; if (!try) - break; + goto done; } } } +done: We just tested the above post against 96 vCPUs VM in an over-subscribe scenario, the score of pbzip2 fluctuated drastically. Sometimes it is worse than vanilla, but the average improvement is around 2.2%. The new version of my post is around 9.3%,the origial posted patch is around 10% which is totally as expected since now both IPI receivers in user-mode and lock-waiters are second class citizens. Fair enough. Of the two patches you posted I prefer the original, so I'll go with that one. Paolo
Re: [PATCH v13 10/12] KVM: x86: Introduce new KVM_FEATURE_SEV_LIVE_MIGRATION feature & Custom MSR.
On 15/04/21 17:58, Ashish Kalra wrote: From: Ashish Kalra Add new KVM_FEATURE_SEV_LIVE_MIGRATION feature for guest to check for host-side support for SEV live migration. Also add a new custom MSR_KVM_SEV_LIVE_MIGRATION for guest to enable the SEV live migration feature. MSR is handled by userspace using MSR filters. Signed-off-by: Ashish Kalra Reviewed-by: Steve Rutherford Let's leave the MSR out for now and rename the feature to KVM_FEATURE_HC_PAGE_ENC_STATUS. Paolo --- Documentation/virt/kvm/cpuid.rst | 5 + Documentation/virt/kvm/msr.rst | 12 arch/x86/include/uapi/asm/kvm_para.h | 4 arch/x86/kvm/cpuid.c | 3 ++- 4 files changed, 23 insertions(+), 1 deletion(-) diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst index cf62162d4be2..0bdb6cdb12d3 100644 --- a/Documentation/virt/kvm/cpuid.rst +++ b/Documentation/virt/kvm/cpuid.rst @@ -96,6 +96,11 @@ KVM_FEATURE_MSI_EXT_DEST_ID15 guest checks this feature bit before using extended destination ID bits in MSI address bits 11-5. +KVM_FEATURE_SEV_LIVE_MIGRATION 16 guest checks this feature bit before + using the page encryption state + hypercall to notify the page state + change + KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 host will warn if no guest-side per-cpu warps are expected in kvmclock diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst index e37a14c323d2..020245d16087 100644 --- a/Documentation/virt/kvm/msr.rst +++ b/Documentation/virt/kvm/msr.rst @@ -376,3 +376,15 @@ data: write '1' to bit 0 of the MSR, this causes the host to re-scan its queue and check if there are more notifications pending. The MSR is available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID. + +MSR_KVM_SEV_LIVE_MIGRATION: +0x4b564d08 + + Control SEV Live Migration features. + +data: +Bit 0 enables (1) or disables (0) host-side SEV Live Migration feature, +in other words, this is guest->host communication that it's properly +handling the shared pages list. + +All other bits are reserved. diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 950afebfba88..f6bfa138874f 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -33,6 +33,7 @@ #define KVM_FEATURE_PV_SCHED_YIELD13 #define KVM_FEATURE_ASYNC_PF_INT 14 #define KVM_FEATURE_MSI_EXT_DEST_ID 15 +#define KVM_FEATURE_SEV_LIVE_MIGRATION 16 #define KVM_HINTS_REALTIME 0 @@ -54,6 +55,7 @@ #define MSR_KVM_POLL_CONTROL 0x4b564d05 #define MSR_KVM_ASYNC_PF_INT 0x4b564d06 #define MSR_KVM_ASYNC_PF_ACK 0x4b564d07 +#define MSR_KVM_SEV_LIVE_MIGRATION 0x4b564d08 struct kvm_steal_time { __u64 steal; @@ -136,4 +138,6 @@ struct kvm_vcpu_pv_apf_data { #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK #define KVM_PV_EOI_DISABLED 0x0 +#define KVM_SEV_LIVE_MIGRATION_ENABLED BIT_ULL(0) + #endif /* _UAPI_ASM_X86_KVM_PARA_H */ diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 6bd2f8b830e4..4e2e69a692aa 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -812,7 +812,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) (1 << KVM_FEATURE_PV_SEND_IPI) | (1 << KVM_FEATURE_POLL_CONTROL) | (1 << KVM_FEATURE_PV_SCHED_YIELD) | -(1 << KVM_FEATURE_ASYNC_PF_INT); +(1 << KVM_FEATURE_ASYNC_PF_INT) | +(1 << KVM_FEATURE_SEV_LIVE_MIGRATION); if (sched_info_on()) entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
Re: [PATCH v13 09/12] mm: x86: Invoke hypercall when page encryption status is changed
On 15/04/21 17:57, Ashish Kalra wrote: From: Brijesh Singh Invoke a hypercall when a memory region is changed from encrypted -> decrypted and vice versa. Hypervisor needs to know the page encryption status during the guest migration. Boris, can you ack this patch? Paolo Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Paolo Bonzini Cc: Joerg Roedel Cc: Borislav Petkov Cc: Tom Lendacky Cc: x...@kernel.org Cc: k...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Steve Rutherford Reviewed-by: Venu Busireddy Signed-off-by: Brijesh Singh Signed-off-by: Ashish Kalra --- arch/x86/include/asm/paravirt.h | 10 + arch/x86/include/asm/paravirt_types.h | 2 + arch/x86/kernel/paravirt.c| 1 + arch/x86/mm/mem_encrypt.c | 57 ++- arch/x86/mm/pat/set_memory.c | 7 5 files changed, 76 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 4abf110e2243..efaa3e628967 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -84,6 +84,12 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm) PVOP_VCALL1(mmu.exit_mmap, mm); } +static inline void page_encryption_changed(unsigned long vaddr, int npages, + bool enc) +{ + PVOP_VCALL3(mmu.page_encryption_changed, vaddr, npages, enc); +} + #ifdef CONFIG_PARAVIRT_XXL static inline void load_sp0(unsigned long sp0) { @@ -799,6 +805,10 @@ static inline void paravirt_arch_dup_mmap(struct mm_struct *oldmm, static inline void paravirt_arch_exit_mmap(struct mm_struct *mm) { } + +static inline void page_encryption_changed(unsigned long vaddr, int npages, bool enc) +{ +} #endif #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_PARAVIRT_H */ diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index de87087d3bde..69ef9c207b38 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -195,6 +195,8 @@ struct pv_mmu_ops { /* Hook for intercepting the destruction of an mm_struct. */ void (*exit_mmap)(struct mm_struct *mm); + void (*page_encryption_changed)(unsigned long vaddr, int npages, + bool enc); #ifdef CONFIG_PARAVIRT_XXL struct paravirt_callee_save read_cr2; diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index c60222ab8ab9..9f206e192f6b 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -335,6 +335,7 @@ struct paravirt_patch_template pv_ops = { (void (*)(struct mmu_gather *, void *))tlb_remove_page, .mmu.exit_mmap = paravirt_nop, + .mmu.page_encryption_changed= paravirt_nop, #ifdef CONFIG_PARAVIRT_XXL .mmu.read_cr2 = __PV_IS_CALLEE_SAVE(native_read_cr2), diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c index ae78cef79980..fae9ccbd0da7 100644 --- a/arch/x86/mm/mem_encrypt.c +++ b/arch/x86/mm/mem_encrypt.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -29,6 +30,7 @@ #include #include #include +#include #include "mm_internal.h" @@ -229,6 +231,47 @@ void __init sev_setup_arch(void) swiotlb_adjust_size(size); } +static void set_memory_enc_dec_hypercall(unsigned long vaddr, int npages, + bool enc) +{ + unsigned long sz = npages << PAGE_SHIFT; + unsigned long vaddr_end, vaddr_next; + + vaddr_end = vaddr + sz; + + for (; vaddr < vaddr_end; vaddr = vaddr_next) { + int psize, pmask, level; + unsigned long pfn; + pte_t *kpte; + + kpte = lookup_address(vaddr, ); + if (!kpte || pte_none(*kpte)) + return; + + switch (level) { + case PG_LEVEL_4K: + pfn = pte_pfn(*kpte); + break; + case PG_LEVEL_2M: + pfn = pmd_pfn(*(pmd_t *)kpte); + break; + case PG_LEVEL_1G: + pfn = pud_pfn(*(pud_t *)kpte); + break; + default: + return; + } + + psize = page_level_size(level); + pmask = page_level_mask(level); + + kvm_sev_hypercall3(KVM_HC_PAGE_ENC_STATUS, + pfn << PAGE_SHIFT, psize >> PAGE_SHIFT, enc); + + vaddr_next = (vaddr & pmask) + psize; + } +} + static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc) { pgprot_t old_prot, new_prot; @@ -286,12 +329,13 @@ static void __init __set_clr_pte_enc(pte_t
[PATCH] KVM: x86: document behavior of measurement ioctls with len==0
Signed-off-by: Paolo Bonzini --- Documentation/virt/kvm/amd-memory-encryption.rst | 6 ++ 1 file changed, 6 insertions(+) diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/amd-memory-encryption.rst index 469a6308765b..34ce2d1fcb89 100644 --- a/Documentation/virt/kvm/amd-memory-encryption.rst +++ b/Documentation/virt/kvm/amd-memory-encryption.rst @@ -148,6 +148,9 @@ measurement. Since the guest owner knows the initial contents of the guest at boot, the measurement can be verified by comparing it to what the guest owner expects. +If len is zero on entry, the measurement blob length is written to len and +uaddr is unused. + Parameters (in): struct kvm_sev_launch_measure Returns: 0 on success, -negative on error @@ -271,6 +274,9 @@ report containing the SHA-256 digest of the guest memory and VMSA passed through commands and signed with the PEK. The digest returned by the command should match the digest used by the guest owner with the KVM_SEV_LAUNCH_MEASURE. +If len is zero on entry, the measurement blob length is written to len and +uaddr is unused. + Parameters (in): struct kvm_sev_attestation Returns: 0 on success, -negative on error -- 2.26.2
Re: [PATCH v13 04/12] KVM: SVM: Add support for KVM_SEV_RECEIVE_START command
On 20/04/21 10:38, Paolo Bonzini wrote: On 15/04/21 17:54, Ashish Kalra wrote: + } + + sev->handle = start->handle; + sev->fd = argp->sev_fd; These two lines are spurious, I'll delete them. And this is wrong as well. My apologies. Paolo
Re: [PATCH v13 01/12] KVM: SVM: Add KVM_SEV SEND_START command
On 15/04/21 17:53, Ashish Kalra wrote: From: Brijesh Singh The command is used to create an outgoing SEV guest encryption context. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Paolo Bonzini Cc: Joerg Roedel Cc: Borislav Petkov Cc: Tom Lendacky Cc: x...@kernel.org Cc: k...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Steve Rutherford Reviewed-by: Venu Busireddy Signed-off-by: Brijesh Singh Signed-off-by: Ashish Kalra --- .../virt/kvm/amd-memory-encryption.rst| 27 arch/x86/kvm/svm/sev.c| 125 ++ include/linux/psp-sev.h | 8 +- include/uapi/linux/kvm.h | 12 ++ 4 files changed, 168 insertions(+), 4 deletions(-) diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/amd-memory-encryption.rst index 469a6308765b..ac799dd7a618 100644 --- a/Documentation/virt/kvm/amd-memory-encryption.rst +++ b/Documentation/virt/kvm/amd-memory-encryption.rst @@ -284,6 +284,33 @@ Returns: 0 on success, -negative on error __u32 len; }; +10. KVM_SEV_SEND_START +-- + +The KVM_SEV_SEND_START command can be used by the hypervisor to create an +outgoing guest encryption context. + +Parameters (in): struct kvm_sev_send_start + +Returns: 0 on success, -negative on error + +:: +struct kvm_sev_send_start { +__u32 policy; /* guest policy */ + +__u64 pdh_cert_uaddr; /* platform Diffie-Hellman certificate */ +__u32 pdh_cert_len; + +__u64 plat_certs_uaddr;/* platform certificate chain */ +__u32 plat_certs_len; + +__u64 amd_certs_uaddr;/* AMD certificate */ +__u32 amd_certs_len; + +__u64 session_uaddr; /* Guest session information */ +__u32 session_len; +}; + References == diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 874ea309279f..2b65900c05d6 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -1110,6 +1110,128 @@ static int sev_get_attestation_report(struct kvm *kvm, struct kvm_sev_cmd *argp) return ret; } +/* Userspace wants to query session length. */ +static int +__sev_send_start_query_session_length(struct kvm *kvm, struct kvm_sev_cmd *argp, + struct kvm_sev_send_start *params) +{ + struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info; + struct sev_data_send_start *data; + int ret; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (data == NULL) + return -ENOMEM; + + data->handle = sev->handle; + ret = sev_issue_cmd(kvm, SEV_CMD_SEND_START, data, >error); This is missing an "if (ret < 0)" (and this time I'm pretty sure it's indeed the case :)), otherwise you miss for example the EBADF return code if the SEV file descriptor is closed or reused. Same for KVM_SEND_UPDATE_DATA. Also, the length==0 case is not documented. Paolo + params->session_len = data->session_len; + if (copy_to_user((void __user *)(uintptr_t)argp->data, params, + sizeof(struct kvm_sev_send_start))) + ret = -EFAULT; + + kfree(data); + return ret; +} + +static int sev_send_start(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info; + struct sev_data_send_start *data; + struct kvm_sev_send_start params; + void *amd_certs, *session_data; + void *pdh_cert, *plat_certs; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (copy_from_user(, (void __user *)(uintptr_t)argp->data, + sizeof(struct kvm_sev_send_start))) + return -EFAULT; + + /* if session_len is zero, userspace wants to query the session length */ + if (!params.session_len) + return __sev_send_start_query_session_length(kvm, argp, + ); + + /* some sanity checks */ + if (!params.pdh_cert_uaddr || !params.pdh_cert_len || + !params.session_uaddr || params.session_len > SEV_FW_BLOB_MAX_SIZE) + return -EINVAL; + + /* allocate the memory to hold the session data blob */ + session_data = kmalloc(params.session_len, GFP_KERNEL_ACCOUNT); + if (!session_data) + return -ENOMEM; + + /* copy the certificate blobs from userspace */ + pdh_cert = psp_copy_user_blob(params.pdh_cert_uaddr, + params.pdh_cert_len); + if (IS_ERR(pdh_cert)) { + ret = PTR_ERR(pdh_cert); + goto e_free_session; + } + + pl
Re: [PATCH v13 05/12] KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command
On 20/04/21 10:40, Paolo Bonzini wrote: On 15/04/21 17:55, Ashish Kalra wrote: + if (!guest_page) + goto e_free; + Missing unpin on error (but it won't be needed with Sean's patches that move the data block to the stack, so I can fix this too). No, sorry---the initialization order is different between send_update_data and receive_update_data, so it's okay. Paolo
Re: [PATCH v13 05/12] KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command
On 15/04/21 17:55, Ashish Kalra wrote: + if (!guest_page) + goto e_free; + Missing unpin on error (but it won't be needed with Sean's patches that move the data block to the stack, so I can fix this too). Paolo
Re: [PATCH v13 04/12] KVM: SVM: Add support for KVM_SEV_RECEIVE_START command
On 15/04/21 17:54, Ashish Kalra wrote: + } + + sev->handle = start->handle; + sev->fd = argp->sev_fd; These two lines are spurious, I'll delete them. Paolo
[PATCH] KVM: selftests: Always run vCPU thread with blocked SIG_IPI
The main thread could start to send SIG_IPI at any time, even before signal blocked on vcpu thread. Therefore, start the vcpu thread with the signal blocked. Without this patch, on very busy cores the dirty_log_test could fail directly on receiving a SIGUSR1 without a handler (when vcpu runs far slower than main). Reported-by: Peter Xu Cc: sta...@vger.kernel.org Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/dirty_log_test.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index ffa4e2791926..81edbd23d371 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -527,9 +527,8 @@ static void *vcpu_worker(void *data) */ sigmask->len = 8; pthread_sigmask(0, NULL, sigset); + sigdelset(sigset, SIG_IPI); vcpu_ioctl(vm, VCPU_ID, KVM_SET_SIGNAL_MASK, sigmask); - sigaddset(sigset, SIG_IPI); - pthread_sigmask(SIG_BLOCK, sigset, NULL); sigemptyset(sigset); sigaddset(sigset, SIG_IPI); @@ -858,6 +857,7 @@ int main(int argc, char *argv[]) .interval = TEST_HOST_LOOP_INTERVAL, }; int opt, i; + sigset_t sigset; sem_init(_vcpu_stop, 0, 0); sem_init(_vcpu_cont, 0, 0); @@ -916,6 +916,11 @@ int main(int argc, char *argv[]) srandom(time(0)); + /* Ensure that vCPU threads start with SIG_IPI blocked. */ + sigemptyset(); + sigaddset(, SIG_IPI); + pthread_sigmask(SIG_BLOCK, , NULL); + if (host_log_mode_option == LOG_MODE_ALL) { /* Run each log mode */ for (i = 0; i < LOG_MODE_NUM; i++) { -- 2.26.2
Re: [PATCH v3 2/2] KVM: selftests: Wait for vcpu thread before signal setup
On 17/04/21 16:36, Peter Xu wrote: The main thread could start to send SIG_IPI at any time, even before signal blocked on vcpu thread. Reuse the sem_vcpu_stop to sync on that, so when SIG_IPI is sent the signal will always land correctly as an -EINTR. Without this patch, on very busy cores the dirty_log_test could fail directly on receiving a SIG_USR1 without a handler (when vcpu runs far slower than main). Signed-off-by: Peter Xu This should be a better way to do the same: --- 8< From: Paolo Bonzini Subject: [PATCH] KVM: selftests: Wait for vcpu thread before signal setup The main thread could start to send SIG_IPI at any time, even before signal blocked on vcpu thread. Therefore, start the vcpu thread with the signal blocked. Without this patch, on very busy cores the dirty_log_test could fail directly on receiving a SIGUSR1 without a handler (when vcpu runs far slower than main). Reported-by: Peter Xu Cc: sta...@vger.kernel.org Signed-off-by: Paolo Bonzini diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index ffa4e2791926..048973d50a45 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -527,9 +527,8 @@ static void *vcpu_worker(void *data) */ sigmask->len = 8; pthread_sigmask(0, NULL, sigset); + sigdelset(sigset, SIG_IPI); vcpu_ioctl(vm, VCPU_ID, KVM_SET_SIGNAL_MASK, sigmask); - sigaddset(sigset, SIG_IPI); - pthread_sigmask(SIG_BLOCK, sigset, NULL); sigemptyset(sigset); sigaddset(sigset, SIG_IPI); @@ -858,6 +857,7 @@ int main(int argc, char *argv[]) .interval = TEST_HOST_LOOP_INTERVAL, }; int opt, i; + sigset_t sigset; sem_init(_vcpu_stop, 0, 0); sem_init(_vcpu_cont, 0, 0); @@ -916,6 +916,11 @@ int main(int argc, char *argv[]) srandom(time(0)); + /* Ensure that vCPU threads start with SIG_IPI blocked. */ + sigemptyset(); + sigaddset(, SIG_IPI); + pthread_sigmask(SIG_BLOCK, sigset, NULL); + if (host_log_mode_option == LOG_MODE_ALL) { /* Run each log mode */ for (i = 0; i < LOG_MODE_NUM; i++) { --- tools/testing/selftests/kvm/dirty_log_test.c | 8 1 file changed, 8 insertions(+) diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index 510884f0eab8..25230e799bc4 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -534,6 +534,12 @@ static void *vcpu_worker(void *data) sigemptyset(sigset); sigaddset(sigset, SIG_IPI); + /* +* Tell the main thread that signals are setup already; let's borrow +* sem_vcpu_stop even if it's not for it. +*/ + sem_post(_vcpu_stop); + guest_array = addr_gva2hva(vm, (vm_vaddr_t)random_array); while (!READ_ONCE(host_quit)) { @@ -785,6 +791,8 @@ static void run_test(enum vm_guest_mode mode, void *arg) pthread_create(_thread, NULL, vcpu_worker, vm); + sem_wait_until(_vcpu_stop); + while (iteration < p->iterations) { /* Give the vcpu thread some time to dirty some pages */ usleep(p->interval * 1000);
Re: [PATCH v3 1/2] KVM: selftests: Sync data verify of dirty logging with guest sync
On 18/04/21 14:43, Peter Xu wrote: 8<- diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index 25230e799bc4..d3050d1c2cd0 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -377,7 +377,7 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, int ret, int err) /* A ucall-sync or ring-full event is allowed */ if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) { /* We should allow this to continue */ - ; + vcpu_handle_sync_stop(); } else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL || (ret == -1 && err == EINTR)) { /* Update the flag first before pause */ 8<- That's my intention when I introduce vcpu_handle_sync_stop(), but forgot to add... And possibly even this (untested though): diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index ffa4e2791926..918954f01cef 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -383,6 +383,7 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, int ret, int err) /* Update the flag first before pause */ WRITE_ONCE(dirty_ring_vcpu_ring_full, run->exit_reason == KVM_EXIT_DIRTY_RING_FULL); + atomic_set(_sync_stop_requested, false); sem_post(_vcpu_stop); pr_info("vcpu stops because %s...\n", dirty_ring_vcpu_ring_full ? @@ -804,8 +805,7 @@ static void run_test(enum vm_guest_mode mode, void *arg) * the flush of the last page, and since we handle the last * page specially verification will succeed anyway. */ - assert(host_log_mode == LOG_MODE_DIRTY_RING || - atomic_read(_sync_stop_requested) == false); + assert(atomic_read(_sync_stop_requested) == false); vm_dirty_log_verify(mode, bmap); sem_post(_vcpu_cont); You can submit all these as a separate patch. Paolo
Re: [PATCH] KVM: Boost vCPU candidiate in user mode which is delivering interrupt
On 20/04/21 08:08, Wanpeng Li wrote: On Tue, 20 Apr 2021 at 14:02, Wanpeng Li wrote: On Tue, 20 Apr 2021 at 00:59, Paolo Bonzini wrote: On 19/04/21 18:32, Sean Christopherson wrote: If false positives are a big concern, what about adding another pass to the loop and only yielding to usermode vCPUs with interrupts in the second full pass? I.e. give vCPUs that are already in kernel mode priority, and only yield to handle an interrupt if there are no vCPUs in kernel mode. kvm_arch_dy_runnable() pulls in pv_unhalted, which seems like a good thing. pv_unhalted won't help if you're waiting for a kernel spinlock though, would it? Doing two passes (or looking for a "best" candidate that prefers kernel mode vCPUs to user mode vCPUs waiting for an interrupt) seems like the best choice overall. How about something like this: I was thinking of something simpler: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 9b8e30dd5b9b..455c648f9adc 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3198,10 +3198,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) { struct kvm *kvm = me->kvm; struct kvm_vcpu *vcpu; - int last_boosted_vcpu = me->kvm->last_boosted_vcpu; int yielded = 0; int try = 3; - int pass; + int pass, num_passes = 1; int i; kvm_vcpu_set_in_spin_loop(me, true); @@ -3212,13 +3211,14 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) * VCPU is holding the lock that we need and will release it. * We approximate round-robin by starting at the last boosted VCPU. */ - for (pass = 0; pass < 2 && !yielded && try; pass++) { - kvm_for_each_vcpu(i, vcpu, kvm) { - if (!pass && i <= last_boosted_vcpu) { - i = last_boosted_vcpu; - continue; - } else if (pass && i > last_boosted_vcpu) - break; + for (pass = 0; pass < num_passes; pass++) { + int idx = me->kvm->last_boosted_vcpu; + int n = atomic_read(>online_vcpus); + for (i = 0; i < n; i++, idx++) { + if (idx == n) + idx = 0; + + vcpu = kvm_get_vcpu(kvm, idx); if (!READ_ONCE(vcpu->ready)) continue; if (vcpu == me) @@ -3226,23 +3226,36 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) if (rcuwait_active(>wait) && !vcpu_dy_runnable(vcpu)) continue; - if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode && - !kvm_arch_vcpu_in_kernel(vcpu)) - continue; if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) continue; + if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode && + !kvm_arch_vcpu_in_kernel(vcpu)) { + /* +* A vCPU running in userspace can get to kernel mode via +* an interrupt. That's a worse choice than a CPU already +* in kernel mode so only do it on a second pass. +*/ + if (!vcpu_dy_runnable(vcpu)) + continue; + if (pass == 0) { + num_passes = 2; + continue; + } + } + yielded = kvm_vcpu_yield_to(vcpu); if (yielded > 0) { kvm->last_boosted_vcpu = i; - break; + goto done; } else if (yielded < 0) { try--; if (!try) - break; + goto done; } } } +done: kvm_vcpu_set_in_spin_loop(me, false); /* Ensure vcpu is not eligible during next spinloop */ Paolo
Re: Doubt regarding memory allocation in KVM
On 20/04/21 07:45, Shivank Garg wrote: Hi, I'm learning about qemu KVM, looking into code and experimenting on it. I have the following doubts regarding it, I would be grateful if you help me to get some idea on them. 1. I observe that KVM allocates memory to guests when it needs it but doesn't take it back (except for ballooning case). Also, the Qemu/KVM process does not free the memory even when the guest is rebooted. In this case, Does the Guest VM get access to memory already pre-filled with some garbage from the previous run?? Yes. (Since the host would allocate zeroed pages to guests the first time it requests but after that it's up to guests). Can it be a security issue? No, it's the same that happens on non-virtual machine. 2. How does the KVM know if GPFN (guest physical frame number) is backed by an actual machine frame number in host? If not mapped, then it faults in the host and allocates a physical frame for guests in the host. (kvm_mmu_page_fault) It's all handled by Linux. KVM only does a call to get_user_pages. See functions whose name starts with hva_to_pfn in virt/kvm/kvm_main.c Given a GPA, the GFN is simply the guest physical address minus bits 0:11, so shifted right by 12. Paolo
Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary
On 19/04/21 17:09, Sean Christopherson wrote: - this loses the rwsem fairness. On the other hand, mm/mmu_notifier.c's own interval-tree-based filter is also using a similar mechanism that is likewise not fair, so it should be okay. The one concern I had with an unfair mechanism of this nature is that, in theory, the memslot update could be blocked indefinitely. Yep, that's why I mentioned it. @@ -1333,9 +1351,22 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm, WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS); slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS; - down_write(>mmu_notifier_slots_lock); + /* +* This cannot be an rwsem because the MMU notifier must not run +* inside the critical section. A sleeping rwsem cannot exclude +* that. How on earth did you decipher that from the splat? I stared at it for a good five minutes and was completely befuddled. Just scratch that, it makes no sense. It's much simpler, but you have to look at include/linux/mmu_notifier.h to figure it out: invalidate_range_start take pseudo lock down_read() (*) release pseudo lock invalidate_range_end take pseudo lock (**) up_read() release pseudo lock At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock; at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock. This could cause a deadlock (ignoring for a second that the pseudo lock is not a lock): - invalidate_range_start waits on down_read(), because the rwsem is held by install_new_memslots - install_new_memslots waits on down_write(), because the rwsem is held till (another) invalidate_range_end finishes - invalidate_range_end sits waits on the pseudo lock, held by invalidate_range_start. Removing the fairness of the rwsem breaks the cycle (in lockdep terms, it would change the *shared* rwsem readers into *shared recursive* readers). This also means that there's no need for a raw spinlock. Given this simple explanation, I think it's okay to include this patch in the merge window pull request, with the fix after my signature squashed in. The fix actually undoes a lot of the changes to __kvm_handle_hva_range that this patch made, so the result is relatively simple. You can already find the result in kvm/queue. Paolo From daefeeb229ba8be5bd819a51875bc1fd5e74fc85 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 19 Apr 2021 09:01:46 -0400 Subject: [PATCH] KVM: avoid "deadlock" between install_new_memslots and MMU notifier Wanpeng Li is reporting this splat: == WARNING: possible circular locking dependency detected 5.12.0-rc3+ #6 Tainted: G OE -- qemu-system-x86/3069 is trying to acquire lock: 9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190 but task is already holding lock: aff7410a9160 (>mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm] which lock already depends on the new lock. This corresponds to the following MMU notifier logic: invalidate_range_start take pseudo lock down_read() (*) release pseudo lock invalidate_range_end take pseudo lock (**) up_read() release pseudo lock At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock; at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock. This could cause a deadlock (ignoring for a second that the pseudo lock is not a lock): - invalidate_range_start waits on down_read(), because the rwsem is held by install_new_memslots - install_new_memslots waits on down_write(), because the rwsem is held till (another) invalidate_range_end finishes - invalidate_range_end sits waits on the pseudo lock, held by invalidate_range_start. Removing the fairness of the rwsem breaks the cycle (in lockdep terms, it would change the *shared* rwsem readers into *shared recursive* readers), so open-code the wait using a readers count and a spinlock. This also allows handling blockable and non-blockable critical section in the same way. Losing the rwsem fairness does theoretically allow MMU notifiers to block install_new_memslots forever. Note that mm/mmu_notifier.c's own retry scheme in mmu_interval_read_begin also uses wait/wake_up and is likewise not fair. Signed-off-by: Paolo Bonzini --- Documentation/virt/kvm/locking.rst | 9 +-- include/linux/kvm_host.h | 8 +- virt/kvm/kvm_main.c| 119 ++--- 3 files changed, 67 insertions(+), 69 deletions(-) diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst index 8f5d5bcf5689..e628f48dfdda 100644 --- a/Documentation/virt/kvm/locking.
[PATCH fixed] KVM: x86/mmu: Fast invalidation for TDP MMU
From: Ben Gardon Provide a real mechanism for fast invalidation by marking roots as invalid so that their reference count will quickly fall to zero and they will be torn down. One negative side affect of this approach is that a vCPU thread will likely drop the last reference to a root and be saddled with the work of tearing down an entire paging structure. This issue will be resolved in a later commit. Signed-off-by: Ben Gardon Message-Id: <20210401233736.638171-13-bgar...@google.com> [Move the loop to tdp_mmu.c, otherwise compilation fails on 32-bit. - Paolo] Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 12 +--- arch/x86/kvm/mmu/tdp_mmu.c | 17 + arch/x86/kvm/mmu/tdp_mmu.h | 5 + 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 61c0d1091fc5..3323ab281f34 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5421,6 +5421,15 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm) */ kvm->arch.mmu_valid_gen = kvm->arch.mmu_valid_gen ? 0 : 1; + /* In order to ensure all threads see this change when +* handling the MMU reload signal, this must happen in the +* same critical section as kvm_reload_remote_mmus, and +* before kvm_zap_obsolete_pages as kvm_zap_obsolete_pages +* could drop the MMU lock and yield. +*/ + if (is_tdp_mmu_enabled(kvm)) + kvm_tdp_mmu_invalidate_all_roots(kvm); + /* * Notify all vcpus to reload its shadow page table and flush TLB. * Then all vcpus will switch to new shadow page table with the new @@ -5433,9 +5442,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm) kvm_zap_obsolete_pages(kvm); - if (is_tdp_mmu_enabled(kvm)) - kvm_tdp_mmu_zap_all(kvm); - write_unlock(>mmu_lock); } diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 335c126d9860..bc9308a104b1 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -797,6 +797,23 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) kvm_flush_remote_tlbs(kvm); } +/* + * Mark each TDP MMU root as invalid so that other threads + * will drop their references and allow the root count to + * go to 0. + * + * This has essentially the same effect for the TDP MMU + * as updating mmu_valid_gen does for the shadow MMU. + */ +void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm) +{ + struct kvm_mmu_page *root; + + lockdep_assert_held_write(>mmu_lock); + list_for_each_entry(root, >arch.tdp_mmu_roots, link) + root->role.invalid = true; +} + /* * Installs a last-level SPTE to handle a TDP page fault. * (NPT/EPT violation/misconfiguration) diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 2e1913bbc0ba..25ec0173700e 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -10,6 +10,9 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu); __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm, struct kvm_mmu_page *root) { + if (root->role.invalid) + return false; + return refcount_inc_not_zero(>tdp_mmu_root_count); } @@ -43,7 +46,9 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp), sp->gfn, end, false, false, false); } + void kvm_tdp_mmu_zap_all(struct kvm *kvm); +void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm); int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, int map_writable, int max_level, kvm_pfn_t pfn, -- 2.26.2
Re: [PATCH 00/13] [RFC] Rust support
On 19/04/21 19:14, Linus Torvalds wrote: On Mon, Apr 19, 2021 at 2:36 AM Peter Zijlstra wrote: I also don't see how this is better than seq_cst. But yes, not broken, but also very much not optimal. I continue to feel like kernel people should just entirely ignore the C++ memory ordering standard. It's inferior to what we already have, and simply not helpful. It doesn't actually solve any problems as far as the kernel is concerned, and it generates its own set of issues (ie assuming that the compiler supports it, and assuming the compiler gets it right). The really subtle cases that it could have been helpful for (eg RCU, or the load-store control dependencies) were _too_ subtle for the standard. And I do not believe Rust changes _any_ of that. It changes it for the worse, in that access to fields that are shared across threads *must* either use atomic types (which boil down to the same compiler intrinsics as the C/C++ memory model) or synchronization primitives. LKMM operates in the grey area between the C standard and what gcc/clang actually implement, but there's no such grey area in Rust unless somebody wants to rewrite arch/*/asm atomic access primitives and memory barriers in Rust. Of course it's possible to say Rust code just uses the C/C++/Rust model and C code follows the LKMM, but that really only delays the inevitable until a driver is written part in C part in Rust, and needs to perform accesses outside synchronization primitives. Paolo Any kernel Rust code will simply have to follow the LKMM rules, and use the kernel model for the interfaces. Things like the C++ memory model is simply not _relevant_ to the kernel.
Re: [PATCH] KVM: Boost vCPU candidiate in user mode which is delivering interrupt
On 19/04/21 18:32, Sean Christopherson wrote: If false positives are a big concern, what about adding another pass to the loop and only yielding to usermode vCPUs with interrupts in the second full pass? I.e. give vCPUs that are already in kernel mode priority, and only yield to handle an interrupt if there are no vCPUs in kernel mode. kvm_arch_dy_runnable() pulls in pv_unhalted, which seems like a good thing. pv_unhalted won't help if you're waiting for a kernel spinlock though, would it? Doing two passes (or looking for a "best" candidate that prefers kernel mode vCPUs to user mode vCPUs waiting for an interrupt) seems like the best choice overall. Paolo
Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary
On 19/04/21 10:49, Wanpeng Li wrote: I saw this splatting: == WARNING: possible circular locking dependency detected 5.12.0-rc3+ #6 Tainted: G OE -- qemu-system-x86/3069 is trying to acquire lock: 9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190 but task is already holding lock: aff7410a9160 (>mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm] I guess it is possible to open-code the wait using a readers count and a spinlock (see patch after signature). This allows including the rcu_assign_pointer in the same critical section that checks the number of readers. Also on the plus side, the init_rwsem() is replaced by slightly nicer code. IIUC this could be extended to non-sleeping invalidations too, but I am not really sure about that. There are some issues with the patch though: - I am not sure if this should be a raw spin lock to avoid the same issue on PREEMPT_RT kernel. That said the critical section is so tiny that using a raw spin lock may make sense anyway - this loses the rwsem fairness. On the other hand, mm/mmu_notifier.c's own interval-tree-based filter is also using a similar mechanism that is likewise not fair, so it should be okay. Any opinions? For now I placed the change below in kvm/queue, but I'm leaning towards delaying this optimization to the next merge window. Paolo diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst index 8f5d5bcf5689..e628f48dfdda 100644 --- a/Documentation/virt/kvm/locking.rst +++ b/Documentation/virt/kvm/locking.rst @@ -16,12 +16,11 @@ The acquisition orders for mutexes are as follows: - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring them together is quite rare. -- The kvm->mmu_notifier_slots_lock rwsem ensures that pairs of +- kvm->mn_active_invalidate_count ensures that pairs of invalidate_range_start() and invalidate_range_end() callbacks - use the same memslots array. kvm->slots_lock is taken outside the - write-side critical section of kvm->mmu_notifier_slots_lock, so - MMU notifiers must not take kvm->slots_lock. No other write-side - critical sections should be added. + use the same memslots array. kvm->slots_lock is taken on the + waiting side in install_new_memslots, so MMU notifiers must not + take kvm->slots_lock. On x86: diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 76b340dd6981..44a4a0c5148a 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -472,11 +472,15 @@ struct kvm { #endif /* KVM_HAVE_MMU_RWLOCK */ struct mutex slots_lock; - struct rw_semaphore mmu_notifier_slots_lock; struct mm_struct *mm; /* userspace tied to this vm */ struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM]; struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; + /* Used to wait for completion of MMU notifiers. */ + spinlock_t mn_invalidate_lock; + unsigned long mn_active_invalidate_count; + struct rcuwait mn_memslots_update_rcuwait; + /* * created_vcpus is protected by kvm->lock, and is incremented * at the beginning of KVM_CREATE_VCPU. online_vcpus is only @@ -662,7 +666,7 @@ static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id) as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM); return srcu_dereference_check(kvm->memslots[as_id], >srcu, lockdep_is_held(>slots_lock) || - lockdep_is_held(>mmu_notifier_slots_lock) || + READ_ONCE(kvm->mn_active_invalidate_count) || !refcount_read(>users_count)); } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ff9e95eb6960..cdaa1841e725 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -624,7 +624,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, * otherwise, mmu_notifier_count is incremented unconditionally. */ if (!kvm->mmu_notifier_count) { - lockdep_assert_held(>mmu_notifier_slots_lock); + WARN_ON(!READ_ONCE(kvm->mn_active_invalidate_count)); return; } @@ -689,10 +689,13 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, * The complexity required to handle conditional locking for this case * is not worth the marginal benefits, the VM is likely doomed anyways. * -* Pairs with the up_read in range_end(). +* Pairs with the decrement in range_end(). */ - if (blockable) - down_read(>mmu_notifier_slots_lock); + if (blockable) { +
Re: [PATCH 00/13] [RFC] Rust support
On 19/04/21 11:36, Peter Zijlstra wrote: On Mon, Apr 19, 2021 at 11:02:12AM +0200, Paolo Bonzini wrote: void writer(void) { atomic_store_explicit(, seq+1, memory_order_relaxed); atomic_thread_fence(memory_order_acquire); This needs to be memory_order_release. The only change in the resulting assembly is that "dmb ishld" becomes "dmb ish", which is not as good as the "dmb ishst" you get from smp_wmb() but not buggy either. Yuck! And that is what requires the insides to be atomic_store_explicit(), otherwise this fence doesn't have to affect them. Not just that, even the write needs to be atomic_store_explicit in order to avoid a data race.atomic_store_explicit I also don't see how this is better than seq_cst. It is better than seq_cst on TSO architectures. Another possibility is to use release stores for everything (both increments and the stores between them). But yes, not broken, but also very much not optimal. Agreed on that, just like RCU/memory_order_consume. Paolo
Re: [PATCH 00/13] [RFC] Rust support
On 19/04/21 10:26, Peter Zijlstra wrote: On Mon, Apr 19, 2021 at 09:53:06AM +0200, Paolo Bonzini wrote: On 19/04/21 09:32, Peter Zijlstra wrote: On Sat, Apr 17, 2021 at 04:51:58PM +0200, Paolo Bonzini wrote: On 16/04/21 09:09, Peter Zijlstra wrote: Well, the obvious example would be seqlocks. C11 can't do them Sure it can. C11 requires annotating with (the equivalent of) READ_ONCE all reads of seqlock-protected fields, but the memory model supports seqlocks just fine. How does that help? IIRC there's two problems, one on each side the lock. On the write side we have: seq++; smp_wmb(); X = r; Y = r; smp_wmb(); seq++; Which C11 simply cannot do right because it does't have wmb. It has atomic_thread_fence(memory_order_release), and atomic_thread_fence(memory_order_acquire) on the read side. https://godbolt.org/z/85xoPxeE5 void writer(void) { atomic_store_explicit(, seq+1, memory_order_relaxed); atomic_thread_fence(memory_order_acquire); This needs to be memory_order_release. The only change in the resulting assembly is that "dmb ishld" becomes "dmb ish", which is not as good as the "dmb ishst" you get from smp_wmb() but not buggy either. The read side can use "dmb ishld" so it gets the same code as Linux. LWN needs a "C11 memory model for kernel folks" article. In the meanwhile there is http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html which is the opposite (Linux kernel memory model for C11 folks). Paolo X = 1; Y = 2; atomic_store_explicit(, seq+1, memory_order_release); } gives: writer: adrpx1, .LANCHOR0 add x0, x1, :lo12:.LANCHOR0 ldr w2, [x1, #:lo12:.LANCHOR0] add w2, w2, 1 str w2, [x0] dmb ishld ldr w1, [x1, #:lo12:.LANCHOR0] mov w3, 1 mov w2, 2 stp w3, w2, [x0, 4] add w1, w1, w3 stlrw1, [x0] ret Which, afaict, is completely buggered. What it seems to be doing is turning the seq load into a load-acquire, but what we really need is to make sure the seq store (increment) is ordered before the other stores.
Re: [PATCH 00/13] [RFC] Rust support
On 19/04/21 09:32, Peter Zijlstra wrote: On Sat, Apr 17, 2021 at 04:51:58PM +0200, Paolo Bonzini wrote: On 16/04/21 09:09, Peter Zijlstra wrote: Well, the obvious example would be seqlocks. C11 can't do them Sure it can. C11 requires annotating with (the equivalent of) READ_ONCE all reads of seqlock-protected fields, but the memory model supports seqlocks just fine. How does that help? IIRC there's two problems, one on each side the lock. On the write side we have: seq++; smp_wmb(); X = r; Y = r; smp_wmb(); seq++; Which C11 simply cannot do right because it does't have wmb. It has atomic_thread_fence(memory_order_release), and atomic_thread_fence(memory_order_acquire) on the read side. You end up having to use seq_cst for the first wmb or make both X and Y (on top of the last seq) a store-release, both options are sub-optimal. seq_cst (except for the fence which is just smp_mb) is a pile of manure, no doubt about that. :) Paolo
Re: [PATCH 00/13] [RFC] Rust support
On 16/04/21 17:58, Theodore Ts'o wrote: Another fairly common use case is a lockless, racy test of a particular field, as an optimization before we take the lock before we test it for realsies. In this particular case, we can't allocate memory while holding a spinlock, so we check to see without taking the spinlock to see whether we should allocate memory (which is expensive, and unnecessasry most of the time): alloc_transaction: /* * This check is racy but it is just an optimization of allocating new * transaction early if there are high chances we'll need it. If we * guess wrong, we'll retry or free the unused transaction. */ if (!data_race(journal->j_running_transaction)) { /* * If __GFP_FS is not present, then we may be being called from * inside the fs writeback layer, so we MUST NOT fail. */ if ((gfp_mask & __GFP_FS) == 0) gfp_mask |= __GFP_NOFAIL; new_transaction = kmem_cache_zalloc(transaction_cache, gfp_mask); if (!new_transaction) return -ENOMEM; } From my limited experience with Rust, things like these are a bit annoying indeed, sooner or later Mutex<> just doesn't cut it and you have to deal with its limitations. In this particular case you would use an AtomicBool field, place it outside the Mutex-protected struct, and make sure that is only accessed under the lock just like in C. One easy way out is to make the Mutex protect (officially) nothing, i.e. Mutex<()>, and handle the mutable fields yourself using RefCell (which gives you run-time checking but has some some space cost) or UnsafeCell (which is unsafe as the name says). Rust makes it pretty easy to write smart pointers (Mutex<>'s lock guard itself is a smart pointer) so you also have the possibility of writing a safe wrapper for the combination of Mutex<()> and UnsafeCell. Another example is when yu have a list of XYZ objects and use the same mutex for both the list of XYZ and a field in struct XYZ. You could place that field in an UnsafeCell and write a function that receives a guard for the list lock and returns the field, or something like that. It *is* quite ugly though. As an aside, from a teaching standpoint associating a Mutex with a specific data structure is bad IMNSHO, because it encourages too fine-grained locking. Sometimes the easiest path to scalability is to use a more coarse lock and ensure that contention is extremely rare. But it does work for most simple use cases (and device drivers would qualify as simple more often than not). Paolo
Re: [PATCH 00/13] [RFC] Rust support
On 16/04/21 09:09, Peter Zijlstra wrote: Well, the obvious example would be seqlocks. C11 can't do them Sure it can. C11 requires annotating with (the equivalent of) READ_ONCE all reads of seqlock-protected fields, but the memory model supports seqlocks just fine. Simlar thing for RCU; C11 can't optimally do that Technically if you know what you're doing (i.e. that you're not on Alpha) you can do RCU using a relaxed load followed by an atomic_signal_fence(memory_order_consume). Which I agree is horrible and not entirely within the standard, but it works in practice. The Linux implementation of memory barriers, atomic RMW primitives, load-acquire/store-release etc. is also completely outside the standard, so it's not much different and more portable. The only thing that I really, really miss when programming with C11 atomics is smp_mb__{before,after}_atomic(). Paolo
Re: [PATCH v6 03/10] KVM: selftests: Use flag CLOCK_MONOTONIC_RAW for timing
On 30/03/21 10:08, Yanan Wang wrote: In addition to function of CLOCK_MONOTONIC, flag CLOCK_MONOTONIC_RAW can also shield possiable impact of NTP, which can provide more robustness. Suggested-by: Vitaly Kuznetsov Signed-off-by: Yanan Wang Reviewed-by: Ben Gardon Reviewed-by: Andrew Jones I'm not sure about this one, is the effect visible? Paolo
Re: [PATCH v6 00/10] KVM: selftests: some improvement and a new test for kvm page table
On 06/04/21 05:05, wangyanan (Y) wrote: Kindly ping... Hi Paolo, Will this series be picked up soon, or is there any other work for me to do? Regards, Yanan On 2021/3/30 16:08, Yanan Wang wrote: Hi, This v6 series can mainly include two parts. Rebased on kvm queue branch: https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=queue In the first part, all the known hugetlb backing src types specified with different hugepage sizes are listed, so that we can specify use of hugetlb source of the exact granularity that we want, instead of the system default ones. And as all the known hugetlb page sizes are listed, it's appropriate for all architectures. Besides, a helper that can get granularity of different backing src types(anonumous/thp/hugetlb) is added, so that we can use the accurate backing src granularity for kinds of alignment or guest memory accessing of vcpus. In the second part, a new test is added: This test is added to serve as a performance tester and a bug reproducer for kvm page table code (GPA->HPA mappings), it gives guidance for the people trying to make some improvement for kvm. And the following explains what we can exactly do through this test. The function guest_code() can cover the conditions where a single vcpu or multiple vcpus access guest pages within the same memory region, in three VM stages(before dirty logging, during dirty logging, after dirty logging). Besides, the backing src memory type(ANONYMOUS/THP/HUGETLB) of the tested memory region can be specified by users, which means normal page mappings or block mappings can be chosen by users to be created in the test. If ANONYMOUS memory is specified, kvm will create normal page mappings for the tested memory region before dirty logging, and update attributes of the page mappings from RO to RW during dirty logging. If THP/HUGETLB memory is specified, kvm will create block mappings for the tested memory region before dirty logging, and split the blcok mappings into normal page mappings during dirty logging, and coalesce the page mappings back into block mappings after dirty logging is stopped. So in summary, as a performance tester, this test can present the performance of kvm creating/updating normal page mappings, or the performance of kvm creating/splitting/recovering block mappings, through execution time. When we need to coalesce the page mappings back to block mappings after dirty logging is stopped, we have to firstly invalidate *all* the TLB entries for the page mappings right before installation of the block entry, because a TLB conflict abort error could occur if we can't invalidate the TLB entries fully. We have hit this TLB conflict twice on aarch64 software implementation and fixed it. As this test can imulate process from dirty logging enabled to dirty logging stopped of a VM with block mappings, so it can also reproduce this TLB conflict abort due to inadequate TLB invalidation when coalescing tables. Links about the TLB conflict abort: https://lore.kernel.org/lkml/20201201201034.116760-3-wangyana...@huawei.com/ --- Change logs: v5->v6: - Address Andrew Jones's comments for v5 series - Add Andrew Jones's R-b tags in some patches - Rebased on newest kvm/queue tree - v5: https://lore.kernel.org/lkml/20210323135231.24948-1-wangyana...@huawei.com/ v4->v5: - Use synchronization(sem_wait) for time measurement - Add a new patch about TEST_ASSERT(patch 4) - Address Andrew Jones's comments for v4 series - Add Andrew Jones's R-b tags in some patches - v4: https://lore.kernel.org/lkml/20210302125751.19080-1-wangyana...@huawei.com/ v3->v4: - Add a helper to get system default hugetlb page size - Add tags of Reviewed-by of Ben in the patches - v3: https://lore.kernel.org/lkml/20210301065916.11484-1-wangyana...@huawei.com/ v2->v3: - Add tags of Suggested-by, Reviewed-by in the patches - Add a generic micro to get hugetlb page sizes - Some changes for suggestions about v2 series - v2: https://lore.kernel.org/lkml/20210225055940.18748-1-wangyana...@huawei.com/ v1->v2: - Add a patch to sync header files - Add helpers to get granularity of different backing src types - Some changes for suggestions about v1 series - v1: https://lore.kernel.org/lkml/20210208090841.333724-1-wangyana...@huawei.com/ --- Yanan Wang (10): tools headers: sync headers of asm-generic/hugetlb_encode.h mm/hugetlb: Add a macro to get HUGETLB page sizes for mmap KVM: selftests: Use flag CLOCK_MONOTONIC_RAW for timing KVM: selftests: Print the errno besides error-string in TEST_ASSERT KVM: selftests: Make a generic helper to get vm guest mode strings KVM: selftests: Add a helper to get system configured THP page size KVM: selftests: Add a helper to get system default hugetlb page size KVM: selftests: List all hugetlb src types specified with page sizes KVM: selftests: Adapt vm_userspace_mem_region_add to new helpers KVM: selftests: Add a test for kvm page table code
Re: [PATCH v2] KVM: vmx: add mismatched size assertions in vmcs_check32()
On 09/04/21 04:24, lihaiwei.ker...@gmail.com wrote: From: Haiwei Li Add compile-time assertions in vmcs_check32() to disallow accesses to 64-bit and 64-bit high fields via vmcs_{read,write}32(). Upper level KVM code should never do partial accesses to VMCS fields. KVM handles the split accesses automatically in vmcs_{read,write}64() when running as a 32-bit kernel. Reviewed-and-tested-by: Sean Christopherson Signed-off-by: Haiwei Li --- v1 -> v2: * Improve the changelog arch/x86/kvm/vmx/vmx_ops.h | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kvm/vmx/vmx_ops.h b/arch/x86/kvm/vmx/vmx_ops.h index 692b0c3..164b64f 100644 --- a/arch/x86/kvm/vmx/vmx_ops.h +++ b/arch/x86/kvm/vmx/vmx_ops.h @@ -37,6 +37,10 @@ static __always_inline void vmcs_check32(unsigned long field) { BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6000) == 0, "32-bit accessor invalid for 16-bit field"); + BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6001) == 0x2000, +"32-bit accessor invalid for 64-bit field"); + BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6001) == 0x2001, +"32-bit accessor invalid for 64-bit high field"); BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6000) == 0x6000, "32-bit accessor invalid for natural width field"); } Queued, thanks. paolo
Re: [PATCH 0/3] KVM: Fixes and a cleanup for coalesced MMIO
On 13/04/21 00:20, Sean Christopherson wrote: Fix two bugs that are exposed if unregistered a device on an I/O bus fails due to OOM. Tack on opportunistic cleanup in the related code. Sean Christopherson (3): KVM: Destroy I/O bus devices on unregister failure _after_ sync'ing SRCU KVM: Stop looking for coalesced MMIO zones if the bus is destroyed KVM: Add proper lockdep assertion in I/O bus unregister include/linux/kvm_host.h | 4 ++-- virt/kvm/coalesced_mmio.c | 19 +-- virt/kvm/kvm_main.c | 26 -- 3 files changed, 35 insertions(+), 14 deletions(-) Queued, thanks. Paolo
Re: [PATCH] KVM: Boost vCPU candidiate in user mode which is delivering interrupt
On 16/04/21 05:08, Wanpeng Li wrote: From: Wanpeng Li Both lock holder vCPU and IPI receiver that has halted are condidate for boost. However, the PLE handler was originally designed to deal with the lock holder preemption problem. The Intel PLE occurs when the spinlock waiter is in kernel mode. This assumption doesn't hold for IPI receiver, they can be in either kernel or user mode. the vCPU candidate in user mode will not be boosted even if they should respond to IPIs. Some benchmarks like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most of the time they are running in user mode. It can lead to a large number of continuous PLE events because the IPI sender causes PLE events repeatedly until the receiver is scheduled while the receiver is not candidate for a boost. This patch boosts the vCPU candidiate in user mode which is delivery interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs VM in over-subscribe scenario (The host machine is 2 socket, 48 cores, 96 HTs Intel CLX box). There is no performance regression for other benchmarks like Unixbench spawn (most of the time contend read/write lock in kernel mode), ebizzy (most of the time contend read/write sem and TLB shoodtdown in kernel mode). +bool kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu) +{ + if (vcpu->arch.apicv_active && static_call(kvm_x86_dy_apicv_has_pending_interrupt)(vcpu)) + return true; + + return false; +} Can you reuse vcpu_dy_runnable instead of this new function? Paolo bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu) { return vcpu->arch.preempted_in_kernel; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3b06d12..5012fc4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -954,6 +954,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu); int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu); bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu); +bool kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 0a481e7..781d2db 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3012,6 +3012,11 @@ static bool vcpu_dy_runnable(struct kvm_vcpu *vcpu) return false; } +bool __weak kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu) +{ + return false; +} + void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) { struct kvm *kvm = me->kvm; @@ -3045,6 +3050,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) !vcpu_dy_runnable(vcpu)) continue; if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode && + !kvm_arch_interrupt_delivery(vcpu) && !kvm_arch_vcpu_in_kernel(vcpu)) continue; if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
Re: [PATCH v2] kvm/selftests: Fix race condition with dirty_log_test
On 13/04/21 23:36, Peter Xu wrote: This patch closes this race by allowing the main thread to give the vcpu thread chance to do a VMENTER to complete that write operation. It's done by adding a vcpu loop counter (must be defined as volatile as main thread will do read loop), then the main thread can guarantee the vcpu got at least another VMENTER by making sure the guest_vcpu_loops increases by 2. Dirty ring does not need this since dirty_ring_last_page would already help avoid this specific race condition. Just a nit, the comment and commit message should mention KVM_RUN rather than vmentry; it's possible to be preempted many times in vcpu_enter_guest without making progress, but those wouldn't return to userspace and thus would not update guest_vcpu_loops. Also, volatile is considered harmful even in userspace/test code[1]. Technically rather than volatile one should use an atomic load (even a relaxed one), but in practice it's okay to use volatile too *for this specific use* (READ_ONCE/WRITE_ONCE are volatile reads and writes as well). If the selftests gained 32-bit support, one should not use volatile because neither reads or writes to uint64_t variables would be guaranteed to be atomic. Queued, thanks. Paolo [1] Documentation/process/volatile-considered-harmful.rst
Re: [PATCH] doc/virt/kvm: move KVM_X86_SET_MSR_FILTER in section 8
On 16/03/21 18:08, Emanuele Giuseppe Esposito wrote: KVM_X86_SET_MSR_FILTER is a capability, not an ioctl. Therefore move it from section 4.97 to the new 8.31 (other capabilities). Here and in the subject I think KVM_X86_SET_MSR_FILTER should be replaced by KVM_CAP_PPC_MULTITCE. Otherwise looks good, so I queued it. Thanks, Paolo To fill the gap, move KVM_X86_SET_MSR_FILTER (was 4.126) to 4.97, and shifted Xen-related ioctl (were 4.127 - 4.130) by one place (4.126 - 4.129). Also fixed minor typo in KVM_GET_MSR_INDEX_LIST ioctl description (section 4.3).
Re: [PATCH] KVM: x86: Remove unused function declaration
On 06/04/21 08:35, Keqian Zhu wrote: kvm_mmu_slot_largepage_remove_write_access() is decared but not used, just remove it. Signed-off-by: Keqian Zhu --- arch/x86/include/asm/kvm_host.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3768819693e5..9c0af0971c9f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1440,8 +1440,6 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, const struct kvm_memory_slot *memslot); void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, struct kvm_memory_slot *memslot); -void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm, - struct kvm_memory_slot *memslot); void kvm_mmu_zap_all(struct kvm *kvm); void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen); unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm); Queued, thanks. Paolo
Re: [PATCH 0/4] KVM: SVM: A fix and cleanups for vmcb tracking
On 06/04/21 19:18, Sean Christopherson wrote: Belated code review for the vmcb changes that are queued for 5.13. Sean Christopherson (4): KVM: SVM: Don't set current_vmcb->cpu when switching vmcb KVM: SVM: Drop vcpu_svm.vmcb_pa KVM: SVM: Add a comment to clarify what vcpu_svm.vmcb points at KVM: SVM: Enhance and clean up the vmcb tracking comment in pre_svm_run() arch/x86/kvm/svm/svm.c | 29 + arch/x86/kvm/svm/svm.h | 2 +- 2 files changed, 14 insertions(+), 17 deletions(-) Queued, thanks -- especially for the bug in patch 1, which avoided review. Paolo
Re: [PATCH v2 0/8] ccp: KVM: SVM: Use stack for SEV command buffers
On 07/04/21 00:49, Sean Christopherson wrote: This series teaches __sev_do_cmd_locked() to gracefully handle vmalloc'd command buffers by copying _all_ incoming data pointers to an internal buffer before sending the command to the PSP. The SEV driver and KVM are then converted to use the stack for all command buffers. Tested everything except sev_ioctl_do_pek_import(), I don't know anywhere near enough about the PSP to give it the right input. v2: - Rebase to kvm/queue, commit f96be2deac9b ("KVM: x86: Support KVM VMs sharing SEV context"). - Unconditionally copy @data to the internal buffer. [Christophe, Brijesh] - Allocate a full page for the buffer. [Brijesh] - Drop one set of the "!"s. [Christophe] - Use virt_addr_valid() instead of is_vmalloc_addr() for the temporary patch (definitely feel free to drop the patch if it's not worth backporting). [Christophe] - s/intput/input/. [Tom] - Add a patch to free "sev" if init fails. This is not strictly necessary (I think; I suck horribly when it comes to the driver framework). But it felt wrong to not free cmd_buf on failure, and even more wrong to free cmd_buf but not sev. v1: - https://lkml.kernel.org/r/20210402233702.3291792-1-sea...@google.com Sean Christopherson (8): crypto: ccp: Free SEV device if SEV init fails crypto: ccp: Detect and reject "invalid" addresses destined for PSP crypto: ccp: Reject SEV commands with mismatching command buffer crypto: ccp: Play nice with vmalloc'd memory for SEV command structs crypto: ccp: Use the stack for small SEV command buffers crypto: ccp: Use the stack and common buffer for status commands crypto: ccp: Use the stack and common buffer for INIT command KVM: SVM: Allocate SEV command structures on local stack arch/x86/kvm/svm/sev.c | 262 +-- drivers/crypto/ccp/sev-dev.c | 197 +- drivers/crypto/ccp/sev-dev.h | 4 +- 3 files changed, 196 insertions(+), 267 deletions(-) Queued, thanks. Paolo
Re: [PATCH v2 8/8] KVM: SVM: Allocate SEV command structures on local stack
On 07/04/21 19:34, Borislav Petkov wrote: On Wed, Apr 07, 2021 at 05:05:07PM +, Sean Christopherson wrote: I used memset() to defer initialization until after the various sanity checks, I'd actually vote for that too - I don't like doing stuff which is not going to be used. I.e., don't change what you have. It's three votes for that then. :) Sean, I squashed in this change diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index ec4c01807272..a4d0ca8c4710 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -1039,21 +1039,14 @@ static int sev_get_attestation_report(struct kvm *kvm, struct kvm_sev_cmd *argp) static int sev_send_cancel(struct kvm *kvm, struct kvm_sev_cmd *argp) { struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info; - struct sev_data_send_cancel *data; + struct sev_data_send_cancel data; int ret; if (!sev_guest(kvm)) return -ENOTTY; - data = kzalloc(sizeof(*data), GFP_KERNEL); - if (!data) - return -ENOMEM; - - data->handle = sev->handle; - ret = sev_issue_cmd(kvm, SEV_CMD_SEND_CANCEL, data, >error); - - kfree(data); - return ret; + data.handle = sev->handle; + return sev_issue_cmd(kvm, SEV_CMD_SEND_CANCEL, , >error); } int svm_mem_enc_op(struct kvm *kvm, void __user *argp) to handle SV_CMD_SEND_CANCEL which I merged before this series. Paolo
Re: [PATCH v2 7/8] crypto: ccp: Use the stack and common buffer for INIT command
On 07/04/21 07:20, Christophe Leroy wrote: + struct sev_data_init data; struct sev_data_init data = {0, 0, 0, 0}; Having to count the number of items is suboptimal. The alternative could be {} (which however is technically not standard C), {0} (a bit mysterious, but it works) and memset. I kept the latter to avoid touching the submitter's patch too much. Paolo
Re: [PATCH v2 5/8] crypto: ccp: Use the stack for small SEV command buffers
On 07/04/21 00:49, Sean Christopherson wrote: For commands with small input/output buffers, use the local stack to "allocate" the structures used to communicate with the PSP. Now that __sev_do_cmd_locked() gracefully handles vmalloc'd buffers, there's no reason to avoid using the stack, e.g. CONFIG_VMAP_STACK=y will just work. Signed-off-by: Sean Christopherson Squashing this in (inspired by Christophe's review, though not quite matching his suggestion). diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c index 0f5644a3b138..246b281b6376 100644 --- a/drivers/crypto/ccp/sev-dev.c +++ b/drivers/crypto/ccp/sev-dev.c @@ -408,12 +408,11 @@ static int sev_ioctl_do_pek_csr(struct sev_issue_cmd *argp, bool writable) if (copy_from_user(, (void __user *)argp->data, sizeof(input))) return -EFAULT; + memset(, 0, sizeof(data)); + /* userspace wants to query CSR length */ - if (!input.address || !input.length) { - data.address = 0; - data.len = 0; + if (!input.address || !input.length) goto cmd; - } /* allocate a physically contiguous buffer to store the CSR blob */ input_address = (void __user *)input.address;
Re: [PATCH v2 1/3] x86/kvm: Don't bother __pv_cpu_mask when !CONFIG_SMP
On 09/04/21 06:18, Wanpeng Li wrote: From: Wanpeng Li Enable PV TLB shootdown when !CONFIG_SMP doesn't make sense. Let's move it inside CONFIG_SMP. In addition, we can avoid define and alloc __pv_cpu_mask when !CONFIG_SMP and get rid of 'alloc' variable in kvm_alloc_cpumask. Signed-off-by: Wanpeng Li --- v1 -> v2: * shuffle things around a bit more arch/x86/kernel/kvm.c | 118 +++--- 1 file changed, 55 insertions(+), 63 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 5e78e01..224a7a1 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -451,6 +451,10 @@ static void __init sev_map_percpu_data(void) } } +#ifdef CONFIG_SMP + +static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask); + static bool pv_tlb_flush_supported(void) { return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) && @@ -458,10 +462,6 @@ static bool pv_tlb_flush_supported(void) kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)); } -static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask); - -#ifdef CONFIG_SMP - static bool pv_ipi_supported(void) { return kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI); @@ -574,6 +574,49 @@ static void kvm_smp_send_call_func_ipi(const struct cpumask *mask) } } +static void kvm_flush_tlb_others(const struct cpumask *cpumask, + const struct flush_tlb_info *info) +{ + u8 state; + int cpu; + struct kvm_steal_time *src; + struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask); + + cpumask_copy(flushmask, cpumask); + /* +* We have to call flush only on online vCPUs. And +* queue flush_on_enter for pre-empted vCPUs +*/ + for_each_cpu(cpu, flushmask) { + src = _cpu(steal_time, cpu); + state = READ_ONCE(src->preempted); + if ((state & KVM_VCPU_PREEMPTED)) { + if (try_cmpxchg(>preempted, , + state | KVM_VCPU_FLUSH_TLB)) + __cpumask_clear_cpu(cpu, flushmask); + } + } + + native_flush_tlb_others(flushmask, info); +} + +static __init int kvm_alloc_cpumask(void) +{ + int cpu; + + if (!kvm_para_available() || nopv) + return 0; + + if (pv_tlb_flush_supported() || pv_ipi_supported()) + for_each_possible_cpu(cpu) { + zalloc_cpumask_var_node(per_cpu_ptr(&__pv_cpu_mask, cpu), + GFP_KERNEL, cpu_to_node(cpu)); + } + + return 0; +} +arch_initcall(kvm_alloc_cpumask); + static void __init kvm_smp_prepare_boot_cpu(void) { /* @@ -611,33 +654,8 @@ static int kvm_cpu_down_prepare(unsigned int cpu) local_irq_enable(); return 0; } -#endif - -static void kvm_flush_tlb_others(const struct cpumask *cpumask, - const struct flush_tlb_info *info) -{ - u8 state; - int cpu; - struct kvm_steal_time *src; - struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask); - - cpumask_copy(flushmask, cpumask); - /* -* We have to call flush only on online vCPUs. And -* queue flush_on_enter for pre-empted vCPUs -*/ - for_each_cpu(cpu, flushmask) { - src = _cpu(steal_time, cpu); - state = READ_ONCE(src->preempted); - if ((state & KVM_VCPU_PREEMPTED)) { - if (try_cmpxchg(>preempted, , - state | KVM_VCPU_FLUSH_TLB)) - __cpumask_clear_cpu(cpu, flushmask); - } - } - native_flush_tlb_others(flushmask, info); -} +#endif static void __init kvm_guest_init(void) { @@ -653,12 +671,6 @@ static void __init kvm_guest_init(void) pv_ops.time.steal_clock = kvm_steal_clock; } - if (pv_tlb_flush_supported()) { - pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others; - pv_ops.mmu.tlb_remove_table = tlb_remove_table; - pr_info("KVM setup pv remote TLB flush\n"); - } - if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) apic_set_eoi_write(kvm_guest_apic_eoi_write); @@ -668,6 +680,12 @@ static void __init kvm_guest_init(void) } #ifdef CONFIG_SMP + if (pv_tlb_flush_supported()) { + pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others; + pv_ops.mmu.tlb_remove_table = tlb_remove_table; + pr_info("KVM setup pv remote TLB flush\n"); + } + smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu; if (pv_sched_yield_supported()) { smp_ops.send_call_func_ipi = kvm_smp_send_call_func_ipi; @@ -734,7 +752,7 @@ static uint32_t __init kvm_detect(void) static void __init kvm_apic_init(void)
Re: linux-next: Fixes tag needs some work in the kvm tree
On 16/04/21 14:38, Christian Borntraeger wrote: On 16.04.21 14:27, Stephen Rothwell wrote: Hi all, In commit c3171e94cc1c ("KVM: s390: VSIE: fix MVPG handling for prefixing and MSO") Fixes tag Fixes: bdf7509bbefa ("s390/kvm: VSIE: correctly handle MVPG when in VSIE") has these problem(s): - Subject does not match target commit subject Just use git log -1 --format='Fixes: %h ("%s")' Hmm, this has been sitting in kvms390/next for some time now. Is this a new check? Maybe you just missed it when it was reported for kvms390? https://www.spinics.net/lists/linux-next/msg59652.html The SHA1 is stable now so it's too late. It's not a big deal I guess. Paolo
Re: [PATCH v2] tools: do not include scripts/Kbuild.include
On 16/04/21 15:26, Christian Borntraeger wrote: On 16.04.21 15:00, Masahiro Yamada wrote: Since commit d9f4ff50d2aa ("kbuild: spilt cc-option and friends to scripts/Makefile.compiler"), some kselftests fail to build. The tools/ directory opted out Kbuild, and went in a different direction. They copy any kind of files to the tools/ directory in order to do whatever they want in their world. tools/build/Build.include mimics scripts/Kbuild.include, but some tool Makefiles included the Kbuild one to import a feature that is missing in tools/build/Build.include: - Commit ec04aa3ae87b ("tools/thermal: tmon: use "-fstack-protector" only if supported") included scripts/Kbuild.include from tools/thermal/tmon/Makefile to import the cc-option macro. - Commit c2390f16fc5b ("selftests: kvm: fix for compilers that do not support -no-pie") included scripts/Kbuild.include from tools/testing/selftests/kvm/Makefile to import the try-run macro. - Commit 9cae4ace80ef ("selftests/bpf: do not ignore clang failures") included scripts/Kbuild.include from tools/testing/selftests/bpf/Makefile to import the .DELETE_ON_ERROR target. - Commit 0695f8bca93e ("selftests/powerpc: Handle Makefile for unrecognized option") included scripts/Kbuild.include from tools/testing/selftests/powerpc/pmu/ebb/Makefile to import the try-run macro. Copy what they need into tools/build/Build.include, and make them include it instead of scripts/Kbuild.include. Link: https://lore.kernel.org/lkml/86dadf33-70f7-a5ac-cb8c-64966d2f4...@linux.ibm.com/ Fixes: d9f4ff50d2aa ("kbuild: spilt cc-option and friends to scripts/Makefile.compiler") Reported-by: Janosch Frank Reported-by: Christian Borntraeger Signed-off-by: Masahiro Yamada looks better. Tested-by: Christian Borntraeger Thank you very much Masahiro, this look great. Paolo
Re: [PATCH v2 3/7] KVM: x86: hyper-v: Move the remote TLB flush logic out of vmx
On 16/04/21 10:36, Vitaly Kuznetsov wrote: - Create a dedicated set of files, e.g. 'kvmonhyperv.[ch]' (I also thought about 'hyperv_host.[ch]' but then I realized it's equally misleading as one can read this as 'KVM is acting as Hyper-V host'). Personally, I'd vote for the later. Besides eliminating confusion, the benefit of having dedicated files is that we can avoid compiling them completely when !IS_ENABLED(CONFIG_HYPERV) (#ifdefs in C are ugly). Indeed. For the file, kvm-on-hv.[ch] can do. Paolo
Re: [PATCH v3 5/8] docs: vcpu-requests.rst: fix reference for atomic ops
On 09/04/21 14:47, Mauro Carvalho Chehab wrote: Changeset f0400a77ebdc ("atomic: Delete obsolete documentation") got rid of atomic_ops.rst, pointing that this was superseded by Documentation/atomic_*.txt. Update its reference accordingly. Fixes: f0400a77ebdc ("atomic: Delete obsolete documentation") Signed-off-by: Mauro Carvalho Chehab --- Documentation/virt/kvm/vcpu-requests.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/virt/kvm/vcpu-requests.rst b/Documentation/virt/kvm/vcpu-requests.rst index 5feb3706a7ae..5f8798e7fdf8 100644 --- a/Documentation/virt/kvm/vcpu-requests.rst +++ b/Documentation/virt/kvm/vcpu-requests.rst @@ -302,6 +302,6 @@ VCPU returns from the call. References == -.. [atomic-ops] Documentation/core-api/atomic_ops.rst +.. [atomic-ops] Documentation/atomic_bitops.txt and Documentation/atomic_t.txt .. [memory-barriers] Documentation/memory-barriers.txt .. [lwn-mb] https://lwn.net/Articles/573436/ Acked-by: Paolo Bonzini
Re: [PATCH v2 0/8] ccp: KVM: SVM: Use stack for SEV command buffers
On 07/04/21 20:00, Tom Lendacky wrote: For the series: Acked-by: Tom Lendacky Shall I take this as a request (or permission, whatever :)) to merge it through the KVM tree? Paolo
Re: [PATCH 2/2] tools: do not include scripts/Kbuild.include
On 15/04/21 10:04, Masahiro Yamada wrote: On Thu, Apr 15, 2021 at 4:40 PM Paolo Bonzini wrote: I think it would make sense to add try-run, cc-option and .DELETE_ON_ERROR to tools/build/Build.include? To be safe, I just copy-pasted what the makefiles need. If someone wants to refactor the tool build system, that is fine, but, to me, I do not see consistent rules or policy under tools/. "Please put this in a common file instead of introducing duplication" is not asking for wholesale refactoring. Paolo
Re: [PATCH 2/2] tools: do not include scripts/Kbuild.include
On 15/04/21 09:27, Masahiro Yamada wrote: Since commit d9f4ff50d2aa ("kbuild: spilt cc-option and friends to scripts/Makefile.compiler"), some kselftests fail to build. The tools/ directory opted out Kbuild, and went in a different direction. They copy any kind of files to the tools/ directory in order to do whatever they want to do in their world. tools/build/Build.include mimics scripts/Kbuild.include, but some tool Makefiles included the Kbuild one to import a feature that is missing in tools/build/Build.include: - Commit ec04aa3ae87b ("tools/thermal: tmon: use "-fstack-protector" only if supported") included scripts/Kbuild.include from tools/thermal/tmon/Makefile to import the cc-option macro. - Commit c2390f16fc5b ("selftests: kvm: fix for compilers that do not support -no-pie") included scripts/Kbuild.include from tools/testing/selftests/kvm/Makefile to import the try-run macro. - Commit 9cae4ace80ef ("selftests/bpf: do not ignore clang failures") included scripts/Kbuild.include from tools/testing/selftests/bpf/Makefile to import the .DELETE_ON_ERROR target. - Commit 0695f8bca93e ("selftests/powerpc: Handle Makefile for unrecognized option") included scripts/Kbuild.include from tools/testing/selftests/powerpc/pmu/ebb/Makefile to import the try-run macro. Copy what they want there, and stop including scripts/Kbuild.include from the tool Makefiles. I think it would make sense to add try-run, cc-option and .DELETE_ON_ERROR to tools/build/Build.include? Paolo Link: https://lore.kernel.org/lkml/86dadf33-70f7-a5ac-cb8c-64966d2f4...@linux.ibm.com/ Fixes: d9f4ff50d2aa ("kbuild: spilt cc-option and friends to scripts/Makefile.compiler") Reported-by: Janosch Frank Reported-by: Christian Borntraeger Signed-off-by: Masahiro Yamada --- tools/testing/selftests/bpf/Makefile | 3 ++- tools/testing/selftests/kvm/Makefile | 12 +++- .../selftests/powerpc/pmu/ebb/Makefile| 11 ++- tools/thermal/tmon/Makefile | 19 +-- 4 files changed, 40 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 044bfdcf5b74..d872b9f41543 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -1,5 +1,4 @@ # SPDX-License-Identifier: GPL-2.0 -include ../../../../scripts/Kbuild.include include ../../../scripts/Makefile.arch include ../../../scripts/Makefile.include @@ -476,3 +475,5 @@ EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR) $(HOST_SCRATCH_DIR) \ prog_tests/tests.h map_tests/tests.h verifier/tests.h \ feature \ $(addprefix $(OUTPUT)/,*.o *.skel.h no_alu32 bpf_gcc bpf_testmod.ko) + +.DELETE_ON_ERROR: diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index a6d61f451f88..8b45bc417d83 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -1,5 +1,15 @@ # SPDX-License-Identifier: GPL-2.0-only -include ../../../../scripts/Kbuild.include + +TMPOUT = .tmp_ + +try-run = $(shell set -e; \ + TMP=$(TMPOUT)/tmp; \ + mkdir -p $(TMPOUT); \ + trap "rm -rf $(TMPOUT)" EXIT; \ + if ($(1)) >/dev/null 2>&1;\ + then echo "$(2)"; \ + else echo "$(3)"; \ + fi) all: diff --git a/tools/testing/selftests/powerpc/pmu/ebb/Makefile b/tools/testing/selftests/powerpc/pmu/ebb/Makefile index af3df79d8163..d5d3e869df93 100644 --- a/tools/testing/selftests/powerpc/pmu/ebb/Makefile +++ b/tools/testing/selftests/powerpc/pmu/ebb/Makefile @@ -1,5 +1,4 @@ # SPDX-License-Identifier: GPL-2.0 -include ../../../../../../scripts/Kbuild.include noarg: $(MAKE) -C ../../ @@ -8,6 +7,16 @@ noarg: CFLAGS += -m64 TMPOUT = $(OUTPUT)/TMPDIR/ + +try-run = $(shell set -e; \ + TMP=$(TMPOUT)/tmp; \ + mkdir -p $(TMPOUT); \ + trap "rm -rf $(TMPOUT)" EXIT; \ + if ($(1)) >/dev/null 2>&1;\ + then echo "$(2)"; \ + else echo "$(3)"; \ + fi) + # Toolchains may build PIE by default which breaks the assembly no-pie-option := $(call try-run, echo 'int main() { return 0; }' | \ $(CC) -Werror $(KBUILD_CPPFLAGS) $(CC_OPTION_CFLAGS) -no-pie -x c - -o "$$TMP", -no-pie) diff --git a/tools/thermal/tmon/Makefile b/tools/thermal/tmon/Makefile index 59e417ec3e13..92a683e4866c 100644 --- a/tools/thermal/tmon/Makefile +++ b/tools/thermal/tmon/Makefile @@ -1,6 +1,21 @@ # SPDX-License-Identifier: GPL-2.0 -# We need this for the "cc-option" macro. -include ../../../scripts/Kbuild.include + +TMPOUT = .tmp_ + +try-run = $(shell set -e; \ + TMP=$(TMPOUT)/tmp; \ + mkdir -p
Re: [PATCH 2/2] KVM: x86: Fix split-irqchip vs interrupt injection window request
On 15/04/21 02:59, Lai Jiangshan wrote: The next call to inject_pending_event() will reach here AT FIRST with vcpu->arch.exception.injected==false and vcpu->arch.exception.pending==false ... if (!vcpu->arch.exception.pending) { if (vcpu->arch.nmi_injected) { static_call(kvm_x86_set_nmi)(vcpu); can_inject = false; } else if (vcpu->arch.interrupt.injected) { static_call(kvm_x86_set_irq)(vcpu); can_inject = false; And comes here and vcpu->arch.interrupt.injected is true for there is an interrupt queued by KVM_INTERRUPT for pure user irqchip. It then does the injection of the interrupt without checking the EFLAGS.IF. Ok, understood now. Yeah, that could be a problem for userspace irqchip so we should switch it to use pending_external_vector instead. Are you going to write the patch or should I? Thanks! Paolo My question is that what stops the next call to inject_pending_event() to reach here when KVM_INTERRUPT is called with exepction pending.
Re: [PATCH 2/2] KVM: x86: Fix split-irqchip vs interrupt injection window request
On 14/04/21 04:28, Lai Jiangshan wrote: On Tue, Apr 13, 2021 at 8:15 PM Paolo Bonzini wrote: On 13/04/21 13:03, Lai Jiangshan wrote: This patch claims that it has a place to stash the IRQ when EFLAGS.IF=0, but inject_pending_event() seams to ignore EFLAGS.IF and queues the IRQ to the guest directly in the first branch of using "kvm_x86_ops.set_irq(vcpu)". This is only true for pure-userspace irqchip. For split-irqchip, in which case the "place to stash" the interrupt is vcpu->arch.pending_external_vector. For pure-userspace irqchip, KVM_INTERRUPT only cares about being able to stash the interrupt in vcpu->arch.interrupt.injected. It is indeed wrong for userspace to call KVM_INTERRUPT if the vCPU is not ready for interrupt injection, but KVM_INTERRUPT does not return an error. Thanks for the reply. May I ask what is the correct/practical way of using KVM_INTERRUPT ABI for pure-userspace irqchip. gVisor is indeed a pure-userspace irqchip, it will call KVM_INTERRUPT when kvm_run->ready_for_interrupt_injection=1 (along with other conditions unrelated to our discussion). https://github.com/google/gvisor/blob/a9441aea2780da8c93da1c73da860219f98438de/pkg/sentry/platform/kvm/bluepill_amd64_unsafe.go#L105 if kvm_run->ready_for_interrupt_injection=1 when expection pending or EFLAGS.IF=0, it would be unexpected for gVisor. Not with EFLAGS.IF=0. For pending exception, there is code to handle it in inject_pending_event: ... if (!vcpu->arch.exception.pending) { if (vcpu->arch.nmi_injected) { static_call(kvm_x86_set_nmi)(vcpu); can_inject = false; } else if (vcpu->arch.interrupt.injected) { static_call(kvm_x86_set_irq)(vcpu); can_inject = false; } } ... if (vcpu->arch.exception.pending) { ... can_inject = false; } // this is vcpu->arch.interrupt.injected for userspace LAPIC if (kvm_cpu_has_injectable_intr(vcpu)) { r = can_inject ? static_call(kvm_x86_interrupt_allowed)(vcpu, true) : -EBUSY; if (r < 0) goto busy; ... } so what happens is: - the interrupt will not be injected before the exception - KVM will schedule an immediate vmexit to inject the interrupt as well - if (as is likely) the exception has turned off interrupts, the next call to inject_pending_event will reach static_call(kvm_x86_enable_irq_window) and the interrupt will only be injected when IF becomes 1 again. Paolo
[GIT PULL] KVM changes for 5.12-rc8 or final
Linus, The following changes since commit 315f02c60d9425b38eb8ad7f21b8a35e40db23f9: KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp (2021-04-08 07:48:18 -0400) are available in the Git repository at: https://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus for you to fetch changes up to 04c4f2ee3f68c9a4bf1653d15f1a9a435ae33f7a: KVM: VMX: Don't use vcpu->run->internal.ndata as an array index (2021-04-13 18:23:41 -0400) Fix for a possible out-of-bounds access. Reiji Watanabe (1): KVM: VMX: Don't use vcpu->run->internal.ndata as an array index arch/x86/kvm/vmx/vmx.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-)
Re: [PATCH] doc/virt/kvm: move KVM_X86_SET_MSR_FILTER in section 8
On 13/04/21 23:20, Jonathan Corbet wrote: Emanuele Giuseppe Esposito writes: KVM_X86_SET_MSR_FILTER is a capability, not an ioctl. Therefore move it from section 4.97 to the new 8.31 (other capabilities). To fill the gap, move KVM_X86_SET_MSR_FILTER (was 4.126) to 4.97, and shifted Xen-related ioctl (were 4.127 - 4.130) by one place (4.126 - 4.129). Also fixed minor typo in KVM_GET_MSR_INDEX_LIST ioctl description (section 4.3). Signed-off-by: Emanuele Giuseppe Esposito --- Documentation/virt/kvm/api.rst | 250 - 1 file changed, 125 insertions(+), 125 deletions(-) Paolo, what's your thought on this one? If it's OK should I pick it up? I missed the patch, I'll queue it up for 5.13. Paolo Thanks, jon diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 1a2b5210cdbf..a230140d6a7f 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -201,7 +201,7 @@ Errors: == EFAULT the msr index list cannot be read from or written to - E2BIG the msr index list is to be to fit in the array specified by + E2BIG the msr index list is too big to fit in the array specified by the user. == @@ -3686,31 +3686,105 @@ which is the maximum number of possibly pending cpu-local interrupts. Queues an SMI on the thread's vcpu. -4.97 KVM_CAP_PPC_MULTITCE -- +4.97 KVM_X86_SET_MSR_FILTER + -:Capability: KVM_CAP_PPC_MULTITCE -:Architectures: ppc -:Type: vm +:Capability: KVM_X86_SET_MSR_FILTER +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_msr_filter +:Returns: 0 on success, < 0 on error -This capability means the kernel is capable of handling hypercalls -H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user -space. This significantly accelerates DMA operations for PPC KVM guests. -User space should expect that its handlers for these hypercalls -are not going to be called if user space previously registered LIOBN -in KVM (via KVM_CREATE_SPAPR_TCE or similar calls). +:: -In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, -user space might have to advertise it for the guest. For example, -IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is -present in the "ibm,hypertas-functions" device-tree property. + struct kvm_msr_filter_range { + #define KVM_MSR_FILTER_READ (1 << 0) + #define KVM_MSR_FILTER_WRITE (1 << 1) + __u32 flags; + __u32 nmsrs; /* number of msrs in bitmap */ + __u32 base; /* MSR index the bitmap starts at */ + __u8 *bitmap; /* a 1 bit allows the operations in flags, 0 denies */ + }; -The hypercalls mentioned above may or may not be processed successfully -in the kernel based fast path. If they can not be handled by the kernel, -they will get passed on to user space. So user space still has to have -an implementation for these despite the in kernel acceleration. + #define KVM_MSR_FILTER_MAX_RANGES 16 + struct kvm_msr_filter { + #define KVM_MSR_FILTER_DEFAULT_ALLOW (0 << 0) + #define KVM_MSR_FILTER_DEFAULT_DENY (1 << 0) + __u32 flags; + struct kvm_msr_filter_range ranges[KVM_MSR_FILTER_MAX_RANGES]; + }; -This capability is always enabled. +flags values for ``struct kvm_msr_filter_range``: + +``KVM_MSR_FILTER_READ`` + + Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap + indicates that a read should immediately fail, while a 1 indicates that + a read for a particular MSR should be handled regardless of the default + filter action. + +``KVM_MSR_FILTER_WRITE`` + + Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap + indicates that a write should immediately fail, while a 1 indicates that + a write for a particular MSR should be handled regardless of the default + filter action. + +``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE`` + + Filter both read and write accesses to MSRs using the given bitmap. A 0 + in the bitmap indicates that both reads and writes should immediately fail, + while a 1 indicates that reads and writes for a particular MSR are not + filtered by this range. + +flags values for ``struct kvm_msr_filter``: + +``KVM_MSR_FILTER_DEFAULT_ALLOW`` + + If no filter range matches an MSR index that is getting accessed, KVM will + fall back to allowing access to the MSR. + +``KVM_MSR_FILTER_DEFAULT_DENY`` + + If no filter range matches an MSR index that is getting accessed, KVM will + fall back to rejecting access to the MSR. In this mode, all MSRs that should + be processed by KVM need to explicitly be marked as allowed in the bitmaps. + +This ioctl allows user space to define up to 16 bitmaps of MSR ranges to +specify whether a certain MSR access should be explicitly filtered for
Re: [PATCH 2/2] KVM: x86: Fix split-irqchip vs interrupt injection window request
On 13/04/21 13:03, Lai Jiangshan wrote: This patch claims that it has a place to stash the IRQ when EFLAGS.IF=0, but inject_pending_event() seams to ignore EFLAGS.IF and queues the IRQ to the guest directly in the first branch of using "kvm_x86_ops.set_irq(vcpu)". This is only true for pure-userspace irqchip. For split-irqchip, in which case the "place to stash" the interrupt is vcpu->arch.pending_external_vector. For pure-userspace irqchip, KVM_INTERRUPT only cares about being able to stash the interrupt in vcpu->arch.interrupt.injected. It is indeed wrong for userspace to call KVM_INTERRUPT if the vCPU is not ready for interrupt injection, but KVM_INTERRUPT does not return an error. Ignoring the fact that this would be incorrect use of the API, are you saying that the incorrect injection was not possible before this patch? Paolo
Re: [PATCH 2/2] KVM: x86: Fix split-irqchip vs interrupt injection window request
On 12/04/21 23:43, Sean Christopherson wrote: where kvm_arch_interrupt_allowed() checks EFLAGS.IF (and an edge case related to nested virtualization). KVM also captures EFLAGS.IF in vcpu->run->if_flag. For whatever reason, QEMU checks both vcpu->run flags before injecting an IRQ, maybe to handle a case where QEMU itself clears EFLAGS.IF? It's mostly obsolete code (that will be deprecated in the next version and removed in about a year) so I wouldn't read much into it. if_flag itself is obsolete; it is not provided by SEV-ES, and a useless subset of ready_for_interrupt_injection. Paolo
Re: [RFC v2] KVM: x86: Support KVM VMs sharing SEV context
On 09/04/21 03:18, James Bottomley wrote: If you want to share ASIDs you have to share the firmware that the running VM has been attested to. Once the VM moves from LAUNCH to RUNNING, the PSP won't allow the VMM to inject any more firmware or do any more attestations. I think Steve is suggesting to just change the RIP of the mirror VM, which would work for SEV but not SEV-ES (the RAM migration helper won't *suffice* for SEV-ES, but perhaps you could use the PSP to migrate the VMSA and the migration helper for the rest?). If you want to use a single firmware binary, SEC does almost no I/O accesses (the exception being the library constructor from SourceLevelDebugPkg's SecPeiDebugAgentLib), so you probably can: - detect the migration helper hardware in PlatformPei, either from fw_cfg or based on the lack of it - either divert execution to the migration helper through gEfiEndOfPeiSignalPpiGuid, or if it's too late add a new boot mode and PPI to DxeLoadCore. Paolo What you mirror after this point can thus only contain what has already been measured or what the guest added. This is why we think there has to be a new entry path into the VM for the mirror vCPU.
Re: [PATCH v2 07/17] KVM: x86/mmu: Check PDPTRs before allocating PAE roots
On 08/04/21 18:27, Sean Christopherson wrote: For your approach, can we put the out label after the success path? Setting mmu->root_pgd isn't wrong per se, but doing so might mislead future readers into thinking that it's functionally necessary. Indeed, thanks for the speedy review. I'll get it queued tomorrow. Paolo
Re: [PATCH] KVM: vmx: add mismatched size in vmcs_check32
On 08/04/21 18:05, Sean Christopherson wrote: Add compile-time assertions in vmcs_check32() to disallow accesses to 64-bit and 64-bit high fields via vmcs_{read,write}32(). Upper level KVM code should never do partial accesses to VMCS fields. KVM handles the split accesses automatically in vmcs_{read,write}64() when running as a 32-bit kernel. KVM also uses raw vmread/vmwrite (__vmcs_readl/__vmcs_writel) when copying to and from the shadow VMCS, so that path will not go through vmcs_check32 either. Paolo
Re: [PATCH] KVM: SVM: Make sure GHCB is mapped before updating
On 08/04/21 18:04, Tom Lendacky wrote: + if (!err || !sev_es_guest(vcpu->kvm) || !WARN_ON_ONCE(svm->ghcb)) This should be WARN_ON_ONCE(!svm->ghcb), otherwise you'll get the right result, but get a stack trace immediately. Doh, yep. Actually, because of the "or's", this needs to be: if (!err || !sev_es_guest(vcpu->kvm) || (sev_es_guest(vcpu->kvm) && WARN_ON_ONCE(!svm->ghcb))) No, || cuts the right-hand side if the left-hand side is true. So: - if err == 0, the rest is not evaluated - if !sev_es_guest(vcpu->kvm), WARN_ON_ONCE(!svm->ghcb) is not evaluated Paolo
Re: [PATCH v2 07/17] KVM: x86/mmu: Check PDPTRs before allocating PAE roots
On 08/04/21 17:48, Sean Christopherson wrote: Freaking PDPTRs. I was really hoping we could keep the lock and pages_available() logic outside of the helpers. What if kvm_mmu_load() reads the PDPTRs and passes them into mmu_alloc_shadow_roots()? Or is that too ugly? The patch I have posted (though untested) tries to do that in a slightly less ugly way by pushing make_mmu_pages_available down to mmu_alloc_*_roots. Paolo diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index efb41f31e80a..e3c4938cd665 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3275,11 +3275,11 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) return 0; } -static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) +static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu, u64 pdptrs[4]) { struct kvm_mmu *mmu = vcpu->arch.mmu; - u64 pdptrs[4], pm_mask; gfn_t root_gfn, root_pgd; + u64 pm_mask; hpa_t root; int i; @@ -3291,11 +3291,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) if (mmu->root_level == PT32E_ROOT_LEVEL) { for (i = 0; i < 4; ++i) { - pdptrs[i] = mmu->get_pdptr(vcpu, i); - if (!(pdptrs[i] & PT_PRESENT_MASK)) - continue; - - if (mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT)) + if ((pdptrs[i] & PT_PRESENT_MASK) && + mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT)) return 1; } } @@ -4844,21 +4841,33 @@ EXPORT_SYMBOL_GPL(kvm_mmu_reset_context); int kvm_mmu_load(struct kvm_vcpu *vcpu) { - int r; + struct kvm_mmu *mmu = vcpu->arch.mmu; + u64 pdptrs[4]; + int r, i; - r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->direct_map); + r = mmu_topup_memory_caches(vcpu, !mmu->direct_map); if (r) goto out; r = mmu_alloc_special_roots(vcpu); if (r) goto out; + + /* +* On SVM, reading PDPTRs might access guest memory, which might fault +* and thus might sleep. Grab the PDPTRs before acquiring mmu_lock. +*/ + if (!mmu->direct_map && mmu->root_level == PT32E_ROOT_LEVEL) { + for (i = 0; i < 4; ++i) + pdptrs[i] = mmu->get_pdptr(vcpu, i); + } + write_lock(>kvm->mmu_lock); if (make_mmu_pages_available(vcpu)) r = -ENOSPC; else if (vcpu->arch.mmu->direct_map) r = mmu_alloc_direct_roots(vcpu); else - r = mmu_alloc_shadow_roots(vcpu); + r = mmu_alloc_shadow_roots(vcpu, pdptrs); write_unlock(>kvm->mmu_lock); if (r) goto out;
Re: [PATCH 0/4] Add support for XMM fast hypercalls
On 08/04/21 17:40, Siddharth Chandrasekaran wrote: Although the Hyper-v TLFS mentions that a guest cannot use this feature unless the hypervisor advertises support for it, some hypercalls which we plan on upstreaming in future uses them anyway. No, please don't do this. Check the feature bit(s) before you issue hypercalls which rely on the extended interface. Perhaps Siddharth should clarify this, but I read it as Hyper-V being buggy and using XMM arguments unconditionally. The guest is at fault here as it expects Hyper-V to consume arguments from XMM registers for certain hypercalls (that we are working) even if we didn't expose the feature via CPUID bits. What guest is that? Paolo
Re: [PATCH 6/7] KVM: SVM: hyper-v: Enlightened MSR-Bitmap support
On 07/04/21 16:41, Vineeth Pillai wrote: +#if IS_ENABLED(CONFIG_HYPERV) +static inline void hv_vmcb_dirty_nested_enlightenments(struct kvm_vcpu *vcpu) +{ + struct vmcb *vmcb = to_svm(vcpu)->vmcb; + + /* +* vmcb can be NULL if called during early vcpu init. +* And its okay not to mark vmcb dirty during vcpu init +* as we mark it dirty unconditionally towards end of vcpu +* init phase. +*/ + if (vmcb && vmcb_is_clean(vmcb, VMCB_HV_NESTED_ENLIGHTENMENTS) && + vmcb->hv_enlightenments.hv_enlightenments_control.msr_bitmap) + vmcb_mark_dirty(vmcb, VMCB_HV_NESTED_ENLIGHTENMENTS); +} In addition to what Vitaly said, can svm->vmcb really be NULL? If so it might be better to reorder initializations and skip the NULL check. Paolo
Re: [PATCH 4/7] KVM: SVM: hyper-v: Nested enlightenments in VMCB
On 07/04/21 16:41, Vineeth Pillai wrote: +#define VMCB_ALL_CLEAN_MASK (__CLEAN_MASK | (1U << VMCB_HV_NESTED_ENLIGHTENMENTS)) +#else +#define VMCB_ALL_CLEAN_MASK __CLEAN_MASK +#endif I think this should depend on whether KVM is running on top of Hyper-V; not on whether KVM is *compiled* with Hyper-V support. So you should turn VMCB_ALL_CLEAN_MASK into a __read_mostly variable. Paolo /* TPR and CR2 are always written before VMRUN */ #define VMCB_ALWAYS_DIRTY_MASK((1U << VMCB_INTR) | (1U << VMCB_CR2)) @@ -230,7 +251,7 @@ static inline void vmcb_mark_all_dirty(struct vmcb *vmcb) static inline void vmcb_mark_all_clean(struct vmcb *vmcb) { - vmcb->control.clean = ((1 << VMCB_DIRTY_MAX) - 1) + vmcb->control.clean = VMCB_ALL_CLEAN_MASK & ~VMCB_ALWAYS_DIRTY_MASK; }
Re: [PATCH 0/4] Add support for XMM fast hypercalls
On 08/04/21 17:28, Wei Liu wrote: Although the Hyper-v TLFS mentions that a guest cannot use this feature unless the hypervisor advertises support for it, some hypercalls which we plan on upstreaming in future uses them anyway. No, please don't do this. Check the feature bit(s) before you issue hypercalls which rely on the extended interface. Perhaps Siddharth should clarify this, but I read it as Hyper-V being buggy and using XMM arguments unconditionally. Paolo
Re: [PATCH 3/4] KVM: x86: kvm_hv_flush_tlb use inputs from XMM registers
On 08/04/21 14:01, Vitaly Kuznetsov wrote: Also, we can probably defer kvm_hv_hypercall_read_xmm() until we know how many regs we actually need to not read them all (we will always need xmm[0] I guess so we can as well read it here). The cost is get/put FPU, so I think there's not much to gain from that. Paolo
Re: [PATCH 1/2] KVM: x86: reduce pvclock_gtod_sync_lock critical sections
On 08/04/21 14:00, Marcelo Tosatti wrote: KVM_REQ_MCLOCK_INPROGRESS is only needed to kick running vCPUs out of the execution loop; We do not want vcpus with different system_timestamp/tsc_timestamp pair: * To avoid that problem, do not allow visibility of distinct * system_timestamp/tsc_timestamp values simultaneously: use a master * copy of host monotonic time values. Update that master copy * in lockstep. So KVM_REQ_MCLOCK_INPROGRESS also ensures that no vcpu enters guest mode (via vcpu->requests check before VM-entry) with a different system_timestamp/tsc_timestamp pair. Yes this is what KVM_REQ_MCLOCK_INPROGRESS does, but it does not have to be done that way. All you really need is the IPI with KVM_REQUEST_WAIT, which ensures that updates happen after the vCPUs have exited guest mode. You don't need to loop on vcpu->requests for example, because kvm_guest_time_update could just spin on pvclock_gtod_sync_lock until pvclock_update_vm_gtod_copy is done. So this morning I tried protecting the kvm->arch fields for kvmclock using a seqcount, which is nice also because get_kvmclock_ns() does not have to bounce the cacheline of pvclock_gtod_sync_lock anymore. I'll post it tomorrow or next week. Paolo
[GIT PULL] KVM changes for 5.12-rc7
Linus, The following changes since commit 55626ca9c6909d077eca71bccbe15fef6e5ad917: selftests: kvm: Check that TSC page value is small after KVM_SET_CLOCK(0) (2021-04-01 05:14:19 -0400) are available in the Git repository at: https://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus for you to fetch changes up to 315f02c60d9425b38eb8ad7f21b8a35e40db23f9: KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp (2021-04-08 07:48:18 -0400) A lone x86 patch, for a bug found while developing a backport to stable versions. Paolo Bonzini (1): KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp arch/x86/kvm/mmu/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Re: [PATCH v2 07/17] KVM: x86/mmu: Check PDPTRs before allocating PAE roots
On 08/04/21 13:15, Wanpeng Li wrote: I saw this splatting: BUG: sleeping function called from invalid context at arch/x86/kvm/kvm_cache_regs.h:115 kvm_pdptr_read+0x20/0x60 [kvm] kvm_mmu_load+0x3bd/0x540 [kvm] There is a might_sleep() in kvm_pdptr_read(), however, the original commit didn't explain more. I can send a formal one if the below fix is acceptable. I think we can just push make_mmu_pages_available down into kvm_mmu_load's callees. This way it's not necessary to hold the lock until after the PDPTR check: diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0d92a269c5fa..f92c3695bfeb 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3244,6 +3244,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) u8 shadow_root_level = mmu->shadow_root_level; hpa_t root; unsigned i; + int r; + + write_lock(>kvm->mmu_lock); + r = make_mmu_pages_available(vcpu); + if (r < 0) + goto out_unlock; if (is_tdp_mmu_enabled(vcpu->kvm)) { root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu); @@ -3266,13 +3272,16 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) mmu->root_hpa = __pa(mmu->pae_root); } else { WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level); - return -EIO; + r = -EIO; } +out_unlock: + write_unlock(>kvm->mmu_lock); + /* root_pgd is ignored for direct MMUs. */ mmu->root_pgd = 0; - return 0; + return r; } static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) @@ -3282,6 +3291,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) gfn_t root_gfn, root_pgd; hpa_t root; int i; + int r; root_pgd = mmu->get_guest_pgd(vcpu); root_gfn = root_pgd >> PAGE_SHIFT; @@ -3300,6 +3310,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) } } + write_lock(>kvm->mmu_lock); + r = make_mmu_pages_available(vcpu); + if (r < 0) + goto out_unlock; + /* * Do we shadow a long mode page table? If so we need to * write-protect the guests page table root. @@ -3308,7 +3323,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) root = mmu_alloc_root(vcpu, root_gfn, 0, mmu->shadow_root_level, false); mmu->root_hpa = root; - goto set_root_pgd; + goto out_unlock; } if (WARN_ON_ONCE(!mmu->pae_root)) @@ -3350,7 +3365,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) else mmu->root_hpa = __pa(mmu->pae_root); -set_root_pgd: +out_unlock: + write_unlock(>kvm->mmu_lock); mmu->root_pgd = root_pgd; return 0; @@ -4852,14 +4868,10 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) r = mmu_alloc_special_roots(vcpu); if (r) goto out; - write_lock(>kvm->mmu_lock); - if (make_mmu_pages_available(vcpu)) - r = -ENOSPC; - else if (vcpu->arch.mmu->direct_map) + if (vcpu->arch.mmu->direct_map) r = mmu_alloc_direct_roots(vcpu); else r = mmu_alloc_shadow_roots(vcpu); - write_unlock(>kvm->mmu_lock); if (r) goto out; Paolo
Re: [PATCH v2] KVM: Explicitly use GFP_KERNEL_ACCOUNT for 'struct kvm_vcpu' allocations
On 06/04/21 21:07, Sean Christopherson wrote: Use GFP_KERNEL_ACCOUNT when allocating vCPUs to make it more obvious that that the allocations are accounted, to make it easier to audit KVM's allocations in the future, and to be consistent with other cache usage in KVM. When using SLAB/SLUB, this is a nop as the cache itself is created with SLAB_ACCOUNT. When using SLOB, there are caveats within caveats. SLOB doesn't honor SLAB_ACCOUNT, so passing GFP_KERNEL_ACCOUNT will result in vCPU allocations now being accounted. But, even that depends on internal SLOB details as SLOB will only go to the page allocator when its cache is depleted. That just happens to be extremely likely for vCPUs because the size of kvm_vcpu is larger than the a page for almost all combinations of architecture and page size. Whether or not the SLOB behavior is by design is unknown; it's just as likely that no SLOB users care about accounding and so no one has bothered to implemented support in SLOB. Regardless, accounting vCPU allocations will not break SLOB+KVM+cgroup users, if any exist. Cc: Wanpeng Li Signed-off-by: Sean Christopherson --- v2: Drop the Fixes tag and rewrite the changelog since this is a nop when using SLUB or SLAB. [Wanpeng] virt/kvm/kvm_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 0a481e7780f0..580f98386b42 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3192,7 +3192,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id) if (r) goto vcpu_decrement; - vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); + vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT); if (!vcpu) { r = -ENOMEM; goto vcpu_decrement; Queued, thanks. Paolo
Re: [PATCH 3/3] mm: unexport follow_pfn
On 08/04/21 12:05, Daniel Vetter wrote: On Mon, Mar 29, 2021 at 10:31:01AM -0300, Jason Gunthorpe wrote: On Tue, Mar 16, 2021 at 04:33:03PM +0100, Daniel Vetter wrote: Both kvm (in bd2fae8da794 ("KVM: do not assume PTE is writable after follow_pfn")) and vfio (in 07956b6269d3 ("vfio/type1: Use follow_pte()")) have lost their callsites of follow_pfn(). All the other ones have been switched over to unsafe_follow_pfn because they cannot be fixed without breaking userspace api. Argueably the vfio code is still racy, but that's kinda a bigger vfio and kvm Hm I thought kvm is non-racy due to the mmu notifier catch races? No, but the plan is indeed to have some struct for each page that uses follow_pfn and update it from the MMU notifiers. Paolo picture. But since it does leak the pte beyond where it drops the pt lock, without anything else like an mmu notifier guaranteeing coherence, the problem is at least clearly visible in the vfio code. So good enough with me. I've decided to keep the explanation that after dropping the pt lock you must have an mmu notifier if you keep using the pte somehow by adjusting it and moving it into the kerneldoc for the new follow_pte() function. Cc: 3...@google.com Cc: Jann Horn Cc: Paolo Bonzini Cc: Jason Gunthorpe Cc: Cornelia Huck Cc: Peter Xu Cc: Alex Williamson Cc: linux...@kvack.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-samsung-...@vger.kernel.org Cc: linux-me...@vger.kernel.org Cc: k...@vger.kernel.org Signed-off-by: Daniel Vetter --- include/linux/mm.h | 2 -- mm/memory.c| 26 +- mm/nommu.c | 13 + 3 files changed, 6 insertions(+), 35 deletions(-) Reviewed-by: Jason Gunthorpe Thanks for your r-b tags, I'll add them. -Daniel Jason
Re: [PATCH 1/2] KVM: x86: reduce pvclock_gtod_sync_lock critical sections
On 07/04/21 19:40, Marcelo Tosatti wrote: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fe806e894212..0a83eff40b43 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2562,10 +2562,12 @@ static void kvm_gen_update_masterclock(struct kvm *kvm) kvm_hv_invalidate_tsc_page(kvm); - spin_lock(>pvclock_gtod_sync_lock); kvm_make_mclock_inprogress_request(kvm); + Might be good to serialize against two kvm_gen_update_masterclock callers? Otherwise one caller could clear KVM_REQ_MCLOCK_INPROGRESS, while the other is still at pvclock_update_vm_gtod_copy(). Makes sense, but this stuff has always seemed unnecessarily complicated to me. KVM_REQ_MCLOCK_INPROGRESS is only needed to kick running vCPUs out of the execution loop; clearing it in kvm_gen_update_masterclock is unnecessary, because KVM_REQ_CLOCK_UPDATE takes pvclock_gtod_sync_lock too and thus will already wait for pvclock_update_vm_gtod_copy to end. I think it's possible to use a seqcount in KVM_REQ_CLOCK_UPDATE instead of KVM_REQ_MCLOCK_INPROGRESS. Both cause the vCPUs to spin. I'll take a look. Paolo
Re: [PATCH 5.10 096/126] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
On 06/04/21 21:44, Sasha Levin wrote: On Tue, Apr 06, 2021 at 08:28:27PM +0200, Paolo Bonzini wrote: If a patch doesn't (more or less trivially) apply, the maintainer should take action. Distro maintainers can also jump in and post the backport to subsystem mailing lists. If the stable kernel loses a patch because a maintainer doesn't have the time to do a backport, it's not the end of the world. This quickly went from a "world class" to "fuck it". Known bugs are better than unknown bugs. If something is reported on 4.19 and the stable@ backports were only done up to 5.4 because the backport was a bit more messy, it's okay. If a user comes up with a weird bug that I've never seen and that it's caused by a botched backport, it's much worse. In this specific case we're talking of 5.10; but in many cases users of really old kernels, let's say 4.4-4.9 at this point, won't care about having *all* bugfixes. Some newer (and thus more buggy) features may be so distant from the mainline in those kernels, or so immature, that no one will consider them while staying on such an old kernel. Again, kernel necrophilia pays my bills, so I have some experience there. :) It's understandable that maintainers don't have all the time in the world for this, but are you really asking not to backport fixes to stable trees because you don't have the time for it and don't want anyone else to do it instead? If a bug is serious I *will* do the backport; I literally did this specific backport on the first working day after the failure report. But not all bugs are created equal and neither are all stable@-worthy bugs. I try to use the "Fixes" tag correctly, but sometimes a bug that *technically* is 10-years-old may not be worthwhile or even possible to fix even in 5.4. That said... one thing that would be really, really awesome would be a website where you navigate a Linux checkout and for each directory you can choose to get a list of stable patches that were Cc-ed to stable@ and failed to apply. A pipedream maybe, but also a great internship project. :) Paolo Maybe consider designating someone who knows the subsystem well and does have time for this?
Re: [PATCH] KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp
On 06/04/21 20:25, Greg KH wrote: On Tue, Apr 06, 2021 at 12:25:50PM -0400, Paolo Bonzini wrote: Right now, if a call to kvm_tdp_mmu_zap_sp returns false, the caller will skip the TLB flush, which is wrong. There are two ways to fix it: - since kvm_tdp_mmu_zap_sp will not yield and therefore will not flush the TLB itself, we could change the call to kvm_tdp_mmu_zap_sp to use "flush |= ..." - or we can chain the flush argument through kvm_tdp_mmu_zap_sp down to __kvm_tdp_mmu_zap_gfn_range. This patch does the former to simplify application to stable kernels. Cc: sea...@google.com Fixes: 048f49809c526 ("KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping") Cc: # 5.10.x: 048f49809c: KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping Cc: # 5.10.x: 33a3164161: KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages Cc: Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Is this for only the stable kernels, or is it addressed toward upstream merges? Confused, It's for upstream. I'll include it (with the expected "[ Upstream commit abcd ]" header) when I post the complete backport. I'll send this patch to Linus as soon as I get a review even if I don't have anything else in the queue, so (as a general idea) the full backport should be sent and tested on Thursday-Friday. Paolo
Re: [PATCH 5.10 096/126] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
On 06/04/21 20:01, Sasha Levin wrote: On Tue, Apr 06, 2021 at 05:48:50PM +0200, Paolo Bonzini wrote: On 06/04/21 15:49, Sasha Levin wrote: Yup. Is there anything wrong with those patches? The big issue, and the one that you ignoredz every time we discuss this topic, is that this particular subset of 17 has AFAIK never been tested by anyone. Few of the CI systems that run on stable(-rc) releases run kvm-unit-tests, which passed. So yes, this was tested. The code didn't run unless those CI systems set the module parameter that gates the experimental code (they don't). A compile test is better than nothing I guess, but code that didn't run cannot be considered tested. Again, I don't expect that anyone would notice a botched backport to 5.10 or 5.11 of this code, but that isn't an excuse for a poor process. Right, I looked at what needed to be backported, took it back to 5.4, and ran kvm-unit-tests on it. I guess that's a typo since the code was added in 5.10, but anyway... What other hoops should we jump through so we won't need to "hope" anymore? ... you should jump through _less_ hoops. You are not expected to know the status of the code in each and every subsystem, not even Linus does. If a patch doesn't (more or less trivially) apply, the maintainer should take action. Distro maintainers can also jump in and post the backport to subsystem mailing lists. If the stable kernel loses a patch because a maintainer doesn't have the time to do a backport, it's not the end of the world. Paolo
[PATCH] KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp
Right now, if a call to kvm_tdp_mmu_zap_sp returns false, the caller will skip the TLB flush, which is wrong. There are two ways to fix it: - since kvm_tdp_mmu_zap_sp will not yield and therefore will not flush the TLB itself, we could change the call to kvm_tdp_mmu_zap_sp to use "flush |= ..." - or we can chain the flush argument through kvm_tdp_mmu_zap_sp down to __kvm_tdp_mmu_zap_gfn_range. This patch does the former to simplify application to stable kernels. Cc: sea...@google.com Fixes: 048f49809c526 ("KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping") Cc: # 5.10.x: 048f49809c: KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping Cc: # 5.10.x: 33a3164161: KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages Cc: Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 486aa94ecf1d..951dae4e7175 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5906,7 +5906,7 @@ static void kvm_recover_nx_lpages(struct kvm *kvm) lpage_disallowed_link); WARN_ON_ONCE(!sp->lpage_disallowed); if (is_tdp_mmu_page(sp)) { - flush = kvm_tdp_mmu_zap_sp(kvm, sp); + flush |= kvm_tdp_mmu_zap_sp(kvm, sp); } else { kvm_mmu_prepare_zap_page(kvm, sp, _list); WARN_ON_ONCE(sp->lpage_disallowed); -- 2.26.2
Re: [PATCH 5.10 096/126] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
On 06/04/21 15:49, Sasha Levin wrote: On Tue, Apr 06, 2021 at 08:09:26AM +0200, Paolo Bonzini wrote: Whoa no, you have included basically a whole new feature, except for the final patch that actually enables the feature. The whole new MMU Right, we would usually grab dependencies rather than modifying the patch. It means we diverge less with upstream, and custom backports tend to be buggier than just grabbing dependencies. In general I can't disagree. However, you *are* modifying at least commit 048f49809c in your backport, so it's not clear where you draw the line and why you didn't ask for help (more on this below). Only the first five patches here are actual prerequisites for an easy backport of the two commits that actually matter (a835429cda91, "KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap"; and 048f49809c52, "KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping", 2021-03-30). Everything after "KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed" could be dropped without making it any harder to backport those two. is still not meant to be used in production and development is still happening as of 5.13. Unrelated to this discussion, but how are folks supposed to know which feature can and which feature can't be used in production? If it's a released kernel, in theory anyone can pick up 5.12 and use it in production. It's not enabled by default and requires turning on a module parameter. That also means that something like CKI will not notice if anything's wrong with these patches. It also means that I could just shrug and hope that no one ever runs this code in 5.10 and 5.11 (which is actually the most likely case), but it is the process that is *just wrong*. Were all these patches (82-97) included just to enable patch 98 ("KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping")? Same for 105-120 in 5.11. Yup. Is there anything wrong with those patches? The big issue, and the one that you ignoredz every time we discuss this topic, is that this particular subset of 17 has AFAIK never been tested by anyone. There's plenty of locking changes in here, one patch that you didn't backport has this in its commit message: This isn't technically a bug fix in the current code [...] but that is all very, very subtle, and will break at the slightest sneeze, meaning that the locking in 5.10 and 5.11 was also less robust to changes elsewhere in the code. Let's also talk about the process and the timing. I got the "failed to apply" automated message last Friday and I was going to work on the backport today since yesterday was a holiday here. I was *never* CCed on a post of this backport for maintainers to review; you guys *literally* took random subsets of patches from a feature that is new and in active development, and hoped that they worked on a past release. I could be happy because you just provided me with a perfect example of why to use my employer's franken-kernel instead of upstream stable kernels... ;) but this is not how a world-class operating system is developed. Who cares if a VM breaks or even if my laptop panics; but I'd seriously fear for my data if you applied the same attitude to XFS or ext4fs. For now, please drop all 17 patches from 5.10 and 5.11. I'll send a tested backport as soon as possible. Paolo
Re: [PATCH v11 00/13] Add AMD SEV guest live migration support
On 05/04/21 20:27, Steve Rutherford wrote: On Mon, Apr 5, 2021 at 8:17 AM Peter Gonda wrote: Could this patch set include support for the SEND_CANCEL command? That's separate from this patchset. I sent up an implementation last week. And I was going to queue it today. Paolo
Re: [PATCH v11 10/13] KVM: x86: Introduce new KVM_FEATURE_SEV_LIVE_MIGRATION feature & Custom MSR.
On 06/04/21 15:26, Ashish Kalra wrote: It's a little unintuitive to see KVM_MSR_RET_FILTERED here, since userspace can make this happen on its own without having an entry in this switch statement (by setting it in the msr filter bitmaps). When using MSR filters, I would only expect to get MSR filter exits for MSRs I specifically asked for. Not a huge deal, just a little unintuitive. I'm not sure other options are much better (you could put KVM_MSR_RET_INVALID, or you could just not have these entries in svm_{get,set}_msr). Actually KVM_MSR_RET_FILTERED seems more logical to use, especially in comparison with KVM_MSR_RET_INVALID. Also, hooking this msr in svm_{get,set}_msr allows some in-kernel error pre-processsing before doing the pass-through to userspace. I agree that it should be up to userspace to set up the filter since we now have that functionality. Let me read the whole threads for the past versions to see what the objections were... Paolo
Re: [PATCH 2/4] KVM: MIPS: rework flush_shadow_* callbacks into one that prepares the flush
On 06/04/21 13:39, Thomas Bogendoerfer wrote: On Tue, Apr 06, 2021 at 08:05:40AM +0200, Paolo Bonzini wrote: On 06/04/21 03:36, Huacai Chen wrote: I tried the merge and it will be enough for Linus to remove arch/mips/kvm/trap_emul.c. So I will leave it as is, but next time I'd prefer KVM MIPS changes to go through either my tree or a common topic branch. Emmm, the TE removal series is done by Thomas, not me.:) Sure, sorry if the sentence sounded like it was directed to you. No matter who wrote the code, synchronization between trees is only the maintainers' task. :) Sorry about the mess. I'll leave arch/mips/kvm to go via your tree then. No problem. In this specific case, at some point I'll pull from mips-next, prior to sending the pull request to Linus. It will look just like other KVM submaintainer pulls. Paolo
Re: [PATCH 5.10 096/126] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
On 05/04/21 10:54, Greg Kroah-Hartman wrote: From: Ben Gardon [ Upstream commit 9a77daacc87dee9fd63e31243f21894132ed8407 ] To prepare for handling page faults in parallel, change the TDP MMU page fault handler to use atomic operations to set SPTEs so that changes are not lost if multiple threads attempt to modify the same SPTE. Reviewed-by: Peter Feiner Signed-off-by: Ben Gardon Message-Id: <20210202185734.1680553-21-bgar...@google.com> [Document new locking rules. - Paolo] Signed-off-by: Paolo Bonzini Signed-off-by: Sasha Levin --- Documentation/virt/kvm/locking.rst | 9 +- arch/x86/include/asm/kvm_host.h| 13 +++ arch/x86/kvm/mmu/tdp_mmu.c | 142 ++--- 3 files changed, 130 insertions(+), 34 deletions(-) diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst index b21a34c34a21..0aa4817b466d 100644 --- a/Documentation/virt/kvm/locking.rst +++ b/Documentation/virt/kvm/locking.rst @@ -16,7 +16,14 @@ The acquisition orders for mutexes are as follows: - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring them together is quite rare. -On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock. +On x86: + +- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock + +- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock is + taken inside kvm->arch.mmu_lock, and cannot be taken without already + holding kvm->arch.mmu_lock (typically with ``read_lock``, otherwise + there's no need to take kvm->arch.tdp_mmu_pages_lock at all). Everything else is a leaf: no other lock is taken inside the critical sections. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 02d4c74d30e2..47cd8f9b3fe7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1014,6 +1014,19 @@ struct kvm_arch { struct list_head tdp_mmu_roots; /* List of struct tdp_mmu_pages not being used as roots */ struct list_head tdp_mmu_pages; + + /* +* Protects accesses to the following fields when the MMU lock +* is held in read mode: +* - tdp_mmu_pages (above) +* - the link field of struct kvm_mmu_pages used by the TDP MMU +* - lpage_disallowed_mmu_pages +* - the lpage_disallowed_link field of struct kvm_mmu_pages used +*by the TDP MMU +* It is acceptable, but not necessary, to acquire this lock when +* the thread holds the MMU lock in write mode. +*/ + spinlock_t tdp_mmu_pages_lock; }; struct kvm_vm_stat { diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 14d69c01c710..eb38f74af3f2 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -7,6 +7,7 @@ #include "tdp_mmu.h" #include "spte.h" +#include #include #ifdef CONFIG_X86_64 @@ -33,6 +34,7 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm) kvm->arch.tdp_mmu_enabled = true; INIT_LIST_HEAD(>arch.tdp_mmu_roots); + spin_lock_init(>arch.tdp_mmu_pages_lock); INIT_LIST_HEAD(>arch.tdp_mmu_pages); } @@ -225,7 +227,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head) } static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, - u64 old_spte, u64 new_spte, int level); + u64 old_spte, u64 new_spte, int level, + bool shared); static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp) { @@ -267,17 +270,26 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn, * * @kvm: kvm instance * @sp: the new page + * @shared: This operation may not be running under the exclusive use of + * the MMU lock and the operation must synchronize with other + * threads that might be adding or removing pages. * @account_nx: This page replaces a NX large page and should be marked for *eventual reclaim. */ static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp, - bool account_nx) + bool shared, bool account_nx) { - lockdep_assert_held_write(>mmu_lock); + if (shared) + spin_lock(>arch.tdp_mmu_pages_lock); + else + lockdep_assert_held_write(>mmu_lock); list_add(>link, >arch.tdp_mmu_pages); if (account_nx) account_huge_nx_page(kvm, sp); + + if (shared) + spin_unlock(>arch.tdp_mmu_pages_lock); } /** @@ -285,14 +297,24 @@ static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp, * * @kvm: kvm instance * @sp: the page to be removed + * @shared: This operation may not be running under the exclusive use of + * the MMU lock and the operati
Re: [PATCH 2/4] KVM: MIPS: rework flush_shadow_* callbacks into one that prepares the flush
On 06/04/21 03:36, Huacai Chen wrote: I tried the merge and it will be enough for Linus to remove arch/mips/kvm/trap_emul.c. So I will leave it as is, but next time I'd prefer KVM MIPS changes to go through either my tree or a common topic branch. Emmm, the TE removal series is done by Thomas, not me.:) Sure, sorry if the sentence sounded like it was directed to you. No matter who wrote the code, synchronization between trees is only the maintainers' task. :) Paolo
Re: [PATCH 2/4] KVM: MIPS: rework flush_shadow_* callbacks into one that prepares the flush
On 03/04/21 04:31, Huacai Chen wrote: Hi, Paolo, TE mode has been removed in the MIPS tree, can we also remove it in KVM tree before this rework? I tried the merge and it will be enough for Linus to remove arch/mips/kvm/trap_emul.c. So I will leave it as is, but next time I'd prefer KVM MIPS changes to go through either my tree or a common topic branch. Paolo
Re: [PATCH 2/4] KVM: MIPS: rework flush_shadow_* callbacks into one that prepares the flush
On 03/04/21 04:31, Huacai Chen wrote: Hi, Paolo, TE mode has been removed in the MIPS tree, can we also remove it in KVM tree before this rework? Fortunately I can pull the exact commit that was applied to the MIPS tree, as it was the first patch that was applied to the tree, but next time please send KVM changes through the KVM tree. Paolo
Re: [PATCH v2 0/9] KVM: my debug patch queue
On 01/04/21 15:54, Maxim Levitsky wrote: Hi! I would like to publish two debug features which were needed for other stuff I work on. One is the reworked lx-symbols script which now actually works on at least gdb 9.1 (gdb 9.2 was reported to fail to load the debug symbols from the kernel for some reason, not related to this patch) and upstream qemu. Queued patches 2-5 for now. 6 is okay but it needs a selftest. (e.g. using KVM_VCPU_SET_EVENTS) and the correct name for the constant. Paolo The other feature is the ability to trap all guest exceptions (on SVM for now) and see them in kvmtrace prior to potential merge to double/triple fault. This can be very useful and I already had to manually patch KVM a few times for this. I will, once time permits, implement this feature on Intel as well. V2: * Some more refactoring and workarounds for lx-symbols script * added KVM_GUESTDBG_BLOCKEVENTS flag to enable 'block interrupts on single step' together with KVM_CAP_SET_GUEST_DEBUG2 capability to indicate which guest debug flags are supported. This is a replacement for unconditional block of interrupts on single step that was done in previous version of this patch set. Patches to qemu to use that feature will be sent soon. * Reworked the the 'intercept all exceptions for debug' feature according to the review feedback: - renamed the parameter that enables the feature and moved it to common kvm module. (only SVM part is currently implemented though) - disable the feature for SEV guests as was suggested during the review - made the vmexit table const again, as was suggested in the review as well. Best regards, Maxim Levitsky Maxim Levitsky (9): scripts/gdb: rework lx-symbols gdb script KVM: introduce KVM_CAP_SET_GUEST_DEBUG2 KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2 KVM: aarch64: implement KVM_CAP_SET_GUEST_DEBUG2 KVM: s390x: implement KVM_CAP_SET_GUEST_DEBUG2 KVM: x86: implement KVM_GUESTDBG_BLOCKEVENTS KVM: SVM: split svm_handle_invalid_exit KVM: x86: add force_intercept_exceptions_mask KVM: SVM: implement force_intercept_exceptions_mask Documentation/virt/kvm/api.rst| 4 + arch/arm64/include/asm/kvm_host.h | 4 + arch/arm64/kvm/arm.c | 2 + arch/arm64/kvm/guest.c| 5 - arch/s390/include/asm/kvm_host.h | 4 + arch/s390/kvm/kvm-s390.c | 3 + arch/x86/include/asm/kvm_host.h | 12 ++ arch/x86/include/uapi/asm/kvm.h | 1 + arch/x86/kvm/svm/svm.c| 87 +++-- arch/x86/kvm/svm/svm.h| 6 +- arch/x86/kvm/x86.c| 14 ++- arch/x86/kvm/x86.h| 2 + include/uapi/linux/kvm.h | 1 + kernel/module.c | 8 +- scripts/gdb/linux/symbols.py | 203 -- 15 files changed, 272 insertions(+), 84 deletions(-)