Re: [PATCH v2 00/50] KVM: Rework kvm_init() and hardware enabling
On 12/28/22 12:22, Marc Zyngier wrote: Queued, thanks. I will leave this in kvm/queue after testing everything else and moving it to kvm/next; this way, we can wait for test results on other architectures. Can you please make this a topic branch, and if possible based on a released -rc? It would make it a lot easier for everyone. This is now refs/heads/kvm-hw-enable-refactor in https://git.kernel.org/pub/scm/virt/kvm/kvm.git. The commits for this series start at hash fc471e831016. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v2 00/50] KVM: Rework kvm_init() and hardware enabling
On 12/28/22 12:22, Marc Zyngier wrote: Queued, thanks. I will leave this in kvm/queue after testing everything else and moving it to kvm/next; this way, we can wait for test results on other architectures. Can you please make this a topic branch, and if possible based on a released -rc? It would make it a lot easier for everyone. Yes, I will (it will be based on 6.2-rc1 + pull request for rc2 that I'm preparing + x86 changes that this conflicts with). Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v2 00/50] KVM: Rework kvm_init() and hardware enabling
Queued, thanks. I will leave this in kvm/queue after testing everything else and moving it to kvm/next; this way, we can wait for test results on other architectures. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 12/14] KVM: selftests: Use wildcards to find library source files
On 12/13/22 01:16, Sean Christopherson wrote: Use $(wildcard ...) to find the library source files instead of manually defining the inputs, which is a maintenance burden and error prone. No, please don't. This leads to weird errors, for example when "git checkout" is interrupted with ^C. I have applied patches 1-11. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 10/14] KVM: selftests: Include lib.mk before consuming $(CC)
On 12/13/22 01:16, Sean Christopherson wrote: Include lib.mk before consuming $(CC) and document that lib.mk overwrites $(CC) unless make was invoked with -e or $(CC) was specified after make (which apparently makes the environment override the Makefile?!?!). Yes, it does. In projects that don't use configure or similar, you might have seen CFLAGS = -O2 -g to be overridden with "make CFLAGS=-g" if optimization is undesirable. Paolo Including lib.mk after using it for probing, e.g. for -no-pie, can lead to weirdness. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 09/14] KVM: selftests: Explicitly disable builtins for mem*() overrides
On 12/13/22 01:16, Sean Christopherson wrote: Explicitly disable the compiler's builtin memcmp(), memcpy(), and memset(). Because only lib/string_override.c is built with -ffreestanding, the compiler reserves the right to do what it wants and can try to link the non-freestanding code to its own crud. /usr/bin/x86_64-linux-gnu-ld: /lib/x86_64-linux-gnu/libc.a(memcmp.o): in function `memcmp_ifunc': (.text+0x0): multiple definition of `memcmp'; tools/testing/selftests/kvm/lib/string_override.o: tools/testing/selftests/kvm/lib/string_override.c:15: first defined here clang: error: linker command failed with exit code 1 (use -v to see invocation) Hmm, that's weird though. I think it's an effect of ifunc and maybe even a linker bug. The patch makes sense anyway. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 06/14] KVM: selftests: Rename UNAME_M to ARCH_DIR, fill explicitly for x86
On 12/13/22 01:16, Sean Christopherson wrote: -ifeq ($(ARCH),riscv) - UNAME_M := riscv +ifeq ($(ARCH),x86) + ARCH_DIR := x86_64 +else ifeq ($(ARCH),arm64) + ARCH_DIR := aarch64 +else ifeq ($(ARCH),s390) + ARCH_DIR := s390x +else ifeq ($(ARCH),riscv) + ARCH_DIR := riscv +else +$(error Unknown architecture '$(ARCH)') endif $(error) would break compiling via tools/testing/selftests/Makefile, so I am squashing this: diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index d761a77c3a80..59f3eb53c932 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -13,10 +13,8 @@ else ifeq ($(ARCH),arm64) ARCH_DIR := aarch64 else ifeq ($(ARCH),s390) ARCH_DIR := s390x -else ifeq ($(ARCH),riscv) - ARCH_DIR := riscv else -$(error Unknown architecture '$(ARCH)') + ARCH_DIR := $(ARCH) endif LIBKVM += lib/assert.c Then the aarch64 and s390x directories can be renamed---x86 too, but the ifeq needs to stay (just changed to do x86_64->x86 instead of the other way round). Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC PATCH 00/37] KVM: Refactor the KVM/x86 TDP MMU into common code
On 12/13/22 00:26, Sean Christopherson wrote: I would strongly be in favor of discarding the shadow paging residue if x86 folks are willing to part ways with it Yes, absolutely. Something like to_shadow_page->to_mmu_data, sp->md, spt->hpt, spte->spte, sptep->hptep. "host" will be confusing if when the primary MMU is involved though, e.g. I always think of the primary MMU's page tables as the "host page tables". What if we keep the "S" for SPT(E) and repurpose it to mean Secondary PTE? Makes sense, so just to_shadow_page->to_secmmu_page? Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC PATCH 04/37] KVM: x86/mmu: Invert sp->tdp_mmu_page to sp->shadow_mmu_page
On 12/8/22 20:38, David Matlack wrote: Invert the meaning of sp->tdp_mmu_page and rename it accordingly. This allows the TDP MMU code to not care about this field, which will be used in a subsequent commit to move the TDP MMU to common code. No functional change intended. Let's use a bit of the role instead. Paolo Signed-off-by: David Matlack --- arch/x86/kvm/mmu/mmu.c | 1 + arch/x86/kvm/mmu/mmu_internal.h | 2 +- arch/x86/kvm/mmu/tdp_mmu.c | 3 --- arch/x86/kvm/mmu/tdp_mmu.h | 5 - 4 files changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 355548603960..f7668a32721d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2180,6 +2180,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm, sp->gfn = gfn; sp->role = role; + sp->shadow_mmu_page = true; hlist_add_head(>hash_link, sp_list); if (sp_has_gptes(sp)) account_shadowed(kvm, sp); diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index e32379c5b1ad..c1a379fba24d 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -52,7 +52,7 @@ struct kvm_mmu_page { struct list_head link; struct hlist_node hash_link; - bool tdp_mmu_page; + bool shadow_mmu_page; bool unsync; u8 mmu_valid_gen; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 7ccac1aa8df6..fc0b87ceb1ea 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -133,8 +133,6 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, if (!refcount_dec_and_test(>tdp_mmu_root_count)) return; - WARN_ON(!is_tdp_mmu_page(root)); - /* * The root now has refcount=0. It is valid, but readers already * cannot acquire a reference to it because kvm_tdp_mmu_get_root() @@ -279,7 +277,6 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep, sp->role = role; sp->gfn = gfn; sp->ptep = sptep; - sp->tdp_mmu_page = true; trace_kvm_mmu_get_page(sp, true); } diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 0a63b1afabd3..18d3719f14ea 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -71,7 +71,10 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr, u64 *spte); #ifdef CONFIG_X86_64 -static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; } +static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) +{ + return !sp->shadow_mmu_page; +} #else static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; } #endif ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC PATCH 02/37] KVM: MMU: Move struct kvm_mmu_page_role into common code
On 12/8/22 20:38, David Matlack wrote: +/* + * kvm_mmu_page_role tracks the properties of a shadow page (where shadow page + * also includes TDP pages) to determine whether or not a page can be used in + * the given MMU context. + */ +union kvm_mmu_page_role { + u32 word; + struct { + struct { + /* The address space ID mapped by the page. */ + u16 as_id:8; + + /* The level of the page in the page table hierarchy. */ + u16 level:4; + + /* Whether the page is invalid, i.e. pending destruction. */ + u16 invalid:1; + }; + + /* Architecture-specific properties. */ + struct kvm_mmu_page_role_arch arch; + }; +}; + Have you considered adding a tdp_mmu:1 field to the arch-independent part? I think that as long as _that_ field is the same, there's no need to have any overlap between TDP MMU and shadow MMU roles. I'm not even sure if the x86 TDP MMU needs _any_ other role bit. It needs of course the above three, and it also needs "direct" but it is used exactly to mean "is this a TDP MMU page". So we could have union { struct { u32 tdp_mmu:1; u32 invalid:1; u32 :6; u32 level:8; u32 arch:8; u32 :8; } tdp; /* the first field must be "u32 tdp_mmu:1;" */ struct kvm_mmu_page_role_arch shadow; }; Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC PATCH 00/37] KVM: Refactor the KVM/x86 TDP MMU into common code
On 12/9/22 20:07, Oliver Upton wrote: - Naming. This series does not change the names of any existing code. So all the KVM/x86 Shadow MMU-style terminology like "shadow_page"/"sp"/"spte" persists. Should we keep that style in common code or move toward something less shadow-paging-specific? e.g. "page_table"/"pt"/"pte". I would strongly be in favor of discarding the shadow paging residue if x86 folks are willing to part ways with it Yes, absolutely. Something like to_shadow_page->to_mmu_data, sp->md, spt->hpt, spte->spte, sptep->hptep. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC PATCH 01/37] KVM: x86/mmu: Store the address space ID directly in kvm_mmu_page_role
On 12/12/22 18:39, Sean Christopherson wrote: The notion of address spaces is already existing architecture-neutral concept in KVM (e.g. see uses of KVM_ADDRESS_SPACE_NUM in virt/kvm/kvm_main.c), although SMM is the only use-case I'm aware of. Yes, SMM is currently the only use-case. It's possible that in the future Hyper-V VTLs will also have per-level protections. It wouldn't use as_id, but it would likely be recorded in the upper byte of the role. I'm not sure if Microsoft intends to port those to ARM as well. My preference would be to leave .smm in x86's page role What about defining a byte of arch_role and a macro to build it? Unless I'm missing something, e.g. a need to map GPAs differently for SMM vs. non-SMM, SMM could have been implemented with a simple flag in a memslot to mark the memslot as SMM-only. Unfortunately not, because there can be another region (for example video RAM at 0Ah) underneath SMRAM. In fact, KVM_MEM_X86_SMRAM was the first idea. It was uglier than multiple address spaces (https://lore.kernel.org/lkml/1431084034-8425-1-git-send-email-pbonz...@redhat.com). In retrospect there were probably ways to mix the best of the two designs, but it wasn't generic enough. And separate address spaces become truly nasty if the CPU can access multiple protected regions, e.g. if the CPU can access type X and type Y at the same time, then there would need to be memslots for "regular", X, Y, and X+Y. Without a usecase that's hard to say. It's just as possible that there would be a natural hierarchy of levels. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC PATCH 06/37] KVM: MMU: Move struct kvm_mmu_page to common code
On 12/8/22 20:38, David Matlack wrote: This commit increases the size of struct kvm_mmu_page by 64 bytes on x86_64 (184 bytes -> 248 bytes). The size of this struct can be reduced in future commits by moving TDP MMU root fields into a separate struct and by dynamically allocating fields only used by the Shadow MMU. I think it's already possible to use a union like union { struct kvm_mmu_page_arch arch; struct { struct work_struct work; void *data; }; }; Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC PATCH 10/37] KVM: MMU: Move struct kvm_page_fault to common code
On 12/8/22 20:38, David Matlack wrote: + + /* Derived from mmu and global state. */ + const bool is_tdp; I think this could stay in the architecture-independent part. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v2 7/7] KVM: selftests: Avoid infinite loop if ucall_alloc() fails
On 12/9/22 22:03, Sean Christopherson wrote: From: Sean Christopherson Date: Fri, 9 Dec 2022 12:55:44 -0800 Subject: [PATCH] KVM: selftests: Use magic value to signal ucall_alloc() failure Use a magic value to signal a ucall_alloc() failure instead of simply doing GUEST_ASSERT(). GUEST_ASSERT() relies on ucall_alloc() and so a failure puts the guest into an infinite loop. Use -1 as the magic value, as a real ucall struct should never wrap. Reported-by: Oliver Upton Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/lib/ucall_common.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/kvm/lib/ucall_common.c b/tools/testing/selftests/kvm/lib/ucall_common.c index 0cc0971ce60e..2f0e2ea941cc 100644 --- a/tools/testing/selftests/kvm/lib/ucall_common.c +++ b/tools/testing/selftests/kvm/lib/ucall_common.c @@ -4,6 +4,8 @@ #include "linux/bitmap.h" #include "linux/atomic.h" +#define GUEST_UCALL_FAILED -1 + struct ucall_header { DECLARE_BITMAP(in_use, KVM_MAX_VCPUS); struct ucall ucalls[KVM_MAX_VCPUS]; @@ -41,7 +43,8 @@ static struct ucall *ucall_alloc(void) struct ucall *uc; int i; - GUEST_ASSERT(ucall_pool); + if (!ucall_pool) + goto ucall_failed; for (i = 0; i < KVM_MAX_VCPUS; ++i) { if (!test_and_set_bit(i, ucall_pool->in_use)) { @@ -51,7 +54,13 @@ static struct ucall *ucall_alloc(void) } } - GUEST_ASSERT(0); +ucall_failed: + /* +* If the vCPU cannot grab a ucall structure, make a bare ucall with a +* magic value to signal to get_ucall() that things went sideways. +* GUEST_ASSERT() depends on ucall_alloc() and so cannot be used here. +*/ + ucall_arch_do_ucall(GUEST_UCALL_FAILED); return NULL; } @@ -93,6 +102,9 @@ uint64_t get_ucall(struct kvm_vcpu *vcpu, struct ucall *uc) addr = ucall_arch_get_ucall(vcpu); if (addr) { + TEST_ASSERT(addr != (void *)GUEST_UCALL_FAILED, + "Guest failed to allocate ucall struct"); + memcpy(uc, addr, sizeof(*uc)); vcpu_run_complete_io(vcpu); } else { base-commit: dc2efbe4813e0dc4368779bc36c5f0e636cb8eb2 -- Queued, thanks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v2 0/7] KVM: selftests: Fixes for ucall pool + page_fault_test
On 12/9/22 02:52, Oliver Upton wrote: The combination of the pool-based ucall implementation + page_fault_test resulted in some 'fun' bugs. As has always been the case, KVM selftests is a house of cards. Small series to fix up the issues on kvm/queue. Patches 1-2 can probably be squashed into Paolo's merge resolution, if desired. Tested on Ampere Altra and a Skylake box, since there was a decent amount of munging in architecture-generic code. v1 -> v2: - Collect R-b from Sean (thanks!) - Use a common routine for split and contiguous VA spaces, with commentary on why arm64 is different since we all get to look at it now. (Sean) - Don't identity map the ucall MMIO hole - Fix an off-by-one issue in the accounting of virtual memory, discovered in fighting with #2 - Fix an infinite loop in ucall_alloc(), discovered fighting with the ucall_init() v. kvm_vm_elf_load() ordering issue Queued 3+5, thanks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v2 4/7] KVM: selftests: Correctly initialize the VA space for TTBR0_EL1
On 12/9/22 02:53, Oliver Upton wrote: @@ -268,17 +305,17 @@ struct kvm_vm *vm_create(enum vm_guest_mode mode) #ifdef __aarch64__ if (vm->pa_bits != 40) vm->type = KVM_VM_TYPE_ARM_IPA_SIZE(vm->pa_bits); + + /* selftests use TTBR0 only, meaning there is a single VA region. */ + vm->has_split_va_space = false; +#else + vm->has_split_va_space = true; #endif Even though there happens to be already a suitable #ifdef, I don't really like them and don't think there should be a new bool unless something else uses it. However, the new comment is very useful, so I added it to kvm_util.c as follows: /testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c index 759a45540108..56d5ea949cbb 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -186,6 +186,15 @@ const struct vm_guest_mode_params vm_guest_mode_params[] = { _Static_assert(sizeof(vm_guest_mode_params)/sizeof(struct vm_guest_mode_params) == NUM_VM_MODES, "Missing new mode params?"); +/* + * Initializes vm->vpages_valid to match the canonical VA space of the + * architecture. + * + * The default implementation is valid for architectures which split the + * range addressed by a single page table into a low and high region + * based on the MSB of the VA. On architectures with this behavior + * the VA region spans [0, 2^(va_bits - 1)), [-(2^(va_bits - 1), -1]. + */ __weak void vm_vaddr_populate_bitmap(struct kvm_vm *vm) { sparsebit_set_num(vm->vpages_valid, ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 updates for 6.2
On Fri, Dec 9, 2022 at 6:05 PM Oliver Upton wrote: > Mind dumping what I had in v1 and applying this instead? > > https://lore.kernel.org/kvm/20221209015307.1781352-1-oliver.up...@linux.dev/ Ouch, five minutes too late... I can take care of the difference but it'll have to wait for Monday. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 updates for 6.2
On 12/7/22 22:51, Oliver Upton wrote: I haven't pushed to kvm/next yet to give you time to check, so the merge is currently in kvm/queue only. Have a look at this series, which gets things building and actually passing again: https://lore.kernel.org/kvm/20221207214809.489070-1-oliver.up...@linux.dev/ Thank you! Squashed 1-2 and applied 3-4. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 updates for 6.2
On 12/7/22 08:49, Marc Zyngier wrote: On Tue, 06 Dec 2022 21:43:43 +, Paolo Bonzini wrote: On 12/6/22 19:20, Mark Brown wrote: I almost suggested doing that on multiple occasions this cycle, but ultimately decided not to because it would effectively mean splitting series that touch KVM and selftests into different trees, which would create a different kind of dependency hell. Or maybe a hybrid approach where series that only (or mostly?) touch selftests go into a dedicated tree? Some other subsystems do have a separate branch for kselftests. One fairly common occurrence is that the selftests branch ends up failing to build independently because someone adds new ABI together with a selftest but the patches adding the ABI don't end up on the same branch as the tests which try to use them. That is of course resolvable but it's a common friction point. Yeah, the right solution is simply to merge selftests changes separately from the rest and use topic branches. Don't know if this is what you have in mind, but I think that we should use topic branches for *everything*. The only things for which I don't use a separate branch are the odd drive-by patches, of the spelling fix persuasion. Yeah, I just wish we had better tools to manage them... Paolo That's what we do for arm64 and the IRQ subsystem. It is a bit more involved at queuing time, but makes dropping series from -next extremely easy, without affecting the history. And crucially, it gives everyone a hint to base their stuff on a stable commit, not a random "tip of kvm/queue as of three days ago". M. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 updates for 6.2
On 12/6/22 19:20, Mark Brown wrote: I almost suggested doing that on multiple occasions this cycle, but ultimately decided not to because it would effectively mean splitting series that touch KVM and selftests into different trees, which would create a different kind of dependency hell. Or maybe a hybrid approach where series that only (or mostly?) touch selftests go into a dedicated tree? Some other subsystems do have a separate branch for kselftests. One fairly common occurrence is that the selftests branch ends up failing to build independently because someone adds new ABI together with a selftest but the patches adding the ABI don't end up on the same branch as the tests which try to use them. That is of course resolvable but it's a common friction point. Yeah, the right solution is simply to merge selftests changes separately from the rest and use topic branches. We will have more friction of this kind if we succeed in making more KVM code multi-architecture, so let's just treat selftests as the more innocuous drill... Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 updates for 6.2
On 12/5/22 16:58, Marc Zyngier wrote: - There is a lot of selftest conflicts with your own branch, see: https://lore.kernel.org/r/20221201112432.4cb9a...@canb.auug.org.au https://lore.kernel.org/r/20221201113626.438f1...@canb.auug.org.au https://lore.kernel.org/r/20221201115741.7de32...@canb.auug.org.au https://lore.kernel.org/r/20221201120939.3c19f...@canb.auug.org.au https://lore.kernel.org/r/20221201131623.18ebc...@canb.auug.org.au for a rather exhaustive collection. Yeah, I saw them in Stephen's messages but missed your reply. In retrospect, at least Gavin's series for memslot_perf_test should have been applied by both of us with a topic branch, but there's so many conflicts all over the place that it's hard to single out one series. It just happens. The only conflict in non-x86 code is the following one, please check if I got it right. diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c b/tools/testing/selftests/kvm/aarch64/page_fault_test.c index 05bb6a6369c2..0cda70bef5d5 100644 --- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c +++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c @@ -609,6 +609,8 @@ static void setup_memslots(struct kvm_vm *vm, struct test_params *p) data_size / guest_page_size, p->test_desc->data_memslot_flags); vm->memslots[MEM_REGION_TEST_DATA] = TEST_DATA_MEMSLOT; + + ucall_init(vm, data_gpa + data_size); } static void setup_default_handlers(struct test_desc *test) @@ -704,8 +706,6 @@ static void run_test(enum vm_guest_mode mode, void *arg) setup_gva_maps(vm); - ucall_init(vm, NULL); - reset_event_counts(); /* Special care is needed here because the test uses vm_create(). I haven't pushed to kvm/next yet to give you time to check, so the merge is currently in kvm/queue only. - For the 6.3 cycle, we are going to experiment with Oliver taking care of most of the patch herding. I'm sure he'll do a great job, but if there is the odd mistake, please cut him some slack and blame me instead. Absolutely - you both have all the slack you need, synchronization is harder than it seems. Paolo Please pull, M. The following changes since commit f0c4d9fc9cc9462659728d168387191387e903cc: Linux 6.1-rc4 (2022-11-06 15:07:11 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git tags/kvmarm-6.2 for you to fetch changes up to 753d734f3f347e7fc49b819472bbf61dcfc1a16f: Merge remote-tracking branch 'arm64/for-next/sysregs' into kvmarm-master/next (2022-12-05 14:39:53 +) KVM/arm64 updates for 6.2 - Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. - Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on. - Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. - Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. - Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. - Add/Enable/Fix a bunch of selftests covering memslots, breakpoints, stage-2 faults and access tracking. You name it, we got it, we probably broke it. - Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. As a side effect, this tag also drags: - The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring series - A shared branch with the arm64 tree that repaints all the system registers to match the ARM ARM's naming, and resulting in interesting conflicts Anshuman Khandual (1): KVM: arm64: PMU: Replace version number '0' with ID_AA64DFR0_EL1_PMUVer_NI Catalin Marinas (4): mm: Do not enable PG_arch_2 for all 64-bit architectures arm64: mte: Fix/clarify the PG_mte_tagged semantics KVM: arm64: Simplify the sanitise_mte_tags() logic arm64: mte: Lock a page for MTE tag initialisation Fuad Tabba (3): KVM: arm64: Add hyp_spinlock_t static initializer KVM: arm64: Add infrastructure to create and track pKVM instances at EL2 KVM: arm64: Instantiate pKVM hypervisor VM and vCPU structures from EL1 Gavin Shan (14): KVM: x86:
Re: [PATCH 0/9] tools: Make {clear,set}_bit() atomic for reals
On 11/19/22 02:34, Sean Christopherson wrote: For obvious reasons I'd like to route the this through Paolo's tree. In theory, taking just patch 5 through tip would work, but creating a topic branch seems like the way to go, though maybe I'm being overly paranoid. The current tip/perf/core doesn't have any conflicts, nor does it have new set_bit() or clear_bit() users. Code sitting in kvm/queue for 6.2 adds functionality that relies on clear_bit() being an atomic operation. Unfortunately, despite being implemented in atomic.h (among other strong hits that they should be atomic), clear_bit() and set_bit() aren't actually atomic (and of course I realized this _just_ after everything got queued up). Move current tools/ users of clear_bit() and set_bit() to the double-underscore versions (which tools/ already provides and documents as being non-atomic), and then implement clear_bit() and set_bit() as actual atomics to fix the KVM selftests bug. Perf and KVM are the only affected users. NVDIMM also has test code in tools/, but that builds against the kernel proper. The KVM code is well tested and fully audited. The perf code is lightly tested; if I understand the build system, it's probably not even fully compile tested. Patches 1 and 2 are completely unrelated and are fixes for patches sitting in kvm/queue. Paolo, they can be squashed if you want to rewrite history. Patch 3 fixes a hilarious collision in a KVM ARM selftest that will arise when clear_bit() is converted to an atomic. Patch 4 changes clear_bit() and set_bit() to take an "unsigned long" instead of an "int" so that patches 5-6 aren't accompanied by functional changes. I.e. if something in perf is somehow relying on "bit" being a signed int, failures will bisect to patch 4 and not to the supposed-to-be-a-nop conversion to __clear_bit() and __set_bit(). Patch 5-9 switch perf+KVM and complete the conversion. Applies on: git://git.kernel.org/pub/scm/virt/kvm/kvm.git queue Queued, thanks Namhyung for the ACK! Paolo Sean Christopherson (9): KVM: selftests: Add rdmsr_from_l2() implementation in Hyper-V eVMCS test KVM: selftests: Remove unused "vcpu" param to fix build error KVM: arm64: selftests: Enable single-step without a "full" ucall() tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers perf tools: Use dedicated non-atomic clear/set bit helpers KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers tools: Drop "atomic_" prefix from atomic test_and_set_bit() tools: KVM: selftests: Convert clear/set_bit() to actual atomics tools/arch/x86/include/asm/atomic.h | 6 +++- tools/include/asm-generic/atomic-gcc.h| 13 ++- tools/include/asm-generic/bitops/atomic.h | 15 tools/include/linux/bitmap.h | 34 --- tools/perf/bench/find-bit-bench.c | 2 +- tools/perf/builtin-c2c.c | 6 ++-- tools/perf/builtin-kwork.c| 6 ++-- tools/perf/builtin-record.c | 6 ++-- tools/perf/builtin-sched.c| 2 +- tools/perf/tests/bitmap.c | 2 +- tools/perf/tests/mem2node.c | 2 +- tools/perf/util/affinity.c| 4 +-- tools/perf/util/header.c | 8 ++--- tools/perf/util/mmap.c| 6 ++-- tools/perf/util/pmu.c | 2 +- .../util/scripting-engines/trace-event-perl.c | 2 +- .../scripting-engines/trace-event-python.c| 2 +- tools/perf/util/session.c | 2 +- tools/perf/util/svghelper.c | 2 +- .../selftests/kvm/aarch64/arch_timer.c| 2 +- .../selftests/kvm/aarch64/debug-exceptions.c | 21 ++-- tools/testing/selftests/kvm/dirty_log_test.c | 34 +-- .../selftests/kvm/include/ucall_common.h | 8 + .../testing/selftests/kvm/lib/ucall_common.c | 2 +- .../selftests/kvm/x86_64/hyperv_evmcs.c | 13 +-- .../selftests/kvm/x86_64/hyperv_svm_test.c| 4 +-- .../selftests/kvm/x86_64/hyperv_tlb_flush.c | 2 +- 27 files changed, 102 insertions(+), 106 deletions(-) base-commit: 3321eef4acb51c303f0598d8a8493ca58528a054 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 33/44] KVM: x86: Do VMX/SVM support checks directly in vendor code
On 11/3/22 19:58, Sean Christopherson wrote: diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 3e508f239098..ebe617ab0b37 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -191,6 +191,8 @@ static void default_init(struct cpuinfo_x86 *c) strcpy(c->x86_model_id, "386"); } #endif + + clear_cpu_cap(c, X86_FEATURE_MSR_IA32_FEAT_CTL); } static const struct cpu_dev default_cpu = { Not needed I think? default_init does not call init_ia32_feat_ctl. diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c index c881bcafba7d..3a7ae67f5a5e 100644 --- a/arch/x86/kernel/cpu/cpuid-deps.c +++ b/arch/x86/kernel/cpu/cpuid-deps.c @@ -72,6 +72,8 @@ static const struct cpuid_dep cpuid_deps[] = { { X86_FEATURE_AVX512_FP16, X86_FEATURE_AVX512BW }, { X86_FEATURE_ENQCMD, X86_FEATURE_XSAVES}, { X86_FEATURE_PER_THREAD_MBA, X86_FEATURE_MBA }, + { X86_FEATURE_VMX, X86_FEATURE_MSR_IA32_FEAT_CTL }, + { X86_FEATURE_SGX, X86_FEATURE_MSR_IA32_FEAT_CTL }, { X86_FEATURE_SGX_LC, X86_FEATURE_SGX }, { X86_FEATURE_SGX1, X86_FEATURE_SGX }, { X86_FEATURE_SGX2, X86_FEATURE_SGX1 }, Yes, good idea. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 00/44] KVM: Rework kvm_init() and hardware enabling
On 11/4/22 08:17, Isaku Yamahata wrote: On Wed, Nov 02, 2022 at 11:18:27PM +, Sean Christopherson wrote: Non-x86 folks, please test on hardware when possible. I made a _lot_ of mistakes when moving code around. Thankfully, x86 was the trickiest code to deal with, and I'm fairly confident that I found all the bugs I introduced via testing. But the number of mistakes I made and found on x86 makes me more than a bit worried that I screwed something up in other arch code. This is a continuation of Chao's series to do x86 CPU compatibility checks during virtualization hardware enabling[1], and of Isaku's series to try and clean up the hardware enabling paths so that x86 (Intel specifically) can temporarily enable hardware during module initialization without causing undue pain for other architectures[2]. It also includes one patch from another mini-series from Isaku that provides the less controversial patches[3]. The main theme of this series is to kill off kvm_arch_init(), kvm_arch_hardware_(un)setup(), and kvm_arch_check_processor_compat(), which all originated in x86 code from way back when, and needlessly complicate both common KVM code and architecture code. E.g. many architectures don't mark functions/data as __init/__ro_after_init purely because kvm_init() isn't marked __init to support x86's separate vendor modules. The idea/hope is that with those hooks gone (moved to arch code), it will be easier for x86 (and other architectures) to modify their module init sequences as needed without having to fight common KVM code. E.g. I'm hoping that ARM can build on this to simplify its hardware enabling logic, especially the pKVM side of things. There are bug fixes throughout this series. They are more scattered than I would usually prefer, but getting the sequencing correct was a gigantic pain for many of the x86 fixes due to needing to fix common code in order for the x86 fix to have any meaning. And while the bugs are often fatal, they aren't all that interesting for most users as they either require a malicious admin or broken hardware, i.e. aren't likely to be encountered by the vast majority of KVM users. So unless someone _really_ wants a particular fix isolated for backporting, I'm not planning on shuffling patches. Tested on x86. Lightly tested on arm64. Compile tested only on all other architectures. Thanks for the patch series. I the rebased TDX KVM patch series and it worked. Since cpu offline needs to be rejected in some cases(To keep at least one cpu on a package), arch hook for cpu offline is needed. I can keep it in TDX KVM patch series. Yes, this patch looks good. Paolo diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index 23c0f4bc63f1..ef7bcb845d42 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -17,6 +17,7 @@ BUILD_BUG_ON(1) KVM_X86_OP(hardware_enable) KVM_X86_OP(hardware_disable) KVM_X86_OP(hardware_unsetup) +KVM_X86_OP_OPTIONAL_RET0(offline_cpu) KVM_X86_OP(has_emulated_msr) KVM_X86_OP(vcpu_after_set_cpuid) KVM_X86_OP(is_vm_type_supported) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 496c7c6eaff9..c420409aa96f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1468,6 +1468,7 @@ struct kvm_x86_ops { int (*hardware_enable)(void); void (*hardware_disable)(void); void (*hardware_unsetup)(void); + int (*offline_cpu)(void); bool (*has_emulated_msr)(struct kvm *kvm, u32 index); void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2ed5a017f7bc..17c5d6a76c93 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12039,6 +12039,11 @@ void kvm_arch_hardware_disable(void) drop_user_return_notifiers(); } +int kvm_arch_offline_cpu(unsigned int cpu) +{ + return static_call(kvm_x86_offline_cpu)(); +} + bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu) { return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 620489b9aa93..4df79443fd11 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1460,6 +1460,7 @@ static inline void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu) {} int kvm_arch_hardware_enable(void); void kvm_arch_hardware_disable(void); #endif +int kvm_arch_offline_cpu(unsigned int cpu); int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu); int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f6b6dcedaa0a..f770fdc662d0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -5396,16 +5396,24 @@ static void hardware_disable_nolock(void *junk) __this_cpu_write(hardware_enabled, false); } +__weak int kvm_arch_offline_cpu(unsigned
Re: [PATCH 33/44] KVM: x86: Do VMX/SVM support checks directly in vendor code
On 11/3/22 19:35, Sean Christopherson wrote: It's technically required. IA32_FEAT_CTL and thus KVM_INTEL depends on any of CPU_SUP_{INTEL,CENATUR,ZHAOXIN}, but init_ia32_feat_ctl() is invoked if and only if the actual CPU type matches one of the aforementioned CPU_SUP_*. E.g. running a kernel built with CONFIG_CPU_SUP_INTEL=y CONFIG_CPU_SUP_AMD=y # CONFIG_CPU_SUP_HYGON is not set # CONFIG_CPU_SUP_CENTAUR is not set # CONFIG_CPU_SUP_ZHAOXIN is not set on a Cenatur or Zhaoxin CPU will leave X86_FEATURE_VMX set but not set X86_FEATURE_MSR_IA32_FEAT_CTL. If VMX isn't enabled in MSR_IA32_FEAT_CTL, KVM will get unexpected #UDs when trying to enable VMX. Oh, I see. Perhaps X86_FEATURE_VMX and X86_FEATURE_SGX should be moved to one of the software words instead of using cpuid. Nothing that you should care about for this series though. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 36/44] KVM: x86: Do compatibility checks when onlining CPU
On 11/3/22 18:44, Sean Christopherson wrote: Do compatibility checks when enabling hardware to effectively add compatibility checks when onlining a CPU. Abort enabling, i.e. the online process, if the (hotplugged) CPU is incompatible with the known good setup. This paragraph is not true with this patch being before "KVM: Rename and move CPUHP_AP_KVM_STARTING to ONLINE section". Argh, good eyes. Getting the ordering correct in this series has been quite the struggle. Assuming there are no subtle dependencies between x86 and common KVM, the ordering should be something like this: It's not a problem to keep the ordering in this v1, just fix the commit message like "Do compatibility checks when enabling hardware to effectively add compatibility checks on CPU hotplug. For now KVM is using a STARTING hook, which makes it impossible to abort the hotplug if the new CPU is incompatible with the known good setup; switching to an ONLINE hook will fix this." Paolo KVM: Opt out of generic hardware enabling on s390 and PPC KVM: Register syscore (suspend/resume) ops early in kvm_init() KVM: x86: Do compatibility checks when onlining CPU KVM: SVM: Check for SVM support in CPU compatibility checks KVM: VMX: Shuffle support checks and hardware enabling code around KVM: x86: Do VMX/SVM support checks directly in vendor code KVM: x86: Unify pr_fmt to use module name for all KVM modules KVM: x86: Use KBUILD_MODNAME to specify vendor module name KVM: Make hardware_enable_failed a local variable in the "enable all" path KVM: Use a per-CPU variable to track which CPUs have enabled virtualization KVM: Remove on_each_cpu(hardware_disable_nolock) in kvm_exit() KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock KVM: Disable CPU hotplug during hardware enabling KVM: Rename and move CPUHP_AP_KVM_STARTING to ONLINE section KVM: Drop kvm_arch_check_processor_compat() hook ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 00/44] KVM: Rework kvm_init() and hardware enabling
On 11/3/22 13:08, Christian Borntraeger wrote: There are bug fixes throughout this series. They are more scattered than I would usually prefer, but getting the sequencing correct was a gigantic pain for many of the x86 fixes due to needing to fix common code in order for the x86 fix to have any meaning. And while the bugs are often fatal, they aren't all that interesting for most users as they either require a malicious admin or broken hardware, i.e. aren't likely to be encountered by the vast majority of KVM users. So unless someone _really_ wants a particular fix isolated for backporting, I'm not planning on shuffling patches. Tested on x86. Lightly tested on arm64. Compile tested only on all other architectures. Some sniff tests seem to work ok on s390. Thanks. There are just a couple nits, and MIPS/PPC/RISC-V have very small changes. Feel free to send me a pull request once Marc acks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 39/44] KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
On 11/3/22 00:19, Sean Christopherson wrote: +- kvm_lock is taken outside kvm->mmu_lock Not surprising since one is a mutex and one is an rwlock. :) You can drop this hunk as well as the "Opportunistically update KVM's locking documentation" sentence in the commit message. - vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock - kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and @@ -216,15 +220,11 @@ time it will be set using the Dirty tracking mechanism described above. :Type:mutex :Arch:any :Protects:- vm_list - -``kvm_count_lock`` -^^ - -:Type: raw_spinlock_t -:Arch: any -:Protects: - hardware virtualization enable/disable -:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt - migration. + - kvm_usage_count + - hardware virtualization enable/disable + - module probing (x86 only) What do you mean exactly by "module probing"? Is it anything else than what is serialized by vendor_module_lock? Paolo +:Comment: KVM also disables CPU hotplug via cpus_read_lock() during + enable/disable. ``kvm->mn_invalidate_lock`` ^^^ diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4e765ef9f4bd..c8d92e6c3922 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -100,7 +100,6 @@ EXPORT_SYMBOL_GPL(halt_poll_ns_shrink); */ DEFINE_MUTEX(kvm_lock); -static DEFINE_RAW_SPINLOCK(kvm_count_lock); LIST_HEAD(vm_list); static cpumask_var_t cpus_hardware_enabled; @@ -5028,9 +5027,10 @@ static void hardware_enable_nolock(void *junk) static int kvm_online_cpu(unsigned int cpu) { + unsigned long flags; int ret = 0; - raw_spin_lock(_count_lock); + mutex_lock(_lock); /* * Abort the CPU online process if hardware virtualization cannot * be enabled. Otherwise running VMs would encounter unrecoverable @@ -5039,13 +5039,16 @@ static int kvm_online_cpu(unsigned int cpu) if (kvm_usage_count) { WARN_ON_ONCE(atomic_read(_enable_failed)); + local_irq_save(flags); hardware_enable_nolock(NULL); + local_irq_restore(flags); + if (atomic_read(_enable_failed)) { atomic_set(_enable_failed, 0); ret = -EIO; } } - raw_spin_unlock(_count_lock); + mutex_unlock(_lock); return ret; } @@ -5061,10 +5064,13 @@ static void hardware_disable_nolock(void *junk) static int kvm_offline_cpu(unsigned int cpu) { - raw_spin_lock(_count_lock); - if (kvm_usage_count) + mutex_lock(_lock); + if (kvm_usage_count) { + preempt_disable(); hardware_disable_nolock(NULL); - raw_spin_unlock(_count_lock); + preempt_enable(); + } + mutex_unlock(_lock); return 0; } @@ -5079,9 +5085,11 @@ static void hardware_disable_all_nolock(void) static void hardware_disable_all(void) { - raw_spin_lock(_count_lock); + cpus_read_lock(); + mutex_lock(_lock); hardware_disable_all_nolock(); - raw_spin_unlock(_count_lock); + mutex_unlock(_lock); + cpus_read_unlock(); } static int hardware_enable_all(void) @@ -5097,7 +5105,7 @@ static int hardware_enable_all(void) * Disable CPU hotplug to prevent scenarios where KVM sees */ cpus_read_lock(); - raw_spin_lock(_count_lock); + mutex_lock(_lock); kvm_usage_count++; if (kvm_usage_count == 1) { @@ -5110,7 +5118,7 @@ static int hardware_enable_all(void) } } - raw_spin_unlock(_count_lock); + mutex_unlock(_lock); cpus_read_unlock(); return r; @@ -5716,6 +5724,15 @@ static void kvm_init_debug(void) static int kvm_suspend(void) { + /* +* Secondary CPUs and CPU hotplug are disabled across the suspend/resume +* callbacks, i.e. no need to acquire kvm_lock to ensure the usage count +* is stable. Assert that kvm_lock is not held as a paranoid sanity +* check that the system isn't suspended when KVM is enabling hardware. +*/ + lockdep_assert_not_held(_lock); + lockdep_assert_irqs_disabled(); + if (kvm_usage_count) hardware_disable_nolock(NULL); return 0; @@ -5723,10 +5740,11 @@ static int kvm_suspend(void) static void kvm_resume(void) { - if (kvm_usage_count) { - lockdep_assert_not_held(_count_lock); + lockdep_assert_not_held(_lock); + lockdep_assert_irqs_disabled(); + + if (kvm_usage_count) hardware_enable_nolock(NULL); - } } static struct syscore_ops kvm_syscore_ops = { ___ kvmarm
Re: [PATCH 36/44] KVM: x86: Do compatibility checks when onlining CPU
On 11/3/22 00:19, Sean Christopherson wrote: From: Chao Gao Do compatibility checks when enabling hardware to effectively add compatibility checks when onlining a CPU. Abort enabling, i.e. the online process, if the (hotplugged) CPU is incompatible with the known good setup. This paragraph is not true with this patch being before "KVM: Rename and move CPUHP_AP_KVM_STARTING to ONLINE section". Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 33/44] KVM: x86: Do VMX/SVM support checks directly in vendor code
On 11/3/22 00:19, Sean Christopherson wrote: + if (!boot_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) || + !boot_cpu_has(X86_FEATURE_VMX)) { + pr_err("VMX not enabled in MSR_IA32_FEAT_CTL\n"); + return false; I think the reference to the BIOS should remain in these messages and in svm.c (even though these days it's much less common for vendors to default to disabled virtualization in the system setup). The check for X86_FEATURE_MSR_IA32_FEAT_CTL is not needed because init_ia32_feat_ctl() will clear X86_FEATURE_VMX if the rdmsr fail (and not set X86_FEATURE_MSR_IA32_FEAT_CTL). Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 10/44] KVM: VMX: Clean up eVMCS enabling if KVM initialization fails
On Thu, Nov 3, 2022 at 3:01 PM Paolo Bonzini wrote: > > On 11/3/22 00:18, Sean Christopherson wrote: > > +static void hv_cleanup_evmcs(void) > > This needs to be __init. Error: brain temporarily disconnected. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 10/44] KVM: VMX: Clean up eVMCS enabling if KVM initialization fails
On 11/3/22 00:18, Sean Christopherson wrote: +static void hv_cleanup_evmcs(void) This needs to be __init. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 fixes for 6.1, take #2
On Thu, Oct 20, 2022 at 12:02 PM Marc Zyngier wrote: > > Paolo, > > Here's a couple of additional fixes for 6.1. The ITS one is pretty > annoying as it prevents a VM from being restored if it has a > convoluted device topology. Definitely a stable candidate. > > Note that I can't see that you have pulled the first set of fixes > which I sent last week[1]. In order to avoid any problem, the current > pull-request is a suffix of the previous one. But you may want to pull > them individually in order to preserve the tag descriptions. Yes, that's why I did. Pulled now, thanks. Paolo > > Please pull, > > M. > > [1] https://lore.kernel.org/r/20221013132830.1304947-1-...@kernel.org > > The following changes since commit 05c2224d4b049406b0545a10be05280ff4b8ba0a: > > KVM: selftests: Fix number of pages for memory slot in > memslot_modification_stress_test (2022-10-13 11:46:51 +0100) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git > tags/kvmarm-fixes-6.1-2 > > for you to fetch changes up to c000a2607145d28b06c697f968491372ea56c23a: > > KVM: arm64: vgic: Fix exit condition in scan_its_table() (2022-10-15 > 12:10:54 +0100) > > > KVM/arm64 fixes for 6.1, take #2 > > - Fix a bug preventing restoring an ITS containing mappings > for very large and very sparse device topology > > - Work around a relocation handling error when compiling > the nVHE object with profile optimisation > > > Denis Nikitin (1): > KVM: arm64: nvhe: Fix build with profile optimization > > Eric Ren (1): > KVM: arm64: vgic: Fix exit condition in scan_its_table() > > arch/arm64/kvm/hyp/nvhe/Makefile | 4 > arch/arm64/kvm/vgic/vgic-its.c | 5 - > 2 files changed, 8 insertions(+), 1 deletion(-) > ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 fixes for 6.1, take #1
On Thu, Oct 13, 2022 at 3:28 PM Marc Zyngier wrote: > Paolo, > > Here's the first set of fixes for 6.1. The most interesting bit is > Oliver's fix limiting the S2 invalidation batch size the the largest > block mapping, solving (at least for now) the RCU stall problems we > have been seeing for a while. We may have to find another solution > when (and if) we decide to allow 4TB mapping at S2... > > The rest is a set of minor selftest fixes as well as enabling stack > protection and profiling in the VHE code. > > Please pull, Done, thanks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 updates for 6.1
Pulled, thanks! Paolo On Sun, Oct 2, 2022 at 2:42 PM Marc Zyngier wrote: > > Paolo, > > Please find below the rather small set of KVM/arm64 updates > for 6.1. This is mostly a set of fixes for existing features, > the most interesting one being Reiji's really good work tracking > an annoying set of bugs in our single-stepping implementation. > Also, I've included the changes making it possible to enable > the dirty-ring tracking on arm64. Full details in the tag. > > Note that this pull-request comes with a branch shared with the > arm64 tree, in order to avoid some bad conflicts due to the > ongoing repainting of all the system registers. > > Please pull, > > M. > > The following changes since commit b90cb1053190353cc30f0fef0ef1f378ccc063c5: > > Linux 6.0-rc3 (2022-08-28 15:05:29 -0700) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git > tags/kvmarm-6.1 > > for you to fetch changes up to b302ca52ba8235ff0e18c0fa1fa92b51784aef6a: > > Merge branch kvm-arm64/misc-6.1 into kvmarm-master/next (2022-10-01 > 10:19:36 +0100) > > > KVM/arm64 updates for v6.1 > > - Fixes for single-stepping in the presence of an async > exception as well as the preservation of PSTATE.SS > > - Better handling of AArch32 ID registers on AArch64-only > systems > > - Fixes for the dirty-ring API, allowing it to work on > architectures with relaxed memory ordering > > - Advertise the new kvmarm mailing list > > - Various minor cleanups and spelling fixes > > > Elliot Berman (1): > KVM: arm64: Ignore kvm-arm.mode if !is_hyp_mode_available() > > Gavin Shan (1): > KVM: arm64: vgic: Remove duplicate check in update_affinity_collection() > > Kristina Martsenko (3): > arm64: cache: Remove unused CTR_CACHE_MINLINE_MASK > arm64/sysreg: Standardise naming for ID_AA64MMFR1_EL1 fields > arm64/sysreg: Convert ID_AA64MMFR1_EL1 to automatic generation > > Marc Zyngier (12): > Merge branch kvm-arm64/aarch32-raz-idregs into kvmarm-master/next > Merge remote-tracking branch 'arm64/for-next/sysreg' into > kvmarm-master/next > Merge branch kvm-arm64/single-step-async-exception into > kvmarm-master/next > KVM: Use acquire/release semantics when accessing dirty ring GFN state > KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option > KVM: x86: Select CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL > KVM: Document weakly ordered architecture requirements for dirty ring > KVM: selftests: dirty-log: Upgrade flag accesses to acquire/release > semantics > KVM: selftests: dirty-log: Use KVM_CAP_DIRTY_LOG_RING_ACQ_REL if > available > KVM: arm64: Advertise new kvmarm mailing list > Merge branch kvm-arm64/dirty-log-ordered into kvmarm-master/next > Merge branch kvm-arm64/misc-6.1 into kvmarm-master/next > > Mark Brown (31): > arm64/sysreg: Remove stray SMIDR_EL1 defines > arm64/sysreg: Describe ID_AA64SMFR0_EL1.SMEVer as an enumeration > arm64/sysreg: Add _EL1 into ID_AA64MMFR0_EL1 definition names > arm64/sysreg: Add _EL1 into ID_AA64MMFR2_EL1 definition names > arm64/sysreg: Add _EL1 into ID_AA64PFR0_EL1 definition names > arm64/sysreg: Add _EL1 into ID_AA64PFR1_EL1 constant names > arm64/sysreg: Standardise naming of ID_AA64MMFR0_EL1.BigEnd > arm64/sysreg: Standardise naming of ID_AA64MMFR0_EL1.ASIDBits > arm64/sysreg: Standardise naming for ID_AA64MMFR2_EL1.VARange > arm64/sysreg: Standardise naming for ID_AA64MMFR2_EL1.CnP > arm64/sysreg: Standardise naming for ID_AA64PFR0_EL1 constants > arm64/sysreg: Standardise naming for ID_AA64PFR0_EL1.AdvSIMD constants > arm64/sysreg: Standardise naming for SSBS feature enumeration > arm64/sysreg: Standardise naming for MTE feature enumeration > arm64/sysreg: Standardise naming of ID_AA64PFR1_EL1 fractional version > fields > arm64/sysreg: Standardise naming of ID_AA64PFR1_EL1 BTI enumeration > arm64/sysreg: Standardise naming of ID_AA64PFR1_EL1 SME enumeration > arm64/sysreg: Convert HCRX_EL2 to automatic generation > arm64/sysreg: Convert ID_AA64MMFR0_EL1 to automatic generation > arm64/sysreg: Convert ID_AA64MMFR2_EL1 to automatic generation > arm64/sysreg: Convert ID_AA64PFR0_EL1 to automatic generation > arm64/sysreg: Convert ID_AA64PFR1_EL1 to automatic generation > arm64/sysreg: Convert TIPDR_EL1 to automatic generation > arm64/sysreg: Convert SCXTNUM_EL1 to automatic generation > arm64/sysreg: Add defintion for ALLINT > arm64/sysreg: Align field names in ID_AA64DFR0_EL1 with architecture > arm64/sysreg: Add _EL1 into ID_AA64DFR0_EL1 definition names > arm64/sysreg: Use feature numbering for PMU and SPE revisions >
Re: [PATCH 2/6] KVM: Add KVM_CAP_DIRTY_LOG_RING_ORDERED capability and config option
Il ven 23 set 2022, 20:26 Peter Xu ha scritto: > > > Someone will show up with an old userspace which probes for the sole > > existing capability, and things start failing subtly. It is quite > > likely that the userspace code is built for all architectures, > > I didn't quite follow here. Since both kvm/qemu dirty ring was only > supported on x86, I don't see the risk. Say you run a new ARM kernel on old userspace, and the new kernel uses KVM_CAP_DIRTY_LOG_RING. Userspace will try to use the dirty page ring buffer even though it lacks the memory barriers that were just introduced in QEMU. The new capability means "the dirty page ring buffer is supported and, by the way, you're supposed to do everything right with respect to ordering of loads and stores; you can't get away without it like you could on x86". Paolo > > Assuming we've the old binary. > > If to run on old kernel, it'll work like before. > > If to run on new kernel, the kernel will behave stricter on memory barriers > but should still be compatible with the old behavior (not vice versa, so > I'll understand if we're loosing the ordering, but we're not..). > > Any further elaboration would be greatly helpful. > > Thanks, > > > and we > > want to make sure that userspace actively buys into the new ordering > > requirements. A simple way to do this is to expose a new capability, > > making the new requirement obvious. Architectures with relaxed > > ordering semantics will only implement the new one, while x86 will > > implement both. > > -- > Peter Xu > ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 1/6] KVM: Use acquire/release semantics when accessing dirty ring GFN state
On Fri, Sep 23, 2022 at 4:20 PM Marc Zyngier wrote: > > > This is only a partial fix as the userspace side also need upgrading. > > > > Paolo has one fix 4802bf910e ("KVM: dirty ring: add missing memory > > barrier", 2022-09-01) which has already landed. > > What is this commit? It doesn't exist in the kernel as far as I can see. That's the load_acquire in QEMU, and the store_release part is in 7.2 as well (commit 52281c6d11, "KVM: use store-release to mark dirty pages as harvested", 2022-09-18). So all that QEMU is missing is the new capability. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 5/6] KVM: selftests: dirty-log: Upgrade dirty_gfn_set_collected() to store-release
On Thu, Sep 22, 2022 at 7:02 PM Marc Zyngier wrote: > To make sure that all the writes to the log marking the entries > as being in need of reset are observed in order, use a > smp_store_release() when updating the log entry flags. > > Signed-off-by: Marc Zyngier You also need a load-acquire on the load of gfn->flags in dirty_gfn_is_dirtied. Otherwise reading cur->slot or cur->offset might see a stale value. Paolo > --- > tools/testing/selftests/kvm/dirty_log_test.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/tools/testing/selftests/kvm/dirty_log_test.c > b/tools/testing/selftests/kvm/dirty_log_test.c > index 9c883c94d478..3d29f4bf4f9c 100644 > --- a/tools/testing/selftests/kvm/dirty_log_test.c > +++ b/tools/testing/selftests/kvm/dirty_log_test.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > > #include "kvm_util.h" > #include "test_util.h" > @@ -284,7 +285,7 @@ static inline bool dirty_gfn_is_dirtied(struct > kvm_dirty_gfn *gfn) > > static inline void dirty_gfn_set_collected(struct kvm_dirty_gfn *gfn) > { > - gfn->flags = KVM_DIRTY_GFN_F_RESET; > + smp_store_release(>flags, KVM_DIRTY_GFN_F_RESET); > } > > static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns, > -- > 2.34.1 > ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 fixes for 6.0, take #2
Pulled, thanks. Paolo On Mon, Sep 19, 2022 at 7:19 PM Marc Zyngier wrote: > > Paolo, > > Here's the last KVM/arm64 pull request for this cycle, with > a small fix for pKVM and kmemleak. > > Please pull, > > M. > > The following changes since commit 1c23f9e627a7b412978b4e852793c5e3c3efc555: > > Linux 6.0-rc2 (2022-08-21 17:32:54 -0700) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git > tags/kvmarm-fixes-6.0-2 > > for you to fetch changes up to 522c9a64c7049f50c7b1299741c13fac3f231cd4: > > KVM: arm64: Use kmemleak_free_part_phys() to unregister hyp_mem_base > (2022-09-19 17:59:48 +0100) > > > KVM/arm64 fixes for 6.0, take #2 > > - Fix kmemleak usage in Protected KVM (again) > > > Zenghui Yu (1): > KVM: arm64: Use kmemleak_free_part_phys() to unregister hyp_mem_base > > arch/arm64/kvm/arm.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking
On 8/30/22 16:42, Peter Xu wrote: Marc, I thought we won't hit this as long as we properly take care of other orderings of (a) gfn push, and (b) gfn collect, but after a second thought I think it's indeed logically possible that with a reversed ordering here we can be reading some garbage gfn before (a) happens butt also read the valid flag after (b). It seems we must have all the barriers correctly applied always. If that's correct, do you perhaps mean something like this to just add the last piece of barrier? Okay, so I thought about it some more and it's quite tricky. Strictly speaking, the synchronization is just between userspace and kernel. The fact that the actual producer of dirty pages is in another CPU is a red herring, because reset only cares about harvested pages. In other words, the dirty page ring is essentially two ring buffers in one and we only care about the "harvested ring", not the "produced ring". On the other hand, it may happen that userspace has set more RESET flags while the ioctl is ongoing: CPU0 CPU1 CPU2 fill gfn0 store-rel flags for gfn0 fill gfn1 store-rel flags for gfn1 load-acq flags for gfn0 set RESET for gfn0 load-acq flags for gfn1 set RESET for gfn1 do ioctl! ---> ioctl(RESET_RINGS) fill gfn2 store-rel flags for gfn2 load-acq flags for gfn2 set RESET for gfn2 process gfn0 process gfn1 process gfn2 do ioctl! etc. The three load-acquire in CPU0 synchronize with the three store-release in CPU2, but CPU0 and CPU1 are only synchronized up to gfn1 and CPU1 may miss gfn2's fields other than flags. The kernel must be able to cope with invalid values of the fields, and userspace will invoke the ioctl once more. However, once the RESET flag is cleared on gfn2, it is lost forever, therefore in the above scenario CPU1 must read the correct value of gfn2's fields. Therefore RESET must be set with a store-release, that will synchronize with a load-acquire in CPU1 as you suggested. Paolo diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index f4c2a6eb1666..ea620bfb012d 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -84,7 +84,7 @@ static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn) static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn) { - return gfn->flags & KVM_DIRTY_GFN_F_RESET; + return smp_load_acquire(>flags) & KVM_DIRTY_GFN_F_RESET; } int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) ===8<=== Thanks, -- Peter Xu ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking
Il ven 26 ago 2022, 17:50 Marc Zyngier ha scritto: > userspace has no choice. It cannot order on its own the reads that the > kernel will do to *other* rings. > > The problem isn't on CPU0 The problem is that CPU1 does observe > inconsistent data on arm64, and I don't think this difference in > behaviour is acceptable. Nothing documents this, and there is a baked > in assumption that there is a strong ordering between writes as well > as between writes and read. > Nevermind, what I wrote last Saturday doesn't make sense. I will try to put together a litmus test but IMO the synchronization is just between userspace and kernel, because the dirty page ring is essentially two ring buffers in one. The actual producer of dirty pages is a red herring. I am HTML-only for a couple days, I will resend to the mailing list on Thursday. Paolo > ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking
On 8/26/22 17:49, Marc Zyngier wrote: Agreed, but that's a problem for userspace to solve. If userspace wants to reset the fields in different CPUs, it has to synchronize with its own invoking of the ioctl. userspace has no choice. It cannot order on its own the reads that the kernel will do to *other* rings. Those reads will never see KVM_DIRTY_GFN_F_RESET in the flags however, if userspace has never interacted with the ring. So there will be exactly one read on those rings, and there's nothing to reorder. If that's too tricky and you want to add a load-acquire I have no objection though. It also helps avoiding read-read reordering between one entry's flags to the next one's, so it's a good idea to have it anyway. The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl was because it takes kvm->slots_lock so the execution would be serialized anyway. Turning slots_lock into an rwsem would be even worse because it also takes kvm->mmu_lock (since slots_lock is a mutex, at least two concurrent invocations won't clash with each other on the mmu_lock). Whatever the reason, the behaviour should be identical on all architectures. As is is, it only really works on x86, and I contend this is a bug that needs fixing. Thankfully, this can be done at zero cost for x86, and at that of a set of load-acquires on other architectures. Yes, the global-ness of the API is orthogonal to the memory ordering issue. I just wanted to explain why a per-vCPU API probably isn't going to work great. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking
On 8/23/22 22:35, Marc Zyngier wrote: Heh, yeah I need to get that out the door. I'll also note that Gavin's changes are still relevant without that series, as we do write unprotect in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging"). Ah, true. Now if only someone could explain how the whole producer-consumer thing works without a trace of a barrier, that'd be great... Do you mean this? void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) { struct kvm_dirty_gfn *entry; /* It should never get full */ WARN_ON_ONCE(kvm_dirty_ring_full(ring)); entry = >dirty_gfns[ring->dirty_index & (ring->size - 1)]; entry->slot = slot; entry->offset = offset; /* * Make sure the data is filled in before we publish this to * the userspace program. There's no paired kernel-side reader. */ smp_wmb(); kvm_dirty_gfn_set_dirtied(entry); ring->dirty_index++; trace_kvm_dirty_ring_push(ring, slot, offset); } The matching smp_rmb() is in userspace. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking
On 8/24/22 00:47, Marc Zyngier wrote: I definitely don't think I 100% understand all the ordering things since they're complicated.. but my understanding is that the reset procedure didn't need memory barrier (unlike pushing, where we have explicit wmb), because we assumed the userapp is not hostile so logically it should only modify the flags which is a 32bit field, assuming atomicity guaranteed. Atomicity doesn't guarantee ordering, unfortunately. Take the following example: CPU0 is changing a bunch of flags for GFNs A, B, C, D that exist in the ring in that order, and CPU1 performs an ioctl to reset the page state. CPU0: write_flag(A, KVM_DIRTY_GFN_F_RESET) write_flag(B, KVM_DIRTY_GFN_F_RESET) write_flag(C, KVM_DIRTY_GFN_F_RESET) write_flag(D, KVM_DIRTY_GFN_F_RESET) [...] CPU1: ioctl(KVM_RESET_DIRTY_RINGS) Since CPU0 writes do not have any ordering, CPU1 can observe the writes in a sequence that have nothing to do with program order, and could for example observe that GFN A and D have been reset, but not B and C. This in turn breaks the logic in the reset code (B, C, and D don't get reset), despite userspace having followed the spec to the letter. If each was a store-release (which is the case on x86), it wouldn't be a problem, but nothing calls it in the documentation. Maybe that's not a big deal if it is expected that each CPU will issue a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own writes. But expecting this to work across CPUs without any barrier is wishful thinking. Agreed, but that's a problem for userspace to solve. If userspace wants to reset the fields in different CPUs, it has to synchronize with its own invoking of the ioctl. That is, CPU0 must ensure that a ioctl(KVM_RESET_DIRTY_RINGS) is done after (in the memory-ordering sense) its last write_flag(D, KVM_DIRTY_GFN_F_RESET). If there's no such ordering, there's no guarantee that the write_flag will have any effect. The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl was because it takes kvm->slots_lock so the execution would be serialized anyway. Turning slots_lock into an rwsem would be even worse because it also takes kvm->mmu_lock (since slots_lock is a mutex, at least two concurrent invocations won't clash with each other on the mmu_lock). Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH] mailmap: Update Oliver's email address
On 8/19/22 21:01, Oliver Upton wrote: While I'm still at Google, I've since switched to a linux.dev account for working upstream. Add an alias to the new address. Signed-off-by: Oliver Upton Queued, thanks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 fixes for 6.0, take #1
Pulled, thanks. Paolo On Thu, Aug 18, 2022 at 4:09 PM Marc Zyngier wrote: > > Paolo, > > Here's a small set of KVM/arm64 fixes for 6.0, the most visible thing > being a better handling of systems with asymmetric AArch32 support. > > Please pull, > > M. > > The following changes since commit 0982c8d859f8f7022b9fd44d421c7ec721bb41f9: > > Merge branch kvm-arm64/nvhe-stacktrace into kvmarm-master/next (2022-07-27 > 18:33:27 +0100) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git > tags/kvmarm-fixes-6.0-1 > > for you to fetch changes up to b10d86fb8e46cc812171728bcd326df2f34e9ed5: > > KVM: arm64: Reject 32bit user PSTATE on asymmetric systems (2022-08-17 > 10:29:07 +0100) > > > KVM/arm64 fixes for 6.0, take #1 > > - Fix unexpected sign extension of KVM_ARM_DEVICE_ID_MASK > > - Tidy-up handling of AArch32 on asymmetric systems > > > Oliver Upton (2): > KVM: arm64: Treat PMCR_EL1.LC as RES1 on asymmetric systems > KVM: arm64: Reject 32bit user PSTATE on asymmetric systems > > Yang Yingliang (1): > KVM: arm64: Fix compile error due to sign extension > > arch/arm64/include/asm/kvm_host.h | 4 > arch/arm64/include/uapi/asm/kvm.h | 6 -- > arch/arm64/kvm/arm.c | 3 +-- > arch/arm64/kvm/guest.c| 2 +- > arch/arm64/kvm/sys_regs.c | 4 ++-- > 5 files changed, 12 insertions(+), 7 deletions(-) > ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 1/2] KVM: selftests: Make rseq compatible with glibc-2.35
On 8/10/22 14:29, Mathieu Desnoyers wrote: - By design, selftests/rseq and selftests/kvm are parallel. It's going to introduce unnecessary dependency for selftests/kvm to use selftests/rseq/librseq.so. To me, it makes the maintainability even harder. In terms of build system, yes, selftests/rseq and selftests/kvm are side-by-side, and I agree it is odd to have a cross-dependency. That's where moving rseq.c to tools/lib/ makes sense. - What selftests/kvm needs is rseq-thread-pointer.h, which accounts for ~5% of functionalities, provided by selftests/rseq/librseq.so. I've never seen this type of argument used to prevent using a library before, except on extremely memory-constrained devices, which is not our target here. I agree. To me, the main argument against moving librseq to tools/lib is a variant of the build-system argument, namely that recursive Make sucks[1] and selftests/kvm right now does not use tools/lib. So, for a single-file library, it may be simply not worth the hassle. On the other hand, if "somebody else" does the work, I would have no problem with having selftests/kvm depend on tools/lib, not at all. Thanks, Paolo [1] Kbuild is a marvel that makes it work, but it works because there are no such cross-subdirectory dependencies and anyway tools/testing/selftests does not use Kbuild. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v2 1/2] KVM: selftests: Make rseq compatible with glibc-2.35
On 8/10/22 14:22, Mathieu Desnoyers wrote: /* * Create and run a dummy VM that immediately exits to userspace via @@ -256,7 +244,7 @@ int main(int argc, char *argv[]) */ smp_rmb(); cpu = sched_getcpu(); - rseq_cpu = READ_ONCE(__rseq.cpu_id); + rseq_cpu = READ_ONCE(__rseq->cpu_id); #include and use rseq_current_cpu_raw(). Thanks, I squashed it and queued it for -rc1 (tested on both glibc 2.34 and 2.35). diff --git a/tools/testing/selftests/kvm/rseq_test.c b/tools/testing/selftests/kvm/rseq_test.c index 84e8425edc2c..987a76674f4f 100644 --- a/tools/testing/selftests/kvm/rseq_test.c +++ b/tools/testing/selftests/kvm/rseq_test.c @@ -29,7 +29,6 @@ #define NR_TASK_MIGRATIONS 10 static pthread_t migration_thread; -static struct rseq_abi *__rseq; static cpu_set_t possible_mask; static int min_cpu, max_cpu; static bool done; @@ -218,7 +217,6 @@ int main(int argc, char *argv[]) r = rseq_register_current_thread(); TEST_ASSERT(!r, "rseq_register_current_thread failed, errno = %d (%s)", errno, strerror(errno)); - __rseq = rseq_get_abi(); /* * Create and run a dummy VM that immediately exits to userspace via @@ -256,7 +254,7 @@ int main(int argc, char *argv[]) */ smp_rmb(); cpu = sched_getcpu(); - rseq_cpu = READ_ONCE(__rseq->cpu_id); + rseq_cpu = rseq_current_cpu_raw(); smp_rmb(); } while (snapshot != atomic_read(_cnt)); Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 1/2] KVM: selftests: Make rseq compatible with glibc-2.35
On 8/10/22 14:17, Mathieu Desnoyers wrote: Indeed, this hack seems to be a good approach to immediately fix things without moving around all source files and headers. In the longer term, I'd prefer Sean's proposal to move rseq.c to tools/lib/ (and to move rseq headers to tools/include/rseq/). This can be done in a follow up phase though. I'll put a note on my todo list for after I come back from vacation. Great, Gavin, are you going to repost using librseq? Yeah, rseq_test should reuse librseq code. The simplest way, if slightly hackish, is to do something like diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 690b499c3471..6c192b0ec304 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -37,6 +37,7 @@ ifeq ($(ARCH),riscv) UNAME_M := riscv endif LIBKVM += lib/assert.c LIBKVM += lib/elf.c LIBKVM += lib/guest_modes.c @@ -198,7 +199,7 @@ endif CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \ -fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \ -I$(LINUX_TOOL_ARCH_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude \ - -I$( no-pie-option := $(call try-run, echo 'int main() { return 0; }' | \ $(CC) -Werror -no-pie -x c - -o "$$TMP", -no-pie) and just #include "../rseq/rseq.c" in rseq_test.c. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 1/2] KVM: selftests: Make rseq compatible with glibc-2.35
On 8/9/22 14:21, Mathieu Desnoyers wrote: For kvm/selftests, there are 3 architectures involved actually. So we just need consider 4 cases: aarch64, x86, s390 and other. For other case, we just use __builtin_thread_pointer() to maintain code's integrity, but it's not called at all. I think kvm/selftest is always relying on glibc if I'm correct. All those are handled in the rseq selftests and in librseq. Why duplicate all that logic again? Yeah, rseq_test should reuse librseq code. The simplest way, if slightly hackish, is to do something like diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 690b499c3471..6c192b0ec304 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -37,6 +37,7 @@ ifeq ($(ARCH),riscv) UNAME_M := riscv endif LIBKVM += lib/assert.c LIBKVM += lib/elf.c LIBKVM += lib/guest_modes.c @@ -198,7 +199,7 @@ endif CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \ -fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \ -I$(LINUX_TOOL_ARCH_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude \ - -I$( no-pie-option := $(call try-run, echo 'int main() { return 0; }' | \ $(CC) -Werror -no-pie -x c - -o "$$TMP", -no-pie) and just #include "../rseq/rseq.c" in rseq_test.c. Thanks, Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 updates for 5.20
On 7/29/22 10:43, Marc Zyngier wrote: Paolo, Here's the bulk of the KVM/arm64 updates for 5.20. Major feature is the new unwinder for the nVHE modes. The rest is mostly rewriting some unloved aspects of the arm64 port (sysregs, flags). Further details in the tag description. Note that this PR contains a shared branch with the arm64 tree containing some of the stacktrace updates. There is also a minor conflict with your tree in one of the selftests, already resolved in -next. Please pull, M. Pulled, thanks. Because it's Friday and the RISC-V pull brought in all the new x86 RETbleed stuff, it may be a couple days before I finish retesting and do push it out to kvm.git. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v4] KVM: selftests: Fix target thread to be migrated in rseq_test
On 7/19/22 04:08, Gavin Shan wrote: In rseq_test, there are two threads, which are vCPU thread and migration worker separately. Unfortunately, the test has the wrong PID passed to sched_setaffinity() in the migration worker. It forces migration on the migration worker because zeroed PID represents the calling thread, which is the migration worker itself. It means the vCPU thread is never enforced to migration and it can migrate at any time, which eventually leads to failure as the following logs show. host# uname -r 5.19.0-rc6-gavin+ host# # cat /proc/cpuinfo | grep processor | tail -n 1 processor: 223 host# pwd /home/gavin/sandbox/linux.main/tools/testing/selftests/kvm host# for i in `seq 1 100`; do \ echo "> $i"; ./rseq_test; done > 1 > 2 > 3 > 4 > 5 > 6 Test Assertion Failure rseq_test.c:265: rseq_cpu == cpu pid=3925 tid=3925 errno=4 - Interrupted system call 1 0x00401963: main at rseq_test.c:265 (discriminator 2) 2 0xb044affb: ?? ??:0 3 0xb044b0c7: ?? ??:0 4 0x00401a6f: _start at ??:? rseq CPU = 4, sched CPU = 27 Fix the issue by passing correct parameter, TID of the vCPU thread, to sched_setaffinity() in the migration worker. Fixes: 61e52f1630f5 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs") Suggested-by: Sean Christopherson Signed-off-by: Gavin Shan Reviewed-by: Oliver Upton --- v4: Pick the code change as Sean suggested. --- tools/testing/selftests/kvm/rseq_test.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/kvm/rseq_test.c b/tools/testing/selftests/kvm/rseq_test.c index 4158da0da2bb..2237d1aac801 100644 --- a/tools/testing/selftests/kvm/rseq_test.c +++ b/tools/testing/selftests/kvm/rseq_test.c @@ -82,8 +82,9 @@ static int next_cpu(int cpu) return cpu; } -static void *migration_worker(void *ign) +static void *migration_worker(void *__rseq_tid) { + pid_t rseq_tid = (pid_t)(unsigned long)__rseq_tid; cpu_set_t allowed_mask; int r, i, cpu; @@ -106,7 +107,7 @@ static void *migration_worker(void *ign) * stable, i.e. while changing affinity is in-progress. */ smp_wmb(); - r = sched_setaffinity(0, sizeof(allowed_mask), _mask); + r = sched_setaffinity(rseq_tid, sizeof(allowed_mask), _mask); TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", errno, strerror(errno)); smp_wmb(); @@ -231,7 +232,8 @@ int main(int argc, char *argv[]) vm = vm_create_default(VCPU_ID, 0, guest_code); ucall_init(vm, NULL); - pthread_create(_thread, NULL, migration_worker, 0); + pthread_create(_thread, NULL, migration_worker, + (void *)(unsigned long)gettid()); for (i = 0; !done; i++) { vcpu_run(vm, VCPU_ID); Queued, thanks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH] KVM: selftests: Double check on the current CPU in rseq_test
On 7/14/22 17:35, Sean Christopherson wrote: Can you check that smp_rmb() and smp_wmb() generate correct instructions on arm64? That seems like the most likely scenario (or a kernel bug), I distinctly remember the barriers provided by tools/ being rather bizarre. Maybe we should bite the bait and use C11 atomics in tools/. I've long planned an article "C11 atomics for kernel programmers", especially because this will also be an issue when Rust gets into the mix... Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH] KVM: selftests: Double check on the current CPU in rseq_test
On 7/14/22 10:06, Gavin Shan wrote: In rseq_test, there are two threads created. Those two threads are 'main' and 'migration_thread' separately. We also have the assumption that non-migration status on 'migration-worker' thread guarantees the same non-migration status on 'main' thread. Unfortunately, the assumption isn't true. The 'main' thread can be migrated from one CPU to another one between the calls to sched_getcpu() and READ_ONCE(__rseq.cpu_id). The following assert is raised eventually because of the mismatched CPU numbers. The issue can be reproduced on arm64 system occasionally. Hmm, this does not seem a correct patch - the threads are already synchronizing using seq_cnt, like this: migration main -- seq_cnt = 1 smp_wmb() snapshot = 0 smp_rmb() cpu = sched_getcpu() reads 23 sched_setaffinity() rseq_cpu = __rseq.cpuid reads 35 smp_rmb() snapshot != seq_cnt -> retry smp_wmb() seq_cnt = 2 sched_setaffinity() is guaranteed to block until the task is enqueued on an allowed CPU. Can you check that smp_rmb() and smp_wmb() generate correct instructions on arm64? Paolo host# uname -r 5.19.0-rc6-gavin+ host# # cat /proc/cpuinfo | grep processor | tail -n 1 processor: 223 host# pwd /home/gavin/sandbox/linux.main/tools/testing/selftests/kvm host# for i in `seq 1 100`; \ do echo "> $i"; \ ./rseq_test; sleep 3; \ done > 1 > 2 > 3 > 4 > 5 > 6 Test Assertion Failure rseq_test.c:265: rseq_cpu == cpu pid=3925 tid=3925 errno=4 - Interrupted system call 1 0x00401963: main at rseq_test.c:265 (discriminator 2) 2 0xb044affb: ?? ??:0 3 0xb044b0c7: ?? ??:0 4 0x00401a6f: _start at ??:? rseq CPU = 4, sched CPU = 27 This fixes the issue by double-checking on the current CPU after call to READ_ONCE(__rseq.cpu_id) and restarting the test if the two consecutive CPU numbers aren't euqal. Fixes: 61e52f1630f5 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs") Signed-off-by: Gavin Shan --- tools/testing/selftests/kvm/rseq_test.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/kvm/rseq_test.c b/tools/testing/selftests/kvm/rseq_test.c index 4158da0da2bb..74709dd9f5b2 100644 --- a/tools/testing/selftests/kvm/rseq_test.c +++ b/tools/testing/selftests/kvm/rseq_test.c @@ -207,7 +207,7 @@ int main(int argc, char *argv[]) { int r, i, snapshot; struct kvm_vm *vm; - u32 cpu, rseq_cpu; + u32 cpu, rseq_cpu, last_cpu; /* Tell stdout not to buffer its content */ setbuf(stdout, NULL); @@ -259,8 +259,9 @@ int main(int argc, char *argv[]) smp_rmb(); cpu = sched_getcpu(); rseq_cpu = READ_ONCE(__rseq.cpu_id); + last_cpu = sched_getcpu(); smp_rmb(); - } while (snapshot != atomic_read(_cnt)); + } while (snapshot != atomic_read(_cnt) || cpu != last_cpu); TEST_ASSERT(rseq_cpu == cpu, "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu); ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH] KVM: selftests: Double check on the current CPU in rseq_test
On 7/14/22 10:06, Gavin Shan wrote: In rseq_test, there are two threads created. Those two threads are 'main' and 'migration_thread' separately. We also have the assumption that non-migration status on 'migration-worker' thread guarantees the same non-migration status on 'main' thread. Unfortunately, the assumption isn't true. The 'main' thread can be migrated from one CPU to another one between the calls to sched_getcpu() and READ_ONCE(__rseq.cpu_id). The following assert is raised eventually because of the mismatched CPU numbers. The issue can be reproduced on arm64 system occasionally. host# uname -r 5.19.0-rc6-gavin+ host# # cat /proc/cpuinfo | grep processor | tail -n 1 processor: 223 host# pwd /home/gavin/sandbox/linux.main/tools/testing/selftests/kvm host# for i in `seq 1 100`; \ do echo "> $i"; \ ./rseq_test; sleep 3; \ done > 1 > 2 > 3 > 4 > 5 > 6 Test Assertion Failure rseq_test.c:265: rseq_cpu == cpu pid=3925 tid=3925 errno=4 - Interrupted system call 1 0x00401963: main at rseq_test.c:265 (discriminator 2) 2 0xb044affb: ?? ??:0 3 0xb044b0c7: ?? ??:0 4 0x00401a6f: _start at ??:? rseq CPU = 4, sched CPU = 27 This fixes the issue by double-checking on the current CPU after call to READ_ONCE(__rseq.cpu_id) and restarting the test if the two consecutive CPU numbers aren't euqal. Fixes: 61e52f1630f5 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs") Signed-off-by: Gavin Shan --- tools/testing/selftests/kvm/rseq_test.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/kvm/rseq_test.c b/tools/testing/selftests/kvm/rseq_test.c index 4158da0da2bb..74709dd9f5b2 100644 --- a/tools/testing/selftests/kvm/rseq_test.c +++ b/tools/testing/selftests/kvm/rseq_test.c @@ -207,7 +207,7 @@ int main(int argc, char *argv[]) { int r, i, snapshot; struct kvm_vm *vm; - u32 cpu, rseq_cpu; + u32 cpu, rseq_cpu, last_cpu; /* Tell stdout not to buffer its content */ setbuf(stdout, NULL); @@ -259,8 +259,9 @@ int main(int argc, char *argv[]) smp_rmb(); cpu = sched_getcpu(); rseq_cpu = READ_ONCE(__rseq.cpu_id); + last_cpu = sched_getcpu(); smp_rmb(); - } while (snapshot != atomic_read(_cnt)); + } while (snapshot != atomic_read(_cnt) || cpu != last_cpu); TEST_ASSERT(rseq_cpu == cpu, "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu); Queued for -rc, thanks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v7 00/22] Support SDEI Virtualization
On 6/24/22 15:12, Marc Zyngier wrote: - as far as I know, the core Linux/arm64 maintainers have no plan to support APF. Without it, this is a pointless exercise. And even with it, this introduces a Linux specific behaviour in an otherwise architectural hypervisor (something I'm quite keen on avoiding) Regarding non-architectural behavior, isn't that the same already for PTP? I understand that the PTP hypercall is a much smaller implementation than SDEI+APF, but it goes to show that KVM is already not "architectural". There are other cases where paravirtualized solutions can be useful. PTP is one but there are more where KVM/ARM does not have a solution yet, for example lock holder preemption. Unless ARM (the company) has a way to receive input from developers and standardize the interface, similar to the RISC-V SIGs, vendor-specific hypercalls are a sad fact of life. It just happened that until now KVM/ARM hasn't seen much use in some cases (such as desktop virtualization) where overcommitted hosts are more common. Async page faults per se are not KVM specific, in fact Linux supported them for the IBM s390 hypervisor long before KVM added support. They didn't exist on x86 and ARM, so the developers came up with a new hypercall API and for x86 honestly it wasn't great. For ARM we learnt from the mistakes and it seems to me that SDEI is a good match for the feature. If ARM wants to produce a standard interface for APF, whether based on SDEI or something else, we're all ears. Regarding plans of core arm64 maintainers to support async page fault, can you provide a pointer to the discussion? I agree that if there's a hard NACK for APF for whatever reason, the whole host-side code is pointless (including SDEI virtualization); but I would like to read more about it. - It gives an incentive to other hypervisor vendors to add random crap to the Linux mm subsystem, which is even worse. At this stage, we might as well go back to the Xen PV days altogether. return -EGREGIOUS; Since you mention hypervisor vendors and there's only one hypervisor in Linux, I guess you're not talking about the host mm/ subsystem (otherwise yeah, FOLL_NOWAIT is only used by KVM async page faults). So I suppose you're talking about the guest, and then yeah, it sucks to have multiple hypervisors providing the same functionality in different ways (or multiple hypervisors providing different subsets of PV functionality). It happens on x86 with Hyper-V and KVM, and to a lesser extent Xen and VMware. But again, KVM/ARM has already crossed that bridge with PTP support, and the guest needs exactly zero code in the Linux mm subsystem (both generic and arch-specific) to support asynchronous page faults. There are 20 lines of code in do_notify_resume(), and the rest is just SDEI gunk. Again, I would be happy to get a pointer to concrete objections from the Linux ARM64 maintainers. Maybe a different implementation is possible, I don't know. In any case it's absolutely not comparable to Xen PV, and you know it. - I haven't seen any of the KVM/arm64 users actually asking for the APF horror, and the cloud vendors I directly asked had no plan to use it, and not using it on their x86 systems either Please define "horror" in more technical terms. And since this is the second time I'm calling you out on this, I'm also asking you to avoid hyperboles and similar rhetorical gimmicks in the future. That said: Peter, Sean, Google uses or used postcopy extensively on GCE (https://dl.acm.org/doi/pdf/10.1145/3296975.3186415). If it doesn't use it on x86, do you have any insights on why? - no performance data nor workloads that could help making an informed decision have been disclosed, and the only argument in its favour seems to be "but x86 has it" (hardly a compelling one) Again this is just false, numbers have been posted (https://lwn.net/ml/linux-kernel/20210209050403.103143-1-gs...@redhat.com/ was the first result that came up from a quick mailing list search). If they are not enough, please be more specific. Thanks, Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH kvm-unit-tests] MAINTAINERS: Change drew's email address
Queued, thanks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 fixes for 5.19, take #2
On 6/23/22 09:41, Marc Zyngier wrote: git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git tags/kvmarm-fixes-5.19-2 Pulled, thanks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 21/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
From: David Matlack Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at declaration time rather than being fixed for all declarations. This will be used in a follow-up commit to declare an cache in x86 with a capacity of 512+ objects without having to increase the capacity of all caches in KVM. This change requires each cache now specify its capacity at runtime, since the cache struct itself no longer has a fixed capacity known at compile time. To protect against someone accidentally defining a kvm_mmu_memory_cache struct directly (without the extra storage), this commit includes a WARN_ON() in kvm_mmu_topup_memory_cache(). In order to support different capacities, this commit changes the objects pointer array to be dynamically allocated the first time the cache is topped-up. While here, opportunistically clean up the stack-allocated kvm_mmu_memory_cache structs in riscv and arm64 to use designated initializers. No functional change intended. Reviewed-by: Marc Zyngier Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-22-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/arm64/kvm/mmu.c | 2 +- arch/riscv/kvm/mmu.c | 5 + include/linux/kvm_host.h | 1 + include/linux/kvm_types.h | 6 +- virt/kvm/kvm_main.c | 33 ++--- 5 files changed, 38 insertions(+), 9 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index f5651a05b6a8..87f1cd0df36e 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -786,7 +786,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, { phys_addr_t addr; int ret = 0; - struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, }; + struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO }; struct kvm_pgtable *pgt = kvm->arch.mmu.pgt; enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_R | diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c index 1c00695ebee7..081f8d2b9cf3 100644 --- a/arch/riscv/kvm/mmu.c +++ b/arch/riscv/kvm/mmu.c @@ -350,10 +350,7 @@ static int gstage_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa, int ret = 0; unsigned long pfn; phys_addr_t addr, end; - struct kvm_mmu_memory_cache pcache; - - memset(, 0, sizeof(pcache)); - pcache.gfp_zero = __GFP_ZERO; + struct kvm_mmu_memory_cache pcache = { .gfp_zero = __GFP_ZERO }; end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK; pfn = __phys_to_pfn(hpa); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a2bbdf3ab086..3554e48406e4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1356,6 +1356,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm); #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min); +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min); int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc); void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc); void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index f328a01db4fe..4d933518060f 100644 --- a/include/linux/kvm_types.h +++ b/include/linux/kvm_types.h @@ -85,12 +85,16 @@ struct gfn_to_pfn_cache { * MMU flows is problematic, as is triggering reclaim, I/O, etc... while * holding MMU locks. Note, these caches act more like prefetch buffers than * classical caches, i.e. objects are not returned to the cache on being freed. + * + * The @capacity field and @objects array are lazily initialized when the cache + * is topped up (__kvm_mmu_topup_memory_cache()). */ struct kvm_mmu_memory_cache { int nobjs; gfp_t gfp_zero; struct kmem_cache *kmem_cache; - void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE]; + int capacity; + void **objects; }; #endif diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 5b8ae83e09d7..45188d11812c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -396,14 +396,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc, return (void *)__get_free_page(gfp_flags); } -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min) +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min) { + gfp_t gfp = GFP_KERNEL_ACCOUNT; void *obj; if (mc->nobjs >= min) return 0; - while (mc->nobjs < ARRAY_SIZE(mc->objects)) { - obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT); + + if (unlikely(!mc->objects)) { + if (WARN_ON_ONCE(!capacity)) + return -EIO; + + mc->ob
[PATCH v7 23/23] KVM: x86/mmu: Avoid unnecessary flush on eager page split
The TLB flush before installing the newly-populated lower level page table is unnecessary if the lower-level page table maps the huge page identically. KVM knows it is if it did not reuse an existing shadow page table, tell drop_large_spte() to skip the flush in that case. Extracted from a patch by David Matlack. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 32 1 file changed, 20 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 22681931921f..79c6a821ea0d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1135,7 +1135,7 @@ static void drop_spte(struct kvm *kvm, u64 *sptep) rmap_remove(kvm, sptep); } -static void drop_large_spte(struct kvm *kvm, u64 *sptep) +static void drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush) { struct kvm_mmu_page *sp; @@ -1143,7 +1143,9 @@ static void drop_large_spte(struct kvm *kvm, u64 *sptep) WARN_ON(sp->role.level == PG_LEVEL_4K); drop_spte(kvm, sptep); - kvm_flush_remote_tlbs_with_address(kvm, sp->gfn, + + if (flush) + kvm_flush_remote_tlbs_with_address(kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level)); } @@ -2283,7 +2285,7 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) static void __link_shadow_page(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, u64 *sptep, - struct kvm_mmu_page *sp) + struct kvm_mmu_page *sp, bool flush) { u64 spte; @@ -2291,10 +2293,11 @@ static void __link_shadow_page(struct kvm *kvm, /* * If an SPTE is present already, it must be a leaf and therefore -* a large one. Drop it and flush the TLB before installing sp. +* a large one. Drop it, and flush the TLB if needed, before +* installing sp. */ if (is_shadow_present_pte(*sptep)) - drop_large_spte(kvm, sptep); + drop_large_spte(kvm, sptep, flush); spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp)); @@ -2309,7 +2312,7 @@ static void __link_shadow_page(struct kvm *kvm, static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp) { - __link_shadow_page(vcpu->kvm, >arch.mmu_pte_list_desc_cache, sptep, sp); + __link_shadow_page(vcpu->kvm, >arch.mmu_pte_list_desc_cache, sptep, sp, true); } static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, @@ -6172,6 +6175,7 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm, struct kvm_mmu_memory_cache *cache = >arch.split_desc_cache; u64 huge_spte = READ_ONCE(*huge_sptep); struct kvm_mmu_page *sp; + bool flush = false; u64 *sptep, spte; gfn_t gfn; int index; @@ -6189,20 +6193,24 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm, * gfn-to-pfn translation since the SP is direct, so no need to * modify them. * -* If a given SPTE points to a lower level page table, installing -* such SPTEs would effectively unmap a potion of the huge page. -* This is not an issue because __link_shadow_page() flushes the TLB -* when the passed sp replaces a large SPTE. +* However, if a given SPTE points to a lower level page table, +* that lower level page table may only be partially populated. +* Installing such SPTEs would effectively unmap a potion of the +* huge page. Unmapping guest memory always requires a TLB flush +* since a subsequent operation on the unmapped regions would +* fail to detect the need to flush. */ - if (is_shadow_present_pte(*sptep)) + if (is_shadow_present_pte(*sptep)) { + flush |= !is_last_spte(*sptep, sp->role.level); continue; + } spte = make_huge_page_split_spte(kvm, huge_spte, sp->role, index); mmu_spte_set(sptep, spte); __rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access); } - __link_shadow_page(kvm, cache, huge_sptep, sp); + __link_shadow_page(kvm, cache, huge_sptep, sp, flush); } static int shadow_mmu_try_split_huge_page(struct kvm *kvm, -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 20/23] KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
Before allocating a child shadow page table, all callers check whether the parent already points to a huge page and, if so, they drop that SPTE. This is done by drop_large_spte(). However, the act that requires dropping the large SPTE is the installation of the sp that is returned by kvm_mmu_get_child_sp(), which happens in __link_shadow_page(). Move the call there instead of having it in each and every caller. To ensure that the shadow page is not linked twice if it was present, do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent: instead, return an error value if the shadow page already existed. This is a bit more verbose, but clearer than NULL. Now that the drop_large_spte() name is not taken anymore, remove the two underscores in front of __drop_large_spte(). Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 43 +- arch/x86/kvm/mmu/paging_tmpl.h | 31 +++- 2 files changed, 35 insertions(+), 39 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 36bc49f08d60..bf1ae5ebf41b 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1135,26 +1135,16 @@ static void drop_spte(struct kvm *kvm, u64 *sptep) rmap_remove(kvm, sptep); } - -static bool __drop_large_spte(struct kvm *kvm, u64 *sptep) +static void drop_large_spte(struct kvm *kvm, u64 *sptep) { - if (is_large_pte(*sptep)) { - WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K); - drop_spte(kvm, sptep); - return true; - } - - return false; -} + struct kvm_mmu_page *sp; -static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) -{ - if (__drop_large_spte(vcpu->kvm, sptep)) { - struct kvm_mmu_page *sp = sptep_to_sp(sptep); + sp = sptep_to_sp(sptep); + WARN_ON(sp->role.level == PG_LEVEL_4K); - kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn, + drop_spte(kvm, sptep); + kvm_flush_remote_tlbs_with_address(kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level)); - } } /* @@ -2221,6 +2211,9 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, { union kvm_mmu_page_role role; + if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) + return ERR_PTR(-EEXIST); + role = kvm_mmu_child_role(sptep, direct, access); return kvm_mmu_get_shadow_page(vcpu, gfn, role); } @@ -2288,13 +2281,21 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) __shadow_walk_next(iterator, *iterator->sptep); } -static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep, +static void __link_shadow_page(struct kvm *kvm, + struct kvm_mmu_memory_cache *cache, u64 *sptep, struct kvm_mmu_page *sp) { u64 spte; BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK); + /* +* If an SPTE is present already, it must be a leaf and therefore +* a large one. Drop it and flush the TLB before installing sp. +*/ + if (is_shadow_present_pte(*sptep)) + drop_large_spte(kvm, sptep); + spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp)); mmu_spte_set(sptep, spte); @@ -2308,7 +2309,7 @@ static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep, static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp) { - __link_shadow_page(>arch.mmu_pte_list_desc_cache, sptep, sp); + __link_shadow_page(vcpu->kvm, >arch.mmu_pte_list_desc_cache, sptep, sp); } static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, @@ -3080,11 +3081,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) if (it.level == fault->goal_level) break; - drop_large_spte(vcpu, it.sptep); - if (is_shadow_present_pte(*it.sptep)) - continue; - sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL); + if (sp == ERR_PTR(-EEXIST)) + continue; link_shadow_page(vcpu, it.sptep, sp); if (fault->is_tdp && fault->huge_page_disallowed && diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 24f292f3f93f..2448fa8d8438 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -648,15 +648,13 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, gfn_t table_gfn; clear_sp_write_flooding_count(it.sptep); - drop_large_spte(vcpu, it.sptep); -
[PATCH v7 16/23] KVM: x86/mmu: Update page stats in __rmap_add()
From: David Matlack Update the page stats in __rmap_add() rather than at the call site. This will avoid having to manually update page stats when splitting huge pages in a subsequent commit. No functional change intended. Reviewed-by: Ben Gardon Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-17-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a8cdbe2958d9..7cca28d89a85 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1562,6 +1562,8 @@ static void __rmap_add(struct kvm *kvm, sp = sptep_to_sp(spte); kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn); + kvm_update_page_stats(kvm, sp->role.level, 1); + rmap_head = gfn_to_rmap(gfn, sp->role.level, slot); rmap_count = pte_list_add(cache, spte, rmap_head); @@ -2783,7 +2785,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, if (!was_rmapped) { WARN_ON_ONCE(ret == RET_PF_SPURIOUS); - kvm_update_page_stats(vcpu->kvm, level, 1); rmap_add(vcpu, slot, sptep, gfn); } -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 22/23] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
From: David Matlack Add support for Eager Page Splitting pages that are mapped by nested MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB pages, and then splitting all 2MiB pages to 4KiB pages. Note, Eager Page Splitting is limited to nested MMUs as a policy rather than due to any technical reason (the sp->role.guest_mode check could just be deleted and Eager Page Splitting would work correctly for all shadow MMU pages). There is really no reason to support Eager Page Splitting for tdp_mmu=N, since such support will eventually be phased out, and there is no current use case supporting Eager Page Splitting on hosts where TDP is either disabled or unavailable in hardware. Furthermore, future improvements to nested MMU scalability may diverge the code from the legacy shadow paging implementation. These improvements will be simpler to make if Eager Page Splitting does not have to worry about legacy shadow paging. Splitting huge pages mapped by nested MMUs requires dealing with some extra complexity beyond that of the TDP MMU: (1) The shadow MMU has a limit on the number of shadow pages that are allowed to be allocated. So, as a policy, Eager Page Splitting refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer pages available. (2) Splitting a huge page may end up re-using an existing lower level shadow page tables. This is unlike the TDP MMU which always allocates new shadow page tables when splitting. (3) When installing the lower level SPTEs, they must be added to the rmap which may require allocating additional pte_list_desc structs. Case (2) is especially interesting since it may require a TLB flush, unlike the TDP MMU which can fully split huge pages without any TLB flushes. Specifically, an existing lower level page table may point to even lower level page tables that are not fully populated, effectively unmapping a portion of the huge page, which requires a flush. As of this commit, a flush is always done always after dropping the huge page and before installing the lower level page table. This TLB flush could instead be delayed until the MMU lock is about to be dropped, which would batch flushes for multiple splits. However these flushes should be rare in practice (a huge page must be aliased in multiple SPTEs and have been split for NX Huge Pages in only some of them). Flushing immediately is simpler to plumb and also reduces the chances of tripping over a CPU bug (e.g. see iTLB multihit). Suggested-by: Peter Feiner [ This commit is based off of the original implementation of Eager Page Splitting from Peter in Google's kernel from 2016. ] Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-23-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- .../admin-guide/kernel-parameters.txt | 3 +- arch/x86/include/asm/kvm_host.h | 22 ++ arch/x86/kvm/mmu/mmu.c| 261 +- 3 files changed, 277 insertions(+), 9 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 97c16aa2f53f..329f0f274e2b 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2418,8 +2418,7 @@ the KVM_CLEAR_DIRTY ioctl, and only for the pages being cleared. - Eager page splitting currently only supports splitting - huge pages mapped by the TDP MMU. + Eager page splitting is only supported when kvm.tdp_mmu=Y. Default is Y (on). diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 64efe8c90c31..665667d61caf 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1338,6 +1338,28 @@ struct kvm_arch { u32 max_vcpu_ids; bool disable_nx_huge_pages; + + /* +* Memory caches used to allocate shadow pages when performing eager +* page splitting. No need for a shadowed_info_cache since eager page +* splitting only allocates direct shadow pages. +* +* Protected by kvm->slots_lock. +*/ + struct kvm_mmu_memory_cache split_shadow_page_cache; + struct kvm_mmu_memory_cache split_page_header_cache; + + /* +* Memory cache used to allocate pte_list_desc structs while splitting +* huge pages. In the worst case, to split one huge page, 512 +* pte_list_desc structs are needed to add each lower level leaf sptep +* to the rmap plus 1 to extend the parent_ptes rmap of the lower level +* page table. +* +* Protected by kvm->slots_lock. +*/ +#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1) + struct kvm_mmu_memory_cache split_desc_cache; }; struct kvm_vm_stat { diff --git a/arch/x86/kvm/mmu/mmu.c
[PATCH v7 11/23] KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
From: David Matlack The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer. No functional change intended. Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-12-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index fab417e7bf6c..c5a88e8d1b53 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2056,7 +2056,7 @@ struct shadow_page_caches { struct kvm_mmu_memory_cache *gfn_array_cache; }; -static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, +static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm, struct shadow_page_caches *caches, gfn_t gfn, struct hlist_head *sp_list, @@ -2076,15 +2076,15 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, * depends on valid pages being added to the head of the list. See * comments in kvm_zap_obsolete_pages(). */ - sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen; - list_add(>link, >kvm->arch.active_mmu_pages); - kvm_mod_used_mmu_pages(vcpu->kvm, +1); + sp->mmu_valid_gen = kvm->arch.mmu_valid_gen; + list_add(>link, >arch.active_mmu_pages); + kvm_mod_used_mmu_pages(kvm, +1); sp->gfn = gfn; sp->role = role; hlist_add_head(>hash_link, sp_list); if (sp_has_gptes(sp)) - account_shadowed(vcpu->kvm, sp); + account_shadowed(kvm, sp); return sp; } @@ -2103,7 +2103,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role); if (!sp) { created = true; - sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role); + sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role); } trace_kvm_mmu_get_page(sp, created); -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 17/23] KVM: x86/mmu: Cache the access bits of shadowed translations
From: David Matlack Splitting huge pages requires allocating/finding shadow pages to replace the huge page. Shadow pages are keyed, in part, off the guest access permissions they are shadowing. For fully direct MMUs, there is no shadowing so the access bits in the shadow page role are always ACC_ALL. But during shadow paging, the guest can enforce whatever access permissions it wants. In particular, eager page splitting needs to know the permissions to use for the subpages, but KVM cannot retrieve them from the guest page tables because eager page splitting does not have a vCPU. Fortunately, the guest access permissions are easy to cache whenever page faults or FNAME(sync_page) update the shadow page tables; this is an extension of the existing cache of the shadowed GFNs in the gfns array of the shadow page. The access bits only take up 3 bits, which leaves 61 bits left over for gfns, which is more than enough. Now that the gfns array caches more information than just GFNs, rename it to shadowed_translation. While here, preemptively fix up the WARN_ON() that detects gfn mismatches in direct SPs. The WARN_ON() was paired with a pr_err_ratelimited(), which means that users could sometimes see the WARN without the accompanying error message. Fix this by outputting the error message as part of the WARN splat, and opportunistically make them WARN_ONCE() because if these ever fire, they are all but guaranteed to fire a lot and will bring down the kernel. Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-18-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/mmu/mmu.c | 85 +++-- arch/x86/kvm/mmu/mmu_internal.h | 17 ++- arch/x86/kvm/mmu/paging_tmpl.h | 9 +++- 4 files changed, 84 insertions(+), 29 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7e4c31b57a75..64efe8c90c31 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -713,7 +713,7 @@ struct kvm_vcpu_arch { struct kvm_mmu_memory_cache mmu_pte_list_desc_cache; struct kvm_mmu_memory_cache mmu_shadow_page_cache; - struct kvm_mmu_memory_cache mmu_gfn_array_cache; + struct kvm_mmu_memory_cache mmu_shadowed_info_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; /* diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 7cca28d89a85..13a059ad5dc7 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -656,7 +656,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect) if (r) return r; if (maybe_indirect) { - r = kvm_mmu_topup_memory_cache(>arch.mmu_gfn_array_cache, + r = kvm_mmu_topup_memory_cache(>arch.mmu_shadowed_info_cache, PT64_ROOT_MAX_LEVEL); if (r) return r; @@ -669,7 +669,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu) { kvm_mmu_free_memory_cache(>arch.mmu_pte_list_desc_cache); kvm_mmu_free_memory_cache(>arch.mmu_shadow_page_cache); - kvm_mmu_free_memory_cache(>arch.mmu_gfn_array_cache); + kvm_mmu_free_memory_cache(>arch.mmu_shadowed_info_cache); kvm_mmu_free_memory_cache(>arch.mmu_page_header_cache); } @@ -678,34 +678,68 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc) kmem_cache_free(pte_list_desc_cache, pte_list_desc); } +static bool sp_has_gptes(struct kvm_mmu_page *sp); + static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) { if (sp->role.passthrough) return sp->gfn; if (!sp->role.direct) - return sp->gfns[index]; + return sp->shadowed_translation[index] >> PAGE_SHIFT; return sp->gfn + (index << ((sp->role.level - 1) * SPTE_LEVEL_BITS)); } -static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn) +/* + * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note + * that the SPTE itself may have a more constrained access permissions that + * what the guest enforces. For example, a guest may create an executable + * huge PTE but KVM may disallow execution to mitigate iTLB multihit. + */ +static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index) { - if (sp->role.passthrough) { - WARN_ON_ONCE(gfn != sp->gfn); - return; - } + if (sp_has_gptes(sp)) + return sp->shadowed_translation[index] & ACC_ALL; - if (!sp->role.direct) { - sp->gfns[index] = gfn; + /* +* For direct MMUs (e.g. TDP or non-paging guests) or passthrough SPs, +* KVM is not shadowing any g
[PATCH v7 18/23] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
From: David Matlack Currently make_huge_page_split_spte() assumes execute permissions can be granted to any 4K SPTE when splitting huge pages. This is true for the TDP MMU but is not necessarily true for the shadow MMU, since KVM may be shadowing a non-executable huge page. To fix this, pass in the role of the child shadow page where the huge page will be split and derive the execution permission from that. This is correct because huge pages are always split with direct shadow page and thus the shadow page role contains the correct access permissions. No functional change intended. Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-19-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/spte.c| 15 +++ arch/x86/kvm/mmu/spte.h| 4 ++-- arch/x86/kvm/mmu/tdp_mmu.c | 2 +- 3 files changed, 10 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c index db294c1beea2..fb1f17504138 100644 --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -246,11 +246,10 @@ static u64 make_spte_executable(u64 spte) * This is used during huge page splitting to build the SPTEs that make up the * new page table. */ -u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, int huge_level, +u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, union kvm_mmu_page_role role, int index) { u64 child_spte; - int child_level; if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte))) return 0; @@ -259,23 +258,23 @@ u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, int huge_level, return 0; child_spte = huge_spte; - child_level = huge_level - 1; /* * The child_spte already has the base address of the huge page being * split. So we just have to OR in the offset to the page at the next * lower level for the given index. */ - child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT; + child_spte |= (index * KVM_PAGES_PER_HPAGE(role.level)) << PAGE_SHIFT; - if (child_level == PG_LEVEL_4K) { + if (role.level == PG_LEVEL_4K) { child_spte &= ~PT_PAGE_SIZE_MASK; /* -* When splitting to a 4K page, mark the page executable as the -* NX hugepage mitigation no longer applies. +* When splitting to a 4K page where execution is allowed, mark +* the page executable as the NX hugepage mitigation no longer +* applies. */ - if (is_nx_huge_page_enabled(kvm)) + if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled(kvm)) child_spte = make_spte_executable(child_spte); } diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index 256f90587e8d..b5c855f5514f 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -421,8 +421,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch, bool can_unsync, bool host_writable, u64 *new_spte); -u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, int huge_level, - int index); +u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, + union kvm_mmu_page_role role, int index); u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled); u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access); u64 mark_spte_for_access_track(u64 spte); diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 522e2532343b..f3a430d64975 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1478,7 +1478,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter, * not been linked in yet and thus is not reachable from any other CPU. */ for (i = 0; i < SPTE_ENT_PER_PAGE; i++) - sp->spt[i] = make_huge_page_split_spte(kvm, huge_spte, level, i); + sp->spt[i] = make_huge_page_split_spte(kvm, huge_spte, sp->role, i); /* * Replace the huge spte with a pointer to the populated lower level -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 19/23] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels
From: David Matlack Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This is fine for now since KVM never creates intermediate huge pages during dirty logging. In other words, KVM always replaces 1GiB pages directly with 4KiB pages, so there is no reason to look for collapsible 2MiB pages. However, this will stop being true once the shadow MMU participates in eager page splitting. During eager page splitting, each 1GiB is first split into 2MiB pages and then those are split into 4KiB pages. The intermediate 2MiB pages may be left behind if an error condition causes eager page splitting to bail early. No functional change intended. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-20-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 13a059ad5dc7..36bc49f08d60 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6154,18 +6154,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, return need_tlb_flush; } +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm, + const struct kvm_memory_slot *slot) +{ + /* +* Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap +* pages that are already mapped at the maximum possible level. +*/ + if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte, + PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, + true)) + kvm_arch_flush_remote_tlbs_memslot(kvm, slot); +} + void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, const struct kvm_memory_slot *slot) { if (kvm_memslots_have_rmaps(kvm)) { write_lock(>mmu_lock); - /* -* Zap only 4k SPTEs since the legacy MMU only supports dirty -* logging at a 4k granularity and never creates collapsible -* 2m SPTEs during dirty logging. -*/ - if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true)) - kvm_arch_flush_remote_tlbs_memslot(kvm, slot); + kvm_rmap_zap_collapsible_sptes(kvm, slot); write_unlock(>mmu_lock); } -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 14/23] KVM: x86/mmu: Pass const memslot to rmap_add()
From: David Matlack Constify rmap_add()'s @slot parameter; it is simply passed on to gfn_to_rmap(), which takes a const memslot. No functional change intended. Reviewed-by: Ben Gardon Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-15-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a7748c5a2385..45a4e85c0b2c 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1556,7 +1556,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head, #define RMAP_RECYCLE_THRESHOLD 1000 -static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, +static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot, u64 *spte, gfn_t gfn) { struct kvm_mmu_page *sp; -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 13/23] KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
From: David Matlack Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync indirect shadow pages, so it's safe to pass in NULL when looking up direct shadow pages. This will be used for doing eager page splitting, which allocates direct shadow pages from the context of a VM ioctl without access to a vCPU pointer. Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-14-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 88b3f3c2c8b1..a7748c5a2385 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1975,6 +1975,12 @@ static void clear_sp_write_flooding_count(u64 *spte) __clear_sp_write_flooding_count(sptep_to_sp(spte)); } +/* + * The vCPU is required when finding indirect shadow pages; the shadow + * page may already exist and syncing it needs the vCPU pointer in + * order to read guest page tables. Direct shadow pages are never + * unsync, thus @vcpu can be NULL if @role.direct is true. + */ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm, struct kvm_vcpu *vcpu, gfn_t gfn, @@ -2013,6 +2019,9 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm, goto out; if (sp->unsync) { + if (KVM_BUG_ON(!vcpu, kvm)) + break; + /* * The page is good, but is stale. kvm_sync_page does * get the latest guest state, but (unlike mmu_unsync_children) @@ -2090,6 +2099,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm, return sp; } +/* Note, @vcpu may be NULL if @role.direct is true; see kvm_mmu_find_shadow_page. */ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm, struct kvm_vcpu *vcpu, struct shadow_page_caches *caches, -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 06/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
From: David Matlack Decompose kvm_mmu_get_page() into separate helper functions to increase readability and prepare for allocating shadow pages without a vcpu pointer. Specifically, pull the guts of kvm_mmu_get_page() into 2 helper functions: kvm_mmu_find_shadow_page() - Walks the page hash checking for any existing mmu pages that match the given gfn and role. kvm_mmu_alloc_shadow_page() Allocates and initializes an entirely new kvm_mmu_page. This currently requries a vcpu pointer for allocation and looking up the memslot but that will be removed in a future commit. No functional change intended. Reviewed-by: Sean Christopherson Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-7-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 52 +++--- 1 file changed, 39 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f4e7978a6c6a..a59fe860da29 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1993,16 +1993,16 @@ static void clear_sp_write_flooding_count(u64 *spte) __clear_sp_write_flooding_count(sptep_to_sp(spte)); } -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, -union kvm_mmu_page_role role) +static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu, +gfn_t gfn, +struct hlist_head *sp_list, +union kvm_mmu_page_role role) { - struct hlist_head *sp_list; struct kvm_mmu_page *sp; int ret; int collisions = 0; LIST_HEAD(invalid_list); - sp_list = >kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; for_each_valid_sp(vcpu->kvm, sp, sp_list) { if (sp->gfn != gfn) { collisions++; @@ -2027,7 +2027,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, /* unsync and write-flooding only apply to indirect SPs. */ if (sp->role.direct) - goto trace_get_page; + goto out; if (sp->unsync) { /* @@ -2053,14 +2053,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, __clear_sp_write_flooding_count(sp); -trace_get_page: - trace_kvm_mmu_get_page(sp, false); goto out; } + sp = NULL; ++vcpu->kvm->stat.mmu_cache_miss; - sp = kvm_mmu_alloc_page(vcpu, role.direct); +out: + kvm_mmu_commit_zap_page(vcpu->kvm, _list); + + if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions) + vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions; + return sp; +} + +static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, + gfn_t gfn, + struct hlist_head *sp_list, + union kvm_mmu_page_role role) +{ + struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct); sp->gfn = gfn; sp->role = role; @@ -2070,12 +2082,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn)) kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1); } - trace_kvm_mmu_get_page(sp, true); -out: - kvm_mmu_commit_zap_page(vcpu->kvm, _list); - if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions) - vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions; + return sp; +} + +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, +union kvm_mmu_page_role role) +{ + struct hlist_head *sp_list; + struct kvm_mmu_page *sp; + bool created = false; + + sp_list = >kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; + + sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role); + if (!sp) { + created = true; + sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role); + } + + trace_kvm_mmu_get_page(sp, created); return sp; } -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 15/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
From: David Matlack Allow adding new entries to the rmap and linking shadow pages without a struct kvm_vcpu pointer by moving the implementation of rmap_add() and link_shadow_page() into inner helper functions. No functional change intended. Reviewed-by: Ben Gardon Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-16-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 47 ++ 1 file changed, 29 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 45a4e85c0b2c..a8cdbe2958d9 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -673,11 +673,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu) kvm_mmu_free_memory_cache(>arch.mmu_page_header_cache); } -static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu) -{ - return kvm_mmu_memory_cache_alloc(>arch.mmu_pte_list_desc_cache); -} - static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc) { kmem_cache_free(pte_list_desc_cache, pte_list_desc); @@ -832,7 +827,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn, /* * Returns the number of pointers in the rmap chain, not counting the new one. */ -static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, +static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, struct kvm_rmap_head *rmap_head) { struct pte_list_desc *desc; @@ -843,7 +838,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, rmap_head->val = (unsigned long)spte; } else if (!(rmap_head->val & 1)) { rmap_printk("%p %llx 1->many\n", spte, *spte); - desc = mmu_alloc_pte_list_desc(vcpu); + desc = kvm_mmu_memory_cache_alloc(cache); desc->sptes[0] = (u64 *)rmap_head->val; desc->sptes[1] = spte; desc->spte_count = 2; @@ -855,7 +850,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, while (desc->spte_count == PTE_LIST_EXT) { count += PTE_LIST_EXT; if (!desc->more) { - desc->more = mmu_alloc_pte_list_desc(vcpu); + desc->more = kvm_mmu_memory_cache_alloc(cache); desc = desc->more; desc->spte_count = 0; break; @@ -1556,8 +1551,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head, #define RMAP_RECYCLE_THRESHOLD 1000 -static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot, -u64 *spte, gfn_t gfn) +static void __rmap_add(struct kvm *kvm, + struct kvm_mmu_memory_cache *cache, + const struct kvm_memory_slot *slot, + u64 *spte, gfn_t gfn) { struct kvm_mmu_page *sp; struct kvm_rmap_head *rmap_head; @@ -1566,15 +1563,23 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot, sp = sptep_to_sp(spte); kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn); rmap_head = gfn_to_rmap(gfn, sp->role.level, slot); - rmap_count = pte_list_add(vcpu, spte, rmap_head); + rmap_count = pte_list_add(cache, spte, rmap_head); if (rmap_count > RMAP_RECYCLE_THRESHOLD) { - kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0)); + kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0)); kvm_flush_remote_tlbs_with_address( - vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level)); + kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level)); } } +static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot, +u64 *spte, gfn_t gfn) +{ + struct kvm_mmu_memory_cache *cache = >arch.mmu_pte_list_desc_cache; + + __rmap_add(vcpu->kvm, cache, slot, spte, gfn); +} + bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { bool young = false; @@ -1645,13 +1650,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn) return hash_64(gfn, KVM_MMU_HASH_SHIFT); } -static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu, +static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache, struct kvm_mmu_page *sp, u64 *parent_pte) { if (!parent_pte) return; - pte_list_add(vcpu, parent_pte, >parent_ptes); + pte_list_add(cache, parent_pte, >parent_ptes); } static void mmu_page_remove_parent_pte(struct k
[PATCH v7 09/23] KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
From: David Matlack Move the code that write-protects newly-shadowed guest page tables into account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a more logical place for this code to live. But most importantly, this reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct kvm_vcpu pointer, which will be necessary when creating new shadow pages during VM ioctls for eager page splitting. Note, it is safe to drop the role.level == PG_LEVEL_4K check since account_shadowed() returns early if role.level > PG_LEVEL_4K. No functional change intended. Reviewed-by: Sean Christopherson Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-10-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index bd45364bf465..2602c3642f23 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -766,6 +766,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) KVM_PAGE_TRACK_WRITE); kvm_mmu_gfn_disallow_lpage(slot, gfn); + + if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K)) + kvm_flush_remote_tlbs_with_address(kvm, gfn, 1); } void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp) @@ -2072,11 +2075,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, sp->gfn = gfn; sp->role = role; hlist_add_head(>hash_link, sp_list); - if (sp_has_gptes(sp)) { + if (sp_has_gptes(sp)) account_shadowed(vcpu->kvm, sp); - if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn)) - kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1); - } return sp; } -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 10/23] KVM: x86/mmu: Pass memory caches to allocate SPs separately
From: David Matlack Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it will allocate the various pieces of memory for shadow pages as a parameter, rather than deriving them from the vcpu pointer. This will be useful in a future commit where shadow pages are allocated during VM ioctls for eager page splitting, and thus will use a different set of caches. Preemptively pull the caches out all the way to kvm_mmu_get_shadow_page() since eager page splitting will not be calling kvm_mmu_alloc_shadow_page() directly. No functional change intended. Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-11-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 36 +--- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 2602c3642f23..fab417e7bf6c 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2049,17 +2049,25 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu, return sp; } +/* Caches used when allocating a new shadow page. */ +struct shadow_page_caches { + struct kvm_mmu_memory_cache *page_header_cache; + struct kvm_mmu_memory_cache *shadow_page_cache; + struct kvm_mmu_memory_cache *gfn_array_cache; +}; + static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, + struct shadow_page_caches *caches, gfn_t gfn, struct hlist_head *sp_list, union kvm_mmu_page_role role) { struct kvm_mmu_page *sp; - sp = kvm_mmu_memory_cache_alloc(>arch.mmu_page_header_cache); - sp->spt = kvm_mmu_memory_cache_alloc(>arch.mmu_shadow_page_cache); + sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache); + sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache); if (!role.direct) - sp->gfns = kvm_mmu_memory_cache_alloc(>arch.mmu_gfn_array_cache); + sp->gfns = kvm_mmu_memory_cache_alloc(caches->gfn_array_cache); set_page_private(virt_to_page(sp->spt), (unsigned long)sp); @@ -2081,9 +2089,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, return sp; } -static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, - gfn_t gfn, - union kvm_mmu_page_role role) +static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, + struct shadow_page_caches *caches, + gfn_t gfn, + union kvm_mmu_page_role role) { struct hlist_head *sp_list; struct kvm_mmu_page *sp; @@ -2094,13 +2103,26 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role); if (!sp) { created = true; - sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role); + sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role); } trace_kvm_mmu_get_page(sp, created); return sp; } +static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, + gfn_t gfn, + union kvm_mmu_page_role role) +{ + struct shadow_page_caches caches = { + .page_header_cache = >arch.mmu_page_header_cache, + .shadow_page_cache = >arch.mmu_shadow_page_cache, + .gfn_array_cache = >arch.mmu_gfn_array_cache, + }; + + return __kvm_mmu_get_shadow_page(vcpu, , gfn, role); +} + static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, unsigned int access) { struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep); -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 08/23] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
From: David Matlack Rename 2 functions: kvm_mmu_get_page() -> kvm_mmu_get_shadow_page() kvm_mmu_free_page() -> kvm_mmu_free_shadow_page() This change makes it clear that these functions deal with shadow pages rather than struct pages. It also aligns these functions with the naming scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page(). Prefer "shadow_page" over the shorter "sp" since these are core functions and the line lengths aren't terrible. No functional change intended. Reviewed-by: Sean Christopherson Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-9-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 8b84cdd8c6cd..bd45364bf465 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1626,7 +1626,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr) percpu_counter_add(_total_used_mmu_pages, nr); } -static void kvm_mmu_free_page(struct kvm_mmu_page *sp) +static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp) { MMU_WARN_ON(!is_empty_shadow_page(sp->spt)); hlist_del(>hash_link); @@ -2081,8 +2081,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, return sp; } -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, -union kvm_mmu_page_role role) +static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, + gfn_t gfn, + union kvm_mmu_page_role role) { struct hlist_head *sp_list; struct kvm_mmu_page *sp; @@ -2146,7 +2147,7 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, union kvm_mmu_page_role role; role = kvm_mmu_child_role(sptep, direct, access); - return kvm_mmu_get_page(vcpu, gfn, role); + return kvm_mmu_get_shadow_page(vcpu, gfn, role); } static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator, @@ -2422,7 +2423,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, list_for_each_entry_safe(sp, nsp, invalid_list, link) { WARN_ON(!sp->role.invalid || sp->root_count); - kvm_mmu_free_page(sp); + kvm_mmu_free_shadow_page(sp); } } @@ -3415,7 +3416,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte); WARN_ON_ONCE(role.direct && role.has_4_byte_gpte); - sp = kvm_mmu_get_page(vcpu, gfn, role); + sp = kvm_mmu_get_shadow_page(vcpu, gfn, role); ++sp->root_count; return __pa(sp->spt); -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 03/23] KVM: x86/mmu: Stop passing "direct" to mmu_alloc_root()
From: David Matlack The "direct" argument is vcpu->arch.mmu->root_role.direct, because unlike non-root page tables, it's impossible to have a direct root in an indirect MMU. So just use that. Suggested-by: Lai Jiangshan Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-4-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 844b58ddb3bb..2e30398fe59f 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3369,8 +3369,9 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn) } static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva, - u8 level, bool direct) + u8 level) { + bool direct = vcpu->arch.mmu->root_role.direct; struct kvm_mmu_page *sp; sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL); @@ -3396,7 +3397,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu); mmu->root.hpa = root; } else if (shadow_root_level >= PT64_ROOT_4LEVEL) { - root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true); + root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level); mmu->root.hpa = root; } else if (shadow_root_level == PT32E_ROOT_LEVEL) { if (WARN_ON_ONCE(!mmu->pae_root)) { @@ -3408,7 +3409,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i])); root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), - i << 30, PT32_ROOT_LEVEL, true); + i << 30, PT32_ROOT_LEVEL); mmu->pae_root[i] = root | PT_PRESENT_MASK | shadow_me_value; } @@ -3532,7 +3533,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) */ if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) { root = mmu_alloc_root(vcpu, root_gfn, 0, - mmu->root_role.level, false); + mmu->root_role.level); mmu->root.hpa = root; goto set_root_pgd; } @@ -3578,7 +3579,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) } root = mmu_alloc_root(vcpu, root_gfn, i << 30, - PT32_ROOT_LEVEL, false); + PT32_ROOT_LEVEL); mmu->pae_root[i] = root | pm_mask; } -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 12/23] KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page()
From: David Matlack Get the kvm pointer from the caller, rather than deriving it from vcpu->kvm, and plumb the kvm pointer all the way from kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer is only needed to sync indirect shadow pages. In other words, __kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages without a vcpu pointer. This enables eager page splitting, which needs to allocate direct shadow pages during VM ioctls. No functional change intended. Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-13-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c5a88e8d1b53..88b3f3c2c8b1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1975,7 +1975,8 @@ static void clear_sp_write_flooding_count(u64 *spte) __clear_sp_write_flooding_count(sptep_to_sp(spte)); } -static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu, +static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm, +struct kvm_vcpu *vcpu, gfn_t gfn, struct hlist_head *sp_list, union kvm_mmu_page_role role) @@ -1985,7 +1986,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu, int collisions = 0; LIST_HEAD(invalid_list); - for_each_valid_sp(vcpu->kvm, sp, sp_list) { + for_each_valid_sp(kvm, sp, sp_list) { if (sp->gfn != gfn) { collisions++; continue; @@ -2002,7 +2003,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu, * upper-level page will be write-protected. */ if (role.level > PG_LEVEL_4K && sp->unsync) - kvm_mmu_prepare_zap_page(vcpu->kvm, sp, + kvm_mmu_prepare_zap_page(kvm, sp, _list); continue; } @@ -2030,7 +2031,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu, WARN_ON(!list_empty(_list)); if (ret > 0) - kvm_flush_remote_tlbs(vcpu->kvm); + kvm_flush_remote_tlbs(kvm); } __clear_sp_write_flooding_count(sp); @@ -2039,13 +2040,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu, } sp = NULL; - ++vcpu->kvm->stat.mmu_cache_miss; + ++kvm->stat.mmu_cache_miss; out: - kvm_mmu_commit_zap_page(vcpu->kvm, _list); + kvm_mmu_commit_zap_page(kvm, _list); - if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions) - vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions; + if (collisions > kvm->stat.max_mmu_page_hash_collisions) + kvm->stat.max_mmu_page_hash_collisions = collisions; return sp; } @@ -2089,7 +2090,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm, return sp; } -static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, +static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm, + struct kvm_vcpu *vcpu, struct shadow_page_caches *caches, gfn_t gfn, union kvm_mmu_page_role role) @@ -2098,12 +2100,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp; bool created = false; - sp_list = >kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; + sp_list = >arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; - sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role); + sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role); if (!sp) { created = true; - sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role); + sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role); } trace_kvm_mmu_get_page(sp, created); @@ -2120,7 +2122,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, .gfn_array_cache = >arch.mmu_gfn_array_cache, }; - return __kvm_mmu_get_shadow_page(
[PATCH v7 04/23] KVM: x86/mmu: Derive shadow MMU page role from parent
From: David Matlack Instead of computing the shadow page role from scratch for every new page, derive most of the information from the parent shadow page. This eliminates the dependency on the vCPU root role to allocate shadow page tables, and reduces the number of parameters to kvm_mmu_get_page(). Preemptively split out the role calculation to a separate function for use in a following commit. Note that when calculating the MMU root role, we can take @role.passthrough, @role.direct, and @role.access directly from @vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still must be overridden for PAE page directories, when shadowing 32-bit guest page tables with PAE page tables. No functional change intended. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-5-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 114 +++-- arch/x86/kvm/mmu/paging_tmpl.h | 9 +-- 2 files changed, 71 insertions(+), 52 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 2e30398fe59f..fd1b479bf7fc 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1993,49 +1993,15 @@ static void clear_sp_write_flooding_count(u64 *spte) __clear_sp_write_flooding_count(sptep_to_sp(spte)); } -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, -gfn_t gfn, -gva_t gaddr, -unsigned level, -bool direct, -unsigned int access) +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, +union kvm_mmu_page_role role) { - union kvm_mmu_page_role role; struct hlist_head *sp_list; - unsigned quadrant; struct kvm_mmu_page *sp; int ret; int collisions = 0; LIST_HEAD(invalid_list); - role = vcpu->arch.mmu->root_role; - role.level = level; - role.direct = direct; - role.access = access; - if (role.has_4_byte_gpte) { - /* -* If the guest has 4-byte PTEs then that means it's using 32-bit, -* 2-level, non-PAE paging. KVM shadows such guests with PAE paging -* (i.e. 8-byte PTEs). The difference in PTE size means that KVM must -* shadow each guest page table with multiple shadow page tables, which -* requires extra bookkeeping in the role. -* -* Specifically, to shadow the guest's page directory (which covers a -* 4GiB address space), KVM uses 4 PAE page directories, each mapping -* 1GiB of the address space. @role.quadrant encodes which quarter of -* the address space each maps. -* -* To shadow the guest's page tables (which each map a 4MiB region), KVM -* uses 2 PAE page tables, each mapping a 2MiB region. For these, -* @role.quadrant encodes which half of the region they map. -*/ - quadrant = gaddr >> (PAGE_SHIFT + (SPTE_LEVEL_BITS * level)); - quadrant &= (1 << level) - 1; - role.quadrant = quadrant; - } - if (level <= vcpu->arch.mmu->cpu_role.base.level) - role.passthrough = 0; - sp_list = >kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; for_each_valid_sp(vcpu->kvm, sp, sp_list) { if (sp->gfn != gfn) { @@ -2053,7 +2019,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, * Unsync pages must not be left as is, because the new * upper-level page will be write-protected. */ - if (level > PG_LEVEL_4K && sp->unsync) + if (role.level > PG_LEVEL_4K && sp->unsync) kvm_mmu_prepare_zap_page(vcpu->kvm, sp, _list); continue; @@ -2094,14 +2060,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, ++vcpu->kvm->stat.mmu_cache_miss; - sp = kvm_mmu_alloc_page(vcpu, direct); + sp = kvm_mmu_alloc_page(vcpu, role.direct); sp->gfn = gfn; sp->role = role; hlist_add_head(>hash_link, sp_list); if (sp_has_gptes(sp)) { account_shadowed(vcpu->kvm, sp); - if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn)) + if (role.level == PG_LEVEL_4K &&a
[PATCH v7 07/23] KVM: x86/mmu: Consolidate shadow page allocation and initialization
From: David Matlack Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under the latter so that all shadow page allocation and initialization happens in one place. No functional change intended. Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-8-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 39 +-- 1 file changed, 17 insertions(+), 22 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a59fe860da29..8b84cdd8c6cd 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1664,27 +1664,6 @@ static void drop_parent_pte(struct kvm_mmu_page *sp, mmu_spte_clear_no_track(parent_pte); } -static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct) -{ - struct kvm_mmu_page *sp; - - sp = kvm_mmu_memory_cache_alloc(>arch.mmu_page_header_cache); - sp->spt = kvm_mmu_memory_cache_alloc(>arch.mmu_shadow_page_cache); - if (!direct) - sp->gfns = kvm_mmu_memory_cache_alloc(>arch.mmu_gfn_array_cache); - set_page_private(virt_to_page(sp->spt), (unsigned long)sp); - - /* -* active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages() -* depends on valid pages being added to the head of the list. See -* comments in kvm_zap_obsolete_pages(). -*/ - sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen; - list_add(>link, >kvm->arch.active_mmu_pages); - kvm_mod_used_mmu_pages(vcpu->kvm, +1); - return sp; -} - static void mark_unsync(u64 *spte); static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp) { @@ -2072,7 +2051,23 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, struct hlist_head *sp_list, union kvm_mmu_page_role role) { - struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct); + struct kvm_mmu_page *sp; + + sp = kvm_mmu_memory_cache_alloc(>arch.mmu_page_header_cache); + sp->spt = kvm_mmu_memory_cache_alloc(>arch.mmu_shadow_page_cache); + if (!role.direct) + sp->gfns = kvm_mmu_memory_cache_alloc(>arch.mmu_gfn_array_cache); + + set_page_private(virt_to_page(sp->spt), (unsigned long)sp); + + /* +* active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages() +* depends on valid pages being added to the head of the list. See +* comments in kvm_zap_obsolete_pages(). +*/ + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen; + list_add(>link, >kvm->arch.active_mmu_pages); + kvm_mod_used_mmu_pages(vcpu->kvm, +1); sp->gfn = gfn; sp->role = role; -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
For the description of the "why" of this patch, I'll just direct you to David's excellent cover letter from v6, which can be found at https://lore.kernel.org/r/20220516232138.1783324-1-dmatl...@google.com. This version mostly does the following: - apply the feedback from Sean and other reviewers, which is mostly aesthetic - replace the refactoring of drop_large_spte()/__drop_large_spte() with my own version. The insight there is that drop_large_spte() is always followed by {,__}link_shadow_page(), so the call is moved there - split the TLB flush optimization into a separate patch, mostly to perform the previous refactoring independent of the optional TLB flush - rename a few functions from *nested_mmu* to *shadow_mmu* David Matlack (21): KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs KVM: x86/mmu: Use a bool for direct KVM: x86/mmu: Stop passing "direct" to mmu_alloc_root() KVM: x86/mmu: Derive shadow MMU page role from parent KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions KVM: x86/mmu: Consolidate shadow page allocation and initialization KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages KVM: x86/mmu: Move guest PT write-protection to account_shadowed() KVM: x86/mmu: Pass memory caches to allocate SPs separately KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page() KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page() KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page() KVM: x86/mmu: Pass const memslot to rmap_add() KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu KVM: x86/mmu: Update page stats in __rmap_add() KVM: x86/mmu: Cache the access bits of shadowed translations KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels KVM: Allow for different capacities in kvm_mmu_memory_cache structs KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs Paolo Bonzini (2): KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page() KVM: x86/mmu: Avoid unnecessary flush on eager page split .../admin-guide/kernel-parameters.txt | 3 +- arch/arm64/kvm/mmu.c | 2 +- arch/riscv/kvm/mmu.c | 5 +- arch/x86/include/asm/kvm_host.h | 24 +- arch/x86/kvm/mmu/mmu.c| 719 ++ arch/x86/kvm/mmu/mmu_internal.h | 17 +- arch/x86/kvm/mmu/paging_tmpl.h| 43 +- arch/x86/kvm/mmu/spte.c | 15 +- arch/x86/kvm/mmu/spte.h | 4 +- arch/x86/kvm/mmu/tdp_mmu.c| 2 +- include/linux/kvm_host.h | 1 + include/linux/kvm_types.h | 6 +- virt/kvm/kvm_main.c | 33 +- 13 files changed, 666 insertions(+), 208 deletions(-) -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 02/23] KVM: x86/mmu: Use a bool for direct
From: David Matlack The parameter "direct" can either be true or false, and all of the callers pass in a bool variable or true/false literal, so just use the type bool. No functional change intended. Reviewed-by: Lai Jiangshan Reviewed-by: Sean Christopherson Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-3-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c0afb4f1c8ae..844b58ddb3bb 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1664,7 +1664,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp, mmu_spte_clear_no_track(parent_pte); } -static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct) +static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct) { struct kvm_mmu_page *sp; @@ -1997,7 +1997,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gaddr, unsigned level, -int direct, +bool direct, unsigned int access) { union kvm_mmu_page_role role; -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 05/23] KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes
From: David Matlack The quadrant is only used when gptes are 4 bytes, but mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE page directories regardless. Make this less confusing by only passing in a non-zero quadrant when it is actually necessary. Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-6-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index fd1b479bf7fc..f4e7978a6c6a 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3389,9 +3389,10 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, struct kvm_mmu_page *sp; role.level = level; + role.quadrant = quadrant; - if (role.has_4_byte_gpte) - role.quadrant = quadrant; + WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte); + WARN_ON_ONCE(role.direct && role.has_4_byte_gpte); sp = kvm_mmu_get_page(vcpu, gfn, role); ++sp->root_count; @@ -3427,7 +3428,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) for (i = 0; i < 4; ++i) { WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i])); - root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i, + root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0, PT32_ROOT_LEVEL); mmu->pae_root[i] = root | PT_PRESENT_MASK | shadow_me_value; @@ -3512,9 +3513,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) struct kvm_mmu *mmu = vcpu->arch.mmu; u64 pdptrs[4], pm_mask; gfn_t root_gfn, root_pgd; + int quadrant, i, r; hpa_t root; - unsigned i; - int r; root_pgd = mmu->get_guest_pgd(vcpu); root_gfn = root_pgd >> PAGE_SHIFT; @@ -3597,7 +3597,15 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) root_gfn = pdptrs[i] >> PAGE_SHIFT; } - root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL); + /* +* If shadowing 32-bit non-PAE page tables, each PAE page +* directory maps one quarter of the guest's non-PAE page +* directory. Othwerise each PAE page direct shadows one guest +* PAE page directory so that quadrant should be 0. +*/ + quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 0; + + root = mmu_alloc_root(vcpu, root_gfn, quadrant, PT32_ROOT_LEVEL); mmu->pae_root[i] = root | pm_mask; } -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
[PATCH v7 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
From: David Matlack Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for fully direct MMUs") skipped the unsync checks and write flood clearing for full direct MMUs. We can extend this further to skip the checks for all direct shadow pages. Direct shadow pages in indirect MMUs (i.e. shadow paging) are used when shadowing a guest huge page with smaller pages. Such direct shadow pages, like their counterparts in fully direct MMUs, are never marked unsynced or have a non-zero write-flooding count. Checking sp->role.direct also generates better code than checking direct_map because, due to register pressure, direct_map has to get shoved onto the stack and then pulled back off. No functional change intended. Reviewed-by: Lai Jiangshan Reviewed-by: Sean Christopherson Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220516232138.1783324-2-dmatl...@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 27b2a5603496..c0afb4f1c8ae 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2000,7 +2000,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, int direct, unsigned int access) { - bool direct_mmu = vcpu->arch.mmu->root_role.direct; union kvm_mmu_page_role role; struct hlist_head *sp_list; unsigned quadrant; @@ -2060,7 +2059,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, continue; } - if (direct_mmu) + /* unsync and write-flooding only apply to indirect SPs. */ + if (sp->role.direct) goto trace_get_page; if (sp->unsync) { -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte()
On 6/22/22 18:13, Paolo Bonzini wrote: Even better, drop_large_spte() is always called right before kvm_mmu_get_child_sp(), so: Actually, we can even include the call from eager page splitting if __link_shadow_page() is the one that takes care of dropping the large SPTE: From bea344e409bb8329ca69aca0a63f97537a7ec798 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Wed, 22 Jun 2022 12:11:44 -0400 Subject: [PATCH] KVM: MMU: pull call to drop_large_spte() into __link_shadow_page() Before allocating a child shadow page table, all callers check whether the parent already points to a huge page and, if so, they drop that SPTE. This is done by drop_large_spte(). However, the act that requires dropping the large SPTE is the installation of the sp that is returned by kvm_mmu_get_child_sp(), which happens in __link_shadow_page(). Move the call there instead of having it in each and every caller. To ensure that the shadow page is not linked twice if it was present, do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent: instead, return an error value if the shadow page already existed. This is a bit more verbose, but clearer than NULL. Now that the drop_large_spte() name is not taken anymore, remove the two underscores in front of __drop_large_spte(). Signed-off-by: Paolo Bonzini diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 36bc49f08d60..64c1191be4ae 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1135,26 +1135,16 @@ static void drop_spte(struct kvm *kvm, u64 *sptep) rmap_remove(kvm, sptep); } - -static bool __drop_large_spte(struct kvm *kvm, u64 *sptep) +static void drop_large_spte(struct kvm *kvm, u64 *sptep) { - if (is_large_pte(*sptep)) { - WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K); - drop_spte(kvm, sptep); - return true; - } - - return false; -} + struct kvm_mmu_page *sp; -static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) -{ - if (__drop_large_spte(vcpu->kvm, sptep)) { - struct kvm_mmu_page *sp = sptep_to_sp(sptep); + sp = sptep_to_sp(sptep); + WARN_ON(sp->role.level == PG_LEVEL_4K); - kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn, + drop_spte(kvm, sptep); + kvm_flush_remote_tlbs_with_address(kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level)); - } } /* @@ -2221,6 +2211,9 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, { union kvm_mmu_page_role role; + if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) + return ERR_PTR(-EEXIST); + role = kvm_mmu_child_role(sptep, direct, access); return kvm_mmu_get_shadow_page(vcpu, gfn, role); } @@ -2295,6 +2288,13 @@ static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep, BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK); + /* +* If an SPTE is present already, it must be a leaf and therefore +* a large one. Drop it and flush the TLB before installing sp. +*/ + if (is_shadow_present_pte(*sptep) + drop_large_spte(vcpu->kvm, sptep); + spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp)); mmu_spte_set(sptep, spte); @@ -3080,11 +3080,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) if (it.level == fault->goal_level) break; - drop_large_spte(vcpu, it.sptep); - if (is_shadow_present_pte(*it.sptep)) - continue; - sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL); + if (sp == ERR_PTR(-EEXIST)) + continue; link_shadow_page(vcpu, it.sptep, sp); if (fault->is_tdp && fault->huge_page_disallowed && diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 24f292f3f93f..2448fa8d8438 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -648,15 +648,13 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, gfn_t table_gfn; clear_sp_write_flooding_count(it.sptep); - drop_large_spte(vcpu, it.sptep); - sp = NULL; - if (!is_shadow_present_pte(*it.sptep)) { - table_gfn = gw->table_gfn[it.level - 2]; - access = gw->pt_access[it.level - 2]; - sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn, - false, access); + table_gfn = gw->table_gfn[it.level - 2]; + access = gw->pt_access[it.level - 2]; + sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn, +
Re: [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte()
On 6/17/22 19:11, Sean Christopherson wrote: since the shortlog is already a somewhat vague "do a refactor", I vote to opportunistically: - rename drop_large_spte() to drop_spte_if_huge() - rename __drop_large_spte() to drop_huge_spte() - move "if (!is_large_pte(*sptep))" to drop_spte_if_huge() since the split path should never pass in a non-huge SPTE. That last point will also clean up an oddity with with "flush" parameter; given the command-like name of "flush", it's a bit weird that __drop_large_spte() doesn't flush when the SPTE is large. Even better, drop_large_spte() is always called right before kvm_mmu_get_child_sp(), so: From 86a9490972a1e959a4df114678719494b5475720 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Wed, 22 Jun 2022 12:11:44 -0400 Subject: [PATCH] KVM: MMU: pull drop_large_spte into kvm_mmu_get_child_sp Before allocating a child shadow page table, all callers need to check whether the parent already points to a huge page and, if so, drop it. This is done by drop_large_spte(), but it can be moved to kvm_mmu_get_child_sp(). To ensure that the shadow page is not linked twice if it was present, do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent: instead, return an error value if the shadow page already existed. This is a bit more verbose, but clearer than NULL. Now that the drop_large_spte() name is not taken anymore, remove the two underscores in front of __drop_large_spte(). Signed-off-by: Paolo Bonzini diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 36bc49f08d60..7f52870ee062 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1135,26 +1135,16 @@ static void drop_spte(struct kvm *kvm, u64 *sptep) rmap_remove(kvm, sptep); } - -static bool __drop_large_spte(struct kvm *kvm, u64 *sptep) +static void drop_large_spte(struct kvm *kvm, u64 *sptep) { - if (is_large_pte(*sptep)) { - WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K); - drop_spte(kvm, sptep); - return true; - } - - return false; -} + struct kvm_mmu_page *sp; -static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) -{ - if (__drop_large_spte(vcpu->kvm, sptep)) { - struct kvm_mmu_page *sp = sptep_to_sp(sptep); + sp = sptep_to_sp(sptep); + WARN_ON(sp->role.level == PG_LEVEL_4K); - kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn, + drop_spte(kvm, sptep); + kvm_flush_remote_tlbs_with_address(kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level)); - } } /* @@ -2221,6 +2211,13 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, { union kvm_mmu_page_role role; + if (is_shadow_present_pte(*sptep)) { + if (!is_large_pte(*sptep)) + return ERR_PTR(-EEXIST); + + drop_large_spte(vcpu->kvm, sptep, true); + } + role = kvm_mmu_child_role(sptep, direct, access); return kvm_mmu_get_shadow_page(vcpu, gfn, role); } @@ -3080,11 +3077,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) if (it.level == fault->goal_level) break; - drop_large_spte(vcpu, it.sptep); - if (is_shadow_present_pte(*it.sptep)) - continue; - sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL); + if (sp == ERR_PTR(-EEXIST)) + continue; link_shadow_page(vcpu, it.sptep, sp); if (fault->is_tdp && fault->huge_page_disallowed && diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 24f292f3f93f..2448fa8d8438 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -648,15 +648,13 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, gfn_t table_gfn; clear_sp_write_flooding_count(it.sptep); - drop_large_spte(vcpu, it.sptep); - sp = NULL; - if (!is_shadow_present_pte(*it.sptep)) { - table_gfn = gw->table_gfn[it.level - 2]; - access = gw->pt_access[it.level - 2]; - sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn, - false, access); + table_gfn = gw->table_gfn[it.level - 2]; + access = gw->pt_access[it.level - 2]; + sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn, + false, access); + if (sp != ERR_PTR(-EEXIST)) { /* * We must synchronize the pagetable before li
Re: [PATCH v6 13/22] KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
On 6/17/22 17:28, Sean Christopherson wrote: On Mon, May 16, 2022, David Matlack wrote: Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync indirect shadow pages, so it's safe to pass in NULL when looking up direct shadow pages. This will be used for doing eager page splitting, which allocates direct "hugepage" again, because I need constant reminders :-) shadow pages from the context of a VM ioctl without access to a vCPU pointer. Signed-off-by: David Matlack --- With nits addressed, Reviewed-by: Sean Christopherson arch/x86/kvm/mmu/mmu.c | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4fbc2da47428..acb54d6e0ea5 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1850,6 +1850,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, if (ret < 0) kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); + Unrelated whitespace change leftover from the previous approach. return ret; } @@ -2001,6 +2002,7 @@ static void clear_sp_write_flooding_count(u64 *spte) __clear_sp_write_flooding_count(sptep_to_sp(spte)); } +/* Note, @vcpu may be NULL if @role.direct is true. */ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm, struct kvm_vcpu *vcpu, gfn_t gfn, @@ -2039,6 +2041,16 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm, goto out; if (sp->unsync) { + /* +* A vCPU pointer should always be provided when finding s/should/must, and "be provided" in unnecessarily ambiguous, simply state that "@vcpu must be non-NULL". E.g. if a caller provides a pointer, but that pointer happens to be NULL. +* indirect shadow pages, as that shadow page may +* already exist and need to be synced using the vCPU +* pointer. Direct shadow pages are never unsync and +* thus do not require a vCPU pointer. +*/ "vCPU pointer" over and over is a bit versbose, and I prefer to refer to vCPUs/VMs as objects themselves. E.g. "XYZ requires a vCPU" versus "XYZ requires a vCPU pointer" since it's not the pointer itself that's required, it's all the context of the vCPU that is needed. /* * @vcpu must be non-NULL when finding indirect shadow * pages, as such pages may already exist and need to * be synced, which requires a vCPU. Direct pages are * never unsync and thus do not require a vCPU. */ My own take: diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index d7987420bb26..a7748c5a2385 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1975,7 +1975,12 @@ static void clear_sp_write_flooding_count(u64 *spte) __clear_sp_write_flooding_count(sptep_to_sp(spte)); } -/* Note, @vcpu may be NULL if @role.direct is true. */ +/* + * The vCPU is required when finding indirect shadow pages; the shadow + * page may already exist and syncing it needs the vCPU pointer in + * order to read guest page tables. Direct shadow pages are never + * unsync, thus @vcpu can be NULL if @role.direct is true. + */ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm, struct kvm_vcpu *vcpu, gfn_t gfn, @@ -2014,13 +2019,6 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm, goto out; if (sp->unsync) { - /* -* The vCPU pointer is required when finding indirect -* shadow pages, as that shadow page may already exist -* exist and need to be synced using the vCPU pointer. -* Direct shadow pages are never unsync and thus do not -* require a vCPU pointer. -*/ if (KVM_BUG_ON(!vcpu, kvm)) break; @@ -2101,7 +2099,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm, return sp; } -/* Note, @vcpu may be NULL if @role.direct is true. */ +/* Note, @vcpu may be NULL if @role.direct is true; see kvm_mmu_find_shadow_page. */ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm, struct kvm_vcpu *vcpu, struct shadow_page_caches *caches,
Re: [PATCH v6 03/22] KVM: x86/mmu: Stop passing @direct to mmu_alloc_root()
On 6/16/22 20:47, Sean Christopherson wrote: The argument @direct is vcpu->arch.mmu->root_role.direct, so just use that. It's worth calling out that, unlike non-root page tables, it's impossible to have a direct root in an indirect MMU. I.e. provide a hint as to why there's a need to pass @direct in the first place. I suppose there's *no* need to pass direct? Also, there's the trivial (but less interesting) justification that kvm_mmu_load does if (vcpu->arch.mmu->root_role.direct) r = mmu_alloc_direct_roots(vcpu); else r = mmu_alloc_shadow_roots(vcpu); and those are the only callers of mmu_alloc_root. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 4/4] KVM: selftests: Fix filename reporting in guest asserts
On 6/16/22 14:45, Andrew Jones wrote: +#define __GUEST_ASSERT(_condition, _condstr, _nargs, _args...) do {\ + if (!(_condition)) \ + ucall(UCALL_ABORT, GUEST_ASSERT_BUILTIN_NARGS + _nargs, \ + "Failed guest assert: " _condstr, \ + __FILE__, \ + __LINE__, \ + ##_args); \ We don't need another level of indentation nor the ## operator on _args. The ## is needed to drop the comma if there are no _args. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 0/3] KVM: selftests: Consolidate ucall code
On 6/18/22 02:16, Sean Christopherson wrote: Consolidate the code for making and getting ucalls. All architectures pass the ucall struct via memory, so filling and copying the struct is 100% generic. The only per-arch code is sending and receiving the address of said struct. Tested on x86 and arm, compile tested on s390 and RISC-V. I'm not sure about doing this yet. The SEV tests added multiple implementations of the ucalls in one architecture. I have rebased those recently (not the SEV part) to get more familiar with the new kvm_vcpu API for selftests, and was going to look at your old review next... Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [Question] remote_tlb_flush statistic is missed from kvm_flush_remote_tlbs() ?
On 6/17/22 12:33, Andrew Jones wrote: I don't see the change for commit 38f703663d4c as of an upstream pull right now $ git show 47700948a4ab:arch/arm64/kvm/mmu.c | grep -A4 'void kvm_flush_remote_tlbs' void kvm_flush_remote_tlbs(struct kvm *kvm) { ++kvm->stat.generic.remote_tlb_flush_requests; kvm_call_hyp(__kvm_tlb_flush_vmid, >arch.mmu); } and I do see it got dropped with merge commit e99314a340d2. $ git diff 419025b3b419 0d0a19395baa -- arch/arm64/kvm/mmu.c | grep -A5 'void kvm_flush_remote_tlbs' void kvm_flush_remote_tlbs(struct kvm *kvm) { + ++kvm->stat.generic.remote_tlb_flush_requests; kvm_call_hyp(__kvm_tlb_flush_vmid, >arch.mmu); - ++kvm->stat.generic.remote_tlb_flush; } Hi, on ARM it makes little sense to split remote_tlb_flush_requests and remote_tlb_flush. On x86 the latter means "a vmexit was forced in order to flush the TLB", and in fact this common code: if (!kvm_arch_flush_remote_tlb(kvm) || kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm->stat.generic.remote_tlb_flush; should probably be written if (!kvm_arch_flush_remote_tlb(kvm)) return; if (kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm->stat.generic.remote_tlb_flush; Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH] selftests: KVM: Handle compiler optimizations in ucall
On 6/17/22 09:28, Andrew Jones wrote: On Thu, Jun 16, 2022 at 09:54:16PM +, David Laight wrote: From: oliver.up...@linux.dev Sent: 16 June 2022 19:45 June 16, 2022 11:48 AM, "David Laight" wrote: No wonder I was confused. It's not surprising the compiler optimises it all away. It doesn't seem right to be 'abusing' WRITE_ONCE() here. Just adding barrier() should be enough and much more descriptive. I had the same thought, although I do not believe barrier() is sufficient on its own. barrier_data() with a pointer to uc passed through is required to keep clang from eliminating the dead store. A barrier() (full memory clobber) ought to be stronger than the partial one than barrier_data() generates. I can't quite decide whether you need a barrier() both sides of the 'magic write'. Plausibly the compiler could discard the on-stack data after the barrier() and before the 'magic write'. Certainly putting the 'magic write' inside a asm block that has a memory clobber is a more correct solution. Indeed, since the magic write is actually a guest MMIO write, then it should be using writeq(). It doesn't need to use writeq() because no special precautions are needed with respect to cacheability or instruction reordering (as is the case with hardware registers). WRITE_ONCE is okay, especially since the code never reads it (and if it did it would also use READ_ONCE). Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH] selftests: KVM: Handle compiler optimizations in ucall
Queued, thanks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 fixes for 5.19, take #1
On 6/9/22 16:17, Marc Zyngier wrote: git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git tags/kvmarm-fixes-5.19-1 Pulled, thanks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 updates for 5.19
On 5/19/22 11:27, Marc Zyngier wrote: Hi Paolo, Here's the bulk of the KVM/arm64 updates for 5.19. Major features are guard pages for the EL2 stacks, save/restore of the guest-visible hypercall configuration and PSCI suspend support. Further details in the tag description. Note that this PR contains a shared branch with the arm64 tree containing the SME patches to resolve conflicts with the WFxT support branch. Pulled, thanks. I solved the conflict for KVM_EXIT_SYSTEM_EVENT values in your favor. Paolo Please pull, M. The following changes since commit 672c0c5173427e6b3e2a9bbb7be51ceeec78093a: Linux 5.18-rc5 (2022-05-01 13:57:58 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git tags/kvmarm-5.19 for you to fetch changes up to 5c0ad551e9aa6188f2bda0977c1cb6768a2b74ef: Merge branch kvm-arm64/its-save-restore-fixes-5.19 into kvmarm-master/next (2022-05-16 17:48:36 +0100) KVM/arm64 updates for 5.19 - Add support for the ARMv8.6 WFxT extension - Guard pages for the EL2 stacks - Trap and emulate AArch32 ID registers to hide unsupported features - Ability to select and save/restore the set of hypercalls exposed to the guest - Support for PSCI-initiated suspend in collaboration with userspace - GICv3 register-based LPI invalidation support - Move host PMU event merging into the vcpu data structure - GICv3 ITS save/restore fixes - The usual set of small-scale cleanups and fixes Alexandru Elisei (3): KVM: arm64: Hide AArch32 PMU registers when not available KVM: arm64: Don't BUG_ON() if emulated register table is unsorted KVM: arm64: Print emulated register table name when it is unsorted Ard Biesheuvel (1): KVM: arm64: Avoid unnecessary absolute addressing via literals Fuad Tabba (4): KVM: arm64: Wrapper for getting pmu_events KVM: arm64: Repack struct kvm_pmu to reduce size KVM: arm64: Pass pmu events to hyp via vcpu KVM: arm64: Reenable pmu in Protected Mode Kalesh Singh (6): KVM: arm64: Introduce hyp_alloc_private_va_range() KVM: arm64: Introduce pkvm_alloc_private_va_range() KVM: arm64: Add guard pages for KVM nVHE hypervisor stack KVM: arm64: Add guard pages for pKVM (protected nVHE) hypervisor stack KVM: arm64: Detect and handle hypervisor stack overflows KVM: arm64: Symbolize the nVHE HYP addresses Marc Zyngier (30): arm64: Expand ESR_ELx_WFx_ISS_TI to match its ARMv8.7 definition arm64: Add RV and RN fields for ESR_ELx_WFx_ISS arm64: Add HWCAP advertising FEAT_WFXT arm64: Add wfet()/wfit() helpers arm64: Use WFxT for __delay() when possible KVM: arm64: Simplify kvm_cpu_has_pending_timer() KVM: arm64: Introduce kvm_counter_compute_delta() helper KVM: arm64: Handle blocking WFIT instruction KVM: arm64: Offer early resume for non-blocking WFxT instructions KVM: arm64: Expose the WFXT feature to guests KVM: arm64: Fix new instances of 32bit ESRs Merge remote-tracking branch 'arm64/for-next/sme' into kvmarm-master/next Merge branch kvm-arm64/wfxt into kvmarm-master/next Merge branch kvm-arm64/hyp-stack-guard into kvmarm-master/next Merge branch kvm-arm64/aarch32-idreg-trap into kvmarm-master/next Documentation: Fix index.rst after psci.rst renaming irqchip/gic-v3: Exposes bit values for GICR_CTLR.{IR, CES} KVM: arm64: vgic-v3: Expose GICR_CTLR.RWP when disabling LPIs KVM: arm64: vgic-v3: Implement MMIO-based LPI invalidation KVM: arm64: vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision KVM: arm64: vgic-v3: List M1 Pro/Max as requiring the SEIS workaround KVM: arm64: Hide KVM_REG_ARM_*_BMAP_BIT_COUNT from userspace KVM: arm64: pmu: Restore compilation when HW_PERF_EVENTS isn't selected KVM: arm64: Fix hypercall bitmap writeback when vcpus have already run Merge branch kvm-arm64/hcall-selection into kvmarm-master/next Merge branch kvm-arm64/psci-suspend into kvmarm-master/next Merge branch kvm-arm64/vgic-invlpir into kvmarm-master/next Merge branch kvm-arm64/per-vcpu-host-pmu-data into kvmarm-master/next Merge branch kvm-arm64/misc-5.19 into kvmarm-master/next Merge branch kvm-arm64/its-save-restore-fixes-5.19 into kvmarm-master/next Mark Brown (25): arm64/sme: Provide ABI documentation for SME arm64/sme: System register and exception syndrome definitions arm64/sme: Manually encode SME instructions arm64/sme: Early CPU setup for SME arm64/sme: Basic enumeration support arm64/sme: Identify supported SME vector lengths at boot arm64/sme: Implement sysctl to set the default vector length arm64/sme:
[PATCH] Documentation: kvm: reorder ARM-specific section about KVM_SYSTEM_EVENT_SUSPEND
Signed-off-by: Paolo Bonzini --- Documentation/virt/kvm/api.rst | 52 +- 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 3201933e52d9..6890c04ccde2 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6170,32 +6170,6 @@ Valid values for 'type' are: - KVM_SYSTEM_EVENT_SUSPEND -- the guest has requested a suspension of the VM. -For arm/arm64: --- - - KVM_SYSTEM_EVENT_SUSPEND exits are enabled with the - KVM_CAP_ARM_SYSTEM_SUSPEND VM capability. If a guest invokes the PSCI - SYSTEM_SUSPEND function, KVM will exit to userspace with this event - type. - - It is the sole responsibility of userspace to implement the PSCI - SYSTEM_SUSPEND call according to ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND". - KVM does not change the vCPU's state before exiting to userspace, so - the call parameters are left in-place in the vCPU registers. - - Userspace is _required_ to take action for such an exit. It must - either: - -- Honor the guest request to suspend the VM. Userspace can request - in-kernel emulation of suspension by setting the calling vCPU's - state to KVM_MP_STATE_SUSPENDED. Userspace must configure the vCPU's - state according to the parameters passed to the PSCI function when - the calling vCPU is resumed. See ARM DEN0022D.b 5.19.1 "Intended use" - for details on the function parameters. - -- Deny the guest request to suspend the VM. See ARM DEN0022D.b 5.19.2 - "Caller responsibilities" for possible return values. - If KVM_CAP_SYSTEM_EVENT_DATA is present, the 'data' field can contain architecture specific information for the system-level event. Only the first `ndata` items (possibly zero) of the data array are valid. @@ -6211,6 +6185,32 @@ Previous versions of Linux defined a `flags` member in this struct. The field is now aliased to `data[0]`. Userspace can assume that it is only written if ndata is greater than 0. +For arm/arm64: +-- + +KVM_SYSTEM_EVENT_SUSPEND exits are enabled with the +KVM_CAP_ARM_SYSTEM_SUSPEND VM capability. If a guest invokes the PSCI +SYSTEM_SUSPEND function, KVM will exit to userspace with this event +type. + +It is the sole responsibility of userspace to implement the PSCI +SYSTEM_SUSPEND call according to ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND". +KVM does not change the vCPU's state before exiting to userspace, so +the call parameters are left in-place in the vCPU registers. + +Userspace is _required_ to take action for such an exit. It must +either: + + - Honor the guest request to suspend the VM. Userspace can request + in-kernel emulation of suspension by setting the calling vCPU's + state to KVM_MP_STATE_SUSPENDED. Userspace must configure the vCPU's + state according to the parameters passed to the PSCI function when + the calling vCPU is resumed. See ARM DEN0022D.b 5.19.1 "Intended use" + for details on the function parameters. + + - Deny the guest request to suspend the VM. See ARM DEN0022D.b 5.19.2 + "Caller responsibilities" for possible return values. + :: /* KVM_EXIT_IOAPIC_EOI */ -- 2.31.1 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 fixes for 5.18, take #3
On 5/16/22 14:51, Marc Zyngier wrote: Paolo, Here's the third (and hopefully final) set of fixes for 5.18. Two rather simple patches: one addressing a userspace-visible change in behaviour with GICv3, and a fix for pKVM in combination with CPUs affected by Spectre-v3a. Please pull, M. The following changes since commit 85ea6b1ec915c9dd90caf3674b203999d8c7e062: KVM: arm64: Inject exception on out-of-IPA-range translation fault (2022-04-27 23:02:23 +0100) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git tags/kvmarm-fixes-5.18-3 for you to fetch changes up to 2e40316753ee552fb598e8da8ca0d20a04e67453: KVM: arm64: Don't hypercall before EL2 init (2022-05-15 12:14:14 +0100) KVM/arm64 fixes for 5.18, take #3 - Correctly expose GICv3 support even if no irqchip is created so that userspace doesn't observe it changing pointlessly (fixing a regression with QEMU) - Don't issue a hypercall to set the id-mapped vectors when protected mode is enabled Marc Zyngier (1): KVM: arm64: vgic-v3: Consistently populate ID_AA64PFR0_EL1.GIC Quentin Perret (1): KVM: arm64: Don't hypercall before EL2 init arch/arm64/kvm/arm.c | 3 ++- arch/arm64/kvm/sys_regs.c | 3 +-- 2 files changed, 3 insertions(+), 3 deletions(-) Pulled, thanks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH] selftests: kvm: replace ternary operator with min()
On 5/11/22 14:05, Guo Zhengkui wrote: Fix the following coccicheck warnings: tools/testing/selftests/kvm/lib/s390x/ucall.c:25:15-17: WARNING opportunity for min() tools/testing/selftests/kvm/lib/x86_64/ucall.c:27:15-17: WARNING opportunity for min() tools/testing/selftests/kvm/lib/riscv/ucall.c:56:15-17: WARNING opportunity for min() tools/testing/selftests/kvm/lib/aarch64/ucall.c:82:15-17: WARNING opportunity for min() tools/testing/selftests/kvm/lib/aarch64/ucall.c:55:20-21: WARNING opportunity for min() min() is defined in tools/include/linux/kernel.h. Signed-off-by: Guo Zhengkui --- tools/testing/selftests/kvm/lib/aarch64/ucall.c | 4 ++-- tools/testing/selftests/kvm/lib/riscv/ucall.c | 2 +- tools/testing/selftests/kvm/lib/s390x/ucall.c | 2 +- tools/testing/selftests/kvm/lib/x86_64/ucall.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/kvm/lib/aarch64/ucall.c b/tools/testing/selftests/kvm/lib/aarch64/ucall.c index e0b0164e9af8..00be3ef195ca 100644 --- a/tools/testing/selftests/kvm/lib/aarch64/ucall.c +++ b/tools/testing/selftests/kvm/lib/aarch64/ucall.c @@ -52,7 +52,7 @@ void ucall_init(struct kvm_vm *vm, void *arg) * lower and won't match physical addresses. */ bits = vm->va_bits - 1; - bits = vm->pa_bits < bits ? vm->pa_bits : bits; + bits = min(vm->pa_bits, bits); end = 1ul << bits; start = end * 5 / 8; step = end / 16; @@ -79,7 +79,7 @@ void ucall(uint64_t cmd, int nargs, ...) va_list va; int i; - nargs = nargs <= UCALL_MAX_ARGS ? nargs : UCALL_MAX_ARGS; + nargs = min(nargs, UCALL_MAX_ARGS); va_start(va, nargs); for (i = 0; i < nargs; ++i) diff --git a/tools/testing/selftests/kvm/lib/riscv/ucall.c b/tools/testing/selftests/kvm/lib/riscv/ucall.c index 9e42d8248fa6..34f16fe70ce8 100644 --- a/tools/testing/selftests/kvm/lib/riscv/ucall.c +++ b/tools/testing/selftests/kvm/lib/riscv/ucall.c @@ -53,7 +53,7 @@ void ucall(uint64_t cmd, int nargs, ...) va_list va; int i; - nargs = nargs <= UCALL_MAX_ARGS ? nargs : UCALL_MAX_ARGS; + nargs = min(nargs, UCALL_MAX_ARGS); va_start(va, nargs); for (i = 0; i < nargs; ++i) diff --git a/tools/testing/selftests/kvm/lib/s390x/ucall.c b/tools/testing/selftests/kvm/lib/s390x/ucall.c index 9d3b0f15249a..665267c1135d 100644 --- a/tools/testing/selftests/kvm/lib/s390x/ucall.c +++ b/tools/testing/selftests/kvm/lib/s390x/ucall.c @@ -22,7 +22,7 @@ void ucall(uint64_t cmd, int nargs, ...) va_list va; int i; - nargs = nargs <= UCALL_MAX_ARGS ? nargs : UCALL_MAX_ARGS; + nargs = min(nargs, UCALL_MAX_ARGS); va_start(va, nargs); for (i = 0; i < nargs; ++i) diff --git a/tools/testing/selftests/kvm/lib/x86_64/ucall.c b/tools/testing/selftests/kvm/lib/x86_64/ucall.c index a3489973e290..2ea31a0ebe30 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/ucall.c +++ b/tools/testing/selftests/kvm/lib/x86_64/ucall.c @@ -24,7 +24,7 @@ void ucall(uint64_t cmd, int nargs, ...) va_list va; int i; - nargs = nargs <= UCALL_MAX_ARGS ? nargs : UCALL_MAX_ARGS; + nargs = min(nargs, UCALL_MAX_ARGS); va_start(va, nargs); for (i = 0; i < nargs; ++i) Queued, thanks. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH] selftests: KVM: aarch64: Let hypercalls use UAPI *_BIT_COUNT
On 5/5/22 14:04, Marc Zyngier wrote: diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index e523bb6eac67..3cde9f958eee 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -342,6 +342,10 @@ struct kvm_arm_copy_mte_tags { enum { KVM_REG_ARM_STD_BIT_TRNG_V1_0 = 0, + /* +* KVM_REG_ARM_STD_BMAP_BIT_COUNT will vary as new services +* are added, and is explicitely not part of the ABI. +*/ KVM_REG_ARM_STD_BMAP_BIT_COUNT, }; That seems like a bad idea. Perhaps you can wrap it in #ifdef __KERNEL_OR_SELFTESTS__? I can't find any prior art. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [GIT PULL] KVM/arm64 fixes for 5.18, take #2
On 4/29/22 17:36, Marc Zyngier wrote: Paolo, Here's a trio of fixes for 5.18. Nothing terribly interesting, but nonetheless important fixes (two of the bugs are related to AArch32). Cool, will pull soon. Please take a quick look at the flags->data ABI fix, it's one patch on top of 5.18 as you requested and if I hear nothing I'll send it ~Sunday morning to Linus. Thanks, Paolo Please pull, M. The following changes since commit 21db83846683d3987666505a3ec38f367708199a: selftests: KVM: Free the GIC FD when cleaning up in arch_timer (2022-04-07 08:46:13 +0100) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git tags/kvmarm-fixes-5.18-2 for you to fetch changes up to 85ea6b1ec915c9dd90caf3674b203999d8c7e062: KVM: arm64: Inject exception on out-of-IPA-range translation fault (2022-04-27 23:02:23 +0100) KVM/arm64 fixes for 5.18, take #2 - Take care of faults occuring between the PARange and IPA range by injecting an exception - Fix S2 faults taken from a host EL0 in protected mode - Work around Oops caused by a PMU access from a 32bit guest when PMU has been created. This is a temporary bodge until we fix it for good. Alexandru Elisei (1): KVM/arm64: Don't emulate a PMU for 32-bit guests if feature not set Marc Zyngier (1): KVM: arm64: Inject exception on out-of-IPA-range translation fault Will Deacon (1): KVM: arm64: Handle host stage-2 faults from 32-bit EL0 arch/arm64/include/asm/kvm_emulate.h | 1 + arch/arm64/kvm/hyp/nvhe/host.S | 18 +- arch/arm64/kvm/inject_fault.c| 28 arch/arm64/kvm/mmu.c | 19 +++ arch/arm64/kvm/pmu-emul.c| 23 ++- 5 files changed, 79 insertions(+), 10 deletions(-) ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
On 4/21/22 18:30, Ben Gardon wrote: Completely agree. I'm surprised that ARM doesn't have a need for a metadata structure associated with each page of the stage 2 paging structure, but if you don't need it, that definitely makes things simpler. The uses of struct kvm_mmu_page in the TDP MMU are all relatively new, for the work_struct and the roots' reference count. sp->ptep is only used to in a very specific path, kvm_recover_nx_lpages. I wouldn't be surprised if ARM grows more metadata later, but in fact it's not _that_ surprising that it doesn't need it yet! Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm