Re: [PATCH v2 00/50] KVM: Rework kvm_init() and hardware enabling

2022-12-29 Thread Paolo Bonzini

On 12/28/22 12:22, Marc Zyngier wrote:



Queued, thanks.  I will leave this in kvm/queue after testing everything
else and moving it to kvm/next; this way, we can wait for test results
on other architectures.


Can you please make this a topic branch, and if possible based
on a released -rc? It would make it a lot easier for everyone.


This is now refs/heads/kvm-hw-enable-refactor in 
https://git.kernel.org/pub/scm/virt/kvm/kvm.git.


The commits for this series start at hash fc471e831016.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 00/50] KVM: Rework kvm_init() and hardware enabling

2022-12-28 Thread Paolo Bonzini

On 12/28/22 12:22, Marc Zyngier wrote:



Queued, thanks.  I will leave this in kvm/queue after testing everything
else and moving it to kvm/next; this way, we can wait for test results
on other architectures.


Can you please make this a topic branch, and if possible based
on a released -rc? It would make it a lot easier for everyone.


Yes, I will (it will be based on 6.2-rc1 + pull request for rc2 that I'm 
preparing + x86 changes that this conflicts with).


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 00/50] KVM: Rework kvm_init() and hardware enabling

2022-12-27 Thread Paolo Bonzini
Queued, thanks.  I will leave this in kvm/queue after testing everything
else and moving it to kvm/next; this way, we can wait for test results
on other architectures.

Paolo


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 12/14] KVM: selftests: Use wildcards to find library source files

2022-12-24 Thread Paolo Bonzini

On 12/13/22 01:16, Sean Christopherson wrote:

Use $(wildcard ...) to find the library source files instead of manually
defining the inputs, which is a maintenance burden and error prone.


No, please don't.  This leads to weird errors, for example when "git 
checkout" is interrupted with ^C.   I have applied patches 1-11.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 10/14] KVM: selftests: Include lib.mk before consuming $(CC)

2022-12-24 Thread Paolo Bonzini

On 12/13/22 01:16, Sean Christopherson wrote:

Include lib.mk before consuming $(CC) and document that lib.mk overwrites
$(CC) unless make was invoked with -e or $(CC) was specified after make
(which apparently makes the environment override the Makefile?!?!).


Yes, it does.  In projects that don't use configure or similar, you 
might have seen


CFLAGS = -O2 -g

to be overridden with "make CFLAGS=-g" if optimization is undesirable.

Paolo


Including lib.mk after using it for probing, e.g. for -no-pie, can lead
to weirdness.


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 09/14] KVM: selftests: Explicitly disable builtins for mem*() overrides

2022-12-24 Thread Paolo Bonzini

On 12/13/22 01:16, Sean Christopherson wrote:

Explicitly disable the compiler's builtin memcmp(), memcpy(), and
memset().  Because only lib/string_override.c is built with -ffreestanding,
the compiler reserves the right to do what it wants and can try to link the
non-freestanding code to its own crud.

   /usr/bin/x86_64-linux-gnu-ld: /lib/x86_64-linux-gnu/libc.a(memcmp.o): in 
function `memcmp_ifunc':
   (.text+0x0): multiple definition of `memcmp'; 
tools/testing/selftests/kvm/lib/string_override.o:
   tools/testing/selftests/kvm/lib/string_override.c:15: first defined here
   clang: error: linker command failed with exit code 1 (use -v to see 
invocation)


Hmm, that's weird though.  I think it's an effect of ifunc and maybe 
even a linker bug.  The patch makes sense anyway.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 06/14] KVM: selftests: Rename UNAME_M to ARCH_DIR, fill explicitly for x86

2022-12-24 Thread Paolo Bonzini

On 12/13/22 01:16, Sean Christopherson wrote:

-ifeq ($(ARCH),riscv)
-   UNAME_M := riscv
+ifeq ($(ARCH),x86)
+   ARCH_DIR := x86_64
+else ifeq ($(ARCH),arm64)
+   ARCH_DIR := aarch64
+else ifeq ($(ARCH),s390)
+   ARCH_DIR := s390x
+else ifeq ($(ARCH),riscv)
+   ARCH_DIR := riscv
+else
+$(error Unknown architecture '$(ARCH)')
  endif


$(error) would break compiling via tools/testing/selftests/Makefile, so 
I am squashing this:


diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile

index d761a77c3a80..59f3eb53c932 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -13,10 +13,8 @@ else ifeq ($(ARCH),arm64)
ARCH_DIR := aarch64
 else ifeq ($(ARCH),s390)
ARCH_DIR := s390x
-else ifeq ($(ARCH),riscv)
-   ARCH_DIR := riscv
 else
-$(error Unknown architecture '$(ARCH)')
+   ARCH_DIR := $(ARCH)
 endif

 LIBKVM += lib/assert.c

Then the aarch64 and s390x directories can be renamed---x86 too, but the 
ifeq needs to stay (just changed to do x86_64->x86 instead of the other 
way round).


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 00/37] KVM: Refactor the KVM/x86 TDP MMU into common code

2022-12-12 Thread Paolo Bonzini

On 12/13/22 00:26, Sean Christopherson wrote:

I would strongly be in favor of discarding the shadow paging residue if
x86 folks are willing to part ways with it 

Yes, absolutely.  Something like to_shadow_page->to_mmu_data, sp->md,
spt->hpt, spte->spte, sptep->hptep.

"host" will be confusing if when the primary MMU is involved though, e.g. I 
always
think of the primary MMU's page tables as the "host page tables".

What if we keep the "S" for SPT(E) and repurpose it to mean Secondary PTE?


Makes sense, so just to_shadow_page->to_secmmu_page?

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 04/37] KVM: x86/mmu: Invert sp->tdp_mmu_page to sp->shadow_mmu_page

2022-12-12 Thread Paolo Bonzini

On 12/8/22 20:38, David Matlack wrote:

Invert the meaning of sp->tdp_mmu_page and rename it accordingly. This
allows the TDP MMU code to not care about this field, which will be used
in a subsequent commit to move the TDP MMU to common code.

No functional change intended.


Let's use a bit of the role instead.

Paolo


Signed-off-by: David Matlack 
---
  arch/x86/kvm/mmu/mmu.c  | 1 +
  arch/x86/kvm/mmu/mmu_internal.h | 2 +-
  arch/x86/kvm/mmu/tdp_mmu.c  | 3 ---
  arch/x86/kvm/mmu/tdp_mmu.h  | 5 -
  4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 355548603960..f7668a32721d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2180,6 +2180,7 @@ static struct kvm_mmu_page 
*kvm_mmu_alloc_shadow_page(struct kvm *kvm,
  
  	sp->gfn = gfn;

sp->role = role;
+   sp->shadow_mmu_page = true;
hlist_add_head(>hash_link, sp_list);
if (sp_has_gptes(sp))
account_shadowed(kvm, sp);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index e32379c5b1ad..c1a379fba24d 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -52,7 +52,7 @@ struct kvm_mmu_page {
struct list_head link;
struct hlist_node hash_link;
  
-	bool tdp_mmu_page;

+   bool shadow_mmu_page;
bool unsync;
u8 mmu_valid_gen;
  
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c

index 7ccac1aa8df6..fc0b87ceb1ea 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -133,8 +133,6 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct 
kvm_mmu_page *root,
if (!refcount_dec_and_test(>tdp_mmu_root_count))
return;
  
-	WARN_ON(!is_tdp_mmu_page(root));

-
/*
 * The root now has refcount=0.  It is valid, but readers already
 * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
@@ -279,7 +277,6 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, 
tdp_ptep_t sptep,
sp->role = role;
sp->gfn = gfn;
sp->ptep = sptep;
-   sp->tdp_mmu_page = true;
  
  	trace_kvm_mmu_get_page(sp, true);

  }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 0a63b1afabd3..18d3719f14ea 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -71,7 +71,10 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu 
*vcpu, u64 addr,
u64 *spte);
  
  #ifdef CONFIG_X86_64

-static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return 
sp->tdp_mmu_page; }
+static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp)
+{
+   return !sp->shadow_mmu_page;
+}
  #else
  static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
  #endif


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 02/37] KVM: MMU: Move struct kvm_mmu_page_role into common code

2022-12-12 Thread Paolo Bonzini

On 12/8/22 20:38, David Matlack wrote:

+/*
+ * kvm_mmu_page_role tracks the properties of a shadow page (where shadow page
+ * also includes TDP pages) to determine whether or not a page can be used in
+ * the given MMU context.
+ */
+union kvm_mmu_page_role {
+   u32 word;
+   struct {
+   struct {
+   /* The address space ID mapped by the page. */
+   u16 as_id:8;
+
+   /* The level of the page in the page table hierarchy. */
+   u16 level:4;
+
+   /* Whether the page is invalid, i.e. pending 
destruction. */
+   u16 invalid:1;
+   };
+
+   /* Architecture-specific properties. */
+   struct kvm_mmu_page_role_arch arch;
+   };
+};
+


Have you considered adding a tdp_mmu:1 field to the arch-independent 
part?  I think that as long as _that_ field is the same, there's no need 
to have any overlap between TDP MMU and shadow MMU roles.


I'm not even sure if the x86 TDP MMU needs _any_ other role bit.  It 
needs of course the above three, and it also needs "direct" but it is 
used exactly to mean "is this a TDP MMU page".  So we could have


union {
struct {
u32 tdp_mmu:1;
u32 invalid:1;
u32 :6;
u32 level:8;
u32 arch:8;
u32 :8;
} tdp;
/* the first field must be "u32 tdp_mmu:1;" */
struct kvm_mmu_page_role_arch shadow;
};

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 00/37] KVM: Refactor the KVM/x86 TDP MMU into common code

2022-12-12 Thread Paolo Bonzini

On 12/9/22 20:07, Oliver Upton wrote:

  - Naming. This series does not change the names of any existing code.
So all the KVM/x86 Shadow MMU-style terminology like
"shadow_page"/"sp"/"spte" persists. Should we keep that style in
common code or move toward something less shadow-paging-specific?
e.g. "page_table"/"pt"/"pte".


I would strongly be in favor of discarding the shadow paging residue if
x86 folks are willing to part ways with it 


Yes, absolutely.  Something like to_shadow_page->to_mmu_data, sp->md, 
spt->hpt, spte->spte, sptep->hptep.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 01/37] KVM: x86/mmu: Store the address space ID directly in kvm_mmu_page_role

2022-12-12 Thread Paolo Bonzini

On 12/12/22 18:39, Sean Christopherson wrote:

The notion of address spaces is already existing architecture-neutral
concept in KVM (e.g. see uses of KVM_ADDRESS_SPACE_NUM in
virt/kvm/kvm_main.c), although SMM is the only use-case I'm aware of.


Yes, SMM is currently the only use-case.


It's possible that in the future Hyper-V VTLs will also have per-level 
protections.  It wouldn't use as_id, but it would likely be recorded in 
the upper byte of the role.


I'm not sure if Microsoft intends to port those to ARM as well.


My preference would be to leave .smm in x86's page role


What about defining a byte of arch_role and a macro to build it?



Unless I'm missing something, e.g. a need to map GPAs differently for
SMM vs. non-SMM, SMM could have been implemented with a simple flag
in a memslot to mark the memslot as SMM-only.


Unfortunately not, because there can be another region (for example 
video RAM at 0Ah) underneath SMRAM.


In fact, KVM_MEM_X86_SMRAM was the first idea.  It was uglier than 
multiple address spaces 
(https://lore.kernel.org/lkml/1431084034-8425-1-git-send-email-pbonz...@redhat.com). 
 In retrospect there were probably ways to mix the best of the two 
designs, but it wasn't generic enough.



And separate address spaces become truly nasty if the CPU can access multiple
protected regions, e.g. if the CPU can access type X and type Y at the same 
time,
then there would need to be memslots for "regular", X, Y, and X+Y.


Without a usecase that's hard to say.  It's just as possible that there 
would be a natural hierarchy of levels.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 06/37] KVM: MMU: Move struct kvm_mmu_page to common code

2022-12-12 Thread Paolo Bonzini

On 12/8/22 20:38, David Matlack wrote:

This commit increases the size of struct kvm_mmu_page by 64 bytes on
x86_64 (184 bytes -> 248 bytes). The size of this struct can be reduced
in future commits by moving TDP MMU root fields into a separate struct
and by dynamically allocating fields only used by the Shadow MMU.


I think it's already possible to use a union like

union {
struct kvm_mmu_page_arch arch;
struct {
struct work_struct work;
void *data;
};
};

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 10/37] KVM: MMU: Move struct kvm_page_fault to common code

2022-12-12 Thread Paolo Bonzini

On 12/8/22 20:38, David Matlack wrote:

+
+   /* Derived from mmu and global state.  */
+   const bool is_tdp;


I think this could stay in the architecture-independent part.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 7/7] KVM: selftests: Avoid infinite loop if ucall_alloc() fails

2022-12-12 Thread Paolo Bonzini

On 12/9/22 22:03, Sean Christopherson wrote:

From: Sean Christopherson
Date: Fri, 9 Dec 2022 12:55:44 -0800
Subject: [PATCH] KVM: selftests: Use magic value to signal ucall_alloc()
  failure

Use a magic value to signal a ucall_alloc() failure instead of simply
doing GUEST_ASSERT().  GUEST_ASSERT() relies on ucall_alloc() and so a
failure puts the guest into an infinite loop.

Use -1 as the magic value, as a real ucall struct should never wrap.

Reported-by: Oliver Upton
Signed-off-by: Sean Christopherson
---
  tools/testing/selftests/kvm/lib/ucall_common.c | 16 ++--
  1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/lib/ucall_common.c 
b/tools/testing/selftests/kvm/lib/ucall_common.c
index 0cc0971ce60e..2f0e2ea941cc 100644
--- a/tools/testing/selftests/kvm/lib/ucall_common.c
+++ b/tools/testing/selftests/kvm/lib/ucall_common.c
@@ -4,6 +4,8 @@
  #include "linux/bitmap.h"
  #include "linux/atomic.h"
  
+#define GUEST_UCALL_FAILED -1

+
  struct ucall_header {
DECLARE_BITMAP(in_use, KVM_MAX_VCPUS);
struct ucall ucalls[KVM_MAX_VCPUS];
@@ -41,7 +43,8 @@ static struct ucall *ucall_alloc(void)
struct ucall *uc;
int i;
  
-	GUEST_ASSERT(ucall_pool);

+   if (!ucall_pool)
+   goto ucall_failed;
  
  	for (i = 0; i < KVM_MAX_VCPUS; ++i) {

if (!test_and_set_bit(i, ucall_pool->in_use)) {
@@ -51,7 +54,13 @@ static struct ucall *ucall_alloc(void)
}
}
  
-	GUEST_ASSERT(0);

+ucall_failed:
+   /*
+* If the vCPU cannot grab a ucall structure, make a bare ucall with a
+* magic value to signal to get_ucall() that things went sideways.
+* GUEST_ASSERT() depends on ucall_alloc() and so cannot be used here.
+*/
+   ucall_arch_do_ucall(GUEST_UCALL_FAILED);
return NULL;
  }
  
@@ -93,6 +102,9 @@ uint64_t get_ucall(struct kvm_vcpu *vcpu, struct ucall *uc)
  
  	addr = ucall_arch_get_ucall(vcpu);

if (addr) {
+   TEST_ASSERT(addr != (void *)GUEST_UCALL_FAILED,
+   "Guest failed to allocate ucall struct");
+
memcpy(uc, addr, sizeof(*uc));
vcpu_run_complete_io(vcpu);
} else {

base-commit: dc2efbe4813e0dc4368779bc36c5f0e636cb8eb2
--


Queued, thanks.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 0/7] KVM: selftests: Fixes for ucall pool + page_fault_test

2022-12-12 Thread Paolo Bonzini

On 12/9/22 02:52, Oliver Upton wrote:

The combination of the pool-based ucall implementation + page_fault_test
resulted in some 'fun' bugs. As has always been the case, KVM selftests
is a house of cards.

Small series to fix up the issues on kvm/queue. Patches 1-2 can probably
be squashed into Paolo's merge resolution, if desired.

Tested on Ampere Altra and a Skylake box, since there was a decent
amount of munging in architecture-generic code.

v1 -> v2:
  - Collect R-b from Sean (thanks!)
  - Use a common routine for split and contiguous VA spaces, with
commentary on why arm64 is different since we all get to look at it
now. (Sean)
  - Don't identity map the ucall MMIO hole
  - Fix an off-by-one issue in the accounting of virtual memory,
discovered in fighting with #2
  - Fix an infinite loop in ucall_alloc(), discovered fighting with the
ucall_init() v. kvm_vm_elf_load() ordering issue


Queued 3+5, thanks.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 4/7] KVM: selftests: Correctly initialize the VA space for TTBR0_EL1

2022-12-12 Thread Paolo Bonzini

On 12/9/22 02:53, Oliver Upton wrote:

@@ -268,17 +305,17 @@ struct kvm_vm *vm_create(enum vm_guest_mode mode)
  #ifdef __aarch64__
if (vm->pa_bits != 40)
vm->type = KVM_VM_TYPE_ARM_IPA_SIZE(vm->pa_bits);
+
+   /* selftests use TTBR0 only, meaning there is a single VA region. */
+   vm->has_split_va_space = false;
+#else
+   vm->has_split_va_space = true;
  #endif
  


Even though there happens to be already a suitable #ifdef, I don't
really like them and don't think there should be a new bool unless
something else uses it.

However, the new comment is very useful, so I added it to kvm_util.c as
follows:

/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 759a45540108..56d5ea949cbb 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -186,6 +186,15 @@ const struct vm_guest_mode_params vm_guest_mode_params[] = 
{
 _Static_assert(sizeof(vm_guest_mode_params)/sizeof(struct 
vm_guest_mode_params) == NUM_VM_MODES,
   "Missing new mode params?");
 
+/*

+ * Initializes vm->vpages_valid to match the canonical VA space of the
+ * architecture.
+ *
+ * The default implementation is valid for architectures which split the
+ * range addressed by a single page table into a low and high region
+ * based on the MSB of the VA. On architectures with this behavior
+ * the VA region spans [0, 2^(va_bits - 1)), [-(2^(va_bits - 1), -1].
+ */
 __weak void vm_vaddr_populate_bitmap(struct kvm_vm *vm)
 {
sparsebit_set_num(vm->vpages_valid,

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 updates for 6.2

2022-12-09 Thread Paolo Bonzini
On Fri, Dec 9, 2022 at 6:05 PM Oliver Upton  wrote:
> Mind dumping what I had in v1 and applying this instead?
>
> https://lore.kernel.org/kvm/20221209015307.1781352-1-oliver.up...@linux.dev/

Ouch, five minutes too late... I can take care of the difference but
it'll have to wait for Monday.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 updates for 6.2

2022-12-09 Thread Paolo Bonzini

On 12/7/22 22:51, Oliver Upton wrote:


I haven't pushed to kvm/next yet to give you time to check, so the
merge is currently in kvm/queue only.

Have a look at this series, which gets things building and actually
passing again:

https://lore.kernel.org/kvm/20221207214809.489070-1-oliver.up...@linux.dev/


Thank you!  Squashed 1-2 and applied 3-4.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 updates for 6.2

2022-12-09 Thread Paolo Bonzini

On 12/7/22 08:49, Marc Zyngier wrote:

On Tue, 06 Dec 2022 21:43:43 +,
Paolo Bonzini  wrote:


On 12/6/22 19:20, Mark Brown wrote:

I almost suggested doing that on multiple occasions this cycle, but ultimately
decided not to because it would effectively mean splitting series that touch KVM
and selftests into different trees, which would create a different kind of
dependency hell.  Or maybe a hybrid approach where series that only (or mostly?)
touch selftests go into a dedicated tree?


Some other subsystems do have a separate branch for kselftests.  One
fairly common occurrence is that the selftests branch ends up failing to
build independently because someone adds new ABI together with a
selftest but the patches adding the ABI don't end up on the same branch
as the tests which try to use them.  That is of course resolvable but
it's a common friction point.


Yeah, the right solution is simply to merge selftests changes
separately from the rest and use topic branches.


Don't know if this is what you have in mind, but I think that we
should use topic branches for *everything*. The only things for which
I don't use a separate branch are the odd drive-by patches, of the
spelling fix persuasion.


Yeah, I just wish we had better tools to manage them...

Paolo


That's what we do for arm64 and the IRQ subsystem. It is a bit more
involved at queuing time, but makes dropping series from -next
extremely easy, without affecting the history. And crucially, it gives
everyone a hint to base their stuff on a stable commit, not a random
"tip of kvm/queue as of three days ago".

M.



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 updates for 6.2

2022-12-06 Thread Paolo Bonzini

On 12/6/22 19:20, Mark Brown wrote:

I almost suggested doing that on multiple occasions this cycle, but ultimately
decided not to because it would effectively mean splitting series that touch KVM
and selftests into different trees, which would create a different kind of
dependency hell.  Or maybe a hybrid approach where series that only (or mostly?)
touch selftests go into a dedicated tree?


Some other subsystems do have a separate branch for kselftests.  One
fairly common occurrence is that the selftests branch ends up failing to
build independently because someone adds new ABI together with a
selftest but the patches adding the ABI don't end up on the same branch
as the tests which try to use them.  That is of course resolvable but
it's a common friction point.


Yeah, the right solution is simply to merge selftests changes separately 
from the rest and use topic branches.


We will have more friction of this kind if we succeed in making more KVM 
code multi-architecture, so let's just treat selftests as the more 
innocuous drill...


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 updates for 6.2

2022-12-06 Thread Paolo Bonzini

On 12/5/22 16:58, Marc Zyngier wrote:

- There is a lot of selftest conflicts with your own branch, see:

   https://lore.kernel.org/r/20221201112432.4cb9a...@canb.auug.org.au
   https://lore.kernel.org/r/20221201113626.438f1...@canb.auug.org.au
   https://lore.kernel.org/r/20221201115741.7de32...@canb.auug.org.au
   https://lore.kernel.org/r/20221201120939.3c19f...@canb.auug.org.au
   https://lore.kernel.org/r/20221201131623.18ebc...@canb.auug.org.au

   for a rather exhaustive collection.


Yeah, I saw them in Stephen's messages but missed your reply.

In retrospect, at least Gavin's series for memslot_perf_test should have
been applied by both of us with a topic branch, but there's so many
conflicts all over the place that it's hard to single out one series.
It just happens.

The only conflict in non-x86 code is the following one, please check
if I got it right.

diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c 
b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
index 05bb6a6369c2..0cda70bef5d5 100644
--- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
+++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
@@ -609,6 +609,8 @@ static void setup_memslots(struct kvm_vm *vm, struct 
test_params *p)
data_size / guest_page_size,
p->test_desc->data_memslot_flags);
vm->memslots[MEM_REGION_TEST_DATA] = TEST_DATA_MEMSLOT;
+
+   ucall_init(vm, data_gpa + data_size);
 }
 
 static void setup_default_handlers(struct test_desc *test)

@@ -704,8 +706,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
 	setup_gva_maps(vm);
 
-	ucall_init(vm, NULL);

-
reset_event_counts();
 
 	/*



Special care is needed here because the test uses vm_create().

I haven't pushed to kvm/next yet to give you time to check, so the
merge is currently in kvm/queue only.


- For the 6.3 cycle, we are going to experiment with Oliver taking
   care of most of the patch herding. I'm sure he'll do a great job,
   but if there is the odd mistake, please cut him some slack and blame
   me instead.


Absolutely - you both have all the slack you need, synchronization
is harder than it seems.

Paolo


Please pull,

M.

The following changes since commit f0c4d9fc9cc9462659728d168387191387e903cc:

   Linux 6.1-rc4 (2022-11-06 15:07:11 -0800)

are available in the Git repository at:

   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
tags/kvmarm-6.2

for you to fetch changes up to 753d734f3f347e7fc49b819472bbf61dcfc1a16f:

   Merge remote-tracking branch 'arm64/for-next/sysregs' into 
kvmarm-master/next (2022-12-05 14:39:53 +)


KVM/arm64 updates for 6.2

- Enable the per-vcpu dirty-ring tracking mechanism, together with an
   option to keep the good old dirty log around for pages that are
   dirtied by something other than a vcpu.

- Switch to the relaxed parallel fault handling, using RCU to delay
   page table reclaim and giving better performance under load.

- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
   option, which multi-process VMMs such as crosvm rely on.

- Merge the pKVM shadow vcpu state tracking that allows the hypervisor
   to have its own view of a vcpu, keeping that state private.

- Add support for the PMUv3p5 architecture revision, bringing support
   for 64bit counters on systems that support it, and fix the
   no-quite-compliant CHAIN-ed counter support for the machines that
   actually exist out there.

- Fix a handful of minor issues around 52bit VA/PA support (64kB pages
   only) as a prefix of the oncoming support for 4kB and 16kB pages.

- Add/Enable/Fix a bunch of selftests covering memslots, breakpoints,
   stage-2 faults and access tracking. You name it, we got it, we
   probably broke it.

- Pick a small set of documentation and spelling fixes, because no
   good merge window would be complete without those.

As a side effect, this tag also drags:

- The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring
   series

- A shared branch with the arm64 tree that repaints all the system
   registers to match the ARM ARM's naming, and resulting in
   interesting conflicts


Anshuman Khandual (1):
   KVM: arm64: PMU: Replace version number '0' with 
ID_AA64DFR0_EL1_PMUVer_NI

Catalin Marinas (4):
   mm: Do not enable PG_arch_2 for all 64-bit architectures
   arm64: mte: Fix/clarify the PG_mte_tagged semantics
   KVM: arm64: Simplify the sanitise_mte_tags() logic
   arm64: mte: Lock a page for MTE tag initialisation

Fuad Tabba (3):
   KVM: arm64: Add hyp_spinlock_t static initializer
   KVM: arm64: Add infrastructure to create and track pKVM instances at EL2
   KVM: arm64: Instantiate pKVM hypervisor VM and vCPU structures from EL1

Gavin Shan (14):
   KVM: x86: 

Re: [PATCH 0/9] tools: Make {clear,set}_bit() atomic for reals

2022-12-02 Thread Paolo Bonzini

On 11/19/22 02:34, Sean Christopherson wrote:

For obvious reasons I'd like to route the this through Paolo's tree.
In theory, taking just patch 5 through tip would work, but creating a
topic branch seems like the way to go, though maybe I'm being overly
paranoid.  The current tip/perf/core doesn't have any conflicts, nor does
it have new set_bit() or clear_bit() users.

  
Code sitting in kvm/queue for 6.2 adds functionality that relies on

clear_bit() being an atomic operation.  Unfortunately, despite being
implemented in atomic.h (among other strong hits that they should be
atomic), clear_bit() and set_bit() aren't actually atomic (and of course
I realized this _just_ after everything got queued up).

Move current tools/ users of clear_bit() and set_bit() to the
double-underscore versions (which tools/ already provides and documents
as being non-atomic), and then implement clear_bit() and set_bit() as
actual atomics to fix the KVM selftests bug.

Perf and KVM are the only affected users.  NVDIMM also has test code
in tools/, but that builds against the kernel proper.  The KVM code is
well tested and fully audited.  The perf code is lightly tested; if I
understand the build system, it's probably not even fully compile tested.

Patches 1 and 2 are completely unrelated and are fixes for patches
sitting in kvm/queue.  Paolo, they can be squashed if you want to rewrite
history.

Patch 3 fixes a hilarious collision in a KVM ARM selftest that will arise
when clear_bit() is converted to an atomic.

Patch 4 changes clear_bit() and set_bit() to take an "unsigned long"
instead of an "int" so that patches 5-6 aren't accompanied by functional
changes.  I.e. if something in perf is somehow relying on "bit" being a
signed int, failures will bisect to patch 4 and not to the
supposed-to-be-a-nop conversion to __clear_bit() and __set_bit().

Patch 5-9 switch perf+KVM and complete the conversion.

Applies on:
   
   git://git.kernel.org/pub/scm/virt/kvm/kvm.git queue


Queued, thanks Namhyung for the ACK!

Paolo



Sean Christopherson (9):
   KVM: selftests: Add rdmsr_from_l2() implementation in Hyper-V eVMCS
 test
   KVM: selftests: Remove unused "vcpu" param to fix build error
   KVM: arm64: selftests: Enable single-step without a "full" ucall()
   tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
   perf tools: Use dedicated non-atomic clear/set bit helpers
   KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
   tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
   tools: Drop "atomic_" prefix from atomic test_and_set_bit()
   tools: KVM: selftests: Convert clear/set_bit() to actual atomics

  tools/arch/x86/include/asm/atomic.h   |  6 +++-
  tools/include/asm-generic/atomic-gcc.h| 13 ++-
  tools/include/asm-generic/bitops/atomic.h | 15 
  tools/include/linux/bitmap.h  | 34 ---
  tools/perf/bench/find-bit-bench.c |  2 +-
  tools/perf/builtin-c2c.c  |  6 ++--
  tools/perf/builtin-kwork.c|  6 ++--
  tools/perf/builtin-record.c   |  6 ++--
  tools/perf/builtin-sched.c|  2 +-
  tools/perf/tests/bitmap.c |  2 +-
  tools/perf/tests/mem2node.c   |  2 +-
  tools/perf/util/affinity.c|  4 +--
  tools/perf/util/header.c  |  8 ++---
  tools/perf/util/mmap.c|  6 ++--
  tools/perf/util/pmu.c |  2 +-
  .../util/scripting-engines/trace-event-perl.c |  2 +-
  .../scripting-engines/trace-event-python.c|  2 +-
  tools/perf/util/session.c |  2 +-
  tools/perf/util/svghelper.c   |  2 +-
  .../selftests/kvm/aarch64/arch_timer.c|  2 +-
  .../selftests/kvm/aarch64/debug-exceptions.c  | 21 ++--
  tools/testing/selftests/kvm/dirty_log_test.c  | 34 +--
  .../selftests/kvm/include/ucall_common.h  |  8 +
  .../testing/selftests/kvm/lib/ucall_common.c  |  2 +-
  .../selftests/kvm/x86_64/hyperv_evmcs.c   | 13 +--
  .../selftests/kvm/x86_64/hyperv_svm_test.c|  4 +--
  .../selftests/kvm/x86_64/hyperv_tlb_flush.c   |  2 +-
  27 files changed, 102 insertions(+), 106 deletions(-)


base-commit: 3321eef4acb51c303f0598d8a8493ca58528a054


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 33/44] KVM: x86: Do VMX/SVM support checks directly in vendor code

2022-11-04 Thread Paolo Bonzini

On 11/3/22 19:58, Sean Christopherson wrote:


diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3e508f239098..ebe617ab0b37 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -191,6 +191,8 @@ static void default_init(struct cpuinfo_x86 *c)
 strcpy(c->x86_model_id, "386");
 }
  #endif
+
+   clear_cpu_cap(c, X86_FEATURE_MSR_IA32_FEAT_CTL);
  }
  
  static const struct cpu_dev default_cpu = {


Not needed I think?  default_init does not call init_ia32_feat_ctl.


diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index c881bcafba7d..3a7ae67f5a5e 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -72,6 +72,8 @@ static const struct cpuid_dep cpuid_deps[] = {
 { X86_FEATURE_AVX512_FP16,  X86_FEATURE_AVX512BW  },
 { X86_FEATURE_ENQCMD,   X86_FEATURE_XSAVES},
 { X86_FEATURE_PER_THREAD_MBA,   X86_FEATURE_MBA   },
+   { X86_FEATURE_VMX,  X86_FEATURE_MSR_IA32_FEAT_CTL   
  },
+   { X86_FEATURE_SGX,  X86_FEATURE_MSR_IA32_FEAT_CTL   
  },
 { X86_FEATURE_SGX_LC,   X86_FEATURE_SGX   },
 { X86_FEATURE_SGX1, X86_FEATURE_SGX   },
 { X86_FEATURE_SGX2, X86_FEATURE_SGX1  },



Yes, good idea.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 00/44] KVM: Rework kvm_init() and hardware enabling

2022-11-04 Thread Paolo Bonzini

On 11/4/22 08:17, Isaku Yamahata wrote:

On Wed, Nov 02, 2022 at 11:18:27PM +,
Sean Christopherson  wrote:


Non-x86 folks, please test on hardware when possible.  I made a _lot_ of
mistakes when moving code around.  Thankfully, x86 was the trickiest code
to deal with, and I'm fairly confident that I found all the bugs I
introduced via testing.  But the number of mistakes I made and found on
x86 makes me more than a bit worried that I screwed something up in other
arch code.

This is a continuation of Chao's series to do x86 CPU compatibility checks
during virtualization hardware enabling[1], and of Isaku's series to try
and clean up the hardware enabling paths so that x86 (Intel specifically)
can temporarily enable hardware during module initialization without
causing undue pain for other architectures[2].  It also includes one patch
from another mini-series from Isaku that provides the less controversial
patches[3].

The main theme of this series is to kill off kvm_arch_init(),
kvm_arch_hardware_(un)setup(), and kvm_arch_check_processor_compat(), which
all originated in x86 code from way back when, and needlessly complicate
both common KVM code and architecture code.  E.g. many architectures don't
mark functions/data as __init/__ro_after_init purely because kvm_init()
isn't marked __init to support x86's separate vendor modules.

The idea/hope is that with those hooks gone (moved to arch code), it will
be easier for x86 (and other architectures) to modify their module init
sequences as needed without having to fight common KVM code.  E.g. I'm
hoping that ARM can build on this to simplify its hardware enabling logic,
especially the pKVM side of things.

There are bug fixes throughout this series.  They are more scattered than
I would usually prefer, but getting the sequencing correct was a gigantic
pain for many of the x86 fixes due to needing to fix common code in order
for the x86 fix to have any meaning.  And while the bugs are often fatal,
they aren't all that interesting for most users as they either require a
malicious admin or broken hardware, i.e. aren't likely to be encountered
by the vast majority of KVM users.  So unless someone _really_ wants a
particular fix isolated for backporting, I'm not planning on shuffling
patches.

Tested on x86.  Lightly tested on arm64.  Compile tested only on all other
architectures.


Thanks for the patch series. I the rebased TDX KVM patch series and it worked.
Since cpu offline needs to be rejected in some cases(To keep at least one cpu
on a package), arch hook for cpu offline is needed.
I can keep it in TDX KVM patch series.


Yes, this patch looks good.

Paolo


diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index 23c0f4bc63f1..ef7bcb845d42 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -17,6 +17,7 @@ BUILD_BUG_ON(1)
  KVM_X86_OP(hardware_enable)
  KVM_X86_OP(hardware_disable)
  KVM_X86_OP(hardware_unsetup)
+KVM_X86_OP_OPTIONAL_RET0(offline_cpu)
  KVM_X86_OP(has_emulated_msr)
  KVM_X86_OP(vcpu_after_set_cpuid)
  KVM_X86_OP(is_vm_type_supported)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 496c7c6eaff9..c420409aa96f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1468,6 +1468,7 @@ struct kvm_x86_ops {
int (*hardware_enable)(void);
void (*hardware_disable)(void);
void (*hardware_unsetup)(void);
+   int (*offline_cpu)(void);
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
  
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

index 2ed5a017f7bc..17c5d6a76c93 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12039,6 +12039,11 @@ void kvm_arch_hardware_disable(void)
drop_user_return_notifiers();
  }
  
+int kvm_arch_offline_cpu(unsigned int cpu)

+{
+   return static_call(kvm_x86_offline_cpu)();
+}
+
  bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
  {
return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 620489b9aa93..4df79443fd11 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1460,6 +1460,7 @@ static inline void kvm_create_vcpu_debugfs(struct 
kvm_vcpu *vcpu) {}
  int kvm_arch_hardware_enable(void);
  void kvm_arch_hardware_disable(void);
  #endif
+int kvm_arch_offline_cpu(unsigned int cpu);
  int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
  bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f6b6dcedaa0a..f770fdc662d0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5396,16 +5396,24 @@ static void hardware_disable_nolock(void *junk)
__this_cpu_write(hardware_enabled, false);
  }
  
+__weak int kvm_arch_offline_cpu(unsigned 

Re: [PATCH 33/44] KVM: x86: Do VMX/SVM support checks directly in vendor code

2022-11-03 Thread Paolo Bonzini

On 11/3/22 19:35, Sean Christopherson wrote:

It's technically required.  IA32_FEAT_CTL and thus KVM_INTEL depends on any of
CPU_SUP_{INTEL,CENATUR,ZHAOXIN}, but init_ia32_feat_ctl() is invoked if and only
if the actual CPU type matches one of the aforementioned CPU_SUP_*.

E.g. running a kernel built with

   CONFIG_CPU_SUP_INTEL=y
   CONFIG_CPU_SUP_AMD=y
   # CONFIG_CPU_SUP_HYGON is not set
   # CONFIG_CPU_SUP_CENTAUR is not set
   # CONFIG_CPU_SUP_ZHAOXIN is not set

on a Cenatur or Zhaoxin CPU will leave X86_FEATURE_VMX set but not set
X86_FEATURE_MSR_IA32_FEAT_CTL.  If VMX isn't enabled in MSR_IA32_FEAT_CTL, KVM
will get unexpected #UDs when trying to enable VMX.


Oh, I see.  Perhaps X86_FEATURE_VMX and X86_FEATURE_SGX should be moved 
to one of the software words instead of using cpuid.  Nothing that you 
should care about for this series though.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 36/44] KVM: x86: Do compatibility checks when onlining CPU

2022-11-03 Thread Paolo Bonzini

On 11/3/22 18:44, Sean Christopherson wrote:

Do compatibility checks when enabling hardware to effectively add
compatibility checks when onlining a CPU.  Abort enabling, i.e. the
online process, if the (hotplugged) CPU is incompatible with the known
good setup.


This paragraph is not true with this patch being before "KVM: Rename and
move CPUHP_AP_KVM_STARTING to ONLINE section".


Argh, good eyes.  Getting the ordering correct in this series has been quite the
struggle.  Assuming there are no subtle dependencies between x86 and common KVM,
the ordering should be something like this:


It's not a problem to keep the ordering in this v1, just fix the commit 
message like "Do compatibility checks when enabling hardware to 
effectively add compatibility checks on CPU hotplug.  For now KVM is 
using a STARTING hook, which makes it impossible to abort the hotplug if 
the new CPU is incompatible with the known good setup; switching to an 
ONLINE hook will fix this."


Paolo


   KVM: Opt out of generic hardware enabling on s390 and PPC
   KVM: Register syscore (suspend/resume) ops early in kvm_init()
   KVM: x86: Do compatibility checks when onlining CPU
   KVM: SVM: Check for SVM support in CPU compatibility checks
   KVM: VMX: Shuffle support checks and hardware enabling code around
   KVM: x86: Do VMX/SVM support checks directly in vendor code
   KVM: x86: Unify pr_fmt to use module name for all KVM modules
   KVM: x86: Use KBUILD_MODNAME to specify vendor module name
   KVM: Make hardware_enable_failed a local variable in the "enable all" path
   KVM: Use a per-CPU variable to track which CPUs have enabled virtualization
   KVM: Remove on_each_cpu(hardware_disable_nolock) in kvm_exit()
   KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
   KVM: Disable CPU hotplug during hardware enabling
   KVM: Rename and move CPUHP_AP_KVM_STARTING to ONLINE section
   KVM: Drop kvm_arch_check_processor_compat() hook



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 00/44] KVM: Rework kvm_init() and hardware enabling

2022-11-03 Thread Paolo Bonzini

On 11/3/22 13:08, Christian Borntraeger wrote:

There are bug fixes throughout this series.  They are more scattered than
I would usually prefer, but getting the sequencing correct was a gigantic
pain for many of the x86 fixes due to needing to fix common code in order
for the x86 fix to have any meaning.  And while the bugs are often fatal,
they aren't all that interesting for most users as they either require a
malicious admin or broken hardware, i.e. aren't likely to be encountered
by the vast majority of KVM users.  So unless someone _really_ wants a
particular fix isolated for backporting, I'm not planning on shuffling
patches.

Tested on x86.  Lightly tested on arm64.  Compile tested only on all 
other architectures.


Some sniff tests seem to work ok on s390.


Thanks.  There are just a couple nits, and MIPS/PPC/RISC-V have very 
small changes.  Feel free to send me a pull request once Marc acks.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 39/44] KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock

2022-11-03 Thread Paolo Bonzini

On 11/3/22 00:19, Sean Christopherson wrote:

+- kvm_lock is taken outside kvm->mmu_lock


Not surprising since one is a mutex and one is an rwlock. :)  You can 
drop this hunk as well as the "Opportunistically update KVM's locking 
documentation" sentence in the commit message.



  - vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock
  
  - kvm->arch.mmu_lock is an rwlock.  kvm->arch.tdp_mmu_pages_lock and

@@ -216,15 +220,11 @@ time it will be set using the Dirty tracking mechanism 
described above.
  :Type:mutex
  :Arch:any
  :Protects:- vm_list
-
-``kvm_count_lock``
-^^
-
-:Type: raw_spinlock_t
-:Arch: any
-:Protects: - hardware virtualization enable/disable
-:Comment:  'raw' because hardware enabling/disabling must be atomic /wrt
-   migration.
+   - kvm_usage_count
+   - hardware virtualization enable/disable
+   - module probing (x86 only)


What do you mean exactly by "module probing"?  Is it anything else than 
what is serialized by vendor_module_lock?


Paolo


+:Comment:  KVM also disables CPU hotplug via cpus_read_lock() during
+   enable/disable.
  
  ``kvm->mn_invalidate_lock``

  ^^^
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4e765ef9f4bd..c8d92e6c3922 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -100,7 +100,6 @@ EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
   */
  
  DEFINE_MUTEX(kvm_lock);

-static DEFINE_RAW_SPINLOCK(kvm_count_lock);
  LIST_HEAD(vm_list);
  
  static cpumask_var_t cpus_hardware_enabled;

@@ -5028,9 +5027,10 @@ static void hardware_enable_nolock(void *junk)
  
  static int kvm_online_cpu(unsigned int cpu)

  {
+   unsigned long flags;
int ret = 0;
  
-	raw_spin_lock(_count_lock);

+   mutex_lock(_lock);
/*
 * Abort the CPU online process if hardware virtualization cannot
 * be enabled. Otherwise running VMs would encounter unrecoverable
@@ -5039,13 +5039,16 @@ static int kvm_online_cpu(unsigned int cpu)
if (kvm_usage_count) {
WARN_ON_ONCE(atomic_read(_enable_failed));
  
+		local_irq_save(flags);

hardware_enable_nolock(NULL);
+   local_irq_restore(flags);
+
if (atomic_read(_enable_failed)) {
atomic_set(_enable_failed, 0);
ret = -EIO;
}
}
-   raw_spin_unlock(_count_lock);
+   mutex_unlock(_lock);
return ret;
  }
  
@@ -5061,10 +5064,13 @@ static void hardware_disable_nolock(void *junk)
  
  static int kvm_offline_cpu(unsigned int cpu)

  {
-   raw_spin_lock(_count_lock);
-   if (kvm_usage_count)
+   mutex_lock(_lock);
+   if (kvm_usage_count) {
+   preempt_disable();
hardware_disable_nolock(NULL);
-   raw_spin_unlock(_count_lock);
+   preempt_enable();
+   }
+   mutex_unlock(_lock);
return 0;
  }
  
@@ -5079,9 +5085,11 @@ static void hardware_disable_all_nolock(void)
  
  static void hardware_disable_all(void)

  {
-   raw_spin_lock(_count_lock);
+   cpus_read_lock();
+   mutex_lock(_lock);
hardware_disable_all_nolock();
-   raw_spin_unlock(_count_lock);
+   mutex_unlock(_lock);
+   cpus_read_unlock();
  }
  
  static int hardware_enable_all(void)

@@ -5097,7 +5105,7 @@ static int hardware_enable_all(void)
 * Disable CPU hotplug to prevent scenarios where KVM sees
 */
cpus_read_lock();
-   raw_spin_lock(_count_lock);
+   mutex_lock(_lock);
  
  	kvm_usage_count++;

if (kvm_usage_count == 1) {
@@ -5110,7 +5118,7 @@ static int hardware_enable_all(void)
}
}
  
-	raw_spin_unlock(_count_lock);

+   mutex_unlock(_lock);
cpus_read_unlock();
  
  	return r;

@@ -5716,6 +5724,15 @@ static void kvm_init_debug(void)
  
  static int kvm_suspend(void)

  {
+   /*
+* Secondary CPUs and CPU hotplug are disabled across the suspend/resume
+* callbacks, i.e. no need to acquire kvm_lock to ensure the usage count
+* is stable.  Assert that kvm_lock is not held as a paranoid sanity
+* check that the system isn't suspended when KVM is enabling hardware.
+*/
+   lockdep_assert_not_held(_lock);
+   lockdep_assert_irqs_disabled();
+
if (kvm_usage_count)
hardware_disable_nolock(NULL);
return 0;
@@ -5723,10 +5740,11 @@ static int kvm_suspend(void)
  
  static void kvm_resume(void)

  {
-   if (kvm_usage_count) {
-   lockdep_assert_not_held(_count_lock);
+   lockdep_assert_not_held(_lock);
+   lockdep_assert_irqs_disabled();
+
+   if (kvm_usage_count)
hardware_enable_nolock(NULL);
-   }
  }
  
  static struct syscore_ops kvm_syscore_ops = {


___
kvmarm 

Re: [PATCH 36/44] KVM: x86: Do compatibility checks when onlining CPU

2022-11-03 Thread Paolo Bonzini

On 11/3/22 00:19, Sean Christopherson wrote:

From: Chao Gao

Do compatibility checks when enabling hardware to effectively add
compatibility checks when onlining a CPU.  Abort enabling, i.e. the
online process, if the (hotplugged) CPU is incompatible with the known
good setup.


This paragraph is not true with this patch being before "KVM: Rename and 
move CPUHP_AP_KVM_STARTING to ONLINE section".


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 33/44] KVM: x86: Do VMX/SVM support checks directly in vendor code

2022-11-03 Thread Paolo Bonzini

On 11/3/22 00:19, Sean Christopherson wrote:

+   if (!boot_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) ||
+   !boot_cpu_has(X86_FEATURE_VMX)) {
+   pr_err("VMX not enabled in MSR_IA32_FEAT_CTL\n");
+   return false;


I think the reference to the BIOS should remain in these messages and in 
svm.c (even though these days it's much less common for vendors to 
default to disabled virtualization in the system setup).


The check for X86_FEATURE_MSR_IA32_FEAT_CTL is not needed because 
init_ia32_feat_ctl() will clear X86_FEATURE_VMX if the rdmsr fail (and 
not set X86_FEATURE_MSR_IA32_FEAT_CTL).


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 10/44] KVM: VMX: Clean up eVMCS enabling if KVM initialization fails

2022-11-03 Thread Paolo Bonzini
On Thu, Nov 3, 2022 at 3:01 PM Paolo Bonzini  wrote:
>
> On 11/3/22 00:18, Sean Christopherson wrote:
> > +static void hv_cleanup_evmcs(void)
>
> This needs to be __init.

Error: brain temporarily disconnected.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 10/44] KVM: VMX: Clean up eVMCS enabling if KVM initialization fails

2022-11-03 Thread Paolo Bonzini

On 11/3/22 00:18, Sean Christopherson wrote:

+static void hv_cleanup_evmcs(void)


This needs to be __init.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 fixes for 6.1, take #2

2022-10-22 Thread Paolo Bonzini
On Thu, Oct 20, 2022 at 12:02 PM Marc Zyngier  wrote:
>
> Paolo,
>
> Here's a couple of additional fixes for 6.1. The ITS one is pretty
> annoying as it prevents a VM from being restored if it has a
> convoluted device topology. Definitely a stable candidate.
>
> Note that I can't see that you have pulled the first set of fixes
> which I sent last week[1]. In order to avoid any problem, the current
> pull-request is a suffix of the previous one. But you may want to pull
> them individually in order to preserve the tag descriptions.

Yes, that's why I did. Pulled now, thanks.

Paolo

>
> Please pull,
>
> M.
>
> [1] https://lore.kernel.org/r/20221013132830.1304947-1-...@kernel.org
>
> The following changes since commit 05c2224d4b049406b0545a10be05280ff4b8ba0a:
>
>   KVM: selftests: Fix number of pages for memory slot in 
> memslot_modification_stress_test (2022-10-13 11:46:51 +0100)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
> tags/kvmarm-fixes-6.1-2
>
> for you to fetch changes up to c000a2607145d28b06c697f968491372ea56c23a:
>
>   KVM: arm64: vgic: Fix exit condition in scan_its_table() (2022-10-15 
> 12:10:54 +0100)
>
> 
> KVM/arm64 fixes for 6.1, take #2
>
> - Fix a bug preventing restoring an ITS containing mappings
>   for very large and very sparse device topology
>
> - Work around a relocation handling error when compiling
>   the nVHE object with profile optimisation
>
> 
> Denis Nikitin (1):
>   KVM: arm64: nvhe: Fix build with profile optimization
>
> Eric Ren (1):
>   KVM: arm64: vgic: Fix exit condition in scan_its_table()
>
>  arch/arm64/kvm/hyp/nvhe/Makefile | 4 
>  arch/arm64/kvm/vgic/vgic-its.c   | 5 -
>  2 files changed, 8 insertions(+), 1 deletion(-)
>

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 fixes for 6.1, take #1

2022-10-22 Thread Paolo Bonzini
On Thu, Oct 13, 2022 at 3:28 PM Marc Zyngier  wrote:
> Paolo,
>
> Here's the first set of fixes for 6.1. The most interesting bit is
> Oliver's fix limiting the S2 invalidation batch size the the largest
> block mapping, solving (at least for now) the RCU stall problems we
> have been seeing for a while. We may have to find another solution
> when (and if) we decide to allow 4TB mapping at S2...
>
> The rest is a set of minor selftest fixes as well as enabling stack
> protection and profiling in the VHE code.
>
> Please pull,

Done, thanks.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 updates for 6.1

2022-10-03 Thread Paolo Bonzini
Pulled, thanks!

Paolo

On Sun, Oct 2, 2022 at 2:42 PM Marc Zyngier  wrote:
>
> Paolo,
>
> Please find below the rather small set of KVM/arm64 updates
> for 6.1. This is mostly a set of fixes for existing features,
> the most interesting one being Reiji's really good work tracking
> an annoying set of bugs in our single-stepping implementation.
> Also, I've included the changes making it possible to enable
> the dirty-ring tracking on arm64. Full details in the tag.
>
> Note that this pull-request comes with a branch shared with the
> arm64 tree, in order to avoid some bad conflicts due to the
> ongoing repainting of all the system registers.
>
> Please pull,
>
> M.
>
> The following changes since commit b90cb1053190353cc30f0fef0ef1f378ccc063c5:
>
>   Linux 6.0-rc3 (2022-08-28 15:05:29 -0700)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
> tags/kvmarm-6.1
>
> for you to fetch changes up to b302ca52ba8235ff0e18c0fa1fa92b51784aef6a:
>
>   Merge branch kvm-arm64/misc-6.1 into kvmarm-master/next (2022-10-01 
> 10:19:36 +0100)
>
> 
> KVM/arm64 updates for v6.1
>
> - Fixes for single-stepping in the presence of an async
>   exception as well as the preservation of PSTATE.SS
>
> - Better handling of AArch32 ID registers on AArch64-only
>   systems
>
> - Fixes for the dirty-ring API, allowing it to work on
>   architectures with relaxed memory ordering
>
> - Advertise the new kvmarm mailing list
>
> - Various minor cleanups and spelling fixes
>
> 
> Elliot Berman (1):
>   KVM: arm64: Ignore kvm-arm.mode if !is_hyp_mode_available()
>
> Gavin Shan (1):
>   KVM: arm64: vgic: Remove duplicate check in update_affinity_collection()
>
> Kristina Martsenko (3):
>   arm64: cache: Remove unused CTR_CACHE_MINLINE_MASK
>   arm64/sysreg: Standardise naming for ID_AA64MMFR1_EL1 fields
>   arm64/sysreg: Convert ID_AA64MMFR1_EL1 to automatic generation
>
> Marc Zyngier (12):
>   Merge branch kvm-arm64/aarch32-raz-idregs into kvmarm-master/next
>   Merge remote-tracking branch 'arm64/for-next/sysreg' into 
> kvmarm-master/next
>   Merge branch kvm-arm64/single-step-async-exception into 
> kvmarm-master/next
>   KVM: Use acquire/release semantics when accessing dirty ring GFN state
>   KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option
>   KVM: x86: Select CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL
>   KVM: Document weakly ordered architecture requirements for dirty ring
>   KVM: selftests: dirty-log: Upgrade flag accesses to acquire/release 
> semantics
>   KVM: selftests: dirty-log: Use KVM_CAP_DIRTY_LOG_RING_ACQ_REL if 
> available
>   KVM: arm64: Advertise new kvmarm mailing list
>   Merge branch kvm-arm64/dirty-log-ordered into kvmarm-master/next
>   Merge branch kvm-arm64/misc-6.1 into kvmarm-master/next
>
> Mark Brown (31):
>   arm64/sysreg: Remove stray SMIDR_EL1 defines
>   arm64/sysreg: Describe ID_AA64SMFR0_EL1.SMEVer as an enumeration
>   arm64/sysreg: Add _EL1 into ID_AA64MMFR0_EL1 definition names
>   arm64/sysreg: Add _EL1 into ID_AA64MMFR2_EL1 definition names
>   arm64/sysreg: Add _EL1 into ID_AA64PFR0_EL1 definition names
>   arm64/sysreg: Add _EL1 into ID_AA64PFR1_EL1 constant names
>   arm64/sysreg: Standardise naming of ID_AA64MMFR0_EL1.BigEnd
>   arm64/sysreg: Standardise naming of ID_AA64MMFR0_EL1.ASIDBits
>   arm64/sysreg: Standardise naming for ID_AA64MMFR2_EL1.VARange
>   arm64/sysreg: Standardise naming for ID_AA64MMFR2_EL1.CnP
>   arm64/sysreg: Standardise naming for ID_AA64PFR0_EL1 constants
>   arm64/sysreg: Standardise naming for ID_AA64PFR0_EL1.AdvSIMD constants
>   arm64/sysreg: Standardise naming for SSBS feature enumeration
>   arm64/sysreg: Standardise naming for MTE feature enumeration
>   arm64/sysreg: Standardise naming of ID_AA64PFR1_EL1 fractional version 
> fields
>   arm64/sysreg: Standardise naming of ID_AA64PFR1_EL1 BTI enumeration
>   arm64/sysreg: Standardise naming of ID_AA64PFR1_EL1 SME enumeration
>   arm64/sysreg: Convert HCRX_EL2 to automatic generation
>   arm64/sysreg: Convert ID_AA64MMFR0_EL1 to automatic generation
>   arm64/sysreg: Convert ID_AA64MMFR2_EL1 to automatic generation
>   arm64/sysreg: Convert ID_AA64PFR0_EL1 to automatic generation
>   arm64/sysreg: Convert ID_AA64PFR1_EL1 to automatic generation
>   arm64/sysreg: Convert TIPDR_EL1 to automatic generation
>   arm64/sysreg: Convert SCXTNUM_EL1 to automatic generation
>   arm64/sysreg: Add defintion for ALLINT
>   arm64/sysreg: Align field names in ID_AA64DFR0_EL1 with architecture
>   arm64/sysreg: Add _EL1 into ID_AA64DFR0_EL1 definition names
>   arm64/sysreg: Use feature numbering for PMU and SPE revisions
>   

Re: [PATCH 2/6] KVM: Add KVM_CAP_DIRTY_LOG_RING_ORDERED capability and config option

2022-09-23 Thread Paolo Bonzini
Il ven 23 set 2022, 20:26 Peter Xu  ha scritto:
>
> > Someone will show up with an old userspace which probes for the sole
> > existing capability, and things start failing subtly. It is quite
> > likely that the userspace code is built for all architectures,
>
> I didn't quite follow here.  Since both kvm/qemu dirty ring was only
> supported on x86, I don't see the risk.

Say you run a new ARM kernel on old userspace, and the new kernel uses
KVM_CAP_DIRTY_LOG_RING. Userspace will try to use the dirty page ring
buffer even though it lacks the memory barriers that were just
introduced in QEMU.

The new capability means "the dirty page ring buffer is supported and,
by the way, you're supposed to do everything right with respect to
ordering of loads and stores; you can't get away without it like you
could on x86".

Paolo

>
> Assuming we've the old binary.
>
> If to run on old kernel, it'll work like before.
>
> If to run on new kernel, the kernel will behave stricter on memory barriers
> but should still be compatible with the old behavior (not vice versa, so
> I'll understand if we're loosing the ordering, but we're not..).
>
> Any further elaboration would be greatly helpful.
>
> Thanks,
>
> > and we
> > want to make sure that userspace actively buys into the new ordering
> > requirements. A simple way to do this is to expose a new capability,
> > making the new requirement obvious. Architectures with relaxed
> > ordering semantics will only implement the new one, while x86 will
> > implement both.
>
> --
> Peter Xu
>

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/6] KVM: Use acquire/release semantics when accessing dirty ring GFN state

2022-09-23 Thread Paolo Bonzini
On Fri, Sep 23, 2022 at 4:20 PM Marc Zyngier  wrote:
> > > This is only a partial fix as the userspace side also need upgrading.
> >
> > Paolo has one fix 4802bf910e ("KVM: dirty ring: add missing memory
> > barrier", 2022-09-01) which has already landed.
>
> What is this commit? It doesn't exist in the kernel as far as I can see.

That's the load_acquire in QEMU, and the store_release part is in 7.2
as well (commit 52281c6d11, "KVM: use store-release to mark dirty
pages as harvested", 2022-09-18).

So all that QEMU is missing is the new capability.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 5/6] KVM: selftests: dirty-log: Upgrade dirty_gfn_set_collected() to store-release

2022-09-22 Thread Paolo Bonzini
On Thu, Sep 22, 2022 at 7:02 PM Marc Zyngier  wrote:
> To make sure that all the writes to the log marking the entries
> as being in need of reset are observed in order, use a
> smp_store_release() when updating the log entry flags.
>
> Signed-off-by: Marc Zyngier 

You also need a load-acquire on the load of gfn->flags in
dirty_gfn_is_dirtied. Otherwise reading cur->slot or cur->offset might
see a stale value.

Paolo

> ---
>  tools/testing/selftests/kvm/dirty_log_test.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
> b/tools/testing/selftests/kvm/dirty_log_test.c
> index 9c883c94d478..3d29f4bf4f9c 100644
> --- a/tools/testing/selftests/kvm/dirty_log_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_test.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "kvm_util.h"
>  #include "test_util.h"
> @@ -284,7 +285,7 @@ static inline bool dirty_gfn_is_dirtied(struct 
> kvm_dirty_gfn *gfn)
>
>  static inline void dirty_gfn_set_collected(struct kvm_dirty_gfn *gfn)
>  {
> -   gfn->flags = KVM_DIRTY_GFN_F_RESET;
> +   smp_store_release(>flags, KVM_DIRTY_GFN_F_RESET);
>  }
>
>  static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
> --
> 2.34.1
>

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 fixes for 6.0, take #2

2022-09-22 Thread Paolo Bonzini
Pulled, thanks.

Paolo

On Mon, Sep 19, 2022 at 7:19 PM Marc Zyngier  wrote:
>
> Paolo,
>
> Here's the last KVM/arm64 pull request for this cycle, with
> a small fix for pKVM and kmemleak.
>
> Please pull,
>
> M.
>
> The following changes since commit 1c23f9e627a7b412978b4e852793c5e3c3efc555:
>
>   Linux 6.0-rc2 (2022-08-21 17:32:54 -0700)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
> tags/kvmarm-fixes-6.0-2
>
> for you to fetch changes up to 522c9a64c7049f50c7b1299741c13fac3f231cd4:
>
>   KVM: arm64: Use kmemleak_free_part_phys() to unregister hyp_mem_base 
> (2022-09-19 17:59:48 +0100)
>
> 
> KVM/arm64 fixes for 6.0, take #2
>
> - Fix kmemleak usage in Protected KVM (again)
>
> 
> Zenghui Yu (1):
>   KVM: arm64: Use kmemleak_free_part_phys() to unregister hyp_mem_base
>
>  arch/arm64/kvm/arm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

2022-09-01 Thread Paolo Bonzini

On 8/30/22 16:42, Peter Xu wrote:

Marc,

I thought we won't hit this as long as we properly take care of other
orderings of (a) gfn push, and (b) gfn collect, but after a second thought
I think it's indeed logically possible that with a reversed ordering here
we can be reading some garbage gfn before (a) happens butt also read the
valid flag after (b).

It seems we must have all the barriers correctly applied always.  If that's
correct, do you perhaps mean something like this to just add the last piece
of barrier?


Okay, so I thought about it some more and it's quite tricky.

Strictly speaking, the synchronization is just between userspace and 
kernel. The fact that the actual producer of dirty pages is in another 
CPU is a red herring, because reset only cares about harvested pages.


In other words, the dirty page ring is essentially two ring buffers in 
one and we only care about the "harvested ring", not the "produced ring".


On the other hand, it may happen that userspace has set more RESET flags 
while the ioctl is ongoing:



CPU0 CPU1   CPU2
fill gfn0
store-rel flags for gfn0
fill gfn1
store-rel flags for gfn1
load-acq flags for gfn0
set RESET for gfn0
load-acq flags for gfn1
set RESET for gfn1
do ioctl! --->
 ioctl(RESET_RINGS)
fill gfn2
store-rel flags for gfn2
load-acq flags for gfn2
set RESET for gfn2
 process gfn0
 process gfn1
 process gfn2
do ioctl!
etc.

The three load-acquire in CPU0 synchronize with the three store-release 
in CPU2, but CPU0 and CPU1 are only synchronized up to gfn1 and CPU1 may 
miss gfn2's fields other than flags.


The kernel must be able to cope with invalid values of the fields, and 
userspace will invoke the ioctl once more.  However, once the RESET flag 
is cleared on gfn2, it is lost forever, therefore in the above scenario 
CPU1 must read the correct value of gfn2's fields.


Therefore RESET must be set with a store-release, that will synchronize 
with a load-acquire in CPU1 as you suggested.


Paolo


diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index f4c2a6eb1666..ea620bfb012d 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -84,7 +84,7 @@ static inline void kvm_dirty_gfn_set_dirtied(struct 
kvm_dirty_gfn *gfn)
 
 static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)

 {
-   return gfn->flags & KVM_DIRTY_GFN_F_RESET;
+   return smp_load_acquire(>flags) & KVM_DIRTY_GFN_F_RESET;
 }
 
 int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)

===8<===

Thanks,

--
Peter Xu



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

2022-08-29 Thread Paolo Bonzini
Il ven 26 ago 2022, 17:50 Marc Zyngier  ha scritto:

> userspace has no choice. It cannot order on its own the reads that the
> kernel will do to *other* rings.
>
> The problem isn't on CPU0 The problem is that CPU1 does observe
> inconsistent data on arm64, and I don't think this difference in
> behaviour is acceptable. Nothing documents this, and there is a baked
> in assumption that there is a strong ordering between writes as well
> as between writes and read.
>

Nevermind, what I wrote last Saturday doesn't make sense. I will try to put
together a litmus test but IMO the synchronization is just between
userspace and kernel, because the dirty page ring is essentially two ring
buffers in one. The actual producer of dirty pages is a red herring.

I am HTML-only for a couple days, I will resend to the mailing list on
Thursday.

Paolo

>
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

2022-08-27 Thread Paolo Bonzini

On 8/26/22 17:49, Marc Zyngier wrote:

Agreed, but that's a problem for userspace to solve.  If userspace
wants to reset the fields in different CPUs, it has to synchronize
with its own invoking of the ioctl.


userspace has no choice. It cannot order on its own the reads that the
kernel will do to *other* rings.


Those reads will never see KVM_DIRTY_GFN_F_RESET in the flags however, 
if userspace has never interacted with the ring.  So there will be 
exactly one read on those rings, and there's nothing to reorder.


If that's too tricky and you want to add a load-acquire I have no 
objection though.  It also helps avoiding read-read reordering between 
one entry's flags to the next one's, so it's a good idea to have it anyway.



The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl
was because it takes kvm->slots_lock so the execution would be
serialized anyway.  Turning slots_lock into an rwsem would be even
worse because it also takes kvm->mmu_lock (since slots_lock is a
mutex, at least two concurrent invocations won't clash with each other
on the mmu_lock).


Whatever the reason, the behaviour should be identical on all
architectures. As is is, it only really works on x86, and I contend
this is a bug that needs fixing.

Thankfully, this can be done at zero cost for x86, and at that of a
set of load-acquires on other architectures.


Yes, the global-ness of the API is orthogonal to the memory ordering 
issue.  I just wanted to explain why a per-vCPU API probably isn't going 
to work great.


Paolo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

2022-08-26 Thread Paolo Bonzini

On 8/23/22 22:35, Marc Zyngier wrote:

Heh, yeah I need to get that out the door. I'll also note that Gavin's
changes are still relevant without that series, as we do write unprotect
in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
Add fast path to handle permission relaxation during dirty logging").


Ah, true. Now if only someone could explain how the whole
producer-consumer thing works without a trace of a barrier, that'd be
great...


Do you mean this?

void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
{
struct kvm_dirty_gfn *entry;

/* It should never get full */
WARN_ON_ONCE(kvm_dirty_ring_full(ring));

entry = >dirty_gfns[ring->dirty_index & (ring->size - 1)];

entry->slot = slot;
entry->offset = offset;
/*
 * Make sure the data is filled in before we publish this to
 * the userspace program.  There's no paired kernel-side reader.
 */
smp_wmb();
kvm_dirty_gfn_set_dirtied(entry);
ring->dirty_index++;
trace_kvm_dirty_ring_push(ring, slot, offset);
}

The matching smp_rmb() is in userspace.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

2022-08-26 Thread Paolo Bonzini

On 8/24/22 00:47, Marc Zyngier wrote:

I definitely don't think I 100% understand all the ordering things since
they're complicated.. but my understanding is that the reset procedure
didn't need memory barrier (unlike pushing, where we have explicit wmb),
because we assumed the userapp is not hostile so logically it should only
modify the flags which is a 32bit field, assuming atomicity guaranteed.

Atomicity doesn't guarantee ordering, unfortunately. Take the
following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
D that exist in the ring in that order, and CPU1 performs an ioctl to
reset the page state.

CPU0:
 write_flag(A, KVM_DIRTY_GFN_F_RESET)
 write_flag(B, KVM_DIRTY_GFN_F_RESET)
 write_flag(C, KVM_DIRTY_GFN_F_RESET)
 write_flag(D, KVM_DIRTY_GFN_F_RESET)
 [...]

CPU1:
ioctl(KVM_RESET_DIRTY_RINGS)

Since CPU0 writes do not have any ordering, CPU1 can observe the
writes in a sequence that have nothing to do with program order, and
could for example observe that GFN A and D have been reset, but not B
and C. This in turn breaks the logic in the reset code (B, C, and D
don't get reset), despite userspace having followed the spec to the
letter. If each was a store-release (which is the case on x86), it
wouldn't be a problem, but nothing calls it in the documentation.

Maybe that's not a big deal if it is expected that each CPU will issue
a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
writes. But expecting this to work across CPUs without any barrier is
wishful thinking.


Agreed, but that's a problem for userspace to solve.  If userspace wants 
to reset the fields in different CPUs, it has to synchronize with its 
own invoking of the ioctl.


That is, CPU0 must ensure that a ioctl(KVM_RESET_DIRTY_RINGS) is done 
after (in the memory-ordering sense) its last write_flag(D, 
KVM_DIRTY_GFN_F_RESET).  If there's no such ordering, there's no 
guarantee that the write_flag will have any effect.


The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl was 
because it takes kvm->slots_lock so the execution would be serialized 
anyway.  Turning slots_lock into an rwsem would be even worse because it 
also takes kvm->mmu_lock (since slots_lock is a mutex, at least two 
concurrent invocations won't clash with each other on the mmu_lock).


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] mailmap: Update Oliver's email address

2022-08-22 Thread Paolo Bonzini

On 8/19/22 21:01, Oliver Upton wrote:

While I'm still at Google, I've since switched to a linux.dev account
for working upstream.

Add an alias to the new address.

Signed-off-by: Oliver Upton


Queued, thanks.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 fixes for 6.0, take #1

2022-08-19 Thread Paolo Bonzini
Pulled, thanks.

Paolo

On Thu, Aug 18, 2022 at 4:09 PM Marc Zyngier  wrote:
>
> Paolo,
>
> Here's a small set of KVM/arm64 fixes for 6.0, the most visible thing
> being a better handling of systems with asymmetric AArch32 support.
>
> Please pull,
>
> M.
>
> The following changes since commit 0982c8d859f8f7022b9fd44d421c7ec721bb41f9:
>
>   Merge branch kvm-arm64/nvhe-stacktrace into kvmarm-master/next (2022-07-27 
> 18:33:27 +0100)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
> tags/kvmarm-fixes-6.0-1
>
> for you to fetch changes up to b10d86fb8e46cc812171728bcd326df2f34e9ed5:
>
>   KVM: arm64: Reject 32bit user PSTATE on asymmetric systems (2022-08-17 
> 10:29:07 +0100)
>
> 
> KVM/arm64 fixes for 6.0, take #1
>
> - Fix unexpected sign extension of KVM_ARM_DEVICE_ID_MASK
>
> - Tidy-up handling of AArch32 on asymmetric systems
>
> 
> Oliver Upton (2):
>   KVM: arm64: Treat PMCR_EL1.LC as RES1 on asymmetric systems
>   KVM: arm64: Reject 32bit user PSTATE on asymmetric systems
>
> Yang Yingliang (1):
>   KVM: arm64: Fix compile error due to sign extension
>
>  arch/arm64/include/asm/kvm_host.h | 4 
>  arch/arm64/include/uapi/asm/kvm.h | 6 --
>  arch/arm64/kvm/arm.c  | 3 +--
>  arch/arm64/kvm/guest.c| 2 +-
>  arch/arm64/kvm/sys_regs.c | 4 ++--
>  5 files changed, 12 insertions(+), 7 deletions(-)
>

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/2] KVM: selftests: Make rseq compatible with glibc-2.35

2022-08-10 Thread Paolo Bonzini

On 8/10/22 14:29, Mathieu Desnoyers wrote:

- By design, selftests/rseq and selftests/kvm are parallel. It's going to
introduce
   unnecessary dependency for selftests/kvm to use selftests/rseq/librseq.so. To
   me,
   it makes the maintainability even harder.

In terms of build system, yes, selftests/rseq and selftests/kvm are 
side-by-side,
and I agree it is odd to have a cross-dependency.

That's where moving rseq.c to tools/lib/ makes sense.


- What selftests/kvm needs is rseq-thread-pointer.h, which accounts for ~5% of
   functionalities, provided by selftests/rseq/librseq.so.

I've never seen this type of argument used to prevent using a library before, 
except
on extremely memory-constrained devices, which is not our target here.


I agree.

To me, the main argument against moving librseq to tools/lib is a 
variant of the build-system argument, namely that recursive Make 
sucks[1] and selftests/kvm right now does not use tools/lib.  So, for a 
single-file library, it may be simply not worth the hassle.


On the other hand, if "somebody else" does the work, I would have no 
problem with having selftests/kvm depend on tools/lib, not at all.


Thanks,

Paolo

[1] Kbuild is a marvel that makes it work, but it works because there 
are no such cross-subdirectory dependencies and anyway 
tools/testing/selftests does not use Kbuild.


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 1/2] KVM: selftests: Make rseq compatible with glibc-2.35

2022-08-10 Thread Paolo Bonzini

On 8/10/22 14:22, Mathieu Desnoyers wrote:


/*
 * Create and run a dummy VM that immediately exits to userspace via
@@ -256,7 +244,7 @@ int main(int argc, char *argv[])
 */
smp_rmb();
cpu = sched_getcpu();
-   rseq_cpu = READ_ONCE(__rseq.cpu_id);
+   rseq_cpu = READ_ONCE(__rseq->cpu_id);

#include 

and use

rseq_current_cpu_raw().


Thanks, I squashed it and queued it for -rc1 (tested on both
glibc 2.34 and 2.35).

diff --git a/tools/testing/selftests/kvm/rseq_test.c 
b/tools/testing/selftests/kvm/rseq_test.c
index 84e8425edc2c..987a76674f4f 100644
--- a/tools/testing/selftests/kvm/rseq_test.c
+++ b/tools/testing/selftests/kvm/rseq_test.c
@@ -29,7 +29,6 @@
 #define NR_TASK_MIGRATIONS 10
 
 static pthread_t migration_thread;

-static struct rseq_abi *__rseq;
 static cpu_set_t possible_mask;
 static int min_cpu, max_cpu;
 static bool done;
@@ -218,7 +217,6 @@ int main(int argc, char *argv[])
r = rseq_register_current_thread();
TEST_ASSERT(!r, "rseq_register_current_thread failed, errno = %d (%s)",
errno, strerror(errno));
-   __rseq = rseq_get_abi();
 
 	/*

 * Create and run a dummy VM that immediately exits to userspace via
@@ -256,7 +254,7 @@ int main(int argc, char *argv[])
 */
smp_rmb();
cpu = sched_getcpu();
-   rseq_cpu = READ_ONCE(__rseq->cpu_id);
+   rseq_cpu = rseq_current_cpu_raw();
smp_rmb();
} while (snapshot != atomic_read(_cnt));
 


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/2] KVM: selftests: Make rseq compatible with glibc-2.35

2022-08-10 Thread Paolo Bonzini

On 8/10/22 14:17, Mathieu Desnoyers wrote:

Indeed, this hack seems to be a good approach to immediately fix things without
moving around all source files and headers. In the longer term, I'd prefer 
Sean's
proposal to move rseq.c to tools/lib/ (and to move rseq headers to 
tools/include/rseq/).
This can be done in a follow up phase though. I'll put a note on my todo list
for after I come back from vacation.


Great, Gavin, are you going to repost using librseq?


Yeah, rseq_test should reuse librseq code.  The simplest way,
if slightly hackish, is to do something like

diff --git a/tools/testing/selftests/kvm/Makefile
b/tools/testing/selftests/kvm/Makefile
index 690b499c3471..6c192b0ec304 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -37,6 +37,7 @@ ifeq ($(ARCH),riscv)
UNAME_M := riscv
 endif
 
 LIBKVM += lib/assert.c

 LIBKVM += lib/elf.c
 LIBKVM += lib/guest_modes.c
@@ -198,7 +199,7 @@ endif
 CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \
-fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \
-I$(LINUX_TOOL_ARCH_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude \
-   -I$( 
 no-pie-option := $(call try-run, echo 'int main() { return 0; }' | \

 $(CC) -Werror -no-pie -x c - -o "$$TMP", -no-pie)


and just #include "../rseq/rseq.c" in rseq_test.c.




Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/2] KVM: selftests: Make rseq compatible with glibc-2.35

2022-08-10 Thread Paolo Bonzini

On 8/9/22 14:21, Mathieu Desnoyers wrote:

For kvm/selftests, there are 3 architectures involved actually. So we
just need consider 4 cases: aarch64, x86, s390 and other. For other
case, we just use __builtin_thread_pointer() to maintain code's
integrity, but it's not called at all.

I think kvm/selftest is always relying on glibc if I'm correct.

All those are handled in the rseq selftests and in librseq. Why duplicate all 
that logic again?


Yeah, rseq_test should reuse librseq code.  The simplest way,
if slightly hackish, is to do something like

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index 690b499c3471..6c192b0ec304 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -37,6 +37,7 @@ ifeq ($(ARCH),riscv)
UNAME_M := riscv
 endif
 
 LIBKVM += lib/assert.c

 LIBKVM += lib/elf.c
 LIBKVM += lib/guest_modes.c
@@ -198,7 +199,7 @@ endif
 CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \
-fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \
-I$(LINUX_TOOL_ARCH_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude \
-   -I$( 
 no-pie-option := $(call try-run, echo 'int main() { return 0; }' | \

 $(CC) -Werror -no-pie -x c - -o "$$TMP", -no-pie)


and just #include "../rseq/rseq.c" in rseq_test.c.

Thanks,

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 updates for 5.20

2022-07-29 Thread Paolo Bonzini

On 7/29/22 10:43, Marc Zyngier wrote:

Paolo,

Here's the bulk of the KVM/arm64 updates for 5.20. Major feature is
the new unwinder for the nVHE modes. The rest is mostly rewriting some
unloved aspects of the arm64 port (sysregs, flags). Further details in
the tag description.

Note that this PR contains a shared branch with the arm64 tree
containing some of the stacktrace updates. There is also a minor
conflict with your tree in one of the selftests, already resolved in
-next.

Please pull,

M.


Pulled, thanks.  Because it's Friday and the RISC-V pull brought in all 
the new x86 RETbleed stuff, it may be a couple days before I finish 
retesting and do push it out to kvm.git.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4] KVM: selftests: Fix target thread to be migrated in rseq_test

2022-07-19 Thread Paolo Bonzini

On 7/19/22 04:08, Gavin Shan wrote:

In rseq_test, there are two threads, which are vCPU thread and migration
worker separately. Unfortunately, the test has the wrong PID passed to
sched_setaffinity() in the migration worker. It forces migration on the
migration worker because zeroed PID represents the calling thread, which
is the migration worker itself. It means the vCPU thread is never enforced
to migration and it can migrate at any time, which eventually leads to
failure as the following logs show.

   host# uname -r
   5.19.0-rc6-gavin+
   host# # cat /proc/cpuinfo | grep processor | tail -n 1
   processor: 223
   host# pwd
   /home/gavin/sandbox/linux.main/tools/testing/selftests/kvm
   host# for i in `seq 1 100`; do \
 echo "> $i"; ./rseq_test; done
   > 1
   > 2
   > 3
   > 4
   > 5
   > 6
    Test Assertion Failure 
 rseq_test.c:265: rseq_cpu == cpu
 pid=3925 tid=3925 errno=4 - Interrupted system call
1  0x00401963: main at rseq_test.c:265 (discriminator 2)
2  0xb044affb: ?? ??:0
3  0xb044b0c7: ?? ??:0
4  0x00401a6f: _start at ??:?
 rseq CPU = 4, sched CPU = 27

Fix the issue by passing correct parameter, TID of the vCPU thread, to
sched_setaffinity() in the migration worker.

Fixes: 61e52f1630f5 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task 
migration bugs")
Suggested-by: Sean Christopherson 
Signed-off-by: Gavin Shan 
Reviewed-by: Oliver Upton 
---
v4: Pick the code change as Sean suggested.
---
  tools/testing/selftests/kvm/rseq_test.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/rseq_test.c 
b/tools/testing/selftests/kvm/rseq_test.c
index 4158da0da2bb..2237d1aac801 100644
--- a/tools/testing/selftests/kvm/rseq_test.c
+++ b/tools/testing/selftests/kvm/rseq_test.c
@@ -82,8 +82,9 @@ static int next_cpu(int cpu)
return cpu;
  }
  
-static void *migration_worker(void *ign)

+static void *migration_worker(void *__rseq_tid)
  {
+   pid_t rseq_tid = (pid_t)(unsigned long)__rseq_tid;
cpu_set_t allowed_mask;
int r, i, cpu;
  
@@ -106,7 +107,7 @@ static void *migration_worker(void *ign)

 * stable, i.e. while changing affinity is in-progress.
 */
smp_wmb();
-   r = sched_setaffinity(0, sizeof(allowed_mask), _mask);
+   r = sched_setaffinity(rseq_tid, sizeof(allowed_mask), 
_mask);
TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
errno, strerror(errno));
smp_wmb();
@@ -231,7 +232,8 @@ int main(int argc, char *argv[])
vm = vm_create_default(VCPU_ID, 0, guest_code);
ucall_init(vm, NULL);
  
-	pthread_create(_thread, NULL, migration_worker, 0);

+   pthread_create(_thread, NULL, migration_worker,
+  (void *)(unsigned long)gettid());
  
  	for (i = 0; !done; i++) {

vcpu_run(vm, VCPU_ID);


Queued, thanks.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: selftests: Double check on the current CPU in rseq_test

2022-07-14 Thread Paolo Bonzini

On 7/14/22 17:35, Sean Christopherson wrote:

Can you check that smp_rmb() and smp_wmb() generate correct instructions on
arm64?


That seems like the most likely scenario (or a kernel bug), I distinctly 
remember
the barriers provided by tools/ being rather bizarre.


Maybe we should bite the bait and use C11 atomics in tools/.  I've long 
planned an article "C11 atomics for kernel programmers", especially 
because this will also be an issue when Rust gets into the mix...


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: selftests: Double check on the current CPU in rseq_test

2022-07-14 Thread Paolo Bonzini

On 7/14/22 10:06, Gavin Shan wrote:

In rseq_test, there are two threads created. Those two threads are
'main' and 'migration_thread' separately. We also have the assumption
that non-migration status on 'migration-worker' thread guarantees the
same non-migration status on 'main' thread. Unfortunately, the assumption
isn't true. The 'main' thread can be migrated from one CPU to another
one between the calls to sched_getcpu() and READ_ONCE(__rseq.cpu_id).
The following assert is raised eventually because of the mismatched
CPU numbers.

The issue can be reproduced on arm64 system occasionally.


Hmm, this does not seem a correct patch - the threads are already 
synchronizing using seq_cnt, like this:


migration   main
--  
seq_cnt = 1
smp_wmb()
snapshot = 0
smp_rmb()
cpu = sched_getcpu() reads 23
sched_setaffinity()
rseq_cpu = __rseq.cpuid reads 35
smp_rmb()
snapshot != seq_cnt -> retry
smp_wmb()
seq_cnt = 2

sched_setaffinity() is guaranteed to block until the task is enqueued on 
an allowed CPU.


Can you check that smp_rmb() and smp_wmb() generate correct instructions 
on arm64?


Paolo


   host# uname -r
   5.19.0-rc6-gavin+
   host# # cat /proc/cpuinfo | grep processor | tail -n 1
   processor: 223
   host# pwd
   /home/gavin/sandbox/linux.main/tools/testing/selftests/kvm
   host# for i in `seq 1 100`;   \
 do echo "> $i"; \
 ./rseq_test; sleep 3;   \
 done
   > 1
   > 2
   > 3
   > 4
   > 5
   > 6
    Test Assertion Failure 
 rseq_test.c:265: rseq_cpu == cpu
 pid=3925 tid=3925 errno=4 - Interrupted system call
1  0x00401963: main at rseq_test.c:265 (discriminator 2)
2  0xb044affb: ?? ??:0
3  0xb044b0c7: ?? ??:0
4  0x00401a6f: _start at ??:?
 rseq CPU = 4, sched CPU = 27

This fixes the issue by double-checking on the current CPU after
call to READ_ONCE(__rseq.cpu_id) and restarting the test if the
two consecutive CPU numbers aren't euqal.

Fixes: 61e52f1630f5 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task 
migration bugs")
Signed-off-by: Gavin Shan 
---
  tools/testing/selftests/kvm/rseq_test.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/rseq_test.c 
b/tools/testing/selftests/kvm/rseq_test.c
index 4158da0da2bb..74709dd9f5b2 100644
--- a/tools/testing/selftests/kvm/rseq_test.c
+++ b/tools/testing/selftests/kvm/rseq_test.c
@@ -207,7 +207,7 @@ int main(int argc, char *argv[])
  {
int r, i, snapshot;
struct kvm_vm *vm;
-   u32 cpu, rseq_cpu;
+   u32 cpu, rseq_cpu, last_cpu;
  
  	/* Tell stdout not to buffer its content */

setbuf(stdout, NULL);
@@ -259,8 +259,9 @@ int main(int argc, char *argv[])
smp_rmb();
cpu = sched_getcpu();
rseq_cpu = READ_ONCE(__rseq.cpu_id);
+   last_cpu = sched_getcpu();
smp_rmb();
-   } while (snapshot != atomic_read(_cnt));
+   } while (snapshot != atomic_read(_cnt) || cpu != last_cpu);
  
  		TEST_ASSERT(rseq_cpu == cpu,

"rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: selftests: Double check on the current CPU in rseq_test

2022-07-14 Thread Paolo Bonzini

On 7/14/22 10:06, Gavin Shan wrote:

In rseq_test, there are two threads created. Those two threads are
'main' and 'migration_thread' separately. We also have the assumption
that non-migration status on 'migration-worker' thread guarantees the
same non-migration status on 'main' thread. Unfortunately, the assumption
isn't true. The 'main' thread can be migrated from one CPU to another
one between the calls to sched_getcpu() and READ_ONCE(__rseq.cpu_id).
The following assert is raised eventually because of the mismatched
CPU numbers.

The issue can be reproduced on arm64 system occasionally.

   host# uname -r
   5.19.0-rc6-gavin+
   host# # cat /proc/cpuinfo | grep processor | tail -n 1
   processor: 223
   host# pwd
   /home/gavin/sandbox/linux.main/tools/testing/selftests/kvm
   host# for i in `seq 1 100`;   \
 do echo "> $i"; \
 ./rseq_test; sleep 3;   \
 done
   > 1
   > 2
   > 3
   > 4
   > 5
   > 6
    Test Assertion Failure 
 rseq_test.c:265: rseq_cpu == cpu
 pid=3925 tid=3925 errno=4 - Interrupted system call
1  0x00401963: main at rseq_test.c:265 (discriminator 2)
2  0xb044affb: ?? ??:0
3  0xb044b0c7: ?? ??:0
4  0x00401a6f: _start at ??:?
 rseq CPU = 4, sched CPU = 27

This fixes the issue by double-checking on the current CPU after
call to READ_ONCE(__rseq.cpu_id) and restarting the test if the
two consecutive CPU numbers aren't euqal.

Fixes: 61e52f1630f5 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task 
migration bugs")
Signed-off-by: Gavin Shan 
---
  tools/testing/selftests/kvm/rseq_test.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/rseq_test.c 
b/tools/testing/selftests/kvm/rseq_test.c
index 4158da0da2bb..74709dd9f5b2 100644
--- a/tools/testing/selftests/kvm/rseq_test.c
+++ b/tools/testing/selftests/kvm/rseq_test.c
@@ -207,7 +207,7 @@ int main(int argc, char *argv[])
  {
int r, i, snapshot;
struct kvm_vm *vm;
-   u32 cpu, rseq_cpu;
+   u32 cpu, rseq_cpu, last_cpu;
  
  	/* Tell stdout not to buffer its content */

setbuf(stdout, NULL);
@@ -259,8 +259,9 @@ int main(int argc, char *argv[])
smp_rmb();
cpu = sched_getcpu();
rseq_cpu = READ_ONCE(__rseq.cpu_id);
+   last_cpu = sched_getcpu();
smp_rmb();
-   } while (snapshot != atomic_read(_cnt));
+   } while (snapshot != atomic_read(_cnt) || cpu != last_cpu);
  
  		TEST_ASSERT(rseq_cpu == cpu,

"rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);


Queued for -rc, thanks.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 00/22] Support SDEI Virtualization

2022-07-05 Thread Paolo Bonzini

On 6/24/22 15:12, Marc Zyngier wrote:

- as far as I know, the core Linux/arm64 maintainers have no plan to
   support APF. Without it, this is a pointless exercise. And even with
   it, this introduces a Linux specific behaviour in an otherwise
   architectural hypervisor (something I'm quite keen on avoiding)


Regarding non-architectural behavior, isn't that the same already for 
PTP?  I understand that the PTP hypercall is a much smaller 
implementation than SDEI+APF, but it goes to show that KVM is already 
not "architectural".


There are other cases where paravirtualized solutions can be useful. 
PTP is one but there are more where KVM/ARM does not have a solution 
yet, for example lock holder preemption.  Unless ARM (the company) has a 
way to receive input from developers and standardize the interface, 
similar to the RISC-V SIGs, vendor-specific hypercalls are a sad fact of 
life.  It just happened that until now KVM/ARM hasn't seen much use in 
some cases (such as desktop virtualization) where overcommitted hosts 
are more common.


Async page faults per se are not KVM specific, in fact Linux supported 
them for the IBM s390 hypervisor long before KVM added support.  They 
didn't exist on x86 and ARM, so the developers came up with a new 
hypercall API and for x86 honestly it wasn't great.  For ARM we learnt 
from the mistakes and it seems to me that SDEI is a good match for the 
feature.  If ARM wants to produce a standard interface for APF, whether 
based on SDEI or something else, we're all ears.


Regarding plans of core arm64 maintainers to support async page fault, 
can you provide a pointer to the discussion?  I agree that if there's a 
hard NACK for APF for whatever reason, the whole host-side code is 
pointless (including SDEI virtualization); but I would like to read more 
about it.



- It gives an incentive to other hypervisor vendors to add random crap
   to the Linux mm subsystem, which is even worse. At this stage, we
   might as well go back to the Xen PV days altogether.


return -EGREGIOUS;

Since you mention hypervisor vendors and there's only one hypervisor in 
Linux, I guess you're not talking about the host mm/ subsystem 
(otherwise yeah, FOLL_NOWAIT is only used by KVM async page faults).


So I suppose you're talking about the guest, and then yeah, it sucks to 
have multiple hypervisors providing the same functionality in different 
ways (or multiple hypervisors providing different subsets of PV 
functionality).  It happens on x86 with Hyper-V and KVM, and to a lesser 
extent Xen and VMware.


But again, KVM/ARM has already crossed that bridge with PTP support, and 
the guest needs exactly zero code in the Linux mm subsystem (both 
generic and arch-specific) to support asynchronous page faults.  There 
are 20 lines of code in do_notify_resume(), and the rest is just SDEI 
gunk.  Again, I would be happy to get a pointer to concrete objections 
from the Linux ARM64 maintainers.  Maybe a different implementation is 
possible, I don't know.


In any case it's absolutely not comparable to Xen PV, and you know it.


- I haven't seen any of the KVM/arm64 users actually asking for the
   APF horror, and the cloud vendors I directly asked had no plan to
   use it, and not using it on their x86 systems either


Please define "horror" in more technical terms.  And since this is the 
second time I'm calling you out on this, I'm also asking you to avoid 
hyperboles and similar rhetorical gimmicks in the future.


That said: Peter, Sean, Google uses or used postcopy extensively on GCE 
(https://dl.acm.org/doi/pdf/10.1145/3296975.3186415).  If it doesn't use 
it on x86, do you have any insights on why?



- no performance data nor workloads that could help making an informed
   decision have been disclosed, and the only argument in its favour
   seems to be "but x86 has it" (hardly a compelling one)


Again this is just false, numbers have been posted 
(https://lwn.net/ml/linux-kernel/20210209050403.103143-1-gs...@redhat.com/ 
was the first result that came up from a quick mailing list search).  If 
they are not enough, please be more specific.


Thanks,

Paolo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH kvm-unit-tests] MAINTAINERS: Change drew's email address

2022-06-24 Thread Paolo Bonzini
Queued, thanks.

Paolo


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 fixes for 5.19, take #2

2022-06-23 Thread Paolo Bonzini

On 6/23/22 09:41, Marc Zyngier wrote:

   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
tags/kvmarm-fixes-5.19-2


Pulled, thanks.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 21/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

This change requires each cache now specify its capacity at runtime,
since the cache struct itself no longer has a fixed capacity known at
compile time. To protect against someone accidentally defining a
kvm_mmu_memory_cache struct directly (without the extra storage), this
commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().

In order to support different capacities, this commit changes the
objects pointer array to be dynamically allocated the first time the
cache is topped-up.

While here, opportunistically clean up the stack-allocated
kvm_mmu_memory_cache structs in riscv and arm64 to use designated
initializers.

No functional change intended.

Reviewed-by: Marc Zyngier 
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-22-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/arm64/kvm/mmu.c  |  2 +-
 arch/riscv/kvm/mmu.c  |  5 +
 include/linux/kvm_host.h  |  1 +
 include/linux/kvm_types.h |  6 +-
 virt/kvm/kvm_main.c   | 33 ++---
 5 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f5651a05b6a8..87f1cd0df36e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -786,7 +786,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t 
guest_ipa,
 {
phys_addr_t addr;
int ret = 0;
-   struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+   struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
 KVM_PGTABLE_PROT_R |
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 1c00695ebee7..081f8d2b9cf3 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -350,10 +350,7 @@ static int gstage_ioremap(struct kvm *kvm, gpa_t gpa, 
phys_addr_t hpa,
int ret = 0;
unsigned long pfn;
phys_addr_t addr, end;
-   struct kvm_mmu_memory_cache pcache;
-
-   memset(, 0, sizeof(pcache));
-   pcache.gfp_zero = __GFP_ZERO;
+   struct kvm_mmu_memory_cache pcache = { .gfp_zero = __GFP_ZERO };
 
end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
pfn = __phys_to_pfn(hpa);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a2bbdf3ab086..3554e48406e4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1356,6 +1356,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int 
capacity, int min);
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index f328a01db4fe..4d933518060f 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -85,12 +85,16 @@ struct gfn_to_pfn_cache {
  * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
  * holding MMU locks.  Note, these caches act more like prefetch buffers than
  * classical caches, i.e. objects are not returned to the cache on being freed.
+ *
+ * The @capacity field and @objects array are lazily initialized when the cache
+ * is topped up (__kvm_mmu_topup_memory_cache()).
  */
 struct kvm_mmu_memory_cache {
int nobjs;
gfp_t gfp_zero;
struct kmem_cache *kmem_cache;
-   void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
+   int capacity;
+   void **objects;
 };
 #endif
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5b8ae83e09d7..45188d11812c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -396,14 +396,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct 
kvm_mmu_memory_cache *mc,
return (void *)__get_free_page(gfp_flags);
 }
 
-int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int 
capacity, int min)
 {
+   gfp_t gfp = GFP_KERNEL_ACCOUNT;
void *obj;
 
if (mc->nobjs >= min)
return 0;
-   while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
-   obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
+
+   if (unlikely(!mc->objects)) {
+   if (WARN_ON_ONCE(!capacity))
+   return -EIO;
+
+   mc->ob

[PATCH v7 23/23] KVM: x86/mmu: Avoid unnecessary flush on eager page split

2022-06-22 Thread Paolo Bonzini
The TLB flush before installing the newly-populated lower level
page table is unnecessary if the lower-level page table maps
the huge page identically.  KVM knows it is if it did not reuse
an existing shadow page table, tell drop_large_spte() to skip
the flush in that case.

Extracted from a patch by David Matlack.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 22681931921f..79c6a821ea0d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1135,7 +1135,7 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
rmap_remove(kvm, sptep);
 }
 
-static void drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
 {
struct kvm_mmu_page *sp;
 
@@ -1143,7 +1143,9 @@ static void drop_large_spte(struct kvm *kvm, u64 *sptep)
WARN_ON(sp->role.level == PG_LEVEL_4K);
 
drop_spte(kvm, sptep);
-   kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
+
+   if (flush)
+   kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
KVM_PAGES_PER_HPAGE(sp->role.level));
 }
 
@@ -2283,7 +2285,7 @@ static void shadow_walk_next(struct 
kvm_shadow_walk_iterator *iterator)
 
 static void __link_shadow_page(struct kvm *kvm,
   struct kvm_mmu_memory_cache *cache, u64 *sptep,
-  struct kvm_mmu_page *sp)
+  struct kvm_mmu_page *sp, bool flush)
 {
u64 spte;
 
@@ -2291,10 +2293,11 @@ static void __link_shadow_page(struct kvm *kvm,
 
/*
 * If an SPTE is present already, it must be a leaf and therefore
-* a large one.  Drop it and flush the TLB before installing sp.
+* a large one.  Drop it, and flush the TLB if needed, before
+* installing sp.
 */
if (is_shadow_present_pte(*sptep))
-   drop_large_spte(kvm, sptep);
+   drop_large_spte(kvm, sptep, flush);
 
spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
 
@@ -2309,7 +2312,7 @@ static void __link_shadow_page(struct kvm *kvm,
 static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 struct kvm_mmu_page *sp)
 {
-   __link_shadow_page(vcpu->kvm, >arch.mmu_pte_list_desc_cache, 
sptep, sp);
+   __link_shadow_page(vcpu->kvm, >arch.mmu_pte_list_desc_cache, 
sptep, sp, true);
 }
 
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
@@ -6172,6 +6175,7 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm,
struct kvm_mmu_memory_cache *cache = >arch.split_desc_cache;
u64 huge_spte = READ_ONCE(*huge_sptep);
struct kvm_mmu_page *sp;
+   bool flush = false;
u64 *sptep, spte;
gfn_t gfn;
int index;
@@ -6189,20 +6193,24 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm,
 * gfn-to-pfn translation since the SP is direct, so no need to
 * modify them.
 *
-* If a given SPTE points to a lower level page table, 
installing
-* such SPTEs would effectively unmap a potion of the huge page.
-* This is not an issue because __link_shadow_page() flushes 
the TLB
-* when the passed sp replaces a large SPTE.
+* However, if a given SPTE points to a lower level page table,
+* that lower level page table may only be partially populated.
+* Installing such SPTEs would effectively unmap a potion of the
+* huge page. Unmapping guest memory always requires a TLB flush
+* since a subsequent operation on the unmapped regions would
+* fail to detect the need to flush.
 */
-   if (is_shadow_present_pte(*sptep))
+   if (is_shadow_present_pte(*sptep)) {
+   flush |= !is_last_spte(*sptep, sp->role.level);
continue;
+   }
 
spte = make_huge_page_split_spte(kvm, huge_spte, sp->role, 
index);
mmu_spte_set(sptep, spte);
__rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
}
 
-   __link_shadow_page(kvm, cache, huge_sptep, sp);
+   __link_shadow_page(kvm, cache, huge_sptep, sp, flush);
 }
 
 static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
-- 
2.31.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 20/23] KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()

2022-06-22 Thread Paolo Bonzini
Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE.  This is done by drop_large_spte().

However, the act that requires dropping the large SPTE is the
installation of the sp that is returned by kvm_mmu_get_child_sp(),
which happens in __link_shadow_page().  Move the call there
instead of having it in each and every caller.

To ensure that the shadow page is not linked twice if it was
present, do _not_ opportunistically make kvm_mmu_get_child_sp()
idempotent: instead, return an error value if the shadow page
already existed.  This is a bit more verbose, but clearer than
NULL.

Now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 43 +-
 arch/x86/kvm/mmu/paging_tmpl.h | 31 +++-
 2 files changed, 35 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 36bc49f08d60..bf1ae5ebf41b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1135,26 +1135,16 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
rmap_remove(kvm, sptep);
 }
 
-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void drop_large_spte(struct kvm *kvm, u64 *sptep)
 {
-   if (is_large_pte(*sptep)) {
-   WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-   drop_spte(kvm, sptep);
-   return true;
-   }
-
-   return false;
-}
+   struct kvm_mmu_page *sp;
 
-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-   if (__drop_large_spte(vcpu->kvm, sptep)) {
-   struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+   sp = sptep_to_sp(sptep);
+   WARN_ON(sp->role.level == PG_LEVEL_4K);
 
-   kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+   drop_spte(kvm, sptep);
+   kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
KVM_PAGES_PER_HPAGE(sp->role.level));
-   }
 }
 
 /*
@@ -2221,6 +2211,9 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct 
kvm_vcpu *vcpu,
 {
union kvm_mmu_page_role role;
 
+   if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
+   return ERR_PTR(-EEXIST);
+
role = kvm_mmu_child_role(sptep, direct, access);
return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
@@ -2288,13 +2281,21 @@ static void shadow_walk_next(struct 
kvm_shadow_walk_iterator *iterator)
__shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
+static void __link_shadow_page(struct kvm *kvm,
+  struct kvm_mmu_memory_cache *cache, u64 *sptep,
   struct kvm_mmu_page *sp)
 {
u64 spte;
 
BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
 
+   /*
+* If an SPTE is present already, it must be a leaf and therefore
+* a large one.  Drop it and flush the TLB before installing sp.
+*/
+   if (is_shadow_present_pte(*sptep))
+   drop_large_spte(kvm, sptep);
+
spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
 
mmu_spte_set(sptep, spte);
@@ -2308,7 +2309,7 @@ static void __link_shadow_page(struct 
kvm_mmu_memory_cache *cache, u64 *sptep,
 static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 struct kvm_mmu_page *sp)
 {
-   __link_shadow_page(>arch.mmu_pte_list_desc_cache, sptep, sp);
+   __link_shadow_page(vcpu->kvm, >arch.mmu_pte_list_desc_cache, 
sptep, sp);
 }
 
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
@@ -3080,11 +3081,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct 
kvm_page_fault *fault)
if (it.level == fault->goal_level)
break;
 
-   drop_large_spte(vcpu, it.sptep);
-   if (is_shadow_present_pte(*it.sptep))
-   continue;
-
sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, 
ACC_ALL);
+   if (sp == ERR_PTR(-EEXIST))
+   continue;
 
link_shadow_page(vcpu, it.sptep, sp);
if (fault->is_tdp && fault->huge_page_disallowed &&
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 24f292f3f93f..2448fa8d8438 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,15 +648,13 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct 
kvm_page_fault *fault,
gfn_t table_gfn;
 
clear_sp_write_flooding_count(it.sptep);
-   drop_large_spte(vcpu, it.sptep);
 
- 

[PATCH v7 16/23] KVM: x86/mmu: Update page stats in __rmap_add()

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.

Reviewed-by: Ben Gardon 
Reviewed-by: Peter Xu 
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-17-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a8cdbe2958d9..7cca28d89a85 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1562,6 +1562,8 @@ static void __rmap_add(struct kvm *kvm,
 
sp = sptep_to_sp(spte);
kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+   kvm_update_page_stats(kvm, sp->role.level, 1);
+
rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
rmap_count = pte_list_add(cache, spte, rmap_head);
 
@@ -2783,7 +2785,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct 
kvm_memory_slot *slot,
 
if (!was_rmapped) {
WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-   kvm_update_page_stats(vcpu->kvm, level, 1);
rmap_add(vcpu, slot, sptep, gfn);
}
 
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 22/23] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.

Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.

Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.

(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.

(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.

Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush.  As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.

This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits.  However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).

Suggested-by: Peter Feiner 
[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016. ]
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-23-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 .../admin-guide/kernel-parameters.txt |   3 +-
 arch/x86/include/asm/kvm_host.h   |  22 ++
 arch/x86/kvm/mmu/mmu.c| 261 +-
 3 files changed, 277 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 97c16aa2f53f..329f0f274e2b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2418,8 +2418,7 @@
the KVM_CLEAR_DIRTY ioctl, and only for the pages being
cleared.
 
-   Eager page splitting currently only supports splitting
-   huge pages mapped by the TDP MMU.
+   Eager page splitting is only supported when 
kvm.tdp_mmu=Y.
 
Default is Y (on).
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 64efe8c90c31..665667d61caf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1338,6 +1338,28 @@ struct kvm_arch {
u32 max_vcpu_ids;
 
bool disable_nx_huge_pages;
+
+   /*
+* Memory caches used to allocate shadow pages when performing eager
+* page splitting. No need for a shadowed_info_cache since eager page
+* splitting only allocates direct shadow pages.
+*
+* Protected by kvm->slots_lock.
+*/
+   struct kvm_mmu_memory_cache split_shadow_page_cache;
+   struct kvm_mmu_memory_cache split_page_header_cache;
+
+   /*
+* Memory cache used to allocate pte_list_desc structs while splitting
+* huge pages. In the worst case, to split one huge page, 512
+* pte_list_desc structs are needed to add each lower level leaf sptep
+* to the rmap plus 1 to extend the parent_ptes rmap of the lower level
+* page table.
+*
+* Protected by kvm->slots_lock.
+*/
+#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
+   struct kvm_mmu_memory_cache split_desc_cache;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/mmu.c

[PATCH v7 11/23] KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the
kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer.

No functional change intended.

Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-12-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fab417e7bf6c..c5a88e8d1b53 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2056,7 +2056,7 @@ struct shadow_page_caches {
struct kvm_mmu_memory_cache *gfn_array_cache;
 };
 
-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
  struct shadow_page_caches 
*caches,
  gfn_t gfn,
  struct hlist_head 
*sp_list,
@@ -2076,15 +2076,15 @@ static struct kvm_mmu_page 
*kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 * depends on valid pages being added to the head of the list.  See
 * comments in kvm_zap_obsolete_pages().
 */
-   sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-   list_add(>link, >kvm->arch.active_mmu_pages);
-   kvm_mod_used_mmu_pages(vcpu->kvm, +1);
+   sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
+   list_add(>link, >arch.active_mmu_pages);
+   kvm_mod_used_mmu_pages(kvm, +1);
 
sp->gfn = gfn;
sp->role = role;
hlist_add_head(>hash_link, sp_list);
if (sp_has_gptes(sp))
-   account_shadowed(vcpu->kvm, sp);
+   account_shadowed(kvm, sp);
 
return sp;
 }
@@ -2103,7 +2103,7 @@ static struct kvm_mmu_page 
*__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
if (!sp) {
created = true;
-   sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, 
role);
+   sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, 
role);
}
 
trace_kvm_mmu_get_page(sp, created);
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 17/23] KVM: x86/mmu: Cache the access bits of shadowed translations

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.

In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU.  Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page.  The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.

Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-18-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c  | 85 +++--
 arch/x86/kvm/mmu/mmu_internal.h | 17 ++-
 arch/x86/kvm/mmu/paging_tmpl.h  |  9 +++-
 4 files changed, 84 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7e4c31b57a75..64efe8c90c31 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -713,7 +713,7 @@ struct kvm_vcpu_arch {
 
struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-   struct kvm_mmu_memory_cache mmu_gfn_array_cache;
+   struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
 
/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7cca28d89a85..13a059ad5dc7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -656,7 +656,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, 
bool maybe_indirect)
if (r)
return r;
if (maybe_indirect) {
-   r = kvm_mmu_topup_memory_cache(>arch.mmu_gfn_array_cache,
+   r = 
kvm_mmu_topup_memory_cache(>arch.mmu_shadowed_info_cache,
   PT64_ROOT_MAX_LEVEL);
if (r)
return r;
@@ -669,7 +669,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
kvm_mmu_free_memory_cache(>arch.mmu_pte_list_desc_cache);
kvm_mmu_free_memory_cache(>arch.mmu_shadow_page_cache);
-   kvm_mmu_free_memory_cache(>arch.mmu_gfn_array_cache);
+   kvm_mmu_free_memory_cache(>arch.mmu_shadowed_info_cache);
kvm_mmu_free_memory_cache(>arch.mmu_page_header_cache);
 }
 
@@ -678,34 +678,68 @@ static void mmu_free_pte_list_desc(struct pte_list_desc 
*pte_list_desc)
kmem_cache_free(pte_list_desc_cache, pte_list_desc);
 }
 
+static bool sp_has_gptes(struct kvm_mmu_page *sp);
+
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
if (sp->role.passthrough)
return sp->gfn;
 
if (!sp->role.direct)
-   return sp->gfns[index];
+   return sp->shadowed_translation[index] >> PAGE_SHIFT;
 
return sp->gfn + (index << ((sp->role.level - 1) * SPTE_LEVEL_BITS));
 }
 
-static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
+/*
+ * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
+ * that the SPTE itself may have a more constrained access permissions that
+ * what the guest enforces. For example, a guest may create an executable
+ * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
+ */
+static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
 {
-   if (sp->role.passthrough) {
-   WARN_ON_ONCE(gfn != sp->gfn);
-   return;
-   }
+   if (sp_has_gptes(sp))
+   return sp->shadowed_translation[index] & ACC_ALL;
 
-   if (!sp->role.direct) {
-   sp->gfns[index] = gfn;
+   /*
+* For direct MMUs (e.g. TDP or non-paging guests) or passthrough SPs,
+* KVM is not shadowing any g

[PATCH v7 18/23] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
shadowing a non-executable huge page.

To fix this, pass in the role of the child shadow page where the huge
page will be split and derive the execution permission from that.  This
is correct because huge pages are always split with direct shadow page
and thus the shadow page role contains the correct access permissions.

No functional change intended.

Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-19-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/spte.c| 15 +++
 arch/x86/kvm/mmu/spte.h|  4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index db294c1beea2..fb1f17504138 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -246,11 +246,10 @@ static u64 make_spte_executable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, int huge_level,
+u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, union 
kvm_mmu_page_role role,
  int index)
 {
u64 child_spte;
-   int child_level;
 
if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
return 0;
@@ -259,23 +258,23 @@ u64 make_huge_page_split_spte(struct kvm *kvm, u64 
huge_spte, int huge_level,
return 0;
 
child_spte = huge_spte;
-   child_level = huge_level - 1;
 
/*
 * The child_spte already has the base address of the huge page being
 * split. So we just have to OR in the offset to the page at the next
 * lower level for the given index.
 */
-   child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
+   child_spte |= (index * KVM_PAGES_PER_HPAGE(role.level)) << PAGE_SHIFT;
 
-   if (child_level == PG_LEVEL_4K) {
+   if (role.level == PG_LEVEL_4K) {
child_spte &= ~PT_PAGE_SIZE_MASK;
 
/*
-* When splitting to a 4K page, mark the page executable as the
-* NX hugepage mitigation no longer applies.
+* When splitting to a 4K page where execution is allowed, mark
+* the page executable as the NX hugepage mitigation no longer
+* applies.
 */
-   if (is_nx_huge_page_enabled(kvm))
+   if ((role.access & ACC_EXEC_MASK) && 
is_nx_huge_page_enabled(kvm))
child_spte = make_spte_executable(child_spte);
}
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 256f90587e8d..b5c855f5514f 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -421,8 +421,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page 
*sp,
   unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
   u64 old_spte, bool prefetch, bool can_unsync,
   bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, int huge_level,
- int index);
+u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
+ union kvm_mmu_page_role role, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 522e2532343b..f3a430d64975 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1478,7 +1478,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, 
struct tdp_iter *iter,
 * not been linked in yet and thus is not reachable from any other CPU.
 */
for (i = 0; i < SPTE_ENT_PER_PAGE; i++)
-   sp->spt[i] = make_huge_page_split_spte(kvm, huge_spte, level, 
i);
+   sp->spt[i] = make_huge_page_split_spte(kvm, huge_spte, 
sp->role, i);
 
/*
 * Replace the huge spte with a pointer to the populated lower level
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 19/23] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
is fine for now since KVM never creates intermediate huge pages during
dirty logging. In other words, KVM always replaces 1GiB pages directly
with 4KiB pages, so there is no reason to look for collapsible 2MiB
pages.

However, this will stop being true once the shadow MMU participates in
eager page splitting. During eager page splitting, each 1GiB is first
split into 2MiB pages and then those are split into 4KiB pages. The
intermediate 2MiB pages may be left behind if an error condition causes
eager page splitting to bail early.

No functional change intended.

Reviewed-by: Peter Xu 
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-20-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 13a059ad5dc7..36bc49f08d60 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6154,18 +6154,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm 
*kvm,
return need_tlb_flush;
 }
 
+static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+  const struct kvm_memory_slot *slot)
+{
+   /*
+* Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
+* pages that are already mapped at the maximum possible level.
+*/
+   if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
+ PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
+ true))
+   kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+}
+
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
   const struct kvm_memory_slot *slot)
 {
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(>mmu_lock);
-   /*
-* Zap only 4k SPTEs since the legacy MMU only supports dirty
-* logging at a 4k granularity and never creates collapsible
-* 2m SPTEs during dirty logging.
-*/
-   if (slot_handle_level_4k(kvm, slot, 
kvm_mmu_zap_collapsible_spte, true))
-   kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+   kvm_rmap_zap_collapsible_sptes(kvm, slot);
write_unlock(>mmu_lock);
}
 
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 14/23] KVM: x86/mmu: Pass const memslot to rmap_add()

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Constify rmap_add()'s @slot parameter; it is simply passed on to
gfn_to_rmap(), which takes a const memslot.

No functional change intended.

Reviewed-by: Ben Gardon 
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-15-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a7748c5a2385..45a4e85c0b2c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1556,7 +1556,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct 
kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 u64 *spte, gfn_t gfn)
 {
struct kvm_mmu_page *sp;
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 13/23] KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
indirect shadow pages, so it's safe to pass in NULL when looking up
direct shadow pages.

This will be used for doing eager page splitting, which allocates direct
shadow pages from the context of a VM ioctl without access to a vCPU
pointer.

Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-14-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 88b3f3c2c8b1..a7748c5a2385 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1975,6 +1975,12 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
+/*
+ * The vCPU is required when finding indirect shadow pages; the shadow
+ * page may already exist and syncing it needs the vCPU pointer in
+ * order to read guest page tables.  Direct shadow pages are never
+ * unsync, thus @vcpu can be NULL if @role.direct is true.
+ */
 static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 struct kvm_vcpu *vcpu,
 gfn_t gfn,
@@ -2013,6 +2019,9 @@ static struct kvm_mmu_page 
*kvm_mmu_find_shadow_page(struct kvm *kvm,
goto out;
 
if (sp->unsync) {
+   if (KVM_BUG_ON(!vcpu, kvm))
+   break;
+
/*
 * The page is good, but is stale.  kvm_sync_page does
 * get the latest guest state, but (unlike 
mmu_unsync_children)
@@ -2090,6 +2099,7 @@ static struct kvm_mmu_page 
*kvm_mmu_alloc_shadow_page(struct kvm *kvm,
return sp;
 }
 
+/* Note, @vcpu may be NULL if @role.direct is true; see 
kvm_mmu_find_shadow_page. */
 static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
  struct kvm_vcpu *vcpu,
  struct shadow_page_caches 
*caches,
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 06/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 2 helper
functions:

kvm_mmu_find_shadow_page() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role.

kvm_mmu_alloc_shadow_page()
  Allocates and initializes an entirely new kvm_mmu_page. This currently
  requries a vcpu pointer for allocation and looking up the memslot but
  that will be removed in a future commit.

No functional change intended.

Reviewed-by: Sean Christopherson 
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-7-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 52 +++---
 1 file changed, 39 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f4e7978a6c6a..a59fe860da29 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1993,16 +1993,16 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+gfn_t gfn,
+struct hlist_head *sp_list,
+union kvm_mmu_page_role 
role)
 {
-   struct hlist_head *sp_list;
struct kvm_mmu_page *sp;
int ret;
int collisions = 0;
LIST_HEAD(invalid_list);
 
-   sp_list = >kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
for_each_valid_sp(vcpu->kvm, sp, sp_list) {
if (sp->gfn != gfn) {
collisions++;
@@ -2027,7 +2027,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu, gfn_t gfn,
 
/* unsync and write-flooding only apply to indirect SPs. */
if (sp->role.direct)
-   goto trace_get_page;
+   goto out;
 
if (sp->unsync) {
/*
@@ -2053,14 +2053,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu, gfn_t gfn,
 
__clear_sp_write_flooding_count(sp);
 
-trace_get_page:
-   trace_kvm_mmu_get_page(sp, false);
goto out;
}
 
+   sp = NULL;
++vcpu->kvm->stat.mmu_cache_miss;
 
-   sp = kvm_mmu_alloc_page(vcpu, role.direct);
+out:
+   kvm_mmu_commit_zap_page(vcpu->kvm, _list);
+
+   if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
+   vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+   return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+ gfn_t gfn,
+ struct hlist_head 
*sp_list,
+ union kvm_mmu_page_role 
role)
+{
+   struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
sp->gfn = gfn;
sp->role = role;
@@ -2070,12 +2082,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu, gfn_t gfn,
if (role.level == PG_LEVEL_4K && 
kvm_vcpu_write_protect_gfn(vcpu, gfn))
kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
}
-   trace_kvm_mmu_get_page(sp, true);
-out:
-   kvm_mmu_commit_zap_page(vcpu->kvm, _list);
 
-   if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-   vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+   return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+union kvm_mmu_page_role role)
+{
+   struct hlist_head *sp_list;
+   struct kvm_mmu_page *sp;
+   bool created = false;
+
+   sp_list = >kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+
+   sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
+   if (!sp) {
+   created = true;
+   sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role);
+   }
+
+   trace_kvm_mmu_get_page(sp, created);
return sp;
 }
 
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 15/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.

Reviewed-by: Ben Gardon 
Reviewed-by: Peter Xu 
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-16-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 47 ++
 1 file changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 45a4e85c0b2c..a8cdbe2958d9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -673,11 +673,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
kvm_mmu_free_memory_cache(>arch.mmu_page_header_cache);
 }
 
-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
-{
-   return kvm_mmu_memory_cache_alloc(>arch.mmu_pte_list_desc_cache);
-}
-
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 {
kmem_cache_free(pte_list_desc_cache, pte_list_desc);
@@ -832,7 +827,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t 
gfn,
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
-static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
+static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
struct kvm_rmap_head *rmap_head)
 {
struct pte_list_desc *desc;
@@ -843,7 +838,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
rmap_head->val = (unsigned long)spte;
} else if (!(rmap_head->val & 1)) {
rmap_printk("%p %llx 1->many\n", spte, *spte);
-   desc = mmu_alloc_pte_list_desc(vcpu);
+   desc = kvm_mmu_memory_cache_alloc(cache);
desc->sptes[0] = (u64 *)rmap_head->val;
desc->sptes[1] = spte;
desc->spte_count = 2;
@@ -855,7 +850,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
while (desc->spte_count == PTE_LIST_EXT) {
count += PTE_LIST_EXT;
if (!desc->more) {
-   desc->more = mmu_alloc_pte_list_desc(vcpu);
+   desc->more = kvm_mmu_memory_cache_alloc(cache);
desc = desc->more;
desc->spte_count = 0;
break;
@@ -1556,8 +1551,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct 
kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-u64 *spte, gfn_t gfn)
+static void __rmap_add(struct kvm *kvm,
+  struct kvm_mmu_memory_cache *cache,
+  const struct kvm_memory_slot *slot,
+  u64 *spte, gfn_t gfn)
 {
struct kvm_mmu_page *sp;
struct kvm_rmap_head *rmap_head;
@@ -1566,15 +1563,23 @@ static void rmap_add(struct kvm_vcpu *vcpu, const 
struct kvm_memory_slot *slot,
sp = sptep_to_sp(spte);
kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-   rmap_count = pte_list_add(vcpu, spte, rmap_head);
+   rmap_count = pte_list_add(cache, spte, rmap_head);
 
if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
-   kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, 
sp->role.level, __pte(0));
+   kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, 
__pte(0));
kvm_flush_remote_tlbs_with_address(
-   vcpu->kvm, sp->gfn, 
KVM_PAGES_PER_HPAGE(sp->role.level));
+   kvm, sp->gfn, 
KVM_PAGES_PER_HPAGE(sp->role.level));
}
 }
 
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
+u64 *spte, gfn_t gfn)
+{
+   struct kvm_mmu_memory_cache *cache = 
>arch.mmu_pte_list_desc_cache;
+
+   __rmap_add(vcpu->kvm, cache, slot, spte, gfn);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
bool young = false;
@@ -1645,13 +1650,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
-static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
+static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
struct kvm_mmu_page *sp, u64 *parent_pte)
 {
if (!parent_pte)
return;
 
-   pte_list_add(vcpu, parent_pte, >parent_ptes);
+   pte_list_add(cache, parent_pte, >parent_ptes);
 }
 
 static void mmu_page_remove_parent_pte(struct k

[PATCH v7 09/23] KVM: x86/mmu: Move guest PT write-protection to account_shadowed()

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Move the code that write-protects newly-shadowed guest page tables into
account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
more logical place for this code to live. But most importantly, this
reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
kvm_vcpu pointer, which will be necessary when creating new shadow pages
during VM ioctls for eager page splitting.

Note, it is safe to drop the role.level == PG_LEVEL_4K check since
account_shadowed() returns early if role.level > PG_LEVEL_4K.

No functional change intended.

Reviewed-by: Sean Christopherson 
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-10-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd45364bf465..2602c3642f23 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -766,6 +766,9 @@ static void account_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
KVM_PAGE_TRACK_WRITE);
 
kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+   if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
+   kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -2072,11 +2075,8 @@ static struct kvm_mmu_page 
*kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
sp->gfn = gfn;
sp->role = role;
hlist_add_head(>hash_link, sp_list);
-   if (sp_has_gptes(sp)) {
+   if (sp_has_gptes(sp))
account_shadowed(vcpu->kvm, sp);
-   if (role.level == PG_LEVEL_4K && 
kvm_vcpu_write_protect_gfn(vcpu, gfn))
-   kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
-   }
 
return sp;
 }
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 10/23] KVM: x86/mmu: Pass memory caches to allocate SPs separately

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
will allocate the various pieces of memory for shadow pages as a
parameter, rather than deriving them from the vcpu pointer. This will be
useful in a future commit where shadow pages are allocated during VM
ioctls for eager page splitting, and thus will use a different set of
caches.

Preemptively pull the caches out all the way to
kvm_mmu_get_shadow_page() since eager page splitting will not be calling
kvm_mmu_alloc_shadow_page() directly.

No functional change intended.

Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-11-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 36 +---
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2602c3642f23..fab417e7bf6c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2049,17 +2049,25 @@ static struct kvm_mmu_page 
*kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
return sp;
 }
 
+/* Caches used when allocating a new shadow page. */
+struct shadow_page_caches {
+   struct kvm_mmu_memory_cache *page_header_cache;
+   struct kvm_mmu_memory_cache *shadow_page_cache;
+   struct kvm_mmu_memory_cache *gfn_array_cache;
+};
+
 static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+ struct shadow_page_caches 
*caches,
  gfn_t gfn,
  struct hlist_head 
*sp_list,
  union kvm_mmu_page_role 
role)
 {
struct kvm_mmu_page *sp;
 
-   sp = kvm_mmu_memory_cache_alloc(>arch.mmu_page_header_cache);
-   sp->spt = kvm_mmu_memory_cache_alloc(>arch.mmu_shadow_page_cache);
+   sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
+   sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
if (!role.direct)
-   sp->gfns = 
kvm_mmu_memory_cache_alloc(>arch.mmu_gfn_array_cache);
+   sp->gfns = kvm_mmu_memory_cache_alloc(caches->gfn_array_cache);
 
set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -2081,9 +2089,10 @@ static struct kvm_mmu_page 
*kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
-   gfn_t gfn,
-   union kvm_mmu_page_role 
role)
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+ struct shadow_page_caches 
*caches,
+ gfn_t gfn,
+ union kvm_mmu_page_role 
role)
 {
struct hlist_head *sp_list;
struct kvm_mmu_page *sp;
@@ -2094,13 +2103,26 @@ static struct kvm_mmu_page 
*kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
if (!sp) {
created = true;
-   sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role);
+   sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, 
role);
}
 
trace_kvm_mmu_get_page(sp, created);
return sp;
 }
 
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+   gfn_t gfn,
+   union kvm_mmu_page_role 
role)
+{
+   struct shadow_page_caches caches = {
+   .page_header_cache = >arch.mmu_page_header_cache,
+   .shadow_page_cache = >arch.mmu_shadow_page_cache,
+   .gfn_array_cache = >arch.mmu_gfn_array_cache,
+   };
+
+   return __kvm_mmu_get_shadow_page(vcpu, , gfn, role);
+}
+
 static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, 
unsigned int access)
 {
struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 08/23] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Rename 2 functions:

  kvm_mmu_get_page() -> kvm_mmu_get_shadow_page()
  kvm_mmu_free_page() -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. It also aligns these functions with the naming
scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page().

Prefer "shadow_page" over the shorter "sp" since these are core
functions and the line lengths aren't terrible.

No functional change intended.

Reviewed-by: Sean Christopherson 
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-9-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8b84cdd8c6cd..bd45364bf465 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1626,7 +1626,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
*kvm, long nr)
percpu_counter_add(_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
hlist_del(>hash_link);
@@ -2081,8 +2081,9 @@ static struct kvm_mmu_page 
*kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+   gfn_t gfn,
+   union kvm_mmu_page_role 
role)
 {
struct hlist_head *sp_list;
struct kvm_mmu_page *sp;
@@ -2146,7 +2147,7 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct 
kvm_vcpu *vcpu,
union kvm_mmu_page_role role;
 
role = kvm_mmu_child_role(sptep, direct, access);
-   return kvm_mmu_get_page(vcpu, gfn, role);
+   return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
 
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator 
*iterator,
@@ -2422,7 +2423,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp->role.invalid || sp->root_count);
-   kvm_mmu_free_page(sp);
+   kvm_mmu_free_shadow_page(sp);
}
 }
 
@@ -3415,7 +3416,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t 
gfn, int quadrant,
WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
 
-   sp = kvm_mmu_get_page(vcpu, gfn, role);
+   sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
++sp->root_count;
 
return __pa(sp->spt);
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 03/23] KVM: x86/mmu: Stop passing "direct" to mmu_alloc_root()

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

The "direct" argument is vcpu->arch.mmu->root_role.direct,
because unlike non-root page tables, it's impossible to have
a direct root in an indirect MMU.  So just use that.

Suggested-by: Lai Jiangshan 
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-4-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 844b58ddb3bb..2e30398fe59f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3369,8 +3369,9 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t 
root_gfn)
 }
 
 static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
-   u8 level, bool direct)
+   u8 level)
 {
+   bool direct = vcpu->arch.mmu->root_role.direct;
struct kvm_mmu_page *sp;
 
sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
@@ -3396,7 +3397,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
mmu->root.hpa = root;
} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
-   root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
+   root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
mmu->root.hpa = root;
} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
if (WARN_ON_ONCE(!mmu->pae_root)) {
@@ -3408,7 +3409,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
- i << 30, PT32_ROOT_LEVEL, true);
+ i << 30, PT32_ROOT_LEVEL);
mmu->pae_root[i] = root | PT_PRESENT_MASK |
   shadow_me_value;
}
@@ -3532,7 +3533,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 */
if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
root = mmu_alloc_root(vcpu, root_gfn, 0,
- mmu->root_role.level, false);
+ mmu->root_role.level);
mmu->root.hpa = root;
goto set_root_pgd;
}
@@ -3578,7 +3579,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
}
 
root = mmu_alloc_root(vcpu, root_gfn, i << 30,
- PT32_ROOT_LEVEL, false);
+ PT32_ROOT_LEVEL);
mmu->pae_root[i] = root | pm_mask;
}
 
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 12/23] KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page()

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Get the kvm pointer from the caller, rather than deriving it from
vcpu->kvm, and plumb the kvm pointer all the way from
kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer
is only needed to sync indirect shadow pages. In other words,
__kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages
without a vcpu pointer. This enables eager page splitting, which needs
to allocate direct shadow pages during VM ioctls.

No functional change intended.

Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-13-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c5a88e8d1b53..88b3f3c2c8b1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1975,7 +1975,8 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
+struct kvm_vcpu *vcpu,
 gfn_t gfn,
 struct hlist_head *sp_list,
 union kvm_mmu_page_role 
role)
@@ -1985,7 +1986,7 @@ static struct kvm_mmu_page 
*kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
int collisions = 0;
LIST_HEAD(invalid_list);
 
-   for_each_valid_sp(vcpu->kvm, sp, sp_list) {
+   for_each_valid_sp(kvm, sp, sp_list) {
if (sp->gfn != gfn) {
collisions++;
continue;
@@ -2002,7 +2003,7 @@ static struct kvm_mmu_page 
*kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 * upper-level page will be write-protected.
 */
if (role.level > PG_LEVEL_4K && sp->unsync)
-   kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
+   kvm_mmu_prepare_zap_page(kvm, sp,
 _list);
continue;
}
@@ -2030,7 +2031,7 @@ static struct kvm_mmu_page 
*kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 
WARN_ON(!list_empty(_list));
if (ret > 0)
-   kvm_flush_remote_tlbs(vcpu->kvm);
+   kvm_flush_remote_tlbs(kvm);
}
 
__clear_sp_write_flooding_count(sp);
@@ -2039,13 +2040,13 @@ static struct kvm_mmu_page 
*kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
}
 
sp = NULL;
-   ++vcpu->kvm->stat.mmu_cache_miss;
+   ++kvm->stat.mmu_cache_miss;
 
 out:
-   kvm_mmu_commit_zap_page(vcpu->kvm, _list);
+   kvm_mmu_commit_zap_page(kvm, _list);
 
-   if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-   vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+   if (collisions > kvm->stat.max_mmu_page_hash_collisions)
+   kvm->stat.max_mmu_page_hash_collisions = collisions;
return sp;
 }
 
@@ -2089,7 +2090,8 @@ static struct kvm_mmu_page 
*kvm_mmu_alloc_shadow_page(struct kvm *kvm,
return sp;
 }
 
-static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
+ struct kvm_vcpu *vcpu,
  struct shadow_page_caches 
*caches,
  gfn_t gfn,
  union kvm_mmu_page_role 
role)
@@ -2098,12 +2100,12 @@ static struct kvm_mmu_page 
*__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp;
bool created = false;
 
-   sp_list = >kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+   sp_list = >arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 
-   sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
+   sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
if (!sp) {
created = true;
-   sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, 
role);
+   sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
}
 
trace_kvm_mmu_get_page(sp, created);
@@ -2120,7 +2122,7 @@ static struct kvm_mmu_page 
*kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
.gfn_array_cache = >arch.mmu_gfn_array_cache,
};
 
-   return __kvm_mmu_get_shadow_page(

[PATCH v7 04/23] KVM: x86/mmu: Derive shadow MMU page role from parent

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Instead of computing the shadow page role from scratch for every new
page, derive most of the information from the parent shadow page.  This
eliminates the dependency on the vCPU root role to allocate shadow page
tables, and reduces the number of parameters to kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

Note that when calculating the MMU root role, we can take
@role.passthrough, @role.direct, and @role.access directly from
@vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still
must be overridden for PAE page directories, when shadowing 32-bit
guest page tables with PAE page tables.

No functional change intended.

Reviewed-by: Peter Xu 
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-5-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 114 +++--
 arch/x86/kvm/mmu/paging_tmpl.h |   9 +--
 2 files changed, 71 insertions(+), 52 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2e30398fe59f..fd1b479bf7fc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1993,49 +1993,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-gfn_t gfn,
-gva_t gaddr,
-unsigned level,
-bool direct,
-unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+union kvm_mmu_page_role role)
 {
-   union kvm_mmu_page_role role;
struct hlist_head *sp_list;
-   unsigned quadrant;
struct kvm_mmu_page *sp;
int ret;
int collisions = 0;
LIST_HEAD(invalid_list);
 
-   role = vcpu->arch.mmu->root_role;
-   role.level = level;
-   role.direct = direct;
-   role.access = access;
-   if (role.has_4_byte_gpte) {
-   /*
-* If the guest has 4-byte PTEs then that means it's using 
32-bit,
-* 2-level, non-PAE paging. KVM shadows such guests with PAE 
paging
-* (i.e. 8-byte PTEs). The difference in PTE size means that 
KVM must
-* shadow each guest page table with multiple shadow page 
tables, which
-* requires extra bookkeeping in the role.
-*
-* Specifically, to shadow the guest's page directory (which 
covers a
-* 4GiB address space), KVM uses 4 PAE page directories, each 
mapping
-* 1GiB of the address space. @role.quadrant encodes which 
quarter of
-* the address space each maps.
-*
-* To shadow the guest's page tables (which each map a 4MiB 
region), KVM
-* uses 2 PAE page tables, each mapping a 2MiB region. For 
these,
-* @role.quadrant encodes which half of the region they map.
-*/
-   quadrant = gaddr >> (PAGE_SHIFT + (SPTE_LEVEL_BITS * level));
-   quadrant &= (1 << level) - 1;
-   role.quadrant = quadrant;
-   }
-   if (level <= vcpu->arch.mmu->cpu_role.base.level)
-   role.passthrough = 0;
-
sp_list = >kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
for_each_valid_sp(vcpu->kvm, sp, sp_list) {
if (sp->gfn != gfn) {
@@ -2053,7 +2019,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
 * Unsync pages must not be left as is, because the new
 * upper-level page will be write-protected.
 */
-   if (level > PG_LEVEL_4K && sp->unsync)
+   if (role.level > PG_LEVEL_4K && sp->unsync)
kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 _list);
continue;
@@ -2094,14 +2060,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
 
++vcpu->kvm->stat.mmu_cache_miss;
 
-   sp = kvm_mmu_alloc_page(vcpu, direct);
+   sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
sp->gfn = gfn;
sp->role = role;
hlist_add_head(>hash_link, sp_list);
if (sp_has_gptes(sp)) {
account_shadowed(vcpu->kvm, sp);
-   if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, 
gfn))
+   if (role.level == PG_LEVEL_4K &&a

[PATCH v7 07/23] KVM: x86/mmu: Consolidate shadow page allocation and initialization

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
the latter so that all shadow page allocation and initialization happens
in one place.

No functional change intended.

Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-8-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 39 +--
 1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a59fe860da29..8b84cdd8c6cd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1664,27 +1664,6 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool 
direct)
-{
-   struct kvm_mmu_page *sp;
-
-   sp = kvm_mmu_memory_cache_alloc(>arch.mmu_page_header_cache);
-   sp->spt = kvm_mmu_memory_cache_alloc(>arch.mmu_shadow_page_cache);
-   if (!direct)
-   sp->gfns = 
kvm_mmu_memory_cache_alloc(>arch.mmu_gfn_array_cache);
-   set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-   /*
-* active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
-* depends on valid pages being added to the head of the list.  See
-* comments in kvm_zap_obsolete_pages().
-*/
-   sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-   list_add(>link, >kvm->arch.active_mmu_pages);
-   kvm_mod_used_mmu_pages(vcpu->kvm, +1);
-   return sp;
-}
-
 static void mark_unsync(u64 *spte);
 static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 {
@@ -2072,7 +2051,23 @@ static struct kvm_mmu_page 
*kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
  struct hlist_head 
*sp_list,
  union kvm_mmu_page_role 
role)
 {
-   struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
+   struct kvm_mmu_page *sp;
+
+   sp = kvm_mmu_memory_cache_alloc(>arch.mmu_page_header_cache);
+   sp->spt = kvm_mmu_memory_cache_alloc(>arch.mmu_shadow_page_cache);
+   if (!role.direct)
+   sp->gfns = 
kvm_mmu_memory_cache_alloc(>arch.mmu_gfn_array_cache);
+
+   set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+   /*
+* active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+* depends on valid pages being added to the head of the list.  See
+* comments in kvm_zap_obsolete_pages().
+*/
+   sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
+   list_add(>link, >kvm->arch.active_mmu_pages);
+   kvm_mod_used_mmu_pages(vcpu->kvm, +1);
 
sp->gfn = gfn;
sp->role = role;
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 00/23] KVM: Extend Eager Page Splitting to the shadow MMU

2022-06-22 Thread Paolo Bonzini
For the description of the "why" of this patch, I'll just direct you to
David's excellent cover letter from v6, which can be found at
https://lore.kernel.org/r/20220516232138.1783324-1-dmatl...@google.com.

This version mostly does the following:

- apply the feedback from Sean and other reviewers, which is mostly
  aesthetic

- replace the refactoring of drop_large_spte()/__drop_large_spte()
  with my own version.  The insight there is that drop_large_spte()
  is always followed by {,__}link_shadow_page(), so the call is
  moved there

- split the TLB flush optimization into a separate patch, mostly
  to perform the previous refactoring independent of the optional
  TLB flush

- rename a few functions from *nested_mmu* to *shadow_mmu*

David Matlack (21):
  KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  KVM: x86/mmu: Use a bool for direct
  KVM: x86/mmu: Stop passing "direct" to mmu_alloc_root()
  KVM: x86/mmu: Derive shadow MMU page role from parent
  KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes
  KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  KVM: x86/mmu: Consolidate shadow page allocation and initialization
  KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
  KVM: x86/mmu: Pass memory caches to allocate SPs separately
  KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
  KVM: x86/mmu: Pass kvm pointer separately from vcpu to
kvm_mmu_find_shadow_page()
  KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
  KVM: x86/mmu: Pass const memslot to rmap_add()
  KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  KVM: x86/mmu: Update page stats in __rmap_add()
  KVM: x86/mmu: Cache the access bits of shadowed translations
  KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible
levels
  KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs

Paolo Bonzini (2):
  KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
  KVM: x86/mmu: Avoid unnecessary flush on eager page split

 .../admin-guide/kernel-parameters.txt |   3 +-
 arch/arm64/kvm/mmu.c  |   2 +-
 arch/riscv/kvm/mmu.c  |   5 +-
 arch/x86/include/asm/kvm_host.h   |  24 +-
 arch/x86/kvm/mmu/mmu.c| 719 ++
 arch/x86/kvm/mmu/mmu_internal.h   |  17 +-
 arch/x86/kvm/mmu/paging_tmpl.h|  43 +-
 arch/x86/kvm/mmu/spte.c   |  15 +-
 arch/x86/kvm/mmu/spte.h   |   4 +-
 arch/x86/kvm/mmu/tdp_mmu.c|   2 +-
 include/linux/kvm_host.h  |   1 +
 include/linux/kvm_types.h |   6 +-
 virt/kvm/kvm_main.c   |  33 +-
 13 files changed, 666 insertions(+), 208 deletions(-)

-- 
2.31.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 02/23] KVM: x86/mmu: Use a bool for direct

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.

Reviewed-by: Lai Jiangshan 
Reviewed-by: Sean Christopherson 
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-3-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c0afb4f1c8ae..844b58ddb3bb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1664,7 +1664,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int 
direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool 
direct)
 {
struct kvm_mmu_page *sp;
 
@@ -1997,7 +1997,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
 gfn_t gfn,
 gva_t gaddr,
 unsigned level,
-int direct,
+bool direct,
 unsigned int access)
 {
union kvm_mmu_page_role role;
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 05/23] KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

The quadrant is only used when gptes are 4 bytes, but
mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE
page directories regardless. Make this less confusing by only passing in
a non-zero quadrant when it is actually necessary.

Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-6-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fd1b479bf7fc..f4e7978a6c6a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3389,9 +3389,10 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t 
gfn, int quadrant,
struct kvm_mmu_page *sp;
 
role.level = level;
+   role.quadrant = quadrant;
 
-   if (role.has_4_byte_gpte)
-   role.quadrant = quadrant;
+   WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
+   WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
 
sp = kvm_mmu_get_page(vcpu, gfn, role);
++sp->root_count;
@@ -3427,7 +3428,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
for (i = 0; i < 4; ++i) {
WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
-   root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
+   root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
  PT32_ROOT_LEVEL);
mmu->pae_root[i] = root | PT_PRESENT_MASK |
   shadow_me_value;
@@ -3512,9 +3513,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
struct kvm_mmu *mmu = vcpu->arch.mmu;
u64 pdptrs[4], pm_mask;
gfn_t root_gfn, root_pgd;
+   int quadrant, i, r;
hpa_t root;
-   unsigned i;
-   int r;
 
root_pgd = mmu->get_guest_pgd(vcpu);
root_gfn = root_pgd >> PAGE_SHIFT;
@@ -3597,7 +3597,15 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
root_gfn = pdptrs[i] >> PAGE_SHIFT;
}
 
-   root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL);
+   /*
+* If shadowing 32-bit non-PAE page tables, each PAE page
+* directory maps one quarter of the guest's non-PAE page
+* directory. Othwerise each PAE page direct shadows one guest
+* PAE page directory so that quadrant should be 0.
+*/
+   quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 
0;
+
+   root = mmu_alloc_root(vcpu, root_gfn, quadrant, 
PT32_ROOT_LEVEL);
mmu->pae_root[i] = root | pm_mask;
}
 
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v7 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs

2022-06-22 Thread Paolo Bonzini
From: David Matlack 

Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further to skip the checks for
all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
shadow paging) are used when shadowing a guest huge page with smaller
pages. Such direct shadow pages, like their counterparts in fully direct
MMUs, are never marked unsynced or have a non-zero write-flooding count.

Checking sp->role.direct also generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.

Reviewed-by: Lai Jiangshan 
Reviewed-by: Sean Christopherson 
Reviewed-by: Peter Xu 
Signed-off-by: David Matlack 
Message-Id: <20220516232138.1783324-2-dmatl...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 27b2a5603496..c0afb4f1c8ae 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2000,7 +2000,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
 int direct,
 unsigned int access)
 {
-   bool direct_mmu = vcpu->arch.mmu->root_role.direct;
union kvm_mmu_page_role role;
struct hlist_head *sp_list;
unsigned quadrant;
@@ -2060,7 +2059,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
continue;
}
 
-   if (direct_mmu)
+   /* unsync and write-flooding only apply to indirect SPs. */
+   if (sp->role.direct)
goto trace_get_page;
 
if (sp->unsync) {
-- 
2.31.1


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte()

2022-06-22 Thread Paolo Bonzini

On 6/22/22 18:13, Paolo Bonzini wrote:
Even better, drop_large_spte() is always called right before 
kvm_mmu_get_child_sp(), so:


Actually, we can even include the call from eager page splitting if
__link_shadow_page() is the one that takes care of dropping the large
SPTE:

From bea344e409bb8329ca69aca0a63f97537a7ec798 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini 
Date: Wed, 22 Jun 2022 12:11:44 -0400
Subject: [PATCH] KVM: MMU: pull call to drop_large_spte() into
 __link_shadow_page()

Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE.  This is done by drop_large_spte().

However, the act that requires dropping the large SPTE is the
installation of the sp that is returned by kvm_mmu_get_child_sp(),
which happens in __link_shadow_page().  Move the call there
instead of having it in each and every caller.

To ensure that the shadow page is not linked twice if it was
present, do _not_ opportunistically make kvm_mmu_get_child_sp()
idempotent: instead, return an error value if the shadow page
already existed.  This is a bit more verbose, but clearer than
NULL.

Now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().

Signed-off-by: Paolo Bonzini 

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 36bc49f08d60..64c1191be4ae 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1135,26 +1135,16 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
rmap_remove(kvm, sptep);
 }
 
-

-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void drop_large_spte(struct kvm *kvm, u64 *sptep)
 {
-   if (is_large_pte(*sptep)) {
-   WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-   drop_spte(kvm, sptep);
-   return true;
-   }
-
-   return false;
-}
+   struct kvm_mmu_page *sp;
 
-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)

-{
-   if (__drop_large_spte(vcpu->kvm, sptep)) {
-   struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+   sp = sptep_to_sp(sptep);
+   WARN_ON(sp->role.level == PG_LEVEL_4K);
 
-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,

+   drop_spte(kvm, sptep);
+   kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
KVM_PAGES_PER_HPAGE(sp->role.level));
-   }
 }
 
 /*

@@ -2221,6 +2211,9 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct 
kvm_vcpu *vcpu,
 {
union kvm_mmu_page_role role;
 
+	if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))

+   return ERR_PTR(-EEXIST);
+
role = kvm_mmu_child_role(sptep, direct, access);
return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
@@ -2295,6 +2288,13 @@ static void __link_shadow_page(struct 
kvm_mmu_memory_cache *cache, u64 *sptep,
 
 	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
 
+	/*

+* If an SPTE is present already, it must be a leaf and therefore
+* a large one.  Drop it and flush the TLB before installing sp.
+*/
+   if (is_shadow_present_pte(*sptep)
+   drop_large_spte(vcpu->kvm, sptep);
+
spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
 
 	mmu_spte_set(sptep, spte);

@@ -3080,11 +3080,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct 
kvm_page_fault *fault)
if (it.level == fault->goal_level)
break;
 
-		drop_large_spte(vcpu, it.sptep);

-   if (is_shadow_present_pte(*it.sptep))
-   continue;
-
sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, 
ACC_ALL);
+   if (sp == ERR_PTR(-EEXIST))
+   continue;
 
 		link_shadow_page(vcpu, it.sptep, sp);

if (fault->is_tdp && fault->huge_page_disallowed &&
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 24f292f3f93f..2448fa8d8438 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,15 +648,13 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct 
kvm_page_fault *fault,
gfn_t table_gfn;
 
 		clear_sp_write_flooding_count(it.sptep);

-   drop_large_spte(vcpu, it.sptep);
 
-		sp = NULL;

-   if (!is_shadow_present_pte(*it.sptep)) {
-   table_gfn = gw->table_gfn[it.level - 2];
-   access = gw->pt_access[it.level - 2];
-   sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
- false, access);
+   table_gfn = gw->table_gfn[it.level - 2];
+   access = gw->pt_access[it.level - 2];
+   sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+   

Re: [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte()

2022-06-22 Thread Paolo Bonzini

On 6/17/22 19:11, Sean Christopherson wrote:

since the shortlog is already
a somewhat vague "do a refactor", I vote to opportunistically:

   - rename drop_large_spte() to drop_spte_if_huge()
   - rename __drop_large_spte() to drop_huge_spte()
   - move "if (!is_large_pte(*sptep))" to drop_spte_if_huge() since the split 
path
 should never pass in a non-huge SPTE.

That last point will also clean up an oddity with with "flush" parameter; given
the command-like name of "flush", it's a bit weird that __drop_large_spte() 
doesn't
flush when the SPTE is large.


Even better, drop_large_spte() is always called right before 
kvm_mmu_get_child_sp(), so:


From 86a9490972a1e959a4df114678719494b5475720 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini 
Date: Wed, 22 Jun 2022 12:11:44 -0400
Subject: [PATCH] KVM: MMU: pull drop_large_spte into kvm_mmu_get_child_sp

Before allocating a child shadow page table, all callers need to
check whether the parent already points to a huge page and, if so,
drop it.  This is done by drop_large_spte(), but it can be moved
to kvm_mmu_get_child_sp().

To ensure that the shadow page is not linked twice if it was
present, do _not_ opportunistically make kvm_mmu_get_child_sp()
idempotent: instead, return an error value if the shadow page
already existed.  This is a bit more verbose, but clearer than
NULL.

Now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().

Signed-off-by: Paolo Bonzini 

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 36bc49f08d60..7f52870ee062 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1135,26 +1135,16 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
rmap_remove(kvm, sptep);
 }

-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void drop_large_spte(struct kvm *kvm, u64 *sptep)
 {
-   if (is_large_pte(*sptep)) {
-   WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-   drop_spte(kvm, sptep);
-   return true;
-   }
-
-   return false;
-}
+   struct kvm_mmu_page *sp;

-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-   if (__drop_large_spte(vcpu->kvm, sptep)) {
-   struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+   sp = sptep_to_sp(sptep);
+   WARN_ON(sp->role.level == PG_LEVEL_4K);

-   kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+   drop_spte(kvm, sptep);
+   kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
KVM_PAGES_PER_HPAGE(sp->role.level));
-   }
 }

 /*
@@ -2221,6 +2211,13 @@ static struct kvm_mmu_page 
*kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,

 {
union kvm_mmu_page_role role;

+   if (is_shadow_present_pte(*sptep)) {
+   if (!is_large_pte(*sptep))
+   return ERR_PTR(-EEXIST);
+
+   drop_large_spte(vcpu->kvm, sptep, true);
+   }
+
role = kvm_mmu_child_role(sptep, direct, access);
return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
@@ -3080,11 +3077,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault)

if (it.level == fault->goal_level)
break;

-   drop_large_spte(vcpu, it.sptep);
-   if (is_shadow_present_pte(*it.sptep))
-   continue;
-
sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, 
ACC_ALL);
+   if (sp == ERR_PTR(-EEXIST))
+   continue;

link_shadow_page(vcpu, it.sptep, sp);
if (fault->is_tdp && fault->huge_page_disallowed &&
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 24f292f3f93f..2448fa8d8438 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,15 +648,13 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault,

gfn_t table_gfn;

clear_sp_write_flooding_count(it.sptep);
-   drop_large_spte(vcpu, it.sptep);

-   sp = NULL;
-   if (!is_shadow_present_pte(*it.sptep)) {
-   table_gfn = gw->table_gfn[it.level - 2];
-   access = gw->pt_access[it.level - 2];
-   sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
- false, access);
+   table_gfn = gw->table_gfn[it.level - 2];
+   access = gw->pt_access[it.level - 2];
+   sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+ false, access);

+   if (sp != ERR_PTR(-EEXIST)) {
/*
 * We must synchronize the pagetable before li

Re: [PATCH v6 13/22] KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()

2022-06-22 Thread Paolo Bonzini

On 6/17/22 17:28, Sean Christopherson wrote:

On Mon, May 16, 2022, David Matlack wrote:

Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
indirect shadow pages, so it's safe to pass in NULL when looking up
direct shadow pages.

This will be used for doing eager page splitting, which allocates direct


"hugepage" again, because I need constant reminders :-)


shadow pages from the context of a VM ioctl without access to a vCPU
pointer.

Signed-off-by: David Matlack 
---


With nits addressed,

Reviewed-by: Sean Christopherson 


  arch/x86/kvm/mmu/mmu.c | 13 +
  1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4fbc2da47428..acb54d6e0ea5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1850,6 +1850,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
  
  	if (ret < 0)

kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
+


Unrelated whitespace change leftover from the previous approach.


return ret;
  }
  
@@ -2001,6 +2002,7 @@ static void clear_sp_write_flooding_count(u64 *spte)

__clear_sp_write_flooding_count(sptep_to_sp(spte));
  }
  
+/* Note, @vcpu may be NULL if @role.direct is true. */

  static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 struct kvm_vcpu *vcpu,
 gfn_t gfn,
@@ -2039,6 +2041,16 @@ static struct kvm_mmu_page 
*kvm_mmu_find_shadow_page(struct kvm *kvm,
goto out;
  
  		if (sp->unsync) {

+   /*
+* A vCPU pointer should always be provided when finding


s/should/must, and "be provided" in unnecessarily ambiguous, simply state that
"@vcpu must be non-NULL".  E.g. if a caller provides a pointer, but that pointer
happens to be NULL.


+* indirect shadow pages, as that shadow page may
+* already exist and need to be synced using the vCPU
+* pointer. Direct shadow pages are never unsync and
+* thus do not require a vCPU pointer.
+*/


"vCPU pointer" over and over is a bit versbose, and I prefer to refer to 
vCPUs/VMs
as objects themselves.  E.g. "XYZ requires a vCPU" versus "XYZ requires a vCPU
pointer" since it's not the pointer itself that's required, it's all the context
of the vCPU that is needed.

/*
 * @vcpu must be non-NULL when finding indirect shadow
 * pages, as such pages may already exist and need to
 * be synced, which requires a vCPU.  Direct pages are
 * never unsync and thus do not require a vCPU.
 */


My own take:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d7987420bb26..a7748c5a2385 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1975,7 +1975,12 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-/* Note, @vcpu may be NULL if @role.direct is true. */

+/*
+ * The vCPU is required when finding indirect shadow pages; the shadow
+ * page may already exist and syncing it needs the vCPU pointer in
+ * order to read guest page tables.  Direct shadow pages are never
+ * unsync, thus @vcpu can be NULL if @role.direct is true.
+ */
 static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 struct kvm_vcpu *vcpu,
 gfn_t gfn,
@@ -2014,13 +2019,6 @@ static struct kvm_mmu_page 
*kvm_mmu_find_shadow_page(struct kvm *kvm,
goto out;
 
 		if (sp->unsync) {

-   /*
-* The vCPU pointer is required when finding indirect
-* shadow pages, as that shadow page may already exist
-* exist and need to be synced using the vCPU pointer.
-* Direct shadow pages are never unsync and thus do not
-* require a vCPU pointer.
-*/
if (KVM_BUG_ON(!vcpu, kvm))
break;
 
@@ -2101,7 +2099,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,

return sp;
 }
 
-/* Note, @vcpu may be NULL if @role.direct is true. */

+/* Note, @vcpu may be NULL if @role.direct is true; see 
kvm_mmu_find_shadow_page. */
 static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
  struct kvm_vcpu *vcpu,
  struct shadow_page_caches 
*caches,


Re: [PATCH v6 03/22] KVM: x86/mmu: Stop passing @direct to mmu_alloc_root()

2022-06-22 Thread Paolo Bonzini

On 6/16/22 20:47, Sean Christopherson wrote:

The argument @direct is vcpu->arch.mmu->root_role.direct, so just use
that.

It's worth calling out that, unlike non-root page tables, it's impossible to 
have
a direct root in an indirect MMU.  I.e. provide a hint as to why there's a need 
to
pass @direct in the first place.



I suppose there's *no* need to pass direct?  Also, there's the trivial 
(but less interesting) justification that kvm_mmu_load does


if (vcpu->arch.mmu->root_role.direct)
r = mmu_alloc_direct_roots(vcpu);
else
r = mmu_alloc_shadow_roots(vcpu);

and those are the only callers of mmu_alloc_root.

Paolo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 4/4] KVM: selftests: Fix filename reporting in guest asserts

2022-06-20 Thread Paolo Bonzini

On 6/16/22 14:45, Andrew Jones wrote:

+#define __GUEST_ASSERT(_condition, _condstr, _nargs, _args...) do {\
+   if (!(_condition))  \
+   ucall(UCALL_ABORT, GUEST_ASSERT_BUILTIN_NARGS + _nargs, 
\
+ "Failed guest assert: " _condstr,   \
+ __FILE__, \
+ __LINE__, \
+ ##_args); \

We don't need another level of indentation nor the ## operator on _args.



The ## is needed to drop the comma if there are no _args.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 0/3] KVM: selftests: Consolidate ucall code

2022-06-20 Thread Paolo Bonzini

On 6/18/22 02:16, Sean Christopherson wrote:

Consolidate the code for making and getting ucalls.  All architectures pass
the ucall struct via memory, so filling and copying the struct is 100%
generic.  The only per-arch code is sending and receiving the address of
said struct.

Tested on x86 and arm, compile tested on s390 and RISC-V.


I'm not sure about doing this yet.  The SEV tests added multiple 
implementations of the ucalls in one architecture.  I have rebased those 
recently (not the SEV part) to get more familiar with the new kvm_vcpu 
API for selftests, and was going to look at your old review next...


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [Question] remote_tlb_flush statistic is missed from kvm_flush_remote_tlbs() ?

2022-06-17 Thread Paolo Bonzini

On 6/17/22 12:33, Andrew Jones wrote:

I don't see the change for commit 38f703663d4c as of an upstream pull
right now

$ git show 47700948a4ab:arch/arm64/kvm/mmu.c | grep -A4 'void 
kvm_flush_remote_tlbs'
void kvm_flush_remote_tlbs(struct kvm *kvm)
{
++kvm->stat.generic.remote_tlb_flush_requests;
kvm_call_hyp(__kvm_tlb_flush_vmid, >arch.mmu);
}

and I do see it got dropped with merge commit e99314a340d2.

$ git diff 419025b3b419 0d0a19395baa -- arch/arm64/kvm/mmu.c | grep -A5 'void 
kvm_flush_remote_tlbs'
  void kvm_flush_remote_tlbs(struct kvm *kvm)
  {
+   ++kvm->stat.generic.remote_tlb_flush_requests;
kvm_call_hyp(__kvm_tlb_flush_vmid, >arch.mmu);
-   ++kvm->stat.generic.remote_tlb_flush;
  }


Hi,

on ARM it makes little sense to split remote_tlb_flush_requests and 
remote_tlb_flush.  On x86 the latter means "a vmexit was forced in order 
to flush the TLB", and in fact this common code:


if (!kvm_arch_flush_remote_tlb(kvm)
|| kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
++kvm->stat.generic.remote_tlb_flush;

should probably be written

if (!kvm_arch_flush_remote_tlb(kvm))
return;

if (kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
++kvm->stat.generic.remote_tlb_flush;

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] selftests: KVM: Handle compiler optimizations in ucall

2022-06-17 Thread Paolo Bonzini

On 6/17/22 09:28, Andrew Jones wrote:

On Thu, Jun 16, 2022 at 09:54:16PM +, David Laight wrote:

From: oliver.up...@linux.dev

Sent: 16 June 2022 19:45




June 16, 2022 11:48 AM, "David Laight"  wrote:

No wonder I was confused.
It's not surprising the compiler optimises it all away.

It doesn't seem right to be 'abusing' WRITE_ONCE() here.
Just adding barrier() should be enough and much more descriptive.


I had the same thought, although I do not believe barrier() is sufficient
on its own. barrier_data() with a pointer to uc passed through
is required to keep clang from eliminating the dead store.


A barrier() (full memory clobber) ought to be stronger than
the partial one than barrier_data() generates.

I can't quite decide whether you need a barrier() both sides
of the 'magic write'.
Plausibly the compiler could discard the on-stack data
after the barrier() and before the 'magic write'.

Certainly putting the 'magic write' inside a asm block
that has a memory clobber is a more correct solution.


Indeed, since the magic write is actually a guest MMIO write, then
it should be using writeq().


It doesn't need to use writeq() because no special precautions are 
needed with respect to cacheability or instruction reordering (as is the 
case with hardware registers).


WRITE_ONCE is okay, especially since the code never reads it (and if it 
did it would also use READ_ONCE).


Paolo


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] selftests: KVM: Handle compiler optimizations in ucall

2022-06-16 Thread Paolo Bonzini
Queued, thanks.

Paolo


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 fixes for 5.19, take #1

2022-06-09 Thread Paolo Bonzini

On 6/9/22 16:17, Marc Zyngier wrote:

   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
tags/kvmarm-fixes-5.19-1


Pulled, thanks.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 updates for 5.19

2022-05-20 Thread Paolo Bonzini

On 5/19/22 11:27, Marc Zyngier wrote:

Hi Paolo,

Here's the bulk of the KVM/arm64 updates for 5.19. Major features are
guard pages for the EL2 stacks, save/restore of the guest-visible
hypercall configuration and PSCI suspend support. Further details in
the tag description.

Note that this PR contains a shared branch with the arm64 tree
containing the SME patches to resolve conflicts with the WFxT support
branch.


Pulled, thanks.  I solved the conflict for KVM_EXIT_SYSTEM_EVENT values 
in your favor.


Paolo


Please pull,

M.

The following changes since commit 672c0c5173427e6b3e2a9bbb7be51ceeec78093a:

   Linux 5.18-rc5 (2022-05-01 13:57:58 -0700)

are available in the Git repository at:

   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
tags/kvmarm-5.19

for you to fetch changes up to 5c0ad551e9aa6188f2bda0977c1cb6768a2b74ef:

   Merge branch kvm-arm64/its-save-restore-fixes-5.19 into kvmarm-master/next 
(2022-05-16 17:48:36 +0100)


KVM/arm64 updates for 5.19

- Add support for the ARMv8.6 WFxT extension

- Guard pages for the EL2 stacks

- Trap and emulate AArch32 ID registers to hide unsupported features

- Ability to select and save/restore the set of hypercalls exposed
   to the guest

- Support for PSCI-initiated suspend in collaboration with userspace

- GICv3 register-based LPI invalidation support

- Move host PMU event merging into the vcpu data structure

- GICv3 ITS save/restore fixes

- The usual set of small-scale cleanups and fixes


Alexandru Elisei (3):
   KVM: arm64: Hide AArch32 PMU registers when not available
   KVM: arm64: Don't BUG_ON() if emulated register table is unsorted
   KVM: arm64: Print emulated register table name when it is unsorted

Ard Biesheuvel (1):
   KVM: arm64: Avoid unnecessary absolute addressing via literals

Fuad Tabba (4):
   KVM: arm64: Wrapper for getting pmu_events
   KVM: arm64: Repack struct kvm_pmu to reduce size
   KVM: arm64: Pass pmu events to hyp via vcpu
   KVM: arm64: Reenable pmu in Protected Mode

Kalesh Singh (6):
   KVM: arm64: Introduce hyp_alloc_private_va_range()
   KVM: arm64: Introduce pkvm_alloc_private_va_range()
   KVM: arm64: Add guard pages for KVM nVHE hypervisor stack
   KVM: arm64: Add guard pages for pKVM (protected nVHE) hypervisor stack
   KVM: arm64: Detect and handle hypervisor stack overflows
   KVM: arm64: Symbolize the nVHE HYP addresses

Marc Zyngier (30):
   arm64: Expand ESR_ELx_WFx_ISS_TI to match its ARMv8.7 definition
   arm64: Add RV and RN fields for ESR_ELx_WFx_ISS
   arm64: Add HWCAP advertising FEAT_WFXT
   arm64: Add wfet()/wfit() helpers
   arm64: Use WFxT for __delay() when possible
   KVM: arm64: Simplify kvm_cpu_has_pending_timer()
   KVM: arm64: Introduce kvm_counter_compute_delta() helper
   KVM: arm64: Handle blocking WFIT instruction
   KVM: arm64: Offer early resume for non-blocking WFxT instructions
   KVM: arm64: Expose the WFXT feature to guests
   KVM: arm64: Fix new instances of 32bit ESRs
   Merge remote-tracking branch 'arm64/for-next/sme' into kvmarm-master/next
   Merge branch kvm-arm64/wfxt into kvmarm-master/next
   Merge branch kvm-arm64/hyp-stack-guard into kvmarm-master/next
   Merge branch kvm-arm64/aarch32-idreg-trap into kvmarm-master/next
   Documentation: Fix index.rst after psci.rst renaming
   irqchip/gic-v3: Exposes bit values for GICR_CTLR.{IR, CES}
   KVM: arm64: vgic-v3: Expose GICR_CTLR.RWP when disabling LPIs
   KVM: arm64: vgic-v3: Implement MMIO-based LPI invalidation
   KVM: arm64: vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR 
revision
   KVM: arm64: vgic-v3: List M1 Pro/Max as requiring the SEIS workaround
   KVM: arm64: Hide KVM_REG_ARM_*_BMAP_BIT_COUNT from userspace
   KVM: arm64: pmu: Restore compilation when HW_PERF_EVENTS isn't selected
   KVM: arm64: Fix hypercall bitmap writeback when vcpus have already run
   Merge branch kvm-arm64/hcall-selection into kvmarm-master/next
   Merge branch kvm-arm64/psci-suspend into kvmarm-master/next
   Merge branch kvm-arm64/vgic-invlpir into kvmarm-master/next
   Merge branch kvm-arm64/per-vcpu-host-pmu-data into kvmarm-master/next
   Merge branch kvm-arm64/misc-5.19 into kvmarm-master/next
   Merge branch kvm-arm64/its-save-restore-fixes-5.19 into 
kvmarm-master/next

Mark Brown (25):
   arm64/sme: Provide ABI documentation for SME
   arm64/sme: System register and exception syndrome definitions
   arm64/sme: Manually encode SME instructions
   arm64/sme: Early CPU setup for SME
   arm64/sme: Basic enumeration support
   arm64/sme: Identify supported SME vector lengths at boot
   arm64/sme: Implement sysctl to set the default vector length
   arm64/sme: 

[PATCH] Documentation: kvm: reorder ARM-specific section about KVM_SYSTEM_EVENT_SUSPEND

2022-05-20 Thread Paolo Bonzini
Signed-off-by: Paolo Bonzini 
---
 Documentation/virt/kvm/api.rst | 52 +-
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 3201933e52d9..6890c04ccde2 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6170,32 +6170,6 @@ Valid values for 'type' are:
  - KVM_SYSTEM_EVENT_SUSPEND -- the guest has requested a suspension of
the VM.
 
-For arm/arm64:
---
-
-   KVM_SYSTEM_EVENT_SUSPEND exits are enabled with the
-   KVM_CAP_ARM_SYSTEM_SUSPEND VM capability. If a guest invokes the PSCI
-   SYSTEM_SUSPEND function, KVM will exit to userspace with this event
-   type.
-
-   It is the sole responsibility of userspace to implement the PSCI
-   SYSTEM_SUSPEND call according to ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND".
-   KVM does not change the vCPU's state before exiting to userspace, so
-   the call parameters are left in-place in the vCPU registers.
-
-   Userspace is _required_ to take action for such an exit. It must
-   either:
-
-- Honor the guest request to suspend the VM. Userspace can request
-  in-kernel emulation of suspension by setting the calling vCPU's
-  state to KVM_MP_STATE_SUSPENDED. Userspace must configure the vCPU's
-  state according to the parameters passed to the PSCI function when
-  the calling vCPU is resumed. See ARM DEN0022D.b 5.19.1 "Intended use"
-  for details on the function parameters.
-
-- Deny the guest request to suspend the VM. See ARM DEN0022D.b 5.19.2
-  "Caller responsibilities" for possible return values.
-
 If KVM_CAP_SYSTEM_EVENT_DATA is present, the 'data' field can contain
 architecture specific information for the system-level event.  Only
 the first `ndata` items (possibly zero) of the data array are valid.
@@ -6211,6 +6185,32 @@ Previous versions of Linux defined a `flags` member in 
this struct.  The
 field is now aliased to `data[0]`.  Userspace can assume that it is only
 written if ndata is greater than 0.
 
+For arm/arm64:
+--
+
+KVM_SYSTEM_EVENT_SUSPEND exits are enabled with the
+KVM_CAP_ARM_SYSTEM_SUSPEND VM capability. If a guest invokes the PSCI
+SYSTEM_SUSPEND function, KVM will exit to userspace with this event
+type.
+
+It is the sole responsibility of userspace to implement the PSCI
+SYSTEM_SUSPEND call according to ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND".
+KVM does not change the vCPU's state before exiting to userspace, so
+the call parameters are left in-place in the vCPU registers.
+
+Userspace is _required_ to take action for such an exit. It must
+either:
+
+ - Honor the guest request to suspend the VM. Userspace can request
+   in-kernel emulation of suspension by setting the calling vCPU's
+   state to KVM_MP_STATE_SUSPENDED. Userspace must configure the vCPU's
+   state according to the parameters passed to the PSCI function when
+   the calling vCPU is resumed. See ARM DEN0022D.b 5.19.1 "Intended use"
+   for details on the function parameters.
+
+ - Deny the guest request to suspend the VM. See ARM DEN0022D.b 5.19.2
+   "Caller responsibilities" for possible return values.
+
 ::
 
/* KVM_EXIT_IOAPIC_EOI */
-- 
2.31.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 fixes for 5.18, take #3

2022-05-17 Thread Paolo Bonzini

On 5/16/22 14:51, Marc Zyngier wrote:

Paolo,

Here's the third (and hopefully final) set of fixes for 5.18. Two
rather simple patches: one addressing a userspace-visible change in
behaviour with GICv3, and a fix for pKVM in combination with CPUs
affected by Spectre-v3a.

Please pull,

M.

The following changes since commit 85ea6b1ec915c9dd90caf3674b203999d8c7e062:

   KVM: arm64: Inject exception on out-of-IPA-range translation fault 
(2022-04-27 23:02:23 +0100)

are available in the Git repository at:

   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
tags/kvmarm-fixes-5.18-3

for you to fetch changes up to 2e40316753ee552fb598e8da8ca0d20a04e67453:

   KVM: arm64: Don't hypercall before EL2 init (2022-05-15 12:14:14 +0100)


KVM/arm64 fixes for 5.18, take #3

- Correctly expose GICv3 support even if no irqchip is created
   so that userspace doesn't observe it changing pointlessly
   (fixing a regression with QEMU)

- Don't issue a hypercall to set the id-mapped vectors when
   protected mode is enabled


Marc Zyngier (1):
   KVM: arm64: vgic-v3: Consistently populate ID_AA64PFR0_EL1.GIC

Quentin Perret (1):
   KVM: arm64: Don't hypercall before EL2 init

  arch/arm64/kvm/arm.c  | 3 ++-
  arch/arm64/kvm/sys_regs.c | 3 +--
  2 files changed, 3 insertions(+), 3 deletions(-)



Pulled, thanks.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] selftests: kvm: replace ternary operator with min()

2022-05-12 Thread Paolo Bonzini

On 5/11/22 14:05, Guo Zhengkui wrote:

Fix the following coccicheck warnings:

tools/testing/selftests/kvm/lib/s390x/ucall.c:25:15-17: WARNING
opportunity for min()
tools/testing/selftests/kvm/lib/x86_64/ucall.c:27:15-17: WARNING
opportunity for min()
tools/testing/selftests/kvm/lib/riscv/ucall.c:56:15-17: WARNING
opportunity for min()
tools/testing/selftests/kvm/lib/aarch64/ucall.c:82:15-17: WARNING
opportunity for min()
tools/testing/selftests/kvm/lib/aarch64/ucall.c:55:20-21: WARNING
opportunity for min()

min() is defined in tools/include/linux/kernel.h.

Signed-off-by: Guo Zhengkui 
---
  tools/testing/selftests/kvm/lib/aarch64/ucall.c | 4 ++--
  tools/testing/selftests/kvm/lib/riscv/ucall.c   | 2 +-
  tools/testing/selftests/kvm/lib/s390x/ucall.c   | 2 +-
  tools/testing/selftests/kvm/lib/x86_64/ucall.c  | 2 +-
  4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/kvm/lib/aarch64/ucall.c 
b/tools/testing/selftests/kvm/lib/aarch64/ucall.c
index e0b0164e9af8..00be3ef195ca 100644
--- a/tools/testing/selftests/kvm/lib/aarch64/ucall.c
+++ b/tools/testing/selftests/kvm/lib/aarch64/ucall.c
@@ -52,7 +52,7 @@ void ucall_init(struct kvm_vm *vm, void *arg)
 * lower and won't match physical addresses.
 */
bits = vm->va_bits - 1;
-   bits = vm->pa_bits < bits ? vm->pa_bits : bits;
+   bits = min(vm->pa_bits, bits);
end = 1ul << bits;
start = end * 5 / 8;
step = end / 16;
@@ -79,7 +79,7 @@ void ucall(uint64_t cmd, int nargs, ...)
va_list va;
int i;
  
-	nargs = nargs <= UCALL_MAX_ARGS ? nargs : UCALL_MAX_ARGS;

+   nargs = min(nargs, UCALL_MAX_ARGS);
  
  	va_start(va, nargs);

for (i = 0; i < nargs; ++i)
diff --git a/tools/testing/selftests/kvm/lib/riscv/ucall.c 
b/tools/testing/selftests/kvm/lib/riscv/ucall.c
index 9e42d8248fa6..34f16fe70ce8 100644
--- a/tools/testing/selftests/kvm/lib/riscv/ucall.c
+++ b/tools/testing/selftests/kvm/lib/riscv/ucall.c
@@ -53,7 +53,7 @@ void ucall(uint64_t cmd, int nargs, ...)
va_list va;
int i;
  
-	nargs = nargs <= UCALL_MAX_ARGS ? nargs : UCALL_MAX_ARGS;

+   nargs = min(nargs, UCALL_MAX_ARGS);
  
  	va_start(va, nargs);

for (i = 0; i < nargs; ++i)
diff --git a/tools/testing/selftests/kvm/lib/s390x/ucall.c 
b/tools/testing/selftests/kvm/lib/s390x/ucall.c
index 9d3b0f15249a..665267c1135d 100644
--- a/tools/testing/selftests/kvm/lib/s390x/ucall.c
+++ b/tools/testing/selftests/kvm/lib/s390x/ucall.c
@@ -22,7 +22,7 @@ void ucall(uint64_t cmd, int nargs, ...)
va_list va;
int i;
  
-	nargs = nargs <= UCALL_MAX_ARGS ? nargs : UCALL_MAX_ARGS;

+   nargs = min(nargs, UCALL_MAX_ARGS);
  
  	va_start(va, nargs);

for (i = 0; i < nargs; ++i)
diff --git a/tools/testing/selftests/kvm/lib/x86_64/ucall.c 
b/tools/testing/selftests/kvm/lib/x86_64/ucall.c
index a3489973e290..2ea31a0ebe30 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/ucall.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/ucall.c
@@ -24,7 +24,7 @@ void ucall(uint64_t cmd, int nargs, ...)
va_list va;
int i;
  
-	nargs = nargs <= UCALL_MAX_ARGS ? nargs : UCALL_MAX_ARGS;

+   nargs = min(nargs, UCALL_MAX_ARGS);
  
  	va_start(va, nargs);

for (i = 0; i < nargs; ++i)


Queued, thanks.

Paolo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] selftests: KVM: aarch64: Let hypercalls use UAPI *_BIT_COUNT

2022-05-05 Thread Paolo Bonzini

On 5/5/22 14:04, Marc Zyngier wrote:

diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index e523bb6eac67..3cde9f958eee 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -342,6 +342,10 @@ struct kvm_arm_copy_mte_tags {
  
  enum {

KVM_REG_ARM_STD_BIT_TRNG_V1_0   = 0,
+   /*
+* KVM_REG_ARM_STD_BMAP_BIT_COUNT will vary as new services
+* are added, and is explicitely not part of the ABI.
+*/
KVM_REG_ARM_STD_BMAP_BIT_COUNT,
  };


That seems like a bad idea.  Perhaps you can wrap it in #ifdef 
__KERNEL_OR_SELFTESTS__?  I can't find any prior art.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [GIT PULL] KVM/arm64 fixes for 5.18, take #2

2022-04-29 Thread Paolo Bonzini

On 4/29/22 17:36, Marc Zyngier wrote:

Paolo,

Here's a trio of fixes for 5.18. Nothing terribly interesting, but
nonetheless important fixes (two of the bugs are related to AArch32).


Cool, will pull soon.  Please take a quick look at the flags->data ABI 
fix, it's one patch on top of 5.18 as you requested and if I hear 
nothing I'll send it ~Sunday morning to Linus.


Thanks,

Paolo


Please pull,

M.

The following changes since commit 21db83846683d3987666505a3ec38f367708199a:

   selftests: KVM: Free the GIC FD when cleaning up in arch_timer (2022-04-07 
08:46:13 +0100)

are available in the Git repository at:

   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
tags/kvmarm-fixes-5.18-2

for you to fetch changes up to 85ea6b1ec915c9dd90caf3674b203999d8c7e062:

   KVM: arm64: Inject exception on out-of-IPA-range translation fault 
(2022-04-27 23:02:23 +0100)


KVM/arm64 fixes for 5.18, take #2

- Take care of faults occuring between the PARange and
   IPA range by injecting an exception

- Fix S2 faults taken from a host EL0 in protected mode

- Work around Oops caused by a PMU access from a 32bit
   guest when PMU has been created. This is a temporary
   bodge until we fix it for good.


Alexandru Elisei (1):
   KVM/arm64: Don't emulate a PMU for 32-bit guests if feature not set

Marc Zyngier (1):
   KVM: arm64: Inject exception on out-of-IPA-range translation fault

Will Deacon (1):
   KVM: arm64: Handle host stage-2 faults from 32-bit EL0

  arch/arm64/include/asm/kvm_emulate.h |  1 +
  arch/arm64/kvm/hyp/nvhe/host.S   | 18 +-
  arch/arm64/kvm/inject_fault.c| 28 
  arch/arm64/kvm/mmu.c | 19 +++
  arch/arm64/kvm/pmu-emul.c| 23 ++-
  5 files changed, 79 insertions(+), 10 deletions(-)



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling

2022-04-21 Thread Paolo Bonzini

On 4/21/22 18:30, Ben Gardon wrote:

Completely agree. I'm surprised that ARM doesn't have a need for a
metadata structure associated with each page of the stage 2 paging
structure, but if you don't need it, that definitely makes things
simpler.


The uses of struct kvm_mmu_page in the TDP MMU are all relatively new, 
for the work_struct and the roots' reference count.  sp->ptep is only 
used to in a very specific path, kvm_recover_nx_lpages.


I wouldn't be surprised if ARM grows more metadata later, but in fact 
it's not _that_ surprising that it doesn't need it yet!


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


  1   2   3   4   5   6   7   >