from:"Ben Gardon"

Re: [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log

2021-04-20 Thread Ben Gardon

On Tue, Apr 20, 2021 at 12:49 AM Keqian Zhu  wrote:
>
> Hi Ben,
>
> On 2021/4/20 3:20, Ben Gardon wrote:
> > On Fri, Apr 16, 2021 at 1:25 AM Keqian Zhu  wrote:
> >>
> >> Currently during start dirty logging, if we're with init-all-set,
> >> we write protect huge pages and leave normal pages untouched, for
> >> that we can enable dirty logging for these pages lazily.
> >>
> >> Actually enable dirty logging lazily for huge pages is feasible
> >> too, which not only reduces the time of start dirty logging, also
> >> greatly reduces side-effect on guest when there is high dirty rate.
> >>
> >> Signed-off-by: Keqian Zhu 
> >> ---
> >>  arch/x86/kvm/mmu/mmu.c | 48 ++
> >>  arch/x86/kvm/x86.c | 37 +---
> >>  2 files changed, 54 insertions(+), 31 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> index 2ce5bc2ea46d..98fa25172b9a 100644
> >> --- a/arch/x86/kvm/mmu/mmu.c
> >> +++ b/arch/x86/kvm/mmu/mmu.c
> >> @@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, 
> >> struct kvm_rmap_head *rmap_head,
> >>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
> >>   * @mask: indicates which pages we should protect
> >>   *
> >> - * Used when we do not need to care about huge page mappings: e.g. during 
> >> dirty
> >> - * logging we do not have any such mappings.
> >> + * Used when we do not need to care about huge page mappings.
> >>   */
> >>  static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> >>  struct kvm_memory_slot *slot,
> >> @@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct 
> >> kvm *kvm,
> >>   * It calls kvm_mmu_write_protect_pt_masked to write protect selected 
> >> pages to
> >>   * enable dirty logging for them.
> >>   *
> >> - * Used when we do not need to care about huge page mappings: e.g. during 
> >> dirty
> >> - * logging we do not have any such mappings.
> >> + * We need to care about huge page mappings: e.g. during dirty logging we 
> >> may
> >> + * have any such mappings.
> >>   */
> >>  void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >> struct kvm_memory_slot *slot,
> >> gfn_t gfn_offset, unsigned long mask)
> >>  {
> >> +   gfn_t start, end;
> >> +
> >> +   /*
> >> +* Huge pages are NOT write protected when we start dirty log with
> >> +* init-all-set, so we must write protect them at here.
> >> +*
> >> +* The gfn_offset is guaranteed to be aligned to 64, but the 
> >> base_gfn
> >> +* of memslot has no such restriction, so the range can cross two 
> >> large
> >> +* pages.
> >> +*/
> >> +   if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
> >> +   start = slot->base_gfn + gfn_offset + __ffs(mask);
> >> +   end = slot->base_gfn + gfn_offset + __fls(mask);
> >> +   kvm_mmu_slot_gfn_write_protect(kvm, slot, start, 
> >> PG_LEVEL_2M);
> >> +
> >> +   /* Cross two large pages? */
> >> +   if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) !=
> >> +   ALIGN(end << PAGE_SHIFT, PMD_SIZE))
> >> +   kvm_mmu_slot_gfn_write_protect(kvm, slot, end,
> >> +  PG_LEVEL_2M);
> >> +   }
> >> +
> >> +   /*
> >> +* RFC:
> >> +*
> >> +* 1. I don't return early when kvm_mmu_slot_gfn_write_protect() 
> >> returns
> >> +* true, because I am not very clear about the relationship between
> >> +* legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else
> >> +* manner.
> >> +*
> >> +* The kvm_mmu_slot_gfn_write_protect() returns true when we hit a
> >> +* writable large page mapping in legacy mmu mapping or tdp mmu 
> >> mapping.
> >> +* Do we still have normal mapping in that case? (e.g. We have 
> >> large
> >> +* mapping in legacy mmu and normal mapping in tdp mmu).
> >
> > Right, we can&

Re: [PATCH 09/15] KVM: selftests: Move per-VM GPA into perf_test_args

2021-02-11 Thread Ben Gardon

On Thu, Feb 11, 2021 at 7:58 AM Sean Christopherson  wrote:
>
> On Thu, Feb 11, 2021, Paolo Bonzini wrote:
> > On 11/02/21 02:56, Sean Christopherson wrote:
> > > > > +   pta->gpa = (vm_get_max_gfn(vm) - guest_num_pages) * 
> > > > > pta->guest_page_size;
> > > > > +   pta->gpa &= ~(pta->host_page_size - 1);
> > > > Also not related to this patch, but another case for align.
> > > >
> > > > >  if (backing_src == VM_MEM_SRC_ANONYMOUS_THP ||
> > > > >  backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB)
> > > > > -   guest_test_phys_mem &= ~(KVM_UTIL_HUGEPAGE_ALIGNMENT 
> > > > > - 1);
> > > > > -
> > > > > +   pta->gpa &= ~(KVM_UTIL_HUGEPAGE_ALIGNMENT - 1);
> > > > also align
> > > >
> > > > >   #ifdef __s390x__
> > > > >  /* Align to 1M (segment size) */
> > > > > -   guest_test_phys_mem &= ~((1 << 20) - 1);
> > > > > +   pta->gpa &= ~((1 << 20) - 1);
> > > > And here again (oof)
> > >
> > > Yep, I'll fix all these and the align() comment in v2.
> >
> > This is not exactly align in fact; it is x & ~y rather than (x + y) & ~y.
> > Are you going to introduce a round-down macro or is it a bug?  (I am
> > lazy...).
>
> Good question.  I, too, was lazy.  I didn't look at the guts of align() when I
> moved it, and I didn't look closely at Ben's suggestion.  I'll take a closer
> look today and make sure everything is doing what it's supposed to do.

Ooh, great point Paolo, that helper is indeed rounding up. My comment
in patch #2 was totally wrong. I forgot anyone would ever want to
round up. :/
My misunderstanding and the above use cases are probably good evidence
that it would be helpful to have both align_up and align_down helpers.

Re: [PATCH] locking/arch: Move qrwlock.h include after qspinlock.h

2021-02-10 Thread Ben Gardon

On Wed, Feb 10, 2021 at 7:54 AM Waiman Long  wrote:
>
> On 2/10/21 10:05 AM, Guenter Roeck wrote:
> > On 2/10/21 6:45 AM, Waiman Long wrote:
> >> The queued rwlock code has a dependency on the current spinlock
> >> implementation (likely to be qspinlock), but not vice versa. Including
> >> qrwlock.h before qspinlock.h can be problematic when expanding qrwlock
> >> functionality.
> >>
> >> If both qspinlock.h and qrwlock.h are to be included, the qrwlock.h
> >> include should always be after qspinlock.h. Update the current set of
> >> asm/spinlock.h files to enforce that.
> >>
> >> Signed-off-by: Waiman Long 
> > There should be a Fixes: tag here. If the SHA of the offending commit is not
> > stable, there should be a better reference than "The queued rwlock code".
> I originally have a Fixes tag when I was modifying the mips'
> asm/spinlock.h file. After I realize that there are more files to
> modify, I take that out. Anyway, the problem was exposed by Ben's
> qrwlock patch. So existing stable releases should still be fine without
> this patch.
> >
> > This patch fixes the build problem I had observed on mips. I also tested
> > xtensa:defconfig and arm64:defconfig with no problems observed.
> >
> > Tested-by: Guenter Roeck 
>
> Thanks for the testing as I don't have a build environment to verify that.
>
> Cheers,
> Longman
>

Thanks Longman and Guenter for developing and testing this fix! I
don't have the environment to test this either, but the patch looks
good to me.
Reviewed-by: Ben Gardon

Re: [PATCH 02/15] KVM: selftests: Expose align() helpers to tests

2021-02-10 Thread Ben Gardon

On Wed, Feb 10, 2021 at 3:06 PM Sean Christopherson  wrote:
>
> Refactor align() to work with non-pointers, add align_ptr() for use with
> pointers, and expose both helpers so that they can be used by tests
> and/or other utilities.  The align() helper in particular will be used
> to ensure gpa alignment for hugepages.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson 

Reviewed-by: Ben Gardon 

> ---
>  tools/testing/selftests/kvm/include/kvm_util.h | 15 +++
>  tools/testing/selftests/kvm/lib/kvm_util.c | 11 +--
>  2 files changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
> b/tools/testing/selftests/kvm/include/kvm_util.h
> index 2d7eb6989e83..4b5d2362a68a 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -79,6 +79,21 @@ struct vm_guest_mode_params {
>  };
>  extern const struct vm_guest_mode_params vm_guest_mode_params[];
>
> +/* Aligns x up to the next multiple of size. Size must be a power of 2. */

It might also be worth updating this comment to clarify that the
function rounds down, not up.

> +static inline uint64_t align(uint64_t x, uint64_t size)
> +{
> +   uint64_t mask = size - 1;
> +
> +   TEST_ASSERT(size != 0 && !(size & (size - 1)),
> +   "size not a power of 2: %lu", size);
> +   return ((x + mask) & ~mask);
> +}
> +
> +static inline void *align_ptr(void *x, size_t size)
> +{
> +   return (void *)align((unsigned long)x, size);
> +}
> +
>  int kvm_check_cap(long cap);
>  int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap);
>  int vcpu_enable_cap(struct kvm_vm *vm, uint32_t vcpu_id,
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
> b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 960f4c5129ff..584167c6dbc7 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -21,15 +21,6 @@
>  #define KVM_UTIL_PGS_PER_HUGEPG 512
>  #define KVM_UTIL_MIN_PFN   2
>
> -/* Aligns x up to the next multiple of size. Size must be a power of 2. */
> -static void *align(void *x, size_t size)
> -{
> -   size_t mask = size - 1;
> -   TEST_ASSERT(size != 0 && !(size & (size - 1)),
> -   "size not a power of 2: %lu", size);
> -   return (void *) (((size_t) x + mask) & ~mask);
> -}
> -
>  /*
>   * Capability
>   *
> @@ -757,7 +748,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
> region->mmap_start, errno);
>
> /* Align host address */
> -   region->host_mem = align(region->mmap_start, alignment);
> +   region->host_mem = align_ptr(region->mmap_start, alignment);
>
> /* As needed perform madvise */
> if (src_type == VM_MEM_SRC_ANONYMOUS || src_type == 
> VM_MEM_SRC_ANONYMOUS_THP) {
> --
> 2.30.0.478.g8a0d178c01-goog
>

Re: [PATCH 01/15] KVM: selftests: Explicitly state indicies for vm_guest_mode_params array

2021-02-10 Thread Ben Gardon

On Wed, Feb 10, 2021 at 3:06 PM Sean Christopherson  wrote:
>
> Explicitly state the indices when populating vm_guest_mode_params to
> make it marginally easier to visualize what's going on.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson 

Reviewed-by: Ben Gardon 

> ---
>  tools/testing/selftests/kvm/lib/kvm_util.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
> b/tools/testing/selftests/kvm/lib/kvm_util.c
> index d787cb802b4a..960f4c5129ff 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -154,13 +154,13 @@ _Static_assert(sizeof(vm_guest_mode_string)/sizeof(char 
> *) == NUM_VM_MODES,
>"Missing new mode strings?");
>
>  const struct vm_guest_mode_params vm_guest_mode_params[] = {
> -   { 52, 48,  0x1000, 12 },
> -   { 52, 48, 0x1, 16 },
> -   { 48, 48,  0x1000, 12 },
> -   { 48, 48, 0x1, 16 },
> -   { 40, 48,  0x1000, 12 },
> -   { 40, 48, 0x1, 16 },
> -   {  0,  0,  0x1000, 12 },
> +   [VM_MODE_P52V48_4K] = { 52, 48,  0x1000, 12 },
> +   [VM_MODE_P52V48_64K]= { 52, 48, 0x1, 16 },
> +   [VM_MODE_P48V48_4K] = { 48, 48,  0x1000, 12 },
> +   [VM_MODE_P48V48_64K]= { 48, 48, 0x1, 16 },
> +   [VM_MODE_P40V48_4K] = { 40, 48,  0x1000, 12 },
> +   [VM_MODE_P40V48_64K]= { 40, 48, 0x1, 16 },
> +   [VM_MODE_PXXV48_4K] = {  0,  0,  0x1000, 12 },
>  };
>  _Static_assert(sizeof(vm_guest_mode_params)/sizeof(struct 
> vm_guest_mode_params) == NUM_VM_MODES,
>"Missing new mode params?");
> --
> 2.30.0.478.g8a0d178c01-goog
>

Re: [PATCH 03/15] KVM: selftests: Align HVA for HugeTLB-backed memslots

2021-02-10 Thread Ben Gardon

On Wed, Feb 10, 2021 at 3:06 PM Sean Christopherson  wrote:
>
> Align the HVA for HugeTLB memslots, not just THP memslots.  Add an
> assert so any future backing types are forced to assess whether or not
> they need to be aligned.
>
> Cc: Ben Gardon 
> Cc: Yanan Wang 
> Cc: Andrew Jones 
> Cc: Peter Xu 
> Cc: Aaron Lewis 
> Signed-off-by: Sean Christopherson 

Reviewed-by: Ben Gardon 

> ---
>  tools/testing/selftests/kvm/lib/kvm_util.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
> b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 584167c6dbc7..deaeb47b5a6d 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -731,8 +731,11 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
> alignment = 1;
>  #endif
>
> -   if (src_type == VM_MEM_SRC_ANONYMOUS_THP)
> +   if (src_type == VM_MEM_SRC_ANONYMOUS_THP ||
> +   src_type == VM_MEM_SRC_ANONYMOUS_HUGETLB)
> alignment = max(huge_page_size, alignment);
> +   else
> +   ASSERT_EQ(src_type, VM_MEM_SRC_ANONYMOUS);
>
> /* Add enough memory to align up if necessary */
> if (alignment > 1)
> --
> 2.30.0.478.g8a0d178c01-goog
>

Re: [PATCH 05/15] KVM: selftests: Require GPA to be aligned when backed by hugepages

2021-02-10 Thread Ben Gardon

On Wed, Feb 10, 2021 at 3:06 PM Sean Christopherson  wrote:
>
> Assert that the GPA for a memslot backed by a hugepage is 1gb aligned,
> and fix perf_test_util accordingly.  Lack of GPA alignment prevents KVM
> from backing the guest with hugepages, e.g. x86's write-protection of
> hugepages when dirty logging is activated is otherwise not exercised.
>
> Add a comment explaining that guest_page_size is for non-huge pages to
> try and avoid confusion about what it actually tracks.
>
> Cc: Ben Gardon 
> Cc: Yanan Wang 
> Cc: Andrew Jones 
> Cc: Peter Xu 
> Cc: Aaron Lewis 
> Signed-off-by: Sean Christopherson 
> ---
>  tools/testing/selftests/kvm/lib/kvm_util.c   | 2 ++
>  tools/testing/selftests/kvm/lib/perf_test_util.c | 9 +
>  2 files changed, 11 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
> b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 2e497fbab6ae..855d20784ba7 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -735,6 +735,8 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
> else
> ASSERT_EQ(src_type, VM_MEM_SRC_ANONYMOUS);
>
> +   ASSERT_EQ(guest_paddr, align(guest_paddr, alignment));
> +
> /* Add enough memory to align up if necessary */
> if (alignment > 1)
> region->mmap_size += alignment;
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c 
> b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index 81490b9b4e32..f187b86f2e14 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -58,6 +58,11 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode 
> mode, int vcpus,
> pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode));
>
> perf_test_args.host_page_size = getpagesize();
> +
> +   /*
> +* Snapshot the non-huge page size.  This is used by the guest code to
> +* access/dirty pages at the logging granularity.
> +*/
> perf_test_args.guest_page_size = vm_guest_mode_params[mode].page_size;
>
> guest_num_pages = vm_adjust_num_guest_pages(mode,
> @@ -87,6 +92,10 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode 
> mode, int vcpus,
> guest_test_phys_mem = (vm_get_max_gfn(vm) - guest_num_pages) *
>   perf_test_args.guest_page_size;
> guest_test_phys_mem &= ~(perf_test_args.host_page_size - 1);
> +   if (backing_src == VM_MEM_SRC_ANONYMOUS_THP ||
> +   backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB)
> +   guest_test_phys_mem &= ~(KVM_UTIL_HUGEPAGE_ALIGNMENT - 1);

You could use the align helper here as well. That would make this a
little easier for me to read.

> +
>  #ifdef __s390x__
> /* Align to 1M (segment size) */
> guest_test_phys_mem &= ~((1 << 20) - 1);
> --
> 2.30.0.478.g8a0d178c01-goog
>

Re: [PATCH 06/15] KVM: selftests: Use shorthand local var to access struct perf_tests_args

2021-02-10 Thread Ben Gardon

On Wed, Feb 10, 2021 at 3:06 PM Sean Christopherson  wrote:
>
> Use 'pta' as a local pointer to the global perf_tests_args in order to
> shorten line lengths and make the code borderline readable.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson 

Reviewed-by: Ben Gardon 

> ---
>  .../selftests/kvm/lib/perf_test_util.c| 36 ++-
>  1 file changed, 19 insertions(+), 17 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c 
> b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index f187b86f2e14..73b0fccc28b9 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -23,7 +23,8 @@ static uint64_t guest_test_virt_mem = 
> DEFAULT_GUEST_TEST_MEM;
>   */
>  static void guest_code(uint32_t vcpu_id)
>  {
> -   struct perf_test_vcpu_args *vcpu_args = 
> &perf_test_args.vcpu_args[vcpu_id];
> +   struct perf_test_args *pta = &perf_test_args;
> +   struct perf_test_vcpu_args *vcpu_args = &pta->vcpu_args[vcpu_id];
> uint64_t gva;
> uint64_t pages;
> int i;
> @@ -36,9 +37,9 @@ static void guest_code(uint32_t vcpu_id)
>
> while (true) {
> for (i = 0; i < pages; i++) {
> -   uint64_t addr = gva + (i * 
> perf_test_args.guest_page_size);
> +   uint64_t addr = gva + (i * pta->guest_page_size);
>
> -   if (i % perf_test_args.wr_fract == 0)
> +   if (i % pta->wr_fract == 0)
> *(uint64_t *)addr = 0x0123456789ABCDEF;
> else
> READ_ONCE(*(uint64_t *)addr);
> @@ -52,32 +53,32 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode 
> mode, int vcpus,
>uint64_t vcpu_memory_bytes,
>enum vm_mem_backing_src_type backing_src)
>  {
> +   struct perf_test_args *pta = &perf_test_args;
> struct kvm_vm *vm;
> uint64_t guest_num_pages;
>
> pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode));
>
> -   perf_test_args.host_page_size = getpagesize();
> +   pta->host_page_size = getpagesize();
>
> /*
>  * Snapshot the non-huge page size.  This is used by the guest code to
>  * access/dirty pages at the logging granularity.
>  */
> -   perf_test_args.guest_page_size = vm_guest_mode_params[mode].page_size;
> +   pta->guest_page_size = vm_guest_mode_params[mode].page_size;
>
> guest_num_pages = vm_adjust_num_guest_pages(mode,
> -   (vcpus * vcpu_memory_bytes) / 
> perf_test_args.guest_page_size);
> +   (vcpus * vcpu_memory_bytes) / 
> pta->guest_page_size);
>
> -   TEST_ASSERT(vcpu_memory_bytes % perf_test_args.host_page_size == 0,
> +   TEST_ASSERT(vcpu_memory_bytes % pta->host_page_size == 0,
> "Guest memory size is not host page size aligned.");
> -   TEST_ASSERT(vcpu_memory_bytes % perf_test_args.guest_page_size == 0,
> +   TEST_ASSERT(vcpu_memory_bytes % pta->guest_page_size == 0,
> "Guest memory size is not guest page size aligned.");
>
> vm = vm_create_with_vcpus(mode, vcpus,
> - (vcpus * vcpu_memory_bytes) / 
> perf_test_args.guest_page_size,
> + (vcpus * vcpu_memory_bytes) / 
> pta->guest_page_size,
>   0, guest_code, NULL);
> -
> -   perf_test_args.vm = vm;
> +   pta->vm = vm;
>
> /*
>  * If there should be more memory in the guest test region than there
> @@ -90,8 +91,8 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, 
> int vcpus,
> vcpu_memory_bytes);
>
> guest_test_phys_mem = (vm_get_max_gfn(vm) - guest_num_pages) *
> - perf_test_args.guest_page_size;
> -   guest_test_phys_mem &= ~(perf_test_args.host_page_size - 1);
> + pta->guest_page_size;
> +   guest_test_phys_mem &= ~(pta->host_page_size - 1);

Not really germane to this patch, but the align macro could be used
here as well.

> if (backing_src == VM_MEM_SRC_ANONYMOUS_THP ||
> backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB)
> guest_test_phys_mem &= ~(KVM_UTIL_HUGEPAGE_ALIGNMENT - 1);
> @@ -125,30 +126,31 @@ void perf_test_setup_vcpus(struct kvm_vm *vm, int vcpus,
&

Re: [PATCH 09/15] KVM: selftests: Move per-VM GPA into perf_test_args

2021-02-10 Thread Ben Gardon

On Wed, Feb 10, 2021 at 3:06 PM Sean Christopherson  wrote:
>
> Move the per-VM GPA into perf_test_args instead of storing it as a
> separate global variable.  It's not obvious that guest_test_phys_mem
> holds a GPA, nor that it's connected/coupled with per_vcpu->gpa.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson 

Reviewed-by: Ben Gardon 

> ---
>  .../selftests/kvm/include/perf_test_util.h|  8 +-
>  .../selftests/kvm/lib/perf_test_util.c| 28 ---
>  .../kvm/memslot_modification_stress_test.c|  2 +-
>  3 files changed, 13 insertions(+), 25 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h 
> b/tools/testing/selftests/kvm/include/perf_test_util.h
> index 4d53238b139f..cccf1c44bddb 100644
> --- a/tools/testing/selftests/kvm/include/perf_test_util.h
> +++ b/tools/testing/selftests/kvm/include/perf_test_util.h
> @@ -29,6 +29,7 @@ struct perf_test_vcpu_args {
>  struct perf_test_args {
> struct kvm_vm *vm;
> uint64_t host_page_size;
> +   uint64_t gpa;
> uint64_t guest_page_size;
> int wr_fract;
>
> @@ -37,13 +38,6 @@ struct perf_test_args {
>
>  extern struct perf_test_args perf_test_args;
>
> -/*
> - * Guest physical memory offset of the testing memory slot.
> - * This will be set to the topmost valid physical address minus
> - * the test memory size.
> - */
> -extern uint64_t guest_test_phys_mem;
> -
>  struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus,
>uint64_t vcpu_memory_bytes,
>enum vm_mem_backing_src_type backing_src);
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c 
> b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index f22ce1836547..03f125236021 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -9,8 +9,6 @@
>
>  struct perf_test_args perf_test_args;
>
> -uint64_t guest_test_phys_mem;
> -
>  /*
>   * Guest virtual memory offset of the testing memory slot.
>   * Must not conflict with identity mapped test code.
> @@ -87,29 +85,25 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode 
> mode, int vcpus,
> TEST_ASSERT(guest_num_pages < vm_get_max_gfn(vm),
> "Requested more guest memory than address space allows.\n"
> "guest pages: %lx max gfn: %x vcpus: %d wss: %lx]\n",
> -   guest_num_pages, vm_get_max_gfn(vm), vcpus,
> -   vcpu_memory_bytes);
> +   guest_num_pages, vm_get_max_gfn(vm), vcpus, 
> vcpu_memory_bytes);
>
> -   guest_test_phys_mem = (vm_get_max_gfn(vm) - guest_num_pages) *
> - pta->guest_page_size;
> -   guest_test_phys_mem &= ~(pta->host_page_size - 1);
> +   pta->gpa = (vm_get_max_gfn(vm) - guest_num_pages) * 
> pta->guest_page_size;
> +   pta->gpa &= ~(pta->host_page_size - 1);

Also not related to this patch, but another case for align.

> if (backing_src == VM_MEM_SRC_ANONYMOUS_THP ||
> backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB)
> -   guest_test_phys_mem &= ~(KVM_UTIL_HUGEPAGE_ALIGNMENT - 1);
> -
> +   pta->gpa &= ~(KVM_UTIL_HUGEPAGE_ALIGNMENT - 1);

also align

>  #ifdef __s390x__
> /* Align to 1M (segment size) */
> -   guest_test_phys_mem &= ~((1 << 20) - 1);
> +   pta->gpa &= ~((1 << 20) - 1);

And here again (oof)

>  #endif
> -   pr_info("guest physical test memory offset: 0x%lx\n", 
> guest_test_phys_mem);
> +   pr_info("guest physical test memory offset: 0x%lx\n", pta->gpa);
>
> /* Add an extra memory slot for testing */
> -   vm_userspace_mem_region_add(vm, backing_src, guest_test_phys_mem,
> -   PERF_TEST_MEM_SLOT_INDEX,
> -   guest_num_pages, 0);
> +   vm_userspace_mem_region_add(vm, backing_src, pta->gpa,
> +   PERF_TEST_MEM_SLOT_INDEX, 
> guest_num_pages, 0);
>
> /* Do mapping for the demand paging memory slot */
> -   virt_map(vm, guest_test_virt_mem, guest_test_phys_mem, 
> guest_num_pages, 0);
> +   virt_map(vm, guest_test_virt_mem, pta->gpa, guest_num_pages, 0);
>
> ucall_init(vm, NULL);
>
> @@ -139,13 +133,13 @@ void perf_test_setup_vcpus(struct kvm_vm *vm, int vcpus,
>  (vcpu_id * vcpu_memory_bytes);
> vcpu_ar

Re: [PATCH 07/15] KVM: selftests: Capture per-vCPU GPA in perf_test_vcpu_args

2021-02-10 Thread Ben Gardon

On Wed, Feb 10, 2021 at 3:06 PM Sean Christopherson  wrote:
>
> Capture the per-vCPU GPA in perf_test_vcpu_args so that tests can get
> the GPA without having to calculate the GPA on their own.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson 

Reviewed-by: Ben Gardon 

> ---
>  tools/testing/selftests/kvm/include/perf_test_util.h | 1 +
>  tools/testing/selftests/kvm/lib/perf_test_util.c | 9 -
>  2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h 
> b/tools/testing/selftests/kvm/include/perf_test_util.h
> index 005f2143adeb..4d53238b139f 100644
> --- a/tools/testing/selftests/kvm/include/perf_test_util.h
> +++ b/tools/testing/selftests/kvm/include/perf_test_util.h
> @@ -18,6 +18,7 @@
>  #define PERF_TEST_MEM_SLOT_INDEX   1
>
>  struct perf_test_vcpu_args {
> +   uint64_t gpa;
> uint64_t gva;
> uint64_t pages;
>
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c 
> b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index 73b0fccc28b9..f22ce1836547 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -127,7 +127,6 @@ void perf_test_setup_vcpus(struct kvm_vm *vm, int vcpus,
>bool partition_vcpu_memory_access)
>  {
> struct perf_test_args *pta = &perf_test_args;
> -   vm_paddr_t vcpu_gpa;
> struct perf_test_vcpu_args *vcpu_args;
> int vcpu_id;
>
> @@ -140,17 +139,17 @@ void perf_test_setup_vcpus(struct kvm_vm *vm, int vcpus,
>  (vcpu_id * vcpu_memory_bytes);
> vcpu_args->pages = vcpu_memory_bytes /
>pta->guest_page_size;
> -   vcpu_gpa = guest_test_phys_mem +
> -  (vcpu_id * vcpu_memory_bytes);
> +   vcpu_args->gpa = guest_test_phys_mem +
> +(vcpu_id * vcpu_memory_bytes);
> } else {
> vcpu_args->gva = guest_test_virt_mem;
> vcpu_args->pages = (vcpus * vcpu_memory_bytes) /
>pta->guest_page_size;
> -   vcpu_gpa = guest_test_phys_mem;
> +   vcpu_args->gpa = guest_test_phys_mem;
> }
>
> pr_debug("Added VCPU %d with test mem gpa [%lx, %lx)\n",
> -vcpu_id, vcpu_gpa, vcpu_gpa +
> +vcpu_id, vcpu_args->gpa, vcpu_args->gpa +
>  (vcpu_args->pages * pta->guest_page_size));
> }
>  }
> --
> 2.30.0.478.g8a0d178c01-goog
>

Re: [PATCH 08/15] KVM: selftests: Use perf util's per-vCPU GPA/pages in demand paging test

2021-02-10 Thread Ben Gardon

On Wed, Feb 10, 2021 at 3:06 PM Sean Christopherson  wrote:
>
> Grab the per-vCPU GPA and number of pages from perf_util in the demand
> paging test instead of duplicating perf_util's calculations.
>
> Note, this may or may not result in a functional change.  It's not clear
> that the test's calculations are guaranteed to yield the same value as
> perf_util, e.g. if guest_percpu_mem_size != vcpu_args->pages.
>
> Signed-off-by: Sean Christopherson 

Reviewed-by: Ben Gardon 

> ---
>  .../selftests/kvm/demand_paging_test.c| 20 +--
>  1 file changed, 5 insertions(+), 15 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c 
> b/tools/testing/selftests/kvm/demand_paging_test.c
> index 5f7a229c3af1..0cbf111e6c21 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -294,24 +294,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> TEST_ASSERT(pipefds, "Unable to allocate memory for pipefd");
>
> for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
> -   vm_paddr_t vcpu_gpa;
> +   struct perf_test_vcpu_args *vcpu_args;
> void *vcpu_hva;
> -   uint64_t vcpu_mem_size;
>
> -
> -   if (p->partition_vcpu_memory_access) {
> -   vcpu_gpa = guest_test_phys_mem +
> -  (vcpu_id * guest_percpu_mem_size);
> -   vcpu_mem_size = guest_percpu_mem_size;
> -   } else {
> -   vcpu_gpa = guest_test_phys_mem;
> -   vcpu_mem_size = guest_percpu_mem_size * 
> nr_vcpus;
> -   }
> -   PER_VCPU_DEBUG("Added VCPU %d with test mem gpa [%lx, 
> %lx)\n",
> -  vcpu_id, vcpu_gpa, vcpu_gpa + 
> vcpu_mem_size);
> +   vcpu_args = &perf_test_args.vcpu_args[vcpu_id];
>
> /* Cache the HVA pointer of the region */
> -   vcpu_hva = addr_gpa2hva(vm, vcpu_gpa);
> +   vcpu_hva = addr_gpa2hva(vm, vcpu_args->gpa);
>
> /*
>  * Set up user fault fd to handle demand paging
> @@ -325,7 +314,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> 
> &uffd_handler_threads[vcpu_id],
> pipefds[vcpu_id * 2],
> p->uffd_delay, 
> &uffd_args[vcpu_id],
> -   vcpu_hva, vcpu_mem_size);
> +   vcpu_hva,
> +   vcpu_args->pages * 
> perf_test_args.guest_page_size);
> if (r < 0)
> exit(-r);
> }
> --
> 2.30.0.478.g8a0d178c01-goog
>

Re: [PATCH 10/15] KVM: selftests: Remove perf_test_args.host_page_size

2021-02-10 Thread Ben Gardon

On Wed, Feb 10, 2021 at 3:06 PM Sean Christopherson  wrote:
>
> Remove perf_test_args.host_page_size and instead use getpagesize() so
> that it's somewhat obvious that, for tests that care about the host page
> size, they care about the system page size, not the hardware page size,
> e.g. that the logic is unchanged if hugepages are in play.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson 

Reviewed-by: Ben Gardon 

> ---
>  tools/testing/selftests/kvm/demand_paging_test.c  | 8 
>  tools/testing/selftests/kvm/include/perf_test_util.h  | 1 -
>  tools/testing/selftests/kvm/lib/perf_test_util.c  | 6 ++
>  .../selftests/kvm/memslot_modification_stress_test.c  | 2 +-
>  4 files changed, 7 insertions(+), 10 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c 
> b/tools/testing/selftests/kvm/demand_paging_test.c
> index 0cbf111e6c21..b937a65b0e6d 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -83,7 +83,7 @@ static int handle_uffd_page_request(int uffd, uint64_t addr)
>
> copy.src = (uint64_t)guest_data_prototype;
> copy.dst = addr;
> -   copy.len = perf_test_args.host_page_size;
> +   copy.len = getpagesize();
> copy.mode = 0;
>
> clock_gettime(CLOCK_MONOTONIC, &start);
> @@ -100,7 +100,7 @@ static int handle_uffd_page_request(int uffd, uint64_t 
> addr)
> PER_PAGE_DEBUG("UFFDIO_COPY %d \t%ld ns\n", tid,
>timespec_to_ns(ts_diff));
> PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
> -  perf_test_args.host_page_size, addr, tid);
> +  getpagesize(), addr, tid);
>
> return 0;
>  }
> @@ -271,10 +271,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>
> perf_test_args.wr_fract = 1;
>
> -   guest_data_prototype = malloc(perf_test_args.host_page_size);
> +   guest_data_prototype = malloc(getpagesize());
> TEST_ASSERT(guest_data_prototype,
> "Failed to allocate buffer for guest data pattern");
> -   memset(guest_data_prototype, 0xAB, perf_test_args.host_page_size);
> +   memset(guest_data_prototype, 0xAB, getpagesize());
>
> vcpu_threads = malloc(nr_vcpus * sizeof(*vcpu_threads));
> TEST_ASSERT(vcpu_threads, "Memory allocation failed");
> diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h 
> b/tools/testing/selftests/kvm/include/perf_test_util.h
> index cccf1c44bddb..223fe6b79a04 100644
> --- a/tools/testing/selftests/kvm/include/perf_test_util.h
> +++ b/tools/testing/selftests/kvm/include/perf_test_util.h
> @@ -28,7 +28,6 @@ struct perf_test_vcpu_args {
>
>  struct perf_test_args {
> struct kvm_vm *vm;
> -   uint64_t host_page_size;
> uint64_t gpa;
> uint64_t guest_page_size;
> int wr_fract;
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c 
> b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index 03f125236021..982a86c8eeaa 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -57,8 +57,6 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, 
> int vcpus,
>
> pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode));
>
> -   pta->host_page_size = getpagesize();
> -
> /*
>  * Snapshot the non-huge page size.  This is used by the guest code to
>  * access/dirty pages at the logging granularity.
> @@ -68,7 +66,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, 
> int vcpus,
> guest_num_pages = vm_adjust_num_guest_pages(mode,
> (vcpus * vcpu_memory_bytes) / 
> pta->guest_page_size);
>
> -   TEST_ASSERT(vcpu_memory_bytes % pta->host_page_size == 0,
> +   TEST_ASSERT(vcpu_memory_bytes % getpagesize() == 0,
> "Guest memory size is not host page size aligned.");
> TEST_ASSERT(vcpu_memory_bytes % pta->guest_page_size == 0,
> "Guest memory size is not guest page size aligned.");
> @@ -88,7 +86,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, 
> int vcpus,
> guest_num_pages, vm_get_max_gfn(vm), vcpus, 
> vcpu_memory_bytes);
>
> pta->gpa = (vm_get_max_gfn(vm) - guest_num_pages) * 
> pta->guest_page_size;
> -   pta->gpa &= ~(pta->host_page_size - 1);
> +   pta->gpa &= ~(getpages

Re: [PATCH 11/15] KVM: selftests: Create VM with adjusted number of guest pages for perf tests

2021-02-10 Thread Ben Gardon

On Wed, Feb 10, 2021 at 3:06 PM Sean Christopherson  wrote:
>
> Use the already computed guest_num_pages when creating the so called
> extra VM pages for a perf test, and add a comment explaining why the
> pages are allocated as extra pages.
>
> Signed-off-by: Sean Christopherson 

Reviewed-by: Ben Gardon 

> ---
>  tools/testing/selftests/kvm/lib/perf_test_util.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c 
> b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index 982a86c8eeaa..9b0cfdf10772 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -71,9 +71,12 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode 
> mode, int vcpus,
> TEST_ASSERT(vcpu_memory_bytes % pta->guest_page_size == 0,
> "Guest memory size is not guest page size aligned.");
>
> -   vm = vm_create_with_vcpus(mode, vcpus,
> - (vcpus * vcpu_memory_bytes) / 
> pta->guest_page_size,
> - 0, guest_code, NULL);
> +   /*
> +* Pass guest_num_pages to populate the page tables for test memory.
> +* The memory is also added to memslot 0, but that's a benign side
> +* effect as KVM allows aliasing HVAs in memslots.
> +*/
> +   vm = vm_create_with_vcpus(mode, vcpus, 0, guest_num_pages, 
> guest_code, NULL);
> pta->vm = vm;
>
> /*
> --
> 2.30.0.478.g8a0d178c01-goog
>

[PATCH v2 00/28] Allow parallel MMU operations with TDP MMU

2021-02-02 Thread Ben Gardon

The TDP MMU was implemented to simplify and improve the performance of
KVM's memory management on modern hardware with TDP (EPT / NPT). To build
on the existing performance improvements of the TDP MMU, add the ability
to handle vCPU page faults, enabling and disabling dirty logging, and
removing mappings, in parallel. In the current implementation,
vCPU page faults (actually EPT/NPT violations/misconfigurations) are the
largest source of MMU lock contention on VMs with many vCPUs. This
contention, and the resulting page fault latency, can soft-lock guests
and degrade performance. Handling page faults in parallel is especially
useful when booting VMs, enabling dirty logging, and handling demand
paging. In all these cases vCPUs are constantly incurring  page faults on
each new page accessed.

Broadly, the following changes were required to allow parallel page
faults (and other MMU operations):
-- Contention detection and yielding added to rwlocks to bring them up to
   feature parity with spin locks, at least as far as the use of the MMU
   lock is concerned.
-- TDP MMU page table memory is protected with RCU and freed in RCU
   callbacks to allow multiple threads to operate on that memory
   concurrently.
-- The MMU lock was changed to an rwlock on x86. This allows the page
   fault handlers to acquire the MMU lock in read mode and handle page
   faults in parallel, and other operations to maintain exclusive use of
   the lock by acquiring it in write mode.
-- An additional lock is added to protect some data structures needed by
   the page fault handlers, for relatively infrequent operations.
-- The page fault handler is modified to use atomic cmpxchgs to set SPTEs
   and some page fault handler operations are modified slightly to work
   concurrently with other threads.

This series also contains a few bug fixes and optimizations, related to
the above, but not strictly part of enabling parallel page fault handling.

Correctness testing:
The following tests were performed with an SMP kernel and DBX kernel on an
Intel Skylake machine. The tests were run both with and without the TDP
MMU enabled.
-- This series introduces no new failures in kvm-unit-tests
SMP + no TDP MMU no new failures
SMP + TDP MMU no new failures
DBX + no TDP MMU no new failures
DBX + TDP MMU no new failures
-- All KVM selftests behave as expected
SMP + no TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
SMP + TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
(./x86_64/vmx_preemption_timer_test also fails without this patch set,
both with the TDP MMU on and off.)
DBX + no TDP MMU all pass
DBX + TDP MMU all pass
-- A VM can be booted running a Debian 9 and all memory accessed
SMP + no TDP MMU works
SMP + TDP MMU works
DBX + no TDP MMU works
DBX + TDP MMU works

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/7172

Changelog v1 -> v2:
- Removed the MMU lock union + using a spinlock when the TDP MMU is disabled
- Merged RCU commits
- Extended additional MMU operations to operate in parallel
- Ammended dirty log perf test to cover newly parallelized code paths
- Misc refactorings (see changelogs for individual commits)
- Big thanks to Sean and Paolo for their thorough review of v1

Ben Gardon (28):
  KVM: x86/mmu: change TDP MMU yield function returns to match
cond_resched
  KVM: x86/mmu: Add comment on __tdp_mmu_set_spte
  KVM: x86/mmu: Add lockdep when setting a TDP MMU SPTE
  KVM: x86/mmu: Don't redundantly clear TDP MMU pt memory
  KVM: x86/mmu: Factor out handling of removed page tables
  locking/rwlocks: Add contention detection for rwlocks
  sched: Add needbreak for rwlocks
  sched: Add cond_resched_rwlock
  KVM: x86/mmu: Fix braces in kvm_recover_nx_lpages
  KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs
  KVM: x86/mmu: Merge flush and non-flush tdp_mmu_iter_cond_resched
  KVM: x86/mmu: Rename goal_gfn to next_last_level_gfn
  KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter
  KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed
  KVM: x86/mmu: Skip no-op changes in TDP MMU functions
  KVM: x86/mmu: Clear dirtied pages mask bit before early break
  KVM: x86/mmu: Protect TDP MMU page table memory with RCU
  KVM: x86/mmu: Use an rwlock for the x86 MMU
  KVM: x86/mmu: Factor out functions to add/remove TDP MMU pages
  KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler
  KVM: x86/mmu: Mark SPTEs in disconnected pages as removed
  KVM: x86/mmu: Allow parallel page faults for the TDP MMU
  KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock
  KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock
  KVM: x86/mmu: Allow enabling / disabling dirty logging under MMU read
lock
  KVM: selftests: Add backing src parameter to dirty_log_perf_test
  KVM: selftests: Disable dirty logging with vCPUs running

 arch/x8

[PATCH v2 02/28] KVM: x86/mmu: Add comment on __tdp_mmu_set_spte

2021-02-02 Thread Ben Gardon

__tdp_mmu_set_spte is a very important function in the TDP MMU which
already accepts several arguments and will take more in future commits.
To offset this complexity, add a comment to the function describing each
of the arguemnts.

No functional change intended.

Reviewed-by: Peter Feiner 
Acked-by: Paolo Bonzini 
Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e9f9ff81a38e..3d8cca238eba 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -357,6 +357,22 @@ static void handle_changed_spte(struct kvm *kvm, int 
as_id, gfn_t gfn,
  new_spte, level);
 }
 
+/*
+ * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated 
bookkeeping
+ * @kvm: kvm instance
+ * @iter: a tdp_iter instance currently on the SPTE that should be set
+ * @new_spte: The value the SPTE should be set to
+ * @record_acc_track: Notify the MM subsystem of changes to the accessed state
+ *   of the page. Should be set unless handling an MMU
+ *   notifier for access tracking. Leaving record_acc_track
+ *   unset in that case prevents page accesses from being
+ *   double counted.
+ * @record_dirty_log: Record the page as dirty in the dirty bitmap if
+ *   appropriate for the change being made. Should be set
+ *   unless performing certain dirty logging operations.
+ *   Leaving record_dirty_log unset in that case prevents page
+ *   writes from being double counted.
+ */
 static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
  u64 new_spte, bool record_acc_track,
  bool record_dirty_log)
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 03/28] KVM: x86/mmu: Add lockdep when setting a TDP MMU SPTE

2021-02-02 Thread Ben Gardon

Add lockdep to __tdp_mmu_set_spte to ensure that SPTEs are only modified
under the MMU lock.

No functional change intended.

Reviewed-by: Peter Feiner 
Reviewed-by: Sean Christopherson 
Acked-by: Paolo Bonzini 
Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3d8cca238eba..b83a6a3ad29c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -381,6 +381,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, 
struct tdp_iter *iter,
struct kvm_mmu_page *root = sptep_to_sp(root_pt);
int as_id = kvm_mmu_page_as_id(root);
 
+   lockdep_assert_held(&kvm->mmu_lock);
+
WRITE_ONCE(*iter->sptep, new_spte);
 
__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 01/28] KVM: x86/mmu: change TDP MMU yield function returns to match cond_resched

2021-02-02 Thread Ben Gardon

Currently the TDP MMU yield / cond_resched functions either return
nothing or return true if the TLBs were not flushed. These are confusing
semantics, especially when making control flow decisions in calling
functions.

To clean things up, change both functions to have the same
return value semantics as cond_resched: true if the thread yielded,
false if it did not. If the function yielded in the _flush_ version,
then the TLBs will have been flushed.

Reviewed-by: Peter Feiner 
Acked-by: Paolo Bonzini 
Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 39 --
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2ef8615f9dba..e9f9ff81a38e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -413,8 +413,15 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct 
kvm *kvm,
 _mmu->shadow_root_level, _start, _end)
 
 /*
- * Flush the TLB if the process should drop kvm->mmu_lock.
- * Return whether the caller still needs to flush the tlb.
+ * Flush the TLB and yield if the MMU lock is contended or this thread needs to
+ * return control to the scheduler.
+ *
+ * If this function yields, it will also reset the tdp_iter's walk over the
+ * paging structure and the calling function should allow the iterator to
+ * continue its traversal from the paging structure root.
+ *
+ * Return true if this function yielded, the TLBs were flushed, and the
+ * iterator's traversal was reset. Return false if a yield was not needed.
  */
 static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter 
*iter)
 {
@@ -422,18 +429,32 @@ static bool tdp_mmu_iter_flush_cond_resched(struct kvm 
*kvm, struct tdp_iter *it
kvm_flush_remote_tlbs(kvm);
cond_resched_lock(&kvm->mmu_lock);
tdp_iter_refresh_walk(iter);
-   return false;
-   } else {
return true;
}
+
+   return false;
 }
 
-static void tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+/*
+ * Yield if the MMU lock is contended or this thread needs to return control
+ * to the scheduler.
+ *
+ * If this function yields, it will also reset the tdp_iter's walk over the
+ * paging structure and the calling function should allow the iterator to
+ * continue its traversal from the paging structure root.
+ *
+ * Return true if this function yielded and the iterator's traversal was reset.
+ * Return false if a yield was not needed.
+ */
+static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
 {
if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
cond_resched_lock(&kvm->mmu_lock);
tdp_iter_refresh_walk(iter);
+   return true;
}
+
+   return false;
 }
 
 /*
@@ -469,10 +490,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
 
tdp_mmu_set_spte(kvm, &iter, 0);
 
-   if (can_yield)
-   flush_needed = tdp_mmu_iter_flush_cond_resched(kvm, 
&iter);
-   else
-   flush_needed = true;
+   flush_needed = !can_yield ||
+  !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
}
return flush_needed;
 }
@@ -1072,7 +1091,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
tdp_mmu_set_spte(kvm, &iter, 0);
 
-   spte_set = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
+   spte_set = !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
}
 
if (spte_set)
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 04/28] KVM: x86/mmu: Don't redundantly clear TDP MMU pt memory

2021-02-02 Thread Ben Gardon

The KVM MMU caches already guarantee that shadow page table memory will
be zeroed, so there is no reason to re-zero the page in the TDP MMU page
fault handler.

No functional change intended.

Reviewed-by: Peter Feiner 
Reviewed-by: Sean Christopherson 
Acked-by: Paolo Bonzini 
Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b83a6a3ad29c..3828c0e83466 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -655,7 +655,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 
error_code,
sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
list_add(&sp->link, &vcpu->kvm->arch.tdp_mmu_pages);
child_pt = sp->spt;
-   clear_page(child_pt);
new_spte = make_nonleaf_spte(child_pt,
 !shadow_accessed_mask);
 
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks

2021-02-02 Thread Ben Gardon

rwlocks do not currently have any facility to detect contention
like spinlocks do. In order to allow users of rwlocks to better manage
latency, add contention detection for queued rwlocks.

CC: Ingo Molnar 
CC: Will Deacon 
Acked-by: Peter Zijlstra 
Acked-by: Davidlohr Bueso 
Acked-by: Waiman Long 
Acked-by: Paolo Bonzini 
Signed-off-by: Ben Gardon 
---
 include/asm-generic/qrwlock.h | 24 ++--
 include/linux/rwlock.h|  7 +++
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 84ce841ce735..0020d3b820a7 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -14,6 +14,7 @@
 #include 
 
 #include 
+#include 
 
 /*
  * Writer states & reader shift and bias.
@@ -116,15 +117,26 @@ static inline void queued_write_unlock(struct qrwlock 
*lock)
smp_store_release(&lock->wlocked, 0);
 }
 
+/**
+ * queued_rwlock_is_contended - check if the lock is contended
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static inline int queued_rwlock_is_contended(struct qrwlock *lock)
+{
+   return arch_spin_is_locked(&lock->wait_lock);
+}
+
 /*
  * Remapping rwlock architecture specific functions to the corresponding
  * queue rwlock functions.
  */
-#define arch_read_lock(l)  queued_read_lock(l)
-#define arch_write_lock(l) queued_write_lock(l)
-#define arch_read_trylock(l)   queued_read_trylock(l)
-#define arch_write_trylock(l)  queued_write_trylock(l)
-#define arch_read_unlock(l)queued_read_unlock(l)
-#define arch_write_unlock(l)   queued_write_unlock(l)
+#define arch_read_lock(l)  queued_read_lock(l)
+#define arch_write_lock(l) queued_write_lock(l)
+#define arch_read_trylock(l)   queued_read_trylock(l)
+#define arch_write_trylock(l)  queued_write_trylock(l)
+#define arch_read_unlock(l)queued_read_unlock(l)
+#define arch_write_unlock(l)   queued_write_unlock(l)
+#define arch_rwlock_is_contended(l)queued_rwlock_is_contended(l)
 
 #endif /* __ASM_GENERIC_QRWLOCK_H */
diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h
index 3dcd617e65ae..7ce9a51ae5c0 100644
--- a/include/linux/rwlock.h
+++ b/include/linux/rwlock.h
@@ -128,4 +128,11 @@ do {   
\
1 : ({ local_irq_restore(flags); 0; }); \
 })
 
+#ifdef arch_rwlock_is_contended
+#define rwlock_is_contended(lock) \
+arch_rwlock_is_contended(&(lock)->raw_lock)
+#else
+#define rwlock_is_contended(lock)  ((void)(lock), 0)
+#endif /* arch_rwlock_is_contended */
+
 #endif /* __LINUX_RWLOCK_H */
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 07/28] sched: Add needbreak for rwlocks

2021-02-02 Thread Ben Gardon

Contention awareness while holding a spin lock is essential for reducing
latency when long running kernel operations can hold that lock. Add the
same contention detection interface for read/write spin locks.

CC: Ingo Molnar 
CC: Will Deacon 
Acked-by: Peter Zijlstra 
Acked-by: Davidlohr Bueso 
Acked-by: Waiman Long 
Acked-by: Paolo Bonzini 
Signed-off-by: Ben Gardon 
---
 include/linux/sched.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..5d1378e5a040 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1912,6 +1912,23 @@ static inline int spin_needbreak(spinlock_t *lock)
 #endif
 }
 
+/*
+ * Check if a rwlock is contended.
+ * Returns non-zero if there is another task waiting on the rwlock.
+ * Returns zero if the lock is not contended or the system / underlying
+ * rwlock implementation does not support contention detection.
+ * Technically does not depend on CONFIG_PREEMPTION, but a general need
+ * for low latency.
+ */
+static inline int rwlock_needbreak(rwlock_t *lock)
+{
+#ifdef CONFIG_PREEMPTION
+   return rwlock_is_contended(lock);
+#else
+   return 0;
+#endif
+}
+
 static __always_inline bool need_resched(void)
 {
return unlikely(tif_need_resched());
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 14/28] KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed

2021-02-02 Thread Ben Gardon

Given certain conditions, some TDP MMU functions may not yield
reliably / frequently enough. For example, if a paging structure was
very large but had few, if any writable entries, wrprot_gfn_range
could traverse many entries before finding a writable entry and yielding
because the check for yielding only happens after an SPTE is modified.

Fix this issue by moving the yield to the beginning of the loop.

Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Reviewed-by: Peter Feiner 
Signed-off-by: Ben Gardon 

---

v1 -> v2
- Split patch into three

 arch/x86/kvm/mmu/tdp_mmu.c | 32 ++--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7cfc0639b1ef..c8a1149cb229 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -501,6 +501,12 @@ static bool zap_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
bool flush_needed = false;
 
tdp_root_for_each_pte(iter, root, start, end) {
+   if (can_yield &&
+   tdp_mmu_iter_cond_resched(kvm, &iter, flush_needed)) {
+   flush_needed = false;
+   continue;
+   }
+
if (!is_shadow_present_pte(iter.old_spte))
continue;
 
@@ -515,9 +521,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
continue;
 
tdp_mmu_set_spte(kvm, &iter, 0);
-
-   flush_needed = !(can_yield &&
-tdp_mmu_iter_cond_resched(kvm, &iter, true));
+   flush_needed = true;
}
return flush_needed;
 }
@@ -880,6 +884,9 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
 
for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
   min_level, start, end) {
+   if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
+   continue;
+
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
continue;
@@ -888,8 +895,6 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
 
tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
spte_set = true;
-
-   tdp_mmu_iter_cond_resched(kvm, &iter, false);
}
return spte_set;
 }
@@ -933,6 +938,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
bool spte_set = false;
 
tdp_root_for_each_leaf_pte(iter, root, start, end) {
+   if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
+   continue;
+
if (spte_ad_need_write_protect(iter.old_spte)) {
if (is_writable_pte(iter.old_spte))
new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
@@ -947,8 +955,6 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
 
tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
spte_set = true;
-
-   tdp_mmu_iter_cond_resched(kvm, &iter, false);
}
return spte_set;
 }
@@ -1056,6 +1062,9 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
bool spte_set = false;
 
tdp_root_for_each_pte(iter, root, start, end) {
+   if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
+   continue;
+
if (!is_shadow_present_pte(iter.old_spte))
continue;
 
@@ -1063,8 +1072,6 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
 
tdp_mmu_set_spte(kvm, &iter, new_spte);
spte_set = true;
-
-   tdp_mmu_iter_cond_resched(kvm, &iter, false);
}
 
return spte_set;
@@ -1105,6 +1112,11 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
bool spte_set = false;
 
tdp_root_for_each_pte(iter, root, start, end) {
+   if (tdp_mmu_iter_cond_resched(kvm, &iter, spte_set)) {
+   spte_set = false;
+   continue;
+   }
+
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
continue;
@@ -1116,7 +1128,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
tdp_mmu_set_spte(kvm, &iter, 0);
 
-   spte_set = !tdp_mmu_iter_cond_resched(kvm, &iter, true);
+   spte_set = true;
}
 
if (spte_set)
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 12/28] KVM: x86/mmu: Rename goal_gfn to next_last_level_gfn

2021-02-02 Thread Ben Gardon

The goal_gfn field in tdp_iter can be misleading as it implies that it
is the iterator's final goal. It is really a taget for the lowest gfn
mapped by the leaf level SPTE the iterator will traverse towards. Change
the field's name to be more precise.

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_iter.c | 20 ++--
 arch/x86/kvm/mmu/tdp_iter.h |  4 ++--
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 87b7e16911db..9917c55b7d24 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -22,21 +22,21 @@ static gfn_t round_gfn_for_level(gfn_t gfn, int level)
 
 /*
  * Sets a TDP iterator to walk a pre-order traversal of the paging structure
- * rooted at root_pt, starting with the walk to translate goal_gfn.
+ * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
  */
 void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
-   int min_level, gfn_t goal_gfn)
+   int min_level, gfn_t next_last_level_gfn)
 {
WARN_ON(root_level < 1);
WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
 
-   iter->goal_gfn = goal_gfn;
+   iter->next_last_level_gfn = next_last_level_gfn;
iter->root_level = root_level;
iter->min_level = min_level;
iter->level = root_level;
iter->pt_path[iter->level - 1] = root_pt;
 
-   iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
+   iter->gfn = round_gfn_for_level(iter->next_last_level_gfn, iter->level);
tdp_iter_refresh_sptep(iter);
 
iter->valid = true;
@@ -82,7 +82,7 @@ static bool try_step_down(struct tdp_iter *iter)
 
iter->level--;
iter->pt_path[iter->level - 1] = child_pt;
-   iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
+   iter->gfn = round_gfn_for_level(iter->next_last_level_gfn, iter->level);
tdp_iter_refresh_sptep(iter);
 
return true;
@@ -106,7 +106,7 @@ static bool try_step_side(struct tdp_iter *iter)
return false;
 
iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
-   iter->goal_gfn = iter->gfn;
+   iter->next_last_level_gfn = iter->gfn;
iter->sptep++;
iter->old_spte = READ_ONCE(*iter->sptep);
 
@@ -166,13 +166,13 @@ void tdp_iter_next(struct tdp_iter *iter)
  */
 void tdp_iter_refresh_walk(struct tdp_iter *iter)
 {
-   gfn_t goal_gfn = iter->goal_gfn;
+   gfn_t next_last_level_gfn = iter->next_last_level_gfn;
 
-   if (iter->gfn > goal_gfn)
-   goal_gfn = iter->gfn;
+   if (iter->gfn > next_last_level_gfn)
+   next_last_level_gfn = iter->gfn;
 
tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
-  iter->root_level, iter->min_level, goal_gfn);
+  iter->root_level, iter->min_level, next_last_level_gfn);
 }
 
 u64 *tdp_iter_root_pt(struct tdp_iter *iter)
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 47170d0dc98e..b2dd269c631f 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -15,7 +15,7 @@ struct tdp_iter {
 * The iterator will traverse the paging structure towards the mapping
 * for this GFN.
 */
-   gfn_t goal_gfn;
+   gfn_t next_last_level_gfn;
/* Pointers to the page tables traversed to reach the current SPTE */
u64 *pt_path[PT64_ROOT_MAX_LEVEL];
/* A pointer to the current SPTE */
@@ -52,7 +52,7 @@ struct tdp_iter {
 u64 *spte_to_child_pt(u64 pte, int level);
 
 void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
-   int min_level, gfn_t goal_gfn);
+   int min_level, gfn_t next_last_level_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_refresh_walk(struct tdp_iter *iter);
 u64 *tdp_iter_root_pt(struct tdp_iter *iter);
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 05/28] KVM: x86/mmu: Factor out handling of removed page tables

2021-02-02 Thread Ben Gardon

Factor out the code to handle a disconnected subtree of the TDP paging
structure from the code to handle the change to an individual SPTE.
Future commits will build on this to allow asynchronous page freeing.

No functional change intended.

Reviewed-by: Peter Feiner 
Acked-by: Paolo Bonzini 
Signed-off-by: Ben Gardon 

---

v1 -> v2
- Replaced "disconnected" with "removed" updated derivative
  comments and code

 arch/x86/kvm/mmu/tdp_mmu.c | 71 ++
 1 file changed, 42 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3828c0e83466..c3075fb568eb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -234,6 +234,45 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, 
int as_id, gfn_t gfn,
}
 }
 
+/**
+ * handle_removed_tdp_mmu_page - handle a pt removed from the TDP structure
+ *
+ * @kvm: kvm instance
+ * @pt: the page removed from the paging structure
+ *
+ * Given a page table that has been removed from the TDP paging structure,
+ * iterates through the page table to clear SPTEs and free child page tables.
+ */
+static void handle_removed_tdp_mmu_page(struct kvm *kvm, u64 *pt)
+{
+   struct kvm_mmu_page *sp = sptep_to_sp(pt);
+   int level = sp->role.level;
+   gfn_t gfn = sp->gfn;
+   u64 old_child_spte;
+   int i;
+
+   trace_kvm_mmu_prepare_zap_page(sp);
+
+   list_del(&sp->link);
+
+   if (sp->lpage_disallowed)
+   unaccount_huge_nx_page(kvm, sp);
+
+   for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+   old_child_spte = READ_ONCE(*(pt + i));
+   WRITE_ONCE(*(pt + i), 0);
+   handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
+   gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
+   old_child_spte, 0, level - 1);
+   }
+
+   kvm_flush_remote_tlbs_with_address(kvm, gfn,
+  KVM_PAGES_PER_HPAGE(level));
+
+   free_page((unsigned long)pt);
+   kmem_cache_free(mmu_page_header_cache, sp);
+}
+
 /**
  * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -254,10 +293,6 @@ static void __handle_changed_spte(struct kvm *kvm, int 
as_id, gfn_t gfn,
bool was_leaf = was_present && is_last_spte(old_spte, level);
bool is_leaf = is_present && is_last_spte(new_spte, level);
bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
-   u64 *pt;
-   struct kvm_mmu_page *sp;
-   u64 old_child_spte;
-   int i;
 
WARN_ON(level > PT64_ROOT_MAX_LEVEL);
WARN_ON(level < PG_LEVEL_4K);
@@ -321,31 +356,9 @@ static void __handle_changed_spte(struct kvm *kvm, int 
as_id, gfn_t gfn,
 * Recursively handle child PTs if the change removed a subtree from
 * the paging structure.
 */
-   if (was_present && !was_leaf && (pfn_changed || !is_present)) {
-   pt = spte_to_child_pt(old_spte, level);
-   sp = sptep_to_sp(pt);
-
-   trace_kvm_mmu_prepare_zap_page(sp);
-
-   list_del(&sp->link);
-
-   if (sp->lpage_disallowed)
-   unaccount_huge_nx_page(kvm, sp);
-
-   for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
-   old_child_spte = READ_ONCE(*(pt + i));
-   WRITE_ONCE(*(pt + i), 0);
-   handle_changed_spte(kvm, as_id,
-   gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
-   old_child_spte, 0, level - 1);
-   }
-
-   kvm_flush_remote_tlbs_with_address(kvm, gfn,
-  KVM_PAGES_PER_HPAGE(level));
-
-   free_page((unsigned long)pt);
-   kmem_cache_free(mmu_page_header_cache, sp);
-   }
+   if (was_present && !was_leaf && (pfn_changed || !is_present))
+   handle_removed_tdp_mmu_page(kvm,
+   spte_to_child_pt(old_spte, level));
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 15/28] KVM: x86/mmu: Skip no-op changes in TDP MMU functions

2021-02-02 Thread Ben Gardon

Skip setting SPTEs if no change is expected.

Reviewed-by: Peter Feiner 
Signed-off-by: Ben Gardon 

---

v1 -> v2
- Merged no-op checks into exiting old_spte check

 arch/x86/kvm/mmu/tdp_mmu.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c8a1149cb229..aeb05f626b55 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -888,7 +888,8 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
continue;
 
if (!is_shadow_present_pte(iter.old_spte) ||
-   !is_last_spte(iter.old_spte, iter.level))
+   !is_last_spte(iter.old_spte, iter.level) ||
+   !(iter.old_spte & PT_WRITABLE_MASK))
continue;
 
new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
@@ -1065,7 +1066,8 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
continue;
 
-   if (!is_shadow_present_pte(iter.old_spte))
+   if (!is_shadow_present_pte(iter.old_spte) ||
+   iter.old_spte & shadow_dirty_mask)
continue;
 
new_spte = iter.old_spte | shadow_dirty_mask;
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 19/28] KVM: x86/mmu: Factor out functions to add/remove TDP MMU pages

2021-02-02 Thread Ben Gardon

Move the work of adding and removing TDP MMU pages to/from  "secondary"
data structures to helper functions. These functions will be built on in
future commits to enable MMU operations to proceed (mostly) in parallel.

No functional change expected.

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 47 +++---
 1 file changed, 39 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f1fbed72e149..5a9e964e0178 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -262,6 +262,39 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, 
int as_id, gfn_t gfn,
}
 }
 
+/**
+ * tdp_mmu_link_page - Add a new page to the list of pages used by the TDP MMU
+ *
+ * @kvm: kvm instance
+ * @sp: the new page
+ * @account_nx: This page replaces a NX large page and should be marked for
+ * eventual reclaim.
+ */
+static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ bool account_nx)
+{
+   lockdep_assert_held_write(&kvm->mmu_lock);
+
+   list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
+   if (account_nx)
+   account_huge_nx_page(kvm, sp);
+}
+
+/**
+ * tdp_mmu_unlink_page - Remove page from the list of pages used by the TDP MMU
+ *
+ * @kvm: kvm instance
+ * @sp: the page to be removed
+ */
+static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+   lockdep_assert_held_write(&kvm->mmu_lock);
+
+   list_del(&sp->link);
+   if (sp->lpage_disallowed)
+   unaccount_huge_nx_page(kvm, sp);
+}
+
 /**
  * handle_removed_tdp_mmu_page - handle a pt removed from the TDP structure
  *
@@ -281,10 +314,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, 
u64 *pt)
 
trace_kvm_mmu_prepare_zap_page(sp);
 
-   list_del(&sp->link);
-
-   if (sp->lpage_disallowed)
-   unaccount_huge_nx_page(kvm, sp);
+   tdp_mmu_unlink_page(kvm, sp);
 
for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
old_child_spte = READ_ONCE(*(pt + i));
@@ -705,15 +735,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 
error_code,
 
if (!is_shadow_present_pte(iter.old_spte)) {
sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
-   list_add(&sp->link, &vcpu->kvm->arch.tdp_mmu_pages);
child_pt = sp->spt;
+
+   tdp_mmu_link_page(vcpu->kvm, sp,
+ huge_page_disallowed &&
+ req_level >= iter.level);
+
new_spte = make_nonleaf_spte(child_pt,
 !shadow_accessed_mask);
 
trace_kvm_mmu_get_page(sp, true);
-   if (huge_page_disallowed && req_level >= iter.level)
-   account_huge_nx_page(vcpu->kvm, sp);
-
tdp_mmu_set_spte(vcpu->kvm, &iter, new_spte);
}
}
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 16/28] KVM: x86/mmu: Clear dirtied pages mask bit before early break

2021-02-02 Thread Ben Gardon

In clear_dirty_pt_masked, the loop is intended to exit early after
processing each of the GFNs with corresponding bits set in mask. This
does not work as intended if another thread has already cleared the
dirty bit or writable bit on the SPTE. In that case, the loop would
proceed to the next iteration early and the bit in mask would not be
cleared. As a result the loop could not exit early and would proceed
uselessly. Move the unsetting of the mask bit before the check for a
no-op SPTE change.

Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP
MMU")

Suggested-by: Sean Christopherson 
Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index aeb05f626b55..a75e92164a8b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1007,6 +1007,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct 
kvm_mmu_page *root,
!(mask & (1UL << (iter.gfn - gfn
continue;
 
+   mask &= ~(1UL << (iter.gfn - gfn));
+
if (wrprot || spte_ad_need_write_protect(iter.old_spte)) {
if (is_writable_pte(iter.old_spte))
new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
@@ -1020,8 +1022,6 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct 
kvm_mmu_page *root,
}
 
tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
-
-   mask &= ~(1UL << (iter.gfn - gfn));
}
 }
 
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 22/28] KVM: x86/mmu: Mark SPTEs in disconnected pages as removed

2021-02-02 Thread Ben Gardon

When clearing TDP MMU pages what have been disconnected from the paging
structure root, set the SPTEs to a special non-present value which will
not be overwritten by other threads. This is needed to prevent races in
which a thread is clearing a disconnected page table, but another thread
has already acquired a pointer to that memory and installs a mapping in
an already cleared entry. This can lead to memory leaks and accounting
errors.

Reviewed-by: Peter Feiner 
Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 36 ++--
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7a2cdfeac4d2..0dd27e000dd0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -334,9 +334,10 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, 
u64 *pt,
 {
struct kvm_mmu_page *sp = sptep_to_sp(pt);
int level = sp->role.level;
-   gfn_t gfn = sp->gfn;
+   gfn_t base_gfn = sp->gfn;
u64 old_child_spte;
u64 *sptep;
+   gfn_t gfn;
int i;
 
trace_kvm_mmu_prepare_zap_page(sp);
@@ -345,16 +346,39 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, 
u64 *pt,
 
for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
sptep = pt + i;
+   gfn = base_gfn + (i * KVM_PAGES_PER_HPAGE(level - 1));
 
if (shared) {
-   old_child_spte = xchg(sptep, 0);
+   /*
+* Set the SPTE to a nonpresent value that other
+* threads will not overwrite. If the SPTE was
+* already marked as removed then another thread
+* handling a page fault could overwrite it, so
+* set the SPTE until it is set from some other
+* value to the removed SPTE value.
+*/
+   for (;;) {
+   old_child_spte = xchg(sptep, REMOVED_SPTE);
+   if (!is_removed_spte(old_child_spte))
+   break;
+   cpu_relax();
+   }
} else {
old_child_spte = READ_ONCE(*sptep);
-   WRITE_ONCE(*sptep, 0);
+
+   /*
+* Marking the SPTE as a removed SPTE is not
+* strictly necessary here as the MMU lock should
+* stop other threads from concurrentrly modifying
+* this SPTE. Using the removed SPTE value keeps
+* the shared and non-atomic cases consistent and
+* simplifies the function.
+*/
+   WRITE_ONCE(*sptep, REMOVED_SPTE);
}
-   handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
-   gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
-   old_child_spte, 0, level - 1, shared);
+   handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
+   old_child_spte, REMOVED_SPTE, level - 1,
+   shared);
}
 
kvm_flush_remote_tlbs_with_address(kvm, gfn,
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 24/28] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock

2021-02-02 Thread Ben Gardon

To reduce lock contention and interference with page fault handlers,
allow the TDP MMU function to zap a GFN range to operate under the MMU
read lock.

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/mmu.c  |  13 ++-
 arch/x86/kvm/mmu/mmu_internal.h |   6 +-
 arch/x86/kvm/mmu/tdp_mmu.c  | 165 +---
 arch/x86/kvm/mmu/tdp_mmu.h  |   3 +-
 4 files changed, 145 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d181a2b2485..254ff87d2a61 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5518,13 +5518,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t 
gfn_start, gfn_t gfn_end)
}
}
 
+   kvm_mmu_unlock(kvm);
+
if (kvm->arch.tdp_mmu_enabled) {
-   flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);
+   read_lock(&kvm->mmu_lock);
+   flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end,
+ true);
if (flush)
kvm_flush_remote_tlbs(kvm);
-   }
 
-   write_unlock(&kvm->mmu_lock);
+   read_unlock(&kvm->mmu_lock);
+   }
 }
 
 static bool slot_rmap_write_protect(struct kvm *kvm,
@@ -6015,7 +6019,8 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
WARN_ON_ONCE(!sp->lpage_disallowed);
if (sp->tdp_mmu_page) {
kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
-   sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level));
+   sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level),
+   false);
} else {
kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
WARN_ON_ONCE(sp->lpage_disallowed);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 7f599cc64178..7df209fb8051 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -40,7 +40,11 @@ struct kvm_mmu_page {
u64 *spt;
/* hold the gfn of each spte inside spt */
gfn_t *gfns;
-   int root_count;  /* Currently serving as active root */
+   /* Currently serving as active root */
+   union {
+   int root_count;
+   refcount_t tdp_mmu_root_count;
+   };
unsigned int unsync_children;
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
DECLARE_BITMAP(unsync_child_bitmap, 512);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0dd27e000dd0..de26762433ea 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -52,46 +52,104 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
rcu_barrier();
 }
 
-static void tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
+static __always_inline __must_check bool tdp_mmu_get_root(struct kvm *kvm,
+   struct kvm_mmu_page *root)
 {
-   if (kvm_mmu_put_root(kvm, root))
-   kvm_tdp_mmu_free_root(kvm, root);
+   return refcount_inc_not_zero(&root->tdp_mmu_root_count);
 }
 
-static inline bool tdp_mmu_next_root_valid(struct kvm *kvm,
-  struct kvm_mmu_page *root)
+static __always_inline void tdp_mmu_put_root(struct kvm *kvm,
+struct kvm_mmu_page *root,
+bool shared)
 {
-   lockdep_assert_held_write(&kvm->mmu_lock);
+   int root_count;
+   int r;
 
-   if (list_entry_is_head(root, &kvm->arch.tdp_mmu_roots, link))
-   return false;
+   if (shared) {
+   lockdep_assert_held_read(&kvm->mmu_lock);
 
-   kvm_mmu_get_root(kvm, root);
-   return true;
+   root_count = atomic_read(&root->tdp_mmu_root_count.refs);
+
+   /*
+* If this is not the last reference on the root, it can be
+* dropped under the MMU read lock.
+*/
+   if (root_count > 1) {
+   r = atomic_cmpxchg(&root->tdp_mmu_root_count.refs,
+  root_count, root_count - 1);
+   if (r == root_count)
+   return;
+   }
+
+   /*
+* If the cmpxchg failed because of a race or this is the
+* last reference on the root, drop the read lock, and
+* reacquire the MMU lock in write mode.
+*/
+   read_unlock(&kvm->mmu_lock);
+   write_lock(&kvm->mmu_lock);
+   } else {
+   lockdep_assert_held_write(&kvm->mmu_lock);
+   }
+
+

[PATCH v2 26/28] KVM: x86/mmu: Allow enabling / disabling dirty logging under MMU read lock

2021-02-02 Thread Ben Gardon

To reduce lock contention and interference with page fault handlers,
allow the TDP MMU functions which enable and disable dirty logging
to operate under the MMU read lock.


Extend dirty logging enable disable functions read lock-ness

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/mmu.c | 14 +++---
 arch/x86/kvm/mmu/tdp_mmu.c | 93 ++
 arch/x86/kvm/mmu/tdp_mmu.h |  2 +-
 3 files changed, 84 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e3cf868be6bd..6ba2a72d4330 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5638,9 +5638,10 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 
write_lock(&kvm->mmu_lock);
flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
+   write_unlock(&kvm->mmu_lock);
+
if (kvm->arch.tdp_mmu_enabled)
flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
-   write_unlock(&kvm->mmu_lock);
 
/*
 * It's also safe to flush TLBs out of mmu lock here as currently this
@@ -5661,9 +5662,10 @@ void kvm_mmu_slot_largepage_remove_write_access(struct 
kvm *kvm,
write_lock(&kvm->mmu_lock);
flush = slot_handle_large_level(kvm, memslot, slot_rmap_write_protect,
false);
+   write_unlock(&kvm->mmu_lock);
+
if (kvm->arch.tdp_mmu_enabled)
flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_2M);
-   write_unlock(&kvm->mmu_lock);
 
if (flush)
kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
@@ -5677,12 +5679,12 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm,
 
write_lock(&kvm->mmu_lock);
flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false);
-   if (kvm->arch.tdp_mmu_enabled)
-   flush |= kvm_tdp_mmu_slot_set_dirty(kvm, memslot);
-   write_unlock(&kvm->mmu_lock);
-
if (flush)
kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
+   write_unlock(&kvm->mmu_lock);
+
+   if (kvm->arch.tdp_mmu_enabled)
+   kvm_tdp_mmu_slot_set_dirty(kvm, memslot);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cfe66b8d39fa..6093926a6bc5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -553,18 +553,22 @@ static void handle_changed_spte(struct kvm *kvm, int 
as_id, gfn_t gfn,
 }
 
 /*
- * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically and handle the
+ * __tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically and handle the
  * associated bookkeeping
  *
  * @kvm: kvm instance
  * @iter: a tdp_iter instance currently on the SPTE that should be set
  * @new_spte: The value the SPTE should be set to
+ * @record_dirty_log: Record the page as dirty in the dirty bitmap if
+ *   appropriate for the change being made. Should be set
+ *   unless performing certain dirty logging operations.
+ *   Leaving record_dirty_log unset in that case prevents page
+ *   writes from being double counted.
  * Returns: true if the SPTE was set, false if it was not. If false is 
returned,
  * this function will have no side-effects.
  */
-static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
-  struct tdp_iter *iter,
-  u64 new_spte)
+static inline bool __tdp_mmu_set_spte_atomic(struct kvm *kvm,
+   struct tdp_iter *iter, u64 new_spte, bool record_dirty_log)
 {
u64 *root_pt = tdp_iter_root_pt(iter);
struct kvm_mmu_page *root = sptep_to_sp(root_pt);
@@ -583,12 +587,31 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm 
*kvm,
  new_spte) != iter->old_spte)
return false;
 
-   handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
-   iter->level, true);
+   __handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
+ iter->level, true);
+   handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
+   if (record_dirty_log)
+   handle_changed_spte_dirty_log(kvm, as_id, iter->gfn,
+ iter->old_spte, new_spte,
+ iter->level);
 
return true;
 }
 
+static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm,
+   struct tdp_iter *iter,
+   u64 new_spte)
+{
+   return __tdp_mmu_set_spte_atomic(kvm, iter, new_spte, false);
+}
+
+static inline bool tdp_mmu_set_spte_atomic(struct kvm *k

[PATCH v2 23/28] KVM: x86/mmu: Allow parallel page faults for the TDP MMU

2021-02-02 Thread Ben Gardon

Make the last few changes necessary to enable the TDP MMU to handle page
faults in parallel while holding the mmu_lock in read mode.

Reviewed-by: Peter Feiner 
Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/mmu.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b4d6709c240e..3d181a2b2485 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3724,7 +3724,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, 
gpa_t gpa, u32 error_code,
return r;
 
r = RET_PF_RETRY;
-   write_lock(&vcpu->kvm->mmu_lock);
+
+   if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+   read_lock(&vcpu->kvm->mmu_lock);
+   else
+   write_lock(&vcpu->kvm->mmu_lock);
+
if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
goto out_unlock;
r = make_mmu_pages_available(vcpu);
@@ -3739,7 +3744,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, 
gpa_t gpa, u32 error_code,
 prefault, is_tdp);
 
 out_unlock:
-   write_unlock(&vcpu->kvm->mmu_lock);
+   if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+   read_unlock(&vcpu->kvm->mmu_lock);
+   else
+   write_unlock(&vcpu->kvm->mmu_lock);
kvm_release_pfn_clean(pfn);
return r;
 }
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 25/28] KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock

2021-02-02 Thread Ben Gardon

To speed the process of disabling dirty logging, change the TDP MMU
function which zaps collapsible SPTEs to run under the MMU read lock.

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/mmu.c |  5 ++---
 arch/x86/kvm/mmu/tdp_mmu.c | 22 +++---
 2 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 254ff87d2a61..e3cf868be6bd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5517,8 +5517,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
start, end - 1, true);
}
}
-
-   kvm_mmu_unlock(kvm);
+   write_unlock(&kvm->mmu_lock);
 
if (kvm->arch.tdp_mmu_enabled) {
read_lock(&kvm->mmu_lock);
@@ -5611,10 +5610,10 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
write_lock(&kvm->mmu_lock);
slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
 kvm_mmu_zap_collapsible_spte, true);
+   write_unlock(&kvm->mmu_lock);
 
if (kvm->arch.tdp_mmu_enabled)
kvm_tdp_mmu_zap_collapsible_sptes(kvm, memslot);
-   write_unlock(&kvm->mmu_lock);
 }
 
 void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index de26762433ea..cfe66b8d39fa 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1451,10 +1451,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
rcu_read_lock();
 
tdp_root_for_each_pte(iter, root, start, end) {
-   if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false)) {
-   spte_set = false;
+retry:
+   if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;
-   }
 
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
@@ -1465,9 +1464,14 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
!PageTransCompoundMap(pfn_to_page(pfn)))
continue;
 
-   tdp_mmu_set_spte(kvm, &iter, 0);
-
-   spte_set = true;
+   if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
+   /*
+* The iter must explicitly re-read the SPTE because
+* the atomic cmpxchg failed.
+*/
+   iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+   goto retry;
+   }
}
 
rcu_read_unlock();
@@ -1485,7 +1489,9 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
struct kvm_mmu_page *root;
int root_as_id;
 
-   for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
+   read_lock(&kvm->mmu_lock);
+
+   for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
root_as_id = kvm_mmu_page_as_id(root);
if (root_as_id != slot->as_id)
continue;
@@ -1493,6 +1499,8 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
zap_collapsible_spte_range(kvm, root, slot->base_gfn,
   slot->base_gfn + slot->npages);
}
+
+   read_unlock(&kvm->mmu_lock);
 }
 
 /*
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 27/28] KVM: selftests: Add backing src parameter to dirty_log_perf_test

2021-02-02 Thread Ben Gardon

Add a parameter to control the backing memory type for
dirty_log_perf_test so that the test can be run with hugepages.

To: linux-kselft...@vger.kernel.org
CC: Peter Xu 
CC: Andrew Jones 
CC: Thomas Huth 
Signed-off-by: Ben Gardon 
---
 .../selftests/kvm/demand_paging_test.c|  3 +-
 .../selftests/kvm/dirty_log_perf_test.c   | 15 --
 .../testing/selftests/kvm/include/kvm_util.h  |  6 
 .../selftests/kvm/include/perf_test_util.h|  3 +-
 .../testing/selftests/kvm/include/test_util.h | 14 +
 .../selftests/kvm/lib/perf_test_util.c|  6 ++--
 tools/testing/selftests/kvm/lib/test_util.c   | 29 +++
 7 files changed, 62 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c 
b/tools/testing/selftests/kvm/demand_paging_test.c
index cdad1eca72f7..9e3254ff0821 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -265,7 +265,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
int vcpu_id;
int r;
 
-   vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size);
+   vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
+VM_MEM_SRC_ANONYMOUS);
 
perf_test_args.wr_fract = 1;
 
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c 
b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 2283a0ec74a9..604ccefd6e76 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -92,6 +92,7 @@ struct test_params {
unsigned long iterations;
uint64_t phys_offset;
int wr_fract;
+   enum vm_mem_backing_src_type backing_src;
 };
 
 static void run_test(enum vm_guest_mode mode, void *arg)
@@ -111,7 +112,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
struct kvm_enable_cap cap = {};
struct timespec clear_dirty_log_total = (struct timespec){0};
 
-   vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size);
+   vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
+p->backing_src);
 
perf_test_args.wr_fract = p->wr_fract;
 
@@ -236,7 +238,7 @@ static void help(char *name)
 {
puts("");
printf("usage: %s [-h] [-i iterations] [-p offset] "
-  "[-m mode] [-b vcpu bytes] [-v vcpus]\n", name);
+  "[-m mode] [-b vcpu bytes] [-v vcpus] [-s mem type]\n", name);
puts("");
printf(" -i: specify iteration counts (default: %"PRIu64")\n",
   TEST_HOST_LOOP_N);
@@ -251,6 +253,9 @@ static void help(char *name)
   " 1/.\n"
   " (default: 1 i.e. all pages are written to.)\n");
printf(" -v: specify the number of vCPUs to run.\n");
+   printf(" -s: specify the type of memory that should be used to\n"
+  " back the guest data region.\n");
+   backing_src_help();
puts("");
exit(0);
 }
@@ -261,6 +266,7 @@ int main(int argc, char *argv[])
struct test_params p = {
.iterations = TEST_HOST_LOOP_N,
.wr_fract = 1,
+   .backing_src = VM_MEM_SRC_ANONYMOUS,
};
int opt;
 
@@ -271,7 +277,7 @@ int main(int argc, char *argv[])
 
guest_modes_append_default();
 
-   while ((opt = getopt(argc, argv, "hi:p:m:b:f:v:")) != -1) {
+   while ((opt = getopt(argc, argv, "hi:p:m:b:f:v:s:")) != -1) {
switch (opt) {
case 'i':
p.iterations = strtol(optarg, NULL, 10);
@@ -295,6 +301,9 @@ int main(int argc, char *argv[])
TEST_ASSERT(nr_vcpus > 0 && nr_vcpus <= max_vcpus,
"Invalid number of vcpus, must be between 1 
and %d", max_vcpus);
break;
+   case 's':
+   p.backing_src = parse_backing_src_type(optarg);
+   break;
case 'h':
default:
help(argv[0]);
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
b/tools/testing/selftests/kvm/include/kvm_util.h
index 5cbb861525ed..2d7eb6989e83 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -79,12 +79,6 @@ struct vm_guest_mode_params {
 };
 extern const struct vm_guest_mode_params vm_guest_mode_params[];
 
-enum vm_mem_backing_src_type {
-   VM_MEM_SRC_ANONYMOUS,
-   VM_MEM_SRC_ANONYMOUS_THP,
-   VM_MEM_SRC_ANONYMOUS_HUGETLB,
-};
-
 int kvm_check_cap(long cap);
 int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap);
 int vcpu_enabl

[PATCH v2 28/28] KVM: selftests: Disable dirty logging with vCPUs running

2021-02-02 Thread Ben Gardon

Disabling dirty logging is much more intestesting from a testing
perspective if the vCPUs are still running. This also excercises the
code-path in which collapsible SPTEs must be faulted back in at a higher
level after disabling dirty logging.

To: linux-kselft...@vger.kernel.org
CC: Peter Xu 
CC: Andrew Jones 
CC: Thomas Huth 
Signed-off-by: Ben Gardon 
---
 tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c 
b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 604ccefd6e76..d44a5b8ef232 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -205,11 +205,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
}
}
 
-   /* Tell the vcpu thread to quit */
-   host_quit = true;
-   for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
-   pthread_join(vcpu_threads[vcpu_id], NULL);
-
/* Disable dirty logging */
clock_gettime(CLOCK_MONOTONIC, &start);
vm_mem_region_set_flags(vm, PERF_TEST_MEM_SLOT_INDEX, 0);
@@ -217,6 +212,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
pr_info("Disabling dirty logging time: %ld.%.9lds\n",
ts_diff.tv_sec, ts_diff.tv_nsec);
 
+   /* Tell the vcpu thread to quit */
+   host_quit = true;
+   for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
+   pthread_join(vcpu_threads[vcpu_id], NULL);
+
avg = timespec_div(get_dirty_log_total, p->iterations);
pr_info("Get dirty log over %lu iterations took %ld.%.9lds. (Avg 
%ld.%.9lds/iteration)\n",
p->iterations, get_dirty_log_total.tv_sec,
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 21/28] KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler

2021-02-02 Thread Ben Gardon

When the TDP MMU is allowed to handle page faults in parallel there is
the possiblity of a race where an SPTE is cleared and then imediately
replaced with a present SPTE pointing to a different PFN, before the
TLBs can be flushed. This race would violate architectural specs. Ensure
that the TLBs are flushed properly before other threads are allowed to
install any present value for the SPTE.

Reviewed-by: Peter Feiner 
Signed-off-by: Ben Gardon 

---

v1 -> v2
- Renamed "FROZEN_SPTE" to "REMOVED_SPTE" and updated derivative
  comments and code

 arch/x86/kvm/mmu/spte.h| 21 -
 arch/x86/kvm/mmu/tdp_mmu.c | 63 --
 2 files changed, 74 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 2b3a30bd38b0..3f974006cfb6 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -130,6 +130,25 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
  PT64_EPT_EXECUTABLE_MASK)
 #define SHADOW_ACC_TRACK_SAVED_BITS_SHIFT PT64_SECOND_AVAIL_BITS_SHIFT
 
+/*
+ * If a thread running without exclusive control of the MMU lock must perform a
+ * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
+ * non-present intermediate value. Other threads which encounter this value
+ * should not modify the SPTE.
+ *
+ * This constant works because it is considered non-present on both AMD and
+ * Intel CPUs and does not create a L1TF vulnerability because the pfn section
+ * is zeroed out.
+ *
+ * Only used by the TDP MMU.
+ */
+#define REMOVED_SPTE (1ull << 59)
+
+static inline bool is_removed_spte(u64 spte)
+{
+   return spte == REMOVED_SPTE;
+}
+
 /*
  * In some cases, we need to preserve the GFN of a non-present or reserved
  * SPTE when we usurp the upper five bits of the physical address space to
@@ -187,7 +206,7 @@ static inline bool is_access_track_spte(u64 spte)
 
 static inline int is_shadow_present_pte(u64 pte)
 {
-   return (pte != 0) && !is_mmio_spte(pte);
+   return (pte != 0) && !is_mmio_spte(pte) && !is_removed_spte(pte);
 }
 
 static inline int is_large_pte(u64 pte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0b5a9339ac55..7a2cdfeac4d2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -427,15 +427,19 @@ static void __handle_changed_spte(struct kvm *kvm, int 
as_id, gfn_t gfn,
 */
if (!was_present && !is_present) {
/*
-* If this change does not involve a MMIO SPTE, it is
-* unexpected. Log the change, though it should not impact the
-* guest since both the former and current SPTEs are nonpresent.
+* If this change does not involve a MMIO SPTE or removed SPTE,
+* it is unexpected. Log the change, though it should not
+* impact the guest since both the former and current SPTEs
+* are nonpresent.
 */
-   if (WARN_ON(!is_mmio_spte(old_spte) && !is_mmio_spte(new_spte)))
+   if (WARN_ON(!is_mmio_spte(old_spte) &&
+   !is_mmio_spte(new_spte) &&
+   !is_removed_spte(new_spte)))
pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
   "should not be replaced with another,\n"
   "different nonpresent SPTE, unless one or both\n"
-  "are MMIO SPTEs.\n"
+  "are MMIO SPTEs, or the new SPTE is\n"
+  "a temporary removed SPTE.\n"
   "as_id: %d gfn: %llx old_spte: %llx new_spte: 
%llx level: %d",
   as_id, gfn, old_spte, new_spte, level);
return;
@@ -486,6 +490,13 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 
lockdep_assert_held_read(&kvm->mmu_lock);
 
+   /*
+* Do not change removed SPTEs. Only the thread that froze the SPTE
+* may modify it.
+*/
+   if (iter->old_spte == REMOVED_SPTE)
+   return false;
+
if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
  new_spte) != iter->old_spte)
return false;
@@ -496,6 +507,34 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
return true;
 }
 
+static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
+  struct tdp_iter *iter)
+{
+   /*
+* Freeze the SPTE by setting it to a special,
+* non-present value. This will stop other threads from
+* immediately installing a present entry in its place
+* before the TLB

[PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map

2021-02-02 Thread Ben Gardon

To prepare for handling page faults in parallel, change the TDP MMU
page fault handler to use atomic operations to set SPTEs so that changes
are not lost if multiple threads attempt to modify the same SPTE.

Reviewed-by: Peter Feiner 
Signed-off-by: Ben Gardon 

---

v1 -> v2
- Rename "atomic" arg to "shared" in multiple functions
- Merged the commit that protects the lists of TDP MMU pages with a new
  lock
- Merged the commits to add an atomic option for setting SPTEs and to
  use that option in the TDP MMU page fault handler

 arch/x86/include/asm/kvm_host.h |  13 +++
 arch/x86/kvm/mmu/tdp_mmu.c  | 142 
 2 files changed, 122 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b6ebf2558386..78ebf56f2b37 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1028,6 +1028,19 @@ struct kvm_arch {
 * tdp_mmu_page set and a root_count of 0.
 */
struct list_head tdp_mmu_pages;
+
+   /*
+* Protects accesses to the following fields when the MMU lock
+* is held in read mode:
+*  - tdp_mmu_pages (above)
+*  - the link field of struct kvm_mmu_pages used by the TDP MMU
+*  - lpage_disallowed_mmu_pages
+*  - the lpage_disallowed_link field of struct kvm_mmu_pages used
+*by the TDP MMU
+* It is acceptable, but not necessary, to acquire this lock when
+* the thread holds the MMU lock in write mode.
+*/
+   spinlock_t tdp_mmu_pages_lock;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5a9e964e0178..0b5a9339ac55 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -7,6 +7,7 @@
 #include "tdp_mmu.h"
 #include "spte.h"
 
+#include 
 #include 
 
 #ifdef CONFIG_X86_64
@@ -33,6 +34,7 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
kvm->arch.tdp_mmu_enabled = true;
 
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
+   spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
 }
 
@@ -225,7 +227,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head 
*head)
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-   u64 old_spte, u64 new_spte, int level);
+   u64 old_spte, u64 new_spte, int level,
+   bool shared);
 
 static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
 {
@@ -267,17 +270,26 @@ static void handle_changed_spte_dirty_log(struct kvm 
*kvm, int as_id, gfn_t gfn,
  *
  * @kvm: kvm instance
  * @sp: the new page
+ * @shared: This operation may not be running under the exclusive use of
+ * the MMU lock and the operation must synchronize with other
+ * threads that might be adding or removing pages.
  * @account_nx: This page replaces a NX large page and should be marked for
  * eventual reclaim.
  */
 static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- bool account_nx)
+ bool shared, bool account_nx)
 {
-   lockdep_assert_held_write(&kvm->mmu_lock);
+   if (shared)
+   spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+   else
+   lockdep_assert_held_write(&kvm->mmu_lock);
 
list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
if (account_nx)
account_huge_nx_page(kvm, sp);
+
+   if (shared)
+   spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 }
 
 /**
@@ -285,14 +297,24 @@ static void tdp_mmu_link_page(struct kvm *kvm, struct 
kvm_mmu_page *sp,
  *
  * @kvm: kvm instance
  * @sp: the page to be removed
+ * @shared: This operation may not be running under the exclusive use of
+ * the MMU lock and the operation must synchronize with other
+ * threads that might be adding or removing pages.
  */
-static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+   bool shared)
 {
-   lockdep_assert_held_write(&kvm->mmu_lock);
+   if (shared)
+   spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+   else
+   lockdep_assert_held_write(&kvm->mmu_lock);
 
list_del(&sp->link);
if (sp->lpage_disallowed)
unaccount_huge_nx_page(kvm, sp);
+
+   if (shared)
+   spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 }
 
 /**
@@ -300,28 +322,39 @@ static void tdp_mmu_unlink_page(struct kvm *kvm, struct 
kvm_mmu_page *sp)
  *
  * @kvm: kvm instance
  * @pt: the page removed from the paging structure
+ * @shared: This operation may not be running

[PATCH v2 18/28] KVM: x86/mmu: Use an rwlock for the x86 MMU

2021-02-02 Thread Ben Gardon

Add a read / write lock to be used in place of the MMU spinlock on x86.
The rwlock will enable the TDP MMU to handle page faults, and other
operations in parallel in future commits.

Reviewed-by: Peter Feiner 
Signed-off-by: Ben Gardon 

---

v1 -> v2
- Removed MMU lock wrappers
- Completely replaced the MMU spinlock with an rwlock for x86

 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/kvm/mmu/mmu.c  | 90 -
 arch/x86/kvm/mmu/page_track.c   |  8 +--
 arch/x86/kvm/mmu/paging_tmpl.h  |  8 +--
 arch/x86/kvm/mmu/tdp_mmu.c  | 20 
 arch/x86/kvm/x86.c  |  4 +-
 include/linux/kvm_host.h|  5 ++
 virt/kvm/dirty_ring.c   | 10 
 virt/kvm/kvm_main.c | 46 +++--
 9 files changed, 112 insertions(+), 81 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3d6616f6f6ef..b6ebf2558386 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -337,6 +337,8 @@ struct kvm_mmu_root_info {
 
 #define KVM_MMU_NUM_PREV_ROOTS 3
 
+#define KVM_HAVE_MMU_RWLOCK
+
 struct kvm_mmu_page;
 
 /*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 60ff6837655a..b4d6709c240e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2016,9 +2016,9 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
flush |= kvm_sync_page(vcpu, sp, &invalid_list);
mmu_pages_clear_parents(&parents);
}
-   if (need_resched() || spin_needbreak(&vcpu->kvm->mmu_lock)) {
+   if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) {
kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
-   cond_resched_lock(&vcpu->kvm->mmu_lock);
+   cond_resched_rwlock_write(&vcpu->kvm->mmu_lock);
flush = false;
}
}
@@ -2470,7 +2470,7 @@ static int make_mmu_pages_available(struct kvm_vcpu *vcpu)
  */
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages)
 {
-   spin_lock(&kvm->mmu_lock);
+   write_lock(&kvm->mmu_lock);
 
if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) {
kvm_mmu_zap_oldest_mmu_pages(kvm, kvm->arch.n_used_mmu_pages -
@@ -2481,7 +2481,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned 
long goal_nr_mmu_pages)
 
kvm->arch.n_max_mmu_pages = goal_nr_mmu_pages;
 
-   spin_unlock(&kvm->mmu_lock);
+   write_unlock(&kvm->mmu_lock);
 }
 
 int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
@@ -2492,7 +2492,7 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 
pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
r = 0;
-   spin_lock(&kvm->mmu_lock);
+   write_lock(&kvm->mmu_lock);
for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
 sp->role.word);
@@ -2500,7 +2500,7 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
}
kvm_mmu_commit_zap_page(kvm, &invalid_list);
-   spin_unlock(&kvm->mmu_lock);
+   write_unlock(&kvm->mmu_lock);
 
return r;
 }
@@ -3192,7 +3192,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct 
kvm_mmu *mmu,
return;
}
 
-   spin_lock(&kvm->mmu_lock);
+   write_lock(&kvm->mmu_lock);
 
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i))
@@ -3215,7 +3215,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct 
kvm_mmu *mmu,
}
 
kvm_mmu_commit_zap_page(kvm, &invalid_list);
-   spin_unlock(&kvm->mmu_lock);
+   write_unlock(&kvm->mmu_lock);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_free_roots);
 
@@ -3236,16 +3236,16 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, 
gfn_t gfn, gva_t gva,
 {
struct kvm_mmu_page *sp;
 
-   spin_lock(&vcpu->kvm->mmu_lock);
+   write_lock(&vcpu->kvm->mmu_lock);
 
if (make_mmu_pages_available(vcpu)) {
-   spin_unlock(&vcpu->kvm->mmu_lock);
+   write_unlock(&vcpu->kvm->mmu_lock);
return INVALID_PAGE;
}
sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
++sp->root_count;
 
-   spin_unlock(&vcpu->kvm->mmu_lock);
+   write_unlock(&vcpu->kvm->mmu_lock);
return __pa(sp->spt);
 }
 
@@ -3416,17 +3416,17 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
!smp_load_acquire(&sp->unsync_children)

[PATCH v2 17/28] KVM: x86/mmu: Protect TDP MMU page table memory with RCU

2021-02-02 Thread Ben Gardon

In order to enable concurrent modifications to the paging structures in
the TDP MMU, threads must be able to safely remove pages of page table
memory while other threads are traversing the same memory. To ensure
threads do not access PT memory after it is freed, protect PT memory
with RCU.

Protecting concurrent accesses to page table memory from use-after-free
bugs could also have been acomplished using
walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES,
coupling with the barriers in a TLB flush. The use of RCU for this case
has several distinct advantages over that approach.
1. Disabling interrupts for long running operations is not desirable.
   Future commits will allow operations besides page faults to operate
   without the exclusive protection of the MMU lock and those operations
   are too long to disable iterrupts for their duration.
2. The use of RCU here avoids long blocking / spinning operations in
   perfromance critical paths. By freeing memory with an asynchronous
   RCU API we avoid the longer wait times TLB flushes experience when
   overlapping with a thread in walk_shadow_page_lockless_begin/end().
3. RCU provides a separation of concerns when removing memory from the
   paging structure. Because the RCU callback to free memory can be
   scheduled immediately after a TLB flush, there's no need for the
   thread to manually free a queue of pages later, as commit_zap_pages
   does.

Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU")
Reviewed-by: Peter Feiner 
Suggested-by: Sean Christopherson 
Signed-off-by: Ben Gardon 

---

v1 -> v2
- Moved RCU read unlock before the TLB flush
- Merged the RCU commits from v1 into a single commit
- Changed the way accesses to page table memory are annotated with RCU
  in the TDP iterator

 arch/x86/kvm/mmu/mmu_internal.h |  3 ++
 arch/x86/kvm/mmu/tdp_iter.c | 16 +++---
 arch/x86/kvm/mmu/tdp_iter.h | 10 ++--
 arch/x86/kvm/mmu/tdp_mmu.c  | 95 +
 4 files changed, 103 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index bfc6389edc28..7f599cc64178 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -57,6 +57,9 @@ struct kvm_mmu_page {
atomic_t write_flooding_count;
 
bool tdp_mmu_page;
+
+   /* Used for freeing the page asyncronously if it is a TDP MMU page. */
+   struct rcu_head rcu_head;
 };
 
 extern struct kmem_cache *mmu_page_header_cache;
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 1a09d212186b..e5f148106e20 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -12,7 +12,7 @@ static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
 {
iter->sptep = iter->pt_path[iter->level - 1] +
SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
-   iter->old_spte = READ_ONCE(*iter->sptep);
+   iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
 }
 
 static gfn_t round_gfn_for_level(gfn_t gfn, int level)
@@ -35,7 +35,7 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int 
root_level,
iter->root_level = root_level;
iter->min_level = min_level;
iter->level = root_level;
-   iter->pt_path[iter->level - 1] = root_pt;
+   iter->pt_path[iter->level - 1] = (tdp_ptep_t)root_pt;
 
iter->gfn = round_gfn_for_level(iter->next_last_level_gfn, iter->level);
tdp_iter_refresh_sptep(iter);
@@ -48,7 +48,7 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int 
root_level,
  * address of the child page table referenced by the SPTE. Returns null if
  * there is no such entry.
  */
-u64 *spte_to_child_pt(u64 spte, int level)
+tdp_ptep_t spte_to_child_pt(u64 spte, int level)
 {
/*
 * There's no child entry if this entry isn't present or is a
@@ -57,7 +57,7 @@ u64 *spte_to_child_pt(u64 spte, int level)
if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
return NULL;
 
-   return __va(spte_to_pfn(spte) << PAGE_SHIFT);
+   return (tdp_ptep_t)__va(spte_to_pfn(spte) << PAGE_SHIFT);
 }
 
 /*
@@ -66,7 +66,7 @@ u64 *spte_to_child_pt(u64 spte, int level)
  */
 static bool try_step_down(struct tdp_iter *iter)
 {
-   u64 *child_pt;
+   tdp_ptep_t child_pt;
 
if (iter->level == iter->min_level)
return false;
@@ -75,7 +75,7 @@ static bool try_step_down(struct tdp_iter *iter)
 * Reread the SPTE before stepping down to avoid traversing into page
 * tables that are no longer linked from this entry.
 */
-   iter->old_spte = READ_ONCE(*iter->sptep);
+   iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
 
child_pt = spte_to_child_pt(iter->old_spte, iter->level);
if (!chil

[PATCH v2 13/28] KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter

2021-02-02 Thread Ben Gardon

In some functions the TDP iter risks not making forward progress if two
threads livelock yielding to one another. This is possible if two threads
are trying to execute wrprot_gfn_range. Each could write protect an entry
and then yield. This would reset the tdp_iter's walk over the paging
structure and the loop would end up repeating the same entry over and
over, preventing either thread from making forward progress.

Fix this issue by only yielding if the loop has made forward progress
since the last yield.

Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Reviewed-by: Peter Feiner 
Signed-off-by: Ben Gardon 

---

v1 -> v2
- Moved forward progress check into tdp_mmu_iter_cond_resched
- Folded tdp_iter_refresh_walk into tdp_mmu_iter_cond_resched
- Split patch into three and renamed all

 arch/x86/kvm/mmu/tdp_iter.c | 18 +-
 arch/x86/kvm/mmu/tdp_iter.h |  7 ++-
 arch/x86/kvm/mmu/tdp_mmu.c  | 21 -
 3 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 9917c55b7d24..1a09d212186b 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -31,6 +31,7 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int 
root_level,
WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
 
iter->next_last_level_gfn = next_last_level_gfn;
+   iter->yielded_gfn = iter->next_last_level_gfn;
iter->root_level = root_level;
iter->min_level = min_level;
iter->level = root_level;
@@ -158,23 +159,6 @@ void tdp_iter_next(struct tdp_iter *iter)
iter->valid = false;
 }
 
-/*
- * Restart the walk over the paging structure from the root, starting from the
- * highest gfn the iterator had previously reached. Assumes that the entire
- * paging structure, except the root page, may have been completely torn down
- * and rebuilt.
- */
-void tdp_iter_refresh_walk(struct tdp_iter *iter)
-{
-   gfn_t next_last_level_gfn = iter->next_last_level_gfn;
-
-   if (iter->gfn > next_last_level_gfn)
-   next_last_level_gfn = iter->gfn;
-
-   tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
-  iter->root_level, iter->min_level, next_last_level_gfn);
-}
-
 u64 *tdp_iter_root_pt(struct tdp_iter *iter)
 {
return iter->pt_path[iter->root_level - 1];
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index b2dd269c631f..d480c540ee27 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -16,6 +16,12 @@ struct tdp_iter {
 * for this GFN.
 */
gfn_t next_last_level_gfn;
+   /*
+* The next_last_level_gfn at the time when the thread last
+* yielded. Only yielding when the next_last_level_gfn !=
+* yielded_gfn helps ensure forward progress.
+*/
+   gfn_t yielded_gfn;
/* Pointers to the page tables traversed to reach the current SPTE */
u64 *pt_path[PT64_ROOT_MAX_LEVEL];
/* A pointer to the current SPTE */
@@ -54,7 +60,6 @@ u64 *spte_to_child_pt(u64 pte, int level);
 void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
int min_level, gfn_t next_last_level_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
-void tdp_iter_refresh_walk(struct tdp_iter *iter);
 u64 *tdp_iter_root_pt(struct tdp_iter *iter);
 
 #endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8f7b120597f3..7cfc0639b1ef 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -451,8 +451,9 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm 
*kvm,
  * TLB flush before yielding.
  *
  * If this function yields, it will also reset the tdp_iter's walk over the
- * paging structure and the calling function should allow the iterator to
- * continue its traversal from the paging structure root.
+ * paging structure and the calling function should skip to the next
+ * iteration to allow the iterator to continue its traversal from the
+ * paging structure root.
  *
  * Return true if this function yielded and the iterator's traversal was reset.
  * Return false if a yield was not needed.
@@ -460,12 +461,22 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct 
kvm *kvm,
 static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
 struct tdp_iter *iter, bool flush)
 {
+   /* Ensure forward progress has been made before yielding. */
+   if (iter->next_last_level_gfn == iter->yielded_gfn)
+   return false;
+
if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
if (flush)
kvm_flush_remote_tlbs(kvm);
 
cond_resched_lock(&kvm->mmu_lock);
-

[PATCH v2 11/28] KVM: x86/mmu: Merge flush and non-flush tdp_mmu_iter_cond_resched

2021-02-02 Thread Ben Gardon

The flushing and non-flushing variants of tdp_mmu_iter_cond_resched have
almost identical implementations. Merge the two functions and add a
flush parameter.

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 42 --
 1 file changed, 13 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e3066d08c1dc..8f7b120597f3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -443,33 +443,13 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct 
kvm *kvm,
for_each_tdp_pte(_iter, __va(_mmu->root_hpa),   \
 _mmu->shadow_root_level, _start, _end)
 
-/*
- * Flush the TLB and yield if the MMU lock is contended or this thread needs to
- * return control to the scheduler.
- *
- * If this function yields, it will also reset the tdp_iter's walk over the
- * paging structure and the calling function should allow the iterator to
- * continue its traversal from the paging structure root.
- *
- * Return true if this function yielded, the TLBs were flushed, and the
- * iterator's traversal was reset. Return false if a yield was not needed.
- */
-static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter 
*iter)
-{
-   if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
-   kvm_flush_remote_tlbs(kvm);
-   cond_resched_lock(&kvm->mmu_lock);
-   tdp_iter_refresh_walk(iter);
-   return true;
-   }
-
-   return false;
-}
-
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
  * to the scheduler.
  *
+ * If this function should yield and flush is set, it will perform a remote
+ * TLB flush before yielding.
+ *
  * If this function yields, it will also reset the tdp_iter's walk over the
  * paging structure and the calling function should allow the iterator to
  * continue its traversal from the paging structure root.
@@ -477,9 +457,13 @@ static bool tdp_mmu_iter_flush_cond_resched(struct kvm 
*kvm, struct tdp_iter *it
  * Return true if this function yielded and the iterator's traversal was reset.
  * Return false if a yield was not needed.
  */
-static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
+struct tdp_iter *iter, bool flush)
 {
if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+   if (flush)
+   kvm_flush_remote_tlbs(kvm);
+
cond_resched_lock(&kvm->mmu_lock);
tdp_iter_refresh_walk(iter);
return true;
@@ -522,7 +506,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
tdp_mmu_set_spte(kvm, &iter, 0);
 
flush_needed = !can_yield ||
-  !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
+  !tdp_mmu_iter_cond_resched(kvm, &iter, true);
}
return flush_needed;
 }
@@ -894,7 +878,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
spte_set = true;
 
-   tdp_mmu_iter_cond_resched(kvm, &iter);
+   tdp_mmu_iter_cond_resched(kvm, &iter, false);
}
return spte_set;
 }
@@ -953,7 +937,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
spte_set = true;
 
-   tdp_mmu_iter_cond_resched(kvm, &iter);
+   tdp_mmu_iter_cond_resched(kvm, &iter, false);
}
return spte_set;
 }
@@ -1069,7 +1053,7 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
tdp_mmu_set_spte(kvm, &iter, new_spte);
spte_set = true;
 
-   tdp_mmu_iter_cond_resched(kvm, &iter);
+   tdp_mmu_iter_cond_resched(kvm, &iter, false);
}
 
return spte_set;
@@ -1121,7 +1105,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
tdp_mmu_set_spte(kvm, &iter, 0);
 
-   spte_set = !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
+   spte_set = !tdp_mmu_iter_cond_resched(kvm, &iter, true);
}
 
if (spte_set)
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 10/28] KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs

2021-02-02 Thread Ben Gardon

There is a bug in the TDP MMU function to zap SPTEs which could be
replaced with a larger mapping which prevents the function from doing
anything. Fix this by correctly zapping the last level SPTEs.

Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp 
MMU")
Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c3075fb568eb..e3066d08c1dc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1098,8 +1098,8 @@ bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct 
kvm_memory_slot *slot)
 }
 
 /*
- * Clear non-leaf entries (and free associated page tables) which could
- * be replaced by large mappings, for GFNs within the slot.
+ * Clear leaf entries which could be replaced by large mappings, for
+ * GFNs within the slot.
  */
 static void zap_collapsible_spte_range(struct kvm *kvm,
   struct kvm_mmu_page *root,
@@ -,7 +,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
tdp_root_for_each_pte(iter, root, start, end) {
if (!is_shadow_present_pte(iter.old_spte) ||
-   is_last_spte(iter.old_spte, iter.level))
+   !is_last_spte(iter.old_spte, iter.level))
continue;
 
pfn = spte_to_pfn(iter.old_spte);
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 09/28] KVM: x86/mmu: Fix braces in kvm_recover_nx_lpages

2021-02-02 Thread Ben Gardon

No functional change intended.

Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU")
Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d16481aa29d..60ff6837655a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6005,10 +6005,10 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
  struct kvm_mmu_page,
  lpage_disallowed_link);
WARN_ON_ONCE(!sp->lpage_disallowed);
-   if (sp->tdp_mmu_page)
+   if (sp->tdp_mmu_page) {
kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level));
-   else {
+   } else {
kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
WARN_ON_ONCE(sp->lpage_disallowed);
}
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 08/28] sched: Add cond_resched_rwlock

2021-02-02 Thread Ben Gardon

Safely rescheduling while holding a spin lock is essential for keeping
long running kernel operations running smoothly. Add the facility to
cond_resched rwlocks.

CC: Ingo Molnar 
CC: Will Deacon 
Acked-by: Peter Zijlstra 
Acked-by: Davidlohr Bueso 
Acked-by: Waiman Long 
Acked-by: Paolo Bonzini 
Signed-off-by: Ben Gardon 
---
 include/linux/sched.h | 12 
 kernel/sched/core.c   | 40 
 2 files changed, 52 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5d1378e5a040..3052d16da3cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1883,12 +1883,24 @@ static inline int _cond_resched(void) { return 0; }
 })
 
 extern int __cond_resched_lock(spinlock_t *lock);
+extern int __cond_resched_rwlock_read(rwlock_t *lock);
+extern int __cond_resched_rwlock_write(rwlock_t *lock);
 
 #define cond_resched_lock(lock) ({ \
___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
__cond_resched_lock(lock);  \
 })
 
+#define cond_resched_rwlock_read(lock) ({  \
+   __might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET); \
+   __cond_resched_rwlock_read(lock);   \
+})
+
+#define cond_resched_rwlock_write(lock) ({ \
+   __might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET); \
+   __cond_resched_rwlock_write(lock);  \
+})
+
 static inline void cond_resched_rcu(void)
 {
 #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ff74fca39ed2..efed1bf202d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6709,6 +6709,46 @@ int __cond_resched_lock(spinlock_t *lock)
 }
 EXPORT_SYMBOL(__cond_resched_lock);
 
+int __cond_resched_rwlock_read(rwlock_t *lock)
+{
+   int resched = should_resched(PREEMPT_LOCK_OFFSET);
+   int ret = 0;
+
+   lockdep_assert_held_read(lock);
+
+   if (rwlock_needbreak(lock) || resched) {
+   read_unlock(lock);
+   if (resched)
+   preempt_schedule_common();
+   else
+   cpu_relax();
+   ret = 1;
+   read_lock(lock);
+   }
+   return ret;
+}
+EXPORT_SYMBOL(__cond_resched_rwlock_read);
+
+int __cond_resched_rwlock_write(rwlock_t *lock)
+{
+   int resched = should_resched(PREEMPT_LOCK_OFFSET);
+   int ret = 0;
+
+   lockdep_assert_held_write(lock);
+
+   if (rwlock_needbreak(lock) || resched) {
+   write_unlock(lock);
+   if (resched)
+   preempt_schedule_common();
+   else
+   cpu_relax();
+   ret = 1;
+   write_lock(lock);
+   }
+   return ret;
+}
+EXPORT_SYMBOL(__cond_resched_rwlock_write);
+
 /**
  * yield - yield the current processor to other threads.
  *
-- 
2.30.0.365.g02bc693789-goog

Re: [PATCH v2 23/28] KVM: x86/mmu: Allow parallel page faults for the TDP MMU

2021-02-03 Thread Ben Gardon

On Wed, Feb 3, 2021 at 4:40 AM Paolo Bonzini  wrote:
>
> On 02/02/21 19:57, Ben Gardon wrote:
> >
> > - write_lock(&vcpu->kvm->mmu_lock);
> > +
> > + if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> > + read_lock(&vcpu->kvm->mmu_lock);
> > + else
> > + write_lock(&vcpu->kvm->mmu_lock);
> > +
>
> I'd like to make this into two helper functions, but I'm not sure about
> the naming:
>
> - kvm_mmu_read_lock_for_root/kvm_mmu_read_unlock_for_root: not precise
> because it's really write-locked for shadow MMU roots
>
> - kvm_mmu_lock_for_root/kvm_mmu_unlock_for_root: not clear that TDP MMU
> operations will need to operate in shared-lock mode
>
> I prefer the first because at least it's the conservative option, but
> I'm open to other opinions and suggestions.
>
> Paolo
>

Of the above two options, I like the second one, though I'd be happy
with either. I agree the first is more conservative, in that it's
clear the MMU lock could be shared. It feels a little misleading,
though to have read in the name of the function but then acquire the
write lock, especially since there's code below that which expects the
write lock. I don't know of a good way to abstract this into a helper
without some comments to make it clear what's going on, but maybe
there's a slightly more open-coded compromise:
if (!kvm_mmu_read_lock_for_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
 write_lock(&vcpu->kvm->mmu_lock);
or
enum kvm_mmu_lock_mode lock_mode =
get_mmu_lock_mode_for_root(vcpu->kvm, vcpu->arch.mmu->root_hpa);

kvm_mmu_lock_for_mode(lock_mode);

Not sure if either of those are actually clearer, but the latter
trends in the direction the RCF took, having an enum to capture
read/write and whether or not yo yield in a lock mode parameter.

Re: [PATCH v2 24/28] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock

2021-02-03 Thread Ben Gardon

On Wed, Feb 3, 2021 at 3:26 AM Paolo Bonzini  wrote:
>
> On 02/02/21 19:57, Ben Gardon wrote:
> > +#ifdef CONFIG_LOCKDEP
> > + if (shared)
> > + lockdep_assert_held_read(&kvm->mmu_lock);
> > + else
> > + lockdep_assert_held_write(&kvm->mmu_lock);
> > +#endif /* CONFIG_LOCKDEP */
>
> Also, there's no need for the #ifdef here.

I agree, I must have misinterpreted some feedback on a previous commit
and gone overboard with it.

> Do we want a helper
> kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm, bool shared)?

There are only two places that try to assert both ways as far as I can
see on a cursory check, but it couldn't hurt.

>
> Paolo
>

Re: [PATCH v2 25/28] KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock

2021-02-03 Thread Ben Gardon

On Wed, Feb 3, 2021 at 3:34 AM Paolo Bonzini  wrote:
>
> On 02/02/21 19:57, Ben Gardon wrote:
> > @@ -1485,7 +1489,9 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm 
> > *kvm,
> >   struct kvm_mmu_page *root;
> >   int root_as_id;
> >
> > - for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
> > + read_lock(&kvm->mmu_lock);
> > +
> > + for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
> >   root_as_id = kvm_mmu_page_as_id(root);
> >   if (root_as_id != slot->as_id)
> >   continue;
> > @@ -1493,6 +1499,8 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm 
> > *kvm,
> >   zap_collapsible_spte_range(kvm, root, slot->base_gfn,
> >  slot->base_gfn + slot->npages);
> >   }
> > +
> > + read_unlock(&kvm->mmu_lock);
> >  }
>
>
> I'd prefer the functions to be consistent about who takes the lock,
> either mmu.c or tdp_mmu.c.  Since everywhere else you're doing it in
> mmu.c, that would be:
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0554d9c5c5d4..386ee4b703d9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5567,10 +5567,13 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> write_lock(&kvm->mmu_lock);
> slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
>  kvm_mmu_zap_collapsible_spte, true);
> +   write_unlock(&kvm->mmu_lock);
>
> -   if (kvm->arch.tdp_mmu_enabled)
> +   if (kvm->arch.tdp_mmu_enabled) {
> +   read_lock(&kvm->mmu_lock);
> kvm_tdp_mmu_zap_collapsible_sptes(kvm, memslot);
> -   write_unlock(&kvm->mmu_lock);
> +   read_unlock(&kvm->mmu_lock);
> +   }
>   }
>
>   void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
>
> and just lockdep_assert_held_read here.

That makes sense to me, I agree keeping it consistent is probably a good idea.

>
> > - tdp_mmu_set_spte(kvm, &iter, 0);
> > -
> > - spte_set = true;
>
> Is it correct to remove this assignment?

No, it was not correct to remove it. Thank you for catching that.

>
> Paolo
>

[PATCH] KVM: VMX: Optimize flushing the PML buffer

2021-02-04 Thread Ben Gardon

vmx_flush_pml_buffer repeatedly calls kvm_vcpu_mark_page_dirty, which
SRCU-derefrences kvm->memslots. In order to give the compiler more
freedom to optimize the function, SRCU-dereference the pointer
kvm->memslots only once.

Reviewed-by: Makarand Sonare 
Signed-off-by: Ben Gardon 

---

Tested by running the dirty_log_perf_test selftest on a dual socket Intel
Skylake machine:
./dirty_log_perf_test -v 4 -b 30G -i 5

The test was run 5 times with and without this patch and the dirty
memory time for iterations 2-5 was averaged across the 5 runs.
Iteration 1 was discarded for this analysis because it is still dominated
by the time spent populating memory.

The average time for each run demonstrated a strange bimodal distribution,
with clusters around 2 seconds and 2.5 seconds. This may have been a
result of vCPU migration between NUMA nodes.

In any case, the get dirty times with this patch averaged to 2.07
seconds, a 7% savings from the 2.22 second everage without this patch.

While these savings may be partly a result of the patched runs having
one more 2 second clustered run, the patched runs in the higer cluster
were also 7-8% shorter than those in the unpatched case.

Below is the raw data for anyone interested in visualizing the results
with a graph:
Iteration   BaselinePatched
2   2.038562907 2.045226614
3   2.037363248 2.045033709
4   2.037176331 1.999783966
5   1.999891981 2.007849104
2   2.569526298 2.001252504
3   2.579110209 2.008541897
4   2.585883731 2.005317983
5   2.588692727 2.007100987
2   2.01191437  2.006953735
3   2.012972236 2.04540153
4   1.968836017 2.005035246
5   1.967915154 2.003859551
2   2.037533296 1.991275846
3   2.501480125 2.391886691
4   2.454382587 2.391904789
5   2.461046772 2.398767963
2   2.036991484 2.011331436
3   2.002954418 2.002635687
4   2.053342717 2.006769959
5   2.522539759 2.006470059
Average 2.223405818 2.069119963

 arch/x86/kvm/vmx/vmx.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cc60b1fc3ee7..46c54802dfdb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5692,6 +5692,7 @@ static void vmx_destroy_pml_buffer(struct vcpu_vmx *vmx)
 static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct kvm_memslots *memslots;
u64 *pml_buf;
u16 pml_idx;
 
@@ -5707,13 +5708,18 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
else
pml_idx++;
 
+   memslots = kvm_vcpu_memslots(vcpu);
+
pml_buf = page_address(vmx->pml_pg);
for (; pml_idx < PML_ENTITY_NUM; pml_idx++) {
+   struct kvm_memory_slot *memslot;
u64 gpa;
 
gpa = pml_buf[pml_idx];
WARN_ON(gpa & (PAGE_SIZE - 1));
-   kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
+
+   memslot = __gfn_to_memslot(memslots, gpa >> PAGE_SHIFT);
+   mark_page_dirty_in_slot(vcpu->kvm, memslot, gpa >> PAGE_SHIFT);
}
 
/* reset PML index */
-- 
2.30.0.365.g02bc693789-goog

Re: [PATCH] KVM: VMX: Optimize flushing the PML buffer

2021-02-04 Thread Ben Gardon

On Thu, Feb 4, 2021 at 2:51 PM Peter Xu  wrote:
>
> Hi, Ben,
>
> On Thu, Feb 04, 2021 at 02:19:59PM -0800, Ben Gardon wrote:
> > The average time for each run demonstrated a strange bimodal distribution,
> > with clusters around 2 seconds and 2.5 seconds. This may have been a
> > result of vCPU migration between NUMA nodes.
>
> Have you thought about using numactl or similar technique to verify your idea
> (force both vcpu threads binding, and memory allocations)?
>
> From the numbers it already shows improvements indeed, but just curious since
> you raised this up. :)

Frustratingly, the test machines I have don't have numactl installed
but I've been meaning to add cpu pinning to the selftests perf tests
anyway, so maybe this is a good reason to do it.

>
> > @@ -5707,13 +5708,18 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu 
> > *vcpu)
> >   else
> >   pml_idx++;
> >
> > + memslots = kvm_vcpu_memslots(vcpu);
> > +
> >   pml_buf = page_address(vmx->pml_pg);
> >   for (; pml_idx < PML_ENTITY_NUM; pml_idx++) {
> > + struct kvm_memory_slot *memslot;
> >   u64 gpa;
> >
> >   gpa = pml_buf[pml_idx];
> >   WARN_ON(gpa & (PAGE_SIZE - 1));
> > - kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> > +
> > + memslot = __gfn_to_memslot(memslots, gpa >> PAGE_SHIFT);
> > + mark_page_dirty_in_slot(vcpu->kvm, memslot, gpa >> 
> > PAGE_SHIFT);
>
> Since at it: make "gpa >> PAGE_SHIFT" a temp var too?

That's a good idea, I'll try it.

>
> Thanks,
>
> --
> Peter Xu
>

Re: [RFC PATCH 1/2] KVM: selftests: Add a macro to get string of vm_mem_backing_src_type

2021-02-08 Thread Ben Gardon

On Mon, Feb 8, 2021 at 1:08 AM Yanan Wang  wrote:
>
> Add a macro to get string of the backing source memory type, so that
> application can add choices for source types in the help() function,
> and users can specify which type to use for testing.

Coincidentally, I sent out a change last week to do the same thing:
"KVM: selftests: Add backing src parameter to dirty_log_perf_test"
(https://lkml.org/lkml/2021/2/2/1430)
Whichever way this ends up being implemented, I'm happy to see others
interested in testing different backing source types too.

>
> Signed-off-by: Yanan Wang 
> ---
>  tools/testing/selftests/kvm/include/kvm_util.h | 3 +++
>  tools/testing/selftests/kvm/lib/kvm_util.c | 8 
>  2 files changed, 11 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
> b/tools/testing/selftests/kvm/include/kvm_util.h
> index 5cbb861525ed..f5fc29dc9ee6 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -69,7 +69,9 @@ enum vm_guest_mode {
>  #define PTES_PER_MIN_PAGE  ptes_per_page(MIN_PAGE_SIZE)
>
>  #define vm_guest_mode_string(m) vm_guest_mode_string[m]
> +#define vm_mem_backing_src_type_string(s) vm_mem_backing_src_type_string[s]
>  extern const char * const vm_guest_mode_string[];
> +extern const char * const vm_mem_backing_src_type_string[];
>
>  struct vm_guest_mode_params {
> unsigned int pa_bits;
> @@ -83,6 +85,7 @@ enum vm_mem_backing_src_type {
> VM_MEM_SRC_ANONYMOUS,
> VM_MEM_SRC_ANONYMOUS_THP,
> VM_MEM_SRC_ANONYMOUS_HUGETLB,
> +   NUM_VM_BACKING_SRC_TYPES,
>  };
>
>  int kvm_check_cap(long cap);
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
> b/tools/testing/selftests/kvm/lib/kvm_util.c
> index fa5a90e6c6f0..a9b651c7f866 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -165,6 +165,14 @@ const struct vm_guest_mode_params vm_guest_mode_params[] 
> = {
>  _Static_assert(sizeof(vm_guest_mode_params)/sizeof(struct 
> vm_guest_mode_params) == NUM_VM_MODES,
>"Missing new mode params?");
>
> +const char * const vm_mem_backing_src_type_string[] = {
> +   "VM_MEM_SRC_ANONYMOUS",
> +   "VM_MEM_SRC_ANONYMOUS_THP",
> +   "VM_MEM_SRC_ANONYMOUS_HUGETLB",
> +};
> +_Static_assert(sizeof(vm_mem_backing_src_type_string)/sizeof(char *) == 
> NUM_VM_BACKING_SRC_TYPES,
> +  "Missing new source type strings?");
> +
>  /*
>   * VM Create
>   *
> --
> 2.23.0
>

Re: [RFC PATCH 2/2] KVM: selftests: Add a test for kvm page table code

2021-02-08 Thread Ben Gardon

On Mon, Feb 8, 2021 at 1:08 AM Yanan Wang  wrote:
>
> This test serves as a performance tester and a bug reproducer for
> kvm page table code (GPA->HPA mappings), so it gives guidance for
> people trying to make some improvement for kvm.
>
> The function guest_code() is designed to cover conditions where a single vcpu
> or multiple vcpus access guest pages within the same memory range, in three
> VM stages(before dirty-logging, during dirty-logging, after dirty-logging).
> Besides, the backing source memory type(ANONYMOUS/THP/HUGETLB) of the tested
> memory region can be specified by users, which means normal page mappings or
> block mappings can be chosen by users to be created in the test.
>
> If use of ANONYMOUS memory is specified, kvm will create page mappings for the
> tested memory region before dirty-logging, and update attributes of the page
> mappings from RO to RW during dirty-logging. If use of THP/HUGETLB memory is
> specified, kvm will create block mappings for the tested memory region before
> dirty-logging, and split the blcok mappings into page mappings during
> dirty-logging, and coalesce the page mappings back into block mappings after
> dirty-logging is stopped.
>
> So in summary, as a performance tester, this test can present the performance
> of kvm creating/updating normal page mappings, or the performance of kvm
> creating/splitting/recovering block mappings, through execution time.
>
> When we need to coalesce the page mappings back to block mappings after dirty
> logging is stopped, we have to firstly invalidate *all* the TLB entries for 
> the
> page mappings right before installation of the block entry, because a TLB 
> conflict
> abort error could occur if we can't invalidate the TLB entries fully. We have
> hit this TLB conflict twice on aarch64 software implementation and fixed it.
> As this test can imulate process from dirty-logging enabled to dirty-logging
> stopped of a VM with block mappings, so it can also reproduce this TLB 
> conflict
> abort due to inadequate TLB invalidation when coalescing tables.
>
> Signed-off-by: Yanan Wang 

Thanks for sending this! Happy to see more tests for weird TLB
flushing edge cases and races.

Just out of curiosity, were you unable to replicate the bug with the
dirty_log_perf_test and setting the wr_fract option?
With "KVM: selftests: Disable dirty logging with vCPUs running"
(https://lkml.org/lkml/2021/2/2/1431), the dirty_log_perf_test has
most of the same features as this one.
Please correct me if I'm wrong, but it seems like the major difference
here is a more careful pattern of which pages are dirtied when.

Within Google we have a system for pre-specifying sets of arguments to
e.g. the dirty_log_perf_test. I wonder if something similar, even as
simple as a script that just runs dirty_log_perf_test several times
would be helpful for cases where different arguments are needed for
the test to cover different specific cases. Even with this test, for
example, I assume the test doesn't work very well with just 1 vCPU,
but it's still a good default in the test, so having some kind of
configuration (lite) file would be useful.

> ---
>  tools/testing/selftests/kvm/Makefile  |   3 +
>  .../selftests/kvm/kvm_page_table_test.c   | 518 ++
>  2 files changed, 521 insertions(+)
>  create mode 100644 tools/testing/selftests/kvm/kvm_page_table_test.c
>
> diff --git a/tools/testing/selftests/kvm/Makefile 
> b/tools/testing/selftests/kvm/Makefile
> index fe41c6a0fa67..697318019bd4 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -62,6 +62,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test
>  TEST_GEN_PROGS_x86_64 += demand_paging_test
>  TEST_GEN_PROGS_x86_64 += dirty_log_test
>  TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
> +TEST_GEN_PROGS_x86_64 += kvm_page_table_test
>  TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
>  TEST_GEN_PROGS_x86_64 += set_memory_region_test
>  TEST_GEN_PROGS_x86_64 += steal_time
> @@ -71,6 +72,7 @@ TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list-sve
>  TEST_GEN_PROGS_aarch64 += demand_paging_test
>  TEST_GEN_PROGS_aarch64 += dirty_log_test
>  TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
> +TEST_GEN_PROGS_aarch64 += kvm_page_table_test
>  TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
>  TEST_GEN_PROGS_aarch64 += set_memory_region_test
>  TEST_GEN_PROGS_aarch64 += steal_time
> @@ -80,6 +82,7 @@ TEST_GEN_PROGS_s390x += s390x/resets
>  TEST_GEN_PROGS_s390x += s390x/sync_regs_test
>  TEST_GEN_PROGS_s390x += demand_paging_test
>  TEST_GEN_PROGS_s390x += dirty_log_test
> +TEST_GEN_PROGS_s390x += kvm_page_table_test
>  TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
>  TEST_GEN_PROGS_s390x += set_memory_region_test
>
> diff --git a/tools/testing/selftests/kvm/kvm_page_table_test.c 
> b/tools/testing/selftests/kvm/kvm_page_table_test.c
> new file mode 100644
> index ..b09c05288937
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/kvm_pag

Re: [RFC PATCH 1/2] KVM: selftests: Add a macro to get string of vm_mem_backing_src_type

2021-02-09 Thread Ben Gardon

On Tue, Feb 9, 2021 at 3:21 AM wangyanan (Y)  wrote:
>
>
> On 2021/2/9 2:13, Ben Gardon wrote:
> > On Mon, Feb 8, 2021 at 1:08 AM Yanan Wang  wrote:
> >> Add a macro to get string of the backing source memory type, so that
> >> application can add choices for source types in the help() function,
> >> and users can specify which type to use for testing.
> > Coincidentally, I sent out a change last week to do the same thing:
> > "KVM: selftests: Add backing src parameter to dirty_log_perf_test"
> > (https://lkml.org/lkml/2021/2/2/1430)
> > Whichever way this ends up being implemented, I'm happy to see others
> > interested in testing different backing source types too.
>
> Thanks Ben! I have a little question here.
>
> Can we just present three IDs (0/1/2) but not strings for users to
> choose which backing_src_type to use like the way of guest modes,

That would be fine with me. The string names are easier for me to read
than an ID number (especially if you were to add additional options
e.g. 1G hugetlb or file backed  / shared memory) but it's mostly an
aesthetic preference, so I don't have strong feelings either way.

>
> which I think can make cmdlines more consise and easier to print. And is
> it better to make a universal API to get backing_src_strings
>
> like Sean have suggested, so that the API can be used elsewhere ?

Definitely. This should be as easy as possible to incorporate into all
selftests.

>
> >> Signed-off-by: Yanan Wang 
> >> ---
> >>   tools/testing/selftests/kvm/include/kvm_util.h | 3 +++
> >>   tools/testing/selftests/kvm/lib/kvm_util.c | 8 
> >>   2 files changed, 11 insertions(+)
> >>
> >> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
> >> b/tools/testing/selftests/kvm/include/kvm_util.h
> >> index 5cbb861525ed..f5fc29dc9ee6 100644
> >> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> >> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> >> @@ -69,7 +69,9 @@ enum vm_guest_mode {
> >>   #define PTES_PER_MIN_PAGE  ptes_per_page(MIN_PAGE_SIZE)
> >>
> >>   #define vm_guest_mode_string(m) vm_guest_mode_string[m]
> >> +#define vm_mem_backing_src_type_string(s) 
> >> vm_mem_backing_src_type_string[s]
> >>   extern const char * const vm_guest_mode_string[];
> >> +extern const char * const vm_mem_backing_src_type_string[];
> >>
> >>   struct vm_guest_mode_params {
> >>  unsigned int pa_bits;
> >> @@ -83,6 +85,7 @@ enum vm_mem_backing_src_type {
> >>  VM_MEM_SRC_ANONYMOUS,
> >>  VM_MEM_SRC_ANONYMOUS_THP,
> >>  VM_MEM_SRC_ANONYMOUS_HUGETLB,
> >> +   NUM_VM_BACKING_SRC_TYPES,
> >>   };
> >>
> >>   int kvm_check_cap(long cap);
> >> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
> >> b/tools/testing/selftests/kvm/lib/kvm_util.c
> >> index fa5a90e6c6f0..a9b651c7f866 100644
> >> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> >> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> >> @@ -165,6 +165,14 @@ const struct vm_guest_mode_params 
> >> vm_guest_mode_params[] = {
> >>   _Static_assert(sizeof(vm_guest_mode_params)/sizeof(struct 
> >> vm_guest_mode_params) == NUM_VM_MODES,
> >> "Missing new mode params?");
> >>
> >> +const char * const vm_mem_backing_src_type_string[] = {
> >> +   "VM_MEM_SRC_ANONYMOUS",
> >> +   "VM_MEM_SRC_ANONYMOUS_THP",
> >> +   "VM_MEM_SRC_ANONYMOUS_HUGETLB",
> >> +};
> >> +_Static_assert(sizeof(vm_mem_backing_src_type_string)/sizeof(char *) == 
> >> NUM_VM_BACKING_SRC_TYPES,
> >> +  "Missing new source type strings?");
> >> +
> >>   /*
> >>* VM Create
> >>*
> >> --
> >> 2.23.0
> >>
> > .

Re: [RFC PATCH 2/2] KVM: selftests: Add a test for kvm page table code

2021-02-09 Thread Ben Gardon

On Mon, Feb 8, 2021 at 11:22 PM wangyanan (Y)  wrote:
>
> Hi Ben,
>
> On 2021/2/9 4:29, Ben Gardon wrote:
> > On Mon, Feb 8, 2021 at 1:08 AM Yanan Wang  wrote:
> >> This test serves as a performance tester and a bug reproducer for
> >> kvm page table code (GPA->HPA mappings), so it gives guidance for
> >> people trying to make some improvement for kvm.
> >>
> >> The function guest_code() is designed to cover conditions where a single 
> >> vcpu
> >> or multiple vcpus access guest pages within the same memory range, in three
> >> VM stages(before dirty-logging, during dirty-logging, after dirty-logging).
> >> Besides, the backing source memory type(ANONYMOUS/THP/HUGETLB) of the 
> >> tested
> >> memory region can be specified by users, which means normal page mappings 
> >> or
> >> block mappings can be chosen by users to be created in the test.
> >>
> >> If use of ANONYMOUS memory is specified, kvm will create page mappings for 
> >> the
> >> tested memory region before dirty-logging, and update attributes of the 
> >> page
> >> mappings from RO to RW during dirty-logging. If use of THP/HUGETLB memory 
> >> is
> >> specified, kvm will create block mappings for the tested memory region 
> >> before
> >> dirty-logging, and split the blcok mappings into page mappings during
> >> dirty-logging, and coalesce the page mappings back into block mappings 
> >> after
> >> dirty-logging is stopped.
> >>
> >> So in summary, as a performance tester, this test can present the 
> >> performance
> >> of kvm creating/updating normal page mappings, or the performance of kvm
> >> creating/splitting/recovering block mappings, through execution time.
> >>
> >> When we need to coalesce the page mappings back to block mappings after 
> >> dirty
> >> logging is stopped, we have to firstly invalidate *all* the TLB entries 
> >> for the
> >> page mappings right before installation of the block entry, because a TLB 
> >> conflict
> >> abort error could occur if we can't invalidate the TLB entries fully. We 
> >> have
> >> hit this TLB conflict twice on aarch64 software implementation and fixed 
> >> it.
> >> As this test can imulate process from dirty-logging enabled to 
> >> dirty-logging
> >> stopped of a VM with block mappings, so it can also reproduce this TLB 
> >> conflict
> >> abort due to inadequate TLB invalidation when coalescing tables.
> >>
> >> Signed-off-by: Yanan Wang 
> > Thanks for sending this! Happy to see more tests for weird TLB
> > flushing edge cases and races.
> >
> > Just out of curiosity, were you unable to replicate the bug with the
> > dirty_log_perf_test and setting the wr_fract option?
> > With "KVM: selftests: Disable dirty logging with vCPUs running"
> > (https://lkml.org/lkml/2021/2/2/1431), the dirty_log_perf_test has
> > most of the same features as this one.
> > Please correct me if I'm wrong, but it seems like the major difference
> > here is a more careful pattern of which pages are dirtied when.
> Actually the procedures in KVM_UPDATE_MAPPINGS stage are specially
> designed for
> reproduce of the TLB conflict bug. The following explains why.
> In x86 implementation, the related page mappings will be all destroyed
> in advance when
> stopping dirty logging while vcpus are still running. So after dirty
> logging is successfully
> stopped, there will certainly be page faults when accessing memory, and
> KVM will handle
> the faults and create block mappings once again. (Is this right?)
> So in this case, dirty_log_perf_test can replicate the bug theoretically.
>
> But there is difference in ARM implementation. The related page mappings
> will not be
> destroyed immediately when stopping dirty logging and will  be kept
> instead. And after
> dirty logging, KVM will destroy these mappings together with creation of
> block mappings
> when handling a guest fault (page fault or permission fault).  So based
> on guest_code() in
> dirty_log_perf_test, there will not be any page faults after dirty
> logging because all the
> page mappings have been created and KVM has no chance to recover block
> mappings
> at all. So this is why I left half of the pages clean and another half
> dirtied.

Ah okay, I'm sorry. I shouldn't have assumed that ARM does the same
thing as x86 when disabling dirty logging. It makes sense then why
your guest code is so carefully structured. Does that mean that if

Re: [RFC PATCH 2/2] KVM: selftests: Add a test for kvm page table code

2021-02-09 Thread Ben Gardon

On Tue, Feb 9, 2021 at 1:43 AM wangyanan (Y)  wrote:
>
>
> On 2021/2/9 4:29, Ben Gardon wrote:
> > On Mon, Feb 8, 2021 at 1:08 AM Yanan Wang  wrote:
> >> This test serves as a performance tester and a bug reproducer for
> >> kvm page table code (GPA->HPA mappings), so it gives guidance for
> >> people trying to make some improvement for kvm.
> >>
> >> The function guest_code() is designed to cover conditions where a single 
> >> vcpu
> >> or multiple vcpus access guest pages within the same memory range, in three
> >> VM stages(before dirty-logging, during dirty-logging, after dirty-logging).
> >> Besides, the backing source memory type(ANONYMOUS/THP/HUGETLB) of the 
> >> tested
> >> memory region can be specified by users, which means normal page mappings 
> >> or
> >> block mappings can be chosen by users to be created in the test.
> >>
> >> If use of ANONYMOUS memory is specified, kvm will create page mappings for 
> >> the
> >> tested memory region before dirty-logging, and update attributes of the 
> >> page
> >> mappings from RO to RW during dirty-logging. If use of THP/HUGETLB memory 
> >> is
> >> specified, kvm will create block mappings for the tested memory region 
> >> before
> >> dirty-logging, and split the blcok mappings into page mappings during
> >> dirty-logging, and coalesce the page mappings back into block mappings 
> >> after
> >> dirty-logging is stopped.
> >>
> >> So in summary, as a performance tester, this test can present the 
> >> performance
> >> of kvm creating/updating normal page mappings, or the performance of kvm
> >> creating/splitting/recovering block mappings, through execution time.
> >>
> >> When we need to coalesce the page mappings back to block mappings after 
> >> dirty
> >> logging is stopped, we have to firstly invalidate *all* the TLB entries 
> >> for the
> >> page mappings right before installation of the block entry, because a TLB 
> >> conflict
> >> abort error could occur if we can't invalidate the TLB entries fully. We 
> >> have
> >> hit this TLB conflict twice on aarch64 software implementation and fixed 
> >> it.
> >> As this test can imulate process from dirty-logging enabled to 
> >> dirty-logging
> >> stopped of a VM with block mappings, so it can also reproduce this TLB 
> >> conflict
> >> abort due to inadequate TLB invalidation when coalescing tables.
> >>
> >> Signed-off-by: Yanan Wang 
> > Thanks for sending this! Happy to see more tests for weird TLB
> > flushing edge cases and races.
> >
> > Just out of curiosity, were you unable to replicate the bug with the
> > dirty_log_perf_test and setting the wr_fract option?
> > With "KVM: selftests: Disable dirty logging with vCPUs running"
> > (https://lkml.org/lkml/2021/2/2/1431), the dirty_log_perf_test has
> > most of the same features as this one.
> > Please correct me if I'm wrong, but it seems like the major difference
> > here is a more careful pattern of which pages are dirtied when.
> >
> > Within Google we have a system for pre-specifying sets of arguments to
> > e.g. the dirty_log_perf_test. I wonder if something similar, even as
> > simple as a script that just runs dirty_log_perf_test several times
> > would be helpful for cases where different arguments are needed for
> > the test to cover different specific cases. Even with this test, for
> > example, I assume the test doesn't work very well with just 1 vCPU,
> > but it's still a good default in the test, so having some kind of
> > configuration (lite) file would be useful.
> >
> >> ---
> >>   tools/testing/selftests/kvm/Makefile  |   3 +
> >>   .../selftests/kvm/kvm_page_table_test.c   | 518 ++
> >>   2 files changed, 521 insertions(+)
> >>   create mode 100644 tools/testing/selftests/kvm/kvm_page_table_test.c
> >>
> >> diff --git a/tools/testing/selftests/kvm/Makefile 
> >> b/tools/testing/selftests/kvm/Makefile
> >> index fe41c6a0fa67..697318019bd4 100644
> >> --- a/tools/testing/selftests/kvm/Makefile
> >> +++ b/tools/testing/selftests/kvm/Makefile
> >> @@ -62,6 +62,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test
> >>   TEST_GEN_PROGS_x86_64 += demand_paging_test
> >>   TEST_GEN_PROGS_x86_64 += dirty_log_test
> >>   TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
&g

Re: [RFC PATCH v2 4/7] KVM: selftests: Add a helper to get system configured THP page size

2021-02-25 Thread Ben Gardon

On Wed, Feb 24, 2021 at 10:00 PM Yanan Wang  wrote:
>
> If we want to have some tests about transparent hugepages, the system
> configured THP hugepage size should better be known by the tests, which
> can be used for kinds of alignment or guest memory accessing of vcpus...
> So it makes sense to add a helper to get the transparent hugepage size.
>
> With VM_MEM_SRC_ANONYMOUS_THP specified in vm_userspace_mem_region_add(),
> we now stat /sys/kernel/mm/transparent_hugepage to check whether THP is
> configured in the host kernel before madvise(). Based on this, we can also
> read file /sys/kernel/mm/transparent_hugepage/hpage_pmd_size to get THP
> hugepage size.
>
> Signed-off-by: Yanan Wang 

Reviewed-by: Ben Gardon 

> ---
>  .../testing/selftests/kvm/include/test_util.h |  2 ++
>  tools/testing/selftests/kvm/lib/test_util.c   | 36 +++
>  2 files changed, 38 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/include/test_util.h 
> b/tools/testing/selftests/kvm/include/test_util.h
> index b7f41399f22c..ef24c76ba89a 100644
> --- a/tools/testing/selftests/kvm/include/test_util.h
> +++ b/tools/testing/selftests/kvm/include/test_util.h
> @@ -78,6 +78,8 @@ struct vm_mem_backing_src_alias {
> enum vm_mem_backing_src_type type;
>  };
>
> +bool thp_configured(void);
> +size_t get_trans_hugepagesz(void);
>  void backing_src_help(void);
>  enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
>
> diff --git a/tools/testing/selftests/kvm/lib/test_util.c 
> b/tools/testing/selftests/kvm/lib/test_util.c
> index c7c0627c6842..f2d133f76c67 100644
> --- a/tools/testing/selftests/kvm/lib/test_util.c
> +++ b/tools/testing/selftests/kvm/lib/test_util.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "linux/kernel.h"
>
>  #include "test_util.h"
> @@ -117,6 +118,41 @@ const struct vm_mem_backing_src_alias 
> backing_src_aliases[] = {
> {"anonymous_hugetlb", VM_MEM_SRC_ANONYMOUS_HUGETLB,},
>  };
>
> +bool thp_configured(void)
> +{
> +   int ret;
> +   struct stat statbuf;
> +
> +   ret = stat("/sys/kernel/mm/transparent_hugepage", &statbuf);
> +   TEST_ASSERT(ret == 0 || (ret == -1 && errno == ENOENT),
> +   "Error in stating /sys/kernel/mm/transparent_hugepage: 
> %d",
> +   errno);
> +
> +   return ret == 0;
> +}
> +
> +size_t get_trans_hugepagesz(void)
> +{
> +   size_t size;
> +   char buf[16];
> +   FILE *f;
> +
> +   TEST_ASSERT(thp_configured(), "THP is not configured in host kernel");
> +
> +   f = fopen("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", "r");
> +   TEST_ASSERT(f != NULL,
> +   "Error in opening transparent_hugepage/hpage_pmd_size: 
> %d",
> +   errno);
> +
> +   if (fread(buf, sizeof(char), sizeof(buf), f) == 0) {
> +   fclose(f);
> +   TEST_FAIL("Unable to read 
> transparent_hugepage/hpage_pmd_size");
> +   }
> +
> +   size = strtoull(buf, NULL, 10);
> +   return size;
> +}
> +
>  void backing_src_help(void)
>  {
> int i;
> --
> 2.19.1
>

Re: [RFC PATCH v2 5/7] KVM: selftests: List all hugetlb src types specified with page sizes

2021-02-25 Thread Ben Gardon

On Wed, Feb 24, 2021 at 10:03 PM Yanan Wang  wrote:
>
> With VM_MEM_SRC_ANONYMOUS_HUGETLB, we currently can only use system
> default hugetlb pages to back the testing guest memory. In order to
> add flexibility, now list all the known hugetlb backing src types with
> different page sizes, so that we can specify use of hugetlb pages of the
> exact granularity that we want. And as all the known hugetlb page sizes
> are listed, it's appropriate for all architectures.
>
> Besides, the helper get_backing_src_pagesz() is added to get the
> granularity of different backing src types(anonumous, thp, hugetlb).
>
> Signed-off-by: Yanan Wang 
> ---
>  .../testing/selftests/kvm/include/test_util.h | 19 ++-
>  tools/testing/selftests/kvm/lib/kvm_util.c|  2 +-
>  tools/testing/selftests/kvm/lib/test_util.c   | 56 +++
>  3 files changed, 63 insertions(+), 14 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/test_util.h 
> b/tools/testing/selftests/kvm/include/test_util.h
> index ef24c76ba89a..be5d08bcdca7 100644
> --- a/tools/testing/selftests/kvm/include/test_util.h
> +++ b/tools/testing/selftests/kvm/include/test_util.h
> @@ -70,16 +70,31 @@ struct timespec timespec_div(struct timespec ts, int 
> divisor);
>  enum vm_mem_backing_src_type {
> VM_MEM_SRC_ANONYMOUS,
> VM_MEM_SRC_ANONYMOUS_THP,
> -   VM_MEM_SRC_ANONYMOUS_HUGETLB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_16KB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_64KB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_512KB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_1MB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_8MB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_16MB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_32MB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_256MB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_512MB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_2GB,
> +   VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB,
> +   NUM_SRC_TYPES,
>  };
>
>  struct vm_mem_backing_src_alias {
> const char *name;
> -   enum vm_mem_backing_src_type type;
> +   uint32_t flag;
>  };
>
>  bool thp_configured(void);
>  size_t get_trans_hugepagesz(void);
> +const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i);
> +size_t get_backing_src_pagesz(uint32_t i);
>  void backing_src_help(void);
>  enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
> b/tools/testing/selftests/kvm/lib/kvm_util.c
> index cc22c4ab7d67..b91c8e3a7ee1 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -757,7 +757,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
> region->mmap_start = mmap(NULL, region->mmap_size,
>   PROT_READ | PROT_WRITE,
>   MAP_PRIVATE | MAP_ANONYMOUS
> - | (src_type == VM_MEM_SRC_ANONYMOUS_HUGETLB 
> ? MAP_HUGETLB : 0),
> + | vm_mem_backing_src_alias(src_type)->flag,
>   -1, 0);
> TEST_ASSERT(region->mmap_start != MAP_FAILED,
> "test_malloc failed, mmap_start: %p errno: %i",
> diff --git a/tools/testing/selftests/kvm/lib/test_util.c 
> b/tools/testing/selftests/kvm/lib/test_util.c
> index f2d133f76c67..6780aa058f35 100644
> --- a/tools/testing/selftests/kvm/lib/test_util.c
> +++ b/tools/testing/selftests/kvm/lib/test_util.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "linux/kernel.h"
>
>  #include "test_util.h"
> @@ -112,12 +113,6 @@ void print_skip(const char *fmt, ...)
> puts(", skipping test");
>  }
>
> -const struct vm_mem_backing_src_alias backing_src_aliases[] = {
> -   {"anonymous", VM_MEM_SRC_ANONYMOUS,},
> -   {"anonymous_thp", VM_MEM_SRC_ANONYMOUS_THP,},
> -   {"anonymous_hugetlb", VM_MEM_SRC_ANONYMOUS_HUGETLB,},
> -};
> -
>  bool thp_configured(void)
>  {
> int ret;
> @@ -153,22 +148,61 @@ size_t get_trans_hugepagesz(void)
> return size;
>  }
>
> +const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)
> +{
> +   static const struct vm_mem_backing_src_alias aliases[] = {
> +   { "anonymous",   0},
> +   { "anonymous_thp",   0},
> +   { "anonymous_hugetlb_16kb",  MAP_HUGETLB | MAP_HUGE_16KB  },
> +   { "anonymous_hugetlb_64kb",  MAP_HUGETLB | MAP_HUGE_64KB  },
> +   { "anonymous_hugetlb_512kb", MAP_HUGETLB | MAP_HUGE_512KB },
> +   { "anonymous_hugetlb_1mb",   MAP_HUGETLB | MAP_HUGE_1MB   },
> +   { "anonymous_hugetlb_2mb",   MAP_HUGETLB | MAP_HUGE_2MB   },
> +   { "anonymous_hugetlb_8mb",   MAP_HUGETLB | MAP_HUGE_8MB   },
> +

Re: [RFC PATCH v2 6/7] KVM: selftests: Adapt vm_userspace_mem_region_add to new helpers

2021-02-25 Thread Ben Gardon

On Wed, Feb 24, 2021 at 10:03 PM Yanan Wang  wrote:
>
> With VM_MEM_SRC_ANONYMOUS_THP specified in vm_userspace_mem_region_add(),
> we have to get the transparent hugepage size for HVA alignment. With the
> new helpers, we can use get_backing_src_pagesz() to check whether THP is
> configured and then get the exact configured hugepage size.
>
> As different architectures may have different THP page sizes configured,
> this can get the accurate THP page sizes on any platform.
>
> Signed-off-by: Yanan Wang 
> ---
>  tools/testing/selftests/kvm/lib/kvm_util.c | 27 +++---
>  1 file changed, 8 insertions(+), 19 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
> b/tools/testing/selftests/kvm/lib/kvm_util.c
> index b91c8e3a7ee1..0105fbfed036 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -18,7 +18,6 @@
>  #include 
>  #include 
>
> -#define KVM_UTIL_PGS_PER_HUGEPG 512
>  #define KVM_UTIL_MIN_PFN   2
>
>  /* Aligns x up to the next multiple of size. Size must be a power of 2. */
> @@ -686,7 +685,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
>  {
> int ret;
> struct userspace_mem_region *region;
> -   size_t huge_page_size = KVM_UTIL_PGS_PER_HUGEPG * vm->page_size;
> +   size_t backing_src_pagesz = get_backing_src_pagesz(src_type);
> size_t alignment;
>
> TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
> @@ -748,7 +747,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
>  #endif
>
> if (src_type == VM_MEM_SRC_ANONYMOUS_THP)
> -   alignment = max(huge_page_size, alignment);
> +   alignment = max(backing_src_pagesz, alignment);
>
> /* Add enough memory to align up if necessary */
> if (alignment > 1)
> @@ -767,22 +766,12 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
> region->host_mem = align(region->mmap_start, alignment);
>
> /* As needed perform madvise */
> -   if (src_type == VM_MEM_SRC_ANONYMOUS || src_type == 
> VM_MEM_SRC_ANONYMOUS_THP) {
> -   struct stat statbuf;
> -
> -   ret = stat("/sys/kernel/mm/transparent_hugepage", &statbuf);
> -   TEST_ASSERT(ret == 0 || (ret == -1 && errno == ENOENT),
> -   "stat /sys/kernel/mm/transparent_hugepage");
> -
> -   TEST_ASSERT(ret == 0 || src_type != VM_MEM_SRC_ANONYMOUS_THP,
> -   "VM_MEM_SRC_ANONYMOUS_THP requires THP to be 
> configured in the host kernel");
> -
> -   if (ret == 0) {
> -   ret = madvise(region->host_mem, npages * 
> vm->page_size,
> - src_type == VM_MEM_SRC_ANONYMOUS ? 
> MADV_NOHUGEPAGE : MADV_HUGEPAGE);
> -   TEST_ASSERT(ret == 0, "madvise failed, addr: %p 
> length: 0x%lx src_type: %x",
> -   region->host_mem, npages * vm->page_size, 
> src_type);
> -   }
> +   if (src_type <= VM_MEM_SRC_ANONYMOUS_THP && thp_configured()) {

This check relies on an unstated property of the backing src type
enums where VM_MEM_SRC_ANONYMOUS and VM_MEM_SRC_ANONYMOUS_THP are
declared first.
It would probably be more readable for folks if the check was explicit:
if ((src_type == VM_MEM_SRC_ANONYMOUS || src_type ==
VM_MEM_SRC_ANONYMOUS_THP) && thp_configured()) {


> +   ret = madvise(region->host_mem, npages * vm->page_size,
> + src_type == VM_MEM_SRC_ANONYMOUS ? 
> MADV_NOHUGEPAGE : MADV_HUGEPAGE);
> +   TEST_ASSERT(ret == 0, "madvise failed, addr: %p length: 0x%lx 
> src_type: %s",
> +   region->host_mem, npages * vm->page_size,
> +   vm_mem_backing_src_alias(src_type)->name);
> }
>
> region->unused_phy_pages = sparsebit_alloc();
> --
> 2.19.1
>

Re: [RFC PATCH v2 0/7] Some improvement and a new test for kvm page table

2021-02-25 Thread Ben Gardon

On Wed, Feb 24, 2021 at 9:59 PM Yanan Wang  wrote:
>
> Hi,
> This v2 series can mainly include two parts.
> Based on kvm queue branch: 
> https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=queue
> Links of v1: 
> https://lore.kernel.org/lkml/20210208090841.333724-1-wangyana...@huawei.com/
>
> In the first part, all the known hugetlb backing src types specified
> with different hugepage sizes are listed, so that we can specify use
> of hugetlb source of the exact granularity that we want, instead of
> the system default ones. And as all the known hugetlb page sizes are
> listed, it's appropriate for all architectures. Besides, a helper that
> can get granularity of different backing src types(anonumous/thp/hugetlb)
> is added, so that we can use the accurate backing src granularity for
> kinds of alignment or guest memory accessing of vcpus.
>
> In the second part, a new test is added:
> This test is added to serve as a performance tester and a bug reproducer
> for kvm page table code (GPA->HPA mappings), it gives guidance for the
> people trying to make some improvement for kvm. And the following explains
> what we can exactly do through this test.
>
> The function guest_code() can cover the conditions where a single vcpu or
> multiple vcpus access guest pages within the same memory region, in three
> VM stages(before dirty logging, during dirty logging, after dirty logging).
> Besides, the backing src memory type(ANONYMOUS/THP/HUGETLB) of the tested
> memory region can be specified by users, which means normal page mappings
> or block mappings can be chosen by users to be created in the test.
>
> If ANONYMOUS memory is specified, kvm will create normal page mappings
> for the tested memory region before dirty logging, and update attributes
> of the page mappings from RO to RW during dirty logging. If THP/HUGETLB
> memory is specified, kvm will create block mappings for the tested memory
> region before dirty logging, and split the blcok mappings into normal page
> mappings during dirty logging, and coalesce the page mappings back into
> block mappings after dirty logging is stopped.
>
> So in summary, as a performance tester, this test can present the
> performance of kvm creating/updating normal page mappings, or the
> performance of kvm creating/splitting/recovering block mappings,
> through execution time.
>
> When we need to coalesce the page mappings back to block mappings after
> dirty logging is stopped, we have to firstly invalidate *all* the TLB
> entries for the page mappings right before installation of the block entry,
> because a TLB conflict abort error could occur if we can't invalidate the
> TLB entries fully. We have hit this TLB conflict twice on aarch64 software
> implementation and fixed it. As this test can imulate process from dirty
> logging enabled to dirty logging stopped of a VM with block mappings,
> so it can also reproduce this TLB conflict abort due to inadequate TLB
> invalidation when coalescing tables.
>
> Links about the TLB conflict abort:
> https://lore.kernel.org/lkml/20201201201034.116760-3-wangyana...@huawei.com/

Besides a few style / readability comments, this series looks good to
me. Thanks for generalizing the way these selftests handle different
hugeTLB sizes!


>
> Yanan Wang (7):
>   tools include: sync head files of mmap flag encodings about hugetlb
>   KVM: selftests: Use flag CLOCK_MONOTONIC_RAW for timing
>   KVM: selftests: Make a generic helper to get vm guest mode strings
>   KVM: selftests: Add a helper to get system configured THP page size
>   KVM: selftests: List all hugetlb src types specified with page sizes
>   KVM: selftests: Adapt vm_userspace_mem_region_add to new helpers
>   KVM: selftests: Add a test for kvm page table code
>
>  tools/include/asm-generic/hugetlb_encode.h|   3 +
>  tools/testing/selftests/kvm/Makefile  |   3 +
>  .../selftests/kvm/demand_paging_test.c|   8 +-
>  .../selftests/kvm/dirty_log_perf_test.c   |  14 +-
>  .../testing/selftests/kvm/include/kvm_util.h  |   4 +-
>  .../testing/selftests/kvm/include/test_util.h |  21 +-
>  .../selftests/kvm/kvm_page_table_test.c   | 476 ++
>  tools/testing/selftests/kvm/lib/kvm_util.c|  58 +--
>  tools/testing/selftests/kvm/lib/test_util.c   |  92 +++-
>  tools/testing/selftests/kvm/steal_time.c  |   4 +-
>  10 files changed, 623 insertions(+), 60 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/kvm_page_table_test.c
>
> --
> 2.19.1
>

Re: [PATCH v2 0/2] KVM: x86/mmu: Zap orphaned kids for nested TDP MMU

2020-08-12 Thread Ben Gardon

On Wed, Aug 12, 2020 at 12:28 PM Sean Christopherson
 wrote:
>
> As promised, albeit a few days late.
>
> Ben, I kept your performance numbers even though it this version has
> non-trivial differences relative to what you tested.  I assume we'll need
> a v3 anyways if this doesn't provide the advertised performance benefits.
>
> Ben Gardon (1):
>   KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only
> parent
>
> Sean Christopherson (1):
>   KVM: x86/mmu: Move flush logic from mmu_page_zap_pte() to
> FNAME(invlpg)
>
>  arch/x86/kvm/mmu/mmu.c | 38 ++
>  arch/x86/kvm/mmu/paging_tmpl.h |  7 +--
>  2 files changed, 30 insertions(+), 15 deletions(-)
>

Thanks for sending this revised series Sean. This all looks good to me.
I think the main performance difference between this series and the
original patch I sent is only zapping nested TDP shadow pages, but I
expect it to behave more or less the same since the number of direct
TDP pages is pretty bounded.

>
> --
> 2.28.0
>

Re: [PATCH 1/1] kvm: mmu: zap pages when zapping only parent

2020-08-05 Thread Ben Gardon

On Tue, Aug 4, 2020 at 2:14 PM Sean Christopherson
 wrote:
>
> On Mon, Jul 27, 2020 at 01:33:24PM -0700, Ben Gardon wrote:
> > When the KVM MMU zaps a page, it will recursively zap the unsynced child
> > pages, but not the synced ones. This can create problems over time when
> > running many nested guests because it leaves unlinked pages which will not
> > be freed until the page quota is hit. With the default page quota of 20
> > shadow pages per 1000 guest pages, this looks like a memory leak and can
> > degrade MMU performance.
> >
> > In a recent benchmark, substantial performance degradation was observed:
> > An L1 guest was booted with 64G memory.
> > 2G nested Windows guests were booted, 10 at a time for 20
> > iterations. (200 total boots)
> > Windows was used in this benchmark because they touch all of their
> > memory on startup.
> > By the end of the benchmark, the nested guests were taking ~10% longer
> > to boot. With this patch there is no degradation in boot time.
> > Without this patch the benchmark ends with hundreds of thousands of
> > stale EPT02 pages cluttering up rmaps and the page hash map. As a
> > result, VM shutdown is also much slower: deleting memslot 0 was
> > observed to take over a minute. With this patch it takes just a
> > few miliseconds.
> >
> > If TDP is enabled, zap child shadow pages when zapping the only parent
> > shadow page.
>
> Comments on the mechanics below.  For the approach itself, I wonder if we
> could/should go even further, i.e. be even more aggressive in reaping nested
> TDP shadow pages.
>
> For this to work, KVM is effectively relying on the write flooding detection
> in kvm_mmu_pte_write() to kick in, i.e. KVM needs the L1 VMM to overwrite
> the TDP tables that L1 was using for L2.  In particular, L1 needs to write
> the upper level TDP entries in order for L0 to effeciently reclaim memory.
>
> For HyperV as L1, I believe that will happen sooner than later as HyperV
> maintains a pool of zeroed pages, i.e. L1 will quickly zero out the old TDP
> entries and trigger the zap.
>
> For KVM as L1, that may not hold true for all scenarios due to lazy zeroing
> of memory.  If L1 is creating and destroying VMs (as in the benchmark), then
> it will work as expected.  But if L1 creates and destroys a large L2 without
> reallocating all memory used for L2's TDP tables, the write flooding will
> never happen and L0 will keep the stale SPs even though L2 is dead.
>
> The above scenario may or may not be problematic in practice.  I would
> assume any reasonably active L1 will quickly do something with the old TDP
> memory and trigger write flooding, but at the same time it's plausible that
> L1 could leave pages unused for a decent amount of time.
>
> One thought would be to track nested TDP PGDs and periodically purge PGDs
> that haven't been used in some arbitrary amount of time and/or an arbitrary
> threshold for the number of nested TDP PGDs is reached.  That being said,
> either of those is probably overkill without more analysis on the below
> approach.

We thought about this as well, but in the absence of a workload which
doesn't get sufficient reclaim from write flooding, it didn't seem
worth implementing.

>
> > Tested by running the kvm-unit-tests suite on an Intel Haswell machine.
> > No regressions versus
> > commit c34b26b98cac ("KVM: MIPS: clean up redundant 'kvm_run' parameters"),
> > or warnings.
> >
> > Reviewed-by: Peter Shier 
> > Signed-off-by: Ben Gardon 
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 49 +-
> >  1 file changed, 44 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index fa506aaaf0194..c550bc3831dcc 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2626,13 +2626,52 @@ static bool mmu_page_zap_pte(struct kvm *kvm, 
> > struct kvm_mmu_page *sp,
> >   return false;
> >  }
> >
> > -static void kvm_mmu_page_unlink_children(struct kvm *kvm,
> > -  struct kvm_mmu_page *sp)
> > +static int kvm_mmu_page_unlink_children(struct kvm *kvm,
> > + struct kvm_mmu_page *sp,
> > + struct list_head *invalid_list)
> >  {
> >   unsigned i;
> > + int zapped = 0;
> > +
> > + for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
> > + u64 *sptep = sp->spt + i;
> > + u64 spte = *sptep;
> > + struct kvm_mmu_page *child_

Re: [RFC 0/9] KVM:x86/mmu:Introduce parallel memory virtualization to boost performance

2020-08-06 Thread Ben Gardon

On Wed, Aug 5, 2020 at 9:53 AM Yulei Zhang  wrote:
>
> From: Yulei Zhang 
>
> Currently in KVM memory virtulization we relay on mmu_lock to synchronize
> the memory mapping update, which make vCPUs work in serialize mode and
> slow down the execution, especially after migration to do substantial
> memory mapping setup, and performance get worse if increase vCPU numbers
> and guest memories.
>
> The idea we present in this patch set is to mitigate the issue with
> pre-constructed memory mapping table. We will fast pin the guest memory
> to build up a global memory mapping table according to the guest memslots
> changes and apply it to cr3, so that after guest starts up all the vCPUs
> would be able to update the memory concurrently, thus the performance
> improvement is expected.

Is a re-implementation of the various MMU functions in this series
necessary to pre-populate the EPT/NPT? I realize the approach you took
is probably the fastest way to pre-populate an EPT, but it seems like
similar pre-population could be achieved with some changes to the PF
handler's prefault scheme or, from user space by adding a dummy vCPU
to touch memory before loading the actual guest image.

I think this series is taking a similar approach to the direct MMU RFC
I sent out a little less than a year ago. (I will send another version
of that series in the next month.) I'm not sure this level of
complexity is worth it if you're only interested in EPT pre-population.
Is pre-population your goal? You mention "parallel memory
virtualization," does that refer to parallel page fault handling you
intend to implement in a future series?

There are a number of features I see you've chosen to leave behind in
this series which might work for your use case, but I think they're
necessary. These include handling vCPUs with different roles (SMM, VMX
non root mode, etc.), MMU notifiers (which I realize matter less for
pinned memory), demand paging through UFFD, fast EPT
invalidation/teardown and others.

>
> And after test the initial patch with memory dirty pattern workload, we
> have seen positive results even with huge page enabled. For example,
> guest with 32 vCPUs and 64G memories, in 2M/1G huge page mode we would get
> more than 50% improvement.
>
>
> Yulei Zhang (9):
>   Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
> support
>   Introduce page table population function for direct build EPT feature
>   Introduce page table remove function for direct build EPT feature
>   Add release function for direct build ept when guest VM exit
>   Modify the page fault path to meet the direct build EPT requirement
>   Apply the direct build EPT according to the memory slots change
>   Add migration support when using direct build EPT
>   Introduce kvm module parameter global_tdp to turn on the direct build
> EPT mode
>   Handle certain mmu exposed functions properly while turn on direct
> build EPT mode
>
>  arch/mips/kvm/mips.c|  13 +
>  arch/powerpc/kvm/powerpc.c  |  13 +
>  arch/s390/kvm/kvm-s390.c|  13 +
>  arch/x86/include/asm/kvm_host.h |  13 +-
>  arch/x86/kvm/mmu/mmu.c  | 537 ++--
>  arch/x86/kvm/svm/svm.c  |   2 +-
>  arch/x86/kvm/vmx/vmx.c  |  17 +-
>  arch/x86/kvm/x86.c  |  55 ++--
>  include/linux/kvm_host.h|   7 +-
>  virt/kvm/kvm_main.c |  43 ++-
>  10 files changed, 648 insertions(+), 65 deletions(-)
>
> --
> 2.17.1
>

[PATCH 00/24] Allow parallel page faults with TDP MMU

2021-01-12 Thread Ben Gardon

ping accesses, we can see that this parallel page faults
series actually reduces performance when populating memory. In profiling,
it appeared that most of the time was spent in get_user_pages, so it's
possible the extra concurrency hit the main MM subsystem harder, creating
contention there.

Does this series degrade performance with the TDP MMU disabled?

Baseline, TDP MMU disabled, partitioned accesses:
Populate memory time (s)110.193
Enabling dirty logging time (s) 4.829
Dirty memory time (s)   3.949
Get dirty log time (s)  0.822
Disabling dirty logging time (s)2.995
Parallel PFs series, TDP MMU disabled, partitioned accesses:
Populate memory time (s)110.917
Enabling dirty logging time (s) 5.196
Dirty memory time (s)   4.559
Get dirty log time (s)  0.879
Disabling dirty logging time (s)3.278

Here we can see that the parallel PFs series appears to have made enabling
and disabling dirty logging, and dirtying memory slightly slower. It's
possible that this is a result of additional checks around MMU lock
acquisition.

Baseline, TDP MMU disabled, overlapping accesses:
Populate memory time (s)103.115
Enabling dirty logging time (s) 0.222
Dirty memory time (s)   0.189
Get dirty log time (s)  2.341
Disabling dirty logging time (s)0.126
Parallel PFs series, TDP MMU disabled, overlapping accesses:
Populate memory time (s)85.392
Enabling dirty logging time (s) 0.224
Dirty memory time (s)   0.201
Get dirty log time (s)  2.363
Disabling dirty logging time (s)0.131

>From the above results we can see that the parallel PF series only had a
significant effect on the population time, with overlapping accesses and
the TDP MMU disabled. It is not currently known what in this series caused
the improvement.

Correctness testing:
The following tests were performed with an SMP kernel and DBX kernel on an
Intel Skylake machine. The tests were run both with and without the TDP
MMU enabled.
-- This series introduces no new failures in kvm-unit-tests
SMP + no TDP MMU no new failures
SMP + TDP MMU no new failures
DBX + no TDP MMU no new failures
DBX + TDP MMU no new failures
-- All KVM selftests behave as expected
SMP + no TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
SMP + TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
(./x86_64/vmx_preemption_timer_test also fails without this patch set,
both with the TDP MMU on and off.)
DBX + no TDP MMU all pass
DBX + TDP MMU all pass
-- A VM can be booted running a Debian 9 and all memory accessed
SMP + no TDP MMU works
SMP + TDP MMU works
DBX + no TDP MMU works
DBX + TDP MMU works
Cross-compilation was also checked for PowerPC and ARM64.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/7172

Ben Gardon (24):
  locking/rwlocks: Add contention detection for rwlocks
  sched: Add needbreak for rwlocks
  sched: Add cond_resched_rwlock
  kvm: x86/mmu: change TDP MMU yield function returns to match
cond_resched
  kvm: x86/mmu: Fix yielding in TDP MMU
  kvm: x86/mmu: Skip no-op changes in TDP MMU functions
  kvm: x86/mmu: Add comment on __tdp_mmu_set_spte
  kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE
  kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory
  kvm: x86/mmu: Factor out handle disconnected pt
  kvm: x86/mmu: Put TDP MMU PT walks in RCU read-critical section
  kvm: x86/kvm: RCU dereference tdp mmu page table links
  kvm: x86/mmu: Only free tdp_mmu pages after a grace period
  kvm: mmu: Wrap mmu_lock lock / unlock in a function
  kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  kvm: mmu: Wrap mmu_lock assertions
  kvm: mmu: Move mmu_lock to struct kvm_arch
  kvm: x86/mmu: Use an rwlock for the x86 TDP MMU
  kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  kvm: x86/mmu: Add atomic option for setting SPTEs
  kvm: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler
  kvm: x86/mmu: Freeze SPTEs in disconnected pages
  kvm: x86/mmu: Allow parallel page faults for the TDP MMU

 Documentation/virt/kvm/locking.rst   |   2 +-
 arch/arm64/include/asm/kvm_host.h|   2 +
 arch/arm64/kvm/arm.c |   2 +
 arch/arm64/kvm/mmu.c |  40 +-
 arch/mips/include/asm/kvm_host.h |   2 +
 arch/mips/kvm/mips.c |  10 +-
 arch/mips/kvm/mmu.c  |  20 +-
 arch/powerpc/include/asm/kvm_book3s_64.h |   7 +-
 arch/powerpc/include/asm/kvm_host.h  |   2 +
 arch/powerpc/kvm/book3s_64_mmu_host.c|   4 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  12 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  32 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c  |   4 +-
 arch/powerpc/kvm/book3s_

[PATCH 02/24] sched: Add needbreak for rwlocks

2021-01-12 Thread Ben Gardon

Contention awareness while holding a spin lock is essential for reducing
latency when long running kernel operations can hold that lock. Add the
same contention detection interface for read/write spin locks.

CC: Ingo Molnar 
CC: Will Deacon 
Acked-by: Peter Zijlstra 
Acked-by: Davidlohr Bueso 
Acked-by: Waiman Long 
Acked-by: Paolo Bonzini 
Signed-off-by: Ben Gardon 
---
 include/linux/sched.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..5d1378e5a040 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1912,6 +1912,23 @@ static inline int spin_needbreak(spinlock_t *lock)
 #endif
 }
 
+/*
+ * Check if a rwlock is contended.
+ * Returns non-zero if there is another task waiting on the rwlock.
+ * Returns zero if the lock is not contended or the system / underlying
+ * rwlock implementation does not support contention detection.
+ * Technically does not depend on CONFIG_PREEMPTION, but a general need
+ * for low latency.
+ */
+static inline int rwlock_needbreak(rwlock_t *lock)
+{
+#ifdef CONFIG_PREEMPTION
+   return rwlock_is_contended(lock);
+#else
+   return 0;
+#endif
+}
+
 static __always_inline bool need_resched(void)
 {
return unlikely(tif_need_resched());
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 05/24] kvm: x86/mmu: Fix yielding in TDP MMU

2021-01-12 Thread Ben Gardon

There are two problems with the way the TDP MMU yields in long running
functions. 1.) Given certain conditions, the function may not yield
reliably / frequently enough. 2.) In some functions the TDP iter risks
not making forward progress if two threads livelock yielding to
one another.

Case 1 is possible if for example, a paging structure was very large
but had few, if any writable entries. wrprot_gfn_range could traverse many
entries before finding a writable entry and yielding.

Case 2 is possible if two threads were trying to execute wrprot_gfn_range.
Each could write protect an entry and then yield. This would reset the
tdp_iter's walk over the paging structure and the loop would end up
repeating the same entry over and over, preventing either thread from
making forward progress.

Fix these issues by moving the yield to the beginning of the loop,
before other checks and only yielding if the loop has made forward
progress since the last yield.

Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 83 +++---
 1 file changed, 69 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b2784514ca2d..1987da0da66e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -470,9 +470,23 @@ static bool zap_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
  gfn_t start, gfn_t end, bool can_yield)
 {
struct tdp_iter iter;
+   gfn_t last_goal_gfn = start;
bool flush_needed = false;
 
tdp_root_for_each_pte(iter, root, start, end) {
+   /* Ensure forward progress has been made before yielding. */
+   if (can_yield && iter.goal_gfn != last_goal_gfn &&
+   tdp_mmu_iter_flush_cond_resched(kvm, &iter)) {
+   last_goal_gfn = iter.goal_gfn;
+   flush_needed = false;
+   /*
+* Yielding caused the paging structure walk to be
+* reset so skip to the next iteration to continue the
+* walk from the root.
+*/
+   continue;
+   }
+
if (!is_shadow_present_pte(iter.old_spte))
continue;
 
@@ -487,12 +501,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
continue;
 
tdp_mmu_set_spte(kvm, &iter, 0);
-
-   if (can_yield)
-   flush_needed = !tdp_mmu_iter_flush_cond_resched(kvm,
-   &iter);
-   else
-   flush_needed = true;
+   flush_needed = true;
}
return flush_needed;
 }
@@ -850,12 +859,25 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
 {
struct tdp_iter iter;
u64 new_spte;
+   gfn_t last_goal_gfn = start;
bool spte_set = false;
 
BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
   min_level, start, end) {
+   /* Ensure forward progress has been made before yielding. */
+   if (iter.goal_gfn != last_goal_gfn &&
+   tdp_mmu_iter_cond_resched(kvm, &iter)) {
+   last_goal_gfn = iter.goal_gfn;
+   /*
+* Yielding caused the paging structure walk to be
+* reset so skip to the next iteration to continue the
+* walk from the root.
+*/
+   continue;
+   }
+
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
continue;
@@ -864,8 +886,6 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
 
tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
spte_set = true;
-
-   tdp_mmu_iter_cond_resched(kvm, &iter);
}
return spte_set;
 }
@@ -906,9 +926,22 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
 {
struct tdp_iter iter;
u64 new_spte;
+   gfn_t last_goal_gfn = start;
bool spte_set = false;
 
tdp_root_for_each_leaf_pte(iter, root, start, end) {
+   /* Ensure forward progress has been made before yielding. */
+   if (iter.goal_gfn != last_goal_gfn &&
+   tdp_mmu_iter_cond_resched(kvm, &iter)) {
+   last_goal_gfn = iter.goal_gfn;
+

[PATCH 07/24] kvm: x86/mmu: Add comment on __tdp_mmu_set_spte

2021-01-12 Thread Ben Gardon

__tdp_mmu_set_spte is a very important function in the TDP MMU which
already accepts several arguments and will take more in future commits.
To offset this complexity, add a comment to the function describing each
of the arguemnts.

No functional change intended.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2650fa9fe066..b033da8243fc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -357,6 +357,22 @@ static void handle_changed_spte(struct kvm *kvm, int 
as_id, gfn_t gfn,
  new_spte, level);
 }
 
+/*
+ * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated 
bookkeeping
+ * @kvm: kvm instance
+ * @iter: a tdp_iter instance currently on the SPTE that should be set
+ * @new_spte: The value the SPTE should be set to
+ * @record_acc_track: Notify the MM subsystem of changes to the accessed state
+ *   of the page. Should be set unless handling an MMU
+ *   notifier for access tracking. Leaving record_acc_track
+ *   unset in that case prevents page accesses from being
+ *   double counted.
+ * @record_dirty_log: Record the page as dirty in the dirty bitmap if
+ *   appropriate for the change being made. Should be set
+ *   unless performing certain dirty logging operations.
+ *   Leaving record_dirty_log unset in that case prevents page
+ *   writes from being double counted.
+ */
 static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
  u64 new_spte, bool record_acc_track,
  bool record_dirty_log)
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 09/24] kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory

2021-01-12 Thread Ben Gardon

The KVM MMU caches already guarantee that shadow page table memory will
be zeroed, so there is no reason to re-zero the page in the TDP MMU page
fault handler.

No functional change intended.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 411938e97a00..55df596696c7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -665,7 +665,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 
error_code,
sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
list_add(&sp->link, &vcpu->kvm->arch.tdp_mmu_pages);
child_pt = sp->spt;
-   clear_page(child_pt);
new_spte = make_nonleaf_spte(child_pt,
 !shadow_accessed_mask);
 
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 04/24] kvm: x86/mmu: change TDP MMU yield function returns to match cond_resched

2021-01-12 Thread Ben Gardon

Currently the TDP MMU yield / cond_resched functions either return
nothing or return true if the TLBs were not flushed. These are confusing
semantics, especially when making control flow decisions in calling
functions.

To clean things up, change both functions to have the same
return value semantics as cond_resched: true if the thread yielded,
false if it did not. If the function yielded in the _flush_ version,
then the TLBs will have been flushed.

Reviewed-by: Peter Feiner 
Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 38 +-
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2ef8615f9dba..b2784514ca2d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -413,8 +413,15 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct 
kvm *kvm,
 _mmu->shadow_root_level, _start, _end)
 
 /*
- * Flush the TLB if the process should drop kvm->mmu_lock.
- * Return whether the caller still needs to flush the tlb.
+ * Flush the TLB and yield if the MMU lock is contended or this thread needs to
+ * return control to the scheduler.
+ *
+ * If this function yields, it will also reset the tdp_iter's walk over the
+ * paging structure and the calling function should allow the iterator to
+ * continue its traversal from the paging structure root.
+ *
+ * Return true if this function yielded, the TLBs were flushed, and the
+ * iterator's traversal was reset. Return false if a yield was not needed.
  */
 static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter 
*iter)
 {
@@ -422,18 +429,30 @@ static bool tdp_mmu_iter_flush_cond_resched(struct kvm 
*kvm, struct tdp_iter *it
kvm_flush_remote_tlbs(kvm);
cond_resched_lock(&kvm->mmu_lock);
tdp_iter_refresh_walk(iter);
-   return false;
-   } else {
return true;
-   }
+   } else
+   return false;
 }
 
-static void tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+/*
+ * Yield if the MMU lock is contended or this thread needs to return control
+ * to the scheduler.
+ *
+ * If this function yields, it will also reset the tdp_iter's walk over the
+ * paging structure and the calling function should allow the iterator to
+ * continue its traversal from the paging structure root.
+ *
+ * Return true if this function yielded and the iterator's traversal was reset.
+ * Return false if a yield was not needed.
+ */
+static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
 {
if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
cond_resched_lock(&kvm->mmu_lock);
tdp_iter_refresh_walk(iter);
-   }
+   return true;
+   } else
+   return false;
 }
 
 /*
@@ -470,7 +489,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
tdp_mmu_set_spte(kvm, &iter, 0);
 
if (can_yield)
-   flush_needed = tdp_mmu_iter_flush_cond_resched(kvm, 
&iter);
+   flush_needed = !tdp_mmu_iter_flush_cond_resched(kvm,
+   &iter);
else
flush_needed = true;
}
@@ -1072,7 +1092,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
tdp_mmu_set_spte(kvm, &iter, 0);
 
-   spte_set = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
+   spte_set = !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
}
 
if (spte_set)
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 08/24] kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE

2021-01-12 Thread Ben Gardon

Add lockdep to __tdp_mmu_set_spte to ensure that SPTEs are only modified
under the MMU lock. This lockdep will be updated in future commits to
reflect and validate changes to the TDP MMU's synchronization strategy.

No functional change intended.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b033da8243fc..411938e97a00 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -381,6 +381,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, 
struct tdp_iter *iter,
struct kvm_mmu_page *root = sptep_to_sp(root_pt);
int as_id = kvm_mmu_page_as_id(root);
 
+   lockdep_assert_held(&kvm->mmu_lock);
+
WRITE_ONCE(*iter->sptep, new_spte);
 
__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 14/24] kvm: mmu: Wrap mmu_lock lock / unlock in a function

2021-01-12 Thread Ben Gardon

Wrap locking and unlocking the mmu_lock in a function. This will
facilitate future logging and stat collection for the lock and more
immediately support a refactoring to move the lock into the struct
kvm_arch(s) so that x86 can change the spinlock to a rwlock without
affecting the performance of other archs.

No functional change intended.

Signed-off-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/arm64/kvm/mmu.c   | 36 ++---
 arch/mips/kvm/mips.c   |  8 +--
 arch/mips/kvm/mmu.c| 14 ++---
 arch/powerpc/kvm/book3s_64_mmu_host.c  |  4 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c| 12 ++---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 22 
 arch/powerpc/kvm/book3s_hv.c   |  8 +--
 arch/powerpc/kvm/book3s_hv_nested.c| 52 +-
 arch/powerpc/kvm/book3s_mmu_hpte.c | 10 ++--
 arch/powerpc/kvm/e500_mmu_host.c   |  4 +-
 arch/x86/kvm/mmu/mmu.c | 74 +-
 arch/x86/kvm/mmu/page_track.c  |  8 +--
 arch/x86/kvm/mmu/paging_tmpl.h |  8 +--
 arch/x86/kvm/mmu/tdp_mmu.c |  6 +--
 arch/x86/kvm/x86.c |  4 +-
 drivers/gpu/drm/i915/gvt/kvmgt.c   | 12 ++---
 include/linux/kvm_host.h   |  3 ++
 virt/kvm/dirty_ring.c  |  4 +-
 virt/kvm/kvm_main.c| 42 +--
 19 files changed, 172 insertions(+), 159 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7d2257cc5438..402b1642c944 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -164,13 +164,13 @@ static void stage2_flush_vm(struct kvm *kvm)
int idx;
 
idx = srcu_read_lock(&kvm->srcu);
-   spin_lock(&kvm->mmu_lock);
+   kvm_mmu_lock(kvm);
 
slots = kvm_memslots(kvm);
kvm_for_each_memslot(memslot, slots)
stage2_flush_memslot(kvm, memslot);
 
-   spin_unlock(&kvm->mmu_lock);
+   kvm_mmu_unlock(kvm);
srcu_read_unlock(&kvm->srcu, idx);
 }
 
@@ -456,13 +456,13 @@ void stage2_unmap_vm(struct kvm *kvm)
 
idx = srcu_read_lock(&kvm->srcu);
mmap_read_lock(current->mm);
-   spin_lock(&kvm->mmu_lock);
+   kvm_mmu_lock(kvm);
 
slots = kvm_memslots(kvm);
kvm_for_each_memslot(memslot, slots)
stage2_unmap_memslot(kvm, memslot);
 
-   spin_unlock(&kvm->mmu_lock);
+   kvm_mmu_unlock(kvm);
mmap_read_unlock(current->mm);
srcu_read_unlock(&kvm->srcu, idx);
 }
@@ -472,14 +472,14 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
struct kvm *kvm = mmu->kvm;
struct kvm_pgtable *pgt = NULL;
 
-   spin_lock(&kvm->mmu_lock);
+   kvm_mmu_lock(kvm);
pgt = mmu->pgt;
if (pgt) {
mmu->pgd_phys = 0;
mmu->pgt = NULL;
free_percpu(mmu->last_vcpu_ran);
}
-   spin_unlock(&kvm->mmu_lock);
+   kvm_mmu_unlock(kvm);
 
if (pgt) {
kvm_pgtable_stage2_destroy(pgt);
@@ -516,10 +516,10 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t 
guest_ipa,
if (ret)
break;
 
-   spin_lock(&kvm->mmu_lock);
+   kvm_mmu_lock(kvm);
ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
 &cache);
-   spin_unlock(&kvm->mmu_lock);
+   kvm_mmu_unlock(kvm);
if (ret)
break;
 
@@ -567,9 +567,9 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
start = memslot->base_gfn << PAGE_SHIFT;
end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
 
-   spin_lock(&kvm->mmu_lock);
+   kvm_mmu_lock(kvm);
stage2_wp_range(&kvm->arch.mmu, start, end);
-   spin_unlock(&kvm->mmu_lock);
+   kvm_mmu_unlock(kvm);
kvm_flush_remote_tlbs(kvm);
 }
 
@@ -867,7 +867,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (exec_fault && device)
return -ENOEXEC;
 
-   spin_lock(&kvm->mmu_lock);
+   kvm_mmu_lock(kvm);
pgt = vcpu->arch.hw_mmu->pgt;
if (mmu_notifier_retry(kvm, mmu_seq))
goto out_unlock;
@@ -912,7 +912,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
}
 
 out_unlock:
-   spin_unlock(&kvm->mmu_lock);
+   kvm_mmu_unlock(kvm);
kvm_set_pfn_accessed(pfn);
kvm_release_pfn_clean(pfn);
return ret;
@@ -927,10 +927,10 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa)
 
trace_kvm_access_fault(fault_ipa);
 
-   spin_lock(&vcpu->kvm->mmu_lock);
+   kvm_mmu_lock(vcpu->kv

[PATCH 17/24] kvm: mmu: Move mmu_lock to struct kvm_arch

2021-01-12 Thread Ben Gardon

Move the mmu_lock to struct kvm_arch so that it can be replaced with a
rwlock on x86 without affecting the performance of other archs.

No functional change intended.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 Documentation/virt/kvm/locking.rst |  2 +-
 arch/arm64/include/asm/kvm_host.h  |  2 ++
 arch/arm64/kvm/arm.c   |  2 ++
 arch/mips/include/asm/kvm_host.h   |  2 ++
 arch/mips/kvm/mips.c   |  2 ++
 arch/mips/kvm/mmu.c|  6 +++---
 arch/powerpc/include/asm/kvm_host.h|  2 ++
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 10 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c|  4 ++--
 arch/powerpc/kvm/book3s_hv_nested.c|  4 ++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c| 14 +++---
 arch/powerpc/kvm/e500_mmu_host.c   |  2 +-
 arch/powerpc/kvm/powerpc.c |  2 ++
 arch/s390/include/asm/kvm_host.h   |  2 ++
 arch/s390/kvm/kvm-s390.c   |  2 ++
 arch/x86/include/asm/kvm_host.h|  2 ++
 arch/x86/kvm/mmu/mmu.c |  2 +-
 arch/x86/kvm/x86.c |  2 ++
 include/linux/kvm_host.h   |  1 -
 virt/kvm/kvm_main.c| 11 +--
 20 files changed, 47 insertions(+), 29 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst 
b/Documentation/virt/kvm/locking.rst
index b21a34c34a21..06c006c73c4b 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -212,7 +212,7 @@ which time it will be set using the Dirty tracking 
mechanism described above.
- tsc offset in vmcb
 :Comment:  'raw' because updating the tsc offsets must not be preempted.
 
-:Name: kvm->mmu_lock
+:Name: kvm_arch::mmu_lock
 :Type: spinlock_t
 :Arch: any
 :Protects: -shadow page/shadow tlb entry
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 8fcfab0c2567..6fd4d64eb202 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -102,6 +102,8 @@ struct kvm_arch_memory_slot {
 };
 
 struct kvm_arch {
+   spinlock_t mmu_lock;
+
struct kvm_s2_mmu mmu;
 
/* VTCR_EL2 value for this VM */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 04c44853b103..90f4fcd84bb5 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -130,6 +130,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
int ret;
 
+   spin_lock_init(&kvm->arch.mmu_lock);
+
ret = kvm_arm_setup_stage2(kvm, type);
if (ret)
return ret;
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 24f3d0f9996b..eb3caeffaf91 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -216,6 +216,8 @@ struct loongson_kvm_ipi {
 #endif
 
 struct kvm_arch {
+   spinlock_t mmu_lock;
+
/* Guest physical mm */
struct mm_struct gpa_mm;
/* Mask of CPUs needing GPA ASID flush */
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 4e393d93c1aa..7b8d65d8c863 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -150,6 +150,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
return -EINVAL;
};
 
+   spin_lock_init(&kvm->arch.mmu_lock);
+
/* Allocate page table to map GPA -> RPA */
kvm->arch.gpa_mm.pgd = kvm_pgd_alloc();
if (!kvm->arch.gpa_mm.pgd)
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 449663152b3c..68fcda1e48f9 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -263,7 +263,7 @@ static bool kvm_mips_flush_gpa_pgd(pgd_t *pgd, unsigned 
long start_gpa,
  *
  * Flushes a range of GPA mappings from the GPA page tables.
  *
- * The caller must hold the @kvm->mmu_lock spinlock.
+ * The caller must hold the @kvm->arch.mmu_lock spinlock.
  *
  * Returns:Whether its safe to remove the top level page directory because
  * all lower levels have been removed.
@@ -388,7 +388,7 @@ BUILD_PTE_RANGE_OP(mkclean, pte_mkclean)
  * Make a range of GPA mappings clean so that guest writes will fault and
  * trigger dirty page logging.
  *
- * The caller must hold the @kvm->mmu_lock spinlock.
+ * The caller must hold the @kvm->arch.mmu_lock spinlock.
  *
  * Returns:Whether any GPA mappings were modified, which would require
  * derived mappings (GVA page tables & TLB enties) to be
@@ -410,7 +410,7 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t 
start_gfn, gfn_t end_gfn)
  * slot to be write protected
  *
  * Walks bits set in mask write protects the associated pte's. Caller must
- * acquire @kvm->mmu_lock.
+ * acquire @kvm->arch.mmu_lock.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
diff --git a/arch/powerpc/incl

[PATCH 18/24] kvm: x86/mmu: Use an rwlock for the x86 TDP MMU

2021-01-12 Thread Ben Gardon

Add a read / write lock to be used in place of the MMU spinlock when the
TDP MMU is enabled. The rwlock will enable the TDP MMU to handle page
faults in parallel in a future commit. In cases where the TDP MMU is not
in use, no operation would be acquiring the lock in read mode, so a
regular spin lock is still used as locking and unlocking a spin lock is
slightly faster.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/include/asm/kvm_host.h |  8 ++-
 arch/x86/kvm/mmu/mmu.c  | 89 +
 arch/x86/kvm/mmu/mmu_internal.h |  9 
 arch/x86/kvm/mmu/tdp_mmu.c  | 10 ++--
 arch/x86/kvm/x86.c  |  2 -
 virt/kvm/kvm_main.c | 10 ++--
 6 files changed, 115 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3087de84fad3..92d5340842c8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -902,7 +902,13 @@ enum kvm_irqchip_mode {
 #define APICV_INHIBIT_REASON_X2APIC5
 
 struct kvm_arch {
-   spinlock_t mmu_lock;
+   union {
+   /* Used if the TDP MMU is enabled. */
+   rwlock_t mmu_rwlock;
+
+   /* Used if the TDP MMU is not enabled. */
+   spinlock_t mmu_lock;
+   };
 
unsigned long n_used_mmu_pages;
unsigned long n_requested_mmu_pages;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ba296ad051c3..280d7cd6f94b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5471,6 +5471,11 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 
kvm_mmu_init_tdp_mmu(kvm);
 
+   if (kvm->arch.tdp_mmu_enabled)
+   rwlock_init(&kvm->arch.mmu_rwlock);
+   else
+   spin_lock_init(&kvm->arch.mmu_lock);
+
node->track_write = kvm_mmu_pte_write;
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
kvm_page_track_register_notifier(kvm, node);
@@ -6074,3 +6079,87 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
if (kvm->arch.nx_lpage_recovery_thread)
kthread_stop(kvm->arch.nx_lpage_recovery_thread);
 }
+
+void kvm_mmu_lock_shared(struct kvm *kvm)
+{
+   WARN_ON(!kvm->arch.tdp_mmu_enabled);
+   read_lock(&kvm->arch.mmu_rwlock);
+}
+
+void kvm_mmu_unlock_shared(struct kvm *kvm)
+{
+   WARN_ON(!kvm->arch.tdp_mmu_enabled);
+   read_unlock(&kvm->arch.mmu_rwlock);
+}
+
+void kvm_mmu_lock_exclusive(struct kvm *kvm)
+{
+   WARN_ON(!kvm->arch.tdp_mmu_enabled);
+   write_lock(&kvm->arch.mmu_rwlock);
+}
+
+void kvm_mmu_unlock_exclusive(struct kvm *kvm)
+{
+   WARN_ON(!kvm->arch.tdp_mmu_enabled);
+   write_unlock(&kvm->arch.mmu_rwlock);
+}
+
+void kvm_mmu_lock(struct kvm *kvm)
+{
+   if (kvm->arch.tdp_mmu_enabled)
+   kvm_mmu_lock_exclusive(kvm);
+   else
+   spin_lock(&kvm->arch.mmu_lock);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_lock);
+
+void kvm_mmu_unlock(struct kvm *kvm)
+{
+   if (kvm->arch.tdp_mmu_enabled)
+   kvm_mmu_unlock_exclusive(kvm);
+   else
+   spin_unlock(&kvm->arch.mmu_lock);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_unlock);
+
+int kvm_mmu_lock_needbreak(struct kvm *kvm)
+{
+   if (kvm->arch.tdp_mmu_enabled)
+   return rwlock_needbreak(&kvm->arch.mmu_rwlock);
+   else
+   return spin_needbreak(&kvm->arch.mmu_lock);
+}
+
+int kvm_mmu_lock_cond_resched_exclusive(struct kvm *kvm)
+{
+   WARN_ON(!kvm->arch.tdp_mmu_enabled);
+   return cond_resched_rwlock_write(&kvm->arch.mmu_rwlock);
+}
+
+int kvm_mmu_lock_cond_resched(struct kvm *kvm)
+{
+   if (kvm->arch.tdp_mmu_enabled)
+   return kvm_mmu_lock_cond_resched_exclusive(kvm);
+   else
+   return cond_resched_lock(&kvm->arch.mmu_lock);
+}
+
+void kvm_mmu_lock_assert_held_shared(struct kvm *kvm)
+{
+   WARN_ON(!kvm->arch.tdp_mmu_enabled);
+   lockdep_assert_held_read(&kvm->arch.mmu_rwlock);
+}
+
+void kvm_mmu_lock_assert_held_exclusive(struct kvm *kvm)
+{
+   WARN_ON(!kvm->arch.tdp_mmu_enabled);
+   lockdep_assert_held_write(&kvm->arch.mmu_rwlock);
+}
+
+void kvm_mmu_lock_assert_held(struct kvm *kvm)
+{
+   if (kvm->arch.tdp_mmu_enabled)
+   lockdep_assert_held(&kvm->arch.mmu_rwlock);
+   else
+   lockdep_assert_held(&kvm->arch.mmu_lock);
+}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index cc8268cf28d2..53a789b8a820 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -149,4 +149,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache 
*mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+voi

[PATCH 10/24] kvm: x86/mmu: Factor out handle disconnected pt

2021-01-12 Thread Ben Gardon

Factor out the code to handle a disconnected subtree of the TDP paging
structure from the code to handle the change to an individual SPTE.
Future commits will build on this to allow asynchronous page freeing.

No functional change intended.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 75 +++---
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 55df596696c7..e8f35cd46b4c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -234,6 +234,49 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, 
int as_id, gfn_t gfn,
}
 }
 
+/**
+ * handle_disconnected_tdp_mmu_page - handle a pt removed from the TDP 
structure
+ *
+ * @kvm: kvm instance
+ * @pt: the page removed from the paging structure
+ *
+ * Given a page table that has been removed from the TDP paging structure,
+ * iterates through the page table to clear SPTEs and free child page tables.
+ */
+static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt)
+{
+   struct kvm_mmu_page *sp;
+   gfn_t gfn;
+   int level;
+   u64 old_child_spte;
+   int i;
+
+   sp = sptep_to_sp(pt);
+   gfn = sp->gfn;
+   level = sp->role.level;
+
+   trace_kvm_mmu_prepare_zap_page(sp);
+
+   list_del(&sp->link);
+
+   if (sp->lpage_disallowed)
+   unaccount_huge_nx_page(kvm, sp);
+
+   for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+   old_child_spte = READ_ONCE(*(pt + i));
+   WRITE_ONCE(*(pt + i), 0);
+   handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
+   gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
+   old_child_spte, 0, level - 1);
+   }
+
+   kvm_flush_remote_tlbs_with_address(kvm, gfn,
+  KVM_PAGES_PER_HPAGE(level));
+
+   free_page((unsigned long)pt);
+   kmem_cache_free(mmu_page_header_cache, sp);
+}
+
 /**
  * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -254,10 +297,6 @@ static void __handle_changed_spte(struct kvm *kvm, int 
as_id, gfn_t gfn,
bool was_leaf = was_present && is_last_spte(old_spte, level);
bool is_leaf = is_present && is_last_spte(new_spte, level);
bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
-   u64 *pt;
-   struct kvm_mmu_page *sp;
-   u64 old_child_spte;
-   int i;
 
WARN_ON(level > PT64_ROOT_MAX_LEVEL);
WARN_ON(level < PG_LEVEL_4K);
@@ -321,31 +360,9 @@ static void __handle_changed_spte(struct kvm *kvm, int 
as_id, gfn_t gfn,
 * Recursively handle child PTs if the change removed a subtree from
 * the paging structure.
 */
-   if (was_present && !was_leaf && (pfn_changed || !is_present)) {
-   pt = spte_to_child_pt(old_spte, level);
-   sp = sptep_to_sp(pt);
-
-   trace_kvm_mmu_prepare_zap_page(sp);
-
-   list_del(&sp->link);
-
-   if (sp->lpage_disallowed)
-   unaccount_huge_nx_page(kvm, sp);
-
-   for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
-   old_child_spte = READ_ONCE(*(pt + i));
-   WRITE_ONCE(*(pt + i), 0);
-   handle_changed_spte(kvm, as_id,
-   gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
-   old_child_spte, 0, level - 1);
-   }
-
-   kvm_flush_remote_tlbs_with_address(kvm, gfn,
-  KVM_PAGES_PER_HPAGE(level));
-
-   free_page((unsigned long)pt);
-   kmem_cache_free(mmu_page_header_cache, sp);
-   }
+   if (was_present && !was_leaf && (pfn_changed || !is_present))
+   handle_disconnected_tdp_mmu_page(kvm,
+   spte_to_child_pt(old_spte, level));
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 22/24] kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler

2021-01-12 Thread Ben Gardon

When the TDP MMU is allowed to handle page faults in parallel there is
the possiblity of a race where an SPTE is cleared and then imediately
replaced with a present SPTE pointing to a different PFN, before the
TLBs can be flushed. This race would violate architectural specs. Ensure
that the TLBs are flushed properly before other threads are allowed to
install any present value for the SPTE.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/spte.h| 16 +-
 arch/x86/kvm/mmu/tdp_mmu.c | 62 --
 2 files changed, 68 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 2b3a30bd38b0..ecd9bfbccef4 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -130,6 +130,20 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
  PT64_EPT_EXECUTABLE_MASK)
 #define SHADOW_ACC_TRACK_SAVED_BITS_SHIFT PT64_SECOND_AVAIL_BITS_SHIFT
 
+/*
+ * If a thread running without exclusive control of the MMU lock must perform a
+ * multi-part operation on an SPTE, it can set the SPTE to FROZEN_SPTE as a
+ * non-present intermediate value. This will guarantee that other threads will
+ * not modify the spte.
+ *
+ * This constant works because it is considered non-present on both AMD and
+ * Intel CPUs and does not create a L1TF vulnerability because the pfn section
+ * is zeroed out.
+ *
+ * Only used by the TDP MMU.
+ */
+#define FROZEN_SPTE (1ull << 59)
+
 /*
  * In some cases, we need to preserve the GFN of a non-present or reserved
  * SPTE when we usurp the upper five bits of the physical address space to
@@ -187,7 +201,7 @@ static inline bool is_access_track_spte(u64 spte)
 
 static inline int is_shadow_present_pte(u64 pte)
 {
-   return (pte != 0) && !is_mmio_spte(pte);
+   return (pte != 0) && !is_mmio_spte(pte) && (pte != FROZEN_SPTE);
 }
 
 static inline int is_large_pte(u64 pte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7b12a87a4124..5c9d053000ad 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -429,15 +429,19 @@ static void __handle_changed_spte(struct kvm *kvm, int 
as_id, gfn_t gfn,
 */
if (!was_present && !is_present) {
/*
-* If this change does not involve a MMIO SPTE, it is
-* unexpected. Log the change, though it should not impact the
-* guest since both the former and current SPTEs are nonpresent.
+* If this change does not involve a MMIO SPTE or FROZEN_SPTE,
+* it is unexpected. Log the change, though it should not
+* impact the guest since both the former and current SPTEs
+* are nonpresent.
 */
-   if (WARN_ON(!is_mmio_spte(old_spte) && !is_mmio_spte(new_spte)))
+   if (WARN_ON(!is_mmio_spte(old_spte) &&
+   !is_mmio_spte(new_spte) &&
+   new_spte != FROZEN_SPTE))
pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
   "should not be replaced with another,\n"
   "different nonpresent SPTE, unless one or both\n"
-  "are MMIO SPTEs.\n"
+  "are MMIO SPTEs, or the new SPTE is\n"
+  "FROZEN_SPTE.\n"
   "as_id: %d gfn: %llx old_spte: %llx new_spte: 
%llx level: %d",
   as_id, gfn, old_spte, new_spte, level);
return;
@@ -488,6 +492,13 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 
kvm_mmu_lock_assert_held_shared(kvm);
 
+   /*
+* Do not change FROZEN_SPTEs. Only the thread that froze the SPTE
+* may modify it.
+*/
+   if (iter->old_spte == FROZEN_SPTE)
+   return false;
+
if (cmpxchg64(iter->sptep, iter->old_spte, new_spte) != iter->old_spte)
return false;
 
@@ -497,6 +508,34 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
return true;
 }
 
+static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
+  struct tdp_iter *iter)
+{
+   /*
+* Freeze the SPTE by setting it to a special,
+* non-present value. This will stop other threads from
+* immediately installing a present entry in its place
+* before the TLBs are flushed.
+*/
+   if (!tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE))
+   return false;
+
+   kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
+  KVM_PAGES_PER_HPAGE(iter->level));
+
+   /*
+*

[PATCH 01/24] locking/rwlocks: Add contention detection for rwlocks

2021-01-12 Thread Ben Gardon

rwlocks do not currently have any facility to detect contention
like spinlocks do. In order to allow users of rwlocks to better manage
latency, add contention detection for queued rwlocks.

CC: Ingo Molnar 
CC: Will Deacon 
Acked-by: Peter Zijlstra 
Acked-by: Davidlohr Bueso 
Acked-by: Waiman Long 
Acked-by: Paolo Bonzini 

Signed-off-by: Ben Gardon 
---
 include/asm-generic/qrwlock.h | 24 ++--
 include/linux/rwlock.h|  7 +++
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 84ce841ce735..0020d3b820a7 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -14,6 +14,7 @@
 #include 
 
 #include 
+#include 
 
 /*
  * Writer states & reader shift and bias.
@@ -116,15 +117,26 @@ static inline void queued_write_unlock(struct qrwlock 
*lock)
smp_store_release(&lock->wlocked, 0);
 }
 
+/**
+ * queued_rwlock_is_contended - check if the lock is contended
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static inline int queued_rwlock_is_contended(struct qrwlock *lock)
+{
+   return arch_spin_is_locked(&lock->wait_lock);
+}
+
 /*
  * Remapping rwlock architecture specific functions to the corresponding
  * queue rwlock functions.
  */
-#define arch_read_lock(l)  queued_read_lock(l)
-#define arch_write_lock(l) queued_write_lock(l)
-#define arch_read_trylock(l)   queued_read_trylock(l)
-#define arch_write_trylock(l)  queued_write_trylock(l)
-#define arch_read_unlock(l)queued_read_unlock(l)
-#define arch_write_unlock(l)   queued_write_unlock(l)
+#define arch_read_lock(l)  queued_read_lock(l)
+#define arch_write_lock(l) queued_write_lock(l)
+#define arch_read_trylock(l)   queued_read_trylock(l)
+#define arch_write_trylock(l)  queued_write_trylock(l)
+#define arch_read_unlock(l)queued_read_unlock(l)
+#define arch_write_unlock(l)   queued_write_unlock(l)
+#define arch_rwlock_is_contended(l)queued_rwlock_is_contended(l)
 
 #endif /* __ASM_GENERIC_QRWLOCK_H */
diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h
index 3dcd617e65ae..7ce9a51ae5c0 100644
--- a/include/linux/rwlock.h
+++ b/include/linux/rwlock.h
@@ -128,4 +128,11 @@ do {   
\
1 : ({ local_irq_restore(flags); 0; }); \
 })
 
+#ifdef arch_rwlock_is_contended
+#define rwlock_is_contended(lock) \
+arch_rwlock_is_contended(&(lock)->raw_lock)
+#else
+#define rwlock_is_contended(lock)  ((void)(lock), 0)
+#endif /* arch_rwlock_is_contended */
+
 #endif /* __LINUX_RWLOCK_H */
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock

2021-01-12 Thread Ben Gardon

Add a lock to protect the data structures that track the page table
memory used by the TDP MMU. In order to handle multiple TDP MMU
operations in parallel, pages of PT memory must be added and removed
without the exclusive protection of the MMU lock. A new lock to protect
the list(s) of in-use pages will cause some serialization, but only on
non-leaf page table entries, so the lock is not expected to be very
contended.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/include/asm/kvm_host.h | 15 
 arch/x86/kvm/mmu/tdp_mmu.c  | 67 +
 2 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 92d5340842c8..f8dccb27c722 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1034,6 +1034,21 @@ struct kvm_arch {
 * tdp_mmu_page set and a root_count of 0.
 */
struct list_head tdp_mmu_pages;
+
+   /*
+* Protects accesses to the following fields when the MMU lock is
+* not held exclusively:
+*  - tdp_mmu_pages (above)
+*  - the link field of struct kvm_mmu_pages used by the TDP MMU
+*when they are part of tdp_mmu_pages (but not when they are part
+*of the tdp_mmu_free_list or tdp_mmu_disconnected_list)
+*  - lpage_disallowed_mmu_pages
+*  - the lpage_disallowed_link field of struct kvm_mmu_pages used
+*by the TDP MMU
+*  May be acquired under the MMU lock in read mode or non-overlapping
+*  with the MMU lock.
+*/
+   spinlock_t tdp_mmu_pages_lock;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8b61bdb391a0..264594947c3b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -33,6 +33,7 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
kvm->arch.tdp_mmu_enabled = true;
 
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
+   spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
 }
 
@@ -262,6 +263,58 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, 
int as_id, gfn_t gfn,
}
 }
 
+/**
+ * tdp_mmu_link_page - Add a new page to the list of pages used by the TDP MMU
+ *
+ * @kvm: kvm instance
+ * @sp: the new page
+ * @atomic: This operation is not running under the exclusive use of the MMU
+ * lock and the operation must be atomic with respect to ther threads
+ * that might be adding or removing pages.
+ * @account_nx: This page replaces a NX large page and should be marked for
+ * eventual reclaim.
+ */
+static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ bool atomic, bool account_nx)
+{
+   if (atomic)
+   spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+   else
+   kvm_mmu_lock_assert_held_exclusive(kvm);
+
+   list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
+   if (account_nx)
+   account_huge_nx_page(kvm, sp);
+
+   if (atomic)
+   spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+}
+
+/**
+ * tdp_mmu_unlink_page - Remove page from the list of pages used by the TDP MMU
+ *
+ * @kvm: kvm instance
+ * @sp: the page to be removed
+ * @atomic: This operation is not running under the exclusive use of the MMU
+ * lock and the operation must be atomic with respect to ther threads
+ * that might be adding or removing pages.
+ */
+static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+   bool atomic)
+{
+   if (atomic)
+   spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+   else
+   kvm_mmu_lock_assert_held_exclusive(kvm);
+
+   list_del(&sp->link);
+   if (sp->lpage_disallowed)
+   unaccount_huge_nx_page(kvm, sp);
+
+   if (atomic)
+   spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+}
+
 /**
  * handle_disconnected_tdp_mmu_page - handle a pt removed from the TDP 
structure
  *
@@ -285,10 +338,7 @@ static void handle_disconnected_tdp_mmu_page(struct kvm 
*kvm, u64 *pt)
 
trace_kvm_mmu_prepare_zap_page(sp);
 
-   list_del(&sp->link);
-
-   if (sp->lpage_disallowed)
-   unaccount_huge_nx_page(kvm, sp);
+   tdp_mmu_unlink_page(kvm, sp, atomic);
 
for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
old_child_spte = READ_ONCE(*(pt + i));
@@ -719,15 +769,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 
error_code,
 
if (!is_shadow_present_pte(iter.old_spte)) {
sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
-   list_add(&sp->link, &vcpu->kvm->arch.tdp_mmu_pages);

[PATCH 13/24] kvm: x86/mmu: Only free tdp_mmu pages after a grace period

2021-01-12 Thread Ben Gardon

By waiting until an RCU grace period has elapsed to free TDP MMU PT memory,
the system can ensure that no kernel threads access the memory after it
has been freed.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/mmu_internal.h |  3 +++
 arch/x86/kvm/mmu/tdp_mmu.c  | 31 +--
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index bfc6389edc28..7f599cc64178 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -57,6 +57,9 @@ struct kvm_mmu_page {
atomic_t write_flooding_count;
 
bool tdp_mmu_page;
+
+   /* Used for freeing the page asyncronously if it is a TDP MMU page. */
+   struct rcu_head rcu_head;
 };
 
 extern struct kmem_cache *mmu_page_header_cache;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 662907d374b3..dc5b4bf34ca2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -42,6 +42,12 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
return;
 
WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
+
+   /*
+* Ensure that all the outstanding RCU callbacks to free shadow pages
+* can run before the VM is torn down.
+*/
+   rcu_barrier();
 }
 
 static void tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
@@ -196,6 +202,28 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
return __pa(root->spt);
 }
 
+static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
+{
+   free_page((unsigned long)sp->spt);
+   kmem_cache_free(mmu_page_header_cache, sp);
+}
+
+/*
+ * This is called through call_rcu in order to free TDP page table memory
+ * safely with respect to other kernel threads that may be operating on
+ * the memory.
+ * By only accessing TDP MMU page table memory in an RCU read critical
+ * section, and freeing it after a grace period, lockless access to that
+ * memory won't use it after it is freed.
+ */
+static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
+{
+   struct kvm_mmu_page *sp = container_of(head, struct kvm_mmu_page,
+  rcu_head);
+
+   tdp_mmu_free_sp(sp);
+}
+
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
u64 old_spte, u64 new_spte, int level);
 
@@ -273,8 +301,7 @@ static void handle_disconnected_tdp_mmu_page(struct kvm 
*kvm, u64 *pt)
kvm_flush_remote_tlbs_with_address(kvm, gfn,
   KVM_PAGES_PER_HPAGE(level));
 
-   free_page((unsigned long)pt);
-   kmem_cache_free(mmu_page_header_cache, sp);
+   call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
 /**
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak

2021-01-12 Thread Ben Gardon

Wrap the MMU lock cond_reseched and needbreak operations in a function.
This will support a refactoring to move the lock into the struct
kvm_arch(s) so that x86 can change the spinlock to a rwlock without
affecting the performance of other archs.

No functional change intended.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/arm64/kvm/mmu.c   |  2 +-
 arch/x86/kvm/mmu/mmu.c | 16 
 arch/x86/kvm/mmu/tdp_mmu.c |  8 
 include/linux/kvm_host.h   |  2 ++
 virt/kvm/kvm_main.c| 10 ++
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 402b1642c944..57ef1ec23b56 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -58,7 +58,7 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t 
addr,
break;
 
if (resched && next != end)
-   cond_resched_lock(&kvm->mmu_lock);
+   kvm_mmu_lock_cond_resched(kvm);
} while (addr = next, addr != end);
 
return ret;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5a4577830606..659ed0a2875f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2016,9 +2016,9 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
flush |= kvm_sync_page(vcpu, sp, &invalid_list);
mmu_pages_clear_parents(&parents);
}
-   if (need_resched() || spin_needbreak(&vcpu->kvm->mmu_lock)) {
+   if (need_resched() || kvm_mmu_lock_needbreak(vcpu->kvm)) {
kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
-   cond_resched_lock(&vcpu->kvm->mmu_lock);
+   kvm_mmu_lock_cond_resched(vcpu->kvm);
flush = false;
}
}
@@ -5233,14 +5233,14 @@ slot_handle_level_range(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
if (iterator.rmap)
flush |= fn(kvm, iterator.rmap);
 
-   if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+   if (need_resched() || kvm_mmu_lock_needbreak(kvm)) {
if (flush && lock_flush_tlb) {
kvm_flush_remote_tlbs_with_address(kvm,
start_gfn,
iterator.gfn - start_gfn + 1);
flush = false;
}
-   cond_resched_lock(&kvm->mmu_lock);
+   kvm_mmu_lock_cond_resched(kvm);
}
}
 
@@ -5390,7 +5390,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
 * be in active use by the guest.
 */
if (batch >= BATCH_ZAP_PAGES &&
-   cond_resched_lock(&kvm->mmu_lock)) {
+   kvm_mmu_lock_cond_resched(kvm)) {
batch = 0;
goto restart;
}
@@ -5688,7 +5688,7 @@ void kvm_mmu_zap_all(struct kvm *kvm)
continue;
if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
goto restart;
-   if (cond_resched_lock(&kvm->mmu_lock))
+   if (kvm_mmu_lock_cond_resched(kvm))
goto restart;
}
 
@@ -6013,9 +6013,9 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
WARN_ON_ONCE(sp->lpage_disallowed);
}
 
-   if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+   if (need_resched() || kvm_mmu_lock_needbreak(kvm)) {
kvm_mmu_commit_zap_page(kvm, &invalid_list);
-   cond_resched_lock(&kvm->mmu_lock);
+   kvm_mmu_lock_cond_resched(kvm);
}
}
kvm_mmu_commit_zap_page(kvm, &invalid_list);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 90807f2d928f..fb911ca428b2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -488,10 +488,10 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct 
kvm *kvm,
 static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm,
struct tdp_iter *iter)
 {
-   if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+   if (need_resched() || kvm_mmu_lock_needbreak(kvm)) {
kvm_flush_remote_tlbs(kvm);
rcu_read_unlock();
-   cond_resched_lock(&kvm->mmu_lock);
+   kvm_mmu_lock_cond_resched(kvm);
rcu_read_lock();
tdp_iter_refresh_walk(iter);
return true;
@@ -512,9 +512,9 @@ static bool

[PATCH 24/24] kvm: x86/mmu: Allow parallel page faults for the TDP MMU

2021-01-12 Thread Ben Gardon

Make the last few changes necessary to enable the TDP MMU to handle page
faults in parallel while holding the mmu_lock in read mode.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/mmu.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 280d7cd6f94b..fa111ceb67d4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3724,7 +3724,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, 
gpa_t gpa, u32 error_code,
return r;
 
r = RET_PF_RETRY;
-   kvm_mmu_lock(vcpu->kvm);
+
+   if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+   kvm_mmu_lock_shared(vcpu->kvm);
+   else
+   kvm_mmu_lock(vcpu->kvm);
+
if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
goto out_unlock;
r = make_mmu_pages_available(vcpu);
@@ -3739,7 +3744,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, 
gpa_t gpa, u32 error_code,
 prefault, is_tdp);
 
 out_unlock:
-   kvm_mmu_unlock(vcpu->kvm);
+   if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+   kvm_mmu_unlock_shared(vcpu->kvm);
+   else
+   kvm_mmu_unlock(vcpu->kvm);
kvm_release_pfn_clean(pfn);
return r;
 }
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 21/24] kvm: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map

2021-01-12 Thread Ben Gardon

To prepare for handling page faults in parallel, change the TDP MMU
page fault handler to use atomic operations to set SPTEs so that changes
are not lost if multiple threads attempt to modify the same SPTE.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 38 ++
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1380ed313476..7b12a87a4124 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -714,21 +714,18 @@ static int tdp_mmu_map_handle_target_level(struct 
kvm_vcpu *vcpu, int write,
int ret = 0;
int make_spte_ret = 0;
 
-   if (unlikely(is_noslot_pfn(pfn))) {
+   if (unlikely(is_noslot_pfn(pfn)))
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
-   trace_mark_mmio_spte(iter->sptep, iter->gfn, new_spte);
-   } else {
+   else
make_spte_ret = make_spte(vcpu, ACC_ALL, iter->level, iter->gfn,
 pfn, iter->old_spte, prefault, true,
 map_writable, !shadow_accessed_mask,
 &new_spte);
-   trace_kvm_mmu_set_spte(iter->level, iter->gfn, iter->sptep);
-   }
 
if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
-   else
-   tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);
+   else if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
+   return RET_PF_RETRY;
 
/*
 * If the page fault was caused by a write but the page is write
@@ -742,8 +739,11 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu 
*vcpu, int write,
}
 
/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
-   if (unlikely(is_mmio_spte(new_spte)))
+   if (unlikely(is_mmio_spte(new_spte))) {
+   trace_mark_mmio_spte(iter->sptep, iter->gfn, new_spte);
ret = RET_PF_EMULATE;
+   } else
+   trace_kvm_mmu_set_spte(iter->level, iter->gfn, iter->sptep);
 
trace_kvm_mmu_set_spte(iter->level, iter->gfn, iter->sptep);
if (!prefault)
@@ -801,7 +801,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 
error_code,
 */
if (is_shadow_present_pte(iter.old_spte) &&
is_large_pte(iter.old_spte)) {
-   tdp_mmu_set_spte(vcpu->kvm, &iter, 0);
+   if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, 0))
+   break;
 
kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
KVM_PAGES_PER_HPAGE(iter.level));
@@ -818,19 +819,24 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 
error_code,
sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
child_pt = sp->spt;
 
-   tdp_mmu_link_page(vcpu->kvm, sp, false,
- huge_page_disallowed &&
- req_level >= iter.level);
-
new_spte = make_nonleaf_spte(child_pt,
 !shadow_accessed_mask);
 
-   trace_kvm_mmu_get_page(sp, true);
-   tdp_mmu_set_spte(vcpu->kvm, &iter, new_spte);
+   if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter,
+   new_spte)) {
+   tdp_mmu_link_page(vcpu->kvm, sp, true,
+ huge_page_disallowed &&
+ req_level >= iter.level);
+
+   trace_kvm_mmu_get_page(sp, true);
+   } else {
+   tdp_mmu_free_sp(sp);
+   break;
+   }
}
}
 
-   if (WARN_ON(iter.level != level)) {
+   if (iter.level != level) {
rcu_read_unlock();
return RET_PF_RETRY;
}
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 16/24] kvm: mmu: Wrap mmu_lock assertions

2021-01-12 Thread Ben Gardon

Wrap assertions and warnings checking the MMU lock state in a function
which uses lockdep_assert_held. While the existing checks use a few
different functions to check the lock state, they are all better off
using lockdep_assert_held. This will support a refactoring to move the
mmu_lock to struct kvm_arch so that it can be replaced with an rwlock for
x86.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/arm64/kvm/mmu.c | 2 +-
 arch/powerpc/include/asm/kvm_book3s_64.h | 7 +++
 arch/powerpc/kvm/book3s_hv_nested.c  | 3 +--
 arch/x86/kvm/mmu/mmu_internal.h  | 4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c   | 8 
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c  | 5 +
 7 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 57ef1ec23b56..8b54eb58bf47 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -130,7 +130,7 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, 
phys_addr_t start, u64
struct kvm *kvm = mmu->kvm;
phys_addr_t end = start + size;
 
-   assert_spin_locked(&kvm->mmu_lock);
+   kvm_mmu_lock_assert_held(kvm);
WARN_ON(size & ~PAGE_MASK);
WARN_ON(stage2_apply_range(kvm, start, end, kvm_pgtable_stage2_unmap,
   may_block));
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 9bb9bb370b53..db2e437cd97c 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -650,8 +650,8 @@ static inline pte_t *find_kvm_secondary_pte(struct kvm 
*kvm, unsigned long ea,
 {
pte_t *pte;
 
-   VM_WARN(!spin_is_locked(&kvm->mmu_lock),
-   "%s called with kvm mmu_lock not held \n", __func__);
+   kvm_mmu_lock_assert_held(kvm);
+
pte = __find_linux_pte(kvm->arch.pgtable, ea, NULL, hshift);
 
return pte;
@@ -662,8 +662,7 @@ static inline pte_t *find_kvm_host_pte(struct kvm *kvm, 
unsigned long mmu_seq,
 {
pte_t *pte;
 
-   VM_WARN(!spin_is_locked(&kvm->mmu_lock),
-   "%s called with kvm mmu_lock not held \n", __func__);
+   kvm_mmu_lock_assert_held(kvm);
 
if (mmu_notifier_retry(kvm, mmu_seq))
return NULL;
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 18890dca9476..6d5987d1eee7 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -767,8 +767,7 @@ pte_t *find_kvm_nested_guest_pte(struct kvm *kvm, unsigned 
long lpid,
if (!gp)
return NULL;
 
-   VM_WARN(!spin_is_locked(&kvm->mmu_lock),
-   "%s called with kvm mmu_lock not held \n", __func__);
+   kvm_mmu_lock_assert_held(kvm);
pte = __find_linux_pte(gp->shadow_pgtable, ea, NULL, hshift);
 
return pte;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 7f599cc64178..cc8268cf28d2 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -101,14 +101,14 @@ void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 static inline void kvm_mmu_get_root(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
BUG_ON(!sp->root_count);
-   lockdep_assert_held(&kvm->mmu_lock);
+   kvm_mmu_lock_assert_held(kvm);
 
++sp->root_count;
 }
 
 static inline bool kvm_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-   lockdep_assert_held(&kvm->mmu_lock);
+   kvm_mmu_lock_assert_held(kvm);
--sp->root_count;
 
return !sp->root_count;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index fb911ca428b2..1d7c01300495 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -117,7 +117,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct 
kvm_mmu_page *root)
 {
gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
 
-   lockdep_assert_held(&kvm->mmu_lock);
+   kvm_mmu_lock_assert_held(kvm);
 
WARN_ON(root->root_count);
WARN_ON(!root->tdp_mmu_page);
@@ -425,7 +425,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, 
struct tdp_iter *iter,
struct kvm_mmu_page *root = sptep_to_sp(root_pt);
int as_id = kvm_mmu_page_as_id(root);
 
-   lockdep_assert_held(&kvm->mmu_lock);
+   kvm_mmu_lock_assert_held(kvm);
 
WRITE_ONCE(*iter->sptep, new_spte);
 
@@ -1139,7 +1139,7 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
struct kvm_mmu_page *root;
int root_as_id;
 
-   lockdep_assert_held(&kvm->mmu_lock);
+   kvm_mmu_lock_assert_held(kvm);
for_each_tdp_mmu_root(kvm, root) {
root_as_id = kvm_mmu_page_as_id(root);

[PATCH 12/24] kvm: x86/kvm: RCU dereference tdp mmu page table links

2021-01-12 Thread Ben Gardon

In order to protect TDP MMU PT memory with RCU, ensure that page table
links are properly rcu_derefenced.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_iter.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 87b7e16911db..82855613ffa0 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -49,6 +49,8 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int 
root_level,
  */
 u64 *spte_to_child_pt(u64 spte, int level)
 {
+   u64 *child_pt;
+
/*
 * There's no child entry if this entry isn't present or is a
 * last-level entry.
@@ -56,7 +58,9 @@ u64 *spte_to_child_pt(u64 spte, int level)
if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
return NULL;
 
-   return __va(spte_to_pfn(spte) << PAGE_SHIFT);
+   child_pt = __va(spte_to_pfn(spte) << PAGE_SHIFT);
+
+   return rcu_dereference(child_pt);
 }
 
 /*
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 23/24] kvm: x86/mmu: Freeze SPTEs in disconnected pages

2021-01-12 Thread Ben Gardon

When clearing TDP MMU pages what have been disconnected from the paging
structure root, set the SPTEs to a special non-present value which will
not be overwritten by other threads. This is needed to prevent races in
which a thread is clearing a disconnected page table, but another thread
has already acquired a pointer to that memory and installs a mapping in
an already cleared entry. This can lead to memory leaks and accounting
errors.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 35 +--
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5c9d053000ad..45160ff84e91 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -333,13 +333,14 @@ static void handle_disconnected_tdp_mmu_page(struct kvm 
*kvm, u64 *pt,
 {
struct kvm_mmu_page *sp;
gfn_t gfn;
+   gfn_t base_gfn;
int level;
u64 *sptep;
u64 old_child_spte;
int i;
 
sp = sptep_to_sp(pt);
-   gfn = sp->gfn;
+   base_gfn = sp->gfn;
level = sp->role.level;
 
trace_kvm_mmu_prepare_zap_page(sp);
@@ -348,16 +349,38 @@ static void handle_disconnected_tdp_mmu_page(struct kvm 
*kvm, u64 *pt,
 
for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
sptep = pt + i;
+   gfn = base_gfn + (i * KVM_PAGES_PER_HPAGE(level - 1));
 
if (atomic) {
-   old_child_spte = xchg(sptep, 0);
+   /*
+* Set the SPTE to a nonpresent value that other
+* threads will not overwrite. If the SPTE was already
+* frozen then another thread handling a page fault
+* could overwrite it, so set the SPTE until it is set
+* from nonfrozen -> frozen.
+*/
+   for (;;) {
+   old_child_spte = xchg(sptep, FROZEN_SPTE);
+   if (old_child_spte != FROZEN_SPTE)
+   break;
+   cpu_relax();
+   }
} else {
old_child_spte = READ_ONCE(*sptep);
-   WRITE_ONCE(*sptep, 0);
+
+   /*
+* Setting the SPTE to FROZEN_SPTE is not strictly
+* necessary here as the MMU lock should stop other
+* threads from concurrentrly modifying this SPTE.
+* Using FROZEN_SPTE keeps the atomic and
+* non-atomic cases consistent and simplifies the
+* function.
+*/
+   WRITE_ONCE(*sptep, FROZEN_SPTE);
}
-   handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
-   gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
-   old_child_spte, 0, level - 1, atomic);
+   handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
+   old_child_spte, FROZEN_SPTE, level - 1,
+   atomic);
}
 
kvm_flush_remote_tlbs_with_address(kvm, gfn,
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 20/24] kvm: x86/mmu: Add atomic option for setting SPTEs

2021-01-12 Thread Ben Gardon

In order to allow multiple TDP MMU operations to proceed in parallel,
there must be an option to modify SPTEs atomically so that changes are
not lost. Add that option to __tdp_mmu_set_spte and
__handle_changed_spte.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 67 --
 1 file changed, 57 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 264594947c3b..1380ed313476 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -7,6 +7,7 @@
 #include "tdp_mmu.h"
 #include "spte.h"
 
+#include 
 #include 
 
 #ifdef CONFIG_X86_64
@@ -226,7 +227,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head 
*head)
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-   u64 old_spte, u64 new_spte, int level);
+   u64 old_spte, u64 new_spte, int level,
+   bool atomic);
 
 static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
 {
@@ -320,15 +322,19 @@ static void tdp_mmu_unlink_page(struct kvm *kvm, struct 
kvm_mmu_page *sp,
  *
  * @kvm: kvm instance
  * @pt: the page removed from the paging structure
+ * @atomic: Use atomic operations to clear the SPTEs in any disconnected
+ * pages of memory.
  *
  * Given a page table that has been removed from the TDP paging structure,
  * iterates through the page table to clear SPTEs and free child page tables.
  */
-static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt)
+static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt,
+bool atomic)
 {
struct kvm_mmu_page *sp;
gfn_t gfn;
int level;
+   u64 *sptep;
u64 old_child_spte;
int i;
 
@@ -341,11 +347,17 @@ static void handle_disconnected_tdp_mmu_page(struct kvm 
*kvm, u64 *pt)
tdp_mmu_unlink_page(kvm, sp, atomic);
 
for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
-   old_child_spte = READ_ONCE(*(pt + i));
-   WRITE_ONCE(*(pt + i), 0);
+   sptep = pt + i;
+
+   if (atomic) {
+   old_child_spte = xchg(sptep, 0);
+   } else {
+   old_child_spte = READ_ONCE(*sptep);
+   WRITE_ONCE(*sptep, 0);
+   }
handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
-   old_child_spte, 0, level - 1);
+   old_child_spte, 0, level - 1, atomic);
}
 
kvm_flush_remote_tlbs_with_address(kvm, gfn,
@@ -362,12 +374,15 @@ static void handle_disconnected_tdp_mmu_page(struct kvm 
*kvm, u64 *pt)
  * @old_spte: The value of the SPTE before the change
  * @new_spte: The value of the SPTE after the change
  * @level: the level of the PT the SPTE is part of in the paging structure
+ * @atomic: Use atomic operations to clear the SPTEs in any disconnected
+ * pages of memory.
  *
  * Handle bookkeeping that might result from the modification of a SPTE.
  * This function must be called for all TDP SPTE modifications.
  */
 static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-   u64 old_spte, u64 new_spte, int level)
+ u64 old_spte, u64 new_spte, int level,
+ bool atomic)
 {
bool was_present = is_shadow_present_pte(old_spte);
bool is_present = is_shadow_present_pte(new_spte);
@@ -439,18 +454,50 @@ static void __handle_changed_spte(struct kvm *kvm, int 
as_id, gfn_t gfn,
 */
if (was_present && !was_leaf && (pfn_changed || !is_present))
handle_disconnected_tdp_mmu_page(kvm,
-   spte_to_child_pt(old_spte, level));
+   spte_to_child_pt(old_spte, level), atomic);
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-   u64 old_spte, u64 new_spte, int level)
+   u64 old_spte, u64 new_spte, int level,
+   bool atomic)
 {
-   __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
+   __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
+ atomic);
handle_changed_spte_acc_track(old_spte, new_spte, level);
handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
  new_spte, level);
 }
 
+/*
+ * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically and handle the
+ * associated bookkeeping
+ *
+ * @kvm: kvm instance
+ * @iter: a tdp_iter instance currently on the SPTE that should be set
+ * @new_spte: The value the SPTE should be set to
+ * Returns

[PATCH 11/24] kvm: x86/mmu: Put TDP MMU PT walks in RCU read-critical section

2021-01-12 Thread Ben Gardon

In order to enable concurrent modifications to the paging structures in
the TDP MMU, threads must be able to safely remove pages of page table
memory while other threads are traversing the same memory. To ensure
threads do not access PT memory after it is freed, protect PT memory
with RCU.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 53 --
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e8f35cd46b4c..662907d374b3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -458,11 +458,14 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct 
kvm *kvm,
  * Return true if this function yielded, the TLBs were flushed, and the
  * iterator's traversal was reset. Return false if a yield was not needed.
  */
-static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter 
*iter)
+static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm,
+   struct tdp_iter *iter)
 {
if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
kvm_flush_remote_tlbs(kvm);
+   rcu_read_unlock();
cond_resched_lock(&kvm->mmu_lock);
+   rcu_read_lock();
tdp_iter_refresh_walk(iter);
return true;
} else
@@ -483,7 +486,9 @@ static bool tdp_mmu_iter_flush_cond_resched(struct kvm 
*kvm, struct tdp_iter *it
 static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
 {
if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+   rcu_read_unlock();
cond_resched_lock(&kvm->mmu_lock);
+   rcu_read_lock();
tdp_iter_refresh_walk(iter);
return true;
} else
@@ -508,6 +513,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
gfn_t last_goal_gfn = start;
bool flush_needed = false;
 
+   rcu_read_lock();
+
tdp_root_for_each_pte(iter, root, start, end) {
/* Ensure forward progress has been made before yielding. */
if (can_yield && iter.goal_gfn != last_goal_gfn &&
@@ -538,6 +545,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
tdp_mmu_set_spte(kvm, &iter, 0);
flush_needed = true;
}
+
+   rcu_read_unlock();
return flush_needed;
 }
 
@@ -650,6 +659,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 
error_code,
huge_page_disallowed, &req_level);
 
trace_kvm_mmu_spte_requested(gpa, level, pfn);
+
+   rcu_read_lock();
+
tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
if (nx_huge_page_workaround_enabled)
disallowed_hugepage_adjust(iter.old_spte, gfn,
@@ -693,11 +705,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 
error_code,
}
}
 
-   if (WARN_ON(iter.level != level))
+   if (WARN_ON(iter.level != level)) {
+   rcu_read_unlock();
return RET_PF_RETRY;
+   }
 
ret = tdp_mmu_map_handle_target_level(vcpu, write, map_writable, &iter,
  pfn, prefault);
+   rcu_read_unlock();
 
return ret;
 }
@@ -768,6 +783,8 @@ static int age_gfn_range(struct kvm *kvm, struct 
kvm_memory_slot *slot,
int young = 0;
u64 new_spte = 0;
 
+   rcu_read_lock();
+
tdp_root_for_each_leaf_pte(iter, root, start, end) {
/*
 * If we have a non-accessed entry we don't need to change the
@@ -799,6 +816,8 @@ static int age_gfn_range(struct kvm *kvm, struct 
kvm_memory_slot *slot,
trace_kvm_age_page(iter.gfn, iter.level, slot, young);
}
 
+   rcu_read_unlock();
+
return young;
 }
 
@@ -844,6 +863,8 @@ static int set_tdp_spte(struct kvm *kvm, struct 
kvm_memory_slot *slot,
u64 new_spte;
int need_flush = 0;
 
+   rcu_read_lock();
+
WARN_ON(pte_huge(*ptep));
 
new_pfn = pte_pfn(*ptep);
@@ -872,6 +893,8 @@ static int set_tdp_spte(struct kvm *kvm, struct 
kvm_memory_slot *slot,
if (need_flush)
kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 
+   rcu_read_unlock();
+
return 0;
 }
 
@@ -896,6 +919,8 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
gfn_t last_goal_gfn = start;
bool spte_set = false;
 
+   rcu_read_lock();
+
BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
@@ -924,6 +949,8 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
tdp_mmu_s

[PATCH 03/24] sched: Add cond_resched_rwlock

2021-01-12 Thread Ben Gardon

Safely rescheduling while holding a spin lock is essential for keeping
long running kernel operations running smoothly. Add the facility to
cond_resched rwlocks.

CC: Ingo Molnar 
CC: Will Deacon 
Acked-by: Peter Zijlstra 
Acked-by: Davidlohr Bueso 
Acked-by: Waiman Long 
Acked-by: Paolo Bonzini 
Signed-off-by: Ben Gardon 
---
 include/linux/sched.h | 12 
 kernel/sched/core.c   | 40 
 2 files changed, 52 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5d1378e5a040..3052d16da3cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1883,12 +1883,24 @@ static inline int _cond_resched(void) { return 0; }
 })
 
 extern int __cond_resched_lock(spinlock_t *lock);
+extern int __cond_resched_rwlock_read(rwlock_t *lock);
+extern int __cond_resched_rwlock_write(rwlock_t *lock);
 
 #define cond_resched_lock(lock) ({ \
___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
__cond_resched_lock(lock);  \
 })
 
+#define cond_resched_rwlock_read(lock) ({  \
+   __might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET); \
+   __cond_resched_rwlock_read(lock);   \
+})
+
+#define cond_resched_rwlock_write(lock) ({ \
+   __might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET); \
+   __cond_resched_rwlock_write(lock);  \
+})
+
 static inline void cond_resched_rcu(void)
 {
 #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 15d2562118d1..ade357642279 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6695,6 +6695,46 @@ int __cond_resched_lock(spinlock_t *lock)
 }
 EXPORT_SYMBOL(__cond_resched_lock);
 
+int __cond_resched_rwlock_read(rwlock_t *lock)
+{
+   int resched = should_resched(PREEMPT_LOCK_OFFSET);
+   int ret = 0;
+
+   lockdep_assert_held_read(lock);
+
+   if (rwlock_needbreak(lock) || resched) {
+   read_unlock(lock);
+   if (resched)
+   preempt_schedule_common();
+   else
+   cpu_relax();
+   ret = 1;
+   read_lock(lock);
+   }
+   return ret;
+}
+EXPORT_SYMBOL(__cond_resched_rwlock_read);
+
+int __cond_resched_rwlock_write(rwlock_t *lock)
+{
+   int resched = should_resched(PREEMPT_LOCK_OFFSET);
+   int ret = 0;
+
+   lockdep_assert_held_write(lock);
+
+   if (rwlock_needbreak(lock) || resched) {
+   write_unlock(lock);
+   if (resched)
+   preempt_schedule_common();
+   else
+   cpu_relax();
+   ret = 1;
+   write_lock(lock);
+   }
+   return ret;
+}
+EXPORT_SYMBOL(__cond_resched_rwlock_write);
+
 /**
  * yield - yield the current processor to other threads.
  *
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 06/24] kvm: x86/mmu: Skip no-op changes in TDP MMU functions

2021-01-12 Thread Ben Gardon

Skip setting SPTEs if no change is expected.

Reviewed-by: Peter Feiner 

Signed-off-by: Ben Gardon 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1987da0da66e..2650fa9fe066 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -882,6 +882,9 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
!is_last_spte(iter.old_spte, iter.level))
continue;
 
+   if (!(iter.old_spte & PT_WRITABLE_MASK))
+   continue;
+
new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
 
tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
@@ -1079,6 +1082,9 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct 
kvm_mmu_page *root,
if (!is_shadow_present_pte(iter.old_spte))
continue;
 
+   if (iter.old_spte & shadow_dirty_mask)
+   continue;
+
new_spte = iter.old_spte | shadow_dirty_mask;
 
tdp_mmu_set_spte(kvm, &iter, new_spte);
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 2/6] KVM: selftests: Avoid flooding debug log while populating memory

2021-01-12 Thread Ben Gardon

Peter Xu pointed out that a log message printed while waiting for the
memory population phase of the dirty_log_perf_test will flood the debug
logs as there is no delay after printing the message. Since the message
does not provide much value anyway, remove it.

Reviewed-by: Jacob Xu 

Signed-off-by: Ben Gardon 
---
 tools/testing/selftests/kvm/dirty_log_perf_test.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c 
b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 16efe6589b43..15a9c45bdb5f 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -146,8 +146,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
/* Allow the vCPU to populate memory */
pr_debug("Starting iteration %lu - Populating\n", iteration);
while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) != iteration)
-   pr_debug("Waiting for vcpu_last_completed_iteration == %lu\n",
-   iteration);
+   ;
 
ts_diff = timespec_elapsed(start);
pr_info("Populate memory time: %ld.%.9lds\n",
@@ -171,9 +170,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
pr_debug("Starting iteration %lu\n", iteration);
for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
-   while 
(READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) != iteration)
-   pr_debug("Waiting for vCPU %d 
vcpu_last_completed_iteration == %lu\n",
-vcpu_id, iteration);
+   while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id])
+  != iteration)
+   ;
}
 
ts_diff = timespec_elapsed(start);
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 5/6] KVM: selftests: Add option to overlap vCPU memory access

2021-01-12 Thread Ben Gardon

Add an option to overlap the ranges of memory each vCPU accesses instead
of partitioning them. This option will increase the probability of
multiple vCPUs faulting on the same page at the same time, and causing
interesting races, if there are bugs in the page fault handler or
elsewhere in the kernel.

Reviewed-by: Jacob Xu 
Reviewed-by: Makarand Sonare 

Signed-off-by: Ben Gardon 
---
 .../selftests/kvm/demand_paging_test.c| 32 +++
 .../selftests/kvm/dirty_log_perf_test.c   | 14 ++--
 .../selftests/kvm/include/perf_test_util.h|  4 ++-
 .../selftests/kvm/lib/perf_test_util.c| 25 +++
 4 files changed, 57 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c 
b/tools/testing/selftests/kvm/demand_paging_test.c
index a1cd234e6f5e..e8fda95f8389 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -250,6 +250,7 @@ static int setup_demand_paging(struct kvm_vm *vm,
 struct test_params {
bool use_uffd;
useconds_t uffd_delay;
+   bool partition_vcpu_memory_access;
 };
 
 static void run_test(enum vm_guest_mode mode, void *arg)
@@ -277,7 +278,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
vcpu_threads = malloc(nr_vcpus * sizeof(*vcpu_threads));
TEST_ASSERT(vcpu_threads, "Memory allocation failed");
 
-   perf_test_setup_vcpus(vm, nr_vcpus, guest_percpu_mem_size);
+   perf_test_setup_vcpus(vm, nr_vcpus, guest_percpu_mem_size,
+ p->partition_vcpu_memory_access);
 
if (p->use_uffd) {
uffd_handler_threads =
@@ -293,10 +295,19 @@ static void run_test(enum vm_guest_mode mode, void *arg)
for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
vm_paddr_t vcpu_gpa;
void *vcpu_hva;
+   uint64_t vcpu_mem_size;
 
-   vcpu_gpa = guest_test_phys_mem + (vcpu_id * 
guest_percpu_mem_size);
+
+   if (p->partition_vcpu_memory_access) {
+   vcpu_gpa = guest_test_phys_mem +
+  (vcpu_id * guest_percpu_mem_size);
+   vcpu_mem_size = guest_percpu_mem_size;
+   } else {
+   vcpu_gpa = guest_test_phys_mem;
+   vcpu_mem_size = guest_percpu_mem_size * 
nr_vcpus;
+   }
PER_VCPU_DEBUG("Added VCPU %d with test mem gpa [%lx, 
%lx)\n",
-  vcpu_id, vcpu_gpa, vcpu_gpa + 
guest_percpu_mem_size);
+  vcpu_id, vcpu_gpa, vcpu_gpa + 
vcpu_mem_size);
 
/* Cache the HVA pointer of the region */
vcpu_hva = addr_gpa2hva(vm, vcpu_gpa);
@@ -313,7 +324,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
&uffd_handler_threads[vcpu_id],
pipefds[vcpu_id * 2],
p->uffd_delay, 
&uffd_args[vcpu_id],
-   vcpu_hva, 
guest_percpu_mem_size);
+   vcpu_hva, vcpu_mem_size);
if (r < 0)
exit(-r);
}
@@ -376,7 +387,7 @@ static void help(char *name)
 {
puts("");
printf("usage: %s [-h] [-m mode] [-u] [-d uffd_delay_usec]\n"
-  "  [-b memory] [-v vcpus]\n", name);
+  "  [-b memory] [-v vcpus] [-o]\n", name);
guest_modes_help();
printf(" -u: use User Fault FD to handle vCPU page\n"
   " faults.\n");
@@ -387,6 +398,8 @@ static void help(char *name)
   " demand paged by each vCPU. e.g. 10M or 3G.\n"
   " Default: 1G\n");
printf(" -v: specify the number of vCPUs to run.\n");
+   printf(" -o: Overlap guest memory accesses instead of partitioning\n"
+  " them into a separate region of memory for each vCPU.\n");
puts("");
exit(0);
 }
@@ -394,12 +407,14 @@ static void help(char *name)
 int main(int argc, char *argv[])
 {
int max_vcpus = kvm_check_cap(KVM_CAP_MAX_VCPUS);
-   struct test_params p = {};
+   struct test_params p = {
+   .partition_vcpu_memory_access = true,
+   };
int opt;
 
guest_modes_append_default();
 
-   while ((opt = getopt(argc, argv, "hm:ud:b:v:")) != -1) {
+   while ((opt = getopt(argc, argv, "hm:ud:b:v:o")) != -1) {
switch (o

[PATCH 1/6] KVM: selftests: Rename timespec_diff_now to timespec_elapsed

2021-01-12 Thread Ben Gardon

In response to some earlier comments from Peter Xu, rename
timespec_diff_now to the much more sensible timespec_elapsed.

No functional change intended.

Reviewed-by: Jacob Xu 
Reviewed-by: Makarand Sonare 

Signed-off-by: Ben Gardon 
---
 tools/testing/selftests/kvm/demand_paging_test.c  |  8 
 tools/testing/selftests/kvm/dirty_log_perf_test.c | 14 +++---
 tools/testing/selftests/kvm/include/test_util.h   |  2 +-
 tools/testing/selftests/kvm/lib/test_util.c   |  2 +-
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c 
b/tools/testing/selftests/kvm/demand_paging_test.c
index cdad1eca72f7..a1cd234e6f5e 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -64,7 +64,7 @@ static void *vcpu_worker(void *data)
exit_reason_str(run->exit_reason));
}
 
-   ts_diff = timespec_diff_now(start);
+   ts_diff = timespec_elapsed(start);
PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_id,
   ts_diff.tv_sec, ts_diff.tv_nsec);
 
@@ -95,7 +95,7 @@ static int handle_uffd_page_request(int uffd, uint64_t addr)
return r;
}
 
-   ts_diff = timespec_diff_now(start);
+   ts_diff = timespec_elapsed(start);
 
PER_PAGE_DEBUG("UFFDIO_COPY %d \t%ld ns\n", tid,
   timespec_to_ns(ts_diff));
@@ -190,7 +190,7 @@ static void *uffd_handler_thread_fn(void *arg)
pages++;
}
 
-   ts_diff = timespec_diff_now(start);
+   ts_diff = timespec_elapsed(start);
PER_VCPU_DEBUG("userfaulted %ld pages over %ld.%.9lds. (%f/sec)\n",
   pages, ts_diff.tv_sec, ts_diff.tv_nsec,
   pages / ((double)ts_diff.tv_sec + 
(double)ts_diff.tv_nsec / 1.0));
@@ -339,7 +339,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
PER_VCPU_DEBUG("Joined thread for vCPU %d\n", vcpu_id);
}
 
-   ts_diff = timespec_diff_now(start);
+   ts_diff = timespec_elapsed(start);
 
pr_info("All vCPU threads joined\n");
 
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c 
b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 2283a0ec74a9..16efe6589b43 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -52,7 +52,7 @@ static void *vcpu_worker(void *data)
 
clock_gettime(CLOCK_MONOTONIC, &start);
ret = _vcpu_run(vm, vcpu_id);
-   ts_diff = timespec_diff_now(start);
+   ts_diff = timespec_elapsed(start);
 
TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
TEST_ASSERT(get_ucall(vm, vcpu_id, NULL) == UCALL_SYNC,
@@ -149,7 +149,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
pr_debug("Waiting for vcpu_last_completed_iteration == %lu\n",
iteration);
 
-   ts_diff = timespec_diff_now(start);
+   ts_diff = timespec_elapsed(start);
pr_info("Populate memory time: %ld.%.9lds\n",
ts_diff.tv_sec, ts_diff.tv_nsec);
 
@@ -157,7 +157,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
clock_gettime(CLOCK_MONOTONIC, &start);
vm_mem_region_set_flags(vm, PERF_TEST_MEM_SLOT_INDEX,
KVM_MEM_LOG_DIRTY_PAGES);
-   ts_diff = timespec_diff_now(start);
+   ts_diff = timespec_elapsed(start);
pr_info("Enabling dirty logging time: %ld.%.9lds\n\n",
ts_diff.tv_sec, ts_diff.tv_nsec);
 
@@ -176,7 +176,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 vcpu_id, iteration);
}
 
-   ts_diff = timespec_diff_now(start);
+   ts_diff = timespec_elapsed(start);
vcpu_dirty_total = timespec_add(vcpu_dirty_total, ts_diff);
pr_info("Iteration %lu dirty memory time: %ld.%.9lds\n",
iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
@@ -184,7 +184,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
clock_gettime(CLOCK_MONOTONIC, &start);
kvm_vm_get_dirty_log(vm, PERF_TEST_MEM_SLOT_INDEX, bmap);
 
-   ts_diff = timespec_diff_now(start);
+   ts_diff = timespec_elapsed(start);
get_dirty_log_total = timespec_add(get_dirty_log_total,
   ts_diff);
pr_info("Iteration %lu get dirty log time: %ld.%.9lds\n",
@@ -195,7 +195,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
kvm_vm_clear_dirty_log(vm, PERF_TEST_MEM_SLOT_INDEX, 
bmap, 0

[PATCH 6/6] KVM: selftests: Add memslot modification stress test

2021-01-12 Thread Ben Gardon

Add a memslot modification stress test in which a memslot is repeatedly
created and removed while vCPUs access memory in another memslot. Most
userspaces do not create or remove memslots on running VMs which makes
it hard to test races in adding and removing memslots without a
dedicated test. Adding and removing a memslot also has the effect of
tearing down the entire paging structure, which leads to more page
faults and pressure on the page fault handling path than a one-and-done
memory population test.

Reviewed-by: Jacob Xu 

Signed-off-by: Ben Gardon 
---
 tools/testing/selftests/kvm/.gitignore|   1 +
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../kvm/memslot_modification_stress_test.c| 211 ++
 3 files changed, 213 insertions(+)
 create mode 100644 
tools/testing/selftests/kvm/memslot_modification_stress_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore 
b/tools/testing/selftests/kvm/.gitignore
index ce8f4ad39684..5a9aebfd5e01 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -29,5 +29,6 @@
 /dirty_log_test
 /dirty_log_perf_test
 /kvm_create_max_vcpus
+/memslot_modification_stress_test
 /set_memory_region_test
 /steal_time
diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index fe41c6a0fa67..df208dc4f2ed 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -63,6 +63,7 @@ TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
+TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
 TEST_GEN_PROGS_x86_64 += set_memory_region_test
 TEST_GEN_PROGS_x86_64 += steal_time
 
diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c 
b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
new file mode 100644
index ..cae1b90cb63f
--- /dev/null
+++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM memslot modification stress test
+ * Adapted from demand_paging_test.c
+ *
+ * Copyright (C) 2018, Red Hat, Inc.
+ * Copyright (C) 2020, Google, Inc.
+ */
+
+#define _GNU_SOURCE /* for program_invocation_name */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "perf_test_util.h"
+#include "processor.h"
+#include "test_util.h"
+#include "guest_modes.h"
+
+#define DUMMY_MEMSLOT_INDEX 7
+
+#define DEFAULT_MEMSLOT_MODIFICATION_ITERATIONS 10
+
+
+static int nr_vcpus = 1;
+static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
+
+static bool run_vcpus = true;
+
+static void *vcpu_worker(void *data)
+{
+   int ret;
+   struct perf_test_vcpu_args *vcpu_args =
+   (struct perf_test_vcpu_args *)data;
+   int vcpu_id = vcpu_args->vcpu_id;
+   struct kvm_vm *vm = perf_test_args.vm;
+   struct kvm_run *run;
+
+   vcpu_args_set(vm, vcpu_id, 1, vcpu_id);
+   run = vcpu_state(vm, vcpu_id);
+
+   /* Let the guest access its memory until a stop signal is received */
+   while (READ_ONCE(run_vcpus)) {
+   ret = _vcpu_run(vm, vcpu_id);
+   TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
+
+   if (get_ucall(vm, vcpu_id, NULL) == UCALL_SYNC)
+   continue;
+
+   TEST_ASSERT(false,
+   "Invalid guest sync status: exit_reason=%s\n",
+   exit_reason_str(run->exit_reason));
+   }
+
+   return NULL;
+}
+
+struct memslot_antagonist_args {
+   struct kvm_vm *vm;
+   useconds_t delay;
+   uint64_t nr_modifications;
+};
+
+static void add_remove_memslot(struct kvm_vm *vm, useconds_t delay,
+ uint64_t nr_modifications, uint64_t gpa)
+{
+   int i;
+
+   for (i = 0; i < nr_modifications; i++) {
+   usleep(delay);
+   vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, gpa,
+   DUMMY_MEMSLOT_INDEX, 1, 0);
+
+   vm_mem_region_delete(vm, DUMMY_MEMSLOT_INDEX);
+   }
+}
+
+struct test_params {
+   useconds_t memslot_modification_delay;
+   uint64_t nr_memslot_modifications;
+   bool partition_vcpu_memory_access;
+};
+
+static void run_test(enum vm_guest_mode mode, void *arg)
+{
+   struct test_params *p = arg;
+   pthread_t *vcpu_threads;
+   struct kvm_vm *vm;
+   int vcpu_id;
+
+   vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size);
+
+   perf_test_args.wr_fract = 1;
+
+   vcpu_threads = malloc(nr_vcpus * sizeof(*vcpu_threads));
+   TEST_ASSERT(vcpu_threads, "Memory allocation fai

[PATCH 4/6] KVM: selftests: Fix population stage in dirty_log_perf_test

2021-01-12 Thread Ben Gardon

Currently the population stage in the dirty_log_perf_test does nothing
as the per-vCPU iteration counters are not initialized and the loop does
not wait for each vCPU. Remedy those errors.

Reviewed-by: Jacob Xu 
Reviewed-by: Makarand Sonare 

Signed-off-by: Ben Gardon 
---
 tools/testing/selftests/kvm/dirty_log_perf_test.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c 
b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 3875f22d7283..fb6eb7fa0b45 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -139,14 +139,19 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
clock_gettime(CLOCK_MONOTONIC, &start);
for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
+   vcpu_last_completed_iteration[vcpu_id] = -1;
+
pthread_create(&vcpu_threads[vcpu_id], NULL, vcpu_worker,
   &perf_test_args.vcpu_args[vcpu_id]);
}
 
-   /* Allow the vCPU to populate memory */
+   /* Allow the vCPUs to populate memory */
pr_debug("Starting iteration %d - Populating\n", iteration);
-   while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) != iteration)
-   ;
+   for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
+   while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) !=
+  iteration)
+   ;
+   }
 
ts_diff = timespec_elapsed(start);
pr_info("Populate memory time: %ld.%.9lds\n",
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH 3/6] KVM: selftests: Convert iterations to int in dirty_log_perf_test

2021-01-12 Thread Ben Gardon

In order to add an iteration -1 to indicate that the memory population
phase has not yet completed, convert the interations counters to ints.

No functional change intended.

Reviewed-by: Jacob Xu 

Signed-off-by: Ben Gardon 
---
 .../selftests/kvm/dirty_log_perf_test.c   | 26 +--
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c 
b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 15a9c45bdb5f..3875f22d7283 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -28,8 +28,8 @@ static uint64_t guest_percpu_mem_size = 
DEFAULT_PER_VCPU_MEM_SIZE;
 /* Host variables */
 static u64 dirty_log_manual_caps;
 static bool host_quit;
-static uint64_t iteration;
-static uint64_t vcpu_last_completed_iteration[KVM_MAX_VCPUS];
+static int iteration;
+static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
 
 static void *vcpu_worker(void *data)
 {
@@ -48,7 +48,7 @@ static void *vcpu_worker(void *data)
run = vcpu_state(vm, vcpu_id);
 
while (!READ_ONCE(host_quit)) {
-   uint64_t current_iteration = READ_ONCE(iteration);
+   int current_iteration = READ_ONCE(iteration);
 
clock_gettime(CLOCK_MONOTONIC, &start);
ret = _vcpu_run(vm, vcpu_id);
@@ -61,17 +61,17 @@ static void *vcpu_worker(void *data)
 
pr_debug("Got sync event from vCPU %d\n", vcpu_id);
vcpu_last_completed_iteration[vcpu_id] = current_iteration;
-   pr_debug("vCPU %d updated last completed iteration to %lu\n",
+   pr_debug("vCPU %d updated last completed iteration to %d\n",
 vcpu_id, vcpu_last_completed_iteration[vcpu_id]);
 
if (current_iteration) {
pages_count += vcpu_args->pages;
total = timespec_add(total, ts_diff);
-   pr_debug("vCPU %d iteration %lu dirty memory time: 
%ld.%.9lds\n",
+   pr_debug("vCPU %d iteration %d dirty memory time: 
%ld.%.9lds\n",
vcpu_id, current_iteration, ts_diff.tv_sec,
ts_diff.tv_nsec);
} else {
-   pr_debug("vCPU %d iteration %lu populate memory time: 
%ld.%.9lds\n",
+   pr_debug("vCPU %d iteration %d populate memory time: 
%ld.%.9lds\n",
vcpu_id, current_iteration, ts_diff.tv_sec,
ts_diff.tv_nsec);
}
@@ -81,7 +81,7 @@ static void *vcpu_worker(void *data)
}
 
avg = timespec_div(total, vcpu_last_completed_iteration[vcpu_id]);
-   pr_debug("\nvCPU %d dirtied 0x%lx pages over %lu iterations in 
%ld.%.9lds. (Avg %ld.%.9lds/iteration)\n",
+   pr_debug("\nvCPU %d dirtied 0x%lx pages over %d iterations in 
%ld.%.9lds. (Avg %ld.%.9lds/iteration)\n",
vcpu_id, pages_count, vcpu_last_completed_iteration[vcpu_id],
total.tv_sec, total.tv_nsec, avg.tv_sec, avg.tv_nsec);
 
@@ -144,7 +144,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
}
 
/* Allow the vCPU to populate memory */
-   pr_debug("Starting iteration %lu - Populating\n", iteration);
+   pr_debug("Starting iteration %d - Populating\n", iteration);
while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) != iteration)
;
 
@@ -168,7 +168,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
clock_gettime(CLOCK_MONOTONIC, &start);
iteration++;
 
-   pr_debug("Starting iteration %lu\n", iteration);
+   pr_debug("Starting iteration %d\n", iteration);
for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id])
   != iteration)
@@ -177,7 +177,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
ts_diff = timespec_elapsed(start);
vcpu_dirty_total = timespec_add(vcpu_dirty_total, ts_diff);
-   pr_info("Iteration %lu dirty memory time: %ld.%.9lds\n",
+   pr_info("Iteration %d dirty memory time: %ld.%.9lds\n",
iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
 
clock_gettime(CLOCK_MONOTONIC, &start);
@@ -186,7 +186,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
ts_diff = timespec_elapsed(start);
get_dirty_log_total = timespec_add(get_dirty_log_total,
   ts_diff);
-   pr_info("Iteration %lu get dirty log time: %ld.

[PATCH 0/6] KVM: selftests: Perf test cleanups and memslot modification test

2021-01-12 Thread Ben Gardon

This series contains a few cleanups that didn't make it into previous
series, including some cosmetic changes and small bug fixes. The series
also lays the groundwork for a memslot modification test which stresses
the memslot update and page fault code paths in an attempt to expose races.

Tested: dirty_log_perf_test, memslot_modification_stress_test, and
demand_paging_test were run, with all the patches in this series
applied, on an Intel Skylake machine.

echo Y > /sys/module/kvm/parameters/tdp_mmu; \
./memslot_modification_stress_test -i 1000 -v 64 -b 1G; \
./memslot_modification_stress_test -i 1000 -v 64 -b 64M -o; \
./dirty_log_perf_test -v 64 -b 1G; \
./dirty_log_perf_test -v 64 -b 64M -o; \
./demand_paging_test -v 64 -b 1G; \
./demand_paging_test -v 64 -b 64M -o; \
echo N > /sys/module/kvm/parameters/tdp_mmu; \
./memslot_modification_stress_test -i 1000 -v 64 -b 1G; \
./memslot_modification_stress_test -i 1000 -v 64 -b 64M -o; \
./dirty_log_perf_test -v 64 -b 1G; \
./dirty_log_perf_test -v 64 -b 64M -o; \
./demand_paging_test -v 64 -b 1G; \
./demand_paging_test -v 64 -b 64M -o

The tests behaved as expected, and fixed the problem of the
population stage being skipped in dirty_log_perf_test. This can be
seen in the output, with the population stage taking about the time
dirty pass 1 took and dirty pass 1 falling closer to the times for
the other passes.

Note that when running these tests, the -o option causes the test to take
much longer as the work each vCPU must do increases proportional to the
number of vCPUs.

You can view this series in Gerrit at:
https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/7216

Ben Gardon (6):
  KVM: selftests: Rename timespec_diff_now to timespec_elapsed
  KVM: selftests: Avoid flooding debug log while populating memory
  KVM: selftests: Convert iterations to int in dirty_log_perf_test
  KVM: selftests: Fix population stage in dirty_log_perf_test
  KVM: selftests: Add option to overlap vCPU memory access
  KVM: selftests: Add memslot modification stress test

 tools/testing/selftests/kvm/.gitignore|   1 +
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../selftests/kvm/demand_paging_test.c|  40 +++-
 .../selftests/kvm/dirty_log_perf_test.c   |  72 +++---
 .../selftests/kvm/include/perf_test_util.h|   4 +-
 .../testing/selftests/kvm/include/test_util.h |   2 +-
 .../selftests/kvm/lib/perf_test_util.c|  25 ++-
 tools/testing/selftests/kvm/lib/test_util.c   |   2 +-
 .../kvm/memslot_modification_stress_test.c| 211 ++
 9 files changed, 307 insertions(+), 51 deletions(-)
 create mode 100644 
tools/testing/selftests/kvm/memslot_modification_stress_test.c

-- 
2.30.0.284.gd98b1dd5eaa7-goog

Re: [PATCH v2 3/3] KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages

2021-03-25 Thread Ben Gardon

On Thu, Mar 25, 2021 at 1:01 PM Sean Christopherson  wrote:
>
> Prevent the TDP MMU from yielding when zapping a gfn range during NX
> page recovery.  If a flush is pending from a previous invocation of the
> zapping helper, either in the TDP MMU or the legacy MMU, but the TDP MMU
> has not accumulated a flush for the current invocation, then yielding
> will release mmu_lock with stale TLB entriesr

Extra r here.

>
> That being said, this isn't technically a bug fix in the current code, as
> the TDP MMU will never yield in this case.  tdp_mmu_iter_cond_resched()
> will yield if and only if it has made forward progress, as defined by the
> current gfn vs. the last yielded (or starting) gfn.  Because zapping a
> single shadow page is guaranteed to (a) find that page and (b) step
> sideways at the level of the shadow page, the TDP iter will break its loop
> before getting a chance to yield.
>
> But that is all very, very subtle, and will break at the slightest sneeze,
> e.g. zapping while holding mmu_lock for read would break as the TDP MMU
> wouldn't be guaranteed to see the present shadow page, and thus could step
> sideways at a lower level.
>
> Cc: Ben Gardon 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/mmu/mmu.c |  4 +---
>  arch/x86/kvm/mmu/tdp_mmu.c |  5 +++--
>  arch/x86/kvm/mmu/tdp_mmu.h | 23 ++-
>  3 files changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5a53743b37bc..7a99e59c8c1c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5940,7 +5940,6 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
> unsigned int ratio;
> LIST_HEAD(invalid_list);
> bool flush = false;
> -   gfn_t gfn_end;
> ulong to_zap;
>
> rcu_idx = srcu_read_lock(&kvm->srcu);
> @@ -5962,8 +5961,7 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>   lpage_disallowed_link);
> WARN_ON_ONCE(!sp->lpage_disallowed);
> if (is_tdp_mmu_page(sp)) {
> -   gfn_end = sp->gfn + 
> KVM_PAGES_PER_HPAGE(sp->role.level);
> -   flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, 
> gfn_end);
> +   flush = kvm_tdp_mmu_zap_sp(kvm, sp);
> } else {
> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> WARN_ON_ONCE(sp->lpage_disallowed);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6cf08c3c537f..08667e3cf091 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -709,13 +709,14 @@ static bool zap_gfn_range(struct kvm *kvm, struct 
> kvm_mmu_page *root,
>   * SPTEs have been cleared and a TLB flush is needed before releasing the
>   * MMU lock.
>   */
> -bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
> +bool can_yield)
>  {
> struct kvm_mmu_page *root;
> bool flush = false;
>
> for_each_tdp_mmu_root_yield_safe(kvm, root)
> -   flush = zap_gfn_range(kvm, root, start, end, true, flush);
> +   flush = zap_gfn_range(kvm, root, start, end, can_yield, 
> flush);
>
> return flush;
>  }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 3b761c111bff..715aa4e0196d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -8,7 +8,28 @@
>  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
>  void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
>
> -bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
> +bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
> +bool can_yield);
> +static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start,
> +gfn_t end)
> +{
> +   return __kvm_tdp_mmu_zap_gfn_range(kvm, start, end, true);
> +}
> +static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page 
> *sp)

I'm a little leary of adding an interface which takes a non-root
struct kvm_mmu_page as an argument to the TDP MMU.
In the TDP MMU, the struct kvm_mmu_pages are protected rather subtly.
I agree this is safe because we hold the MMU lock in write mode here,
but if we ever wanted to convert to holding it in read mode things
could get complicated fast.
Maybe this is more of a concern if the function started to be used
elsewhe

Re: [PATCH v2 2/3] KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping

2021-03-25 Thread Ben Gardon

On Thu, Mar 25, 2021 at 1:01 PM Sean Christopherson  wrote:
>
> Honor the "flush needed" return from kvm_tdp_mmu_zap_gfn_range(), which
> does the flush itself if and only if it yields (which it will never do in
> this particular scenario), and otherwise expects the caller to do the
> flush.  If pages are zapped from the TDP MMU but not the legacy MMU, then
> no flush will occur.
>
> Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU")
> Cc: sta...@vger.kernel.org
> Cc: Ben Gardon 
> Signed-off-by: Sean Christopherson 

Reviewed-by: Ben Gardon 

> ---
>  arch/x86/kvm/mmu/mmu.c | 11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c6ed633594a2..5a53743b37bc 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5939,6 +5939,8 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
> struct kvm_mmu_page *sp;
> unsigned int ratio;
> LIST_HEAD(invalid_list);
> +   bool flush = false;
> +   gfn_t gfn_end;
> ulong to_zap;
>
> rcu_idx = srcu_read_lock(&kvm->srcu);
> @@ -5960,19 +5962,20 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>   lpage_disallowed_link);
> WARN_ON_ONCE(!sp->lpage_disallowed);
> if (is_tdp_mmu_page(sp)) {
> -   kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
> -   sp->gfn + 
> KVM_PAGES_PER_HPAGE(sp->role.level));
> +   gfn_end = sp->gfn + 
> KVM_PAGES_PER_HPAGE(sp->role.level);
> +   flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, 
> gfn_end);
> } else {
> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> WARN_ON_ONCE(sp->lpage_disallowed);
> }
>
> if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> -   kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +   kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, 
> flush);
> cond_resched_rwlock_write(&kvm->mmu_lock);
> +   flush = false;
> }
> }
> -   kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +   kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
>
> write_unlock(&kvm->mmu_lock);
> srcu_read_unlock(&kvm->srcu, rcu_idx);
> --
> 2.31.0.291.g576ba9dcdaf-goog
>

Re: [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap

2021-03-22 Thread Ben Gardon

On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson  wrote:
>
> When flushing a range of GFNs across multiple roots, ensure any pending
> flush from a previous root is honored before yielding while walking the
> tables of the current root.
>
> Note, kvm_tdp_mmu_zap_gfn_range() now intentionally overwrites it local
> "flush" with the result to avoid redundant flushes.  zap_gfn_range()
> preserves and return the incoming "flush", unless of course the flush was
> performed prior to yielding and no new flush was triggered.
>
> Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES 
> changed")
> Cc: sta...@vger.kernel.org
> Cc: Ben Gardon 
> Signed-off-by: Sean Christopherson 

Reviewed-By: Ben Gardon 

> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 23 ---
>  1 file changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f0c99fa04ef2..6cf08c3c537f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -86,7 +86,7 @@ static inline struct kvm_mmu_page *tdp_mmu_next_root(struct 
> kvm *kvm,
> list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
>
>  static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> - gfn_t start, gfn_t end, bool can_yield);
> + gfn_t start, gfn_t end, bool can_yield, bool flush);

This function is going to acquire so many arguments. Don't need to do
anything about it here, but this is going to need some kind of cleanup
at some point.
I'll have to add another "shared" type arg for running this function
under the read lock in a series I'm prepping.


>
>  void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
>  {
> @@ -99,7 +99,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct 
> kvm_mmu_page *root)
>
> list_del(&root->link);
>
> -   zap_gfn_range(kvm, root, 0, max_gfn, false);
> +   zap_gfn_range(kvm, root, 0, max_gfn, false, false);
>
> free_page((unsigned long)root->spt);
> kmem_cache_free(mmu_page_header_cache, root);
> @@ -664,20 +664,21 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm 
> *kvm,
>   * scheduler needs the CPU or there is contention on the MMU lock. If this
>   * function cannot yield, it will not release the MMU lock or reschedule and
>   * the caller must ensure it does not supply too large a GFN range, or the
> - * operation can cause a soft lockup.
> + * operation can cause a soft lockup.  Note, in some use cases a flush may be
> + * required by prior actions.  Ensure the pending flush is performed prior to
> + * yielding.
>   */
>  static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> - gfn_t start, gfn_t end, bool can_yield)
> + gfn_t start, gfn_t end, bool can_yield, bool flush)
>  {
> struct tdp_iter iter;
> -   bool flush_needed = false;
>
> rcu_read_lock();
>
> tdp_root_for_each_pte(iter, root, start, end) {
> if (can_yield &&
> -   tdp_mmu_iter_cond_resched(kvm, &iter, flush_needed)) {
> -   flush_needed = false;
> +   tdp_mmu_iter_cond_resched(kvm, &iter, flush)) {
> +   flush = false;
> continue;
> }
>
> @@ -695,11 +696,11 @@ static bool zap_gfn_range(struct kvm *kvm, struct 
> kvm_mmu_page *root,
> continue;
>
> tdp_mmu_set_spte(kvm, &iter, 0);
> -   flush_needed = true;
> +   flush = true;
> }
>
> rcu_read_unlock();
> -   return flush_needed;
> +   return flush;
>  }
>
>  /*
> @@ -714,7 +715,7 @@ bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t 
> start, gfn_t end)
> bool flush = false;
>
> for_each_tdp_mmu_root_yield_safe(kvm, root)
> -   flush |= zap_gfn_range(kvm, root, start, end, true);
> +   flush = zap_gfn_range(kvm, root, start, end, true, flush);
>
> return flush;
>  }
> @@ -931,7 +932,7 @@ static int zap_gfn_range_hva_wrapper(struct kvm *kvm,
>  struct kvm_mmu_page *root, gfn_t start,
>  gfn_t end, unsigned long unused)
>  {
> -   return zap_gfn_range(kvm, root, start, end, false);
> +   return zap_gfn_range(kvm, root, start, end, false, false);
>  }
>
>  int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
> --
> 2.31.0.rc2.261.g7f71774620-goog
>

Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping

2021-03-22 Thread Ben Gardon

On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson  wrote:
>
> Fix two intertwined bugs in the NX huge page zapping that were introduced
> by the incorporation of the TDP MMU.  Because there is a unified list of
> NX huge pages, zapping can encounter both TDP MMU and legacy MMU pages,
> and the two MMUs have different tracking for TLB flushing.  If one flavor
> needs a flush, but the code for the other flavor yields, KVM will fail to
> flush before yielding.
>
> First, honor the "flush needed" return from kvm_tdp_mmu_zap_gfn_range(),
> which does the flush itself if and only if it yields, and otherwise
> expects the caller to do the flush.  This requires feeding the result
> into kvm_mmu_remote_flush_or_zap(), and so also fixes the case where the
> TDP MMU needs a flush, the legacy MMU does not, and the main loop yields.
>
> Second, tell the TDP MMU a flush is pending if the list of zapped pages
> from legacy MMUs is not empty, i.e. the legacy MMU needs a flush.  This
> fixes the case where the TDP MMU yields, but it iteslf does not require a
> flush.
>
> Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU")
> Cc: sta...@vger.kernel.org
> Cc: Ben Gardon 
> Signed-off-by: Sean Christopherson 

Reviewed-By: Ben Gardon 

This preserves an extremely unlikely degenerate case, which could
cause an unexpected delay.
The scenario is described below, but I don't think this change needs
to be blocked on it.

> ---
>  arch/x86/kvm/mmu/mmu.c | 15 ++-
>  arch/x86/kvm/mmu/tdp_mmu.c |  6 +++---
>  arch/x86/kvm/mmu/tdp_mmu.h |  3 ++-
>  3 files changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c6ed633594a2..413d6259340e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5517,7 +5517,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t 
> gfn_start, gfn_t gfn_end)
> }
>
> if (is_tdp_mmu_enabled(kvm)) {
> -   flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);
> +   flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end,
> + false);
> if (flush)
> kvm_flush_remote_tlbs(kvm);
> }
> @@ -5939,6 +5940,8 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
> struct kvm_mmu_page *sp;
> unsigned int ratio;
> LIST_HEAD(invalid_list);
> +   bool flush = false;
> +   gfn_t gfn_end;
> ulong to_zap;
>
> rcu_idx = srcu_read_lock(&kvm->srcu);
> @@ -5960,19 +5963,21 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>   lpage_disallowed_link);
> WARN_ON_ONCE(!sp->lpage_disallowed);
> if (is_tdp_mmu_page(sp)) {
> -   kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
> -   sp->gfn + 
> KVM_PAGES_PER_HPAGE(sp->role.level));
> +   gfn_end = sp->gfn + 
> KVM_PAGES_PER_HPAGE(sp->role.level);
> +   flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, 
> gfn_end,
> + flush || 
> !list_empty(&invalid_list));
> } else {
> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> WARN_ON_ONCE(sp->lpage_disallowed);
> }
>
> if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> -   kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +   kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, 
> flush);

This pattern of waiting until a yield is needed or lock contention is
detected has always been a little suspect to me because
kvm_mmu_commit_zap_page does work proportional to the work done before
the yield was needed. That seems like more work than we should like to
be doing at that point.

The yield in kvm_tdp_mmu_zap_gfn_range makes that phenomenon even
worse. Because we can satisfy the need to yield without clearing out
the invalid list, we can potentially queue many more pages which will
then all need to have their zaps committed at once. This is an
admittedly contrived case which could only be hit in a high load
nested scenario.

It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from
yielding. Since we should only need to zap one SPTE, the yield should
not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure
that only one SPTE is zapped we would have to specify the root though.
Otherwise we could end up zapping all the entries for the same GFN
range under an unrelated root.

An

Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping

2021-03-23 Thread Ben Gardon

On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson  wrote:
>
> On Mon, Mar 22, 2021, Ben Gardon wrote:
> > On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson  
> > wrote:
> > > @@ -5960,19 +5963,21 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
> > >   lpage_disallowed_link);
> > > WARN_ON_ONCE(!sp->lpage_disallowed);
> > > if (is_tdp_mmu_page(sp)) {
> > > -   kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
> > > -   sp->gfn + 
> > > KVM_PAGES_PER_HPAGE(sp->role.level));
> > > +   gfn_end = sp->gfn + 
> > > KVM_PAGES_PER_HPAGE(sp->role.level);
> > > +   flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, 
> > > gfn_end,
> > > + flush || 
> > > !list_empty(&invalid_list));
> > > } else {
> > > kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> > > WARN_ON_ONCE(sp->lpage_disallowed);
> > > }
> > >
> > > if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> > > -   kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > > +   kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, 
> > > flush);
> >
> > This pattern of waiting until a yield is needed or lock contention is
> > detected has always been a little suspect to me because
> > kvm_mmu_commit_zap_page does work proportional to the work done before
> > the yield was needed. That seems like more work than we should like to
> > be doing at that point.
> >
> > The yield in kvm_tdp_mmu_zap_gfn_range makes that phenomenon even
> > worse. Because we can satisfy the need to yield without clearing out
> > the invalid list, we can potentially queue many more pages which will
> > then all need to have their zaps committed at once. This is an
> > admittedly contrived case which could only be hit in a high load
> > nested scenario.
> >
> > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from
> > yielding. Since we should only need to zap one SPTE, the yield should
> > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure
> > that only one SPTE is zapped we would have to specify the root though.
> > Otherwise we could end up zapping all the entries for the same GFN
> > range under an unrelated root.
>
> Hmm, I originally did exactly that, but changed my mind because this zaps far
> more than 1 SPTE.  This is zapping a SP that could be huge, but is not, which
> means it's guaranteed to have a non-zero number of child SPTEs.  The worst 
> case
> scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs.

It's true that there are potentially 512^2 child sptes, but the code
to clear those after the single PUD spte is cleared doesn't yield
anyway. If the TDP MMU is only  operating with one root (as we would
expect in most cases), there should only be one chance for it to
yield.

I've considered how we could allow the recursive changed spte handlers
to yield, but it gets complicated quite fast because the caller needs
to know if it yielded and reset the TDP iterator to the root, and
there are some cases (mmu notifiers + vCPU path) where yielding is not
desirable.

>
> But, I didn't consider the interplay between invalid_list and the TDP MMU
> yielding.  Hrm.

Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping

2021-03-23 Thread Ben Gardon

On Tue, Mar 23, 2021 at 11:58 AM Sean Christopherson  wrote:
>
> On Tue, Mar 23, 2021, Ben Gardon wrote:
> > On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson  
> > wrote:
> > >
> > > On Mon, Mar 22, 2021, Ben Gardon wrote:
> > > > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from
> > > > yielding. Since we should only need to zap one SPTE, the yield should
> > > > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure
> > > > that only one SPTE is zapped we would have to specify the root though.
> > > > Otherwise we could end up zapping all the entries for the same GFN
> > > > range under an unrelated root.
> > >
> > > Hmm, I originally did exactly that, but changed my mind because this zaps 
> > > far
> > > more than 1 SPTE.  This is zapping a SP that could be huge, but is not, 
> > > which
> > > means it's guaranteed to have a non-zero number of child SPTEs.  The 
> > > worst case
> > > scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k 
> > > SPTEs.
> >
> > It's true that there are potentially 512^2 child sptes, but the code
> > to clear those after the single PUD spte is cleared doesn't yield
> > anyway. If the TDP MMU is only  operating with one root (as we would
> > expect in most cases), there should only be one chance for it to
> > yield.
>
> Ah, right, I was thinking all the iterative flows yielded.  Disallowing
> kvm_tdp_mmu_zap_gfn_range() from yielding in this case does seem like the best
> fix.  Any objection to me sending v2 with that?

That sounds good to me.

>
> > I've considered how we could allow the recursive changed spte handlers
> > to yield, but it gets complicated quite fast because the caller needs
> > to know if it yielded and reset the TDP iterator to the root, and
> > there are some cases (mmu notifiers + vCPU path) where yielding is not
> > desirable.
>
> Urgh, yeah, seems like we'd quickly end up with a mess resembling the legacy 
> MMU
> iterators.
>
> > >
> > > But, I didn't consider the interplay between invalid_list and the TDP MMU
> > > yielding.  Hrm.

Re: [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers

2021-03-30 Thread Ben Gardon

On Thu, Mar 25, 2021 at 7:20 PM Sean Christopherson  wrote:
>
> The end goal of this series is to optimize the MMU notifiers to take
> mmu_lock if and only if the notification is relevant to KVM, i.e. the hva
> range overlaps a memslot.   Large VMs (hundreds of vCPUs) are very
> sensitive to mmu_lock being taken for write at inopportune times, and
> such VMs also tend to be "static", e.g. backed by HugeTLB with minimal
> page shenanigans.  The vast majority of notifications for these VMs will
> be spurious (for KVM), and eliding mmu_lock for spurious notifications
> avoids an otherwise unacceptable disruption to the guest.
>
> To get there without potentially degrading performance, e.g. due to
> multiple memslot lookups, especially on non-x86 where the use cases are
> largely unknown (from my perspective), first consolidate the MMU notifier
> logic by moving the hva->gfn lookups into common KVM.
>
> Applies on my TDP MMU TLB flushing bug fixes[*], which conflict horribly
> with the TDP MMU changes in this series.  That code applies on kvm/queue
> (commit 4a98623d5d90, "KVM: x86/mmu: Mark the PAE roots as decrypted for
> shadow paging").
>
> Speaking of conflicts, Ben will soon be posting a series to convert a
> bunch of TDP MMU flows to take mmu_lock only for read.  Presumably there
> will be an absurd number of conflicts; Ben and I will sort out the
> conflicts in whichever series loses the race.
>
> Well tested on Intel and AMD.  Compile tested for arm64, MIPS, PPC,
> PPC e500, and s390.  Absolutely needs to be tested for real on non-x86,
> I give it even odds that I introduced an off-by-one bug somewhere.
>
> [*] https://lkml.kernel.org/r/20210325200119.1359384-1-sea...@google.com
>
>
> Patches 1-7 are x86 specific prep patches to play nice with moving
> the hva->gfn memslot lookups into common code.  There ended up being waaay
> more of these than I expected/wanted, but I had a hell of a time getting
> the flushing logic right when shuffling the memslot and address space
> loops.  In the end, I was more confident I got things correct by batching
> the flushes.
>
> Patch 8 moves the existing API prototypes into common code.  It could
> technically be dropped since the old APIs are gone in the end, but I
> thought the switch to the new APIs would suck a bit less this way.

Patches 1-8 look good to me. Feel free to add my Reviewed-by tag to those.
I appreciate the care you took to make all those changes tiny and reviewable.

>
> Patch 9 moves arm64's MMU notifier tracepoints into common code so that
> they are not lost when arm64 is converted to the new APIs, and so that all
> architectures can benefit.
>
> Patch 10 moves x86's memslot walkers into common KVM.  I chose x86 purely
> because I could actually test it.  All architectures use nearly identical
> code, so I don't think it actually matters in the end.

I'm still reviewing 10 and 14-18. 10 is a huge change and the diff is
pretty hard to parse.

>
> Patches 11-13 move arm64, MIPS, and PPC to the new APIs.
>
> Patch 14 yanks out the old APIs.
>
> Patch 15 adds the mmu_lock elision, but only for unpaired notifications.

Reading through all this code and considering the changes I'm
preparing for the TDP MMU have me wondering if it might help to have a
more general purpose MMU lock context struct which could be embedded
in the structs added in this patch. I'm thinking something like:
enum kvm_mmu_lock_mode {
KVM_MMU_LOCK_NONE,
KVM_MMU_LOCK_READ,
KVM_MMU_LOCK_WRITE,
};

struct kvm_mmu_lock_context {
enum kvm_mmu_lock_mode lock_mode;
bool can_block;
bool can_yield;
bool flush;
};

This could yield some grossly long lines, but it would also have
potential to unify a bunch of ad-hoc handling.
The above struct could also fit into a single byte, so it'd be pretty
easy to pass it around.

>
> Patch 16 adds mmu_lock elision for paired .invalidate_range_{start,end}().
> This is quite nasty and no small part of me thinks the patch should be
> burned with fire (I won't spoil it any further), but it's also the most
> problematic scenario for our particular use case.  :-/
>
> Patches 17-18 are additional x86 cleanups.
>
> Sean Christopherson (18):
>   KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible
> SPTEs
>   KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy
> MMU
>   KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs
>   KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range
> zap
>   KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range()
>   KVM: x86/mmu: Pass address space ID to TDP MMU root walkers
>   KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing
> SPTE
>   KVM: Move prototypes for MMU notifier callbacks to generic code
>   KVM: Move arm64's MMU notifier trace events to generic code
>   KVM: Move x86's MMU notifier memslot walkers to generic code
>   KVM: arm64: Convert to the gfn-based MMU notifier callbacks
>

Re: [PATCH 03/15] KVM: x86/mmu: Ensure MMU pages are available when allocating roots

2021-03-03 Thread Ben Gardon

On Tue, Mar 2, 2021 at 10:46 AM Sean Christopherson  wrote:
>
> Hold the mmu_lock for write for the entire duration of allocating and
> initializing an MMU's roots.  This ensures there are MMU pages available
> and thus prevents root allocations from failing.  That in turn fixes a
> bug where KVM would fail to free valid PAE roots if a one of the later
> roots failed to allocate.
>
> Note, KVM still leaks the PAE roots if the lm_root allocation fails.
> This will be addressed in a future commit.
>
> Cc: Ben Gardon 
> Signed-off-by: Sean Christopherson 

Reviewed-by: Ben Gardon 

Very tidy cleanup!

> ---
>  arch/x86/kvm/mmu/mmu.c | 41 --
>  arch/x86/kvm/mmu/tdp_mmu.c | 23 +
>  2 files changed, 18 insertions(+), 46 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2ed3fac1244e..1f129001a30c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2398,6 +2398,9 @@ static int make_mmu_pages_available(struct kvm_vcpu 
> *vcpu)
>  {
> unsigned long avail = kvm_mmu_available_pages(vcpu->kvm);
>
> +   /* Ensure all four PAE roots can be allocated in a single pass. */
> +   BUILD_BUG_ON(KVM_MIN_FREE_MMU_PAGES < 4);
> +

For a second I thought that this should be 5 since a page is needed to
hold the 4 PAE roots, but that page is allocated at vCPU creation and
reused, so no need to check for it here.

> if (likely(avail >= KVM_MIN_FREE_MMU_PAGES))
> return 0;
>
> @@ -3220,16 +3223,9 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, 
> gfn_t gfn, gva_t gva,
>  {
> struct kvm_mmu_page *sp;
>
> -   write_lock(&vcpu->kvm->mmu_lock);
> -
> -   if (make_mmu_pages_available(vcpu)) {
> -   write_unlock(&vcpu->kvm->mmu_lock);
> -   return INVALID_PAGE;
> -   }
> sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
> ++sp->root_count;
>
> -   write_unlock(&vcpu->kvm->mmu_lock);
> return __pa(sp->spt);
>  }
>
> @@ -3241,16 +3237,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu 
> *vcpu)
>
> if (is_tdp_mmu_enabled(vcpu->kvm)) {
> root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> -
> -   if (!VALID_PAGE(root))
> -   return -ENOSPC;
> vcpu->arch.mmu->root_hpa = root;
> } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level,
>   true);
> -
> -   if (!VALID_PAGE(root))
> -   return -ENOSPC;

There's so much going on in mmu_alloc_root that removing this check
makes me nervous, but I think it should be safe.
I checked though the function because I was worried it might yield
somewhere in there, which could result in the page cache being emptied
and the allocation failing, but I don't think mmu_alloc_root this
function will yield.

> vcpu->arch.mmu->root_hpa = root;
> } else if (shadow_root_level == PT32E_ROOT_LEVEL) {
> for (i = 0; i < 4; ++i) {
> @@ -3258,8 +3248,6 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>
> root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
>   i << 30, PT32_ROOT_LEVEL, true);
> -   if (!VALID_PAGE(root))
> -   return -ENOSPC;
> vcpu->arch.mmu->pae_root[i] = root | PT_PRESENT_MASK;
> }
> vcpu->arch.mmu->root_hpa = __pa(vcpu->arch.mmu->pae_root);
> @@ -3294,8 +3282,6 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>
> root = mmu_alloc_root(vcpu, root_gfn, 0,
>   vcpu->arch.mmu->shadow_root_level, 
> false);
> -   if (!VALID_PAGE(root))
> -   return -ENOSPC;
> vcpu->arch.mmu->root_hpa = root;
> goto set_root_pgd;
> }
> @@ -3325,6 +3311,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>
> for (i = 0; i < 4; ++i) {
> MMU_WARN_ON(VALID_PAGE(vcpu->arch.mmu->pae_root[i]));
> +
> if (vcpu->arch.mmu->root_level == PT32E_ROOT_LEVEL) {
> pdptr = vcpu->arch.mmu->get_pdptr(vcpu, i);
> if (!(pdptr & PT_PRESENT_MASK)) {
> @@ -3338,8 +3325,6 @@ static int mmu_alloc_shadow_roots(struct kvm_v

Re: [PATCH 02/15] KVM: x86/mmu: Alloc page for PDPTEs when shadowing 32-bit NPT with 64-bit

2021-03-03 Thread Ben Gardon

On Tue, Mar 2, 2021 at 10:45 AM Sean Christopherson  wrote:
>
> Allocate the so called pae_root page on-demand, along with the lm_root
> page, when shadowing 32-bit NPT with 64-bit NPT, i.e. when running a
> 32-bit L1.  KVM currently only allocates the page when NPT is disabled,
> or when L0 is 32-bit (using PAE paging).
>
> Note, there is an existing memory leak involving the MMU roots, as KVM
> fails to free the PAE roots on failure.  This will be addressed in a
> future commit.
>
> Fixes: ee6268ba3a68 ("KVM: x86: Skip pae_root shadow allocation if tdp 
> enabled")
> Fixes: b6b80c78af83 ("KVM: x86/mmu: Allocate PAE root array when using SVM's 
> 32-bit NPT")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Sean Christopherson 

Reviewed-by: Ben Gardon 

> ---
>  arch/x86/kvm/mmu/mmu.c | 44 --
>  1 file changed, 29 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0987cc1d53eb..2ed3fac1244e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3187,14 +3187,14 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct 
> kvm_mmu *mmu,
> if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) 
> {
> mmu_free_root_page(kvm, &mmu->root_hpa, 
> &invalid_list);
> -   } else {
> +   } else if (mmu->pae_root) {
> for (i = 0; i < 4; ++i)
> if (mmu->pae_root[i] != 0)

I was about to comment on how weird this check is since pae_root can
also be INVALID_PAGE but that case is handled in mmu_free_root_page...
but then I realized that you're already addressing that problem in
patch 7.

> mmu_free_root_page(kvm,
>&mmu->pae_root[i],
>&invalid_list);
> -   mmu->root_hpa = INVALID_PAGE;
> }
> +   mmu->root_hpa = INVALID_PAGE;
> mmu->root_pgd = 0;
> }
>
> @@ -3306,9 +3306,23 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu 
> *vcpu)
>  * the shadow page table may be a PAE or a long mode page table.
>  */
> pm_mask = PT_PRESENT_MASK;
> -   if (vcpu->arch.mmu->shadow_root_level == PT64_ROOT_4LEVEL)
> +   if (vcpu->arch.mmu->shadow_root_level == PT64_ROOT_4LEVEL) {
> pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
>
> +   /*
> +* Allocate the page for the PDPTEs when shadowing 32-bit NPT
> +* with 64-bit only when needed.  Unlike 32-bit NPT, it 
> doesn't
> +* need to be in low mem.  See also lm_root below.
> +*/
> +   if (!vcpu->arch.mmu->pae_root) {
> +   WARN_ON_ONCE(!tdp_enabled);
> +
> +   vcpu->arch.mmu->pae_root = (void 
> *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> +   if (!vcpu->arch.mmu->pae_root)
> +   return -ENOMEM;
> +   }
> +   }
> +
> for (i = 0; i < 4; ++i) {
> MMU_WARN_ON(VALID_PAGE(vcpu->arch.mmu->pae_root[i]));
> if (vcpu->arch.mmu->root_level == PT32E_ROOT_LEVEL) {
> @@ -3331,21 +3345,19 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu 
> *vcpu)
> vcpu->arch.mmu->root_hpa = __pa(vcpu->arch.mmu->pae_root);
>
> /*
> -* If we shadow a 32 bit page table with a long mode page
> -* table we enter this path.
> +* When shadowing 32-bit or PAE NPT with 64-bit NPT, the PML4 and PDP
> +* tables are allocated and initialized at MMU creation as there is no
> +* equivalent level in the guest's NPT to shadow.  Allocate the tables
> +* on demand, as running a 32-bit L1 VMM is very rare.  The PDP is
> +* handled above (to share logic with PAE), deal with the PML4 here.
>  */
> if (vcpu->arch.mmu->shadow_root_level == PT64_ROOT_4LEVEL) {
> if (vcpu->arch.mmu->lm_root == NULL) {
> -   /*
> -* The additional page necessary for this is only
> -* allocated on demand.
> -*/
> -
> u64 *lm_root;
>
> lm_root = (void*)ge

1 2 3 4 >

1 - 100 of 313 matches

Mail list logo