Re: disabling halt polling broken? (was Re: [PATCH 00/14] KVM: Halt-polling fixes, cleanups and a new stat)

2022-11-14 Thread wangyanan (Y)

Hi Sean, Paolo,

I recently also notice the behavior change of param halt_poll_ns.
Now it loses the ability to:
1) dynamically disable halt polling for all the running VMs
by `echo 0 > /sys`
2) dynamically adjust the halt polling interval for all the
running VMs by `echo * > /sys`

While in our cases, we usually use above two abilities, and
KVM_CAP_HALT_POLL is not used yet.

On 2021/9/28 1:33, Sean Christopherson wrote:

On Mon, Sep 27, 2021, Paolo Bonzini wrote:

On Mon, Sep 27, 2021 at 5:17 PM Christian Borntraeger
 wrote:

So I think there are two possibilities that makes sense:

* track what is using KVM_CAP_HALT_POLL, and make writes to halt_poll_ns follow 
that

what about using halt_poll_ns for those VMs that did not uses KVM_CAP_HALT_POLL 
and the private number for those that did.

Yes, that's what I meant.  David pointed out that doesn't allow you to
disable halt polling altogether, but for that you can always ask each
VM's userspace one by one, or just not use KVM_CAP_HALT_POLL. (Also, I
don't know about Google's usecase, but mine was actually more about
using KVM_CAP_HALT_POLL to *disable* halt polling on some VMs!).

I kinda like the idea if special-casing halt_poll_ns=0, e.g. for testing or
in-the-field mitigation if halt-polling is broken.  It'd be trivial to support, 
e.g.

Do we have any plan to repost the diff as a fix?
I would be very nice that this issue can be solved.

Besides, I think we may need some Doc for users to describe
how halt_poll_ns works with KVM_CAP_HALT_POLL, like
"Documentation/virt/guest-halt-polling.rst".

@@ -3304,19 +3304,23 @@ void kvm_vcpu_halt(struct kvm_vcpu *vcpu)
 update_halt_poll_stats(vcpu, start, poll_end, !waited);

 if (halt_poll_allowed) {
+   max_halt_poll_ns = vcpu->kvm->max_halt_poll_ns;
+   if (!max_halt_poll_ns || !halt_poll_ns)  <-- squish the max 
if halt_poll_ns==0
+   max_halt_poll_ns = halt_poll_ns;
+

Does this mean that KVM_CAP_HALT_POLL will not be able to
disable halt polling for a VM individually when halt_poll_ns !=0?

 if (!vcpu_valid_wakeup(vcpu)) {
 shrink_halt_poll_ns(vcpu);
-   } else if (vcpu->kvm->max_halt_poll_ns) {
+   } else if (max_halt_poll_ns) {
 if (halt_ns <= vcpu->halt_poll_ns)
 ;
 /* we had a long block, shrink polling */
 else if (vcpu->halt_poll_ns &&
-halt_ns > vcpu->kvm->max_halt_poll_ns)
+halt_ns > max_halt_poll_ns)
 shrink_halt_poll_ns(vcpu);
 /* we had a short halt and our poll time is too small 
*/
-   else if (vcpu->halt_poll_ns < vcpu->kvm->max_halt_poll_ns 
&&
-halt_ns < vcpu->kvm->max_halt_poll_ns)
-   grow_halt_poll_ns(vcpu);
+   else if (vcpu->halt_poll_ns < max_halt_poll_ns &&
+halt_ns < max_halt_poll_ns)
+   grow_halt_poll_ns(vcpu, max_halt_poll_ns);
 } else {
 vcpu->halt_poll_ns = 0;
 }
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
.

Thanks,
Yanan
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/2] KVM: selftests: Allow >1 guest mode in access_tracking_perf_test

2022-11-14 Thread Oliver Upton
On Mon, Nov 14, 2022 at 05:12:48PM +, Sean Christopherson wrote:
> On Fri, Nov 11, 2022, Oliver Upton wrote:
> > As the name implies, for_each_guest_mode() will run the test case for
> > all supported guest addressing modes. On x86 that doesn't amount to
> > anything, but arm64 can handle 4K, 16K, and 64K page sizes on supporting
> > hardware.
> > 
> > Blindly attempting to run access_tracking_perf_test on arm64 stalls on
> > the second test case, as the 'done' global remains set between test
> > iterations. Clear it after VM teardown in anticipation of a subsequent
> > test case.
> > 
> > Signed-off-by: Oliver Upton 
> > ---
> >  tools/testing/selftests/kvm/access_tracking_perf_test.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c 
> > b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> > index 76c583a07ea2..4da066479e0a 100644
> > --- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
> > +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> > @@ -326,6 +326,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >  
> > perf_test_join_vcpu_threads(nr_vcpus);
> > perf_test_destroy_vm(vm);
> > +
> > +   /* Clear done in anticipation of testing another guest mode */
> > +   done = false;
> 
> Can we fix this in the so called "perf_test" infrastructure?
> memslot_modification_stress_test.c has the same bug just inverted (see 
> run_vcpus).

Yeah, sounds good to me. I'll wrap your diff up in a patch here in a
bit.

--
Thanks,
Oliver

> ---
>  .../selftests/kvm/access_tracking_perf_test.c | 28 +--
>  .../selftests/kvm/include/perf_test_util.h|  3 ++
>  .../selftests/kvm/lib/perf_test_util.c|  3 ++
>  .../kvm/memslot_modification_stress_test.c|  6 +---
>  4 files changed, 14 insertions(+), 26 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c 
> b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> index 76c583a07ea2..786bc62a2c79 100644
> --- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
> +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> @@ -58,9 +58,6 @@ static enum {
>   ITERATION_MARK_IDLE,
>  } iteration_work;
>  
> -/* Set to true when vCPU threads should exit. */
> -static bool done;
> -
>  /* The iteration that was last completed by each vCPU. */
>  static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
>  
> @@ -206,28 +203,20 @@ static void assert_ucall(struct kvm_vcpu *vcpu, 
> uint64_t expected_ucall)
>   expected_ucall, actual_ucall);
>  }
>  
> -static bool spin_wait_for_next_iteration(int *current_iteration)
> -{
> - int last_iteration = *current_iteration;
> -
> - do {
> - if (READ_ONCE(done))
> - return false;
> -
> - *current_iteration = READ_ONCE(iteration);
> - } while (last_iteration == *current_iteration);
> -
> - return true;
> -}
> -
>  static void vcpu_thread_main(struct perf_test_vcpu_args *vcpu_args)
>  {
>   struct kvm_vcpu *vcpu = vcpu_args->vcpu;
>   struct kvm_vm *vm = perf_test_args.vm;
>   int vcpu_idx = vcpu_args->vcpu_idx;
>   int current_iteration = 0;
> + int last_iteration;
> +
> + while (!READ_ONCE(perf_test_args.stop_vcpus)) {
> + last_iteration = current_iteration;
> + do {
> + current_iteration = READ_ONCE(iteration);
> + } while (current_iteration == last_iteration);
>  
> - while (spin_wait_for_next_iteration(_iteration)) {
>   switch (READ_ONCE(iteration_work)) {
>   case ITERATION_ACCESS_MEMORY:
>   vcpu_run(vcpu);
> @@ -321,9 +310,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   mark_memory_idle(vm, nr_vcpus);
>   access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
>  
> - /* Set done to signal the vCPU threads to exit */
> - done = true;
> -
>   perf_test_join_vcpu_threads(nr_vcpus);
>   perf_test_destroy_vm(vm);
>  }
> diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h 
> b/tools/testing/selftests/kvm/include/perf_test_util.h
> index eaa88df0555a..536d7c3c3f14 100644
> --- a/tools/testing/selftests/kvm/include/perf_test_util.h
> +++ b/tools/testing/selftests/kvm/include/perf_test_util.h
> @@ -40,6 +40,9 @@ struct perf_test_args {
>   /* Run vCPUs in L2 instead of L1, if the architecture supports it. */
>   bool nested;
>  
> + /* Test is done, stop running vCPUs. */
> + bool stop_vcpus;
> +
>   struct perf_test_vcpu_args vcpu_args[KVM_MAX_VCPUS];
>  };
>  
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c 
> b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index 9618b37c66f7..ee3f499ccbd2 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -267,6 +267,7 

Re: [RFC PATCH 01/12] KVM: arm64: Relax WARN check in stage2_make_pte()

2022-11-14 Thread Oliver Upton
Hi Ricardo,

On Sat, Nov 12, 2022 at 08:17:03AM +, Ricardo Koller wrote:
> stage2_make_pte() throws a warning when used in a non-shared walk, as PTEs
> are not "locked" when walking non-shared. Add a check so it can be used
> non-shared.
> 
> Signed-off-by: Ricardo Koller 

I would very much prefer to leave this WARN as-is. Correct me if I am
wrong, but I do not believe this warning is firing with the existing
code.

While the locking portion doesn't make a whole lot of sense for a
non-shared walk, it is also a magic value that indicates we've already
done the break side of break-before-make. If the warning fires then that
would suggest our break-before-make implementation isn't working as
expected.

--
Thanks,
Oliver

> ---
>  arch/arm64/kvm/hyp/pgtable.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index c12462439e70..b16107bf917c 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -733,7 +733,8 @@ static void stage2_make_pte(const struct 
> kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
>  {
>   struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
>  
> - WARN_ON(!stage2_pte_is_locked(*ctx->ptep));
> + if (kvm_pgtable_walk_shared(ctx))
> + WARN_ON(!stage2_pte_is_locked(*ctx->ptep));
>  
>   if (stage2_pte_is_counted(new))
>   mm_ops->get_page(ctx->ptep);
> -- 
> 2.38.1.431.g37b22c650d-goog
> 
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()

2022-11-14 Thread Oliver Upton
Hi Ricardo,

On Sat, Nov 12, 2022 at 08:17:06AM +, Ricardo Koller wrote:

[...]

> +/**
> + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs 
> pointing
> + *   to PAGE_SIZE guest pages.
> + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
> + * @addr:Intermediate physical address from which to split.
> + * @size:Size of the range.
> + * @mc:  Cache of pre-allocated and zeroed memory from which to 
> allocate
> + *   page-table pages.
> + *
> + * @addr and the end (@addr + @size) are effectively aligned down and up to
> + * the top level huge-page block size. This is an exampe using 1GB
> + * huge-pages and 4KB granules.
> + *
> + *  [---input range---]
> + *  : :
> + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> + *  : :
> + *   [--2MB--][--2MB--][--2MB--][--2MB--]
> + *  : :
> + *   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> + *  : :
> + *
> + * Return: 0 on success, negative error code on failure. Note that
> + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> + * blocks in the input range as allowed by the size of the memcache. It
> + * will fail it wasn't able to break any block.
> + */
> +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, 
> void *mc);
> +
>  /**
>   * kvm_pgtable_walk() - Walk a page-table.
>   * @pgt: Page-table structure initialised by kvm_pgtable_*_init().
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index d1f309128118..9c42eff6d42e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 
> phys, u32 level,
>   return __kvm_pgtable_visit(, mm_ops, ptep, level);
>  }
>  
> +struct stage2_split_data {
> + struct kvm_s2_mmu   *mmu;
> + void*memcache;
> + struct kvm_pgtable_mm_ops   *mm_ops;

You can also get at mm_ops through kvm_pgtable_visit_ctx

> +};
> +
> +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> +enum kvm_pgtable_walk_flags visit)
> +{
> + struct stage2_split_data *data = ctx->arg;
> + struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> + kvm_pte_t pte = ctx->old, attr, new;
> + enum kvm_pgtable_prot prot;
> + void *mc = data->memcache;
> + u32 level = ctx->level;
> + u64 phys;
> +
> + if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> + return -EINVAL;
> +
> + /* Nothing to split at the last level */
> + if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> + return 0;
> +
> + /* We only split valid block mappings */
> + if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> + return 0;
> +
> + phys = kvm_pte_to_phys(pte);
> + prot = kvm_pgtable_stage2_pte_prot(pte);
> + stage2_set_prot_attr(data->mmu->pgt, prot, );
> +
> + /*
> +  * Eager page splitting is best-effort, so we can ignore the error.
> +  * The returned PTE (new) will be valid even if this call returns
> +  * error: new will be a single (big) block PTE.  The only issue is
> +  * that it will affect dirty logging performance, as the huge-pages
> +  * will have to be split on fault, and so we WARN.
> +  */
> + WARN_ON(stage2_create_removed(, phys, level, attr, mc, mm_ops));

I don't believe we should warn in this case, at least not
unconditionally. ENOMEM is an expected outcome, for example.

Additionally, I believe you'll want to bail out at this point to avoid
installing a potentially garbage PTE as well.

> + stage2_put_pte(ctx, data->mmu, mm_ops);

Ah, I see why you've relaxed the WARN in patch 1 now.

I would recommend you follow the break-before-make pattern and use the
helpers here as well. stage2_try_break_pte() will demote the store to
WRITE_ONCE() if called from a non-shared context.

Then the WARN will behave as expected in stage2_make_pte().

> + /*
> +  * Note, the contents of the page table are guaranteed to be made
> +  * visible before the new PTE is assigned because
> +  * stage2_make__pte() writes the PTE using smp_store_release().

typo: stage2_make_pte()

--
Thanks,
Oliver
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH 1/1] KVM: arm64: Use a separate function for hyp stage-1 walks

2022-11-14 Thread Oliver Upton
A subsequent change to the page table walkers adds RCU protection for
walking stage-2 page tables. KVM uses a global lock to serialize hyp
stage-1 walks, meaning RCU protection is quite meaningless for
protecting hyp stage-1 walkers.

Add a new helper, kvm_pgtable_hyp_walk(), for use when walking hyp
stage-1 tables. Call directly into __kvm_pgtable_walk() as table
concatenation is not a supported feature at stage-1.

No functional change intended.

Signed-off-by: Oliver Upton 
---
 arch/arm64/include/asm/kvm_pgtable.h | 24 
 arch/arm64/kvm/hyp/nvhe/setup.c  |  2 +-
 arch/arm64/kvm/hyp/pgtable.c | 18 +++---
 3 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
b/arch/arm64/include/asm/kvm_pgtable.h
index a874ce0ce7b5..43b2f1882e11 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -596,6 +596,30 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 
addr, u64 size);
 int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
 struct kvm_pgtable_walker *walker);
 
+/**
+ * kvm_pgtable_hyp_walk() - Walk a hyp stage-1 page-table.
+ * @pgt:   Page-table structure initialized by kvm_pgtable_hyp_init().
+ * @addr:  Input address for the start of the walk.
+ * @size:  Size of the range to walk.
+ * @walker:Walker callback description.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * The walker will walk the page-table entries corresponding to the input
+ * address range specified, visiting entries according to the walker flags.
+ * Invalid entries are treated as leaf entries. Leaf entries are reloaded
+ * after invoking the walker callback, allowing the walker to descend into
+ * a newly installed table.
+ *
+ * Returning a negative error code from the walker callback function will
+ * terminate the walk immediately with the same error code.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_hyp_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
+struct kvm_pgtable_walker *walker);
+
 /**
  * kvm_pgtable_get_leaf() - Walk a page-table and retrieve the leaf entry
  * with its level.
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 1068338d77f3..55eeb3ed1891 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -246,7 +246,7 @@ static int finalize_host_mappings(void)
struct memblock_region *reg = _memory[i];
u64 start = (u64)hyp_phys_to_virt(reg->base);
 
-   ret = kvm_pgtable_walk(_pgtable, start, reg->size, 
);
+   ret = kvm_pgtable_hyp_walk(_pgtable, start, reg->size, 
);
if (ret)
return ret;
}
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 5bca9610d040..385fa1051b5d 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -335,6 +335,18 @@ int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr,
return ret;
 }
 
+int kvm_pgtable_hyp_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
+struct kvm_pgtable_walker *walker)
+{
+   struct kvm_pgtable_walk_data data = {
+   .walker = walker,
+   .addr   = ALIGN_DOWN(addr, PAGE_SIZE),
+   .end= PAGE_ALIGN(addr + size),
+   };
+
+   return __kvm_pgtable_walk(, pgt->mm_ops, pgt->pgd, 
pgt->start_level);
+}
+
 struct hyp_map_data {
u64 phys;
kvm_pte_t   attr;
@@ -454,7 +466,7 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, 
u64 size, u64 phys,
if (ret)
return ret;
 
-   ret = kvm_pgtable_walk(pgt, addr, size, );
+   ret = kvm_pgtable_hyp_walk(pgt, addr, size, );
dsb(ishst);
isb();
return ret;
@@ -512,7 +524,7 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 
addr, u64 size)
if (!pgt->mm_ops->page_count)
return 0;
 
-   kvm_pgtable_walk(pgt, addr, size, );
+   kvm_pgtable_hyp_walk(pgt, addr, size, );
return unmapped;
 }
 
@@ -557,7 +569,7 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
.flags  = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
};
 
-   WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), ));
+   WARN_ON(kvm_pgtable_hyp_walk(pgt, 0, BIT(pgt->ia_bits), ));
pgt->mm_ops->put_page(kvm_dereference_pteref(pgt->pgd, false));
pgt->pgd = NULL;
 }
-- 
2.38.1.431.g37b22c650d-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH 0/1] KVM: arm64: Skip RCU protection for hyp stage-1

2022-11-14 Thread Oliver Upton
Whelp, that was quick.

Marek reports [1] that the parallel faults series leads to a kernel BUG
when initializing the hyp stage-1 page tables. Work around the issue by
never acquiring the RCU read lock when walking hyp stage-1. This is safe
because hyp stage-1 is protected by a spinlock (pKVM) or mutex (regular
nVHE).

The included patch applies to the parallel faults series. To avoid
breaking bisection, the patch should immediately precede commit
c3119ae45dfb ("KVM: arm64: Protect stage-2 traversal with RCU"). Or, if
preferred, I can respin the whole series in the correct order.

Tested with the pKVM isolated vCPU state series [2] merged on top, w/
kvm-arm.mode={nvhe,protected} on an Ampere Altra system.

Cc: Marek Szyprowski 

[1]: 
https://lore.kernel.org/kvmarm/d9854277-0411-8169-9e8b-68d15e4c0...@samsung.com/
[2]: 
https://lore.kernel.org/linux-arm-kernel/20221110190259.26861-1-w...@kernel.org/

Oliver Upton (1):
  KVM: arm64: Use a separate function for hyp stage-1 walks

 arch/arm64/include/asm/kvm_pgtable.h | 24 
 arch/arm64/kvm/hyp/nvhe/setup.c  |  2 +-
 arch/arm64/kvm/hyp/pgtable.c | 18 +++---
 3 files changed, 40 insertions(+), 4 deletions(-)

-- 
2.38.1.431.g37b22c650d-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 06/12] KVM: arm64: Split block PTEs without using break-before-make

2022-11-14 Thread Oliver Upton
On Sat, Nov 12, 2022 at 08:17:08AM +, Ricardo Koller wrote:
> Breaking a huge-page block PTE into an equivalent table of smaller PTEs
> does not require using break-before-make (BBM) when FEAT_BBM level 2 is
> implemented. Add the respective check for eager page splitting and avoid
> using BBM.
> 
> Also take care of possible Conflict aborts.  According to the rules
> specified in the Arm ARM (DDI 0487H.a) section "Support levels for changing
> block size" D5.10.1, this can result in a Conflict abort. So, handle it by
> clearing all VM TLB entries.
> 
> Signed-off-by: Ricardo Koller 

I'd suggest adding the TLB conflict abort handler as a separate commit
prior to actually relaxing break-before-make requirements.

> ---
>  arch/arm64/include/asm/esr.h |  1 +
>  arch/arm64/include/asm/kvm_arm.h |  1 +
>  arch/arm64/kvm/hyp/pgtable.c | 10 +-
>  arch/arm64/kvm/mmu.c |  6 ++
>  4 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> index 15b34fbfca66..6f5b976396e7 100644
> --- a/arch/arm64/include/asm/esr.h
> +++ b/arch/arm64/include/asm/esr.h
> @@ -114,6 +114,7 @@
>  #define ESR_ELx_FSC_ACCESS   (0x08)
>  #define ESR_ELx_FSC_FAULT(0x04)
>  #define ESR_ELx_FSC_PERM (0x0C)
> +#define ESR_ELx_FSC_CONFLICT (0x30)
>  
>  /* ISS field definitions for Data Aborts */
>  #define ESR_ELx_ISV_SHIFT(24)
> diff --git a/arch/arm64/include/asm/kvm_arm.h 
> b/arch/arm64/include/asm/kvm_arm.h
> index 0df3fc3a0173..58e7cbe3c250 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -333,6 +333,7 @@
>  #define FSC_SECC_TTW1(0x1d)
>  #define FSC_SECC_TTW2(0x1e)
>  #define FSC_SECC_TTW3(0x1f)
> +#define FSC_CONFLICT ESR_ELx_FSC_CONFLICT
>  
>  /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
>  #define HPFAR_MASK   (~UL(0xf))
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 9c42eff6d42e..36b81df5687e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1267,6 +1267,11 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 
> phys, u32 level,
>   return __kvm_pgtable_visit(, mm_ops, ptep, level);
>  }
>  
> +static bool stage2_has_bbm_level2(void)
> +{
> + return cpus_have_const_cap(ARM64_HAS_STAGE2_BBM2);
> +}
> +
>  struct stage2_split_data {
>   struct kvm_s2_mmu   *mmu;
>   void*memcache;
> @@ -1308,7 +1313,10 @@ static int stage2_split_walker(const struct 
> kvm_pgtable_visit_ctx *ctx,
>*/
>   WARN_ON(stage2_create_removed(, phys, level, attr, mc, mm_ops));
>  
> - stage2_put_pte(ctx, data->mmu, mm_ops);
> + if (stage2_has_bbm_level2())
> + mm_ops->put_page(ctx->ptep);
> + else
> + stage2_put_pte(ctx, data->mmu, mm_ops);
>  
>   /*
>* Note, the contents of the page table are guaranteed to be made
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 8f26c65693a9..318f7b0aa20b 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1481,6 +1481,12 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>   return 1;
>   }
>  
> + /* Conflict abort? */
> + if (fault_status == FSC_CONFLICT) {
> + kvm_flush_remote_tlbs(vcpu->kvm);

You don't need to perfom a broadcasted invalidation in this case. A
local invalidation using the guest's VMID should suffice.

--
Thanks,
Oliver
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 02/12] KVM: arm64: Allow visiting block PTEs in post-order

2022-11-14 Thread Oliver Upton
On Sat, Nov 12, 2022 at 08:17:04AM +, Ricardo Koller wrote:
> The page table walker does not visit block PTEs in post-order. But there
> are some cases where doing so would be beneficial, for example: breaking a
> 1G block PTE into a full tree in post-order avoids visiting the new tree.
> 
> Allow post order visits of block PTEs. This will be used in a subsequent
> commit for eagerly breaking huge pages.
> 
> Signed-off-by: Ricardo Koller 
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  4 ++--
>  arch/arm64/kvm/hyp/nvhe/setup.c  |  2 +-
>  arch/arm64/kvm/hyp/pgtable.c | 25 -
>  3 files changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
> b/arch/arm64/include/asm/kvm_pgtable.h
> index e2edeed462e8..d2e4a5032146 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -255,7 +255,7 @@ struct kvm_pgtable {
>   *   entries.
>   * @KVM_PGTABLE_WALK_TABLE_PRE:  Visit table entries before their
>   *   children.
> - * @KVM_PGTABLE_WALK_TABLE_POST: Visit table entries after their
> + * @KVM_PGTABLE_WALK_POST:   Visit leaf or table entries after their
>   *   children.

It is not immediately obvious from this change alone that promoting the
post-order traversal of every walker to cover leaf + table PTEs is safe.

Have you considered using a flag for just leaf post-order visits?

--
Thanks,
Oliver
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 00/12] KVM: arm64: Eager huge-page splitting for dirty-logging

2022-11-14 Thread Oliver Upton
Hi Ricardo,

On Sat, Nov 12, 2022 at 08:17:02AM +, Ricardo Koller wrote:
> Hi,
> 
> I'm sending this RFC mainly to get some early feedback on the approach used
> for implementing "Eager Page Splitting" on ARM.  "Eager Page Splitting"
> improves the performance of dirty-logging (used in live migrations) when
> guest memory is backed by huge-pages.  It's an optimization used in Google
> Cloud since 2016 on x86, and for the last couple of months on ARM.
> 
> I tried multiple ways of implementing this optimization on ARM: from
> completely reusing the stage2 mapper, to implementing a new walker from
> scratch, and some versions in between. This RFC is one of those in
> between. They all have similar performance benefits, based on some light
> performance testing (mainly dirty_log_perf_test).
> 
> Background and motivation
> =
> Dirty logging is typically used for live-migration iterative copying.  KVM
> implements dirty-logging at the PAGE_SIZE granularity (will refer to 4K
> pages from now on).  It does it by faulting on write-protected 4K pages.
> Therefore, enabling dirty-logging on a huge-page requires breaking it into
> 4K pages in the first place.  KVM does this breaking on fault, and because
> it's in the critical path it only maps the 4K page that faulted; every
> other 4K page is left unmapped.  This is not great for performance on ARM
> for a couple of reasons:
> 
> - Splitting on fault can halt vcpus for milliseconds in some
>   implementations. Splitting a block PTE requires using a broadcasted TLB
>   invalidation (TLBI) for every huge-page (due to the break-before-make
>   requirement). Note that x86 doesn't need this. We observed some
>   implementations that take millliseconds to complete broadcasted TLBIs
>   when done in parallel from multiple vcpus.  And that's exactly what
>   happens when doing it on fault: multiple vcpus fault at the same time
>   triggering TLBIs in parallel.
> 
> - Read intensive guest workloads end up paying for dirty-logging.  Only
>   mapping the faulting 4K page means that all the other pages that were
>   part of the huge-page will now be unmapped. The effect is that any
>   access, including reads, now has to fault.
> 
> Eager Page Splitting (on ARM)
> =
> Eager Page Splitting fixes the above two issues by eagerly splitting
> huge-pages when enabling dirty logging. The goal is to avoid doing it while
> faulting on write-protected pages. This is what the TDP MMU does for x86
> [0], except that x86 does it for different reasons: to avoid grabbing the
> MMU lock on fault. Note that taking care of write-protection faults still
> requires grabbing the MMU lock on ARM, but not on x86 (with the
> fast_page_fault path).
> 
> An additional benefit of eagerly splitting huge-pages is that it can be
> done in a controlled way (e.g., via an IOCTL). This series provides two
> knobs for doing it, just like its x86 counterpart: when enabling dirty
> logging, and when using the KVM_CLEAR_DIRTY_LOG ioctl. The benefit of doing
> it on KVM_CLEAR_DIRTY_LOG is that this ioctl takes ranges, and not complete
> memslots like when enabling dirty logging. This means that the cost of
> splitting (mainly broadcasted TLBIs) can be throttled: split a range, wait
> for a bit, split another range, etc. The benefits of this approach were
> presented by Oliver Upton at KVM Forum 2022 [1].
> 
> Implementation
> ==
> Patches 1-4 add a pgtable utility function for splitting huge block PTEs:
> kvm_pgtable_stage2_split(). Patches 5-6 add support for not doing
> break-before-make on huge-page breaking when FEAT_BBM level 2 is supported.

I would suggest you split up FEAT_BBM=2 and eager page splitting into
two separate series, if possible. IMO, the eager page split is easier to
reason about if it follows the existing pattern of break-before-make.

--
Thanks,
Oliver
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v5 08/14] KVM: arm64: Protect stage-2 traversal with RCU

2022-11-14 Thread Oliver Upton
Hi Marek,

On Mon, Nov 14, 2022 at 03:29:14PM +0100, Marek Szyprowski wrote:
> This patch landed in today's linux-next (20221114) as commit 
> c3119ae45dfb ("KVM: arm64: Protect stage-2 traversal with RCU"). 
> Unfortunately it introduces a following warning:

Thanks for the bug report :) I had failed to test nVHE in the past few
revisions of this series.

> --->8---
> 
> kvm [1]: IPA Size Limit: 40 bits
> BUG: sleeping function called from invalid context at 
> include/linux/sched/mm.h:274
> in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
> preempt_count: 0, expected: 0
> RCU nest depth: 1, expected: 0
> 2 locks held by swapper/0/1:
>   #0: 8a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at: 
> __create_hyp_mappings+0x80/0xc4
>   #1: 8a927720 (rcu_read_lock){}-{1:2}, at: 
> kvm_pgtable_walk+0x0/0x1f4
> CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918
> Hardware name: Raspberry Pi 3 Model B (DT)
> Call trace:
>   dump_backtrace.part.0+0xe4/0xf0
>   show_stack+0x18/0x40
>   dump_stack_lvl+0x8c/0xb8
>   dump_stack+0x18/0x34
>   __might_resched+0x178/0x220
>   __might_sleep+0x48/0xa0
>   prepare_alloc_pages+0x178/0x1a0
>   __alloc_pages+0x9c/0x109c
>   alloc_page_interleave+0x1c/0xc4
>   alloc_pages+0xec/0x160
>   get_zeroed_page+0x1c/0x44
>   kvm_hyp_zalloc_page+0x14/0x20
>   hyp_map_walker+0xd4/0x134
>   kvm_pgtable_visitor_cb.isra.0+0x38/0x5c
>   __kvm_pgtable_walk+0x1a4/0x220
>   kvm_pgtable_walk+0x104/0x1f4
>   kvm_pgtable_hyp_map+0x80/0xc4
>   __create_hyp_mappings+0x9c/0xc4
>   kvm_mmu_init+0x144/0x1cc
>   kvm_arch_init+0xe4/0xef4
>   kvm_init+0x3c/0x3d0
>   arm_init+0x20/0x30
>   do_one_initcall+0x74/0x400
>   kernel_init_freeable+0x2e0/0x350
>   kernel_init+0x24/0x130
>   ret_from_fork+0x10/0x20
> kvm [1]: Hyp mode initialized successfully
> 
> --->8
> 
> I looks that more changes in the KVM code are needed to use RCU for that 
> code.

Right, the specific issue is that while the stage-2 walkers preallocate
any table memory they may need, the hyp walkers do not and allocate
inline.

As hyp stage-1 is protected by a spinlock there is no actual need for
RCU in that case. I'll post something later on today that addresses the
issue.

--
Thanks,
Oliver
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/2] KVM: selftests: Allow >1 guest mode in access_tracking_perf_test

2022-11-14 Thread Sean Christopherson
On Fri, Nov 11, 2022, Oliver Upton wrote:
> As the name implies, for_each_guest_mode() will run the test case for
> all supported guest addressing modes. On x86 that doesn't amount to
> anything, but arm64 can handle 4K, 16K, and 64K page sizes on supporting
> hardware.
> 
> Blindly attempting to run access_tracking_perf_test on arm64 stalls on
> the second test case, as the 'done' global remains set between test
> iterations. Clear it after VM teardown in anticipation of a subsequent
> test case.
> 
> Signed-off-by: Oliver Upton 
> ---
>  tools/testing/selftests/kvm/access_tracking_perf_test.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c 
> b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> index 76c583a07ea2..4da066479e0a 100644
> --- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
> +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> @@ -326,6 +326,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>  
>   perf_test_join_vcpu_threads(nr_vcpus);
>   perf_test_destroy_vm(vm);
> +
> + /* Clear done in anticipation of testing another guest mode */
> + done = false;

Can we fix this in the so called "perf_test" infrastructure?
memslot_modification_stress_test.c has the same bug just inverted (see 
run_vcpus).

E.g. (compile tested only)

---
 .../selftests/kvm/access_tracking_perf_test.c | 28 +--
 .../selftests/kvm/include/perf_test_util.h|  3 ++
 .../selftests/kvm/lib/perf_test_util.c|  3 ++
 .../kvm/memslot_modification_stress_test.c|  6 +---
 4 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c 
b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 76c583a07ea2..786bc62a2c79 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -58,9 +58,6 @@ static enum {
ITERATION_MARK_IDLE,
 } iteration_work;
 
-/* Set to true when vCPU threads should exit. */
-static bool done;
-
 /* The iteration that was last completed by each vCPU. */
 static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
 
@@ -206,28 +203,20 @@ static void assert_ucall(struct kvm_vcpu *vcpu, uint64_t 
expected_ucall)
expected_ucall, actual_ucall);
 }
 
-static bool spin_wait_for_next_iteration(int *current_iteration)
-{
-   int last_iteration = *current_iteration;
-
-   do {
-   if (READ_ONCE(done))
-   return false;
-
-   *current_iteration = READ_ONCE(iteration);
-   } while (last_iteration == *current_iteration);
-
-   return true;
-}
-
 static void vcpu_thread_main(struct perf_test_vcpu_args *vcpu_args)
 {
struct kvm_vcpu *vcpu = vcpu_args->vcpu;
struct kvm_vm *vm = perf_test_args.vm;
int vcpu_idx = vcpu_args->vcpu_idx;
int current_iteration = 0;
+   int last_iteration;
+
+   while (!READ_ONCE(perf_test_args.stop_vcpus)) {
+   last_iteration = current_iteration;
+   do {
+   current_iteration = READ_ONCE(iteration);
+   } while (current_iteration == last_iteration);
 
-   while (spin_wait_for_next_iteration(_iteration)) {
switch (READ_ONCE(iteration_work)) {
case ITERATION_ACCESS_MEMORY:
vcpu_run(vcpu);
@@ -321,9 +310,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
mark_memory_idle(vm, nr_vcpus);
access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
 
-   /* Set done to signal the vCPU threads to exit */
-   done = true;
-
perf_test_join_vcpu_threads(nr_vcpus);
perf_test_destroy_vm(vm);
 }
diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h 
b/tools/testing/selftests/kvm/include/perf_test_util.h
index eaa88df0555a..536d7c3c3f14 100644
--- a/tools/testing/selftests/kvm/include/perf_test_util.h
+++ b/tools/testing/selftests/kvm/include/perf_test_util.h
@@ -40,6 +40,9 @@ struct perf_test_args {
/* Run vCPUs in L2 instead of L1, if the architecture supports it. */
bool nested;
 
+   /* Test is done, stop running vCPUs. */
+   bool stop_vcpus;
+
struct perf_test_vcpu_args vcpu_args[KVM_MAX_VCPUS];
 };
 
diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c 
b/tools/testing/selftests/kvm/lib/perf_test_util.c
index 9618b37c66f7..ee3f499ccbd2 100644
--- a/tools/testing/selftests/kvm/lib/perf_test_util.c
+++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
@@ -267,6 +267,7 @@ void perf_test_start_vcpu_threads(int nr_vcpus,
 
vcpu_thread_fn = vcpu_fn;
WRITE_ONCE(all_vcpu_threads_running, false);
+   WRITE_ONCE(perf_test_args.stop_vcpus, false);
 
for (i = 0; i < nr_vcpus; i++) {
struct vcpu_thread *vcpu = _threads[i];
@@ -289,6 

Re: [PATCH v5 08/14] KVM: arm64: Protect stage-2 traversal with RCU

2022-11-14 Thread Marek Szyprowski
Hi Oliver,

On 07.11.2022 22:56, Oliver Upton wrote:
> Use RCU to safely walk the stage-2 page tables in parallel. Acquire and
> release the RCU read lock when traversing the page tables. Defer the
> freeing of table memory to an RCU callback. Indirect the calls into RCU
> and provide stubs for hypervisor code, as RCU is not available in such a
> context.
>
> The RCU protection doesn't amount to much at the moment, as readers are
> already protected by the read-write lock (all walkers that free table
> memory take the write lock). Nonetheless, a subsequent change will
> futher relax the locking requirements around the stage-2 MMU, thereby
> depending on RCU.
>
> Signed-off-by: Oliver Upton 

This patch landed in today's linux-next (20221114) as commit 
c3119ae45dfb ("KVM: arm64: Protect stage-2 traversal with RCU"). 
Unfortunately it introduces a following warning:

--->8---

kvm [1]: IPA Size Limit: 40 bits
BUG: sleeping function called from invalid context at 
include/linux/sched/mm.h:274
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by swapper/0/1:
  #0: 8a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at: 
__create_hyp_mappings+0x80/0xc4
  #1: 8a927720 (rcu_read_lock){}-{1:2}, at: 
kvm_pgtable_walk+0x0/0x1f4
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918
Hardware name: Raspberry Pi 3 Model B (DT)
Call trace:
  dump_backtrace.part.0+0xe4/0xf0
  show_stack+0x18/0x40
  dump_stack_lvl+0x8c/0xb8
  dump_stack+0x18/0x34
  __might_resched+0x178/0x220
  __might_sleep+0x48/0xa0
  prepare_alloc_pages+0x178/0x1a0
  __alloc_pages+0x9c/0x109c
  alloc_page_interleave+0x1c/0xc4
  alloc_pages+0xec/0x160
  get_zeroed_page+0x1c/0x44
  kvm_hyp_zalloc_page+0x14/0x20
  hyp_map_walker+0xd4/0x134
  kvm_pgtable_visitor_cb.isra.0+0x38/0x5c
  __kvm_pgtable_walk+0x1a4/0x220
  kvm_pgtable_walk+0x104/0x1f4
  kvm_pgtable_hyp_map+0x80/0xc4
  __create_hyp_mappings+0x9c/0xc4
  kvm_mmu_init+0x144/0x1cc
  kvm_arch_init+0xe4/0xef4
  kvm_init+0x3c/0x3d0
  arm_init+0x20/0x30
  do_one_initcall+0x74/0x400
  kernel_init_freeable+0x2e0/0x350
  kernel_init+0x24/0x130
  ret_from_fork+0x10/0x20
kvm [1]: Hyp mode initialized successfully

--->8

I looks that more changes in the KVM code are needed to use RCU for that 
code.

> ---
>   arch/arm64/include/asm/kvm_pgtable.h | 49 
>   arch/arm64/kvm/hyp/pgtable.c | 10 +-
>   arch/arm64/kvm/mmu.c | 14 +++-
>   3 files changed, 71 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
> b/arch/arm64/include/asm/kvm_pgtable.h
> index e70cf57b719e..7634b6964779 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -37,6 +37,13 @@ static inline u64 kvm_get_parange(u64 mmfr0)
>   
>   typedef u64 kvm_pte_t;
>   
> +/*
> + * RCU cannot be used in a non-kernel context such as the hyp. As such, page
> + * table walkers used in hyp do not call into RCU and instead use other
> + * synchronization mechanisms (such as a spinlock).
> + */
> +#if defined(__KVM_NVHE_HYPERVISOR__) || defined(__KVM_VHE_HYPERVISOR__)
> +
>   typedef kvm_pte_t *kvm_pteref_t;
>   
>   static inline kvm_pte_t *kvm_dereference_pteref(kvm_pteref_t pteref, bool 
> shared)
> @@ -44,6 +51,40 @@ static inline kvm_pte_t 
> *kvm_dereference_pteref(kvm_pteref_t pteref, bool shared
>   return pteref;
>   }
>   
> +static inline void kvm_pgtable_walk_begin(void) {}
> +static inline void kvm_pgtable_walk_end(void) {}
> +
> +static inline bool kvm_pgtable_walk_lock_held(void)
> +{
> + return true;
> +}
> +
> +#else
> +
> +typedef kvm_pte_t __rcu *kvm_pteref_t;
> +
> +static inline kvm_pte_t *kvm_dereference_pteref(kvm_pteref_t pteref, bool 
> shared)
> +{
> + return rcu_dereference_check(pteref, !shared);
> +}
> +
> +static inline void kvm_pgtable_walk_begin(void)
> +{
> + rcu_read_lock();
> +}
> +
> +static inline void kvm_pgtable_walk_end(void)
> +{
> + rcu_read_unlock();
> +}
> +
> +static inline bool kvm_pgtable_walk_lock_held(void)
> +{
> + return rcu_read_lock_held();
> +}
> +
> +#endif
> +
>   #define KVM_PTE_VALID   BIT(0)
>   
>   #define KVM_PTE_ADDR_MASK   GENMASK(47, PAGE_SHIFT)
> @@ -202,11 +243,14 @@ struct kvm_pgtable {
>*  children.
>* @KVM_PGTABLE_WALK_TABLE_POST:Visit table entries after their
>*  children.
> + * @KVM_PGTABLE_WALK_SHARED: Indicates the page-tables may be shared
> + *   with other software 

Re: [RFC PATCH 2/3] KVM: arm64: Allow userspace to trap SMCCC sub-ranges

2022-11-14 Thread Marc Zyngier
On Fri, 11 Nov 2022 23:39:09 +,
Oliver Upton  wrote:
> 
> On Fri, Nov 11, 2022 at 08:26:02AM +, Marc Zyngier wrote:
> > On Thu, 10 Nov 2022 21:13:54 +, Oliver Upton  
> > wrote:
> > > The goal of what I was trying to get at is that either the kernel or
> > > userspace takes ownership of a range that has an ABI, but not both. i.e.
> > > you really wouldn't want some VMM or cloud provider trapping portions of
> > > KVM's vendor-specific range while still reporting a 'vanilla' ABI at the
> > > time of discovery. Same goes for PSCI, TRNG, etc.
> > 
> > But I definitely think this is one of the major use cases. For
> > example, there is value in taking PSCI to userspace in order to
> > implement a newer version of the spec, or to support sub-features that
> > KVM doesn't (want to) implement. I don't think this changes the ABI from
> > the guest perspective.
> 
> I disagree for the implications of partially trapping the 'Vendor
> Specific Hypervisor Service'. If the UID for the range still reports KVM
> but userspace decided to add some new widget, then from the guest
> perspective that widget is now part of KVM's own ABI with the guest.

But that's what I mean by "I don't think this changes the ABI from the
guest perspective". The guest cannot know who is doing the emulation
anyway, so it is userspace's duty to preserve the illusion. At the
end of the day, this is only a configuration mechanism, and it is no
different from all other configuration bits (i.e. they need to be
identical on both side for migration).

> Trapping the whole range is a bit of a hack to workaround the need for
> more complicated verification of a hypercall filter.

We already need these things for architected hypercalls. Once we have
the infrastructure, it doesn't matter anymore which range this is for.

> 
> But for everything else, I'm fine with arbitrary function filtering.
> Userspace is always welcome to shoot itself in the foot.
> 
> > pKVM also has a use case for this where userspace gets a notification
> > of the hypercall that a guest has performed to share memory.
> 
> Is that hypercall in the 'Vendor Specific Hypervisor Service' range?

Yes. It is get another KVM hypercall.

> 
> > Communication with a TEE also is on the cards, as would be a FFA
> > implementation. All of this could be implemented in KVM, or in
> > userspace, depending what users of these misfeatures want to do.
> 
> I'm very hopeful that by forwarding all of this to userspace we can get
> out of the business of implementing every darn spec that comes along.

Good luck. All the TEEs have private, home grown APIs, and every
vendor will want to implement their own crap (i.e. there is no spec).

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm