Re: [PATCH v3 0/6] KVM: selftests: memslot_perf_test: aarch64 cleanup/fixes

2022-11-04 Thread Gavin Shan

On 10/25/22 7:18 AM, Maciej S. Szmigiero wrote:

On 20.10.2022 09:12, Gavin Shan wrote:

kvm/selftests/memslots_perf_test doesn't work with 64KB-page-size-host
and 4KB-page-size-guest on aarch64. In the implementation, the host and
guest page size have been hardcoded to 4KB. It's ovbiously not working
on aarch64 which supports 4KB, 16KB, 64KB individually on host and guest.

This series tries to fix it. After the series is applied, the test runs
successfully with 64KB-page-size-host and 4KB-page-size-guest.

    # ./memslots_perf_tests -v -s 512

Since we're here, the code is cleaned up a bit as PATCH[1-3] do. The
other patches are fixes to handle the mismatched host/guest page
sized.

v1: https://lore.kernel.org/kvmarm/20221014071914.227134-1-gs...@redhat.com/T/#t
v2: https://lore.kernel.org/kvmarm/20221018040454.405719-1-gs...@redhat.com/T/#t

Changelog
=
v3:
   * Improved comments about MEM_TEST_MOVE_SIZE, which is set
 to 64KB in PATCH[v3 4/6] and finally fixed to 192KB in
 PATCH[v3 5/6].  (Maciej)
   * Use size instead of pages to do the comparison in
 test_memslot_move_prepare() (Maciej)
   * Use tools/include/linux/sizes.h instead of inventing
 our own macros. (Oliver)
v2:
   * Pick the smaller value between the ones specified by
 user or probed from KVM_CAP_NR_MEMSLOTS in PATCH[v2 3/6]    (Maciej)
   * Improved comments about MEM_TEST_MOVE_SIZE in
 PATCH[v2 4/6]   (Maciej)
   * Avoid mismatched guest page size after VM is started in
 prepare_vm() in PATCH[v2 4/6]   (Maciej)
   * Fix condition to check MEM_TEST_{UNMAP, UNMAP_CHUNK}_SIZE
 in check_memory_size() in PATCH[v2 4/6] (Maciej)
   * Define base and huge page size in kvm_util_base.h in
 PATCH[v2 5/6]   (Sean)
   * Add checks on host/guest page size in check_memory_size()
 and fail early if any of them exceeds 64KB in PATCH[v2 5/6] (Maciej)


Gavin Shan (6):
   KVM: selftests: memslot_perf_test: Use data->nslots in prepare_vm()
   KVM: selftests: memslot_perf_test: Consolidate loop conditions in
 prepare_vm()
   KVM: selftests: memslot_perf_test: Probe memory slots for once
   KVM: selftests: memslot_perf_test: Support variable guest page size
   KVM: selftests: memslot_perf_test: Consolidate memory
   KVM: selftests: memslot_perf_test: Report optimal memory slots



This patch set now looks good to me, so for the whole series:
Reviewed-by: Maciej S. Szmigiero 



Thanks for your time on reviews, Maciej. The broken test case was reported
in our downstream Linux, which means our downstream linux needs the improvements
and fixes to make the test case working.

Thanks,
Gavin

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v8 7/7] KVM: selftests: Automate choosing dirty ring size in dirty_log_test

2022-11-04 Thread Gavin Shan
In the dirty ring case, we rely on vcpu exit due to full dirty ring
state. On ARM64 system, there are 4096 host pages when the host
page size is 64KB. In this case, the vcpu never exits due to the
full dirty ring state. The similar case is 4KB page size on host
and 64KB page size on guest. The vcpu corrupts same set of host
pages, but the dirty page information isn't collected in the main
thread. This leads to infinite loop as the following log shows.

  # ./dirty_log_test -M dirty-ring -c 65536 -m 5
  Setting log mode to: 'dirty-ring'
  Test iterations: 32, interval: 10 (ms)
  Testing guest mode: PA-bits:40,  VA-bits:48,  4K pages
  guest physical test memory offset: 0xffbffe
  vcpu stops because vcpu is kicked out...
  Notifying vcpu to continue
  vcpu continues now.
  Iteration 1 collected 576 pages
  

Fix the issue by automatically choosing the best dirty ring size,
to ensure vcpu exit due to full dirty ring state. The option '-c'
becomes a hint to the dirty ring count, instead of the value of it.

Signed-off-by: Gavin Shan 
---
 tools/testing/selftests/kvm/dirty_log_test.c | 26 +---
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index 8758c10ec850..a87e5f78ebf1 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -24,6 +24,9 @@
 #include "guest_modes.h"
 #include "processor.h"
 
+#define DIRTY_MEM_BITS 30 /* 1G */
+#define PAGE_SHIFT_4K  12
+
 /* The memory slot index to track dirty pages */
 #define TEST_MEM_SLOT_INDEX1
 
@@ -273,6 +276,24 @@ static bool dirty_ring_supported(void)
 
 static void dirty_ring_create_vm_done(struct kvm_vm *vm)
 {
+   uint64_t pages;
+   uint32_t limit;
+
+   /*
+* We rely on vcpu exit due to full dirty ring state. Adjust
+* the ring buffer size to ensure we're able to reach the
+* full dirty ring state.
+*/
+   pages = (1ul << (DIRTY_MEM_BITS - vm->page_shift)) + 3;
+   pages = vm_adjust_num_guest_pages(vm->mode, pages);
+   if (vm->page_size < getpagesize())
+   pages = vm_num_host_pages(vm->mode, pages);
+
+   limit = 1 << (31 - __builtin_clz(pages));
+   test_dirty_ring_count = 1 << (31 - 
__builtin_clz(test_dirty_ring_count));
+   test_dirty_ring_count = min(limit, test_dirty_ring_count);
+   pr_info("dirty ring count: 0x%x\n", test_dirty_ring_count);
+
/*
 * Switch to dirty ring mode after VM creation but before any
 * of the vcpu creation.
@@ -685,9 +706,6 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, 
struct kvm_vcpu **vcpu,
return vm;
 }
 
-#define DIRTY_MEM_BITS 30 /* 1G */
-#define PAGE_SHIFT_4K  12
-
 struct test_params {
unsigned long iterations;
unsigned long interval;
@@ -830,7 +848,7 @@ static void help(char *name)
printf("usage: %s [-h] [-i iterations] [-I interval] "
   "[-p offset] [-m mode]\n", name);
puts("");
-   printf(" -c: specify dirty ring size, in number of entries\n");
+   printf(" -c: hint to dirty ring size, in number of entries\n");
printf(" (only useful for dirty-ring test; default: %"PRIu32")\n",
   TEST_DIRTY_RING_COUNT);
printf(" -i: specify iteration counts (default: %"PRIu64")\n",
-- 
2.23.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v8 6/7] KVM: selftests: Clear dirty ring states between two modes in dirty_log_test

2022-11-04 Thread Gavin Shan
There are two states, which need to be cleared before next mode
is executed. Otherwise, we will hit failure as the following messages
indicate.

- The variable 'dirty_ring_vcpu_ring_full' shared by main and vcpu
  thread. It's indicating if the vcpu exit due to full ring buffer.
  The value can be carried from previous mode (VM_MODE_P40V48_4K) to
  current one (VM_MODE_P40V48_64K) when VM_MODE_P40V48_16K isn't
  supported.

- The current ring buffer index needs to be reset before next mode
  (VM_MODE_P40V48_64K) is executed. Otherwise, the stale value is
  carried from previous mode (VM_MODE_P40V48_4K).

  # ./dirty_log_test -M dirty-ring
  Setting log mode to: 'dirty-ring'
  Test iterations: 32, interval: 10 (ms)
  Testing guest mode: PA-bits:40,  VA-bits:48,  4K pages
  guest physical test memory offset: 0xffbfffc000
:
  Dirtied 995328 pages
  Total bits checked: dirty (1012434), clear (7114123), track_next (966700)
  Testing guest mode: PA-bits:40,  VA-bits:48, 64K pages
  guest physical test memory offset: 0xffbffc
  vcpu stops because vcpu is kicked out...
  vcpu continues now.
  Notifying vcpu to continue
  Iteration 1 collected 0 pages
  vcpu stops because dirty ring is full...
  vcpu continues now.
  vcpu stops because dirty ring is full...
  vcpu continues now.
  vcpu stops because dirty ring is full...
   Test Assertion Failure 
  dirty_log_test.c:369: cleared == count
  pid=10541 tid=10541 errno=22 - Invalid argument
 1  0x00403087: dirty_ring_collect_dirty_pages at 
dirty_log_test.c:369
 2  0x00402a0b: log_mode_collect_dirty_pages at dirty_log_test.c:492
 3   (inlined by) run_test at dirty_log_test.c:795
 4   (inlined by) run_test at dirty_log_test.c:705
 5  0x00403a37: for_each_guest_mode at guest_modes.c:100
 6  0x00401ccf: main at dirty_log_test.c:938
 7  0x9ecd279b: ?? ??:0
 8  0x9ecd286b: ?? ??:0
 9  0x00401def: _start at ??:?
  Reset dirty pages (0) mismatch with collected (35566)

Fix the issues by clearing 'dirty_ring_vcpu_ring_full' and the ring
buffer index before next new mode is to be executed.

Signed-off-by: Gavin Shan 
---
 tools/testing/selftests/kvm/dirty_log_test.c | 27 
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index b5234d6efbe1..8758c10ec850 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -226,13 +226,15 @@ static void clear_log_create_vm_done(struct kvm_vm *vm)
 }
 
 static void dirty_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
- void *bitmap, uint32_t num_pages)
+ void *bitmap, uint32_t num_pages,
+ uint32_t *unused)
 {
kvm_vm_get_dirty_log(vcpu->vm, slot, bitmap);
 }
 
 static void clear_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
- void *bitmap, uint32_t num_pages)
+ void *bitmap, uint32_t num_pages,
+ uint32_t *unused)
 {
kvm_vm_get_dirty_log(vcpu->vm, slot, bitmap);
kvm_vm_clear_dirty_log(vcpu->vm, slot, bitmap, 0, num_pages);
@@ -329,10 +331,9 @@ static void dirty_ring_continue_vcpu(void)
 }
 
 static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
-  void *bitmap, uint32_t num_pages)
+  void *bitmap, uint32_t num_pages,
+  uint32_t *ring_buf_idx)
 {
-   /* We only have one vcpu */
-   static uint32_t fetch_index = 0;
uint32_t count = 0, cleared;
bool continued_vcpu = false;
 
@@ -349,7 +350,8 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vcpu 
*vcpu, int slot,
 
/* Only have one vcpu */
count = dirty_ring_collect_one(vcpu_map_dirty_ring(vcpu),
-  slot, bitmap, num_pages, _index);
+  slot, bitmap, num_pages,
+  ring_buf_idx);
 
cleared = kvm_vm_reset_dirty_ring(vcpu->vm);
 
@@ -406,7 +408,8 @@ struct log_mode {
void (*create_vm_done)(struct kvm_vm *vm);
/* Hook to collect the dirty pages into the bitmap provided */
void (*collect_dirty_pages) (struct kvm_vcpu *vcpu, int slot,
-void *bitmap, uint32_t num_pages);
+void *bitmap, uint32_t num_pages,
+uint32_t *ring_buf_idx);
/* Hook to call when after each vcpu run */
void (*after_vcpu_run)(struct kvm_vcpu *vcpu, int ret, int err);
void (*before_vcpu_join) (void);
@@ -471,13 

[PATCH v8 5/7] KVM: selftests: Use host page size to map ring buffer in dirty_log_test

2022-11-04 Thread Gavin Shan
In vcpu_map_dirty_ring(), the guest's page size is used to figure out
the offset in the virtual area. It works fine when we have same page
sizes on host and guest. However, it fails when the page sizes on host
and guest are different on arm64, like below error messages indicates.

  # ./dirty_log_test -M dirty-ring -m 7
  Setting log mode to: 'dirty-ring'
  Test iterations: 32, interval: 10 (ms)
  Testing guest mode: PA-bits:40,  VA-bits:48, 64K pages
  guest physical test memory offset: 0xffbffc
  vcpu stops because vcpu is kicked out...
  Notifying vcpu to continue
  vcpu continues now.
   Test Assertion Failure 
  lib/kvm_util.c:1477: addr == MAP_FAILED
  pid=9000 tid=9000 errno=0 - Success
  1  0x00405f5b: vcpu_map_dirty_ring at kvm_util.c:1477
  2  0x00402ebb: dirty_ring_collect_dirty_pages at dirty_log_test.c:349
  3  0x004029b3: log_mode_collect_dirty_pages at dirty_log_test.c:478
  4  (inlined by) run_test at dirty_log_test.c:778
  5  (inlined by) run_test at dirty_log_test.c:691
  6  0x00403a57: for_each_guest_mode at guest_modes.c:105
  7  0x00401ccf: main at dirty_log_test.c:921
  8  0xb06ec79b: ?? ??:0
  9  0xb06ec86b: ?? ??:0
  10 0x00401def: _start at ??:?
  Dirty ring mapped private

Fix the issue by using host's page size to map the ring buffer.

Signed-off-by: Gavin Shan 
---
 tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index f1cb1627161f..89a1a420ebd5 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1506,7 +1506,7 @@ struct kvm_reg_list *vcpu_get_reg_list(struct kvm_vcpu 
*vcpu)
 
 void *vcpu_map_dirty_ring(struct kvm_vcpu *vcpu)
 {
-   uint32_t page_size = vcpu->vm->page_size;
+   uint32_t page_size = getpagesize();
uint32_t size = vcpu->vm->dirty_ring_size;
 
TEST_ASSERT(size > 0, "Should enable dirty ring first");
-- 
2.23.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v8 4/7] KVM: arm64: Enable ring-based dirty memory tracking

2022-11-04 Thread Gavin Shan
Enable ring-based dirty memory tracking on arm64 by selecting
CONFIG_HAVE_KVM_DIRTY_{RING_ACQ_REL, RING_WITH_BITMAP} and providing
the ring buffer's physical page offset (KVM_DIRTY_LOG_PAGE_OFFSET).

Besides, helper kvm_vgic_save_its_tables_in_progress() is added to
indicate if vgic/its tables are being saved or not. The helper is used
in ARM64's kvm_arch_allow_write_without_running_vcpu() to keep the
site of saving vgic/its tables out of no-running-vcpu radar.

Signed-off-by: Gavin Shan 
---
 Documentation/virt/kvm/api.rst |  2 +-
 arch/arm64/include/uapi/asm/kvm.h  |  1 +
 arch/arm64/kvm/Kconfig |  2 ++
 arch/arm64/kvm/arm.c   |  3 +++
 arch/arm64/kvm/mmu.c   | 15 +++
 arch/arm64/kvm/vgic/vgic-its.c |  3 +++
 arch/arm64/kvm/vgic/vgic-mmio-v3.c |  7 +++
 include/kvm/arm_vgic.h |  2 ++
 8 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 2ec32bd41792..2fc68f684ad8 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7921,7 +7921,7 @@ regardless of what has actually been exposed through the 
CPUID leaf.
 8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
 --
 
-:Architectures: x86
+:Architectures: x86, arm64
 :Parameters: args[0] - size of the dirty log ring
 
 KVM is capable of tracking dirty memory using ring buffers that are
diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 316917b98707..a7a857f1784d 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -43,6 +43,7 @@
 #define __KVM_HAVE_VCPU_EVENTS
 
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
 
 #define KVM_REG_SIZE(id)   \
(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 815cc118c675..066b053e9eb9 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -32,6 +32,8 @@ menuconfig KVM
select KVM_VFIO
select HAVE_KVM_EVENTFD
select HAVE_KVM_IRQFD
+   select HAVE_KVM_DIRTY_RING_ACQ_REL
+   select HAVE_KVM_DIRTY_RING_WITH_BITMAP
select HAVE_KVM_MSI
select HAVE_KVM_IRQCHIP
select HAVE_KVM_IRQ_ROUTING
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 94d33e296e10..6b097605e38c 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -746,6 +746,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 
if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
return kvm_vcpu_suspend(vcpu);
+
+   if (kvm_dirty_ring_check_request(vcpu))
+   return 0;
}
 
return 1;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 60ee3d9f01f8..fbeb55e45f53 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -932,6 +932,21 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm 
*kvm,
kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
+/*
+ * kvm_arch_allow_write_without_running_vcpu - allow writing guest memory
+ * without the running VCPU when dirty ring is enabled.
+ *
+ * The running VCPU is required to track dirty guest pages when dirty ring
+ * is enabled. Otherwise, the backup bitmap should be used to track the
+ * dirty guest pages. When vgic/its tables are being saved, the backup
+ * bitmap is used to track the dirty guest pages due to the missed running
+ * VCPU in the period.
+ */
+bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm)
+{
+   return kvm_vgic_save_its_tables_in_progress(kvm);
+}
+
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
 {
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c
index 733b53055f97..dd340bb812bd 100644
--- a/arch/arm64/kvm/vgic/vgic-its.c
+++ b/arch/arm64/kvm/vgic/vgic-its.c
@@ -2743,6 +2743,7 @@ static int vgic_its_has_attr(struct kvm_device *dev,
 static int vgic_its_ctrl(struct kvm *kvm, struct vgic_its *its, u64 attr)
 {
const struct vgic_its_abi *abi = vgic_its_get_abi(its);
+   struct vgic_dist *dist = >arch.vgic;
int ret = 0;
 
if (attr == KVM_DEV_ARM_VGIC_CTRL_INIT) /* Nothing to do */
@@ -2762,7 +2763,9 @@ static int vgic_its_ctrl(struct kvm *kvm, struct vgic_its 
*its, u64 attr)
vgic_its_reset(kvm, its);
break;
case KVM_DEV_ARM_ITS_SAVE_TABLES:
+   dist->save_its_tables_in_progress = true;
ret = abi->save_tables(its);
+   dist->save_its_tables_in_progress = false;
break;
case KVM_DEV_ARM_ITS_RESTORE_TABLES:
ret = abi->restore_tables(its);
diff 

[PATCH v8 3/7] KVM: Support dirty ring in conjunction with bitmap

2022-11-04 Thread Gavin Shan
ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is
enabled. It's conflicting with that ring-based dirty page tracking always
requires a running VCPU context.

Introduce a new flavor of dirty ring that requires the use of both VCPU
dirty rings and a dirty bitmap. The expectation is that for non-VCPU
sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to
the dirty bitmap. Userspace should scan the dirty bitmap before migrating
the VM to the target.

Use an additional capability to advertise this behavior. The newly added
capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before
KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added
capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL.

Suggested-by: Marc Zyngier 
Suggested-by: Peter Xu 
Co-developed-by: Oliver Upton 
Signed-off-by: Oliver Upton 
Signed-off-by: Gavin Shan 
Acked-by: Peter Xu 
---
 Documentation/virt/kvm/api.rst | 33 ++-
 include/linux/kvm_dirty_ring.h |  7 +
 include/linux/kvm_host.h   |  1 +
 include/uapi/linux/kvm.h   |  1 +
 virt/kvm/Kconfig   |  8 ++
 virt/kvm/dirty_ring.c  | 10 +++
 virt/kvm/kvm_main.c| 49 +++---
 7 files changed, 93 insertions(+), 16 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index eee9f857a986..2ec32bd41792 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8003,13 +8003,6 @@ flushing is done by the KVM_GET_DIRTY_LOG ioctl).  To 
achieve that, one
 needs to kick the vcpu out of KVM_RUN using a signal.  The resulting
 vmexit ensures that all dirty GFNs are flushed to the dirty rings.
 
-NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
-ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
-KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG.  After enabling
-KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
-machine will switch to ring-buffer dirty page tracking and further
-KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.
-
 NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
 should be exposed by weakly ordered architecture, in order to indicate
 the additional memory ordering requirements imposed on userspace when
@@ -8018,6 +8011,32 @@ Architecture with TSO-like ordering (such as x86) are 
allowed to
 expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
 to userspace.
 
+After using the dirty rings, the userspace needs to detect the capability
+of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures
+need to be backed by per-slot bitmaps. With this capability advertised
+and supported, it means the architecture can dirty guest pages without
+vcpu/ring context, so that some of the dirty information will still be
+maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
+can't be enabled until the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL
+has been enabled.
+
+Note that the bitmap here is only a backup of the ring structure, and
+normally should only contain a very small amount of dirty pages, which
+needs to be transferred during VM downtime. Collecting the dirty bitmap
+should be the very last thing that the VMM does before transmitting state
+to the target VM. VMM needs to ensure that the dirty state is final and
+avoid missing dirty pages from another ioctl ordered after the bitmap
+collection.
+
+To collect dirty bits in the backup bitmap, the userspace can use the
+same KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG shouldn't be needed
+and its behavior is undefined since collecting the dirty bitmap always
+happens in the last phase of VM's migration.
+
+NOTE: One example of using the backup bitmap is saving arm64 vgic/its
+tables through KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} command on
+KVM device "kvm-arm-vgic-its" during VM's migration.
+
 8.30 KVM_CAP_XEN_HVM
 
 
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
index 199ead37b104..4862c98d80d3 100644
--- a/include/linux/kvm_dirty_ring.h
+++ b/include/linux/kvm_dirty_ring.h
@@ -37,6 +37,11 @@ static inline u32 kvm_dirty_ring_get_rsvd_entries(void)
return 0;
 }
 
+static inline bool kvm_use_dirty_bitmap(struct kvm *kvm)
+{
+   return true;
+}
+
 static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
   int index, u32 size)
 {
@@ -67,6 +72,8 @@ static inline void kvm_dirty_ring_free(struct kvm_dirty_ring 
*ring)
 #else /* CONFIG_HAVE_KVM_DIRTY_RING */
 
 int kvm_cpu_dirty_log_size(void);
+bool kvm_use_dirty_bitmap(struct kvm *kvm);
+bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm);
 u32 kvm_dirty_ring_get_rsvd_entries(void);
 int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
 
diff --git a/include/linux/kvm_host.h 

[PATCH v8 1/7] KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULL

2022-11-04 Thread Gavin Shan
The VCPU isn't expected to be runnable when the dirty ring becomes soft
full, until the dirty pages are harvested and the dirty ring is reset
from userspace. So there is a check in each guest's entrace to see if
the dirty ring is soft full or not. The VCPU is stopped from running if
its dirty ring has been soft full. The similar check will be needed when
the feature is going to be supported on ARM64. As Marc Zyngier suggested,
a new event will avoid pointless overhead to check the size of the dirty
ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance.

Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring
becomes soft full in kvm_dirty_ring_push(). The event is only cleared in
the check, done in the newly added helper kvm_dirty_ring_check_request().
Since the VCPU is not runnable when the dirty ring becomes soft full, the
KVM_REQ_DIRTY_RING_SOFT_FULL event is always set to prevent the VCPU from
running until the dirty pages are harvested and the dirty ring is reset by
userspace.

kvm_dirty_ring_soft_full() becomes a private function with the newly added
helper kvm_dirty_ring_check_request(). The alignment for the various event
definitions in kvm_host.h is changed to tab character by the way. In order
to avoid using 'container_of()', the argument @ring is replaced by @vcpu
in kvm_dirty_ring_push().

Link: https://lore.kernel.org/kvmarm/87lerkwtm5.wl-...@kernel.org
Suggested-by: Marc Zyngier 
Signed-off-by: Gavin Shan 
Reviewed-by: Peter Xu 
Reviewed-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 15 ++-
 include/linux/kvm_dirty_ring.h | 12 
 include/linux/kvm_host.h   |  9 +
 virt/kvm/dirty_ring.c  | 32 ++--
 virt/kvm/kvm_main.c|  3 +--
 5 files changed, 46 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 521b433f978c..fd3347e31f5b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10512,20 +10512,17 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
bool req_immediate_exit = false;
 
-   /* Forbid vmenter if vcpu dirty ring is soft-full */
-   if (unlikely(vcpu->kvm->dirty_ring_size &&
-kvm_dirty_ring_soft_full(>dirty_ring))) {
-   vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
-   trace_kvm_dirty_ring_exit(vcpu);
-   r = 0;
-   goto out;
-   }
-
if (kvm_request_pending(vcpu)) {
if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
r = -EIO;
goto out;
}
+
+   if (kvm_dirty_ring_check_request(vcpu)) {
+   r = 0;
+   goto out;
+   }
+
if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
if 
(unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
r = 0;
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
index 906f899813dc..9c13c4c3d30c 100644
--- a/include/linux/kvm_dirty_ring.h
+++ b/include/linux/kvm_dirty_ring.h
@@ -49,7 +49,7 @@ static inline int kvm_dirty_ring_reset(struct kvm *kvm,
return 0;
 }
 
-static inline void kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
+static inline void kvm_dirty_ring_push(struct kvm_vcpu *vcpu,
   u32 slot, u64 offset)
 {
 }
@@ -64,11 +64,6 @@ static inline void kvm_dirty_ring_free(struct kvm_dirty_ring 
*ring)
 {
 }
 
-static inline bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
-{
-   return true;
-}
-
 #else /* CONFIG_HAVE_KVM_DIRTY_RING */
 
 u32 kvm_dirty_ring_get_rsvd_entries(void);
@@ -84,13 +79,14 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct 
kvm_dirty_ring *ring);
  * returns =0: successfully pushed
  * <0: unable to push, need to wait
  */
-void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset);
+void kvm_dirty_ring_push(struct kvm_vcpu *vcpu, u32 slot, u64 offset);
+
+bool kvm_dirty_ring_check_request(struct kvm_vcpu *vcpu);
 
 /* for use in vm_operations_struct */
 struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset);
 
 void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
-bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring);
 
 #endif /* CONFIG_HAVE_KVM_DIRTY_RING */
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 18592bdf4c1b..6fab55e58111 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -153,10 +153,11 @@ static inline bool is_error_page(struct page *page)
  * Architecture-independent vcpu->requests bit members
  * Bits 3-7 are reserved for more arch-independent bits.
  */
-#define KVM_REQ_TLB_FLUSH (0 | KVM_REQUEST_WAIT | 
KVM_REQUEST_NO_WAKEUP)
-#define KVM_REQ_VM_DEAD   (1 | KVM_REQUEST_WAIT | 
KVM_REQUEST_NO_WAKEUP)
-#define 

[PATCH v8 2/7] KVM: Move declaration of kvm_cpu_dirty_log_size() to kvm_dirty_ring.h

2022-11-04 Thread Gavin Shan
Not all architectures like ARM64 need to override the function. Move
its declaration to kvm_dirty_ring.h to avoid the following compiling
warning on ARM64 when the feature is enabled.

  arch/arm64/kvm/../../../virt/kvm/dirty_ring.c:14:12:\
  warning: no previous prototype for 'kvm_cpu_dirty_log_size' \
  [-Wmissing-prototypes]  \
  int __weak kvm_cpu_dirty_log_size(void)

Reported-by: kernel test robot 
Signed-off-by: Gavin Shan 
Reviewed-by: Peter Xu 
---
 arch/x86/include/asm/kvm_host.h | 2 --
 include/linux/kvm_dirty_ring.h  | 1 +
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7551b6f9c31c..b4dbde7d9eb1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2090,8 +2090,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define GET_SMSTATE(type, buf, offset) \
(*(type *)((buf) + (offset) - 0x7e00))
 
-int kvm_cpu_dirty_log_size(void);
-
 int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
 
 #define KVM_CLOCK_VALID_FLAGS  \
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
index 9c13c4c3d30c..199ead37b104 100644
--- a/include/linux/kvm_dirty_ring.h
+++ b/include/linux/kvm_dirty_ring.h
@@ -66,6 +66,7 @@ static inline void kvm_dirty_ring_free(struct kvm_dirty_ring 
*ring)
 
 #else /* CONFIG_HAVE_KVM_DIRTY_RING */
 
+int kvm_cpu_dirty_log_size(void);
 u32 kvm_dirty_ring_get_rsvd_entries(void);
 int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
 
-- 
2.23.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v8 0/7] KVM: arm64: Enable ring-based dirty memory tracking

2022-11-04 Thread Gavin Shan
This series enables the ring-based dirty memory tracking for ARM64.
The feature has been available and enabled on x86 for a while. It
is beneficial when the number of dirty pages is small in a checkpointing
system or live migration scenario. More details can be found from
fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking").

This series is applied to v6.1.rc3, plus commit c227590467cb ("KVM:
Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling them").
The commit is currently in Marc's 'fixes' branch, targeting v6.1.rc4/5.

v7: https://lore.kernel.org/kvmarm/20221031003621.164306-1-gs...@redhat.com/
v6: https://lore.kernel.org/kvmarm/20221011061447.131531-1-gs...@redhat.com/
v5: https://lore.kernel.org/all/20221005004154.83502-1-gs...@redhat.com/
v4: https://lore.kernel.org/kvmarm/20220927005439.21130-1-gs...@redhat.com/
v3: https://lore.kernel.org/r/20220922003214.276736-1-gs...@redhat.com
v2: https://lore.kernel.org/lkml/YyiV%2Fl7O23aw5aaO@xz-m1.local/T/
v1: https://lore.kernel.org/lkml/20220819005601.198436-1-gs...@redhat.com

Testing
===
(1) kvm/selftests/dirty_log_test
(2) Live migration by QEMU

Changelog
=
v8:
  * Pick review-by and ack-by   (Peter/Sean)
  * Drop chunk of code to clear KVM_REQ_DIRTY_RING_SOFT_FULL
in kvm_dirty_ring_reset(). Add comments to say the event
will be cleared by the VCPU thread next time when it enters
the guest. All other changes related to kvm_dirty_ring_reset()
are dropped in PATCH[v8 1/7].   
(Sean/Peter/Marc)
  * Drop PATCH[v7 3/7] since it has been merged (Marc/Oliver)
  * Document the order of DIRTY_RING_{ACQ_REL, WITH_BITMAP},
add check to ensure no memslots are created when
DIRTY_RING_WITH_BITMAP is enabled, and add weak function
kvm_arch_allow_write_without_running_vcpu() in PATCH[v8 3/7] (Oliver)
  * Only keep ourself out of non-running-vcpu radar when vgic/its
tables are being saved in PATCH[v8 4/7]  (Marc/Sean)
v7:
  * Cut down #ifdef, avoid using 'container_of()', move the
dirty-ring check after KVM_REQ_VM_DEAD, add comments
for kvm_dirty_ring_check_request(), use tab character
for KVM event definitions in kvm_host.h in PATCH[v7 01](Sean)
  * Add PATCH[v7 03] to recheck if the capability has
been advertised prior to enable RING/RING_ACEL_REL (Sean)
  * Improve the description about capability RING_WITH_BITMAP,
rename kvm_dirty_ring_exclusive() to kvm_use_dirty_bitmap()
in PATCH[v7 04/09] 
(Peter/Oliver/Sean)
  * Add PATCH[v7 05/09] to improve no-running-vcpu report  (Marc/Sean)
  * Improve commit messages(Sean/Oliver)
v6:
  * Add CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP, for arm64
to advertise KVM_CAP_DIRTY_RING_WITH_BITMAP in
PATCH[v6 3/8]  (Oliver/Peter)
  * Add helper kvm_dirty_ring_exclusive() to check if
traditional bitmap-based dirty log tracking is
exclusive to dirty-ring in PATCH[v6 3/8]   (Peter)
  * Enable KVM_CAP_DIRTY_RING_WITH_BITMAP in PATCH[v6 5/8] (Gavin)
v5:
  * Drop empty stub kvm_dirty_ring_check_request() (Marc/Peter)
  * Add PATCH[v5 3/7] to allow using bitmap, indicated by
KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP(Marc/Peter)
v4:
  * Commit log improvement (Marc)
  * Add helper kvm_dirty_ring_check_request()  (Marc)
  * Drop ifdef for kvm_cpu_dirty_log_size()(Marc)
v3:
  * Check KVM_REQ_RING_SOFT_RULL inside kvm_request_pending()  (Peter)
  * Move declaration of kvm_cpu_dirty_log_size()   (test-robot)
v2:
  * Introduce KVM_REQ_RING_SOFT_FULL   (Marc)
  * Changelog improvement  (Marc)
  * Fix dirty_log_test without knowing host page size  (Drew)

Gavin Shan (7):
  KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULL
  KVM: Move declaration of kvm_cpu_dirty_log_size() to kvm_dirty_ring.h
  KVM: Support dirty ring in conjunction with bitmap
  KVM: arm64: Enable ring-based dirty memory tracking
  KVM: selftests: Use host page size to map ring buffer in
dirty_log_test
  KVM: selftests: Clear dirty ring states between two modes in
dirty_log_test
  KVM: selftests: Automate choosing dirty ring size in dirty_log_test

 Documentation/virt/kvm/api.rst   | 35 ++---
 arch/arm64/include/uapi/asm/kvm.h|  1 +
 arch/arm64/kvm/Kconfig   |  2 +
 arch/arm64/kvm/arm.c |  3 ++
 arch/arm64/kvm/mmu.c | 15 ++
 arch/arm64/kvm/vgic/vgic-its.c   |  3 ++
 arch/arm64/kvm/vgic/vgic-mmio-v3.c   |  7 +++
 arch/x86/include/asm/kvm_host.h  |  2 -
 arch/x86/kvm/x86.c   | 15 

Re: [PATCH v7 4/9] KVM: Support dirty ring in conjunction with bitmap

2022-11-04 Thread Oliver Upton
On Sat, Nov 05, 2022 at 05:57:33AM +0800, Gavin Shan wrote:
> On 11/5/22 4:12 AM, Oliver Upton wrote:
> > I agree that we should address (1) like this, but in (2) requiring that
> > no memslots were created before enabling the existing capabilities would
> > be a change in ABI. If we can get away with that, great, but otherwise
> > we may need to delete the bitmaps associated with all memslots when the
> > cap is enabled.
> > 
> 
> I had the assumption QEMU and kvm/selftests are the only consumers to
> use DIRTY_RING. In this case, requiring that no memslots were created
> to enable DIRTY_RING won't break userspace.
> Following your thoughts, the tracked dirty pages in the bitmap also
> need to be synchronized to the per-vcpu-ring before the bitmap can be
> destroyed.

Eh, I don't think we'd need to go that far. No matter what, any dirty
bits that were present in the bitmap could never be read again anyway,
as we reject KVM_GET_DIRTY_LOG if kvm->dirty_ring_size != 0.

> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 91cf51a25394..420cc101a16e 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4588,6 +4588,32 @@ static int kvm_vm_ioctl_enable_cap_generic(struct 
> > kvm *kvm,
> > return -EINVAL;
> > return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
> > +
> > +   case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: {
> > +   struct kvm_memslots *slots;
> > +   int r = -EINVAL;
> > +
> > +   if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
> > +   !kvm->dirty_ring_size)
> > +   return r;
> > +
> > +   mutex_lock(>slots_lock);
> > +
> > +   slots = kvm_memslots(kvm);
> > +
> > +   /*
> > +* Avoid a race between memslot creation and enabling the ring +
> > +* bitmap capability to guarantee that no memslots have been
> > +* created without a bitmap.
> > +*/
> > +   if (kvm_memslots_empty(slots)) {
> > +   kvm->dirty_ring_with_bitmap = cap->args[0];
> > +   r = 0;
> > +   }
> > +
> > +   mutex_unlock(>slots_lock);
> > +   return r;
> > +   }
> > default:
> > return kvm_vm_ioctl_enable_cap(kvm, cap);
> > }
> > 
> 
> The proposed changes look good to me. It will be integrated to PATCH[v8 4/9].
> By the way, v8 will be posted shortly.

Excellent, thanks!

--
Best,
Oliver
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 4/9] KVM: Support dirty ring in conjunction with bitmap

2022-11-04 Thread Gavin Shan

Hi Oliver,

On 11/5/22 4:12 AM, Oliver Upton wrote:

On Fri, Nov 04, 2022 at 02:57:15PM +0800, Gavin Shan wrote:

On 11/4/22 9:06 AM, Oliver Upton wrote:


[...]


Just to make sure we're on the same page, there's two issues:

   (1) If DIRTY_LOG_RING is enabled before memslot creation and
   RING_WITH_BITMAP is enabled after memslots have been created w/
   dirty logging enabled, memslot->dirty_bitmap == NULL and the
   kernel will fault when attempting to save the ITS tables.

   (2) Not your code, but a similar issue. If DIRTY_LOG_RING[_ACQ_REL] is
   enabled after memslots have been created w/ dirty logging enabled,
   memslot->dirty_bitmap != NULL and that memory is wasted until the
   memslot is freed.

I don't expect you to fix #2, though I've mentioned it because using the
same approach to #1 and #2 would be nice.



Yes, I got your points. Case (2) is still possible to happen with QEMU
excluded. However, QEMU is always enabling DIRTY_LOG_RING[_ACQ_REL] before
any memory slot is created. I agree that we need to ensure there are no
memory slots when DIRTY_LOG_RING[_ACQ_REL] is enabled.

For case (1), we can ensure RING_WTIH_BITMAP is enabled before any memory
slot is added, as below. QEMU needs a new helper (as above) to enable it
on board's level.

Lets fix both with a new helper in PATCH[v8 4/9] like below?


I agree that we should address (1) like this, but in (2) requiring that
no memslots were created before enabling the existing capabilities would
be a change in ABI. If we can get away with that, great, but otherwise
we may need to delete the bitmaps associated with all memslots when the
cap is enabled.



I had the assumption QEMU and kvm/selftests are the only consumers to
use DIRTY_RING. In this case, requiring that no memslots were created
to enable DIRTY_RING won't break userspace. Following your thoughts,
the tracked dirty pages in the bitmap also need to be synchronized to
the per-vcpu-ring before the bitmap can be destroyed. We don't have
per-vcpu-ring at this stage.


   static inline bool kvm_vm_has_memslot_pages(struct kvm *kvm)
   {
   bool has_memslot_pages;

   mutex_lock(>slots_lock);

   has_memslot_pages = !!kvm->nr_memslot_pages;

   mutex_unlock(>slots_lock);

   return has_memslot_pages;
   }


Do we need to build another helper for this? kvm_memslots_empty() will
tell you whether or not a memslot has been created by checking the gfn
tree.



The helper was introduced to be shared when DIRTY_RING[_ACQ_REL] and
DIRTY_RING_WITH_BITMAP are enabled. Since the issue (2) isn't concern
to us, lets put it aside and the helper isn't needed. kvm_memslots_empty()
has same effect as to 'kvm->nr_memslot_pages', it's fine to use
kvm_memslots_empty(), which is more generic.


On top of that, the memslot check and setting
kvm->dirty_ring_with_bitmap must happen behind the slots_lock. Otherwise
you could still wind up creating memslots w/o bitmaps.



Agree.



Something like:


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 91cf51a25394..420cc101a16e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4588,6 +4588,32 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm 
*kvm,
return -EINVAL;
  
  		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);

+
+   case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: {
+   struct kvm_memslots *slots;
+   int r = -EINVAL;
+
+   if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
+   !kvm->dirty_ring_size)
+   return r;
+
+   mutex_lock(>slots_lock);
+
+   slots = kvm_memslots(kvm);
+
+   /*
+* Avoid a race between memslot creation and enabling the ring +
+* bitmap capability to guarantee that no memslots have been
+* created without a bitmap.
+*/
+   if (kvm_memslots_empty(slots)) {
+   kvm->dirty_ring_with_bitmap = cap->args[0];
+   r = 0;
+   }
+
+   mutex_unlock(>slots_lock);
+   return r;
+   }
default:
return kvm_vm_ioctl_enable_cap(kvm, cap);
}



The proposed changes look good to me. It will be integrated to PATCH[v8 4/9].
By the way, v8 will be posted shortly.

Thanks,
Gavin

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 01/14] arm64: Add ID_DFR0_EL1.PerfMon values for PMUv3p7 and IMP_DEF

2022-11-04 Thread Oliver Upton
On Fri, Oct 28, 2022 at 11:53:49AM +0100, Marc Zyngier wrote:
> Align the ID_DFR0_EL1.PerfMon values with ID_AA64DFR0_EL1.PMUver.
> 
> Signed-off-by: Marc Zyngier 

Reviewed-by: Oliver Upton 

FYI, another pile of ID reg changes is on the way that'll move DFR0 to a
generated definition.

https://lore.kernel.org/linux-arm-kernel/20220930140211.3215348-1-james.mo...@arm.com/

--
Thanks,
Oliver

> ---
>  arch/arm64/include/asm/sysreg.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index 7d301700d1a9..84f59ce1dc6d 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -698,6 +698,8 @@
>  #define ID_DFR0_PERFMON_8_1  0x4
>  #define ID_DFR0_PERFMON_8_4  0x5
>  #define ID_DFR0_PERFMON_8_5  0x6
> +#define ID_DFR0_PERFMON_8_7  0x7
> +#define ID_DFR0_PERFMON_IMP_DEF  0xf
>  
>  #define ID_ISAR4_SWP_FRAC_SHIFT  28
>  #define ID_ISAR4_PSR_M_SHIFT 24
> -- 
> 2.34.1
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 00/44] KVM: Rework kvm_init() and hardware enabling

2022-11-04 Thread Sean Christopherson
On Fri, Nov 04, 2022, Isaku Yamahata wrote:
> Thanks for the patch series. I the rebased TDX KVM patch series and it worked.
> Since cpu offline needs to be rejected in some cases(To keep at least one cpu
> on a package), arch hook for cpu offline is needed.

I hate to bring this up because I doubt there's a real use case for SUSPEND with
TDX, but the CPU offline path isn't just for true offlining of CPUs.  When the
system enters SUSPEND, only the initiating CPU goes through 
kvm_suspend()+kvm_resume(),
all responding CPUs go through CPU offline+online.  I.e. disallowing all CPUs 
from
going "offline" will prevent suspending the system.

I don't see anything in the TDX series or the specs that suggests suspend+resume
is disallowed when TDX is enabled, so blocking that seems just as wrong as
preventing software from soft-offlining CPUs.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 02/44] KVM: Initialize IRQ FD after arch hardware setup

2022-11-04 Thread Sean Christopherson
On Fri, Nov 04, 2022, Chao Gao wrote:
> On Wed, Nov 02, 2022 at 11:18:29PM +, Sean Christopherson wrote:
> > 
> >+r = kvm_irqfd_init();
> >+if (r)
> >+goto err_irqfd;
> >+
> > r = kvm_async_pf_init();
> > if (r)
> >-goto out_free_4;
> >+goto err_async_pf;
> > 
> > kvm_chardev_ops.owner = module;
> > 
> >@@ -5927,6 +5926,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, 
> >unsigned vcpu_align,
> > kvm_vfio_ops_exit();
> > err_vfio:
> > kvm_async_pf_deinit();
> >+err_async_pf:
> >+kvm_irqfd_exit();
> 
> >+err_irqfd:
> > out_free_4:
> 
> Do you mind removing one of the two labels?

Ah, I meant to tack on a patch at the very end to clean up these labels once the
dust had settled, e.g. to also resolve the "err" vs. "out" mess I created (on
purpose, because trying to describe the "out" path was frustrating and generated
too much churn).
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 4/9] KVM: Support dirty ring in conjunction with bitmap

2022-11-04 Thread Oliver Upton
On Fri, Nov 04, 2022 at 02:57:15PM +0800, Gavin Shan wrote:
> On 11/4/22 9:06 AM, Oliver Upton wrote:

[...]

> > Just to make sure we're on the same page, there's two issues:
> > 
> >   (1) If DIRTY_LOG_RING is enabled before memslot creation and
> >   RING_WITH_BITMAP is enabled after memslots have been created w/
> >   dirty logging enabled, memslot->dirty_bitmap == NULL and the
> >   kernel will fault when attempting to save the ITS tables.
> > 
> >   (2) Not your code, but a similar issue. If DIRTY_LOG_RING[_ACQ_REL] is
> >   enabled after memslots have been created w/ dirty logging enabled,
> >   memslot->dirty_bitmap != NULL and that memory is wasted until the
> >   memslot is freed.
> > 
> > I don't expect you to fix #2, though I've mentioned it because using the
> > same approach to #1 and #2 would be nice.
> > 
> 
> Yes, I got your points. Case (2) is still possible to happen with QEMU
> excluded. However, QEMU is always enabling DIRTY_LOG_RING[_ACQ_REL] before
> any memory slot is created. I agree that we need to ensure there are no
> memory slots when DIRTY_LOG_RING[_ACQ_REL] is enabled.
> 
> For case (1), we can ensure RING_WTIH_BITMAP is enabled before any memory
> slot is added, as below. QEMU needs a new helper (as above) to enable it
> on board's level.
> 
> Lets fix both with a new helper in PATCH[v8 4/9] like below?

I agree that we should address (1) like this, but in (2) requiring that
no memslots were created before enabling the existing capabilities would
be a change in ABI. If we can get away with that, great, but otherwise
we may need to delete the bitmaps associated with all memslots when the
cap is enabled.

>   static inline bool kvm_vm_has_memslot_pages(struct kvm *kvm)
>   {
>   bool has_memslot_pages;
> 
>   mutex_lock(>slots_lock);
> 
>   has_memslot_pages = !!kvm->nr_memslot_pages;
> 
>   mutex_unlock(>slots_lock);
> 
>   return has_memslot_pages;
>   }

Do we need to build another helper for this? kvm_memslots_empty() will
tell you whether or not a memslot has been created by checking the gfn
tree.

On top of that, the memslot check and setting
kvm->dirty_ring_with_bitmap must happen behind the slots_lock. Otherwise
you could still wind up creating memslots w/o bitmaps.


Something like:


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 91cf51a25394..420cc101a16e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4588,6 +4588,32 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm 
*kvm,
return -EINVAL;
 
return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
+
+   case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: {
+   struct kvm_memslots *slots;
+   int r = -EINVAL;
+
+   if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
+   !kvm->dirty_ring_size)
+   return r;
+
+   mutex_lock(>slots_lock);
+
+   slots = kvm_memslots(kvm);
+
+   /*
+* Avoid a race between memslot creation and enabling the ring +
+* bitmap capability to guarantee that no memslots have been
+* created without a bitmap.
+*/
+   if (kvm_memslots_empty(slots)) {
+   kvm->dirty_ring_with_bitmap = cap->args[0];
+   r = 0;
+   }
+
+   mutex_unlock(>slots_lock);
+   return r;
+   }
default:
return kvm_vm_ioctl_enable_cap(kvm, cap);
}

--
Thanks,
Oliver
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v5 0/8] KVM: arm64: permit MAP_SHARED mappings with MTE enabled

2022-11-04 Thread Peter Collingbourne
On Fri, Nov 4, 2022 at 9:23 AM Marc Zyngier  wrote:
>
> On Fri, 04 Nov 2022 01:10:33 +,
> Peter Collingbourne  wrote:
> >
> > Hi,
> >
> > This patch series allows VMMs to use shared mappings in MTE enabled
> > guests. The first five patches were taken from Catalin's tree [1] which
> > addressed some review feedback from when they were previously sent out
> > as v3 of this series. The first patch from Catalin's tree makes room
> > for an additional PG_arch_3 flag by making the newer PG_arch_* flags
> > arch-dependent. The next four patches are based on a series that
> > Catalin sent out prior to v3, whose cover letter [2] I quote from below:
> >
> > > This series aims to fix the races between initialising the tags on a
> > > page and setting the PG_mte_tagged flag. Currently the flag is set
> > > either before or after that tag initialisation and this can lead to CoW
> > > copying stale tags. The first patch moves the flag setting after the
> > > tags have been initialised, solving the CoW issue. However, concurrent
> > > mprotect() on a shared mapping may (very rarely) lead to valid tags
> > > being zeroed.
> > >
> > > The second skips the sanitise_mte_tags() call in kvm_set_spte_gfn(),
> > > deferring it to user_mem_abort(). The outcome is that no
> > > sanitise_mte_tags() can be simplified to skip the pfn_to_online_page()
> > > check and only rely on VM_MTE_ALLOWED vma flag that can be checked in
> > > user_mem_abort().
> > >
> > > The third and fourth patches use PG_arch_3 as a lock for page tagging,
> > > based on Peter Collingbourne's idea of a two-bit lock.
> > >
> > > I think the first patch can be queued but the rest needs some in depth
> > > review and test. With this series (if correct) we could allos MAP_SHARED
> > > on KVM guest memory but this is to be discussed separately as there are
> > > some KVM ABI implications.
> >
> > In this v5 I rebased Catalin's tree onto -next again. Please double check
>
> Please don't do use -next as a base. In-flight series should be based
> on a *stable* tag, either 6.0 or one of the early -RCs. If there is a
> known conflict with -next, do mention it in the cover letter and
> provide a resolution.

Okay, I will keep that in mind.

> > my rebase, which resolved the conflict with commit a8e5e5146ad0 ("arm64:
> > mte: Avoid setting PG_mte_tagged if no tags cleared or restored").
>
> This commit seems part of -rc1, so I guess the patches directly apply
> on top of that tag?

Yes, sorry, this also applies cleanly to -rc1.

> > I now have Reviewed-by for all patches except for the last one, which adds
> > the documentation. Thanks for the reviews so far, and please take a look!
>
> I'd really like the MM folks (list now cc'd) to look at the relevant
> patches (1 and 5) and ack them before I take this.

Okay, here are the lore links for the convenience of the MM folks:
https://lore.kernel.org/all/20221104011041.290951-2-...@google.com/
https://lore.kernel.org/all/20221104011041.290951-6-...@google.com/

Peter
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 08/44] KVM: x86: Move hardware setup/unsetup to init/exit

2022-11-04 Thread Sean Christopherson
On Fri, Nov 04, 2022, Yuan Yao wrote:
> On Wed, Nov 02, 2022 at 11:18:35PM +, Sean Christopherson wrote:
> > To avoid having to unwind various setup, e.g registration of several
> > notifiers, slot in the vendor hardware setup before the registration of
> > said notifiers and callbacks.  Introducing a functional change while
> > moving code is less than ideal, but the alternative is adding a pile of
> > unwinding code, which is much more error prone, e.g. several attempts to
> > move the setup code verbatim all introduced bugs.

...

> > @@ -9325,6 +9343,24 @@ int kvm_arch_init(void *opaque)
> > kvm_caps.supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
> > }
> >
> > +   rdmsrl_safe(MSR_EFER, _efer);
> > +
> > +   if (boot_cpu_has(X86_FEATURE_XSAVES))
> > +   rdmsrl(MSR_IA32_XSS, host_xss);
> > +
> > +   kvm_init_pmu_capability();
> > +
> > +   r = ops->hardware_setup();
> > +   if (r != 0)
> > +   goto out_mmu_exit;
> 
> The failure case of ops->hardware_setup() is unwound
> by kvm_arch_exit() before this patch, do we need to
> keep that old behavior ?

As called out in the changelog, the call to ops->hardware_setup() was 
deliberately
slotted in before the call to kvm_timer_init() so that kvm_arch_init() wouldn't
need to unwind more stuff if harware_setup() fails.

> > +   /*
> > +* Point of no return!  DO NOT add error paths below this point unless
> > +* absolutely necessary, as most operations from this point forward
> > +* require unwinding.
> > +*/
> > +   kvm_ops_update(ops);
> > +
> > kvm_timer_init();
> >
> > if (pi_inject_timer == -1)
> > @@ -9336,8 +9372,32 @@ int kvm_arch_init(void *opaque)
> > set_hv_tscchange_cb(kvm_hyperv_tsc_notifier);
> >  #endif
> >
> > +   kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
> > +
> > +   if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
> > +   kvm_caps.supported_xss = 0;
> > +
> > +#define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
> > +   cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
> > +#undef __kvm_cpu_cap_has
> > +
> > +   if (kvm_caps.has_tsc_control) {
> > +   /*
> > +* Make sure the user can only configure tsc_khz values that
> > +* fit into a signed integer.
> > +* A min value is not calculated because it will always
> > +* be 1 on all machines.
> > +*/
> > +   u64 max = min(0x7fffULL,
> > + __scale_tsc(kvm_caps.max_tsc_scaling_ratio, 
> > tsc_khz));
> > +   kvm_caps.max_guest_tsc_khz = max;
> > +   }
> > +   kvm_caps.default_tsc_scaling_ratio = 1ULL << 
> > kvm_caps.tsc_scaling_ratio_frac_bits;
> > +   kvm_init_msr_list();
> > return 0;
> >
> > +out_mmu_exit:
> > +   kvm_mmu_vendor_module_exit();
> >  out_free_percpu:
> > free_percpu(user_return_msrs);
> >  out_free_x86_emulator_cache:
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v5 0/8] KVM: arm64: permit MAP_SHARED mappings with MTE enabled

2022-11-04 Thread Marc Zyngier
On Fri, 04 Nov 2022 01:10:33 +,
Peter Collingbourne  wrote:
> 
> Hi,
> 
> This patch series allows VMMs to use shared mappings in MTE enabled
> guests. The first five patches were taken from Catalin's tree [1] which
> addressed some review feedback from when they were previously sent out
> as v3 of this series. The first patch from Catalin's tree makes room
> for an additional PG_arch_3 flag by making the newer PG_arch_* flags
> arch-dependent. The next four patches are based on a series that
> Catalin sent out prior to v3, whose cover letter [2] I quote from below:
> 
> > This series aims to fix the races between initialising the tags on a
> > page and setting the PG_mte_tagged flag. Currently the flag is set
> > either before or after that tag initialisation and this can lead to CoW
> > copying stale tags. The first patch moves the flag setting after the
> > tags have been initialised, solving the CoW issue. However, concurrent
> > mprotect() on a shared mapping may (very rarely) lead to valid tags
> > being zeroed.
> >
> > The second skips the sanitise_mte_tags() call in kvm_set_spte_gfn(),
> > deferring it to user_mem_abort(). The outcome is that no
> > sanitise_mte_tags() can be simplified to skip the pfn_to_online_page()
> > check and only rely on VM_MTE_ALLOWED vma flag that can be checked in
> > user_mem_abort().
> >
> > The third and fourth patches use PG_arch_3 as a lock for page tagging,
> > based on Peter Collingbourne's idea of a two-bit lock.
> >
> > I think the first patch can be queued but the rest needs some in depth
> > review and test. With this series (if correct) we could allos MAP_SHARED
> > on KVM guest memory but this is to be discussed separately as there are
> > some KVM ABI implications.
> 
> In this v5 I rebased Catalin's tree onto -next again. Please double check

Please don't do use -next as a base. In-flight series should be based
on a *stable* tag, either 6.0 or one of the early -RCs. If there is a
known conflict with -next, do mention it in the cover letter and
provide a resolution.

> my rebase, which resolved the conflict with commit a8e5e5146ad0 ("arm64:
> mte: Avoid setting PG_mte_tagged if no tags cleared or restored").

This commit seems part of -rc1, so I guess the patches directly apply
on top of that tag?

> I now have Reviewed-by for all patches except for the last one, which adds
> the documentation. Thanks for the reviews so far, and please take a look!

I'd really like the MM folks (list now cc'd) to look at the relevant
patches (1 and 5) and ack them before I take this.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 7/8] perf: Add perf_event_attr::config3

2022-11-04 Thread Rob Herring
Arm SPEv1.2 adds another 64-bits of event filtering control. As the
existing perf_event_attr::configN fields are all used up for SPE PMU, an
additional field is needed. Add a new 'config3' field.

Tested-by: James Clark 
Signed-off-by: Rob Herring 
---
v3:
 - No change
v2:
 - Drop tools/ side update
---
 include/uapi/linux/perf_event.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 85be78e0e7f6..b2b1d7b54097 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -374,6 +374,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER5112 /* add: aux_watermark */
 #define PERF_ATTR_SIZE_VER6120 /* add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7128 /* add: sig_data */
+#define PERF_ATTR_SIZE_VER8136 /* add: config3 */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -515,6 +516,8 @@ struct perf_event_attr {
 * truncated accordingly on 32 bit architectures.
 */
__u64   sig_data;
+
+   __u64   config3; /* extension of config2 */
 };
 
 /*

-- 
b4 0.11.0-dev
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 1/8] perf: arm_spe: Use feature numbering for PMSEVFR_EL1 defines

2022-11-04 Thread Rob Herring
Similar to commit 121a8fc088f1 ("arm64/sysreg: Use feature numbering for
PMU and SPE revisions") use feature numbering instead of architecture
versions for the PMSEVFR_EL1 Res0 defines.

Tested-by: James Clark 
Signed-off-by: Rob Herring 
---
v3:
 - No change
v2:
 - New patch
---
 arch/arm64/include/asm/sysreg.h | 6 +++---
 drivers/perf/arm_spe_pmu.c  | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 7d301700d1a9..9a4cf12e3e16 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -294,11 +294,11 @@
 #define SYS_PMSFCR_EL1_ST_SHIFT18
 
 #define SYS_PMSEVFR_EL1sys_reg(3, 0, 9, 9, 5)
-#define SYS_PMSEVFR_EL1_RES0_8_2   \
+#define PMSEVFR_EL1_RES0_IMP   \
(GENMASK_ULL(47, 32) | GENMASK_ULL(23, 16) | GENMASK_ULL(11, 8) |\
 BIT_ULL(6) | BIT_ULL(4) | BIT_ULL(2) | BIT_ULL(0))
-#define SYS_PMSEVFR_EL1_RES0_8_3   \
-   (SYS_PMSEVFR_EL1_RES0_8_2 & ~(BIT_ULL(18) | BIT_ULL(17) | BIT_ULL(11)))
+#define PMSEVFR_EL1_RES0_V1P1  \
+   (PMSEVFR_EL1_RES0_IMP & ~(BIT_ULL(18) | BIT_ULL(17) | BIT_ULL(11)))
 
 #define SYS_PMSLATFR_EL1   sys_reg(3, 0, 9, 9, 6)
 #define SYS_PMSLATFR_EL1_MINLAT_SHIFT  0
diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index 00e3a637f7b6..65cf93dcc8ee 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -677,11 +677,11 @@ static u64 arm_spe_pmsevfr_res0(u16 pmsver)
 {
switch (pmsver) {
case ID_AA64DFR0_EL1_PMSVer_IMP:
-   return SYS_PMSEVFR_EL1_RES0_8_2;
+   return PMSEVFR_EL1_RES0_IMP;
case ID_AA64DFR0_EL1_PMSVer_V1P1:
/* Return the highest version we support in default */
default:
-   return SYS_PMSEVFR_EL1_RES0_8_3;
+   return PMSEVFR_EL1_RES0_V1P1;
}
 }
 

-- 
b4 0.11.0-dev
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 8/8] perf: arm_spe: Add support for SPEv1.2 inverted event filtering

2022-11-04 Thread Rob Herring
Arm SPEv1.2 (Arm v8.7/v9.2) adds a new feature called Inverted Event
Filter which excludes samples matching the event filter. The feature
mirrors the existing event filter in PMSEVFR_EL1 adding a new register,
PMSNEVFR_EL1, which has the same event bit assignments.

Tested-by: James Clark 
Signed-off-by: Rob Herring 
---
v3:
 - No change
v2:
 - Update for auto generated register defines
 - Avoid accessing SYS_PMSNEVFR_EL1 on < v8.7
---
 drivers/perf/arm_spe_pmu.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index 82f67e941bc4..573db4211acd 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -85,6 +85,7 @@ struct arm_spe_pmu {
 #define SPE_PMU_FEAT_ARCH_INST (1UL << 3)
 #define SPE_PMU_FEAT_LDS   (1UL << 4)
 #define SPE_PMU_FEAT_ERND  (1UL << 5)
+#define SPE_PMU_FEAT_INV_FILT_EVT  (1UL << 6)
 #define SPE_PMU_FEAT_DEV_PROBED(1UL << 63)
u64 features;
 
@@ -202,6 +203,10 @@ static const struct attribute_group arm_spe_pmu_cap_group 
= {
 #define ATTR_CFG_FLD_min_latency_LO0
 #define ATTR_CFG_FLD_min_latency_HI11
 
+#define ATTR_CFG_FLD_inv_event_filter_CFG  config3 /* PMSNEVFR_EL1 */
+#define ATTR_CFG_FLD_inv_event_filter_LO   0
+#define ATTR_CFG_FLD_inv_event_filter_HI   63
+
 /* Why does everything I do descend into this? */
 #define __GEN_PMU_FORMAT_ATTR(cfg, lo, hi) \
(lo) == (hi) ? #cfg ":" #lo "\n" : #cfg ":" #lo "-" #hi
@@ -232,6 +237,7 @@ GEN_PMU_FORMAT_ATTR(branch_filter);
 GEN_PMU_FORMAT_ATTR(load_filter);
 GEN_PMU_FORMAT_ATTR(store_filter);
 GEN_PMU_FORMAT_ATTR(event_filter);
+GEN_PMU_FORMAT_ATTR(inv_event_filter);
 GEN_PMU_FORMAT_ATTR(min_latency);
 
 static struct attribute *arm_spe_pmu_formats_attr[] = {
@@ -243,12 +249,27 @@ static struct attribute *arm_spe_pmu_formats_attr[] = {
_attr_load_filter.attr,
_attr_store_filter.attr,
_attr_event_filter.attr,
+   _attr_inv_event_filter.attr,
_attr_min_latency.attr,
NULL,
 };
 
+static umode_t arm_spe_pmu_format_attr_is_visible(struct kobject *kobj,
+ struct attribute *attr,
+ int unused)
+   {
+   struct device *dev = kobj_to_dev(kobj);
+   struct arm_spe_pmu *spe_pmu = dev_get_drvdata(dev);
+
+   if (attr == _attr_inv_event_filter.attr && !(spe_pmu->features & 
SPE_PMU_FEAT_INV_FILT_EVT))
+   return 0;
+
+   return attr->mode;
+}
+
 static const struct attribute_group arm_spe_pmu_format_group = {
.name   = "format",
+   .is_visible = arm_spe_pmu_format_attr_is_visible,
.attrs  = arm_spe_pmu_formats_attr,
 };
 
@@ -343,6 +364,9 @@ static u64 arm_spe_event_to_pmsfcr(struct perf_event *event)
if (ATTR_CFG_GET_FLD(attr, event_filter))
reg |= PMSFCR_EL1_FE;
 
+   if (ATTR_CFG_GET_FLD(attr, inv_event_filter))
+   reg |= PMSFCR_EL1_FnE;
+
if (ATTR_CFG_GET_FLD(attr, min_latency))
reg |= PMSFCR_EL1_FL;
 
@@ -355,6 +379,12 @@ static u64 arm_spe_event_to_pmsevfr(struct perf_event 
*event)
return ATTR_CFG_GET_FLD(attr, event_filter);
 }
 
+static u64 arm_spe_event_to_pmsnevfr(struct perf_event *event)
+{
+   struct perf_event_attr *attr = >attr;
+   return ATTR_CFG_GET_FLD(attr, inv_event_filter);
+}
+
 static u64 arm_spe_event_to_pmslatfr(struct perf_event *event)
 {
struct perf_event_attr *attr = >attr;
@@ -703,6 +733,9 @@ static int arm_spe_pmu_event_init(struct perf_event *event)
if (arm_spe_event_to_pmsevfr(event) & 
arm_spe_pmsevfr_res0(spe_pmu->pmsver))
return -EOPNOTSUPP;
 
+   if (arm_spe_event_to_pmsnevfr(event) & 
arm_spe_pmsevfr_res0(spe_pmu->pmsver))
+   return -EOPNOTSUPP;
+
if (attr->exclude_idle)
return -EOPNOTSUPP;
 
@@ -721,6 +754,10 @@ static int arm_spe_pmu_event_init(struct perf_event *event)
!(spe_pmu->features & SPE_PMU_FEAT_FILT_EVT))
return -EOPNOTSUPP;
 
+   if ((FIELD_GET(PMSFCR_EL1_FnE, reg)) &&
+   !(spe_pmu->features & SPE_PMU_FEAT_INV_FILT_EVT))
+   return -EOPNOTSUPP;
+
if ((FIELD_GET(PMSFCR_EL1_FT, reg)) &&
!(spe_pmu->features & SPE_PMU_FEAT_FILT_TYP))
return -EOPNOTSUPP;
@@ -756,6 +793,11 @@ static void arm_spe_pmu_start(struct perf_event *event, 
int flags)
reg = arm_spe_event_to_pmsevfr(event);
write_sysreg_s(reg, SYS_PMSEVFR_EL1);
 
+   if (spe_pmu->features & SPE_PMU_FEAT_INV_FILT_EVT) {
+   reg = arm_spe_event_to_pmsnevfr(event);
+   write_sysreg_s(reg, SYS_PMSNEVFR_EL1);
+   }
+
reg = 

[PATCH v3 4/8] perf: arm_spe: Drop BIT() and use FIELD_GET/PREP accessors

2022-11-04 Thread Rob Herring
Now that generated sysregs are in place, update the register field
accesses. The use of BIT() is no longer needed with the new defines. Use
FIELD_GET and FIELD_PREP instead of open coding masking and shifting.

No functional change.

Tested-by: James Clark 
Signed-off-by: Rob Herring 
---
v3:
 - no change
---
 drivers/perf/arm_spe_pmu.c | 70 ++
 1 file changed, 34 insertions(+), 36 deletions(-)

diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index 814ed18346b6..9b4bd72087ea 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -283,18 +283,18 @@ static u64 arm_spe_event_to_pmscr(struct perf_event 
*event)
struct perf_event_attr *attr = >attr;
u64 reg = 0;
 
-   reg |= ATTR_CFG_GET_FLD(attr, ts_enable) << PMSCR_EL1_TS_SHIFT;
-   reg |= ATTR_CFG_GET_FLD(attr, pa_enable) << PMSCR_EL1_PA_SHIFT;
-   reg |= ATTR_CFG_GET_FLD(attr, pct_enable) << PMSCR_EL1_PCT_SHIFT;
+   reg |= FIELD_PREP(PMSCR_EL1_TS, ATTR_CFG_GET_FLD(attr, ts_enable));
+   reg |= FIELD_PREP(PMSCR_EL1_PA, ATTR_CFG_GET_FLD(attr, pa_enable));
+   reg |= FIELD_PREP(PMSCR_EL1_PCT, ATTR_CFG_GET_FLD(attr, pct_enable));
 
if (!attr->exclude_user)
-   reg |= BIT(PMSCR_EL1_E0SPE_SHIFT);
+   reg |= PMSCR_EL1_E0SPE;
 
if (!attr->exclude_kernel)
-   reg |= BIT(PMSCR_EL1_E1SPE_SHIFT);
+   reg |= PMSCR_EL1_E1SPE;
 
if (get_spe_event_has_cx(event))
-   reg |= BIT(PMSCR_EL1_CX_SHIFT);
+   reg |= PMSCR_EL1_CX;
 
return reg;
 }
@@ -322,7 +322,7 @@ static u64 arm_spe_event_to_pmsirr(struct perf_event *event)
 
arm_spe_event_sanitise_period(event);
 
-   reg |= ATTR_CFG_GET_FLD(attr, jitter) << PMSIRR_EL1_RND_SHIFT;
+   reg |= FIELD_PREP(PMSIRR_EL1_RND, ATTR_CFG_GET_FLD(attr, jitter));
reg |= event->hw.sample_period;
 
return reg;
@@ -333,18 +333,18 @@ static u64 arm_spe_event_to_pmsfcr(struct perf_event 
*event)
struct perf_event_attr *attr = >attr;
u64 reg = 0;
 
-   reg |= ATTR_CFG_GET_FLD(attr, load_filter) << PMSFCR_EL1_LD_SHIFT;
-   reg |= ATTR_CFG_GET_FLD(attr, store_filter) << PMSFCR_EL1_ST_SHIFT;
-   reg |= ATTR_CFG_GET_FLD(attr, branch_filter) << PMSFCR_EL1_B_SHIFT;
+   reg |= FIELD_PREP(PMSFCR_EL1_LD, ATTR_CFG_GET_FLD(attr, load_filter));
+   reg |= FIELD_PREP(PMSFCR_EL1_ST, ATTR_CFG_GET_FLD(attr, store_filter));
+   reg |= FIELD_PREP(PMSFCR_EL1_B, ATTR_CFG_GET_FLD(attr, branch_filter));
 
if (reg)
-   reg |= BIT(PMSFCR_EL1_FT_SHIFT);
+   reg |= PMSFCR_EL1_FT;
 
if (ATTR_CFG_GET_FLD(attr, event_filter))
-   reg |= BIT(PMSFCR_EL1_FE_SHIFT);
+   reg |= PMSFCR_EL1_FE;
 
if (ATTR_CFG_GET_FLD(attr, min_latency))
-   reg |= BIT(PMSFCR_EL1_FL_SHIFT);
+   reg |= PMSFCR_EL1_FL;
 
return reg;
 }
@@ -358,8 +358,7 @@ static u64 arm_spe_event_to_pmsevfr(struct perf_event 
*event)
 static u64 arm_spe_event_to_pmslatfr(struct perf_event *event)
 {
struct perf_event_attr *attr = >attr;
-   return ATTR_CFG_GET_FLD(attr, min_latency)
-  << PMSLATFR_EL1_MINLAT_SHIFT;
+   return FIELD_PREP(PMSLATFR_EL1_MINLAT, ATTR_CFG_GET_FLD(attr, 
min_latency));
 }
 
 static void arm_spe_pmu_pad_buf(struct perf_output_handle *handle, int len)
@@ -511,7 +510,7 @@ static void arm_spe_perf_aux_output_begin(struct 
perf_output_handle *handle,
limit = buf->snapshot ? arm_spe_pmu_next_snapshot_off(handle)
  : arm_spe_pmu_next_off(handle);
if (limit)
-   limit |= BIT(PMBLIMITR_EL1_E_SHIFT);
+   limit |= PMBLIMITR_EL1_E;
 
limit += (u64)buf->base;
base = (u64)buf->base + PERF_IDX2OFF(handle->head, buf);
@@ -570,23 +569,23 @@ arm_spe_pmu_buf_get_fault_act(struct perf_output_handle 
*handle)
 
/* Service required? */
pmbsr = read_sysreg_s(SYS_PMBSR_EL1);
-   if (!(pmbsr & BIT(PMBSR_EL1_S_SHIFT)))
+   if (!FIELD_GET(PMBSR_EL1_S, pmbsr))
return SPE_PMU_BUF_FAULT_ACT_SPURIOUS;
 
/*
 * If we've lost data, disable profiling and also set the PARTIAL
 * flag to indicate that the last record is corrupted.
 */
-   if (pmbsr & BIT(PMBSR_EL1_DL_SHIFT))
+   if (FIELD_GET(PMBSR_EL1_DL, pmbsr))
perf_aux_output_flag(handle, PERF_AUX_FLAG_TRUNCATED |
 PERF_AUX_FLAG_PARTIAL);
 
/* Report collisions to userspace so that it can up the period */
-   if (pmbsr & BIT(PMBSR_EL1_COLL_SHIFT))
+   if (FIELD_GET(PMBSR_EL1_COLL, pmbsr))
perf_aux_output_flag(handle, PERF_AUX_FLAG_COLLISION);
 
/* We only expect buffer management events */
-   switch (FIELD_GET(PMBSR_EL1_EC_MASK, pmbsr)) {
+   switch (FIELD_GET(PMBSR_EL1_EC, pmbsr)) 

[PATCH v3 5/8] perf: arm_spe: Use new PMSIDR_EL1 register enums

2022-11-04 Thread Rob Herring
Now that the SPE register definitions include enums for some PMSIDR_EL1
fields, use them in the driver in place of magic values.

Signed-off-by: Rob Herring 
---
v3: New patch
---
 drivers/perf/arm_spe_pmu.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index 9b4bd72087ea..af6d3867c3e7 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -1006,32 +1006,32 @@ static void __arm_spe_pmu_dev_probe(void *info)
/* This field has a spaced out encoding, so just use a look-up */
fld = FIELD_GET(PMSIDR_EL1_INTERVAL, reg);
switch (fld) {
-   case 0:
+   case PMSIDR_EL1_INTERVAL_256:
spe_pmu->min_period = 256;
break;
-   case 2:
+   case PMSIDR_EL1_INTERVAL_512:
spe_pmu->min_period = 512;
break;
-   case 3:
+   case PMSIDR_EL1_INTERVAL_768:
spe_pmu->min_period = 768;
break;
-   case 4:
+   case PMSIDR_EL1_INTERVAL_1024:
spe_pmu->min_period = 1024;
break;
-   case 5:
+   case PMSIDR_EL1_INTERVAL_1536:
spe_pmu->min_period = 1536;
break;
-   case 6:
+   case PMSIDR_EL1_INTERVAL_2048:
spe_pmu->min_period = 2048;
break;
-   case 7:
+   case PMSIDR_EL1_INTERVAL_3072:
spe_pmu->min_period = 3072;
break;
default:
dev_warn(dev, "unknown PMSIDR_EL1.Interval [%d]; assuming 8\n",
 fld);
fallthrough;
-   case 8:
+   case PMSIDR_EL1_INTERVAL_4096:
spe_pmu->min_period = 4096;
}
 
@@ -1050,10 +1050,10 @@ static void __arm_spe_pmu_dev_probe(void *info)
dev_warn(dev, "unknown PMSIDR_EL1.CountSize [%d]; assuming 2\n",
 fld);
fallthrough;
-   case 2:
+   case PMSIDR_EL1_COUNTSIZE_12_BIT_SAT:
spe_pmu->counter_sz = 12;
break;
-   case 3:
+   case PMSIDR_EL1_COUNTSIZE_16_BIT_SAT:
spe_pmu->counter_sz = 16;
}
 

-- 
b4 0.11.0-dev
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 0/8] perf: Arm SPEv1.2 support

2022-11-04 Thread Rob Herring
This series adds support for Arm SPEv1.2 which is part of the
Armv8.7/Armv9.2 architecture. There's 2 new features that affect the 
kernel: a new event filter bit, branch 'not taken', and an inverted 
event filter register. 

Since this support adds new registers and fields, first the SPE register 
defines are converted to automatic generation.

Note that the 'config3' addition in sysfs format files causes SPE to 
break. A stable fix e552b7be12ed ("perf: Skip and warn on unknown format 
'configN' attrs") landed in v6.1-rc1.

The perf tool side changes are available here[1].

Tested on FVP.

[1] 
https://lore.kernel.org/all/20220914-arm-perf-tool-spe1-2-v2-v4-0-83c098e62...@kernel.org/

Signed-off-by: Rob Herring 
---
Changes in v3:
- Add some more missing SPE register fields and use Enums for some 
  fields
- Use the new PMSIDR_EL1 register Enum defines in the SPE driver
- Link to v2: 
https://lore.kernel.org/r/20220825-arm-spe-v8-7-v2-0-e37322d68...@kernel.org

Changes in v2:
- Convert the SPE register defines to automatic generation
- Fixed access to SYS_PMSNEVFR_EL1 when not present
- Rebase on v6.1-rc1
- Link to v1: 
https://lore.kernel.org/r/20220825-arm-spe-v8-7-v1-0-c75b8d92e...@kernel.org

---
Rob Herring (8):
  perf: arm_spe: Use feature numbering for PMSEVFR_EL1 defines
  arm64: Drop SYS_ from SPE register defines
  arm64/sysreg: Convert SPE registers to automatic generation
  perf: arm_spe: Drop BIT() and use FIELD_GET/PREP accessors
  perf: arm_spe: Use new PMSIDR_EL1 register enums
  perf: arm_spe: Support new SPEv1.2/v8.7 'not taken' event
  perf: Add perf_event_attr::config3
  perf: arm_spe: Add support for SPEv1.2 inverted event filtering

 arch/arm64/include/asm/el2_setup.h |   6 +-
 arch/arm64/include/asm/sysreg.h|  99 +++
 arch/arm64/kvm/debug.c |   2 +-
 arch/arm64/kvm/hyp/nvhe/debug-sr.c |   2 +-
 arch/arm64/tools/sysreg| 139 +
 drivers/perf/arm_spe_pmu.c | 156 -
 include/uapi/linux/perf_event.h|   3 +
 7 files changed, 257 insertions(+), 150 deletions(-)
---
base-commit: 9abf2313adc1ca1b6180c508c25f22f9395cc780
change-id: 20220825-arm-spe-v8-7-fedf04e16f23

Best regards,
-- 
Rob Herring 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 2/8] arm64: Drop SYS_ from SPE register defines

2022-11-04 Thread Rob Herring
We currently have a non-standard SYS_ prefix in the constants generated
for the SPE register bitfields. Drop this in preparation for automatic
register definition generation.

The SPE mask defines were unshifted, and the SPE register field
enumerations were shifted. The autogenerated defines are the opposite,
so make the necessary adjustments.

No functional changes.

Tested-by: James Clark 
Signed-off-by: Rob Herring 
---
v3:
 - No change
v2:
 - New patch
---
 arch/arm64/include/asm/el2_setup.h |   6 +-
 arch/arm64/include/asm/sysreg.h| 112 ++---
 arch/arm64/kvm/debug.c |   2 +-
 arch/arm64/kvm/hyp/nvhe/debug-sr.c |   2 +-
 drivers/perf/arm_spe_pmu.c |  85 ++--
 5 files changed, 103 insertions(+), 104 deletions(-)

diff --git a/arch/arm64/include/asm/el2_setup.h 
b/arch/arm64/include/asm/el2_setup.h
index 668569adf4d3..f9da43e53cdb 100644
--- a/arch/arm64/include/asm/el2_setup.h
+++ b/arch/arm64/include/asm/el2_setup.h
@@ -53,10 +53,10 @@
cbz x0, .Lskip_spe_\@   // Skip if SPE not present
 
mrs_s   x0, SYS_PMBIDR_EL1  // If SPE available at EL2,
-   and x0, x0, #(1 << SYS_PMBIDR_EL1_P_SHIFT)
+   and x0, x0, #(1 << PMBIDR_EL1_P_SHIFT)
cbnzx0, .Lskip_spe_el2_\@   // then permit sampling of 
physical
-   mov x0, #(1 << SYS_PMSCR_EL2_PCT_SHIFT | \
- 1 << SYS_PMSCR_EL2_PA_SHIFT)
+   mov x0, #(1 << PMSCR_EL2_PCT_SHIFT | \
+ 1 << PMSCR_EL2_PA_SHIFT)
msr_s   SYS_PMSCR_EL2, x0   // addresses and physical 
counter
 .Lskip_spe_el2_\@:
mov x0, #(MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT)
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 9a4cf12e3e16..8df8a0a51273 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -239,59 +239,59 @@
 /*** Statistical Profiling Extension ***/
 /* ID registers */
 #define SYS_PMSIDR_EL1 sys_reg(3, 0, 9, 9, 7)
-#define SYS_PMSIDR_EL1_FE_SHIFT0
-#define SYS_PMSIDR_EL1_FT_SHIFT1
-#define SYS_PMSIDR_EL1_FL_SHIFT2
-#define SYS_PMSIDR_EL1_ARCHINST_SHIFT  3
-#define SYS_PMSIDR_EL1_LDS_SHIFT   4
-#define SYS_PMSIDR_EL1_ERND_SHIFT  5
-#define SYS_PMSIDR_EL1_INTERVAL_SHIFT  8
-#define SYS_PMSIDR_EL1_INTERVAL_MASK   0xfUL
-#define SYS_PMSIDR_EL1_MAXSIZE_SHIFT   12
-#define SYS_PMSIDR_EL1_MAXSIZE_MASK0xfUL
-#define SYS_PMSIDR_EL1_COUNTSIZE_SHIFT 16
-#define SYS_PMSIDR_EL1_COUNTSIZE_MASK  0xfUL
+#define PMSIDR_EL1_FE_SHIFT0
+#define PMSIDR_EL1_FT_SHIFT1
+#define PMSIDR_EL1_FL_SHIFT2
+#define PMSIDR_EL1_ARCHINST_SHIFT  3
+#define PMSIDR_EL1_LDS_SHIFT   4
+#define PMSIDR_EL1_ERND_SHIFT  5
+#define PMSIDR_EL1_INTERVAL_SHIFT  8
+#define PMSIDR_EL1_INTERVAL_MASK   GENMASK_ULL(11, 8)
+#define PMSIDR_EL1_MAXSIZE_SHIFT   12
+#define PMSIDR_EL1_MAXSIZE_MASKGENMASK_ULL(15, 12)
+#define PMSIDR_EL1_COUNTSIZE_SHIFT 16
+#define PMSIDR_EL1_COUNTSIZE_MASK  GENMASK_ULL(19, 16)
 
 #define SYS_PMBIDR_EL1 sys_reg(3, 0, 9, 10, 7)
-#define SYS_PMBIDR_EL1_ALIGN_SHIFT 0
-#define SYS_PMBIDR_EL1_ALIGN_MASK  0xfU
-#define SYS_PMBIDR_EL1_P_SHIFT 4
-#define SYS_PMBIDR_EL1_F_SHIFT 5
+#define PMBIDR_EL1_ALIGN_SHIFT 0
+#define PMBIDR_EL1_ALIGN_MASK  0xfU
+#define PMBIDR_EL1_P_SHIFT 4
+#define PMBIDR_EL1_F_SHIFT 5
 
 /* Sampling controls */
 #define SYS_PMSCR_EL1  sys_reg(3, 0, 9, 9, 0)
-#define SYS_PMSCR_EL1_E0SPE_SHIFT  0
-#define SYS_PMSCR_EL1_E1SPE_SHIFT  1
-#define SYS_PMSCR_EL1_CX_SHIFT 3
-#define SYS_PMSCR_EL1_PA_SHIFT 4
-#define SYS_PMSCR_EL1_TS_SHIFT 5
-#define SYS_PMSCR_EL1_PCT_SHIFT6
+#define PMSCR_EL1_E0SPE_SHIFT  0
+#define PMSCR_EL1_E1SPE_SHIFT  1
+#define PMSCR_EL1_CX_SHIFT 3
+#define PMSCR_EL1_PA_SHIFT 4
+#define PMSCR_EL1_TS_SHIFT 5
+#define PMSCR_EL1_PCT_SHIFT6
 
 #define SYS_PMSCR_EL2  sys_reg(3, 4, 9, 9, 0)
-#define SYS_PMSCR_EL2_E0HSPE_SHIFT 0
-#define SYS_PMSCR_EL2_E2SPE_SHIFT  1
-#define SYS_PMSCR_EL2_CX_SHIFT 3
-#define SYS_PMSCR_EL2_PA_SHIFT 4
-#define SYS_PMSCR_EL2_TS_SHIFT 5
-#define SYS_PMSCR_EL2_PCT_SHIFT6
+#define PMSCR_EL2_E0HSPE_SHIFT 0
+#define PMSCR_EL2_E2SPE_SHIFT  1
+#define PMSCR_EL2_CX_SHIFT 3
+#define PMSCR_EL2_PA_SHIFT 4
+#define PMSCR_EL2_TS_SHIFT 5
+#define PMSCR_EL2_PCT_SHIFT6
 
 #define SYS_PMSICR_EL1 sys_reg(3, 0, 9, 9, 2)
 
 #define SYS_PMSIRR_EL1 sys_reg(3, 0, 9, 9, 3)
-#define SYS_PMSIRR_EL1_RND_SHIFT   0
-#define SYS_PMSIRR_EL1_INTERVAL_SHIFT  8
-#define 

[PATCH v3 6/8] perf: arm_spe: Support new SPEv1.2/v8.7 'not taken' event

2022-11-04 Thread Rob Herring
Arm SPEv1.2 (Armv8.7/v9.2) adds a new event, 'not taken', in bit 6 of
the PMSEVFR_EL1 register. Update arm_spe_pmsevfr_res0() to support the
additional event.

Tested-by: James Clark 
Signed-off-by: Rob Herring 
---
v3:
 - No change
v2:
 - Update for v6.1 sysreg generated header changes
---
 arch/arm64/include/asm/sysreg.h | 2 ++
 drivers/perf/arm_spe_pmu.c  | 4 +++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index d002dd00e53e..06231e896832 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -242,6 +242,8 @@
 BIT_ULL(6) | BIT_ULL(4) | BIT_ULL(2) | BIT_ULL(0))
 #define PMSEVFR_EL1_RES0_V1P1  \
(PMSEVFR_EL1_RES0_IMP & ~(BIT_ULL(18) | BIT_ULL(17) | BIT_ULL(11)))
+#define PMSEVFR_EL1_RES0_V1P2  \
+   (PMSEVFR_EL1_RES0_V1P1 & ~BIT_ULL(6))
 
 /* Buffer error reporting */
 #define PMBSR_EL1_FAULT_FSC_SHIFT  PMBSR_EL1_MSS_SHIFT
diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index af6d3867c3e7..82f67e941bc4 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -677,9 +677,11 @@ static u64 arm_spe_pmsevfr_res0(u16 pmsver)
case ID_AA64DFR0_EL1_PMSVer_IMP:
return PMSEVFR_EL1_RES0_IMP;
case ID_AA64DFR0_EL1_PMSVer_V1P1:
+   return PMSEVFR_EL1_RES0_V1P1;
+   case ID_AA64DFR0_EL1_PMSVer_V1P2:
/* Return the highest version we support in default */
default:
-   return PMSEVFR_EL1_RES0_V1P1;
+   return PMSEVFR_EL1_RES0_V1P2;
}
 }
 

-- 
b4 0.11.0-dev
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 3/8] arm64/sysreg: Convert SPE registers to automatic generation

2022-11-04 Thread Rob Herring
Convert all the SPE register defines to automatic generation. No
functional changes.

New registers and fields for SPEv1.2 are added with the conversion.

Some of the PMBSR MSS field defines are kept as the automatic generation
has no way to create multiple names for the same register bits. The
meaning of the MSS field depends on other bits.

Tested-by: James Clark 
Signed-off-by: Rob Herring 
---
v3:
 - Make some fields enums and add some missing fields
v2:
 - New patch
---
 arch/arm64/include/asm/sysreg.h |  91 ++
 arch/arm64/tools/sysreg | 139 
 2 files changed, 144 insertions(+), 86 deletions(-)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 8df8a0a51273..d002dd00e53e 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -237,99 +237,18 @@
 #define SYS_PAR_EL1_FSTGENMASK(6, 1)
 
 /*** Statistical Profiling Extension ***/
-/* ID registers */
-#define SYS_PMSIDR_EL1 sys_reg(3, 0, 9, 9, 7)
-#define PMSIDR_EL1_FE_SHIFT0
-#define PMSIDR_EL1_FT_SHIFT1
-#define PMSIDR_EL1_FL_SHIFT2
-#define PMSIDR_EL1_ARCHINST_SHIFT  3
-#define PMSIDR_EL1_LDS_SHIFT   4
-#define PMSIDR_EL1_ERND_SHIFT  5
-#define PMSIDR_EL1_INTERVAL_SHIFT  8
-#define PMSIDR_EL1_INTERVAL_MASK   GENMASK_ULL(11, 8)
-#define PMSIDR_EL1_MAXSIZE_SHIFT   12
-#define PMSIDR_EL1_MAXSIZE_MASKGENMASK_ULL(15, 12)
-#define PMSIDR_EL1_COUNTSIZE_SHIFT 16
-#define PMSIDR_EL1_COUNTSIZE_MASK  GENMASK_ULL(19, 16)
-
-#define SYS_PMBIDR_EL1 sys_reg(3, 0, 9, 10, 7)
-#define PMBIDR_EL1_ALIGN_SHIFT 0
-#define PMBIDR_EL1_ALIGN_MASK  0xfU
-#define PMBIDR_EL1_P_SHIFT 4
-#define PMBIDR_EL1_F_SHIFT 5
-
-/* Sampling controls */
-#define SYS_PMSCR_EL1  sys_reg(3, 0, 9, 9, 0)
-#define PMSCR_EL1_E0SPE_SHIFT  0
-#define PMSCR_EL1_E1SPE_SHIFT  1
-#define PMSCR_EL1_CX_SHIFT 3
-#define PMSCR_EL1_PA_SHIFT 4
-#define PMSCR_EL1_TS_SHIFT 5
-#define PMSCR_EL1_PCT_SHIFT6
-
-#define SYS_PMSCR_EL2  sys_reg(3, 4, 9, 9, 0)
-#define PMSCR_EL2_E0HSPE_SHIFT 0
-#define PMSCR_EL2_E2SPE_SHIFT  1
-#define PMSCR_EL2_CX_SHIFT 3
-#define PMSCR_EL2_PA_SHIFT 4
-#define PMSCR_EL2_TS_SHIFT 5
-#define PMSCR_EL2_PCT_SHIFT6
-
-#define SYS_PMSICR_EL1 sys_reg(3, 0, 9, 9, 2)
-
-#define SYS_PMSIRR_EL1 sys_reg(3, 0, 9, 9, 3)
-#define PMSIRR_EL1_RND_SHIFT   0
-#define PMSIRR_EL1_INTERVAL_SHIFT  8
-#define PMSIRR_EL1_INTERVAL_MASK   GENMASK_ULL(31, 8)
-
-/* Filtering controls */
-#define SYS_PMSNEVFR_EL1   sys_reg(3, 0, 9, 9, 1)
-
-#define SYS_PMSFCR_EL1 sys_reg(3, 0, 9, 9, 4)
-#define PMSFCR_EL1_FE_SHIFT0
-#define PMSFCR_EL1_FT_SHIFT1
-#define PMSFCR_EL1_FL_SHIFT2
-#define PMSFCR_EL1_B_SHIFT 16
-#define PMSFCR_EL1_LD_SHIFT17
-#define PMSFCR_EL1_ST_SHIFT18
-
-#define SYS_PMSEVFR_EL1sys_reg(3, 0, 9, 9, 5)
 #define PMSEVFR_EL1_RES0_IMP   \
(GENMASK_ULL(47, 32) | GENMASK_ULL(23, 16) | GENMASK_ULL(11, 8) |\
 BIT_ULL(6) | BIT_ULL(4) | BIT_ULL(2) | BIT_ULL(0))
 #define PMSEVFR_EL1_RES0_V1P1  \
(PMSEVFR_EL1_RES0_IMP & ~(BIT_ULL(18) | BIT_ULL(17) | BIT_ULL(11)))
 
-#define SYS_PMSLATFR_EL1   sys_reg(3, 0, 9, 9, 6)
-#define PMSLATFR_EL1_MINLAT_SHIFT  0
-
-/* Buffer controls */
-#define SYS_PMBLIMITR_EL1  sys_reg(3, 0, 9, 10, 0)
-#define PMBLIMITR_EL1_E_SHIFT  0
-#define PMBLIMITR_EL1_FM_SHIFT 1
-#define PMBLIMITR_EL1_FM_MASK  GENMASK_ULL(2, 1)
-#define PMBLIMITR_EL1_FM_STOP_IRQ  0
-
-#define SYS_PMBPTR_EL1 sys_reg(3, 0, 9, 10, 1)
-
 /* Buffer error reporting */
-#define SYS_PMBSR_EL1  sys_reg(3, 0, 9, 10, 3)
-#define PMBSR_EL1_COLL_SHIFT   16
-#define PMBSR_EL1_S_SHIFT  17
-#define PMBSR_EL1_EA_SHIFT 18
-#define PMBSR_EL1_DL_SHIFT 19
-#define PMBSR_EL1_EC_SHIFT 26
-#define PMBSR_EL1_EC_MASK  GENMASK_ULL(31, 26)
-
-#define PMBSR_EL1_EC_BUF   0x0UL
-#define PMBSR_EL1_EC_FAULT_S1  0x24UL
-#define PMBSR_EL1_EC_FAULT_S2  0x25UL
-
-#define PMBSR_EL1_FAULT_FSC_SHIFT  0
-#define PMBSR_EL1_FAULT_FSC_MASK   0x3fUL
-
-#define PMBSR_EL1_BUF_BSC_SHIFT0
-#define PMBSR_EL1_BUF_BSC_MASK 0x3fUL
+#define PMBSR_EL1_FAULT_FSC_SHIFT  PMBSR_EL1_MSS_SHIFT
+#define PMBSR_EL1_FAULT_FSC_MASK   PMBSR_EL1_MSS_MASK
+
+#define PMBSR_EL1_BUF_BSC_SHIFTPMBSR_EL1_MSS_SHIFT
+#define PMBSR_EL1_BUF_BSC_MASK PMBSR_EL1_MSS_MASK
 
 #define PMBSR_EL1_BUF_BSC_FULL   

Re: [PATCH v2 11/14] KVM: arm64: PMU: Allow ID_AA64DFR0_EL1.PMUver to be set from userspace

2022-11-04 Thread Reiji Watanabe
Hi Marc,

On Fri, Nov 4, 2022 at 5:21 AM Marc Zyngier  wrote:
>
> Hi Reiji,
>
> On Fri, 04 Nov 2022 07:00:22 +,
> Reiji Watanabe  wrote:
> >
> > On Thu, Nov 3, 2022 at 3:25 AM Marc Zyngier  wrote:
> > >
> > > On Thu, 03 Nov 2022 05:31:56 +,
> > > Reiji Watanabe  wrote:
> > > >
> > > > It appears the patch allows userspace to set IMPDEF even
> > > > when host_pmuver == 0.  Shouldn't it be allowed only when
> > > > host_pmuver == IMPDEF (as before)?
> > > > Probably, it may not cause any real problems though.
> > >
> > > Given that we don't treat the two cases any differently, I thought it
> > > would be reasonable to relax this particular case, and I can't see any
> > > reason why we shouldn't tolerate this sort of migration.
> >
> > That's true. I assume it won't cause any functional issues.
> >
> > I have another comment related to this.
> > KVM allows userspace to create a guest with a mix of vCPUs with and
> > without PMU.  For such a guest, if the register for the vCPU without
> > PMU is set last, I think the PMUVER value for vCPUs with PMU could
> > become no PMU (0) or IMPDEF (0xf).
> > Also, with the current patch, userspace can set PMUv3 support value
> > (non-zero or non-IMPDEF) for vCPUs without the PMU.
> > IMHO, KVM shouldn't allow userspace to set PMUVER to the value that
> > is inconsistent with PMU configuration for the vCPU.
> > What do you think ?
>
> Yes, this seems sensible, and we only do it one way at the moment.
>
> > I'm thinking of the following code (not tested).
> >
> > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> > index 4fa14b4ae2a6..ddd849027cc3 100644
> > --- a/arch/arm64/kvm/sys_regs.c
> > +++ b/arch/arm64/kvm/sys_regs.c
> > @@ -1265,10 +1265,17 @@ static int set_id_aa64dfr0_el1(struct kvm_vcpu 
> > *vcpu,
> > if (pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF && pmuver > 
> > host_pmuver)
> > return -EINVAL;
> >
> > -   /* We already have a PMU, don't try to disable it... */
> > -   if (kvm_vcpu_has_pmu(vcpu) &&
> > -   (pmuver == 0 || pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF))
> > -   return -EINVAL;
> > +   if (kvm_vcpu_has_pmu(vcpu)) {
> > +   /* We already have a PMU, don't try to disable it... */
> > +   if (pmuver == 0 || pmuver == 
> > ID_AA64DFR0_EL1_PMUVer_IMP_DEF) {
> > +   return -EINVAL;
> > +   }
> > +   } else {
> > +   /* We don't have a PMU, don't try to enable it... */
> > +   if (pmuver > 0 && pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF) 
> > {
> > +   return -EINVAL;
> > +   }
> > +   }
>
> This is a bit ugly. I came up with this instead:
>
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 3b28ef48a525..e104fde1a0ee 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -1273,6 +1273,7 @@ static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
>u64 val)
>  {
> u8 pmuver, host_pmuver;
> +   bool valid_pmu;
>
> host_pmuver = kvm_arm_pmu_get_pmuver_limit();
>
> @@ -1286,9 +1287,10 @@ static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
> if (pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF && pmuver > host_pmuver)
> return -EINVAL;
>
> -   /* We already have a PMU, don't try to disable it... */
> -   if (kvm_vcpu_has_pmu(vcpu) &&
> -   (pmuver == 0 || pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF))
> +   valid_pmu = (pmuver != 0 && pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF);
> +
> +   /* Make sure view register and PMU support do match */
> +   if (kvm_vcpu_has_pmu(vcpu) != valid_pmu)
> return -EINVAL;

Thanks, much better!

>
> /* We can only differ with PMUver, and anything else is an error */
>
> and the similar check for the 32bit counterpart.
>
> >
> > /* We can only differ with PMUver, and anything else is an error */
> > val ^= read_id_reg(vcpu, rd);
> > @@ -1276,7 +1283,8 @@ static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
> > if (val)
> > return -EINVAL;
> >
> > -   vcpu->kvm->arch.dfr0_pmuver = pmuver;
> > +   if (kvm_vcpu_has_pmu(vcpu))
> > +   vcpu->kvm->arch.dfr0_pmuver = pmuver;
>
> We need to update this unconditionally if we want to be able to
> restore an IMPDEF PMU view to the guest.

Yes, right.
BTW, if we have no intention of supporting a mix of vCPUs with and
without PMU, I think it would be nice if we have a clear comment on
that in the code.  Or I'm hoping to disallow it if possible though.

Thank you,
Reiji
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 33/44] KVM: x86: Do VMX/SVM support checks directly in vendor code

2022-11-04 Thread Sean Christopherson
On Fri, Nov 04, 2022, Paolo Bonzini wrote:
> On 11/3/22 19:58, Sean Christopherson wrote:
> > 
> > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > index 3e508f239098..ebe617ab0b37 100644
> > --- a/arch/x86/kernel/cpu/common.c
> > +++ b/arch/x86/kernel/cpu/common.c
> > @@ -191,6 +191,8 @@ static void default_init(struct cpuinfo_x86 *c)
> >  strcpy(c->x86_model_id, "386");
> >  }
> >   #endif
> > +
> > +   clear_cpu_cap(c, X86_FEATURE_MSR_IA32_FEAT_CTL);
> >   }
> >   static const struct cpu_dev default_cpu = {
> 
> Not needed I think?  default_init does not call init_ia32_feat_ctl.

cpuid_deps is only processed by do_clear_cpu_cap(), so unless there's an 
explicit
"clear" action, the dependencies will not be updated.  It kinda makes sense 
since
hardware-based features shouldn't end up with scenarios where a dependent 
feature
exists but the base feature does not (barring bad KVM setups :-) ).

That said, this seems like a bug waiting to happen, and unless I'm missing 
something
it's quite straightforward to process all dependencies during setup.  Time to 
find
out if Boris and co. agree :-)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 1a85e1fb0922..c4408d03b180 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -147,6 +147,7 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
 
 extern void setup_clear_cpu_cap(unsigned int bit);
 extern void clear_cpu_cap(struct cpuinfo_x86 *c, unsigned int bit);
+extern void apply_cpuid_deps(struct cpuinfo_x86 *c);
 
 #define setup_force_cpu_cap(bit) do { \
set_cpu_cap(_cpu_data, bit);   \
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3e508f239098..28ce31dadd7f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1884,6 +1884,8 @@ static void identify_cpu(struct cpuinfo_x86 *c)
c->x86_capability[i] |= boot_cpu_data.x86_capability[i];
}
 
+   apply_cpuid_deps(c);
+
ppin_init(c);
 
/* Init Machine Check Exception if available. */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index c881bcafba7d..7e91e97973ca 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -138,3 +138,13 @@ void setup_clear_cpu_cap(unsigned int feature)
 {
do_clear_cpu_cap(NULL, feature);
 }
+
+void apply_cpuid_deps(struct cpuinfo_x86 *c)
+{
+   const struct cpuid_dep *d;
+
+   for (d = cpuid_deps; d->feature; d++) {
+   if (!cpu_has(c, d->feature))
+   clear_cpu_cap(c, d->feature);
+   }
+}

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 0/7] perf: Arm SPEv1.2 support

2022-11-04 Thread Rob Herring
On Fri, Nov 4, 2022 at 6:30 AM James Clark  wrote:
>
>
>
> On 19/10/2022 20:11, Rob Herring wrote:
> > This series adds support for Arm SPEv1.2 which is part of the
> > Armv8.7/Armv9.2 architecture. There's 2 new features that affect the
> > kernel: a new event filter bit, branch 'not taken', and an inverted
> > event filter register.
> >
> > Since this support adds new registers and fields, first the SPE register
> > defines are converted to automatic generation.
> >
> > Note that the 'config3' addition in sysfs format files causes SPE to
> > break. A stable fix e552b7be12ed ("perf: Skip and warn on unknown format
> > 'configN' attrs") landed in v6.1-rc1.
> >
> > The perf tool side changes are available here[1].
> >
> > Tested on FVP.
> >
> > [1] 
> > https://lore.kernel.org/all/20220914-arm-perf-tool-spe1-2-v2-v4-0-83c098e62...@kernel.org/
> >
>
> LGTM. Tested with [1] applied and on N1SDP (where it isn't supported),
> and on the FVP. Enabling all the inverted filters results in no trace
> and other combinations work as expected.

Thanks, I'll take that as a 'Tested-by' tag if that's okay.

In the future, please be explicit with tags. Tools such as b4 and
patchwork will recognize them and add them automatically.

Rob
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 11/14] KVM: arm64: PMU: Allow ID_AA64DFR0_EL1.PMUver to be set from userspace

2022-11-04 Thread Marc Zyngier
Hi Reiji,

On Fri, 04 Nov 2022 07:00:22 +,
Reiji Watanabe  wrote:
>
> On Thu, Nov 3, 2022 at 3:25 AM Marc Zyngier  wrote:
> >
> > On Thu, 03 Nov 2022 05:31:56 +,
> > Reiji Watanabe  wrote:
> > >
> > > It appears the patch allows userspace to set IMPDEF even
> > > when host_pmuver == 0.  Shouldn't it be allowed only when
> > > host_pmuver == IMPDEF (as before)?
> > > Probably, it may not cause any real problems though.
> >
> > Given that we don't treat the two cases any differently, I thought it
> > would be reasonable to relax this particular case, and I can't see any
> > reason why we shouldn't tolerate this sort of migration.
>
> That's true. I assume it won't cause any functional issues.
> 
> I have another comment related to this.
> KVM allows userspace to create a guest with a mix of vCPUs with and
> without PMU.  For such a guest, if the register for the vCPU without
> PMU is set last, I think the PMUVER value for vCPUs with PMU could
> become no PMU (0) or IMPDEF (0xf).
> Also, with the current patch, userspace can set PMUv3 support value
> (non-zero or non-IMPDEF) for vCPUs without the PMU.
> IMHO, KVM shouldn't allow userspace to set PMUVER to the value that
> is inconsistent with PMU configuration for the vCPU.
> What do you think ?

Yes, this seems sensible, and we only do it one way at the moment.

> I'm thinking of the following code (not tested).
> 
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 4fa14b4ae2a6..ddd849027cc3 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -1265,10 +1265,17 @@ static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
> if (pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF && pmuver > host_pmuver)
> return -EINVAL;
> 
> -   /* We already have a PMU, don't try to disable it... */
> -   if (kvm_vcpu_has_pmu(vcpu) &&
> -   (pmuver == 0 || pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF))
> -   return -EINVAL;
> +   if (kvm_vcpu_has_pmu(vcpu)) {
> +   /* We already have a PMU, don't try to disable it... */
> +   if (pmuver == 0 || pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF) {
> +   return -EINVAL;
> +   }
> +   } else {
> +   /* We don't have a PMU, don't try to enable it... */
> +   if (pmuver > 0 && pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF) {
> +   return -EINVAL;
> +   }
> +   }

This is a bit ugly. I came up with this instead:

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 3b28ef48a525..e104fde1a0ee 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1273,6 +1273,7 @@ static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
   u64 val)
 {
u8 pmuver, host_pmuver;
+   bool valid_pmu;
 
host_pmuver = kvm_arm_pmu_get_pmuver_limit();
 
@@ -1286,9 +1287,10 @@ static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
if (pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF && pmuver > host_pmuver)
return -EINVAL;
 
-   /* We already have a PMU, don't try to disable it... */
-   if (kvm_vcpu_has_pmu(vcpu) &&
-   (pmuver == 0 || pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF))
+   valid_pmu = (pmuver != 0 && pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF);
+
+   /* Make sure view register and PMU support do match */
+   if (kvm_vcpu_has_pmu(vcpu) != valid_pmu)
return -EINVAL;
 
/* We can only differ with PMUver, and anything else is an error */

and the similar check for the 32bit counterpart.

> 
> /* We can only differ with PMUver, and anything else is an error */
> val ^= read_id_reg(vcpu, rd);
> @@ -1276,7 +1283,8 @@ static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
> if (val)
> return -EINVAL;
> 
> -   vcpu->kvm->arch.dfr0_pmuver = pmuver;
> +   if (kvm_vcpu_has_pmu(vcpu))
> +   vcpu->kvm->arch.dfr0_pmuver = pmuver;

We need to update this unconditionally if we want to be able to
restore an IMPDEF PMU view to the guest.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvm/arm: Fix pvtime documentation

2022-11-04 Thread Bagas Sanjaya
On 11/4/22 18:06, Marc Zyngier wrote:
> On Fri, 04 Nov 2022 01:48:21 +,
> Bagas Sanjaya  wrote:
>>
>> On 11/3/22 22:42, Marc Zyngier wrote:
>>> No, this is the correct course of action. There isn't any point in
>>> having an *unrelated* change in a separate series. This is a
>>> standalone change, posted as a standalone patch.
>>>
 Please reroll your series [2] with suggestion applied.
>>>
>>> Or not.
>>>
>>
>> You mean the series before this patch have already been applied,
>> right?
> 
> This change is 100% independent from the series you quoted. Why should
> there be a dependency between the two?
> 
> As for respinning the series at this stage for a documentation
> formatting issue, this is pretty pointless, and only clutters people's
> Inbox with redundant versions...
> 
>   M.
> 

OK, thanks!

-- 
An old man doll... just what I always wanted! - Clara

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 00/44] KVM: Rework kvm_init() and hardware enabling

2022-11-04 Thread Isaku Yamahata
On Wed, Nov 02, 2022 at 11:18:27PM +,
Sean Christopherson  wrote:

> Non-x86 folks, please test on hardware when possible.  I made a _lot_ of
> mistakes when moving code around.  Thankfully, x86 was the trickiest code
> to deal with, and I'm fairly confident that I found all the bugs I
> introduced via testing.  But the number of mistakes I made and found on
> x86 makes me more than a bit worried that I screwed something up in other
> arch code.
> 
> This is a continuation of Chao's series to do x86 CPU compatibility checks
> during virtualization hardware enabling[1], and of Isaku's series to try
> and clean up the hardware enabling paths so that x86 (Intel specifically)
> can temporarily enable hardware during module initialization without
> causing undue pain for other architectures[2].  It also includes one patch
> from another mini-series from Isaku that provides the less controversial
> patches[3].
> 
> The main theme of this series is to kill off kvm_arch_init(),
> kvm_arch_hardware_(un)setup(), and kvm_arch_check_processor_compat(), which
> all originated in x86 code from way back when, and needlessly complicate
> both common KVM code and architecture code.  E.g. many architectures don't
> mark functions/data as __init/__ro_after_init purely because kvm_init()
> isn't marked __init to support x86's separate vendor modules.
> 
> The idea/hope is that with those hooks gone (moved to arch code), it will
> be easier for x86 (and other architectures) to modify their module init
> sequences as needed without having to fight common KVM code.  E.g. I'm
> hoping that ARM can build on this to simplify its hardware enabling logic,
> especially the pKVM side of things.
> 
> There are bug fixes throughout this series.  They are more scattered than
> I would usually prefer, but getting the sequencing correct was a gigantic
> pain for many of the x86 fixes due to needing to fix common code in order
> for the x86 fix to have any meaning.  And while the bugs are often fatal,
> they aren't all that interesting for most users as they either require a
> malicious admin or broken hardware, i.e. aren't likely to be encountered
> by the vast majority of KVM users.  So unless someone _really_ wants a
> particular fix isolated for backporting, I'm not planning on shuffling
> patches.
> 
> Tested on x86.  Lightly tested on arm64.  Compile tested only on all other
> architectures.

Thanks for the patch series. I the rebased TDX KVM patch series and it worked.
Since cpu offline needs to be rejected in some cases(To keep at least one cpu
on a package), arch hook for cpu offline is needed.
I can keep it in TDX KVM patch series.

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index 23c0f4bc63f1..ef7bcb845d42 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -17,6 +17,7 @@ BUILD_BUG_ON(1)
 KVM_X86_OP(hardware_enable)
 KVM_X86_OP(hardware_disable)
 KVM_X86_OP(hardware_unsetup)
+KVM_X86_OP_OPTIONAL_RET0(offline_cpu)
 KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
 KVM_X86_OP(is_vm_type_supported)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 496c7c6eaff9..c420409aa96f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1468,6 +1468,7 @@ struct kvm_x86_ops {
int (*hardware_enable)(void);
void (*hardware_disable)(void);
void (*hardware_unsetup)(void);
+   int (*offline_cpu)(void);
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2ed5a017f7bc..17c5d6a76c93 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12039,6 +12039,11 @@ void kvm_arch_hardware_disable(void)
drop_user_return_notifiers();
 }
 
+int kvm_arch_offline_cpu(unsigned int cpu)
+{
+   return static_call(kvm_x86_offline_cpu)();
+}
+
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
 {
return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 620489b9aa93..4df79443fd11 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1460,6 +1460,7 @@ static inline void kvm_create_vcpu_debugfs(struct 
kvm_vcpu *vcpu) {}
 int kvm_arch_hardware_enable(void);
 void kvm_arch_hardware_disable(void);
 #endif
+int kvm_arch_offline_cpu(unsigned int cpu);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f6b6dcedaa0a..f770fdc662d0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5396,16 +5396,24 @@ static void hardware_disable_nolock(void *junk)
__this_cpu_write(hardware_enabled, false);
 }
 
+__weak int kvm_arch_offline_cpu(unsigned int cpu)
+{
+   

Re: [PATCH 36/44] KVM: x86: Do compatibility checks when onlining CPU

2022-11-04 Thread Isaku Yamahata
On Thu, Nov 03, 2022 at 10:34:10PM +,
Sean Christopherson  wrote:

> On Thu, Nov 03, 2022, Isaku Yamahata wrote:
> > On Wed, Nov 02, 2022 at 11:19:03PM +,
> > Sean Christopherson  wrote:
> > > diff --git a/arch/x86/include/asm/kvm_host.h 
> > > b/arch/x86/include/asm/kvm_host.h
> > > index f223c845ed6e..c99222b71fcc 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -1666,7 +1666,7 @@ struct kvm_x86_nested_ops {
> > >  };
> > >  
> > >  struct kvm_x86_init_ops {
> > > - int (*check_processor_compatibility)(void);
> > > + int (*check_processor_compatibility)(int cpu);
> > 
> > Is this cpu argument used only for error message to include cpu number
> > with avoiding repeating raw_smp_processor_id() in pr_err()?
> 
> Yep.
> 
> > The actual check is done on the current executing cpu.
> > 
> > If cpu != raw_smp_processor_id(), cpu is wrong. Although the function is 
> > called
> > in non-preemptive context, it's a bit confusing. So voting to remove it and
> > to use.
> 
> What if I rename the param is this_cpu?  I 100% agree the argument is 
> confusing
> as-is, but forcing all the helpers to manually grab the cpu is quite annoying.

Makes sense. Let's settle it with this_cpu.
-- 
Isaku Yamahata 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 0/7] perf: Arm SPEv1.2 support

2022-11-04 Thread James Clark



On 19/10/2022 20:11, Rob Herring wrote:
> This series adds support for Arm SPEv1.2 which is part of the
> Armv8.7/Armv9.2 architecture. There's 2 new features that affect the 
> kernel: a new event filter bit, branch 'not taken', and an inverted 
> event filter register. 
> 
> Since this support adds new registers and fields, first the SPE register 
> defines are converted to automatic generation.
> 
> Note that the 'config3' addition in sysfs format files causes SPE to 
> break. A stable fix e552b7be12ed ("perf: Skip and warn on unknown format 
> 'configN' attrs") landed in v6.1-rc1.
> 
> The perf tool side changes are available here[1].
> 
> Tested on FVP.
> 
> [1] 
> https://lore.kernel.org/all/20220914-arm-perf-tool-spe1-2-v2-v4-0-83c098e62...@kernel.org/
> 

LGTM. Tested with [1] applied and on N1SDP (where it isn't supported),
and on the FVP. Enabling all the inverted filters results in no trace
and other combinations work as expected.

James

> Signed-off-by: Rob Herring 
> ---
> Changes in v2:
> - Convert the SPE register defines to automatic generation
> - Fixed access to SYS_PMSNEVFR_EL1 when not present
> - Rebase on v6.1-rc1
> - Link to v1: 
> https://lore.kernel.org/r/20220825-arm-spe-v8-7-v1-0-c75b8d92e...@kernel.org
> 
> ---
> Rob Herring (7):
>   perf: arm_spe: Use feature numbering for PMSEVFR_EL1 defines
>   arm64: Drop SYS_ from SPE register defines
>   arm64/sysreg: Convert SPE registers to automatic generation
>   perf: arm_spe: Drop BIT() and use FIELD_GET/PREP accessors
>   perf: arm_spe: Support new SPEv1.2/v8.7 'not taken' event
>   perf: Add perf_event_attr::config3
>   perf: arm_spe: Add support for SPEv1.2 inverted event filtering
> 
>  arch/arm64/include/asm/el2_setup.h |   6 +-
>  arch/arm64/include/asm/sysreg.h|  99 +++
>  arch/arm64/kvm/debug.c |   2 +-
>  arch/arm64/kvm/hyp/nvhe/debug-sr.c |   2 +-
>  arch/arm64/tools/sysreg| 116 +++
>  drivers/perf/arm_spe_pmu.c | 136 
> -
>  include/uapi/linux/perf_event.h|   3 +
>  7 files changed, 224 insertions(+), 140 deletions(-)
> ---
> base-commit: 9abf2313adc1ca1b6180c508c25f22f9395cc780
> change-id: 20220825-arm-spe-v8-7-fedf04e16f23
> 
> Best regards,
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 02/44] KVM: Initialize IRQ FD after arch hardware setup

2022-11-04 Thread Chao Gao
On Wed, Nov 02, 2022 at 11:18:29PM +, Sean Christopherson wrote:
> 
>+  r = kvm_irqfd_init();
>+  if (r)
>+  goto err_irqfd;
>+
>   r = kvm_async_pf_init();
>   if (r)
>-  goto out_free_4;
>+  goto err_async_pf;
> 
>   kvm_chardev_ops.owner = module;
> 
>@@ -5927,6 +5926,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned 
>vcpu_align,
>   kvm_vfio_ops_exit();
> err_vfio:
>   kvm_async_pf_deinit();
>+err_async_pf:
>+  kvm_irqfd_exit();

>+err_irqfd:
> out_free_4:

Do you mind removing one of the two labels?
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 03/44] KVM: Allocate cpus_hardware_enabled after arch hardware setup

2022-11-04 Thread Yuan Yao
On Wed, Nov 02, 2022 at 11:18:30PM +, Sean Christopherson wrote:
> Allocate cpus_hardware_enabled after arch hardware setup so that arch
> "init" and "hardware setup" are called back-to-back and thus can be
> combined in a future patch.  cpus_hardware_enabled is never used before
> kvm_create_vm(), i.e. doesn't have a dependency with hardware setup and
> only needs to be allocated before /dev/kvm is exposed to userspace.
>
> Free the object before the arch hooks are invoked to maintain symmetry,
> and so that arch code can move away from the hooks without having to
> worry about ordering changes.
>
> Signed-off-by: Sean Christopherson 
> ---
>  virt/kvm/kvm_main.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e0424af52acc..8b7534cc953b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -5843,15 +5843,15 @@ int kvm_init(void *opaque, unsigned vcpu_size, 
> unsigned vcpu_align,
>   if (r)
>   return r;
>
> + r = kvm_arch_hardware_setup(opaque);
> + if (r < 0)
> + goto err_hw_setup;
> +
>   if (!zalloc_cpumask_var(_hardware_enabled, GFP_KERNEL)) {
>   r = -ENOMEM;
>   goto err_hw_enabled;
>   }
>
> - r = kvm_arch_hardware_setup(opaque);
> - if (r < 0)
> - goto out_free_1;
> -
>   c.ret = 
>   c.opaque = opaque;
>   for_each_online_cpu(cpu) {
> @@ -5937,10 +5937,10 @@ int kvm_init(void *opaque, unsigned vcpu_size, 
> unsigned vcpu_align,
>   unregister_reboot_notifier(_reboot_notifier);
>   cpuhp_remove_state_nocalls(CPUHP_AP_KVM_STARTING);
>  out_free_2:
> - kvm_arch_hardware_unsetup();
> -out_free_1:
>   free_cpumask_var(cpus_hardware_enabled);
>  err_hw_enabled:
> + kvm_arch_hardware_unsetup();
> +err_hw_setup:
>   kvm_arch_exit();
>   return r;
>  }
> @@ -5967,9 +5967,9 @@ void kvm_exit(void)
>   cpuhp_remove_state_nocalls(CPUHP_AP_KVM_STARTING);
>   on_each_cpu(hardware_disable_nolock, NULL, 1);
>   kvm_irqfd_exit();
> + free_cpumask_var(cpus_hardware_enabled);
>   kvm_arch_hardware_unsetup();
>   kvm_arch_exit();
> - free_cpumask_var(cpus_hardware_enabled);
>   kvm_vfio_ops_exit();

Looks good to me.

Reviewed-by: Yuan Yao 

>  }
>  EXPORT_SYMBOL_GPL(kvm_exit);
> --
> 2.38.1.431.g37b22c650d-goog
>
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 36/44] KVM: x86: Do compatibility checks when onlining CPU

2022-11-04 Thread Isaku Yamahata
On Wed, Nov 02, 2022 at 11:19:03PM +,
Sean Christopherson  wrote:

> From: Chao Gao 
> 
> Do compatibility checks when enabling hardware to effectively add
> compatibility checks when onlining a CPU.  Abort enabling, i.e. the
> online process, if the (hotplugged) CPU is incompatible with the known
> good setup.
> 
> At init time, KVM does compatibility checks to ensure that all online
> CPUs support hardware virtualization and a common set of features. But
> KVM uses hotplugged CPUs without such compatibility checks. On Intel
> CPUs, this leads to #GP if the hotplugged CPU doesn't support VMX, or
> VM-Entry failure if the hotplugged CPU doesn't support all features
> enabled by KVM.
> 
> Note, this is little more than a NOP on SVM, as SVM already checks for
> full SVM support during hardware enabling.
> 
> Opportunistically add a pr_err() if setup_vmcs_config() fails, and
> tweak all error messages to output which CPU failed.
> 
> Signed-off-by: Chao Gao 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/svm/svm.c  | 20 +++-
>  arch/x86/kvm/vmx/vmx.c  | 33 +++--
>  arch/x86/kvm/x86.c  |  5 +++--
>  4 files changed, 34 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f223c845ed6e..c99222b71fcc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1666,7 +1666,7 @@ struct kvm_x86_nested_ops {
>  };
>  
>  struct kvm_x86_init_ops {
> - int (*check_processor_compatibility)(void);
> + int (*check_processor_compatibility)(int cpu);

Is this cpu argument used only for error message to include cpu number
with avoiding repeating raw_smp_processor_id() in pr_err()?
The actual check is done on the current executing cpu.

If cpu != raw_smp_processor_id(), cpu is wrong. Although the function is called
in non-preemptive context, it's a bit confusing. So voting to remove it and
to use.

Thanks,
-- 
Isaku Yamahata 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 08/44] KVM: x86: Move hardware setup/unsetup to init/exit

2022-11-04 Thread Yuan Yao
On Wed, Nov 02, 2022 at 11:18:35PM +, Sean Christopherson wrote:
> Now that kvm_arch_hardware_setup() is called immediately after
> kvm_arch_init(), fold the guts of kvm_arch_hardware_(un)setup() into
> kvm_arch_{init,exit}() as a step towards dropping one of the hooks.
>
> To avoid having to unwind various setup, e.g registration of several
> notifiers, slot in the vendor hardware setup before the registration of
> said notifiers and callbacks.  Introducing a functional change while
> moving code is less than ideal, but the alternative is adding a pile of
> unwinding code, which is much more error prone, e.g. several attempts to
> move the setup code verbatim all introduced bugs.
>
> Add a comment to document that kvm_ops_update() is effectively the point
> of no return, e.g. it sets the kvm_x86_ops.hardware_enable canary and so
> needs to be unwound.
>
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/x86.c | 121 +++--
>  1 file changed, 63 insertions(+), 58 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a7702b1c563..80ee580a9cd4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9252,6 +9252,24 @@ static struct notifier_block pvclock_gtod_notifier = {
>  };
>  #endif
>
> +static inline void kvm_ops_update(struct kvm_x86_init_ops *ops)
> +{
> + memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
> +
> +#define __KVM_X86_OP(func) \
> + static_call_update(kvm_x86_##func, kvm_x86_ops.func);
> +#define KVM_X86_OP(func) \
> + WARN_ON(!kvm_x86_ops.func); __KVM_X86_OP(func)
> +#define KVM_X86_OP_OPTIONAL __KVM_X86_OP
> +#define KVM_X86_OP_OPTIONAL_RET0(func) \
> + static_call_update(kvm_x86_##func, (void *)kvm_x86_ops.func ? : \
> +(void *)__static_call_return0);
> +#include 
> +#undef __KVM_X86_OP
> +
> + kvm_pmu_ops_update(ops->pmu_ops);
> +}
> +
>  int kvm_arch_init(void *opaque)
>  {
>   struct kvm_x86_init_ops *ops = opaque;
> @@ -9325,6 +9343,24 @@ int kvm_arch_init(void *opaque)
>   kvm_caps.supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
>   }
>
> + rdmsrl_safe(MSR_EFER, _efer);
> +
> + if (boot_cpu_has(X86_FEATURE_XSAVES))
> + rdmsrl(MSR_IA32_XSS, host_xss);
> +
> + kvm_init_pmu_capability();
> +
> + r = ops->hardware_setup();
> + if (r != 0)
> + goto out_mmu_exit;

The failure case of ops->hardware_setup() is unwound
by kvm_arch_exit() before this patch, do we need to
keep that old behavior ?

> +
> + /*
> +  * Point of no return!  DO NOT add error paths below this point unless
> +  * absolutely necessary, as most operations from this point forward
> +  * require unwinding.
> +  */
> + kvm_ops_update(ops);
> +
>   kvm_timer_init();
>
>   if (pi_inject_timer == -1)
> @@ -9336,8 +9372,32 @@ int kvm_arch_init(void *opaque)
>   set_hv_tscchange_cb(kvm_hyperv_tsc_notifier);
>  #endif
>
> + kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
> +
> + if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
> + kvm_caps.supported_xss = 0;
> +
> +#define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
> + cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
> +#undef __kvm_cpu_cap_has
> +
> + if (kvm_caps.has_tsc_control) {
> + /*
> +  * Make sure the user can only configure tsc_khz values that
> +  * fit into a signed integer.
> +  * A min value is not calculated because it will always
> +  * be 1 on all machines.
> +  */
> + u64 max = min(0x7fffULL,
> +   __scale_tsc(kvm_caps.max_tsc_scaling_ratio, 
> tsc_khz));
> + kvm_caps.max_guest_tsc_khz = max;
> + }
> + kvm_caps.default_tsc_scaling_ratio = 1ULL << 
> kvm_caps.tsc_scaling_ratio_frac_bits;
> + kvm_init_msr_list();
>   return 0;
>
> +out_mmu_exit:
> + kvm_mmu_vendor_module_exit();
>  out_free_percpu:
>   free_percpu(user_return_msrs);
>  out_free_x86_emulator_cache:
> @@ -9347,6 +9407,8 @@ int kvm_arch_init(void *opaque)
>
>  void kvm_arch_exit(void)
>  {
> + kvm_unregister_perf_callbacks();
> +
>  #ifdef CONFIG_X86_64
>   if (hypervisor_is_type(X86_HYPER_MS_HYPERV))
>   clear_hv_tscchange_cb();
> @@ -9362,6 +9424,7 @@ void kvm_arch_exit(void)
>   irq_work_sync(_irq_work);
>   cancel_work_sync(_gtod_work);
>  #endif
> + static_call(kvm_x86_hardware_unsetup)();
>   kvm_x86_ops.hardware_enable = NULL;
>   kvm_mmu_vendor_module_exit();
>   free_percpu(user_return_msrs);
> @@ -11922,72 +11985,14 @@ void kvm_arch_hardware_disable(void)
>   drop_user_return_notifiers();
>  }
>
> -static inline void kvm_ops_update(struct kvm_x86_init_ops *ops)
> -{
> - memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
> -
> -#define __KVM_X86_OP(func) \
> - 

Re: [PATCH] kvm/arm: Fix pvtime documentation

2022-11-04 Thread Marc Zyngier
On Fri, 04 Nov 2022 01:48:21 +,
Bagas Sanjaya  wrote:
> 
> On 11/3/22 22:42, Marc Zyngier wrote:
> > No, this is the correct course of action. There isn't any point in
> > having an *unrelated* change in a separate series. This is a
> > standalone change, posted as a standalone patch.
> > 
> >> Please reroll your series [2] with suggestion applied.
> > 
> > Or not.
> > 
> 
> You mean the series before this patch have already been applied,
> right?

This change is 100% independent from the series you quoted. Why should
there be a dependency between the two?

As for respinning the series at this stage for a documentation
formatting issue, this is pretty pointless, and only clutters people's
Inbox with redundant versions...

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] arm64: paravirt: remove conduit check in has_pv_steal_clock

2022-11-04 Thread Steven Price
On 04/11/2022 06:16, Usama Arif wrote:
> arm_smccc_1_1_invoke() which is called later on in the function
> will return failure if there's no conduit (or pre-SMCCC 1.1),
> hence the check is unnecessary.
> 
> Suggested-by: Steven Price 
> Signed-off-by: Usama Arif 

Reviewed-by: Steven Price 

> ---
>  arch/arm64/kernel/paravirt.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c
> index 57c7c211f8c7..aa718d6a9274 100644
> --- a/arch/arm64/kernel/paravirt.c
> +++ b/arch/arm64/kernel/paravirt.c
> @@ -141,10 +141,6 @@ static bool __init has_pv_steal_clock(void)
>  {
>   struct arm_smccc_res res;
>  
> - /* To detect the presence of PV time support we require SMCCC 1.1+ */
> - if (arm_smccc_1_1_get_conduit() == SMCCC_CONDUIT_NONE)
> - return false;
> -
>   arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
>ARM_SMCCC_HV_PV_TIME_FEATURES, );
>  

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check

2022-11-04 Thread Marc Zyngier
On Fri, 04 Nov 2022 06:20:59 +,
Usama Arif  wrote:
> 
> This patchset adds support for vcpu_is_preempted in arm64, which allows the 
> guest
> to check if a vcpu was scheduled out, which is useful to know incase it was
> holding a lock. vcpu_is_preempted can be used to improve
> performance in locking (see owner_on_cpu usage in mutex_spin_on_owner,
> mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling
> (see available_idle_cpu which is used in several places in kernel/sched/fair.c
> for e.g. in wake_affine to determine which CPU can run soonest):

Please refrain from reposting a series only two days after the initial
one. One week is a minimum, and only if there is enough review
comments to justify a respin (there were no valuable comments so far).

Reposting more often only results in the review process being
exponentially delayed.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 33/44] KVM: x86: Do VMX/SVM support checks directly in vendor code

2022-11-04 Thread Paolo Bonzini

On 11/3/22 19:58, Sean Christopherson wrote:


diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3e508f239098..ebe617ab0b37 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -191,6 +191,8 @@ static void default_init(struct cpuinfo_x86 *c)
 strcpy(c->x86_model_id, "386");
 }
  #endif
+
+   clear_cpu_cap(c, X86_FEATURE_MSR_IA32_FEAT_CTL);
  }
  
  static const struct cpu_dev default_cpu = {


Not needed I think?  default_init does not call init_ia32_feat_ctl.


diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index c881bcafba7d..3a7ae67f5a5e 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -72,6 +72,8 @@ static const struct cpuid_dep cpuid_deps[] = {
 { X86_FEATURE_AVX512_FP16,  X86_FEATURE_AVX512BW  },
 { X86_FEATURE_ENQCMD,   X86_FEATURE_XSAVES},
 { X86_FEATURE_PER_THREAD_MBA,   X86_FEATURE_MBA   },
+   { X86_FEATURE_VMX,  X86_FEATURE_MSR_IA32_FEAT_CTL   
  },
+   { X86_FEATURE_SGX,  X86_FEATURE_MSR_IA32_FEAT_CTL   
  },
 { X86_FEATURE_SGX_LC,   X86_FEATURE_SGX   },
 { X86_FEATURE_SGX1, X86_FEATURE_SGX   },
 { X86_FEATURE_SGX2, X86_FEATURE_SGX1  },



Yes, good idea.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 00/44] KVM: Rework kvm_init() and hardware enabling

2022-11-04 Thread Paolo Bonzini

On 11/4/22 08:17, Isaku Yamahata wrote:

On Wed, Nov 02, 2022 at 11:18:27PM +,
Sean Christopherson  wrote:


Non-x86 folks, please test on hardware when possible.  I made a _lot_ of
mistakes when moving code around.  Thankfully, x86 was the trickiest code
to deal with, and I'm fairly confident that I found all the bugs I
introduced via testing.  But the number of mistakes I made and found on
x86 makes me more than a bit worried that I screwed something up in other
arch code.

This is a continuation of Chao's series to do x86 CPU compatibility checks
during virtualization hardware enabling[1], and of Isaku's series to try
and clean up the hardware enabling paths so that x86 (Intel specifically)
can temporarily enable hardware during module initialization without
causing undue pain for other architectures[2].  It also includes one patch
from another mini-series from Isaku that provides the less controversial
patches[3].

The main theme of this series is to kill off kvm_arch_init(),
kvm_arch_hardware_(un)setup(), and kvm_arch_check_processor_compat(), which
all originated in x86 code from way back when, and needlessly complicate
both common KVM code and architecture code.  E.g. many architectures don't
mark functions/data as __init/__ro_after_init purely because kvm_init()
isn't marked __init to support x86's separate vendor modules.

The idea/hope is that with those hooks gone (moved to arch code), it will
be easier for x86 (and other architectures) to modify their module init
sequences as needed without having to fight common KVM code.  E.g. I'm
hoping that ARM can build on this to simplify its hardware enabling logic,
especially the pKVM side of things.

There are bug fixes throughout this series.  They are more scattered than
I would usually prefer, but getting the sequencing correct was a gigantic
pain for many of the x86 fixes due to needing to fix common code in order
for the x86 fix to have any meaning.  And while the bugs are often fatal,
they aren't all that interesting for most users as they either require a
malicious admin or broken hardware, i.e. aren't likely to be encountered
by the vast majority of KVM users.  So unless someone _really_ wants a
particular fix isolated for backporting, I'm not planning on shuffling
patches.

Tested on x86.  Lightly tested on arm64.  Compile tested only on all other
architectures.


Thanks for the patch series. I the rebased TDX KVM patch series and it worked.
Since cpu offline needs to be rejected in some cases(To keep at least one cpu
on a package), arch hook for cpu offline is needed.
I can keep it in TDX KVM patch series.


Yes, this patch looks good.

Paolo


diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index 23c0f4bc63f1..ef7bcb845d42 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -17,6 +17,7 @@ BUILD_BUG_ON(1)
  KVM_X86_OP(hardware_enable)
  KVM_X86_OP(hardware_disable)
  KVM_X86_OP(hardware_unsetup)
+KVM_X86_OP_OPTIONAL_RET0(offline_cpu)
  KVM_X86_OP(has_emulated_msr)
  KVM_X86_OP(vcpu_after_set_cpuid)
  KVM_X86_OP(is_vm_type_supported)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 496c7c6eaff9..c420409aa96f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1468,6 +1468,7 @@ struct kvm_x86_ops {
int (*hardware_enable)(void);
void (*hardware_disable)(void);
void (*hardware_unsetup)(void);
+   int (*offline_cpu)(void);
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
  
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

index 2ed5a017f7bc..17c5d6a76c93 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12039,6 +12039,11 @@ void kvm_arch_hardware_disable(void)
drop_user_return_notifiers();
  }
  
+int kvm_arch_offline_cpu(unsigned int cpu)

+{
+   return static_call(kvm_x86_offline_cpu)();
+}
+
  bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
  {
return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 620489b9aa93..4df79443fd11 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1460,6 +1460,7 @@ static inline void kvm_create_vcpu_debugfs(struct 
kvm_vcpu *vcpu) {}
  int kvm_arch_hardware_enable(void);
  void kvm_arch_hardware_disable(void);
  #endif
+int kvm_arch_offline_cpu(unsigned int cpu);
  int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
  bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f6b6dcedaa0a..f770fdc662d0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5396,16 +5396,24 @@ static void hardware_disable_nolock(void *junk)
__this_cpu_write(hardware_enabled, false);
  }
  
+__weak int kvm_arch_offline_cpu(unsigned 

Re: [PATCH v2 11/14] KVM: arm64: PMU: Allow ID_AA64DFR0_EL1.PMUver to be set from userspace

2022-11-04 Thread Reiji Watanabe
Hi Marc,

On Thu, Nov 3, 2022 at 3:25 AM Marc Zyngier  wrote:
>
> On Thu, 03 Nov 2022 05:31:56 +,
> Reiji Watanabe  wrote:
> >
> > Hi Marc,
> >
> > On Fri, Oct 28, 2022 at 4:16 AM Marc Zyngier  wrote:
> > >
> > > Allow userspace to write ID_AA64DFR0_EL1, on the condition that only
> > > the PMUver field can be altered and be at most the one that was
> > > initially computed for the guest.
> > >
> > > Signed-off-by: Marc Zyngier 
> > > ---
> > >  arch/arm64/kvm/sys_regs.c | 37 -
> > >  1 file changed, 36 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> > > index 7a4cd644b9c0..4fa14b4ae2a6 100644
> > > --- a/arch/arm64/kvm/sys_regs.c
> > > +++ b/arch/arm64/kvm/sys_regs.c
> > > @@ -1247,6 +1247,40 @@ static int set_id_aa64pfr0_el1(struct kvm_vcpu 
> > > *vcpu,
> > > return 0;
> > >  }
> > >
> > > +static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
> > > +  const struct sys_reg_desc *rd,
> > > +  u64 val)
> > > +{
> > > +   u8 pmuver, host_pmuver;
> > > +
> > > +   host_pmuver = kvm_arm_pmu_get_pmuver_limit();
> > > +
> > > +   /*
> > > +* Allow AA64DFR0_EL1.PMUver to be set from userspace as long
> > > +* as it doesn't promise more than what the HW gives us. We
> > > +* allow an IMPDEF PMU though, only if no PMU is supported
> > > +* (KVM backward compatibility handling).
> > > +*/
> >
> > It appears the patch allows userspace to set IMPDEF even
> > when host_pmuver == 0.  Shouldn't it be allowed only when
> > host_pmuver == IMPDEF (as before)?
> > Probably, it may not cause any real problems though.
>
> Given that we don't treat the two cases any differently, I thought it
> would be reasonable to relax this particular case, and I can't see any
> reason why we shouldn't tolerate this sort of migration.

> Given that we don't treat the two cases any differently, I thought it
> would be reasonable to relax this particular case, and I can't see any
> reason why we shouldn't tolerate this sort of migration.

That's true. I assume it won't cause any functional issues.

I have another comment related to this.
KVM allows userspace to create a guest with a mix of vCPUs with and
without PMU.  For such a guest, if the register for the vCPU without
PMU is set last, I think the PMUVER value for vCPUs with PMU could
become no PMU (0) or IMPDEF (0xf).
Also, with the current patch, userspace can set PMUv3 support value
(non-zero or non-IMPDEF) for vCPUs without the PMU.
IMHO, KVM shouldn't allow userspace to set PMUVER to the value that
is inconsistent with PMU configuration for the vCPU.
What do you think ?

I'm thinking of the following code (not tested).

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 4fa14b4ae2a6..ddd849027cc3 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1265,10 +1265,17 @@ static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
if (pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF && pmuver > host_pmuver)
return -EINVAL;

-   /* We already have a PMU, don't try to disable it... */
-   if (kvm_vcpu_has_pmu(vcpu) &&
-   (pmuver == 0 || pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF))
-   return -EINVAL;
+   if (kvm_vcpu_has_pmu(vcpu)) {
+   /* We already have a PMU, don't try to disable it... */
+   if (pmuver == 0 || pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF) {
+   return -EINVAL;
+   }
+   } else {
+   /* We don't have a PMU, don't try to enable it... */
+   if (pmuver > 0 && pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF) {
+   return -EINVAL;
+   }
+   }

/* We can only differ with PMUver, and anything else is an error */
val ^= read_id_reg(vcpu, rd);
@@ -1276,7 +1283,8 @@ static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
if (val)
return -EINVAL;

-   vcpu->kvm->arch.dfr0_pmuver = pmuver;
+   if (kvm_vcpu_has_pmu(vcpu))
+   vcpu->kvm->arch.dfr0_pmuver = pmuver;

return 0;
 }

Thank you,
Reiji



>
> >
> >
> > > +   pmuver = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMUVer), 
> > > val);
> > > +   if (pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF && pmuver > 
> > > host_pmuver)
> > > +   return -EINVAL;
> > > +
> > > +   /* We already have a PMU, don't try to disable it... */
> > > +   if (kvm_vcpu_has_pmu(vcpu) &&
> > > +   (pmuver == 0 || pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF))
> > > +   return -EINVAL;
> >
> > Nit: Perhaps it might be useful to return a different error code for the
> > above two (new) error cases (I plan to use -E2BIG and -EPERM
> > respectively for those cases with my ID register series).
>
> My worry in doing so is that we don't have an 

Re: [PATCH v7 4/9] KVM: Support dirty ring in conjunction with bitmap

2022-11-04 Thread Gavin Shan

Hi Oliver,

On 11/4/22 9:06 AM, Oliver Upton wrote:

On Fri, Nov 04, 2022 at 08:12:21AM +0800, Gavin Shan wrote:

On 11/4/22 7:32 AM, Oliver Upton wrote:

On Mon, Oct 31, 2022 at 08:36:16AM +0800, Gavin Shan wrote:

ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is
enabled. It's conflicting with that ring-based dirty page tracking always
requires a running VCPU context.

Introduce a new flavor of dirty ring that requires the use of both VCPU
dirty rings and a dirty bitmap. The expectation is that for non-VCPU
sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to
the dirty bitmap. Userspace should scan the dirty bitmap before migrating
the VM to the target.

Use an additional capability to advertise this behavior. The newly added
capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before
KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added
capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL.


Whatever ordering requirements we settle on between these capabilities
needs to be documented as well.

[...]



It's mentioned in 'Documentation/virt/kvm/api.rst' as below.

   After using the dirty rings, the userspace needs to detect the capability
   of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures
   need to be backed by per-slot bitmaps. With this capability advertised
   and supported, it means the architecture can dirty guest pages without
   vcpu/ring context, so that some of the dirty information will still be
   maintained in the bitmap structure.

The description may be not obvious about the ordering. For this, I can
add the following sentence at end of the section.

   The capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP can't be enabled
   until the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL has been enabled.


@@ -4588,6 +4594,13 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm 
*kvm,
return -EINVAL;
return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
+   case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
+   if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
+   !kvm->dirty_ring_size)


I believe this ordering requirement is problematic, as it piles on top
of an existing problem w.r.t. KVM_CAP_DIRTY_LOG_RING v. memslot
creation.

Example:
   - Enable KVM_CAP_DIRTY_LOG_RING
   - Create some memslots w/ dirty logging enabled (note that the bitmap
 is _not_ allocated)
   - Enable KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
   - Save ITS tables and get a NULL dereference in
 mark_page_dirty_in_slot():

  if (vcpu && kvm->dirty_ring_size)
  kvm_dirty_ring_push(>dirty_ring,
  slot, rel_gfn);
  else
---> set_bit_le(rel_gfn, memslot->dirty_bitmap);

Similarly, KVM may unnecessarily allocate bitmaps if dirty logging is
enabled on memslots before KVM_CAP_DIRTY_LOG_RING is enabled.

You could paper over this issue by disallowing DIRTY_RING_WITH_BITMAP if
DIRTY_LOG_RING has already been enabled, but the better approach would
be to explicitly check kvm_memslots_empty() such that the real
dependency is obvious. Peter, hadn't you mentioned something about
checking against memslots in an earlier revision?



The userspace (QEMU) needs to ensure that no dirty bitmap is created
before the capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled.
It's unknown by QEMU that vgic/its is used when KVM_CAP_DIRTY_LOG_RING_ACQ_REL
is enabled.


I'm not worried about what QEMU (or any particular VMM for that matter)
does with the UAPI. The problem is this patch provides a trivial way for
userspace to cause a NULL dereference in the kernel. Imposing ordering
between the cap and memslot creation avoids the problem altogether.

So, looking at your example:


kvm_initialization
  enable_KVM_CAP_DIRTY_LOG_RING_ACQ_REL// Where 
KVM_CAP_DIRTY_LOG_RING is enabled
board_initialization   // Where QEMU knows if 
vgic/its is used


Is it possible that QEMU could hoist enabling RING_WITH_BITMAP here?
Based on your description QEMU has decided to use the vGIC ITS but
hasn't yet added any memslots.



It's possible to add ARM specific helper kvm_arm_enable_dirty_ring_with_bitmap()
in qemu/target/arm.c, to enable RING_WITH_BITMAP if needed. The newly added
function can be called here when vgic/its is used.


  add_memory_slots
kvm_post_initialization
  enable_KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
:
start_migration
  enable_dirty_page_tracking
create_dirty_bitmap   // With 
KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP enabled


Just to make sure we're on the same page, there's two issues:

  (1) If DIRTY_LOG_RING is enabled before memslot creation and
  RING_WITH_BITMAP is enabled after memslots have been created w/
  dirty logging enabled, 

Re: [External] Re: [RFC 5/6] KVM: arm64: Support the VCPU preemption check

2022-11-04 Thread Usama Arif




On 03/11/2022 13:25, Steven Price wrote:

On 02/11/2022 16:13, Usama Arif wrote:

Support the vcpu_is_preempted() functionality under KVM/arm64. This will
enhance lock performance on overcommitted hosts (more runnable VCPUs
than physical CPUs in the system) as doing busy waits for preempted
VCPUs will hurt system performance far worse than early yielding.

Signed-off-by: Zengruan Ye 
Signed-off-by: Usama Arif 
---
  arch/arm64/include/asm/paravirt.h |   2 +
  arch/arm64/include/asm/spinlock.h |  16 +++-
  arch/arm64/kernel/paravirt.c  | 126 ++
  arch/arm64/kernel/setup.c |   3 +
  include/linux/cpuhotplug.h|   1 +
  5 files changed, 147 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/paravirt.h 
b/arch/arm64/include/asm/paravirt.h
index 9aa193e0e8f2..4ccb4356c56b 100644
--- a/arch/arm64/include/asm/paravirt.h
+++ b/arch/arm64/include/asm/paravirt.h
@@ -19,10 +19,12 @@ static inline u64 paravirt_steal_clock(int cpu)
  }
  
  int __init pv_time_init(void);

+int __init pv_lock_init(void);
  
  #else
  
  #define pv_time_init() do {} while (0)

+#define pv_lock_init() do {} while (0)
  
  #endif // CONFIG_PARAVIRT
  
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h

index 0525c0b089ed..7023efa4de96 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -10,7 +10,20 @@
  
  /* See include/linux/spinlock.h */

  #define smp_mb__after_spinlock()  smp_mb()
+#define vcpu_is_preempted vcpu_is_preempted
+
+#ifdef CONFIG_PARAVIRT
+#include 
+
+bool dummy_vcpu_is_preempted(int cpu);
  
+DECLARE_STATIC_CALL(pv_vcpu_is_preempted, dummy_vcpu_is_preempted);

+static inline bool vcpu_is_preempted(int cpu)
+{
+   return static_call(pv_vcpu_is_preempted)(cpu);
+}
+
+#else
  /*
   * Changing this will break osq_lock() thanks to the call inside
   * smp_cond_load_relaxed().
@@ -18,10 +31,11 @@
   * See:
   * 
https://lore.kernel.org/lkml/20200110100612.gc2...@hirez.programming.kicks-ass.net
   */
-#define vcpu_is_preempted vcpu_is_preempted
  static inline bool vcpu_is_preempted(int cpu)
  {
return false;
  }
  
+#endif /* CONFIG_PARAVIRT */

+
  #endif /* __ASM_SPINLOCK_H */
diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c
index 57c7c211f8c7..45bcca87bed7 100644
--- a/arch/arm64/kernel/paravirt.c
+++ b/arch/arm64/kernel/paravirt.c
@@ -22,6 +22,7 @@
  
  #include 

  #include 
+#include 
  #include 
  
  struct static_key paravirt_steal_enabled;

@@ -38,7 +39,12 @@ struct pv_time_stolen_time_region {
struct pvclock_vcpu_stolen_time __rcu *kaddr;
  };
  
+struct pv_lock_state_region {

+   struct pvlock_vcpu_state __rcu *kaddr;
+};
+
  static DEFINE_PER_CPU(struct pv_time_stolen_time_region, stolen_time_region);
+static DEFINE_PER_CPU(struct pv_lock_state_region, lock_state_region);
  
  static bool steal_acc = true;

  static int __init parse_no_stealacc(char *arg)
@@ -178,3 +184,123 @@ int __init pv_time_init(void)
  
  	return 0;

  }
+
+static bool native_vcpu_is_preempted(int cpu)
+{
+   return false;
+}
+
+DEFINE_STATIC_CALL(pv_vcpu_is_preempted, native_vcpu_is_preempted);
+
+static bool para_vcpu_is_preempted(int cpu)
+{
+   struct pv_lock_state_region *reg;
+   __le64 preempted_le;
+
+   reg = per_cpu_ptr(_state_region, cpu);
+   if (!reg->kaddr) {
+   pr_warn_once("PV lock enabled but not configured for cpu %d\n",
+cpu);
+   return false;
+   }
+
+   preempted_le = le64_to_cpu(READ_ONCE(reg->kaddr->preempted));
+
+   return !!(preempted_le);
+}
+
+static int pvlock_vcpu_state_dying_cpu(unsigned int cpu)
+{
+   struct pv_lock_state_region *reg;
+
+   reg = this_cpu_ptr(_state_region);
+   if (!reg->kaddr)
+   return 0;
+
+   memunmap(reg->kaddr);
+   memset(reg, 0, sizeof(*reg));
+
+   return 0;
+}
+
+static int init_pvlock_vcpu_state(unsigned int cpu)
+{
+   struct pv_lock_state_region *reg;
+   struct arm_smccc_res res;
+
+   reg = this_cpu_ptr(_state_region);
+
+   arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_LOCK_PREEMPTED, );
+
+   if (res.a0 == SMCCC_RET_NOT_SUPPORTED) {
+   pr_warn("Failed to init PV lock data structure\n");
+   return -EINVAL;
+   }
+
+   reg->kaddr = memremap(res.a0,
+ sizeof(struct pvlock_vcpu_state),
+ MEMREMAP_WB);
+
+   if (!reg->kaddr) {
+   pr_warn("Failed to map PV lock data structure\n");
+   return -ENOMEM;
+   }
+
+   return 0;
+}
+
+static int kvm_arm_init_pvlock(void)
+{
+   int ret;
+
+   ret = cpuhp_setup_state(CPUHP_AP_ARM_KVM_PVLOCK_STARTING,
+   "hypervisor/arm/pvlock:starting",
+   init_pvlock_vcpu_state,
+   pvlock_vcpu_state_dying_cpu);
+   

[v2 6/6] KVM: selftests: add tests for PV time specific hypercall

2022-11-04 Thread Usama Arif
This is a vendor specific hypercall.

Signed-off-by: Usama Arif 
---
 tools/testing/selftests/kvm/aarch64/hypercalls.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/kvm/aarch64/hypercalls.c 
b/tools/testing/selftests/kvm/aarch64/hypercalls.c
index a39da3fe4952..c5c304e886a5 100644
--- a/tools/testing/selftests/kvm/aarch64/hypercalls.c
+++ b/tools/testing/selftests/kvm/aarch64/hypercalls.c
@@ -78,6 +78,8 @@ static const struct test_hvc_info hvc_info[] = {
TEST_HVC_INFO(ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID,
ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID),
TEST_HVC_INFO(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID, 0),
+   TEST_HVC_INFO(ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID,
+   ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID),
TEST_HVC_INFO(ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID, 
KVM_PTP_VIRT_COUNTER),
 };
 
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[v2 5/6] KVM: arm64: Support the VCPU preemption check

2022-11-04 Thread Usama Arif
Support the vcpu_is_preempted() functionality under KVM/arm64. This will
enhance lock performance on overcommitted hosts (more runnable VCPUs
than physical CPUs in the system) as doing busy waits for preempted
VCPUs will hurt system performance far worse than early yielding.

Signed-off-by: Zengruan Ye 
Signed-off-by: Usama Arif 
---
 arch/arm64/include/asm/paravirt.h |   2 +
 arch/arm64/include/asm/spinlock.h |  16 -
 arch/arm64/kernel/paravirt.c  | 112 ++
 arch/arm64/kernel/setup.c |   3 +
 include/linux/cpuhotplug.h|   1 +
 5 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/paravirt.h 
b/arch/arm64/include/asm/paravirt.h
index 9aa193e0e8f2..4ccb4356c56b 100644
--- a/arch/arm64/include/asm/paravirt.h
+++ b/arch/arm64/include/asm/paravirt.h
@@ -19,10 +19,12 @@ static inline u64 paravirt_steal_clock(int cpu)
 }
 
 int __init pv_time_init(void);
+int __init pv_lock_init(void);
 
 #else
 
 #define pv_time_init() do {} while (0)
+#define pv_lock_init() do {} while (0)
 
 #endif // CONFIG_PARAVIRT
 
diff --git a/arch/arm64/include/asm/spinlock.h 
b/arch/arm64/include/asm/spinlock.h
index 0525c0b089ed..7023efa4de96 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -10,7 +10,20 @@
 
 /* See include/linux/spinlock.h */
 #define smp_mb__after_spinlock()   smp_mb()
+#define vcpu_is_preempted vcpu_is_preempted
+
+#ifdef CONFIG_PARAVIRT
+#include 
+
+bool dummy_vcpu_is_preempted(int cpu);
 
+DECLARE_STATIC_CALL(pv_vcpu_is_preempted, dummy_vcpu_is_preempted);
+static inline bool vcpu_is_preempted(int cpu)
+{
+   return static_call(pv_vcpu_is_preempted)(cpu);
+}
+
+#else
 /*
  * Changing this will break osq_lock() thanks to the call inside
  * smp_cond_load_relaxed().
@@ -18,10 +31,11 @@
  * See:
  * 
https://lore.kernel.org/lkml/20200110100612.gc2...@hirez.programming.kicks-ass.net
  */
-#define vcpu_is_preempted vcpu_is_preempted
 static inline bool vcpu_is_preempted(int cpu)
 {
return false;
 }
 
+#endif /* CONFIG_PARAVIRT */
+
 #endif /* __ASM_SPINLOCK_H */
diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c
index 57c7c211f8c7..a340f5f4327f 100644
--- a/arch/arm64/kernel/paravirt.c
+++ b/arch/arm64/kernel/paravirt.c
@@ -20,8 +20,10 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
+#include 
 #include 
 
 struct static_key paravirt_steal_enabled;
@@ -38,7 +40,12 @@ struct pv_time_stolen_time_region {
struct pvclock_vcpu_stolen_time __rcu *kaddr;
 };
 
+struct pv_lock_state_region {
+   struct pvlock_vcpu_state __rcu *kaddr;
+};
+
 static DEFINE_PER_CPU(struct pv_time_stolen_time_region, stolen_time_region);
+static DEFINE_PER_CPU(struct pv_lock_state_region, lock_state_region);
 
 static bool steal_acc = true;
 static int __init parse_no_stealacc(char *arg)
@@ -178,3 +185,108 @@ int __init pv_time_init(void)
 
return 0;
 }
+
+static bool native_vcpu_is_preempted(int cpu)
+{
+   return false;
+}
+
+DEFINE_STATIC_CALL(pv_vcpu_is_preempted, native_vcpu_is_preempted);
+
+static bool para_vcpu_is_preempted(int cpu)
+{
+   struct pv_lock_state_region *reg;
+   __le64 preempted_le;
+
+   reg = per_cpu_ptr(_state_region, cpu);
+   if (!reg->kaddr) {
+   pr_warn_once("PV lock enabled but not configured for cpu %d\n",
+cpu);
+   return false;
+   }
+
+   preempted_le = le64_to_cpu(READ_ONCE(reg->kaddr->preempted));
+
+   return !!(preempted_le);
+}
+
+static int pvlock_vcpu_state_dying_cpu(unsigned int cpu)
+{
+   struct pv_lock_state_region *reg;
+
+   reg = this_cpu_ptr(_state_region);
+   if (!reg->kaddr)
+   return 0;
+
+   memunmap(reg->kaddr);
+   memset(reg, 0, sizeof(*reg));
+
+   return 0;
+}
+
+static int init_pvlock_vcpu_state(unsigned int cpu)
+{
+   struct pv_lock_state_region *reg;
+   struct arm_smccc_res res;
+
+   reg = this_cpu_ptr(_state_region);
+
+   arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID, );
+
+   if (res.a0 == SMCCC_RET_NOT_SUPPORTED) {
+   pr_warn("Failed to init PV lock data structure\n");
+   return -EINVAL;
+   }
+
+   reg->kaddr = memremap(res.a0,
+ sizeof(struct pvlock_vcpu_state),
+ MEMREMAP_WB);
+
+   if (!reg->kaddr) {
+   pr_warn("Failed to map PV lock data structure\n");
+   return -ENOMEM;
+   }
+
+   return 0;
+}
+
+static int kvm_arm_init_pvlock(void)
+{
+   int ret;
+
+   ret = cpuhp_setup_state(CPUHP_AP_ARM_KVM_PVLOCK_STARTING,
+   "hypervisor/arm/pvlock:starting",
+   init_pvlock_vcpu_state,
+   pvlock_vcpu_state_dying_cpu);
+   if (ret < 0) {
+   pr_warn("PV-lock init failed\n");
+   return 

[v2 4/6] KVM: arm64: Provide VCPU attributes for PV lock

2022-11-04 Thread Usama Arif
Allow user space to inform the KVM host where in the physical memory
map the paravirtualized lock structures should be located.

User space can set an attribute on the VCPU providing the IPA base
address of the PV lock structure for that VCPU. This must be
repeated for every VCPU in the VM.

The address is given in terms of the physical address visible to
the guest and must be 64 byte aligned. The guest will discover the
address via a hypercall.

Signed-off-by: Zengruan Ye 
Signed-off-by: Usama Arif 
---
 arch/arm64/include/asm/kvm_host.h |  7 
 arch/arm64/include/uapi/asm/kvm.h |  2 ++
 arch/arm64/kvm/guest.c|  9 +
 arch/arm64/kvm/pvlock.c   | 57 +++
 include/uapi/linux/kvm.h  |  2 ++
 5 files changed, 77 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 18303b30b7e9..86aeca8a4393 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -829,6 +829,13 @@ int kvm_arm_pvtime_get_attr(struct kvm_vcpu *vcpu,
 int kvm_arm_pvtime_has_attr(struct kvm_vcpu *vcpu,
struct kvm_device_attr *attr);
 
+int kvm_arm_pvlock_set_attr(struct kvm_vcpu *vcpu,
+   struct kvm_device_attr *attr);
+int kvm_arm_pvlock_get_attr(struct kvm_vcpu *vcpu,
+   struct kvm_device_attr *attr);
+int kvm_arm_pvlock_has_attr(struct kvm_vcpu *vcpu,
+   struct kvm_device_attr *attr);
+
 extern unsigned int kvm_arm_vmid_bits;
 int kvm_arm_vmid_alloc_init(void);
 void kvm_arm_vmid_alloc_free(void);
diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index bd05ece5c590..71010bacaaab 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -412,6 +412,8 @@ enum {
 #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER1
 #define KVM_ARM_VCPU_PVTIME_CTRL   2
 #define   KVM_ARM_VCPU_PVTIME_IPA  0
+#define KVM_ARM_VCPU_PVLOCK_CTRL   3
+#define   KVM_ARM_VCPU_PVLOCK_IPA  0
 
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_VCPU2_SHIFT28
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 2ff13a3f8479..7e765e3256ef 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -959,6 +959,9 @@ int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
case KVM_ARM_VCPU_PVTIME_CTRL:
ret = kvm_arm_pvtime_set_attr(vcpu, attr);
break;
+   case KVM_ARM_VCPU_PVLOCK_CTRL:
+   ret = kvm_arm_pvlock_set_attr(vcpu, attr);
+   break;
default:
ret = -ENXIO;
break;
@@ -982,6 +985,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
case KVM_ARM_VCPU_PVTIME_CTRL:
ret = kvm_arm_pvtime_get_attr(vcpu, attr);
break;
+   case KVM_ARM_VCPU_PVLOCK_CTRL:
+   ret = kvm_arm_pvlock_get_attr(vcpu, attr);
+   break;
default:
ret = -ENXIO;
break;
@@ -1005,6 +1011,9 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
case KVM_ARM_VCPU_PVTIME_CTRL:
ret = kvm_arm_pvtime_has_attr(vcpu, attr);
break;
+   case KVM_ARM_VCPU_PVLOCK_CTRL:
+   ret = kvm_arm_pvlock_has_attr(vcpu, attr);
+   break;
default:
ret = -ENXIO;
break;
diff --git a/arch/arm64/kvm/pvlock.c b/arch/arm64/kvm/pvlock.c
index 3eb35ab31481..b08a287a4811 100644
--- a/arch/arm64/kvm/pvlock.c
+++ b/arch/arm64/kvm/pvlock.c
@@ -41,3 +41,60 @@ void kvm_update_pvlock_preempted(struct kvm_vcpu *vcpu, u64 
preempted)
kvm_put_guest(kvm, base + offset, cpu_to_le64(preempted));
srcu_read_unlock(>srcu, idx);
 }
+
+int kvm_arm_pvlock_set_attr(struct kvm_vcpu *vcpu,
+   struct kvm_device_attr *attr)
+{
+   u64 __user *user = (u64 __user *)attr->addr;
+   struct kvm *kvm = vcpu->kvm;
+   u64 ipa;
+   int ret = 0;
+   int idx;
+
+   if (attr->attr != KVM_ARM_VCPU_PVLOCK_IPA)
+   return -ENXIO;
+
+   if (get_user(ipa, user))
+   return -EFAULT;
+   if (!IS_ALIGNED(ipa, 64))
+   return -EINVAL;
+   if (vcpu->arch.pv.base != GPA_INVALID)
+   return -EEXIST;
+
+   /* Check the address is in a valid memslot */
+   idx = srcu_read_lock(>srcu);
+   if (kvm_is_error_hva(gfn_to_hva(kvm, ipa >> PAGE_SHIFT)))
+   ret = -EINVAL;
+   srcu_read_unlock(>srcu, idx);
+
+   if (!ret)
+   vcpu->arch.pv.base = ipa;
+
+   return ret;
+}
+
+int kvm_arm_pvlock_get_attr(struct kvm_vcpu *vcpu,
+   struct kvm_device_attr *attr)
+{
+   u64 __user *user = (u64 __user *)attr->addr;
+   u64 ipa;
+
+   if (attr->attr != KVM_ARM_VCPU_PVLOCK_IPA)
+   return 

[v2 0/6] KVM: arm64: implement vcpu_is_preempted check

2022-11-04 Thread Usama Arif
This patchset adds support for vcpu_is_preempted in arm64, which allows the 
guest
to check if a vcpu was scheduled out, which is useful to know incase it was
holding a lock. vcpu_is_preempted can be used to improve
performance in locking (see owner_on_cpu usage in mutex_spin_on_owner,
mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling
(see available_idle_cpu which is used in several places in kernel/sched/fair.c
for e.g. in wake_affine to determine which CPU can run soonest):

This patchset shows improvement on overcommitted hosts (vCPUs > pCPUS), as 
waiting
for preempted vCPUs reduces performance.

This patchset is inspired from the para_steal_clock implementation and from the
work originally done by Zengruan Ye:
https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengr...@huawei.com/.

All the results in the below experiments are done on an aws r6g.metal instance
which has 64 pCPUs.

The following table shows the index results of UnixBench running on a 128 vCPU 
VM
with (6.0.0+vcpu_is_preempted) and without (6.0.0 base) the patchset.
TestName6.0.0 base  6.0.0+vcpu_is_preempted
% improvement for vcpu_is_preempted
Dhrystone 2 using register variables187761  191274.7   
1.871368389
Double-Precision Whetstone  96743.6 98414.4
1.727039308
Execl Throughput689.3   10426  
1412.548963
File Copy 1024 bufsize 2000 maxblocks   549.5   3165   
475.978162
File Copy 256 bufsize 500 maxblocks 400.7   2084.7 
420.2645371
File Copy 4096 bufsize 8000 maxblocks   894.3   5003.2 
459.4543218
Pipe Throughput 76819.5 78601.5
2.319723508
Pipe-based Context Switching3444.8  13414.5
289.4130283
Process Creation301.1   293.4  
-2.557289937
Shell Scripts (1 concurrent)1248.1  28300.6
2167.494592
Shell Scripts (8 concurrent)781.2   26222.3
3256.669227
System Call Overhead34263729.4 
8.855808523

System Benchmarks Index Score   305311534  
277.7923354

This shows a 277% overall improvement using these patches.

The biggest improvement is in the shell scripts benchmark, which forks a lot of 
processes.
This acquires rwsem lock where a large chunk of time is spent in base 6.0.0 
kernel.
This can be seen from one of the callstack of the perf output of the shell
scripts benchmark on 6.0.0 base (pseudo NMI enabled for perf numbers below):
- 33.79% el0_svc
   - 33.43% do_el0_svc
  - 33.43% el0_svc_common.constprop.3
 - 33.30% invoke_syscall
- 17.27% __arm64_sys_clone
   - 17.27% __do_sys_clone
  - 17.26% kernel_clone
 - 16.73% copy_process
- 11.91% dup_mm
   - 11.82% dup_mmap
  - 9.15% down_write
 - 8.87% rwsem_down_write_slowpath
- 8.48% osq_lock

Just under 50% of the total time in the shell script benchmarks ends up being
spent in osq_lock in the base 6.0.0 kernel:
  Children  Self  Command   Shared ObjectSymbol
   17.19%10.71%  sh  [kernel.kallsyms]  [k] osq_lock
6.17% 4.04%  sort[kernel.kallsyms]  [k] osq_lock
4.20% 2.60%  multi.  [kernel.kallsyms]  [k] osq_lock
3.77% 2.47%  grep[kernel.kallsyms]  [k] osq_lock
3.50% 2.24%  expr[kernel.kallsyms]  [k] osq_lock
3.41% 2.23%  od  [kernel.kallsyms]  [k] osq_lock
3.36% 2.15%  rm  [kernel.kallsyms]  [k] osq_lock
3.28% 2.12%  tee [kernel.kallsyms]  [k] osq_lock
3.16% 2.02%  wc  [kernel.kallsyms]  [k] osq_lock
0.21% 0.13%  looper  [kernel.kallsyms]  [k] osq_lock
0.01% 0.00%  Run [kernel.kallsyms]  [k] osq_lock

and this comes down to less than 1% total with 6.0.0+vcpu_is_preempted kernel:
  Children  Self  Command   Shared ObjectSymbol
 0.26% 0.21%  sh  [kernel.kallsyms]  [k] osq_lock
 0.10% 0.08%  multi.  [kernel.kallsyms]  [k] osq_lock
 0.04% 0.04%  sort[kernel.kallsyms]  [k] osq_lock
 0.02% 0.01%  grep[kernel.kallsyms]  [k] osq_lock
 0.02% 0.02%  od  [kernel.kallsyms]  [k] osq_lock
 0.01% 0.01%  tee [kernel.kallsyms]  [k] osq_lock
 0.01% 0.00%  expr[kernel.kallsyms]  [k] osq_lock
 0.01% 0.01%  looper  [kernel.kallsyms]  [k] osq_lock
 0.00% 0.00%  wc  [kernel.kallsyms]  [k] osq_lock
 0.00% 0.00%  rm  [kernel.kallsyms]  [k] osq_lock

To make sure, there is no change in performance when vCPUs 

[v2 3/6] KVM: arm64: Support pvlock preempted via shared structure

2022-11-04 Thread Usama Arif
Implement the service call for configuring a shared structure between a
VCPU and the hypervisor in which the hypervisor can tell whether the
VCPU is running or not.

The preempted field is zero if the VCPU is not preempted.
Any other value means the VCPU has been preempted.

Signed-off-by: Zengruan Ye 
Signed-off-by: Usama Arif 
---
 Documentation/virt/kvm/arm/hypercalls.rst |  3 ++
 arch/arm64/include/asm/kvm_host.h | 18 ++
 arch/arm64/include/uapi/asm/kvm.h |  1 +
 arch/arm64/kvm/Makefile   |  2 +-
 arch/arm64/kvm/arm.c  |  8 +
 arch/arm64/kvm/hypercalls.c   |  8 +
 arch/arm64/kvm/pvlock.c   | 43 +++
 tools/arch/arm64/include/uapi/asm/kvm.h   |  1 +
 8 files changed, 83 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/kvm/pvlock.c

diff --git a/Documentation/virt/kvm/arm/hypercalls.rst 
b/Documentation/virt/kvm/arm/hypercalls.rst
index 3e23084644ba..872a16226ace 100644
--- a/Documentation/virt/kvm/arm/hypercalls.rst
+++ b/Documentation/virt/kvm/arm/hypercalls.rst
@@ -127,6 +127,9 @@ The pseudo-firmware bitmap register are as follows:
 Bit-1: KVM_REG_ARM_VENDOR_HYP_BIT_PTP:
   The bit represents the Precision Time Protocol KVM service.
 
+Bit-2: KVM_REG_ARM_VENDOR_HYP_BIT_PV_LOCK:
+  The bit represents the Paravirtualized lock service.
+
 Errors:
 
 ===  =
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 45e2136322ba..18303b30b7e9 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -417,6 +417,11 @@ struct kvm_vcpu_arch {
u64 last_steal;
gpa_t base;
} steal;
+
+   /* Guest PV lock state */
+   struct {
+   gpa_t base;
+   } pv;
 };
 
 /*
@@ -840,6 +845,19 @@ static inline bool kvm_arm_is_pvtime_enabled(struct 
kvm_vcpu_arch *vcpu_arch)
return (vcpu_arch->steal.base != GPA_INVALID);
 }
 
+static inline void kvm_arm_pvlock_preempted_init(struct kvm_vcpu_arch 
*vcpu_arch)
+{
+   vcpu_arch->pv.base = GPA_INVALID;
+}
+
+static inline bool kvm_arm_is_pvlock_preempted_ready(struct kvm_vcpu_arch 
*vcpu_arch)
+{
+   return (vcpu_arch->pv.base != GPA_INVALID);
+}
+
+gpa_t kvm_init_pvlock(struct kvm_vcpu *vcpu);
+void kvm_update_pvlock_preempted(struct kvm_vcpu *vcpu, u64 preempted);
+
 void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome);
 
 struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);
diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 316917b98707..bd05ece5c590 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -365,6 +365,7 @@ enum {
 enum {
KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT= 0,
KVM_REG_ARM_VENDOR_HYP_BIT_PTP  = 1,
+   KVM_REG_ARM_VENDOR_HYP_BIT_PV_LOCK  = 2,
 #ifdef __KERNEL__
KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT,
 #endif
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 5e33c2d4645a..e1f711885916 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -10,7 +10,7 @@ include $(srctree)/virt/kvm/Makefile.kvm
 obj-$(CONFIG_KVM) += kvm.o
 obj-$(CONFIG_KVM) += hyp/
 
-kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
+kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o pvlock.o \
 inject_fault.o va_layout.o handle_exit.o \
 guest.o debug.o reset.o sys_regs.o stacktrace.o \
 vgic-sys-reg-v3.o fpsimd.o pkvm.o \
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 94d33e296e10..73da4ac859fd 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -345,6 +345,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
kvm_arm_pvtime_vcpu_init(>arch);
 
+   kvm_arm_pvlock_preempted_init(>arch);
+
vcpu->arch.hw_mmu = >kvm->arch.mmu;
 
err = kvm_vgic_vcpu_init(vcpu);
@@ -420,6 +422,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
if (vcpu_has_ptrauth(vcpu))
vcpu_ptrauth_disable(vcpu);
+
+   if (kvm_arm_is_pvlock_preempted_ready(>arch))
+   kvm_update_pvlock_preempted(vcpu, 0);
+
kvm_arch_vcpu_load_debug_state_flags(vcpu);
 
if (!cpumask_test_cpu(smp_processor_id(), 
vcpu->kvm->arch.supported_cpus))
@@ -433,6 +439,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
if (has_vhe())
kvm_vcpu_put_sysregs_vhe(vcpu);
kvm_timer_vcpu_put(vcpu);
+   if (kvm_arm_is_pvlock_preempted_ready(>arch))
+   kvm_update_pvlock_preempted(vcpu, 1);
kvm_vgic_put(vcpu);
kvm_vcpu_pmu_restore_host(vcpu);
kvm_arm_vmid_clear_active();
diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index c9f401fa01a9..ec85b4b2a272 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ 

[v2 2/6] KVM: arm64: Add SMCCC paravirtualised lock calls

2022-11-04 Thread Usama Arif
Add a new SMCCC compatible hypercalls for PV lock features:
  ARM_SMCCC_KVM_FUNC_PV_LOCK:   0xC602

Also add the header file which defines the ABI for the paravirtualized
lock features we're about to add.

Signed-off-by: Zengruan Ye 
Signed-off-by: Usama Arif 
---
 arch/arm64/include/asm/pvlock-abi.h | 17 +
 include/linux/arm-smccc.h   |  8 
 tools/include/linux/arm-smccc.h |  8 
 3 files changed, 33 insertions(+)
 create mode 100644 arch/arm64/include/asm/pvlock-abi.h

diff --git a/arch/arm64/include/asm/pvlock-abi.h 
b/arch/arm64/include/asm/pvlock-abi.h
new file mode 100644
index ..3f4574071679
--- /dev/null
+++ b/arch/arm64/include/asm/pvlock-abi.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright(c) 2019 Huawei Technologies Co., Ltd
+ * Author: Zengruan Ye 
+ * Usama Arif 
+ */
+
+#ifndef __ASM_PVLOCK_ABI_H
+#define __ASM_PVLOCK_ABI_H
+
+struct pvlock_vcpu_state {
+   __le64 preempted;
+   /* Structure must be 64 byte aligned, pad to that size */
+   u8 padding[56];
+} __packed;
+
+#endif
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 220c8c60e021..104c10035b10 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -112,6 +112,7 @@
 /* KVM "vendor specific" services */
 #define ARM_SMCCC_KVM_FUNC_FEATURES0
 #define ARM_SMCCC_KVM_FUNC_PTP 1
+#define ARM_SMCCC_KVM_FUNC_PV_LOCK 2
 #define ARM_SMCCC_KVM_FUNC_FEATURES_2  127
 #define ARM_SMCCC_KVM_NUM_FUNCS128
 
@@ -151,6 +152,13 @@
   ARM_SMCCC_OWNER_STANDARD_HYP,\
   0x21)
 
+/* Paravirtualised lock calls */
+#define ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID   \
+   ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+  ARM_SMCCC_SMC_64,\
+  ARM_SMCCC_OWNER_VENDOR_HYP,  \
+  ARM_SMCCC_KVM_FUNC_PV_LOCK)
+
 /* TRNG entropy source calls (defined by ARM DEN0098) */
 #define ARM_SMCCC_TRNG_VERSION \
ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
diff --git a/tools/include/linux/arm-smccc.h b/tools/include/linux/arm-smccc.h
index 63ce9bebccd3..c21e539c0228 100644
--- a/tools/include/linux/arm-smccc.h
+++ b/tools/include/linux/arm-smccc.h
@@ -111,6 +111,7 @@
 /* KVM "vendor specific" services */
 #define ARM_SMCCC_KVM_FUNC_FEATURES0
 #define ARM_SMCCC_KVM_FUNC_PTP 1
+#define ARM_SMCCC_KVM_FUNC_PV_LOCK 2
 #define ARM_SMCCC_KVM_FUNC_FEATURES_2  127
 #define ARM_SMCCC_KVM_NUM_FUNCS128
 
@@ -150,6 +151,13 @@
   ARM_SMCCC_OWNER_STANDARD_HYP,\
   0x21)
 
+/* Paravirtualised lock calls */
+#define ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID   \
+   ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+  ARM_SMCCC_SMC_64,\
+  ARM_SMCCC_OWNER_VENDOR_HYP,  \
+  ARM_SMCCC_KVM_FUNC_PV_LOCK)
+
 /* TRNG entropy source calls (defined by ARM DEN0098) */
 #define ARM_SMCCC_TRNG_VERSION \
ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[v2 1/6] KVM: arm64: Document PV-lock interface

2022-11-04 Thread Usama Arif
Introduce a paravirtualization interface for KVM/arm64 to obtain whether
the VCPU is currently running or not.

The PV lock structure of the guest is allocated by user space.

A hypercall interface is provided for the guest to interrogate the
location of the shared memory structures.

Signed-off-by: Zengruan Ye 
Signed-off-by: Usama Arif 
---
 Documentation/virt/kvm/arm/index.rst|  1 +
 Documentation/virt/kvm/arm/pvlock.rst   | 52 +
 Documentation/virt/kvm/devices/vcpu.rst | 25 
 3 files changed, 78 insertions(+)
 create mode 100644 Documentation/virt/kvm/arm/pvlock.rst

diff --git a/Documentation/virt/kvm/arm/index.rst 
b/Documentation/virt/kvm/arm/index.rst
index e84848432158..b8499dc00a6a 100644
--- a/Documentation/virt/kvm/arm/index.rst
+++ b/Documentation/virt/kvm/arm/index.rst
@@ -10,4 +10,5 @@ ARM
hyp-abi
hypercalls
pvtime
+   pvlock
ptp_kvm
diff --git a/Documentation/virt/kvm/arm/pvlock.rst 
b/Documentation/virt/kvm/arm/pvlock.rst
new file mode 100644
index ..d3c391b16d36
--- /dev/null
+++ b/Documentation/virt/kvm/arm/pvlock.rst
@@ -0,0 +1,52 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Paravirtualized lock support for arm64
+==
+
+KVM/arm64 provides a hypervisor service call for paravirtualized guests to
+determine whether a VCPU is currently running or not.
+
+A new SMCCC compatible hypercall is defined:
+
+* ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID:   0xC602
+
+ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID
+
+= ==
+Function ID:  (uint32)0xC602
+Return value: (int64) IPA of the pv lock data structure for this
+  VCPU. On failure:
+  NOT_SUPPORTED (-1)
+= ==
+
+The IPA returned by PV_LOCK_PREEMPTED should be mapped by the guest as normal
+memory with inner and outer write back caching attributes, in the inner
+shareable domain.
+
+PV_LOCK_PREEMPTED returns the structure for the calling VCPU.
+
+PV lock state
+-
+
+The structure pointed to by the PV_LOCK_PREEMPTED hypercall is as follows:
+
++---+-+-+-+
+| Field | Byte Length | Byte Offset | Description |
++===+=+=+=+
+| preempted |  8  |  0  | Indicate if the VCPU that owns  |
+|   | | | this struct is running or not.  |
+|   | | | Non-zero values mean the VCPU   |
+|   | | | has been preempted. Zero means  |
+|   | | | the VCPU is not preempted.  |
++---+-+-+-+
+
+The preempted field will be updated to 1 by the hypervisor prior to scheduling
+a VCPU. When the VCPU is scheduled out, the preempted field will be updated
+to 0 by the hypervisor.
+
+The structure will be present within a reserved region of the normal memory
+given to the guest. The guest should not attempt to write into this memory.
+There is a structure per VCPU of the guest.
+
+For the user space interface see
+:ref:`Documentation/virt/kvm/devices/vcpu.rst `.
\ No newline at end of file
diff --git a/Documentation/virt/kvm/devices/vcpu.rst 
b/Documentation/virt/kvm/devices/vcpu.rst
index 716aa3edae14..ea81b98f0e19 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -263,3 +263,28 @@ From the destination VMM process:
 
 7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
respective value derived in the previous step.
+
+.. _kvm_arm_vcpu_pvlock_ctrl:
+
+5. GROUP: KVM_ARM_VCPU_PVLOCK_CTRL
+==
+
+:Architectures: ARM64
+
+5.1 ATTRIBUTE: KVM_ARM_VCPU_PVLOCK_IPA
+--
+
+:Parameters: 64-bit base address
+
+Returns:
+
+===  ==
+-ENXIO   PV lock not implemented
+-EEXIST  Base address already set for this VCPU
+-EINVAL  Base address not 64 byte aligned
+===  ==
+
+Specifies the base address of the pv lock structure for this VCPU. The
+base address must be 64 byte aligned and exist within a valid guest memory
+region. See Documentation/virt/kvm/arm/pvlock.rst for more information
+including the layout of the pv lock structure.
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH] arm64: paravirt: remove conduit check in has_pv_steal_clock

2022-11-04 Thread Usama Arif
arm_smccc_1_1_invoke() which is called later on in the function
will return failure if there's no conduit (or pre-SMCCC 1.1),
hence the check is unnecessary.

Suggested-by: Steven Price 
Signed-off-by: Usama Arif 
---
 arch/arm64/kernel/paravirt.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c
index 57c7c211f8c7..aa718d6a9274 100644
--- a/arch/arm64/kernel/paravirt.c
+++ b/arch/arm64/kernel/paravirt.c
@@ -141,10 +141,6 @@ static bool __init has_pv_steal_clock(void)
 {
struct arm_smccc_res res;
 
-   /* To detect the presence of PV time support we require SMCCC 1.1+ */
-   if (arm_smccc_1_1_get_conduit() == SMCCC_CONDUIT_NONE)
-   return false;
-
arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
 ARM_SMCCC_HV_PV_TIME_FEATURES, );
 
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm