date:20210826

Re: [PATCH v15 10/12] swiotlb: Add restricted DMA pool initialization

2021-08-26 Thread Claire Chang

On Tue, Aug 24, 2021 at 10:26 PM Guenter Roeck  wrote:
>
> Hi Claire,
>
> On Thu, Jun 24, 2021 at 11:55:24PM +0800, Claire Chang wrote:
> > Add the initialization function to create restricted DMA pools from
> > matching reserved-memory nodes.
> >
> > Regardless of swiotlb setting, the restricted DMA pool is preferred if
> > available.
> >
> > The restricted DMA pools provide a basic level of protection against the
> > DMA overwriting buffer contents at unexpected times. However, to protect
> > against general data leakage and system memory corruption, the system
> > needs to provide a way to lock down the memory access, e.g., MPU.
> >
> > Signed-off-by: Claire Chang 
> > Reviewed-by: Christoph Hellwig 
> > Tested-by: Stefano Stabellini 
> > Tested-by: Will Deacon 
> > ---
> >  include/linux/swiotlb.h |  3 +-
> >  kernel/dma/Kconfig  | 14 
> >  kernel/dma/swiotlb.c| 76 +
> >  3 files changed, 92 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > index 3b9454d1e498..39284ff2a6cd 100644
> > --- a/include/linux/swiotlb.h
> > +++ b/include/linux/swiotlb.h
> > @@ -73,7 +73,8 @@ extern enum swiotlb_force swiotlb_force;
> >   *   range check to see if the memory was in fact allocated by this
> >   *   API.
> >   * @nslabs:  The number of IO TLB blocks (in groups of 64) between @start 
> > and
> > - *   @end. This is command line adjustable via setup_io_tlb_npages.
> > + *   @end. For default swiotlb, this is command line adjustable via
> > + *   setup_io_tlb_npages.
> >   * @used:The number of used IO TLB block.
> >   * @list:The free list describing the number of free entries available
> >   *   from each index.
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index 77b405508743..3e961dc39634 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -80,6 +80,20 @@ config SWIOTLB
> >   bool
> >   select NEED_DMA_MAP_STATE
> >
> > +config DMA_RESTRICTED_POOL
> > + bool "DMA Restricted Pool"
> > + depends on OF && OF_RESERVED_MEM
> > + select SWIOTLB
>
> This makes SWIOTLB user configurable, which in turn results in
>
> mips64-linux-ld: arch/mips/kernel/setup.o: in function `arch_mem_init':
> setup.c:(.init.text+0x19c8): undefined reference to `plat_swiotlb_setup'
> make[1]: *** [Makefile:1280: vmlinux] Error 1
>
> when building mips:allmodconfig.
>
> Should this possibly be "depends on SWIOTLB" ?

Patch is sent here: https://lkml.org/lkml/2021/8/26/932

>
> Thanks,
> Guenter

Thanks,
Claire

RE: [PATCH] VT-d: fix caching mode IOTLB flushing

2021-08-26 Thread Tian, Kevin

> From: Jan Beulich 
> Sent: Thursday, August 19, 2021 4:06 PM
> 
> While for context cache entry flushing use of did 0 is indeed correct
> (after all upon reading the context entry the IOMMU wouldn't know any
> domain ID if the entry is not present, and hence a surrogate one needs
> to be used), for IOTLB entries the normal domain ID (from the [present]
> context entry) gets used. See sub-section "IOTLB" of section "Address
> Translation Caches" in the VT-d spec.
> 
> Signed-off-by: Jan Beulich 

Reviewed-by: Kevin Tian 

> ---
> Luckily this is supposed to be an issue only when running on emulated
> IOMMUs; hardware implementations are expected to have CAP.CM=0.
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -474,17 +474,10 @@ int vtd_flush_iotlb_reg(struct vtd_iommu
> 
>  /*
>   * In the non-present entry flush case, if hardware doesn't cache
> - * non-present entry we do nothing and if hardware cache non-present
> - * entry, we flush entries of domain 0 (the domain id is used to cache
> - * any non-present entries)
> + * non-present entries we do nothing.
>   */
> -if ( flush_non_present_entry )
> -{
> -if ( !cap_caching_mode(iommu->cap) )
> -return 1;
> -else
> -did = 0;
> -}
> +if ( flush_non_present_entry && !cap_caching_mode(iommu->cap) )
> +return 1;
> 
>  /* use register invalidation */
>  switch ( type )
> --- a/xen/drivers/passthrough/vtd/qinval.c
> +++ b/xen/drivers/passthrough/vtd/qinval.c
> @@ -362,17 +362,10 @@ static int __must_check flush_iotlb_qi(s
> 
>  /*
>   * In the non-present entry flush case, if hardware doesn't cache
> - * non-present entry we do nothing and if hardware cache non-present
> - * entry, we flush entries of domain 0 (the domain id is used to cache
> - * any non-present entries)
> + * non-present entries we do nothing.
>   */
> -if ( flush_non_present_entry )
> -{
> -if ( !cap_caching_mode(iommu->cap) )
> -return 1;
> -else
> -did = 0;
> -}
> +if ( flush_non_present_entry && !cap_caching_mode(iommu->cap) )
> +return 1;
> 
>  /* use queued invalidation */
>  if (cap_write_drain(iommu->cap))

[xen-4.14-testing test] 164493: regressions - FAIL

2021-08-26 Thread osstest service owner

flight 164493 xen-4.14-testing real [real]
flight 164505 xen-4.14-testing real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/164493/
http://logs.test-lab.xenproject.org/osstest/logs/164505/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-amd64-dom0pvh-xl-amd  8 xen-boot  fail REGR. vs. 163750
 test-amd64-amd64-dom0pvh-xl-intel  8 xen-bootfail REGR. vs. 163750

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail pass 
in 164505-retest

Tests which did not succeed, but are not blocking:
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 163750
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 163750
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 163750
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 163750
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 163750
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 163750
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 163750
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 163750
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 163750
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 163750
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 163750
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass

version targeted for testing:
 xen  74e93071826fe3aaab32e469280a3253a39147f6
baseline version:
 xen  49299c4813b7847d29df07bf790f5489060f2a9c

Last test of basis   163750  2021-07-16

RE: [XEN RFC PATCH 16/40] xen/arm: Create a fake NUMA node to use common code

2021-08-26 Thread Wei Chen

Hi Stefano,

> -Original Message-
> From: Stefano Stabellini 
> Sent: 2021年8月27日 7:10
> To: Wei Chen 
> Cc: xen-devel@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org;
> jbeul...@suse.com; Bertrand Marquis 
> Subject: Re: [XEN RFC PATCH 16/40] xen/arm: Create a fake NUMA node to use
> common code
> 
> On Wed, 11 Aug 2021, Wei Chen wrote:
> > When CONFIG_NUMA is enabled for Arm, Xen will switch to use common
> > NUMA API instead of previous fake NUMA API. Before we parse NUMA
> > information from device tree or ACPI SRAT table, we need to init
> > the NUMA related variables, like cpu_to_node, as single node NUMA
> > system.
> >
> > So in this patch, we introduce a numa_init function for to
> > initialize these data structures as all resources belongs to node#0.
> > This will make the new API returns the same values as the fake API
> > has done.
> >
> > Signed-off-by: Wei Chen 
> > ---
> >  xen/arch/arm/numa.c| 53 ++
> >  xen/arch/arm/setup.c   |  8 ++
> >  xen/include/asm-arm/numa.h | 11 
> >  3 files changed, 72 insertions(+)
> >
> > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> > index 1e30c5bb13..566ad1e52b 100644
> > --- a/xen/arch/arm/numa.c
> > +++ b/xen/arch/arm/numa.c
> > @@ -20,6 +20,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >
> >  void numa_set_node(int cpu, nodeid_t nid)
> >  {
> > @@ -29,3 +31,54 @@ void numa_set_node(int cpu, nodeid_t nid)
> >
> >  cpu_to_node[cpu] = nid;
> >  }
> > +
> > +void __init numa_init(bool acpi_off)
> > +{
> > +uint32_t idx;
> > +paddr_t ram_start = ~0;
> > +paddr_t ram_size = 0;
> > +paddr_t ram_end = 0;
> > +
> > +printk(XENLOG_WARNING
> > +"NUMA has not been supported yet, NUMA off!\n");
> 
> NIT: please align
> 


OK

> 
> > +/* Arm NUMA has not been implemented until this patch */
> 
> "Arm NUMA is not implemented yet"
> 

OK

> 
> > +numa_off = true;
> > +
> > +/*
> > + * Set all cpu_to_node mapping to 0, this will make cpu_to_node
> > + * function return 0 as previous fake cpu_to_node API.
> > + */
> > +for ( idx = 0; idx < NR_CPUS; idx++ )
> > +cpu_to_node[idx] = 0;
> > +
> > +/*
> > + * Make node_to_cpumask, node_spanned_pages and node_start_pfn
> > + * return as previous fake APIs.
> > + */
> > +for ( idx = 0; idx < MAX_NUMNODES; idx++ ) {
> > +node_to_cpumask[idx] = cpu_online_map;
> > +node_spanned_pages(idx) = (max_page - mfn_x(first_valid_mfn));
> > +node_start_pfn(idx) = (mfn_x(first_valid_mfn));
> > +}
> 
> I just want to note that this works because MAX_NUMNODES is 1. If
> MAX_NUMNODES was > 1 then it would be wrong to set node_to_cpumask,
> node_spanned_pages and node_start_pfn for all nodes to the same values.
> 
> It might be worth writing something about it in the in-code comment.
> 

OK, I will do it.

> 
> > +/*
> > + * Find the minimal and maximum address of RAM, NUMA will
> > + * build a memory to node mapping table for the whole range.
> > + */
> > +ram_start = bootinfo.mem.bank[0].start;
> > +ram_size  = bootinfo.mem.bank[0].size;
> > +ram_end   = ram_start + ram_size;
> > +for ( idx = 1 ; idx < bootinfo.mem.nr_banks; idx++ )
> > +{
> > +paddr_t bank_start = bootinfo.mem.bank[idx].start;
> > +paddr_t bank_size = bootinfo.mem.bank[idx].size;
> > +paddr_t bank_end = bank_start + bank_size;
> > +
> > +ram_size  = ram_size + bank_size;
> 
> ram_size is updated but not utilized
> 

Ok, I will remove it.

> 
> > +ram_start = min(ram_start, bank_start);
> > +ram_end   = max(ram_end, bank_end);
> > +}
> > +
> > +numa_initmem_init(PFN_UP(ram_start), PFN_DOWN(ram_end));
> > +return;
> > +}
> > diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
> > index 63a908e325..3c58d2d441 100644
> > --- a/xen/arch/arm/setup.c
> > +++ b/xen/arch/arm/setup.c
> > @@ -30,6 +30,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -874,6 +875,13 @@ void __init start_xen(unsigned long
> boot_phys_offset,
> >  /* Parse the ACPI tables for possible boot-time configuration */
> >  acpi_boot_table_init();
> >
> > +/*
> > + * Try to initialize NUMA system, if failed, the system will
> > + * fallback to uniform system which means system has only 1
> > + * NUMA node.
> > + */
> > +numa_init(acpi_disabled);
> > +
> >  end_boot_allocator();
> >
> >  /*
> > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > index b2982f9053..bb495a24e1 100644
> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -13,6 +13,16 @@ typedef u8 nodeid_t;
> >   */
> >  #define NODES_SHIFT  6
> >
> > +extern void numa_init(bool acpi_off);
> > +
> > +/*
> > + * Temporary for fake NUMA node, when CPU, memory

[PATCH 14/15] perf: Disallow bulk unregistering of guest callbacks and do cleanup

2021-08-26 Thread Sean Christopherson

Drop the helper that allows bulk unregistering of the per-CPU callbacks
now that KVM, the only entity that actually unregisters callbacks, uses
the per-CPU helpers.  Bulk unregistering is inherently unsafe as there
are no protections against nullifying a pointer for a CPU that is using
said pointer in a PMI handler.

Opportunistically tweak names to better reflect reality.

Signed-off-by: Sean Christopherson 
---
 arch/x86/xen/pmu.c |  2 +-
 include/linux/kvm_host.h   |  2 +-
 include/linux/perf_event.h |  9 +++--
 kernel/events/core.c   | 31 +++
 virt/kvm/kvm_main.c|  2 +-
 5 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c
index e13b0b49fcdf..57834de043c3 100644
--- a/arch/x86/xen/pmu.c
+++ b/arch/x86/xen/pmu.c
@@ -548,7 +548,7 @@ void xen_pmu_init(int cpu)
per_cpu(xenpmu_shared, cpu).flags = 0;
 
if (cpu == 0) {
-   perf_register_guest_info_callbacks(_guest_cbs);
+   perf_register_guest_info_callbacks_all_cpus(_guest_cbs);
xen_pmu_arch_init();
}
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0db9af0b628c..d68a49d5fc53 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1171,7 +1171,7 @@ unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu);
 void kvm_register_perf_callbacks(void);
 static inline void kvm_unregister_perf_callbacks(void)
 {
-   __perf_unregister_guest_info_callbacks();
+   perf_unregister_guest_info_callbacks();
 }
 #endif
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7a367bf1b78d..db701409a62f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1238,10 +1238,9 @@ extern void perf_event_bpf_event(struct bpf_prog *prog,
 
 #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS
 DECLARE_PER_CPU(struct perf_guest_info_callbacks *, perf_guest_cbs);
-extern void __perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
-extern void __perf_unregister_guest_info_callbacks(void);
-extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *callbacks);
+extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
 extern void perf_unregister_guest_info_callbacks(void);
+extern void perf_register_guest_info_callbacks_all_cpus(struct 
perf_guest_info_callbacks *cbs);
 #endif /* CONFIG_HAVE_GUEST_PERF_EVENTS */
 
 extern void perf_event_exec(void);
@@ -1486,9 +1485,7 @@ static inline void
 perf_bp_event(struct perf_event *event, void *data){ }
 
 #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS
-static inline void perf_register_guest_info_callbacks
-(struct perf_guest_info_callbacks *callbacks)  { }
-static inline void perf_unregister_guest_info_callbacks(void)  { }
+extern void perf_register_guest_info_callbacks_all_cpus(struct 
perf_guest_info_callbacks *cbs);
 #endif
 
 static inline void perf_event_mmap(struct vm_area_struct *vma) { }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2f28d9d8dc94..f1964096c4c2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6485,35 +6485,26 @@ static void perf_pending_event(struct irq_work *entry)
 #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS
 DEFINE_PER_CPU(struct perf_guest_info_callbacks *, perf_guest_cbs);
 
-void __perf_register_guest_info_callbacks(struct perf_guest_info_callbacks 
*cbs)
-{
-   __this_cpu_write(perf_guest_cbs, cbs);
-}
-EXPORT_SYMBOL_GPL(__perf_register_guest_info_callbacks);
-
-void __perf_unregister_guest_info_callbacks(void)
-{
-   __this_cpu_write(perf_guest_cbs, NULL);
-}
-EXPORT_SYMBOL_GPL(__perf_unregister_guest_info_callbacks);
-
 void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
 {
-   int cpu;
-
-   for_each_possible_cpu(cpu)
-   per_cpu(perf_guest_cbs, cpu) = cbs;
+   __this_cpu_write(perf_guest_cbs, cbs);
 }
 EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
 
 void perf_unregister_guest_info_callbacks(void)
 {
-   int cpu;
-
-   for_each_possible_cpu(cpu)
-   per_cpu(perf_guest_cbs, cpu) = NULL;
+   __this_cpu_write(perf_guest_cbs, NULL);
 }
 EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
+
+void perf_register_guest_info_callbacks_all_cpus(struct 
perf_guest_info_callbacks *cbs)
+{
+   int cpu;
+
+   for_each_possible_cpu(cpu)
+   per_cpu(perf_guest_cbs, cpu) = cbs;
+}
+EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks_all_cpus);
 #endif
 
 static void
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e0b1c9386926..1bcc3eab510b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5502,7 +5502,7 @@ EXPORT_SYMBOL_GPL(kvm_set_intel_pt_intr_handler);
 
 void kvm_register_perf_callbacks(void)
 {
-   __perf_register_guest_info_callbacks(_guest_cbs);
+

[PATCH 13/15] KVM: arm64: Drop perf.c and fold its tiny bit of code into pmu.c

2021-08-26 Thread Sean Christopherson

Fold that last few remnants of perf.c into pmu.c and rename the init
helper as appropriate.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |  2 --
 arch/arm64/kvm/Makefile   |  2 +-
 arch/arm64/kvm/arm.c  |  3 ++-
 arch/arm64/kvm/perf.c | 20 
 arch/arm64/kvm/pmu.c  |  8 
 include/kvm/arm_pmu.h |  1 +
 6 files changed, 12 insertions(+), 24 deletions(-)
 delete mode 100644 arch/arm64/kvm/perf.c

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 12e8d789e1ac..86c0fdd11ad2 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -670,8 +670,6 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned 
int len);
 int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
 int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
-void kvm_perf_init(void);
-
 #ifdef CONFIG_PERF_EVENTS
 #define __KVM_WANT_PERF_CALLBACKS
 #else
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 989bb5dad2c8..0bcc378b7961 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -12,7 +12,7 @@ obj-$(CONFIG_KVM) += hyp/
 
 kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
 $(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
-arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
+arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 inject_fault.o va_layout.o handle_exit.o \
 guest.o debug.o reset.o sys_regs.o \
 vgic-sys-reg-v3.o fpsimd.o pmu.o \
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index dfc8078dd4f9..57e637dee71d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1747,7 +1747,8 @@ static int init_subsystems(void)
if (err)
goto out;
 
-   kvm_perf_init();
+   kvm_pmu_init();
+
kvm_sys_reg_table_init();
 
 out:
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
deleted file mode 100644
index ad9fdc2f2f70..
--- a/arch/arm64/kvm/perf.c
+++ /dev/null
@@ -1,20 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Based on the x86 implementation.
- *
- * Copyright (C) 2012 ARM Ltd.
- * Author: Marc Zyngier 
- */
-
-#include 
-#include 
-
-#include 
-
-DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
-
-void kvm_perf_init(void)
-{
-   if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
-   static_branch_enable(_arm_pmu_available);
-}
diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
index 03a6c1f4a09a..d98b57a17043 100644
--- a/arch/arm64/kvm/pmu.c
+++ b/arch/arm64/kvm/pmu.c
@@ -7,6 +7,14 @@
 #include 
 #include 
 
+DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
+
+void kvm_pmu_init(void)
+{
+   if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
+   static_branch_enable(_arm_pmu_available);
+}
+
 /*
  * Given the perf event attributes and system type, determine
  * if we are going to need to switch counters at guest entry/exit.
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 864b9997efb2..42270676498d 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -14,6 +14,7 @@
 #define ARMV8_PMU_MAX_COUNTER_PAIRS((ARMV8_PMU_MAX_COUNTERS + 1) >> 1)
 
 DECLARE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
+void kvm_pmu_init(void);
 
 static __always_inline bool kvm_arm_support_pmu_v3(void)
 {
-- 
2.33.0.259.gc128427fd7-goog

[PATCH 12/15] KVM: arm64: Convert to the generic perf callbacks

2021-08-26 Thread Sean Christopherson

Drop arm64's version of the callbacks in favor of the callbacks provided
by generic KVM, which are semantically identical.  Implement the "get ip"
hook as needed.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |  6 +
 arch/arm64/kvm/arm.c  |  5 
 arch/arm64/kvm/perf.c | 38 ---
 3 files changed, 6 insertions(+), 43 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 007c38d77fd9..12e8d789e1ac 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -673,11 +673,7 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t 
fault_ipa);
 void kvm_perf_init(void);
 
 #ifdef CONFIG_PERF_EVENTS
-void kvm_register_perf_callbacks(void);
-static inline void kvm_unregister_perf_callbacks(void)
-{
-   __perf_unregister_guest_info_callbacks();
-}
+#define __KVM_WANT_PERF_CALLBACKS
 #else
 static inline void kvm_register_perf_callbacks(void) {}
 static inline void kvm_unregister_perf_callbacks(void) {}
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ec386971030d..dfc8078dd4f9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -503,6 +503,11 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
return vcpu_mode_priv(vcpu);
 }
 
+unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
+{
+   return *vcpu_pc(vcpu);
+}
+
 /* Just ensure a guest exit from a particular CPU */
 static void exit_vm_noop(void *info)
 {
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 2556b0a3b096..ad9fdc2f2f70 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -13,44 +13,6 @@
 
 DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
 
-#ifdef CONFIG_PERF_EVENTS
-static int kvm_is_in_guest(void)
-{
-   return true;
-}
-
-static int kvm_is_user_mode(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return 0;
-
-   return !vcpu_mode_priv(vcpu);
-}
-
-static unsigned long kvm_get_guest_ip(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return 0;
-
-   return *vcpu_pc(vcpu);
-}
-
-static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .is_in_guest= kvm_is_in_guest,
-   .is_user_mode   = kvm_is_user_mode,
-   .get_guest_ip   = kvm_get_guest_ip,
-};
-
-void kvm_register_perf_callbacks(void)
-{
-   __perf_register_guest_info_callbacks(_guest_cbs);
-}
-#endif /* CONFIG_PERF_EVENTS*/
-
 void kvm_perf_init(void)
 {
if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
-- 
2.33.0.259.gc128427fd7-goog

[PATCH 10/15] KVM: Move x86's perf guest info callbacks to generic KVM

2021-08-26 Thread Sean Christopherson

Move x86's perf guest callbacks into common KVM, as they are semantically
identical to arm64's callbacks (the only other such KVM callbacks).
arm64 will convert to the common versions in a future patch.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c  | 48 +
 arch/x86/kvm/x86.h  |  6 -
 include/linux/kvm_host.h| 12 +
 virt/kvm/kvm_main.c | 46 +++
 5 files changed, 66 insertions(+), 47 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 465b35736d9b..63553a1f43ee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -36,6 +36,7 @@
 #include 
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_WANT_PERF_CALLBACKS
 
 #define KVM_MAX_VCPUS 288
 #define KVM_SOFT_MAX_VCPUS 240
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e337aef60793..7cb0f04e24ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8264,32 +8264,6 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-static int kvm_is_in_guest(void)
-{
-   /* x86's callbacks are registered only when handling a guest NMI. */
-   return true;
-}
-
-static int kvm_is_user_mode(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return 0;
-
-   return static_call(kvm_x86_get_cpl)(vcpu) != 0;
-}
-
-static unsigned long kvm_get_guest_ip(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return 0;
-
-   return kvm_rip_read(vcpu);
-}
-
 static void kvm_handle_intel_pt_intr(void)
 {
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
@@ -8302,19 +8276,6 @@ static void kvm_handle_intel_pt_intr(void)
(unsigned long *)>arch.pmu.global_status);
 }
 
-static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .is_in_guest= kvm_is_in_guest,
-   .is_user_mode   = kvm_is_user_mode,
-   .get_guest_ip   = kvm_get_guest_ip,
-   .handle_intel_pt_intr   = NULL,
-};
-
-void kvm_register_perf_callbacks(void)
-{
-   __perf_register_guest_info_callbacks(_guest_cbs);
-}
-EXPORT_SYMBOL_GPL(kvm_register_perf_callbacks);
-
 #ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
@@ -11069,7 +11030,7 @@ int kvm_arch_hardware_setup(void *opaque)
kvm_ops_static_call_update();
 
if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
-   kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr;
+   kvm_set_intel_pt_intr_handler(kvm_handle_intel_pt_intr);
 
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
@@ -11098,7 +11059,7 @@ int kvm_arch_hardware_setup(void *opaque)
 
 void kvm_arch_hardware_unsetup(void)
 {
-   kvm_guest_cbs.handle_intel_pt_intr = NULL;
+   kvm_set_intel_pt_intr_handler(NULL);
 
static_call(kvm_x86_hardware_unsetup)();
 }
@@ -11725,6 +11686,11 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
return vcpu->arch.preempted_in_kernel;
 }
 
+unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
+{
+   return kvm_rip_read(vcpu);
+}
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f13f15d2fab8..e1fe738c3827 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -387,12 +387,6 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm)
return kvm->arch.cstate_in_guest;
 }
 
-void kvm_register_perf_callbacks(void);
-static inline void kvm_unregister_perf_callbacks(void)
-{
-   __perf_unregister_guest_info_callbacks();
-}
-
 static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, bool is_nmi)
 {
WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, is_nmi);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e4d712e9f760..0db9af0b628c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1163,6 +1163,18 @@ static inline bool kvm_arch_intc_initialized(struct kvm 
*kvm)
 }
 #endif
 
+#ifdef __KVM_WANT_PERF_CALLBACKS
+
+void kvm_set_intel_pt_intr_handler(void (*handler)(void));
+unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu);
+
+void kvm_register_perf_callbacks(void);
+static inline void kvm_unregister_perf_callbacks(void)
+{
+   __perf_unregister_guest_info_callbacks();
+}
+#endif
+
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type);
 void kvm_arch_destroy_vm(struct kvm *kvm);
 void kvm_arch_sync_events(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3e67c93ca403..13c4f58a75e5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c

[PATCH 15/15] perf: KVM: Indicate "in guest" via NULL ->is_in_guest callback

2021-08-26 Thread Sean Christopherson

Interpret a null ->is_in_guest callback as meaning "in guest" and use
the new semantics in KVM, which currently returns 'true' unconditionally
in its implementation of ->is_in_guest().  This avoids a retpoline on
the indirect call for PMIs that arrive in a KVM guest, and also provides
a handy excuse for a wrapper around retrieval of perf_get_guest_cbs,
e.g. to reduce the probability of an errant direct read of perf_guest_cbs.

Signed-off-by: Sean Christopherson 
---
 arch/x86/events/core.c   | 16 
 arch/x86/events/intel/core.c |  5 ++---
 include/linux/perf_event.h   | 17 +
 virt/kvm/kvm_main.c  |  9 ++---
 4 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 34155a52e498..b60c339ae06b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2761,11 +2761,11 @@ static bool perf_hw_regs(struct pt_regs *regs)
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct unwind_state state;
unsigned long addr;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2865,11 +2865,11 @@ perf_callchain_user32(struct pt_regs *regs, struct 
perf_callchain_entry_ctx *ent
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stack_frame frame;
const struct stack_frame __user *fp;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2946,9 +2946,9 @@ static unsigned long code_segment_base(struct pt_regs 
*regs)
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
 
-   if (guest_cbs && guest_cbs->is_in_guest())
+   if (guest_cbs)
return guest_cbs->get_guest_ip();
 
return regs->ip + code_segment_base(regs);
@@ -2956,10 +2956,10 @@ unsigned long perf_instruction_pointer(struct pt_regs 
*regs)
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
int misc = 0;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs) {
if (guest_cbs->is_user_mode())
misc |= PERF_RECORD_MISC_GUEST_USER;
else
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 96001962c24d..9a8c18b51a96 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2853,9 +2853,8 @@ static int handle_pmi_common(struct pt_regs *regs, u64 
status)
 */
if (__test_and_clear_bit(GLOBAL_STATUS_TRACE_TOPAPMI_BIT, (unsigned 
long *))) {
handled++;
-   guest_cbs = this_cpu_read(perf_guest_cbs);
-   if (unlikely(guest_cbs && guest_cbs->is_in_guest() &&
-guest_cbs->handle_intel_pt_intr))
+   guest_cbs = perf_get_guest_cbs();
+   if (unlikely(guest_cbs && guest_cbs->handle_intel_pt_intr))
guest_cbs->handle_intel_pt_intr();
else
intel_pt_interrupt();
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index db701409a62f..6e3a10784d24 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1241,6 +1241,23 @@ DECLARE_PER_CPU(struct perf_guest_info_callbacks *, 
perf_guest_cbs);
 extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
 extern void perf_unregister_guest_info_callbacks(void);
 extern void perf_register_guest_info_callbacks_all_cpus(struct 
perf_guest_info_callbacks *cbs);
+/*
+ * Returns guest callbacks for the current CPU if callbacks are registered and
+ * the PMI fired while a guest was running, otherwise returns NULL.
+ */
+static inline struct perf_guest_info_callbacks *perf_get_guest_cbs(void)
+{
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+
+   /*
+* Implementing is_in_guest is optional if the callbacks are registered
+* only when "in guest".
+*/
+   if (guest_cbs && (!guest_cbs->is_in_guest || guest_cbs->is_in_guest()))
+

[PATCH 11/15] KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c

2021-08-26 Thread Sean Christopherson

Now that all state needed for VMX's PT interrupt handler is exposed to
vmx.c (specifically the currently running vCPU), move the handler into
vmx.c where it belongs.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/vmx/vmx.c  | 24 +---
 arch/x86/kvm/x86.c  | 17 -
 virt/kvm/kvm_main.c |  1 +
 4 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 63553a1f43ee..daa33147650a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1496,7 +1496,6 @@ struct kvm_x86_init_ops {
int (*disabled_by_bios)(void);
int (*check_processor_compatibility)(void);
int (*hardware_setup)(void);
-   bool (*intel_pt_intr_in_guest)(void);
 
struct kvm_x86_ops *runtime_ops;
 };
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f08980ef7c44..4665a272249a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7535,6 +7535,8 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
 
 static void hardware_unsetup(void)
 {
+   kvm_set_intel_pt_intr_handler(NULL);
+
if (nested)
nested_vmx_hardware_unsetup();
 
@@ -7685,6 +7687,18 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
 };
 
+static void vmx_handle_intel_pt_intr(void)
+{
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+   if (WARN_ON_ONCE(!vcpu))
+   return;
+
+   kvm_make_request(KVM_REQ_PMI, vcpu);
+   __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT,
+   (unsigned long *)>arch.pmu.global_status);
+}
+
 static __init void vmx_setup_user_return_msrs(void)
 {
 
@@ -7886,9 +7900,14 @@ static __init int hardware_setup(void)
vmx_set_cpu_caps();
 
r = alloc_kvm_area();
-   if (r)
+   if (r) {
nested_vmx_hardware_unsetup();
-   return r;
+   return r;
+   }
+
+   if (pt_mode == PT_MODE_HOST_GUEST)
+   kvm_set_intel_pt_intr_handler(vmx_handle_intel_pt_intr);
+   return 0;
 }
 
 static struct kvm_x86_init_ops vmx_init_ops __initdata = {
@@ -7896,7 +7915,6 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = {
.disabled_by_bios = vmx_disabled_by_bios,
.check_processor_compatibility = vmx_check_processor_compat,
.hardware_setup = hardware_setup,
-   .intel_pt_intr_in_guest = vmx_pt_mode_is_host_guest,
 
.runtime_ops = _x86_ops,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7cb0f04e24ee..11c7a02f839c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8264,18 +8264,6 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-static void kvm_handle_intel_pt_intr(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return;
-
-   kvm_make_request(KVM_REQ_PMI, vcpu);
-   __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT,
-   (unsigned long *)>arch.pmu.global_status);
-}
-
 #ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
@@ -11029,9 +11017,6 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
-   if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
-   kvm_set_intel_pt_intr_handler(kvm_handle_intel_pt_intr);
-
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
 
@@ -11059,8 +11044,6 @@ int kvm_arch_hardware_setup(void *opaque)
 
 void kvm_arch_hardware_unsetup(void)
 {
-   kvm_set_intel_pt_intr_handler(NULL);
-
static_call(kvm_x86_hardware_unsetup)();
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 13c4f58a75e5..e0b1c9386926 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5498,6 +5498,7 @@ void kvm_set_intel_pt_intr_handler(void (*handler)(void))
 {
kvm_guest_cbs.handle_intel_pt_intr = handler;
 }
+EXPORT_SYMBOL_GPL(kvm_set_intel_pt_intr_handler);
 
 void kvm_register_perf_callbacks(void)
 {
-- 
2.33.0.259.gc128427fd7-goog

[PATCH 08/15] KVM: x86: Drop current_vcpu in favor of kvm_running_vcpu

2021-08-26 Thread Sean Christopherson

Now that KVM registers perf callbacks only when the CPU is "in guest",
use kvm_running_vcpu instead of current_vcpu to retrieve the associated
vCPU and drop current_vcpu.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 12 +---
 arch/x86/kvm/x86.h |  4 
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d4d91944fde7..e337aef60793 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8264,17 +8264,15 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
-EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu);
-
 static int kvm_is_in_guest(void)
 {
-   return __this_cpu_read(current_vcpu) != NULL;
+   /* x86's callbacks are registered only when handling a guest NMI. */
+   return true;
 }
 
 static int kvm_is_user_mode(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
if (WARN_ON_ONCE(!vcpu))
return 0;
@@ -8284,7 +8282,7 @@ static int kvm_is_user_mode(void)
 
 static unsigned long kvm_get_guest_ip(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
if (WARN_ON_ONCE(!vcpu))
return 0;
@@ -8294,7 +8292,7 @@ static unsigned long kvm_get_guest_ip(void)
 
 static void kvm_handle_intel_pt_intr(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
if (WARN_ON_ONCE(!vcpu))
return;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 4c5ba4128b38..f13f15d2fab8 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -393,11 +393,8 @@ static inline void kvm_unregister_perf_callbacks(void)
__perf_unregister_guest_info_callbacks();
 }
 
-DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu);
-
 static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, bool is_nmi)
 {
-   __this_cpu_write(current_vcpu, vcpu);
WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, is_nmi);
 
kvm_register_perf_callbacks();
@@ -408,7 +405,6 @@ static inline void kvm_after_interrupt(struct kvm_vcpu 
*vcpu)
kvm_unregister_perf_callbacks();
 
WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, false);
-   __this_cpu_write(current_vcpu, NULL);
 }
 
 
-- 
2.33.0.259.gc128427fd7-goog

[PATCH 09/15] KVM: arm64: Register/unregister perf callbacks at vcpu load/put

2021-08-26 Thread Sean Christopherson

Register/unregister perf callbacks at vcpu_load()/vcpu_put() instead of
keeping the callbacks registered for all eternity after loading KVM.
This will allow future cleanups and optimizations as the registration
of the callbacks signifies "in guest".  This will also allow moving the
callbacks into common KVM as they arm64 and x86 now have semantically
identical callback implementations.

Note, KVM could likely be more precise in its registration, but that's a
cleanup for the future.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h | 12 ++-
 arch/arm64/kvm/arm.c  |  5 -
 arch/arm64/kvm/perf.c | 36 ++-
 3 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index ed940aec89e0..007c38d77fd9 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -671,7 +671,17 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
 int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
 void kvm_perf_init(void);
-void kvm_perf_teardown(void);
+
+#ifdef CONFIG_PERF_EVENTS
+void kvm_register_perf_callbacks(void);
+static inline void kvm_unregister_perf_callbacks(void)
+{
+   __perf_unregister_guest_info_callbacks();
+}
+#else
+static inline void kvm_register_perf_callbacks(void) {}
+static inline void kvm_unregister_perf_callbacks(void) {}
+#endif
 
 long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu);
 gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e9a2b8f27792..ec386971030d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -429,10 +429,13 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (vcpu_has_ptrauth(vcpu))
vcpu_ptrauth_disable(vcpu);
kvm_arch_vcpu_load_debug_state_flags(vcpu);
+
+   kvm_register_perf_callbacks();
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+   kvm_unregister_perf_callbacks();
kvm_arch_vcpu_put_debug_state_flags(vcpu);
kvm_arch_vcpu_put_fp(vcpu);
if (has_vhe())
@@ -2155,7 +2158,7 @@ int kvm_arch_init(void *opaque)
 /* NOP: Compiling as a module not supported */
 void kvm_arch_exit(void)
 {
-   kvm_perf_teardown();
+
 }
 
 static int __init early_kvm_mode_cfg(char *arg)
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 039fe59399a2..2556b0a3b096 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -13,33 +13,30 @@
 
 DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
 
+#ifdef CONFIG_PERF_EVENTS
 static int kvm_is_in_guest(void)
 {
-return kvm_get_running_vcpu() != NULL;
+   return true;
 }
 
 static int kvm_is_user_mode(void)
 {
-   struct kvm_vcpu *vcpu;
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
-   vcpu = kvm_get_running_vcpu();
+   if (WARN_ON_ONCE(!vcpu))
+   return 0;
 
-   if (vcpu)
-   return !vcpu_mode_priv(vcpu);
-
-   return 0;
+   return !vcpu_mode_priv(vcpu);
 }
 
 static unsigned long kvm_get_guest_ip(void)
 {
-   struct kvm_vcpu *vcpu;
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
-   vcpu = kvm_get_running_vcpu();
+   if (WARN_ON_ONCE(!vcpu))
+   return 0;
 
-   if (vcpu)
-   return *vcpu_pc(vcpu);
-
-   return 0;
+   return *vcpu_pc(vcpu);
 }
 
 static struct perf_guest_info_callbacks kvm_guest_cbs = {
@@ -48,15 +45,14 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
.get_guest_ip   = kvm_get_guest_ip,
 };
 
+void kvm_register_perf_callbacks(void)
+{
+   __perf_register_guest_info_callbacks(_guest_cbs);
+}
+#endif /* CONFIG_PERF_EVENTS*/
+
 void kvm_perf_init(void)
 {
if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
static_branch_enable(_arm_pmu_available);
-
-   perf_register_guest_info_callbacks(_guest_cbs);
-}
-
-void kvm_perf_teardown(void)
-{
-   perf_unregister_guest_info_callbacks();
 }
-- 
2.33.0.259.gc128427fd7-goog

[PATCH 07/15] KVM: Use dedicated flag to track if KVM is handling an NMI from guest

2021-08-26 Thread Sean Christopherson

Add a dedicated flag to detect the case where KVM's PMC overflow
callback was originally invoked in response to an NMI that arrived while
the guest was running.  Using current_vcpu is less precise as IRQs also
set current_vcpu (though presumably KVM's callback should not be reached
in that case), and more importantly, this will allow dropping
current_vcpu as the perf callbacks can switch to kvm_running_vcpu now
that the perf callbacks are precisely registered, i.e. kvm_running_vcpu
doesn't need to be used to detect if a PMI arrived in the guest.

Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting")
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h | 3 +--
 arch/x86/kvm/pmu.c  | 2 +-
 arch/x86/kvm/svm/svm.c  | 2 +-
 arch/x86/kvm/vmx/vmx.c  | 2 +-
 arch/x86/kvm/x86.c  | 4 ++--
 arch/x86/kvm/x86.h  | 4 +++-
 6 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1ea4943a73d7..465b35736d9b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -763,6 +763,7 @@ struct kvm_vcpu_arch {
unsigned nmi_pending; /* NMI queued after currently running handler */
bool nmi_injected;/* Trying to inject an NMI this entry */
bool smi_pending;/* SMI queued after currently running handler */
+   bool handling_nmi_from_guest;
 
struct kvm_mtrr mtrr_state;
u64 pat;
@@ -1874,8 +1875,6 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
 int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err);
 void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);
 
-int kvm_is_in_guest(void);
-
 void __user *__x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 u32 size);
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 0772bad9165c..2b8934b452ea 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -87,7 +87,7 @@ static void kvm_perf_overflow_intr(struct perf_event 
*perf_event,
 * woken up. So we should wake it, but this is impossible from
 * NMI context. Do it from irq work instead.
 */
-   if (!kvm_is_in_guest())
+   if (!pmc->vcpu->arch.handling_nmi_from_guest)
irq_work_queue(_to_pmu(pmc)->irq_work);
else
kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 1a70e11f0487..3fc6767e5fd8 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3843,7 +3843,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu 
*vcpu)
}
 
if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
-   kvm_before_interrupt(vcpu);
+   kvm_before_interrupt(vcpu, true);
 
kvm_load_host_xsave_state(vcpu);
stgi();
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f19d72136f77..f08980ef7c44 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6344,7 +6344,7 @@ void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
 static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
unsigned long entry)
 {
-   kvm_before_interrupt(vcpu);
+   kvm_before_interrupt(vcpu, entry == (unsigned long)asm_exc_nmi_noist);
vmx_do_interrupt_nmi_irqoff(entry);
kvm_after_interrupt(vcpu);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bc4ee6ea7752..d4d91944fde7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8267,7 +8267,7 @@ static void kvm_timer_init(void)
 DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
 EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu);
 
-int kvm_is_in_guest(void)
+static int kvm_is_in_guest(void)
 {
return __this_cpu_read(current_vcpu) != NULL;
 }
@@ -9678,7 +9678,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 * interrupts on processors that implement an interrupt shadow, the
 * stat.exits increment will do nicely.
 */
-   kvm_before_interrupt(vcpu);
+   kvm_before_interrupt(vcpu, false);
local_irq_enable();
++vcpu->stat.exits;
local_irq_disable();
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 5cedc0e8a5d5..4c5ba4128b38 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -395,9 +395,10 @@ static inline void kvm_unregister_perf_callbacks(void)
 
 DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu);
 
-static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu)
+static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, bool is_nmi)
 {
__this_cpu_write(current_vcpu, vcpu);
+   WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, is_nmi);
 
kvm_register_perf_callbacks();
 }
@@ -406,6 +407,7 @@ static inline void

[PATCH 01/15] KVM: x86: Register perf callbacks after calling vendor's hardware_setup()

2021-08-26 Thread Sean Christopherson

Wait to register perf callbacks until after doing vendor hardaware setup.
VMX's hardware_setup() configures Intel Processor Trace (PT) mode, and a
future fix to register the Intel PT guest interrupt hook if and only if
Intel PT is exposed to the guest will consume the configured PT mode.

Delaying registration to hardware setup is effectively a nop as KVM's perf
hooks all pivot on the per-CPU current_vcpu, which is non-NULL only when
KVM is handling an IRQ/NMI in a VM-Exit path.  I.e. current_vcpu will be
NULL throughout both kvm_arch_init() and kvm_arch_hardware_setup().

Cc: Alexander Shishkin 
Cc: Artem Kashkanov 
Cc: sta...@vger.kernel.org
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 86539c1686fa..fb6015f97f9e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8426,8 +8426,6 @@ int kvm_arch_init(void *opaque)
 
kvm_timer_init();
 
-   perf_register_guest_info_callbacks(_guest_cbs);
-
if (boot_cpu_has(X86_FEATURE_XSAVE)) {
host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
@@ -8461,7 +8459,6 @@ void kvm_arch_exit(void)
clear_hv_tscchange_cb();
 #endif
kvm_lapic_exit();
-   perf_unregister_guest_info_callbacks(_guest_cbs);
 
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
cpufreq_unregister_notifier(_cpufreq_notifier_block,
@@ -11064,6 +11061,8 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
+   perf_register_guest_info_callbacks(_guest_cbs);
+
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
 
@@ -11091,6 +11090,8 @@ int kvm_arch_hardware_setup(void *opaque)
 
 void kvm_arch_hardware_unsetup(void)
 {
+   perf_unregister_guest_info_callbacks(_guest_cbs);
+
static_call(kvm_x86_hardware_unsetup)();
 }
 
-- 
2.33.0.259.gc128427fd7-goog

Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks

2021-08-26 Thread Sean Christopherson

TL;DR: Please don't merge this patch, it's broken and is also built on a shoddy
   foundation that I would like to fix.

On Fri, Aug 06, 2021, Zhu Lingshan wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 464917096e73..e466fc8176e1 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6489,9 +6489,18 @@ static void perf_pending_event(struct irq_work *entry)
>   */
>  struct perf_guest_info_callbacks *perf_guest_cbs;
>  
> +/* explicitly use __weak to fix duplicate symbol error */
> +void __weak arch_perf_update_guest_cbs(void)
> +{
> +}
> +
>  int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
>  {
> + if (WARN_ON_ONCE(perf_guest_cbs))
> + return -EBUSY;
> +
>   perf_guest_cbs = cbs;
> + arch_perf_update_guest_cbs();

This is horribly broken, it fails to cleanup the static calls when KVM 
unregisters
the callbacks, which happens when the vendor module, e.g. kvm_intel, is 
unloaded.
The explosion doesn't happen until 'kvm' is unloaded because the functions are
implemented in 'kvm', i.e. the use-after-free is deferred a bit.

  BUG: unable to handle page fault for address: a011bb90
  #PF: supervisor instruction fetch in kernel mode
  #PF: error_code(0x0010) - not-present page
  PGD 6211067 P4D 6211067 PUD 6212063 PMD 102b99067 PTE 0
  Oops: 0010 [#1] PREEMPT SMP
  CPU: 0 PID: 1047 Comm: rmmod Not tainted 5.14.0-rc2+ #460
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:0xa011bb90
  Code: Unable to access opcode bytes at RIP 0xa011bb66.
  Call Trace:

   ? perf_misc_flags+0xe/0x50
   ? perf_prepare_sample+0x53/0x6b0
   ? perf_event_output_forward+0x67/0x160
   ? kvm_clock_read+0x14/0x30
   ? kvm_sched_clock_read+0x5/0x10
   ? sched_clock_cpu+0xd/0xd0
   ? __perf_event_overflow+0x52/0xf0
   ? handle_pmi_common+0x1f2/0x2d0
   ? __flush_tlb_all+0x30/0x30
   ? intel_pmu_handle_irq+0xcf/0x410
   ? nmi_handle+0x5/0x260
   ? perf_event_nmi_handler+0x28/0x50
   ? nmi_handle+0xc7/0x260
   ? lock_release+0x2b0/0x2b0
   ? default_do_nmi+0x6b/0x170
   ? exc_nmi+0x103/0x130
   ? end_repeat_nmi+0x16/0x1f
   ? lock_release+0x2b0/0x2b0
   ? lock_release+0x2b0/0x2b0
   ? lock_release+0x2b0/0x2b0

  Modules linked in: irqbypass [last unloaded: kvm]

Even more fun, the existing perf_guest_cbs framework is also broken, though it's
much harder to get it to fail, and probably impossible to get it to fail without
some help.  The issue is that perf_guest_cbs is global, which means that it can
be nullified by KVM (during module unload) while the callbacks are being 
accessed
by a PMI handler on a different CPU.

The bug has escaped notice because all dererfences of perf_guest_cbs follow the
same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and AFAICT the
compiler never reload perf_guest_cbs in this sequence.  The compiler does reload
perf_guest_cbs for any future dereferences, but the ->is_in_guest() guard all 
but
guarantees the PMI handler will win the race, e.g. to nullify perf_guest_cbs,
KVM has to completely exit the guest and teardown down all VMs before it can be
unloaded.

But with a help, e.g. RAED_ONCE(perf_guest_cbs), unloading kvm_intel can trigger
a NULL pointer derference, e.g. this tweak

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1eb45139fcc6..202e5ad97f82 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
 {
int misc = 0;

-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (READ_ONCE(perf_guest_cbs) && 
READ_ONCE(perf_guest_cbs)->is_in_guest()) {
if (perf_guest_cbs->is_user_mode())
misc |= PERF_RECORD_MISC_GUEST_USER;
else

while spamming module load/unload leads to:

  BUG: kernel NULL pointer dereference, address: 
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x) - not-present page
  PGD 0 P4D 0
  Oops:  [#1] PREEMPT SMP
  CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:perf_misc_flags+0x1c/0x70
  Call Trace:
   perf_prepare_sample+0x53/0x6b0
   perf_event_output_forward+0x67/0x160
   __perf_event_overflow+0x52/0xf0
   handle_pmi_common+0x207/0x300
   intel_pmu_handle_irq+0xcf/0x410
   perf_event_nmi_handler+0x28/0x50
   nmi_handle+0xc7/0x260
   default_do_nmi+0x6b/0x170
   exc_nmi+0x103/0x130
   asm_exc_nmi+0x76/0xbf

The good news is that I have a series that should fix both the existing NULL 
pointer
bug and mostly obviate the need for static calls.  The bad news is that my 
approach,
making perf_guest_cbs per-CPU, likely complicates turning these into static 
calls,
though I'm guessing it's still a solvable problem.

Tangentially related, IMO we should make architectures opt-in to getting

[PATCH 04/15] perf: Force architectures to opt-in to guest callbacks

2021-08-26 Thread Sean Christopherson

Introduce HAVE_GUEST_PERF_EVENTS and require architectures to select it
to allow register guest callbacks in perf.  Future patches will convert
the callbacks to per-CPU definitions.  Rather than churn a bunch of arch
code (that was presumably copy+pasted from x86), remove it wholesale as
it's useless and at best wasting cycles.

Wrap even the stubs with an #ifdef to avoid an arch sneaking in a bogus
registration with CONFIG_PERF_EVENTS=n.

Signed-off-by: Sean Christopherson 
---
 arch/arm/kernel/perf_callchain.c   | 28 
 arch/arm64/Kconfig |  1 +
 arch/csky/kernel/perf_callchain.c  | 10 --
 arch/nds32/kernel/perf_event_cpu.c | 29 -
 arch/riscv/kernel/perf_callchain.c | 10 --
 arch/x86/Kconfig   |  1 +
 include/linux/perf_event.h |  4 
 init/Kconfig   |  3 +++
 kernel/events/core.c   |  2 ++
 9 files changed, 19 insertions(+), 69 deletions(-)

diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c
index 3b69a76d341e..bc6b246ab55e 100644
--- a/arch/arm/kernel/perf_callchain.c
+++ b/arch/arm/kernel/perf_callchain.c
@@ -64,11 +64,6 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, 
struct pt_regs *regs
 {
struct frame_tail __user *tail;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
-   /* We don't support guest os callchain now */
-   return;
-   }
-
perf_callchain_store(entry, regs->ARM_pc);
 
if (!current->mm)
@@ -100,20 +95,12 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *re
 {
struct stackframe fr;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
-   /* We don't support guest os callchain now */
-   return;
-   }
-
arm_get_current_stackframe(regs, );
walk_stackframe(, callchain_trace, entry);
 }
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
-   return perf_guest_cbs->get_guest_ip();
-
return instruction_pointer(regs);
 }
 
@@ -121,17 +108,10 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
 {
int misc = 0;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
-   if (perf_guest_cbs->is_user_mode())
-   misc |= PERF_RECORD_MISC_GUEST_USER;
-   else
-   misc |= PERF_RECORD_MISC_GUEST_KERNEL;
-   } else {
-   if (user_mode(regs))
-   misc |= PERF_RECORD_MISC_USER;
-   else
-   misc |= PERF_RECORD_MISC_KERNEL;
-   }
+   if (user_mode(regs))
+   misc |= PERF_RECORD_MISC_USER;
+   else
+   misc |= PERF_RECORD_MISC_KERNEL;
 
return misc;
 }
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index b5b13a932561..72a201a686c5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -190,6 +190,7 @@ config ARM64
select HAVE_NMI
select HAVE_PATA_PLATFORM
select HAVE_PERF_EVENTS
+   select HAVE_GUEST_PERF_EVENTS
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/arch/csky/kernel/perf_callchain.c 
b/arch/csky/kernel/perf_callchain.c
index ab55e98ee8f6..92057de08f4f 100644
--- a/arch/csky/kernel/perf_callchain.c
+++ b/arch/csky/kernel/perf_callchain.c
@@ -88,10 +88,6 @@ void perf_callchain_user(struct perf_callchain_entry_ctx 
*entry,
 {
unsigned long fp = 0;
 
-   /* C-SKY does not support virtualization. */
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
-   return;
-
fp = regs->regs[4];
perf_callchain_store(entry, regs->pc);
 
@@ -112,12 +108,6 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry,
 {
struct stackframe fr;
 
-   /* C-SKY does not support virtualization. */
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
-   pr_warn("C-SKY does not support perf in guest mode!");
-   return;
-   }
-
fr.fp = regs->regs[4];
fr.lr = regs->lr;
walk_stackframe(, entry);
diff --git a/arch/nds32/kernel/perf_event_cpu.c 
b/arch/nds32/kernel/perf_event_cpu.c
index 0ce6f9f307e6..a78a879e7ef1 100644
--- a/arch/nds32/kernel/perf_event_cpu.c
+++ b/arch/nds32/kernel/perf_event_cpu.c
@@ -1371,11 +1371,6 @@ perf_callchain_user(struct perf_callchain_entry_ctx 
*entry,
 
leaf_fp = 0;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
-   /* We don't support guest os callchain now */
-   return;
-   }
-
perf_callchain_store(entry, regs->ipc);
fp = regs->fp;
gp = regs->gp;
@@ -1481,10 +1476,6 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry,

[PATCH 06/15] KVM: x86: Register perf callbacks only when actively handling interrupt

2021-08-26 Thread Sean Christopherson

Register KVM's perf callback only when handling an interrupt that may be
a PMI (sadly this includes IRQs), and unregister the callback immediately
after handling the interrupt (or closing the window).  Registering the
callback on a per-CPU basis (with preemption disabled!), fixes a mostly
theoretical bug where perf could dereference a NULL pointer due to KVM
unloading and unregistering the callbacks in between perf queries of the
callback functions.  The precise registration will also allow for future
cleanups and optimizations, e.g. the existence of the callbacks can serve
as the "in guest" check.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 27 +--
 arch/x86/kvm/x86.h | 10 ++
 include/linux/perf_event.h |  2 ++
 kernel/events/core.c   | 12 
 4 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bae951344e28..bc4ee6ea7752 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8274,28 +8274,31 @@ int kvm_is_in_guest(void)
 
 static int kvm_is_user_mode(void)
 {
-   int user_mode = 3;
+   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
 
-   if (__this_cpu_read(current_vcpu))
-   user_mode = 
static_call(kvm_x86_get_cpl)(__this_cpu_read(current_vcpu));
+   if (WARN_ON_ONCE(!vcpu))
+   return 0;
 
-   return user_mode != 0;
+   return static_call(kvm_x86_get_cpl)(vcpu) != 0;
 }
 
 static unsigned long kvm_get_guest_ip(void)
 {
-   unsigned long ip = 0;
+   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
 
-   if (__this_cpu_read(current_vcpu))
-   ip = kvm_rip_read(__this_cpu_read(current_vcpu));
+   if (WARN_ON_ONCE(!vcpu))
+   return 0;
 
-   return ip;
+   return kvm_rip_read(vcpu);
 }
 
 static void kvm_handle_intel_pt_intr(void)
 {
struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
 
+   if (WARN_ON_ONCE(!vcpu))
+   return;
+
kvm_make_request(KVM_REQ_PMI, vcpu);
__set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT,
(unsigned long *)>arch.pmu.global_status);
@@ -8308,6 +8311,12 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
.handle_intel_pt_intr   = NULL,
 };
 
+void kvm_register_perf_callbacks(void)
+{
+   __perf_register_guest_info_callbacks(_guest_cbs);
+}
+EXPORT_SYMBOL_GPL(kvm_register_perf_callbacks);
+
 #ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
@@ -11063,7 +11072,6 @@ int kvm_arch_hardware_setup(void *opaque)
 
if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr;
-   perf_register_guest_info_callbacks(_guest_cbs);
 
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
@@ -11092,7 +11100,6 @@ int kvm_arch_hardware_setup(void *opaque)
 
 void kvm_arch_hardware_unsetup(void)
 {
-   perf_unregister_guest_info_callbacks();
kvm_guest_cbs.handle_intel_pt_intr = NULL;
 
static_call(kvm_x86_hardware_unsetup)();
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 7d66d63dc55a..5cedc0e8a5d5 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -387,15 +387,25 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm)
return kvm->arch.cstate_in_guest;
 }
 
+void kvm_register_perf_callbacks(void);
+static inline void kvm_unregister_perf_callbacks(void)
+{
+   __perf_unregister_guest_info_callbacks();
+}
+
 DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu);
 
 static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu)
 {
__this_cpu_write(current_vcpu, vcpu);
+
+   kvm_register_perf_callbacks();
 }
 
 static inline void kvm_after_interrupt(struct kvm_vcpu *vcpu)
 {
+   kvm_unregister_perf_callbacks();
+
__this_cpu_write(current_vcpu, NULL);
 }
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c98253dae037..7a367bf1b78d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1238,6 +1238,8 @@ extern void perf_event_bpf_event(struct bpf_prog *prog,
 
 #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS
 DECLARE_PER_CPU(struct perf_guest_info_callbacks *, perf_guest_cbs);
+extern void __perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
+extern void __perf_unregister_guest_info_callbacks(void);
 extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *callbacks);
 extern void perf_unregister_guest_info_callbacks(void);
 #endif /* CONFIG_HAVE_GUEST_PERF_EVENTS */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9bc1375d6ed9..2f28d9d8dc94 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6485,6 +6485,18 @@ static void perf_pending_event(struct irq_work *entry)
 #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS

[PATCH 05/15] perf: Track guest callbacks on a per-CPU basis

2021-08-26 Thread Sean Christopherson

Use a per-CPU pointer to track perf's guest callbacks so that KVM can set
the callbacks more precisely and avoid a lurking NULL pointer dereference.
On x86, KVM supports being built as a module and thus can be unloaded.
And because the shared callbacks are referenced from IRQ/NMI context,
unloading KVM can run concurrently with perf, and thus all of perf's
checks for a NULL perf_guest_cbs are flawed as perf_guest_cbs could be
nullified between the check and dereference.

In practice, this has not been problematic because the callbacks are
always guarded with a "perf_guest_cbs && perf_guest_cbs->is_in_guest()"
pattern, and it's extremely unlikely the compiler will choost to reload
perf_guest_cbs in that particular sequence.  Because is_in_guest() is
obviously true only when KVM is running a guest, perf always wins the
race to the guarded code (which does often reload perf_guest_cbs) as KVM
has to stop running all guests and do a heavy teardown before unloading.

Cc: Zhu Lingshan 
Signed-off-by: Sean Christopherson 
---
 arch/arm64/kernel/perf_callchain.c | 18 --
 arch/x86/events/core.c | 17 +++--
 arch/x86/events/intel/core.c   |  8 +---
 include/linux/perf_event.h |  2 +-
 kernel/events/core.c   | 12 +---
 5 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/kernel/perf_callchain.c 
b/arch/arm64/kernel/perf_callchain.c
index 4a72c2727309..38555275c6a2 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -102,7 +102,9 @@ compat_user_backtrace(struct compat_frame_tail __user *tail,
 void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 struct pt_regs *regs)
 {
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* We don't support guest os callchain now */
return;
}
@@ -147,9 +149,10 @@ static bool callchain_trace(void *data, unsigned long pc)
 void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
   struct pt_regs *regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
struct stackframe frame;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* We don't support guest os callchain now */
return;
}
@@ -160,18 +163,21 @@ void perf_callchain_kernel(struct 
perf_callchain_entry_ctx *entry,
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
-   return perf_guest_cbs->get_guest_ip();
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+
+   if (guest_cbs && guest_cbs->is_in_guest())
+   return guest_cbs->get_guest_ip();
 
return instruction_pointer(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
int misc = 0;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
-   if (perf_guest_cbs->is_user_mode())
+   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs->is_user_mode())
misc |= PERF_RECORD_MISC_GUEST_USER;
else
misc |= PERF_RECORD_MISC_GUEST_KERNEL;
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1eb45139fcc6..34155a52e498 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2761,10 +2761,11 @@ static bool perf_hw_regs(struct pt_regs *regs)
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
struct unwind_state state;
unsigned long addr;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2864,10 +2865,11 @@ perf_callchain_user32(struct pt_regs *regs, struct 
perf_callchain_entry_ctx *ent
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
struct stack_frame frame;
const struct stack_frame __user *fp;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2944,18 +2946,21 @@ static unsigned long code_segment_base(struct pt_regs 
*regs)

[PATCH 00/15] perf: KVM: Fix, optimize, and clean up callbacks

2021-08-26 Thread Sean Christopherson

This started out as a small series[1] to fix a KVM bug related to Intel PT
interrupt handling and snowballed horribly.

The main problem being addressed is that the perf_guest_cbs are shared by
all CPUs, can be nullified by KVM during module unload, and are not
protected against concurrent access from NMI context.

The bug has escaped notice because all dereferences of perf_guest_cbs
follow the same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern,
and AFAICT the compiler never reloads perf_guest_cbs in this sequence.
The compiler does reload perf_guest_cbs for any future dereferences, but
the ->is_in_guest() guard all but guarantees the PMI handler will win the
race, e.g. to nullify perf_guest_cbs, KVM has to completely exit the guest
and teardown down all VMs before it can be unloaded.

But with help, e.g. READ_ONCE(perf_guest_cbs), unloading kvm_intel can
trigger a NULL pointer derference (see below).  Manual intervention aside,
the bug is a bit of a time bomb, e.g. my patch 3 from the original PT
handling series would have omitted the ->is_in_guest() guard.

This series fixes the problem by making the callbacks per-CPU, and
registering/unregistering the callbacks only with preemption disabled
(except for the Xen case, which doesn't unregister).

This approach also allows for several nice cleanups in this series.
KVM x86 and arm64 can share callbacks, KVM x86 drops its somewhat
redundant current_vcpu, and the retpoline that is currently hit when KVM
is loaded (due to always checking ->is_in_guest()) goes away (it's still
there when running as Xen Dom0).

Changing to per-CPU callbacks also provides a good excuse to excise
copy+paste code from architectures that can't possibly have guest
callbacks.

This series conflicts horribly with a proposed patch[2] to use static
calls for perf_guest_cbs.  But that patch is broken as it completely
fails to handle unregister, and it's not clear to me whether or not
it can correctly handle unregister without fixing the underlying race
(I don't know enough about the code patching for static calls).

This tweak

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1eb45139fcc6..202e5ad97f82 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
 {
int misc = 0;

-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (READ_ONCE(perf_guest_cbs) && 
READ_ONCE(perf_guest_cbs)->is_in_guest()) {
if (perf_guest_cbs->is_user_mode())
misc |= PERF_RECORD_MISC_GUEST_USER;
else

while spamming module load/unload leads to:

  BUG: kernel NULL pointer dereference, address: 
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x) - not-present page
  PGD 0 P4D 0
  Oops:  [#1] PREEMPT SMP
  CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:perf_misc_flags+0x1c/0x70
  Call Trace:
   perf_prepare_sample+0x53/0x6b0
   perf_event_output_forward+0x67/0x160
   __perf_event_overflow+0x52/0xf0
   handle_pmi_common+0x207/0x300
   intel_pmu_handle_irq+0xcf/0x410
   perf_event_nmi_handler+0x28/0x50
   nmi_handle+0xc7/0x260
   default_do_nmi+0x6b/0x170
   exc_nmi+0x103/0x130
   asm_exc_nmi+0x76/0xbf

[1] https://lkml.kernel.org/r/20210823193709.55886-1-sea...@google.com
[2] https://lkml.kernel.org/r/20210806133802.3528-2-lingshan@intel.com

Sean Christopherson (15):
  KVM: x86: Register perf callbacks after calling vendor's
hardware_setup()
  KVM: x86: Register Processor Trace interrupt hook iff PT enabled in
guest
  perf: Stop pretending that perf can handle multiple guest callbacks
  perf: Force architectures to opt-in to guest callbacks
  perf: Track guest callbacks on a per-CPU basis
  KVM: x86: Register perf callbacks only when actively handling
interrupt
  KVM: Use dedicated flag to track if KVM is handling an NMI from guest
  KVM: x86: Drop current_vcpu in favor of kvm_running_vcpu
  KVM: arm64: Register/unregister perf callbacks at vcpu load/put
  KVM: Move x86's perf guest info callbacks to generic KVM
  KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
  KVM: arm64: Convert to the generic perf callbacks
  KVM: arm64: Drop perf.c and fold its tiny bit of code into pmu.c
  perf: Disallow bulk unregistering of guest callbacks and do cleanup
  perf: KVM: Indicate "in guest" via NULL ->is_in_guest callback

 arch/arm/kernel/perf_callchain.c   | 28 ++
 arch/arm64/Kconfig |  1 +
 arch/arm64/include/asm/kvm_host.h  |  8 +++-
 arch/arm64/kernel/perf_callchain.c | 18 ++---
 arch/arm64/kvm/Makefile|  2 +-
 arch/arm64/kvm/arm.c   | 13 ++-
 arch/arm64/kvm/perf.c  | 62 --
 arch/arm64/kvm/pmu.c   |  8 
 arch/csky/kernel/perf_callchain.c  | 10

[PATCH 02/15] KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest

2021-08-26 Thread Sean Christopherson

Override the Processor Trace (PT) interrupt handler for guest mode if and
only if PT is configured for host+guest mode, i.e. is being used
independently by both host and guest.  If PT is configured for system
mode, the host fully controls PT and must handle all events.

Fixes: 8479e04e7d6b ("KVM: x86: Inject PMI for KVM guest")
Cc: sta...@vger.kernel.org
Cc: Like Xu 
Reported-by: Alexander Shishkin 
Reported-by: Artem Kashkanov 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/pmu.h  | 1 +
 arch/x86/kvm/vmx/vmx.c  | 1 +
 arch/x86/kvm/x86.c  | 5 -
 4 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09b256db394a..1ea4943a73d7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1494,6 +1494,7 @@ struct kvm_x86_init_ops {
int (*disabled_by_bios)(void);
int (*check_processor_compatibility)(void);
int (*hardware_setup)(void);
+   bool (*intel_pt_intr_in_guest)(void);
 
struct kvm_x86_ops *runtime_ops;
 };
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 0e4f2b1fa9fb..b06dbbd7eeeb 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -41,6 +41,7 @@ struct kvm_pmu_ops {
void (*reset)(struct kvm_vcpu *vcpu);
void (*deliver_pmi)(struct kvm_vcpu *vcpu);
void (*cleanup)(struct kvm_vcpu *vcpu);
+   void (*handle_intel_pt_intr)(void);
 };
 
 static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fada1055f325..f19d72136f77 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7896,6 +7896,7 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = {
.disabled_by_bios = vmx_disabled_by_bios,
.check_processor_compatibility = vmx_check_processor_compat,
.hardware_setup = hardware_setup,
+   .intel_pt_intr_in_guest = vmx_pt_mode_is_host_guest,
 
.runtime_ops = _x86_ops,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb6015f97f9e..ffc6c2d73508 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8305,7 +8305,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
.is_in_guest= kvm_is_in_guest,
.is_user_mode   = kvm_is_user_mode,
.get_guest_ip   = kvm_get_guest_ip,
-   .handle_intel_pt_intr   = kvm_handle_intel_pt_intr,
+   .handle_intel_pt_intr   = NULL,
 };
 
 #ifdef CONFIG_X86_64
@@ -11061,6 +11061,8 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
+   if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
+   kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr;
perf_register_guest_info_callbacks(_guest_cbs);
 
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
@@ -11091,6 +11093,7 @@ int kvm_arch_hardware_setup(void *opaque)
 void kvm_arch_hardware_unsetup(void)
 {
perf_unregister_guest_info_callbacks(_guest_cbs);
+   kvm_guest_cbs.handle_intel_pt_intr = NULL;
 
static_call(kvm_x86_hardware_unsetup)();
 }
-- 
2.33.0.259.gc128427fd7-goog

[PATCH 03/15] perf: Stop pretending that perf can handle multiple guest callbacks

2021-08-26 Thread Sean Christopherson

Drop the 'int' return value from the perf (un)register callbacks helpers
and stop pretending perf can support multiple callbacks.  The 'int'
returns are not future proofing anything as none of the callers take
action on an error.  It's also not obvious that there will ever be
cotenant hypervisors, and if there are, that allowing multiple callbacks
to be registered is desirable or even correct.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |  4 ++--
 arch/arm64/kvm/perf.c |  8 
 arch/x86/kvm/x86.c|  2 +-
 include/linux/perf_event.h| 11 +--
 kernel/events/core.c  | 11 ++-
 5 files changed, 14 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 41911585ae0c..ed940aec89e0 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -670,8 +670,8 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned 
int len);
 int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
 int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
-int kvm_perf_init(void);
-int kvm_perf_teardown(void);
+void kvm_perf_init(void);
+void kvm_perf_teardown(void);
 
 long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu);
 gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 151c31fb9860..039fe59399a2 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -48,15 +48,15 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
.get_guest_ip   = kvm_get_guest_ip,
 };
 
-int kvm_perf_init(void)
+void kvm_perf_init(void)
 {
if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
static_branch_enable(_arm_pmu_available);
 
-   return perf_register_guest_info_callbacks(_guest_cbs);
+   perf_register_guest_info_callbacks(_guest_cbs);
 }
 
-int kvm_perf_teardown(void)
+void kvm_perf_teardown(void)
 {
-   return perf_unregister_guest_info_callbacks(_guest_cbs);
+   perf_unregister_guest_info_callbacks();
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ffc6c2d73508..bae951344e28 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11092,7 +11092,7 @@ int kvm_arch_hardware_setup(void *opaque)
 
 void kvm_arch_hardware_unsetup(void)
 {
-   perf_unregister_guest_info_callbacks(_guest_cbs);
+   perf_unregister_guest_info_callbacks();
kvm_guest_cbs.handle_intel_pt_intr = NULL;
 
static_call(kvm_x86_hardware_unsetup)();
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2d510ad750ed..05c0efba3cd1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1237,8 +1237,8 @@ extern void perf_event_bpf_event(struct bpf_prog *prog,
 u16 flags);
 
 extern struct perf_guest_info_callbacks *perf_guest_cbs;
-extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks 
*callbacks);
-extern int perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *callbacks);
+extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *callbacks);
+extern void perf_unregister_guest_info_callbacks(void);
 
 extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
@@ -1481,10 +1481,9 @@ perf_sw_event(u32 event_id, u64 nr, struct pt_regs 
*regs, u64 addr)  { }
 static inline void
 perf_bp_event(struct perf_event *event, void *data){ }
 
-static inline int perf_register_guest_info_callbacks
-(struct perf_guest_info_callbacks *callbacks)  { 
return 0; }
-static inline int perf_unregister_guest_info_callbacks
-(struct perf_guest_info_callbacks *callbacks)  { 
return 0; }
+static inline void perf_register_guest_info_callbacks
+(struct perf_guest_info_callbacks *callbacks)  { }
+static inline void perf_unregister_guest_info_callbacks(void)  { }
 
 static inline void perf_event_mmap(struct vm_area_struct *vma) { }
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 464917096e73..baae796612b9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6482,24 +6482,17 @@ static void perf_pending_event(struct irq_work *entry)
perf_swevent_put_recursion_context(rctx);
 }
 
-/*
- * We assume there is only KVM supporting the callbacks.
- * Later on, we might change it to a list if there is
- * another virtualization implementation supporting the callbacks.
- */
 struct perf_guest_info_callbacks *perf_guest_cbs;
 
-int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
+void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
 {
perf_guest_cbs = cbs;
-   return 0;
 }

Re: [XEN RFC PATCH 22/40] xen/arm: introduce a helper to parse device tree processor node

2021-08-26 Thread Stefano Stabellini

On Wed, 11 Aug 2021, Wei Chen wrote:
> Processor NUMA ID information is stored in device tree's processor
> node as "numa-node-id". We need a new helper to parse this ID from
> processor node. If we get this ID from processor node, this ID's
> validity still need to be checked. Once we got a invalid NUMA ID
> from any processor node, the device tree will be marked as NUMA
> information invalid.
> 
> Signed-off-by: Wei Chen 
> ---
>  xen/arch/arm/numa_device_tree.c | 41 +++--
>  1 file changed, 39 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/arm/numa_device_tree.c b/xen/arch/arm/numa_device_tree.c
> index 1c74ad135d..37cc56acf3 100644
> --- a/xen/arch/arm/numa_device_tree.c
> +++ b/xen/arch/arm/numa_device_tree.c
> @@ -20,16 +20,53 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  s8 device_tree_numa = 0;
> +static nodemask_t processor_nodes_parsed __initdata;
>  
> -int srat_disabled(void)
> +static int srat_disabled(void)
>  {
>  return numa_off || device_tree_numa < 0;
>  }
>  
> -void __init bad_srat(void)
> +static __init void bad_srat(void)
>  {
>  printk(KERN_ERR "DT: NUMA information is not used.\n");
>  device_tree_numa = -1;
>  }
> +
> +/* Callback for device tree processor affinity */
> +static int __init dtb_numa_processor_affinity_init(nodeid_t node)
> +{
> +if ( srat_disabled() )
> +return -EINVAL;
> +else if ( node == NUMA_NO_NODE || node >= MAX_NUMNODES ) {
> + bad_srat();
> + return -EINVAL;
> + }
> +
> +node_set(node, processor_nodes_parsed);
> +
> +device_tree_numa = 1;
> +printk(KERN_INFO "DT: NUMA node %u processor parsed\n", node);
> +
> +return 0;
> +}
> +
> +/* Parse CPU NUMA node info */
> +int __init device_tree_parse_numa_cpu_node(const void *fdt, int node)
> +{
> +uint32_t nid;
> +
> +nid = device_tree_get_u32(fdt, node, "numa-node-id", MAX_NUMNODES);
> +printk(XENLOG_WARNING "CPU on NUMA node:%u\n", nid);

Given that this is not actually a warning (is it?) then I would move it
to XENLOG_INFO


> +if ( nid >= MAX_NUMNODES )
> +{
> +printk(XENLOG_WARNING "Node id %u exceeds maximum value\n", nid);

This could be XENLOG_ERR


> +return -EINVAL;
> +}
> +
> +return dtb_numa_processor_affinity_init(nid);
> +}

Re: [XEN RFC PATCH 20/40] xen/arm: implement node distance helpers for Arm64

2021-08-26 Thread Stefano Stabellini

On Wed, 11 Aug 2021, Wei Chen wrote:
> In current Xen code, __node_distance is a fake API, it always
> returns NUMA_REMOTE_DISTANCE(20). Now we use a matrix to record
> the distance between any two nodes. Accordingly, we provide a
> set_node_distance API to set the distance for any two nodes in
> this patch.
> 
> Signed-off-by: Wei Chen 
> ---
>  xen/arch/arm/numa.c| 44 ++
>  xen/include/asm-arm/numa.h | 12 ++-
>  xen/include/asm-x86/numa.h |  1 -
>  xen/include/xen/numa.h |  2 +-
>  4 files changed, 56 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> index 566ad1e52b..f61a8df645 100644
> --- a/xen/arch/arm/numa.c
> +++ b/xen/arch/arm/numa.c
> @@ -23,6 +23,11 @@
>  #include 
>  #include 
>  
> +static uint8_t __read_mostly
> +node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
> +{ NUMA_REMOTE_DISTANCE }
> +};
> +
>  void numa_set_node(int cpu, nodeid_t nid)
>  {
>  if ( nid >= MAX_NUMNODES ||
> @@ -32,6 +37,45 @@ void numa_set_node(int cpu, nodeid_t nid)
>  cpu_to_node[cpu] = nid;
>  }
>  
> +void __init numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance)
> +{
> +if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
> +{
> +printk(KERN_WARNING
> +"NUMA nodes are out of matrix, from=%u to=%u distance=%u\n",
> +from, to, distance);

NIT: please align. Example:

printk(KERN_WARNING
   "NUMA nodes are out of matrix, from=%u to=%u distance=%u\n",

Also please use PRIu32 for uint32_t. Probably should use PRIu8 for
nodeids.


> +return;
> +}
> +
> +/* NUMA defines 0xff as an unreachable node and 0-9 are undefined */
> +if ( distance >= NUMA_NO_DISTANCE ||
> +(distance >= NUMA_DISTANCE_UDF_MIN &&
> + distance <= NUMA_DISTANCE_UDF_MAX) ||
> +(from == to && distance != NUMA_LOCAL_DISTANCE) )
> +{
> +printk(KERN_WARNING
> +"Invalid NUMA node distance, from:%d to:%d distance=%d\n",
> +from, to, distance);

NIT: please align

Also you used %u before for nodeids, which is better because from and to
are unsigned. Distance should be uint32_t.


> +return;
> +}
> +
> +node_distance_map[from][to] = distance;

Shouldn't we also be setting:

node_distance_map[to][from] = distance;

?


> +}
> +
> +uint8_t __node_distance(nodeid_t from, nodeid_t to)
> +{
> +/*
> + * Check whether the nodes are in the matrix range.
> + * When any node is out of range, except from and to nodes are the
> + * same, we treat them as unreachable (return 0xFF)
> + */
> +if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
> +return from == to ? NUMA_LOCAL_DISTANCE : NUMA_NO_DISTANCE;
> +
> +return node_distance_map[from][to];
> +}
> +EXPORT_SYMBOL(__node_distance);
> +
>  void __init numa_init(bool acpi_off)
>  {
>  uint32_t idx;
> diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> index bb495a24e1..559b028a01 100644
> --- a/xen/include/asm-arm/numa.h
> +++ b/xen/include/asm-arm/numa.h
> @@ -12,8 +12,19 @@ typedef u8 nodeid_t;
>   * set the number of NUMA memory block number to 128.
>   */
>  #define NODES_SHIFT  6
> +/*
> + * In ACPI spec, 0-9 are the reserved values for node distance,
> + * 10 indicates local node distance, 20 indicates remote node
> + * distance. Set node distance map in device tree will follow
> + * the ACPI's definition.
> + */
> +#define NUMA_DISTANCE_UDF_MIN   0
> +#define NUMA_DISTANCE_UDF_MAX   9
> +#define NUMA_LOCAL_DISTANCE 10
> +#define NUMA_REMOTE_DISTANCE20
>  
>  extern void numa_init(bool acpi_off);
> +extern void numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance);
>  
>  /*
>   * Temporary for fake NUMA node, when CPU, memory and distance
> @@ -21,7 +32,6 @@ extern void numa_init(bool acpi_off);
>   * symbols will be removed.
>   */
>  extern mfn_t first_valid_mfn;
> -#define __node_distance(a, b) (20)
>  
>  #else
>  
> diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
> index 5a57a51e26..e0253c20b7 100644
> --- a/xen/include/asm-x86/numa.h
> +++ b/xen/include/asm-x86/numa.h
> @@ -21,7 +21,6 @@ extern nodeid_t apicid_to_node[];
>  extern void init_cpu_to_node(void);
>  
>  void srat_parse_regions(u64 addr);
> -extern u8 __node_distance(nodeid_t a, nodeid_t b);
>  unsigned int arch_get_dma_bitsize(void);
>  
>  #endif
> diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> index cb08d2eca9..0475823b13 100644
> --- a/xen/include/xen/numa.h
> +++ b/xen/include/xen/numa.h
> @@ -58,7 +58,7 @@ static inline __attribute__((pure)) nodeid_t 
> phys_to_nid(paddr_t addr)
>  #define node_spanned_pages(nid)  (NODE_DATA(nid)->node_spanned_pages)
>  #define node_end_pfn(nid)   (NODE_DATA(nid)->node_start_pfn + \
>NODE_DATA(nid)->node_spanned_pages)
> -
> +extern u8 __node_distance(nodeid_t a, nodeid_t b);
>  extern void

Re: HVM guest only bring up a single vCPU

2021-08-26 Thread Andrew Cooper

On 26/08/2021 22:00, Julien Grall wrote:
> Hi Andrew,
>
> While doing more testing today, I noticed that only one vCPU would be
> brought up with HVM guest with Xen 4.16 on my setup (QEMU):
>
> [    1.122180]
> 
> [    1.122180] UBSAN: shift-out-of-bounds in
> oss/linux/arch/x86/kernel/apic/apic.c:2362:13
> [    1.122180] shift exponent -1 is negative
> [    1.122180] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-rc7+ #304
> [    1.122180] Hardware name: Xen HVM domU, BIOS 4.16-unstable 06/07/2021
> [    1.122180] Call Trace:
> [    1.122180]  dump_stack_lvl+0x56/0x6c
> [    1.122180]  ubsan_epilogue+0x5/0x50
> [    1.122180]  __ubsan_handle_shift_out_of_bounds+0xfa/0x140
> [    1.122180]  ? cgroup_kill_write+0x4d/0x150
> [    1.122180]  ? cpu_up+0x6e/0x100
> [    1.122180]  ? _raw_spin_unlock_irqrestore+0x30/0x50
> [    1.122180]  ? rcu_read_lock_held_common+0xe/0x40
> [    1.122180]  ? irq_shutdown_and_deactivate+0x11/0x30
> [    1.122180]  ? lock_release+0xc7/0x2a0
> [    1.122180]  ? apic_id_is_primary_thread+0x56/0x60
> [    1.122180]  apic_id_is_primary_thread+0x56/0x60
> [    1.122180]  cpu_up+0xbd/0x100
> [    1.122180]  bringup_nonboot_cpus+0x4f/0x60
> [    1.122180]  smp_init+0x26/0x74
> [    1.122180]  kernel_init_freeable+0x183/0x32d
> [    1.122180]  ? _raw_spin_unlock_irq+0x24/0x40
> [    1.122180]  ? rest_init+0x330/0x330
> [    1.122180]  kernel_init+0x17/0x140
> [    1.122180]  ? rest_init+0x330/0x330
> [    1.122180]  ret_from_fork+0x22/0x30
> [    1.122244]
> 
> [    1.123176] installing Xen timer for CPU 1
> [    1.123369] x86: Booting SMP configuration:
> [    1.123409]  node  #0, CPUs:  #1
> [    1.154400] Callback from call_rcu_tasks_trace() invoked.
> [    1.154491] smp: Brought up 1 node, 1 CPU
> [    1.154526] smpboot: Max logical packages: 2
> [    1.154570] smpboot: Total of 1 processors activated (5999.99
> BogoMIPS)
>
> I have tried a PV guest (same setup) and the kernel could bring up all
> the vCPUs.
>
> Digging down, Linux will set smp_num_siblings to 0 (via
> detect_ht_early()) and as a result will skip all the CPUs. The value
> is retrieve from a CPUID leaf. So it sounds like we don't set the
> leaft correctly.
>
> FWIW, I have also tried on Xen 4.11 and could spot the same issue.
> Does this ring any bell to you?

The CPUID data we give to guests is generally nonsense when it comes to
topology.  By any chance does the hardware you're booting this on not
have hyperthreading enabled/active to begin with?

Fixing this is on the todo list, but it needs libxl to start using
policy objects (series for the next phase of this still pending on
xen-devel).  Exactly how you represent the topology to the guest
correctly depends on the vendor and rough generation - I believe there
are 5 different algorithms to use, and for AMD in particular, it even
depends on how many IO-APICs are visible in the guest.

~Andrew

Re: [XEN RFC PATCH 18/40] xen/arm: Keep memory nodes in dtb for NUMA when boot from EFI

2021-08-26 Thread Stefano Stabellini

On Wed, 11 Aug 2021, Wei Chen wrote:
> EFI can get memory map from EFI system table. But EFI system
> table doesn't contain memory NUMA information, EFI depends on
> ACPI SRAT or device tree memory node to parse memory blocks'
> NUMA mapping.
> 
> But in current code, when Xen is booting from EFI, it will
> delete all memory nodes in device tree. So in UEFI + DTB
> boot, we don't have numa-node-id for memory blocks any more.
> 
> So in this patch, we will keep memory nodes in device tree for
> NUMA code to parse memory numa-node-id later.
> 
> As a side effect, if we still parse boot memory information in
> early_scan_node, bootmem.info will calculate memory ranges in
> memory nodes twice. So we have to prvent early_scan_node to
> parse memory nodes in EFI boot.
> 
> As EFI APIs only can be used in Arm64, so we introduced a wrapper
> in header file to prevent #ifdef CONFIG_ARM_64/32 in code block.
> 
> Signed-off-by: Wei Chen 
> ---
>  xen/arch/arm/bootfdt.c  |  8 +++-
>  xen/arch/arm/efi/efi-boot.h | 25 -
>  xen/include/asm-arm/setup.h |  6 ++
>  3 files changed, 13 insertions(+), 26 deletions(-)
> 
> diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c
> index 476e32e0f5..7df149dbca 100644
> --- a/xen/arch/arm/bootfdt.c
> +++ b/xen/arch/arm/bootfdt.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -335,7 +336,12 @@ static int __init early_scan_node(const void *fdt,
>  {
>  int rc = 0;
>  
> -if ( device_tree_node_matches(fdt, node, "memory") )
> +/*
> + * If system boot from EFI, bootinfo.mem has been set by EFI,
> + * so we don't need to parse memory node from DTB.
> + */
> +if ( device_tree_node_matches(fdt, node, "memory") &&
> + !arch_efi_enabled(EFI_BOOT) )
>  rc = process_memory_node(fdt, node, name, depth,
>   address_cells, size_cells, );
>  else if ( depth == 1 && !dt_node_cmp(name, "reserved-memory") )


If we are going to use the device tree info for the numa nodes (and
related memory) does it make sense to still rely on the EFI tables for
the memory map?

I wonder if we should just use device tree for memory and ignore EFI
instead. Do you know what Linux does in this regard?

Re: [XEN RFC PATCH 16/40] xen/arm: Create a fake NUMA node to use common code

2021-08-26 Thread Stefano Stabellini

On Wed, 11 Aug 2021, Wei Chen wrote:
> When CONFIG_NUMA is enabled for Arm, Xen will switch to use common
> NUMA API instead of previous fake NUMA API. Before we parse NUMA
> information from device tree or ACPI SRAT table, we need to init
> the NUMA related variables, like cpu_to_node, as single node NUMA
> system.
> 
> So in this patch, we introduce a numa_init function for to
> initialize these data structures as all resources belongs to node#0.
> This will make the new API returns the same values as the fake API
> has done.
> 
> Signed-off-by: Wei Chen 
> ---
>  xen/arch/arm/numa.c| 53 ++
>  xen/arch/arm/setup.c   |  8 ++
>  xen/include/asm-arm/numa.h | 11 
>  3 files changed, 72 insertions(+)
> 
> diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> index 1e30c5bb13..566ad1e52b 100644
> --- a/xen/arch/arm/numa.c
> +++ b/xen/arch/arm/numa.c
> @@ -20,6 +20,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  void numa_set_node(int cpu, nodeid_t nid)
>  {
> @@ -29,3 +31,54 @@ void numa_set_node(int cpu, nodeid_t nid)
>  
>  cpu_to_node[cpu] = nid;
>  }
> +
> +void __init numa_init(bool acpi_off)
> +{
> +uint32_t idx;
> +paddr_t ram_start = ~0;
> +paddr_t ram_size = 0;
> +paddr_t ram_end = 0;
> +
> +printk(XENLOG_WARNING
> +"NUMA has not been supported yet, NUMA off!\n");

NIT: please align


> +/* Arm NUMA has not been implemented until this patch */

"Arm NUMA is not implemented yet"


> +numa_off = true;
> +
> +/*
> + * Set all cpu_to_node mapping to 0, this will make cpu_to_node
> + * function return 0 as previous fake cpu_to_node API.
> + */
> +for ( idx = 0; idx < NR_CPUS; idx++ )
> +cpu_to_node[idx] = 0;
> +
> +/*
> + * Make node_to_cpumask, node_spanned_pages and node_start_pfn
> + * return as previous fake APIs.
> + */
> +for ( idx = 0; idx < MAX_NUMNODES; idx++ ) {
> +node_to_cpumask[idx] = cpu_online_map;
> +node_spanned_pages(idx) = (max_page - mfn_x(first_valid_mfn));
> +node_start_pfn(idx) = (mfn_x(first_valid_mfn));
> +}

I just want to note that this works because MAX_NUMNODES is 1. If
MAX_NUMNODES was > 1 then it would be wrong to set node_to_cpumask,
node_spanned_pages and node_start_pfn for all nodes to the same values.

It might be worth writing something about it in the in-code comment.


> +/*
> + * Find the minimal and maximum address of RAM, NUMA will
> + * build a memory to node mapping table for the whole range.
> + */
> +ram_start = bootinfo.mem.bank[0].start;
> +ram_size  = bootinfo.mem.bank[0].size;
> +ram_end   = ram_start + ram_size;
> +for ( idx = 1 ; idx < bootinfo.mem.nr_banks; idx++ )
> +{
> +paddr_t bank_start = bootinfo.mem.bank[idx].start;
> +paddr_t bank_size = bootinfo.mem.bank[idx].size;
> +paddr_t bank_end = bank_start + bank_size;
> +
> +ram_size  = ram_size + bank_size;

ram_size is updated but not utilized


> +ram_start = min(ram_start, bank_start);
> +ram_end   = max(ram_end, bank_end);
> +}
> +
> +numa_initmem_init(PFN_UP(ram_start), PFN_DOWN(ram_end));
> +return;
> +}
> diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
> index 63a908e325..3c58d2d441 100644
> --- a/xen/arch/arm/setup.c
> +++ b/xen/arch/arm/setup.c
> @@ -30,6 +30,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -874,6 +875,13 @@ void __init start_xen(unsigned long boot_phys_offset,
>  /* Parse the ACPI tables for possible boot-time configuration */
>  acpi_boot_table_init();
>  
> +/*
> + * Try to initialize NUMA system, if failed, the system will
> + * fallback to uniform system which means system has only 1
> + * NUMA node.
> + */
> +numa_init(acpi_disabled);
> +
>  end_boot_allocator();
>  
>  /*
> diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> index b2982f9053..bb495a24e1 100644
> --- a/xen/include/asm-arm/numa.h
> +++ b/xen/include/asm-arm/numa.h
> @@ -13,6 +13,16 @@ typedef u8 nodeid_t;
>   */
>  #define NODES_SHIFT  6
>  
> +extern void numa_init(bool acpi_off);
> +
> +/*
> + * Temporary for fake NUMA node, when CPU, memory and distance
> + * matrix will be read from DTB or ACPI SRAT. The following
> + * symbols will be removed.
> + */
> +extern mfn_t first_valid_mfn;
> +#define __node_distance(a, b) (20)
> +
>  #else
>  
>  /* Fake one node for now. See also node_online_map. */
> @@ -35,6 +45,7 @@ extern mfn_t first_valid_mfn;
>  #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
>  #define __node_distance(a, b) (20)
>  
> +#define numa_init(x) do { } while (0)
>  #define numa_set_node(x, y) do { } while (0)
>  
>  #endif
> -- 
> 2.25.1
>

Re: HVM guest only bring up a single vCPU

2021-08-26 Thread Marek Marczykowski-Górecki

On Thu, Aug 26, 2021 at 10:00:58PM +0100, Julien Grall wrote:
> Hi Andrew,
> 
> While doing more testing today, I noticed that only one vCPU would be
> brought up with HVM guest with Xen 4.16 on my setup (QEMU):
> 
> [1.122180] 
> 
> [1.122180] UBSAN: shift-out-of-bounds in
> oss/linux/arch/x86/kernel/apic/apic.c:2362:13
> [1.122180] shift exponent -1 is negative
> [1.122180] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-rc7+ #304
> [1.122180] Hardware name: Xen HVM domU, BIOS 4.16-unstable 06/07/2021
> [1.122180] Call Trace:
> [1.122180]  dump_stack_lvl+0x56/0x6c
> [1.122180]  ubsan_epilogue+0x5/0x50
> [1.122180]  __ubsan_handle_shift_out_of_bounds+0xfa/0x140
> [1.122180]  ? cgroup_kill_write+0x4d/0x150
> [1.122180]  ? cpu_up+0x6e/0x100
> [1.122180]  ? _raw_spin_unlock_irqrestore+0x30/0x50
> [1.122180]  ? rcu_read_lock_held_common+0xe/0x40
> [1.122180]  ? irq_shutdown_and_deactivate+0x11/0x30
> [1.122180]  ? lock_release+0xc7/0x2a0
> [1.122180]  ? apic_id_is_primary_thread+0x56/0x60
> [1.122180]  apic_id_is_primary_thread+0x56/0x60
> [1.122180]  cpu_up+0xbd/0x100
> [1.122180]  bringup_nonboot_cpus+0x4f/0x60
> [1.122180]  smp_init+0x26/0x74
> [1.122180]  kernel_init_freeable+0x183/0x32d
> [1.122180]  ? _raw_spin_unlock_irq+0x24/0x40
> [1.122180]  ? rest_init+0x330/0x330
> [1.122180]  kernel_init+0x17/0x140
> [1.122180]  ? rest_init+0x330/0x330
> [1.122180]  ret_from_fork+0x22/0x30
> [1.122244] 
> 
> [1.123176] installing Xen timer for CPU 1
> [1.123369] x86: Booting SMP configuration:
> [1.123409]  node  #0, CPUs:  #1
> [1.154400] Callback from call_rcu_tasks_trace() invoked.
> [1.154491] smp: Brought up 1 node, 1 CPU
> [1.154526] smpboot: Max logical packages: 2
> [1.154570] smpboot: Total of 1 processors activated (5999.99 BogoMIPS)
> 
> I have tried a PV guest (same setup) and the kernel could bring up all the
> vCPUs.
> 
> Digging down, Linux will set smp_num_siblings to 0 (via detect_ht_early())
> and as a result will skip all the CPUs. The value is retrieve from a CPUID
> leaf. So it sounds like we don't set the leaft correctly.
> 
> FWIW, I have also tried on Xen 4.11 and could spot the same issue. Does this
> ring any bell to you?

Is it maybe this:
https://lore.kernel.org/xen-devel/20201106003529.391649-1-bmas...@redhat.com/T/#u
?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

HVM guest only bring up a single vCPU

2021-08-26 Thread Julien Grall


Hi Andrew,

While doing more testing today, I noticed that only one vCPU would be 
brought up with HVM guest with Xen 4.16 on my setup (QEMU):


[1.122180] 

[1.122180] UBSAN: shift-out-of-bounds in 
oss/linux/arch/x86/kernel/apic/apic.c:2362:13

[1.122180] shift exponent -1 is negative
[1.122180] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-rc7+ #304
[1.122180] Hardware name: Xen HVM domU, BIOS 4.16-unstable 06/07/2021
[1.122180] Call Trace:
[1.122180]  dump_stack_lvl+0x56/0x6c
[1.122180]  ubsan_epilogue+0x5/0x50
[1.122180]  __ubsan_handle_shift_out_of_bounds+0xfa/0x140
[1.122180]  ? cgroup_kill_write+0x4d/0x150
[1.122180]  ? cpu_up+0x6e/0x100
[1.122180]  ? _raw_spin_unlock_irqrestore+0x30/0x50
[1.122180]  ? rcu_read_lock_held_common+0xe/0x40
[1.122180]  ? irq_shutdown_and_deactivate+0x11/0x30
[1.122180]  ? lock_release+0xc7/0x2a0
[1.122180]  ? apic_id_is_primary_thread+0x56/0x60
[1.122180]  apic_id_is_primary_thread+0x56/0x60
[1.122180]  cpu_up+0xbd/0x100
[1.122180]  bringup_nonboot_cpus+0x4f/0x60
[1.122180]  smp_init+0x26/0x74
[1.122180]  kernel_init_freeable+0x183/0x32d
[1.122180]  ? _raw_spin_unlock_irq+0x24/0x40
[1.122180]  ? rest_init+0x330/0x330
[1.122180]  kernel_init+0x17/0x140
[1.122180]  ? rest_init+0x330/0x330
[1.122180]  ret_from_fork+0x22/0x30
[1.122244] 


[1.123176] installing Xen timer for CPU 1
[1.123369] x86: Booting SMP configuration:
[1.123409]  node  #0, CPUs:  #1
[1.154400] Callback from call_rcu_tasks_trace() invoked.
[1.154491] smp: Brought up 1 node, 1 CPU
[1.154526] smpboot: Max logical packages: 2
[1.154570] smpboot: Total of 1 processors activated (5999.99 BogoMIPS)

I have tried a PV guest (same setup) and the kernel could bring up all 
the vCPUs.


Digging down, Linux will set smp_num_siblings to 0 (via 
detect_ht_early()) and as a result will skip all the CPUs. The value is 
retrieve from a CPUID leaf. So it sounds like we don't set the leaft 
correctly.


FWIW, I have also tried on Xen 4.11 and could spot the same issue. Does 
this ring any bell to you?


Cheers,

--
Julien Grall

Re: [PATCH v2] PCI/MSI: Skip masking MSI-X on Xen PV

2021-08-26 Thread Bjorn Helgaas

On Thu, Aug 26, 2021 at 07:03:42PM +0200, Marek Marczykowski-Górecki wrote:
> When running as Xen PV guest, masking MSI-X is a responsibility of the
> hypervisor. Guest has no write access to relevant BAR at all - when it
> tries to, it results in a crash like this:
> 
> BUG: unable to handle page fault for address: c9004069100c
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0003) - permissions violation
> PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 8010febd4075
> Oops: 0003 [#1] SMP NOPTI
> CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: GW 
> 5.14.0-rc7-1.fc32.qubes.x86_64 #15
> Workqueue: events work_for_cpu_fn
> RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0
> Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 0f 
> b7 f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 83 c0 
> 10 48 39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48
> RSP: e02b:c9004018bd50 EFLAGS: 00010212
> RAX: c9004069100c RBX: 88800ed412f8 RCX: c9004069105c
> RDX: 0001 RSI: 000febd4 RDI: c90040691000
> RBP: 0003 R08:  R09: febd404f
> R10: 7ff0 R11: 88800ee8ae40 R12: 88800ed41000
> R13:  R14: 0040 R15: feba
> FS:  () GS:88801840() 
> knlGS:
> CS:  e030 DS:  ES:  CR0: 80050033
> CR2: 807f5ea0 CR3: 12f6a000 CR4: 0660
> Call Trace:
>  e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e]
>  e1000_probe+0x41f/0xdb0 [e1000e]
>  local_pci_probe+0x42/0x80
> (...)
> 
> There is pci_msi_ignore_mask variable for bypassing MSI(-X) masking on Xen
> PV, but msix_mask_all() missed checking it. Add the check there too.
> 
> Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Marek Marczykowski-Górecki 

Acked-by: Bjorn Helgaas 

> ---
> Cc: xen-devel@lists.xenproject.org
> 
> Changes in v2:
> - update commit message (MSI -> MSI-X)
> ---
>  drivers/pci/msi.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index e5e75331b415..3a9f4f8ad8f9 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -776,6 +776,9 @@ static void msix_mask_all(void __iomem *base, int tsize)
>   u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT;
>   int i;
>  
> + if (pci_msi_ignore_mask)
> + return;
> +
>   for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE)
>   writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
>  }
> -- 
> 2.31.1
>

[xen-4.12-testing test] 164489: tolerable FAIL - PUSHED

2021-08-26 Thread osstest service owner

flight 164489 xen-4.12-testing real [real]
http://logs.test-lab.xenproject.org/osstest/logs/164489/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qcow219 guest-localmigrate/x10   fail  like 162549
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 162549
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 162549
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 162549
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 162549
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 162549
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 162549
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 162549
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 162549
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 162549
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 162549
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 162549
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  35ba323378d05509f2e0dc049520e140be183003
baseline version:
 xen  ea20eee97e9e0861127a8070cc7b9ae3557b09fb

Last test of basis   162549  2021-06-08 18:37:01 Z   79 days
Failing since164259  2021-08-19 17:07:29 Z7 days5 attempts
Testing same since   164489  2021-08-25 17:57:16 Z1 days1 attempts


People who touched revisions under test:
  Andrew Cooper 
  Anthony PERARD 
  Ian Jackson 
  Jan Beulich 
  Jason Andryuk 
  Julien Grall 
  Roger Pau Monné 
  Stefano

[PATCH v2] PCI/MSI: Skip masking MSI-X on Xen PV

2021-08-26 Thread Marek Marczykowski-Górecki

When running as Xen PV guest, masking MSI-X is a responsibility of the
hypervisor. Guest has no write access to relevant BAR at all - when it
tries to, it results in a crash like this:

BUG: unable to handle page fault for address: c9004069100c
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 8010febd4075
Oops: 0003 [#1] SMP NOPTI
CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: GW 
5.14.0-rc7-1.fc32.qubes.x86_64 #15
Workqueue: events work_for_cpu_fn
RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0
Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 0f b7 
f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 83 c0 10 48 
39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48
RSP: e02b:c9004018bd50 EFLAGS: 00010212
RAX: c9004069100c RBX: 88800ed412f8 RCX: c9004069105c
RDX: 0001 RSI: 000febd4 RDI: c90040691000
RBP: 0003 R08:  R09: febd404f
R10: 7ff0 R11: 88800ee8ae40 R12: 88800ed41000
R13:  R14: 0040 R15: feba
FS:  () GS:88801840() knlGS:
CS:  e030 DS:  ES:  CR0: 80050033
CR2: 807f5ea0 CR3: 12f6a000 CR4: 0660
Call Trace:
 e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e]
 e1000_probe+0x41f/0xdb0 [e1000e]
 local_pci_probe+0x42/0x80
(...)

There is pci_msi_ignore_mask variable for bypassing MSI(-X) masking on Xen
PV, but msix_mask_all() missed checking it. Add the check there too.

Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries")
Cc: sta...@vger.kernel.org
Signed-off-by: Marek Marczykowski-Górecki 
---
Cc: xen-devel@lists.xenproject.org

Changes in v2:
- update commit message (MSI -> MSI-X)
---
 drivers/pci/msi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index e5e75331b415..3a9f4f8ad8f9 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -776,6 +776,9 @@ static void msix_mask_all(void __iomem *base, int tsize)
u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT;
int i;
 
+   if (pci_msi_ignore_mask)
+   return;
+
for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE)
writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
 }
-- 
2.31.1

Re: [PATCH] PCI/MSI: skip masking MSI on Xen PV

2021-08-26 Thread Bjorn Helgaas

On Thu, Aug 26, 2021 at 06:36:49PM +0200, Marek Marczykowski-Górecki wrote:
> On Thu, Aug 26, 2021 at 09:55:32AM -0500, Bjorn Helgaas wrote:
> > If/when you repost this, please run "git log --oneline
> > drivers/pci/msi.c" and follow the convention of capitalizing the
> > subject line.
> > 
> > Also, I think this patch refers specifically to MSI-X, not MSI, so
> > please update the subject line and the "masking MSI" below to reflect
> > that.
> 
> Sure, thanks for pointing this out. Is the patch otherwise ok? Should I
> post v2 with just updated commit message?

Wouldn't hurt to post a v2.  I don't have any objections to the patch,
but ultimately up to Thomas.

> > On Thu, Aug 26, 2021 at 03:43:37PM +0200, Marek Marczykowski-Górecki wrote:
> > > When running as Xen PV guest, masking MSI is a responsibility of the
> > > hypervisor. Guest has no write access to relevant BAR at all - when it
> > > tries to, it results in a crash like this:
> > > 
> > > BUG: unable to handle page fault for address: c9004069100c
> > > #PF: supervisor write access in kernel mode
> > > #PF: error_code(0x0003) - permissions violation
> > > PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 8010febd4075
> > > Oops: 0003 [#1] SMP NOPTI
> > > CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: GW 
> > > 5.14.0-rc7-1.fc32.qubes.x86_64 #15
> > > Workqueue: events work_for_cpu_fn
> > > RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0
> > > Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 
> > > 0f b7 f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 
> > > 83 c0 10 48 39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48
> > > RSP: e02b:c9004018bd50 EFLAGS: 00010212
> > > RAX: c9004069100c RBX: 88800ed412f8 RCX: c9004069105c
> > > RDX: 0001 RSI: 000febd4 RDI: c90040691000
> > > RBP: 0003 R08:  R09: febd404f
> > > R10: 7ff0 R11: 88800ee8ae40 R12: 88800ed41000
> > > R13:  R14: 0040 R15: feba
> > > FS:  () GS:88801840() 
> > > knlGS:
> > > CS:  e030 DS:  ES:  CR0: 80050033
> > > CR2: 807f5ea0 CR3: 12f6a000 CR4: 0660
> > > Call Trace:
> > >  e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e]
> > >  e1000_probe+0x41f/0xdb0 [e1000e]
> > >  local_pci_probe+0x42/0x80
> > > (...)
> > > 
> > > There is pci_msi_ignore_mask variable for bypassing MSI masking on Xen
> > > PV, but msix_mask_all() missed checking it. Add the check there too.
> > > 
> > > Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries")
> > > Cc: sta...@vger.kernel.org
> > 
> > 7d5ec3d36123 appeared in v5.14-rc6, so if this fix is merged before
> > v5.14, the stable tag will be unnecessary.  But we are running out of
> > time there.
> 
> 7d5ec3d36123 was already backported to stable branches (at least 5.10
> and 5.4), and in fact this is how I discovered the issue...

Oh, right, of course.  Sorry :)

> > > Signed-off-by: Marek Marczykowski-Górecki 
> > > 
> > > ---
> > > Cc: xen-devel@lists.xenproject.org
> > > ---
> > >  drivers/pci/msi.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > > index e5e75331b415..3a9f4f8ad8f9 100644
> > > --- a/drivers/pci/msi.c
> > > +++ b/drivers/pci/msi.c
> > > @@ -776,6 +776,9 @@ static void msix_mask_all(void __iomem *base, int 
> > > tsize)
> > >   u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT;
> > >   int i;
> > >  
> > > + if (pci_msi_ignore_mask)
> > > + return;
> > > +
> > >   for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE)
> > >   writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
> > >  }
> > > -- 
> > > 2.31.1
> > > 
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab

Re: [PATCH] PCI/MSI: skip masking MSI on Xen PV

2021-08-26 Thread Marek Marczykowski-Górecki

On Thu, Aug 26, 2021 at 09:55:32AM -0500, Bjorn Helgaas wrote:
> If/when you repost this, please run "git log --oneline
> drivers/pci/msi.c" and follow the convention of capitalizing the
> subject line.
> 
> Also, I think this patch refers specifically to MSI-X, not MSI, so
> please update the subject line and the "masking MSI" below to reflect
> that.

Sure, thanks for pointing this out. Is the patch otherwise ok? Should I
post v2 with just updated commit message?

> On Thu, Aug 26, 2021 at 03:43:37PM +0200, Marek Marczykowski-Górecki wrote:
> > When running as Xen PV guest, masking MSI is a responsibility of the
> > hypervisor. Guest has no write access to relevant BAR at all - when it
> > tries to, it results in a crash like this:
> > 
> > BUG: unable to handle page fault for address: c9004069100c
> > #PF: supervisor write access in kernel mode
> > #PF: error_code(0x0003) - permissions violation
> > PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 8010febd4075
> > Oops: 0003 [#1] SMP NOPTI
> > CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: GW 
> > 5.14.0-rc7-1.fc32.qubes.x86_64 #15
> > Workqueue: events work_for_cpu_fn
> > RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0
> > Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 0f 
> > b7 f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 83 
> > c0 10 48 39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48
> > RSP: e02b:c9004018bd50 EFLAGS: 00010212
> > RAX: c9004069100c RBX: 88800ed412f8 RCX: c9004069105c
> > RDX: 0001 RSI: 000febd4 RDI: c90040691000
> > RBP: 0003 R08:  R09: febd404f
> > R10: 7ff0 R11: 88800ee8ae40 R12: 88800ed41000
> > R13:  R14: 0040 R15: feba
> > FS:  () GS:88801840() 
> > knlGS:
> > CS:  e030 DS:  ES:  CR0: 80050033
> > CR2: 807f5ea0 CR3: 12f6a000 CR4: 0660
> > Call Trace:
> >  e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e]
> >  e1000_probe+0x41f/0xdb0 [e1000e]
> >  local_pci_probe+0x42/0x80
> > (...)
> > 
> > There is pci_msi_ignore_mask variable for bypassing MSI masking on Xen
> > PV, but msix_mask_all() missed checking it. Add the check there too.
> > 
> > Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries")
> > Cc: sta...@vger.kernel.org
> 
> 7d5ec3d36123 appeared in v5.14-rc6, so if this fix is merged before
> v5.14, the stable tag will be unnecessary.  But we are running out of
> time there.

7d5ec3d36123 was already backported to stable branches (at least 5.10
and 5.4), and in fact this is how I discovered the issue...

> > Signed-off-by: Marek Marczykowski-Górecki 
> > ---
> > Cc: xen-devel@lists.xenproject.org
> > ---
> >  drivers/pci/msi.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index e5e75331b415..3a9f4f8ad8f9 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -776,6 +776,9 @@ static void msix_mask_all(void __iomem *base, int tsize)
> > u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT;
> > int i;
> >  
> > +   if (pci_msi_ignore_mask)
> > +   return;
> > +
> > for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE)
> > writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
> >  }
> > -- 
> > 2.31.1
> > 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

[xen-4.11-testing test] 164486: regressions - FAIL

2021-08-26 Thread osstest service owner

flight 164486 xen-4.11-testing real [real]
flight 164502 xen-4.11-testing real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/164486/
http://logs.test-lab.xenproject.org/osstest/logs/164502/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-xl-qemuu-ovmf-amd64 12 debian-hvm-install fail REGR. vs. 162548
 test-amd64-amd64-xl-qemuu-ovmf-amd64 12 debian-hvm-install fail REGR. vs. 
162548

Tests which are failing intermittently (not blocking):
 test-arm64-arm64-xl-seattle   8 xen-bootfail pass in 164502-retest
 test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail pass 
in 164502-retest

Tests which did not succeed, but are not blocking:
 test-arm64-arm64-xl-seattle 15 migrate-support-check fail in 164502 never pass
 test-arm64-arm64-xl-seattle 16 saverestore-support-check fail in 164502 never 
pass
 test-amd64-amd64-xl-qemuu-dmrestrict-amd64-dmrestrict 12 debian-hvm-install 
fail like 162548
 test-amd64-i386-xl-qemuu-dmrestrict-amd64-dmrestrict 12 debian-hvm-install 
fail like 162548
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 162548
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 162548
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 162548
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 162548
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 162548
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 162548
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 162548
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 162548
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 162548
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 162548
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 162548
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass

Re: [PATCH] PCI/MSI: skip masking MSI on Xen PV

2021-08-26 Thread Bjorn Helgaas

If/when you repost this, please run "git log --oneline
drivers/pci/msi.c" and follow the convention of capitalizing the
subject line.

Also, I think this patch refers specifically to MSI-X, not MSI, so
please update the subject line and the "masking MSI" below to reflect
that.

On Thu, Aug 26, 2021 at 03:43:37PM +0200, Marek Marczykowski-Górecki wrote:
> When running as Xen PV guest, masking MSI is a responsibility of the
> hypervisor. Guest has no write access to relevant BAR at all - when it
> tries to, it results in a crash like this:
> 
> BUG: unable to handle page fault for address: c9004069100c
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0003) - permissions violation
> PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 8010febd4075
> Oops: 0003 [#1] SMP NOPTI
> CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: GW 
> 5.14.0-rc7-1.fc32.qubes.x86_64 #15
> Workqueue: events work_for_cpu_fn
> RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0
> Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 0f 
> b7 f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 83 c0 
> 10 48 39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48
> RSP: e02b:c9004018bd50 EFLAGS: 00010212
> RAX: c9004069100c RBX: 88800ed412f8 RCX: c9004069105c
> RDX: 0001 RSI: 000febd4 RDI: c90040691000
> RBP: 0003 R08:  R09: febd404f
> R10: 7ff0 R11: 88800ee8ae40 R12: 88800ed41000
> R13:  R14: 0040 R15: feba
> FS:  () GS:88801840() 
> knlGS:
> CS:  e030 DS:  ES:  CR0: 80050033
> CR2: 807f5ea0 CR3: 12f6a000 CR4: 0660
> Call Trace:
>  e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e]
>  e1000_probe+0x41f/0xdb0 [e1000e]
>  local_pci_probe+0x42/0x80
> (...)
> 
> There is pci_msi_ignore_mask variable for bypassing MSI masking on Xen
> PV, but msix_mask_all() missed checking it. Add the check there too.
> 
> Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries")
> Cc: sta...@vger.kernel.org

7d5ec3d36123 appeared in v5.14-rc6, so if this fix is merged before
v5.14, the stable tag will be unnecessary.  But we are running out of
time there.

> Signed-off-by: Marek Marczykowski-Górecki 
> ---
> Cc: xen-devel@lists.xenproject.org
> ---
>  drivers/pci/msi.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index e5e75331b415..3a9f4f8ad8f9 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -776,6 +776,9 @@ static void msix_mask_all(void __iomem *base, int tsize)
>   u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT;
>   int i;
>  
> + if (pci_msi_ignore_mask)
> + return;
> +
>   for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE)
>   writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
>  }
> -- 
> 2.31.1
>

Re: [PATCH v7 8/8] AMD/IOMMU: respect AtsDisabled device flag

2021-08-26 Thread Jan Beulich

On 26.08.2021 16:27, Andrew Cooper wrote:
> On 26/08/2021 08:26, Jan Beulich wrote:
>> TBD: I find the ordering in amd_iommu_disable_domain_device()
>>  suspicious: amd_iommu_enable_domain_device() sets up the DTE first
>>  and then enables ATS on the device. It would seem to me that
>>  disabling would better be done the other way around (disable ATS on
>>  device, then adjust DTE).
> 
> I'd hope that the worst which goes wrong, given the problematic order,
> is a master abort.
> 
> But yes - ATS wants disabling on the device first, before the DTE is
> updated.

Okay, I'll add another patch.

Jan

Re: [PATCH v7 7/8] AMD/IOMMU: add "ivmd=" command line option

2021-08-26 Thread Jan Beulich

On 26.08.2021 16:08, Andrew Cooper wrote:
> On 26/08/2021 08:25, Jan Beulich wrote:
>> @@ -1523,6 +1523,31 @@ _dom0-iommu=map-inclusive_ - using both
>>  > `= `
>>  
>>  ### irq_vector_map (x86)
>> +
>> +### ivmd (x86)
>> +> `= 
>> [-][=[-][,[-][,...]]][;...]`
>> +
>> +Define IVMD-like ranges that are missing from ACPI tables along with the
>> +device(s) they belong to, and use them for 1:1 mapping.  End addresses can 
>> be
>> +omitted when exactly one page is meant.  The ranges are inclusive when start
>> +and end are specified.  Note that only PCI segment 0 is supported at this 
>> time,
>> +but it is fine to specify it explicitly.
>> +
>> +'start' and 'end' values are page numbers (not full physical addresses),
>> +in hexadecimal format (can optionally be preceded by "0x").
>> +
>> +Omitting the optional (range of) BDF spcifiers signals that the range is to
>> +be applied to all devices.
>> +
>> +Usage example: If device 0:0:1d.0 requires one page (0xd5d45) to be
>> +reserved, and devices 0:0:1a.0...0:0:1a.3 collectively require three pages
>> +(0xd5d46 thru 0xd5d48) to be reserved, one usage would be:
>> +
>> +ivmd=d5d45=0:1d.0;0xd5d46-0xd5d48=0:1a.0-0:1a.3
>> +
>> +Note: grub2 requires to escape or quote special characters, like ';' when
>> +multiple ranges are specified - refer to the grub2 documentation.
> 
> I'm slightly concerned that we're putting in syntax which the majority
> bootloader in use can't handle.

This matches RMRR handling, and I'd really like to keep the two as
similar as possible. Plus you can avoid the use of ; also by having
more than one "ivmd=" on the command line.

> A real IVMD entry in hardware doesn't have the concept of multiple
> device ranges, so I think comma ought to be the main list separator, and
> I don't think we need ; at all.

Firmware would need to present two IVMD entries in such a case. On
the command line I think we should allow more compact representation.
Plus again - this is similar to "rmrr=".

Jan

Re: [PATCH v7 8/8] AMD/IOMMU: respect AtsDisabled device flag

2021-08-26 Thread Andrew Cooper

On 26/08/2021 08:26, Jan Beulich wrote:
> IVHD entries may specify that ATS is to be blocked for a device or range
> of devices. Honor firmware telling us so.

It would be helpful if there was any explanation as to the purpose of
this bit.

It is presumably to do with SecureATS, but that works by accepting the
ATS translation and doing the pagewalk anyway.

>
> While adding respective checks I noticed that the 2nd conditional in
> amd_iommu_setup_domain_device() failed to check the IOMMU's capability.
> Add the missing part of the condition there, as no good can come from
> enabling ATS on a device when the IOMMU is not capable of dealing with
> ATS requests.
>
> For actually using ACPI_IVHD_ATS_DISABLED, make its expansion no longer
> exhibit UB.
>
> Signed-off-by: Jan Beulich 
> ---
> TBD: I find the ordering in amd_iommu_disable_domain_device()
>  suspicious: amd_iommu_enable_domain_device() sets up the DTE first
>  and then enables ATS on the device. It would seem to me that
>  disabling would better be done the other way around (disable ATS on
>  device, then adjust DTE).

I'd hope that the worst which goes wrong, given the problematic order,
is a master abort.

But yes - ATS wants disabling on the device first, before the DTE is
updated.

~Andrew

Re: [PATCH v7 7/8] AMD/IOMMU: add "ivmd=" command line option

2021-08-26 Thread Andrew Cooper

On 26/08/2021 08:25, Jan Beulich wrote:
> Just like VT-d's "rmrr=" it can be used to cover for firmware omissions.
> Since systems surfacing IVMDs seem to be rare, it is also meant to allow
> testing of the involved code.
>
> Only the IVMD flavors actually understood by the IVMD parsing logic can
> be generated, and for this initial implementation there's also no way to
> control the flags field - unity r/w mappings are assumed.
>
> Signed-off-by: Jan Beulich 
> Reviewed-by: Paul Durrant 
> ---
> v5: New.
>
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -836,12 +836,12 @@ Controls for the dom0 IOMMU setup.
>  
>  Typically, some devices in a system use bits of RAM for communication, 
> and
>  these areas should be listed as reserved in the E820 table and identified
> -via RMRR or IVMD entries in the APCI tables, so Xen can ensure that they
> +via RMRR or IVMD entries in the ACPI tables, so Xen can ensure that they

Oops.

> @@ -1523,6 +1523,31 @@ _dom0-iommu=map-inclusive_ - using both
>  > `= `
>  
>  ### irq_vector_map (x86)
> +
> +### ivmd (x86)
> +> `= 
> [-][=[-][,[-][,...]]][;...]`
> +
> +Define IVMD-like ranges that are missing from ACPI tables along with the
> +device(s) they belong to, and use them for 1:1 mapping.  End addresses can be
> +omitted when exactly one page is meant.  The ranges are inclusive when start
> +and end are specified.  Note that only PCI segment 0 is supported at this 
> time,
> +but it is fine to specify it explicitly.
> +
> +'start' and 'end' values are page numbers (not full physical addresses),
> +in hexadecimal format (can optionally be preceded by "0x").
> +
> +Omitting the optional (range of) BDF spcifiers signals that the range is to
> +be applied to all devices.
> +
> +Usage example: If device 0:0:1d.0 requires one page (0xd5d45) to be
> +reserved, and devices 0:0:1a.0...0:0:1a.3 collectively require three pages
> +(0xd5d46 thru 0xd5d48) to be reserved, one usage would be:
> +
> +ivmd=d5d45=0:1d.0;0xd5d46-0xd5d48=0:1a.0-0:1a.3
> +
> +Note: grub2 requires to escape or quote special characters, like ';' when
> +multiple ranges are specified - refer to the grub2 documentation.

I'm slightly concerned that we're putting in syntax which the majority
bootloader in use can't handle.

A real IVMD entry in hardware doesn't have the concept of multiple
device ranges, so I think comma ought to be the main list separator, and
I don't think we need ; at all.

~Andrew

Re: [PATCH v7 6/8] AMD/IOMMU: provide function backing XENMEM_reserved_device_memory_map

2021-08-26 Thread Jan Beulich

On 26.08.2021 15:24, Andrew Cooper wrote:
> On 26/08/2021 08:25, Jan Beulich wrote:
>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>> @@ -467,6 +467,81 @@ int amd_iommu_reserve_domain_unity_unmap
>>  return rc;
>>  }
>>  
>> +int amd_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
>> +{
>> +unsigned int seg = 0 /* XXX */, bdf;
> 
> Is this XXX intended to stay?

Yes - we've already got 3 similar instances plus at least two where the
hardcoding of segment 0 is not otherwise marked.

Jan

> I can't say I'm fussed about multi-segment handling, given the absence
> of support on any hardware I've ever encountered.
> 
> ~Andrew
>

Re: [PATCH v7 4/8] AMD/IOMMU: check IVMD ranges against host implementation limits

2021-08-26 Thread Jan Beulich

On 26.08.2021 15:16, Andrew Cooper wrote:
> On 26/08/2021 08:24, Jan Beulich wrote:
>> When such ranges can't be represented as 1:1 mappings in page tables,
>> reject them as presumably bogus. Note that when we detect features late
>> (because of EFRSup being clear in the ACPI tables), it would be quite a
>> bit of work to check for (and drop) out of range IVMD ranges, so IOMMU
>> initialization gets failed in this case instead.
>>
>> Signed-off-by: Jan Beulich 
>> Reviewed-by: Paul Durrant 
> 
> I'm not certain this is correct in combination with memory encryption.

Which we don't enable. I don't follow why you put this up as an
extra requirement. I'm adding checks based on IOMMU-specific data
alone. I think that's a fair and consistent step in the right
direction, no matter that there may be another step to go. Plus ...

> The upper bits are the KeyID, but we shouldn't find any of those set in
> an IVMD range.  I think at a minimum, we need to reduce the address
> check by the stolen bits for encryption, which gives a lower bound.

... aren't you asking here to consume data I'm only making
available in the still pending "x86/AMD: make HT range dynamic for
Fam17 and up", where I (now, i.e. v2) adjust paddr_bits accordingly?
Else I'm afraid I don't even know what you're talking about.

Jan

[PATCH] PCI/MSI: skip masking MSI on Xen PV

2021-08-26 Thread Marek Marczykowski-Górecki

When running as Xen PV guest, masking MSI is a responsibility of the
hypervisor. Guest has no write access to relevant BAR at all - when it
tries to, it results in a crash like this:

BUG: unable to handle page fault for address: c9004069100c
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 8010febd4075
Oops: 0003 [#1] SMP NOPTI
CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: GW 
5.14.0-rc7-1.fc32.qubes.x86_64 #15
Workqueue: events work_for_cpu_fn
RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0
Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 0f b7 
f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 83 c0 10 48 
39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48
RSP: e02b:c9004018bd50 EFLAGS: 00010212
RAX: c9004069100c RBX: 88800ed412f8 RCX: c9004069105c
RDX: 0001 RSI: 000febd4 RDI: c90040691000
RBP: 0003 R08:  R09: febd404f
R10: 7ff0 R11: 88800ee8ae40 R12: 88800ed41000
R13:  R14: 0040 R15: feba
FS:  () GS:88801840() knlGS:
CS:  e030 DS:  ES:  CR0: 80050033
CR2: 807f5ea0 CR3: 12f6a000 CR4: 0660
Call Trace:
 e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e]
 e1000_probe+0x41f/0xdb0 [e1000e]
 local_pci_probe+0x42/0x80
(...)

There is pci_msi_ignore_mask variable for bypassing MSI masking on Xen
PV, but msix_mask_all() missed checking it. Add the check there too.

Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries")
Cc: sta...@vger.kernel.org
Signed-off-by: Marek Marczykowski-Górecki 
---
Cc: xen-devel@lists.xenproject.org
---
 drivers/pci/msi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index e5e75331b415..3a9f4f8ad8f9 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -776,6 +776,9 @@ static void msix_mask_all(void __iomem *base, int tsize)
u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT;
int i;
 
+   if (pci_msi_ignore_mask)
+   return;
+
for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE)
writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
 }
-- 
2.31.1

Re: [PATCH v7 6/8] AMD/IOMMU: provide function backing XENMEM_reserved_device_memory_map

2021-08-26 Thread Andrew Cooper

On 26/08/2021 08:25, Jan Beulich wrote:
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -467,6 +467,81 @@ int amd_iommu_reserve_domain_unity_unmap
>  return rc;
>  }
>  
> +int amd_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
> +{
> +unsigned int seg = 0 /* XXX */, bdf;

Is this XXX intended to stay?

I can't say I'm fussed about multi-segment handling, given the absence
of support on any hardware I've ever encountered.

~Andrew

Re: [PATCH v1 01/14] xen/pci: Refactor MSI code that implements MSI functionality within XEN

2021-08-26 Thread Daniel P. Smith

On 8/19/21 8:02 AM, Rahul Singh wrote:
> MSI code that implements MSI functionality to support MSI within XEN is
> not usable on ARM. Move the code under CONFIG_HAS_PCI_MSI flag to gate
> the code for ARM.
> 
> No functional change intended.
> 
> Signed-off-by: Rahul Singh 
> ---
>  xen/arch/x86/Kconfig |  1 +
>  xen/drivers/passthrough/Makefile |  1 +
>  xen/drivers/passthrough/msi.c| 96 
>  xen/drivers/passthrough/pci.c| 54 +-
>  xen/drivers/pci/Kconfig  |  4 ++
>  xen/include/xen/msi.h| 56 +++
>  xen/xsm/flask/hooks.c|  8 +--
>  7 files changed, 175 insertions(+), 45 deletions(-)
>  create mode 100644 xen/drivers/passthrough/msi.c
>  create mode 100644 xen/include/xen/msi.h
> 

...

> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
> index f1a1217c98..fdcfeb984c 100644
> --- a/xen/xsm/flask/hooks.c
> +++ b/xen/xsm/flask/hooks.c
> @@ -21,7 +21,7 @@
>  #include 
>  #include 
>  #include 
> -#ifdef CONFIG_HAS_PCI
> +#ifdef CONFIG_HAS_PCI_MSI
>  #include 
>  #endif
>  #include 
> @@ -114,7 +114,7 @@ static int get_irq_sid(int irq, u32 *sid, struct 
> avc_audit_data *ad)
>  }
>  return security_irq_sid(irq, sid);
>  }
> -#ifdef CONFIG_HAS_PCI
> +#ifdef CONFIG_HAS_PCI_MSI
>  {
>  struct irq_desc *desc = irq_to_desc(irq);
>  if ( desc->msi_desc && desc->msi_desc->dev ) {
> @@ -874,7 +874,7 @@ static int flask_map_domain_pirq (struct domain *d)
>  static int flask_map_domain_msi (struct domain *d, int irq, const void *data,
>   u32 *sid, struct avc_audit_data *ad)
>  {
> -#ifdef CONFIG_HAS_PCI
> +#ifdef CONFIG_HAS_PCI_MSI
>  const struct msi_info *msi = data;
>  u32 machine_bdf = (msi->seg << 16) | (msi->bus << 8) | msi->devfn;
>  
> @@ -940,7 +940,7 @@ static int flask_unmap_domain_pirq (struct domain *d)
>  static int flask_unmap_domain_msi (struct domain *d, int irq, const void 
> *data,
> u32 *sid, struct avc_audit_data *ad)
>  {
> -#ifdef CONFIG_HAS_PCI
> +#ifdef CONFIG_HAS_PCI_MSI
>  const struct pci_dev *pdev = data;
>  u32 machine_bdf = (pdev->seg << 16) | (pdev->bus << 8) | pdev->devfn;
>  
> 

Straightforward, so I see no issue with the flask related changes.

Reviewed-by: Daniel P. Smith 


v/r
dps

Re: [PATCH 4/9] gnttab: drop GNTMAP_can_fail

2021-08-26 Thread Andrew Cooper

On 26/08/2021 14:03, Jan Beulich wrote:
> On 26.08.2021 13:45, Andrew Cooper wrote:
>> On 26/08/2021 11:13, Jan Beulich wrote:
>>> There's neither documentation of what this flag is supposed to mean, nor
>>> any implementation. With this, don't even bother enclosing the #define-s
>>> in a __XEN_INTERFACE_VERSION__ conditional, but drop them altogether.
>>>
>>> Signed-off-by: Jan Beulich 
>> It was introduced in 4d45702cf0398 along with GNTST_eagain, and the
>> commit message hints that it is for xen-paging
>>
>> Furthermore, the first use of GNTST_eagain was in ecb35ecb79e0 for
>> trying to map/copy a paged-out frame.
>>
>> Therefore I think it is reasonable to conclude that there was a planned
>> interaction between GNTMAP_can_fail and paging, which presumably would
>> have been "don't pull this in from disk if it is paged out".
> I suppose that's (far fetched?) guesswork.

Not really - the phrase "far fetched" loosely translates as implausible
or unlikely, and I wouldn't characterise the suggestion as implausible.

It is guesswork, and the most plausible guess I can think of.  It is
clear that GNTMAP_can_fail is related to paging somehow, which means
there is some interaction with the gref not being mappable right now.

>
>> I think it is fine to drop, as no implementation has ever existed in
>> Xen, but it would be helpful to have the history summarised in the
>> commit message.
> I've extended it to
>
> "There's neither documentation of what this flag is supposed to mean, nor
>  any implementation. Commit 4d45702cf0398 ("paging: Updates to public
>  grant table header file") suggests there might have been plans to use it
>  for interaction with mem-paging, but no such functionality has ever
>  materialized. With this, don't even bother enclosing the #define-s in a
>  __XEN_INTERFACE_VERSION__ conditional, but drop them altogether."

LGTM.  Reviewed-by: Andrew Cooper 

>
> Jan
>

Re: [PATCH v7 4/8] AMD/IOMMU: check IVMD ranges against host implementation limits

2021-08-26 Thread Andrew Cooper

On 26/08/2021 08:24, Jan Beulich wrote:
> When such ranges can't be represented as 1:1 mappings in page tables,
> reject them as presumably bogus. Note that when we detect features late
> (because of EFRSup being clear in the ACPI tables), it would be quite a
> bit of work to check for (and drop) out of range IVMD ranges, so IOMMU
> initialization gets failed in this case instead.
>
> Signed-off-by: Jan Beulich 
> Reviewed-by: Paul Durrant 

I'm not certain this is correct in combination with memory encryption.

The upper bits are the KeyID, but we shouldn't find any of those set in
an IVMD range.  I think at a minimum, we need to reduce the address
check by the stolen bits for encryption, which gives a lower bound.

~Andrew

Re: [PATCH v7 3/8] AMD/IOMMU: improve (extended) feature detection

2021-08-26 Thread Jan Beulich

On 26.08.2021 15:02, Andrew Cooper wrote:
> On 26/08/2021 08:23, Jan Beulich wrote:
>> First of all the documentation is very clear about ACPI table data
>> superseding raw register data. Use raw register data only if EFRSup is
>> clear in the ACPI tables (which may still go too far). Additionally if
>> this flag is clear, the IVRS type 11H table is reserved and hence may
>> not be recognized.
> 
> The spec says:
> 
> Software Implementation Note: Information conveyed in the IVRS overrides
> the corresponding
> information available through the IOMMU hardware registers. System
> software is required to honor
> the ACPI settings.
> 
> This suggests that if there is an ACPI table, the hardware registers
> shouldn't be followed.
> 
> Given what else is broken when there is no APCI table, I think we can
> (and should) not support this configuration.

Well, we don't. We do require ACPI tables. But that's not the question
here. Instead what this is about is whether the ACPI table has EFRSup
clear. This flag being clear when the register actually exists is imo
more likely a firmware bug.

Plus I didn't want to go too far in one step; I do acknowledge that
we may want to be even more strict down the road by saying "(which may
still go too far)".

>> Furthermore propagate IVRS type 10H data into the feature flags
>> recorded, as the full extended features field is available in type 11H
>> only.
>>
>> Note that this also makes necessary to stop the bad practice of us
>> finding a type 11H IVHD entry, but still processing the type 10H one
>> in detect_iommu_acpi()'s invocation of amd_iommu_detect_one_acpi().
> 
> I could have sworn I read in the spec that if 11H is present, 10H should
> be ignored, but I can't actually locate a statement to this effect.

What I can find is "It is recommended for system software to detect
IOMMU features from the fields in the IVHD Type11h structure
information, superseding information in Type10h block and MMIO
registers."

Jan

Re: [PATCH 4/9] gnttab: drop GNTMAP_can_fail

2021-08-26 Thread Jan Beulich

On 26.08.2021 13:45, Andrew Cooper wrote:
> On 26/08/2021 11:13, Jan Beulich wrote:
>> There's neither documentation of what this flag is supposed to mean, nor
>> any implementation. With this, don't even bother enclosing the #define-s
>> in a __XEN_INTERFACE_VERSION__ conditional, but drop them altogether.
>>
>> Signed-off-by: Jan Beulich 
> 
> It was introduced in 4d45702cf0398 along with GNTST_eagain, and the
> commit message hints that it is for xen-paging
> 
> Furthermore, the first use of GNTST_eagain was in ecb35ecb79e0 for
> trying to map/copy a paged-out frame.
> 
> Therefore I think it is reasonable to conclude that there was a planned
> interaction between GNTMAP_can_fail and paging, which presumably would
> have been "don't pull this in from disk if it is paged out".

I suppose that's (far fetched?) guesswork.

> I think it is fine to drop, as no implementation has ever existed in
> Xen, but it would be helpful to have the history summarised in the
> commit message.

I've extended it to

"There's neither documentation of what this flag is supposed to mean, nor
 any implementation. Commit 4d45702cf0398 ("paging: Updates to public
 grant table header file") suggests there might have been plans to use it
 for interaction with mem-paging, but no such functionality has ever
 materialized. With this, don't even bother enclosing the #define-s in a
 __XEN_INTERFACE_VERSION__ conditional, but drop them altogether."

Jan

Re: [PATCH v7 3/8] AMD/IOMMU: improve (extended) feature detection

2021-08-26 Thread Andrew Cooper

On 26/08/2021 08:23, Jan Beulich wrote:
> First of all the documentation is very clear about ACPI table data
> superseding raw register data. Use raw register data only if EFRSup is
> clear in the ACPI tables (which may still go too far). Additionally if
> this flag is clear, the IVRS type 11H table is reserved and hence may
> not be recognized.

The spec says:

Software Implementation Note: Information conveyed in the IVRS overrides
the corresponding
information available through the IOMMU hardware registers. System
software is required to honor
the ACPI settings.

This suggests that if there is an ACPI table, the hardware registers
shouldn't be followed.

Given what else is broken when there is no APCI table, I think we can
(and should) not support this configuration.

> Furthermore propagate IVRS type 10H data into the feature flags
> recorded, as the full extended features field is available in type 11H
> only.
>
> Note that this also makes necessary to stop the bad practice of us
> finding a type 11H IVHD entry, but still processing the type 10H one
> in detect_iommu_acpi()'s invocation of amd_iommu_detect_one_acpi().

I could have sworn I read in the spec that if 11H is present, 10H should
be ignored, but I can't actually locate a statement to this effect.

~Andrew

>
> Note also that the features.raw check in amd_iommu_prepare_one() needs
> replacing, now that the field can also be populated by different means.
> Key IOMMUv2 availability off of IVHD type not being 10H, and then move
> it a function layer up, so that it would be set only once all IOMMUs
> have been successfully prepared.
>
> Signed-off-by: Jan Beulich 
> Reviewed-by: Paul Durrant

Re: [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

2021-08-26 Thread Jan Beulich

On 26.08.2021 13:57, Andrew Cooper wrote:
> On 24/08/2021 15:21, Jan Beulich wrote:
>> While already the case for PVH, there's no reason to treat PV
>> differently here, though of course the addresses get taken from another
>> source in this case. Except that, to match CPU side mappings, by default
>> we permit r/o ones. This then also means we now deal consistently with
>> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>>
>> Signed-off-by: Jan Beulich 
> 
> Why do we give PV dom0 a mapping of the IO-APIC?  Having thought about
> it, it cannot possibly be usable.
> 
> IO-APICs use a indirect window, and Xen doesn't perform any
> write-emulation (that I can see), so the guest can't even read the data
> register and work out which register it represents.  It also can't do an
> atomic 64bit read across the index and data registers, as that is
> explicitly undefined behaviour in the IO-APIC spec.
> 
> On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite
> the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls
> where introduced upon discovering that a read-only mapping of the
> IO-APIC it totally useless.
> 
> I think we can safely not expose the IO-APICs to PV dom0 at all, which
> simplifies the memory handling further.

The reason we do expose it r/o is that some ACPI implementations access
(read and write) some RTEs from AML. If we don't allow Dom0 to map the
IO-APIC, Dom0 will typically fail to boot on such systems. What we have
right now seems to be good enough for those systems, no matter that the
data they get to read makes little sense. If we learned of systems
where this isn't sufficient, we'd have to implement more complete read
emulation (i.e. latching writes to the window register while still
discarding writes to the data register).

If we produced a not-present PTE instead of a r/o one for such mapping
requests, I'm afraid we'd actually further complicate memory handling,
because we'd then need to consider for emulation also n/p #PF, not just
writes to r/o mappings.

This said - I'd also be fine with consistently not mapping the IO-APICs
in the IOMMU page tables. It was you to request CPU and IOMMU mappings
to match. What I want to do away with is the present mixture, derived
from the E820 type covering the IO-APIC space.

Jan

Re: [PATCH 17/17] IOMMU/x86: drop pointless NULL checks

2021-08-26 Thread Jan Beulich

On 26.08.2021 14:05, Andrew Cooper wrote:
> On 24/08/2021 15:27, Jan Beulich wrote:
>> --- a/xen/drivers/passthrough/vtd/utils.c
>> +++ b/xen/drivers/passthrough/vtd/utils.c
>> @@ -106,11 +106,6 @@ void print_vtd_entries(struct vtd_iommu
>>  }
>>  
>>  root_entry = (struct root_entry 
>> *)map_vtd_domain_page(iommu->root_maddr);
> 
> Seeing as you're actually cleaning up mapping calls, drop this cast too?

There are more of these, so this would be yet another cleanup patch.
I didn't get annoyed enough by these to put one together; instead I
was hoping that we might touch these lines anyway at some point. E.g.
when doing away with the funny map_vtd_domain_page() wrapper around
map_domain_page() ...

> Either way, Acked-by: Andrew Cooper 

Thanks.

Jan

Re: [PATCH v7 2/8] AMD/IOMMU: obtain IVHD type to use earlier

2021-08-26 Thread Jan Beulich

On 26.08.2021 14:30, Andrew Cooper wrote:
> On 26/08/2021 08:23, Jan Beulich wrote:
>> Doing this in amd_iommu_prepare() is too late for it, in particular, to
>> be used in amd_iommu_detect_one_acpi(), as a subsequent change will want
>> to do. Moving it immediately ahead of amd_iommu_detect_acpi() is
>> (luckily) pretty simple, (pretty importantly) without breaking
>> amd_iommu_prepare()'s logic to prevent multiple processing.
>>
>> This involves moving table checksumming, as
>> amd_iommu_get_supported_ivhd_type() ->  get_supported_ivhd_type() will
>> now be invoked before amd_iommu_detect_acpi()  -> detect_iommu_acpi(). In
>> the course of dojng so stop open-coding acpi_tb_checksum(), seeing that
> 
> doing.
> 
>> --- a/xen/drivers/passthrough/amd/iommu_acpi.c
>> +++ b/xen/drivers/passthrough/amd/iommu_acpi.c
>> @@ -1150,20 +1152,7 @@ static int __init parse_ivrs_table(struc
>>  static int __init detect_iommu_acpi(struct acpi_table_header *table)
>>  {
>>  const struct acpi_ivrs_header *ivrs_block;
>> -unsigned long i;
>>  unsigned long length = sizeof(struct acpi_table_ivrs);
>> -u8 checksum, *raw_table;
>> -
>> -/* validate checksum: sum of entire table == 0 */
>> -checksum = 0;
>> -raw_table = (u8 *)table;
>> -for ( i = 0; i < table->length; i++ )
>> -checksum += raw_table[i];
>> -if ( checksum )
>> -{
>> -AMD_IOMMU_DEBUG("IVRS Error: Invalid Checksum %#x\n", checksum);
>> -return -ENODEV;
>> -}
>>  
>>  while ( table->length > (length + sizeof(*ivrs_block)) )
>>  {
>> @@ -1300,6 +1289,15 @@ get_supported_ivhd_type(struct acpi_tabl
>>  {
>>  size_t length = sizeof(struct acpi_table_ivrs);
>>  const struct acpi_ivrs_header *ivrs_block, *blk = NULL;
>> +uint8_t checksum;
>> +
>> +/* Validate checksum: Sum of entire table == 0. */
>> +checksum = acpi_tb_checksum(ACPI_CAST_PTR(uint8_t, table), 
>> table->length);
>> +if ( checksum )
>> +{
>> +AMD_IOMMU_DEBUG("IVRS Error: Invalid Checksum %#x\n", checksum);
> 
> I know you're just moving code, but this really needs to be a visible
> error.  It's "I'm turning off the IOMMU because the ACPI table is bad",
> which is about as serious as errors come.

I'll wait for us settling on patch 1 in this regard, and then follow
the same model here.

Jan

Re: [PATCH v7 1/8] AMD/IOMMU: check / convert IVMD ranges for being / to be reserved

2021-08-26 Thread Jan Beulich

On 26.08.2021 14:10, Andrew Cooper wrote:
> On 26/08/2021 08:23, Jan Beulich wrote:
>> While the specification doesn't say so, just like for VT-d's RMRRs no
>> good can come from these ranges being e.g. conventional RAM or entirely
>> unmarked and hence usable for placing e.g. PCI device BARs. Check
>> whether they are, and put in some limited effort to convert to reserved.
>> (More advanced logic can be added if actual problems are found with this
>> simplistic variant.)
>>
>> Signed-off-by: Jan Beulich 
>> Reviewed-by: Paul Durrant 
>> ---
>> v7: Re-base.
>> v5: New.
>>
>> --- a/xen/drivers/passthrough/amd/iommu_acpi.c
>> +++ b/xen/drivers/passthrough/amd/iommu_acpi.c
>> @@ -384,6 +384,38 @@ static int __init parse_ivmd_block(const
>>  AMD_IOMMU_DEBUG("IVMD Block: type %#x phys %#lx len %#lx\n",
>>  ivmd_block->header.type, start_addr, mem_length);
>>  
>> +if ( !e820_all_mapped(base, limit + PAGE_SIZE, E820_RESERVED) )
>> +{
>> +paddr_t addr;
>> +
>> +AMD_IOMMU_DEBUG("IVMD: [%lx,%lx) is not (entirely) in reserved 
>> memory\n",
>> +base, limit + PAGE_SIZE);
>> +
>> +for ( addr = base; addr <= limit; addr += PAGE_SIZE )
>> +{
>> +unsigned int type = page_get_ram_type(maddr_to_mfn(addr));
>> +
>> +if ( type == RAM_TYPE_UNKNOWN )
>> +{
>> +if ( e820_add_range(, addr, addr + PAGE_SIZE,
>> +E820_RESERVED) )
>> +continue;
>> +AMD_IOMMU_DEBUG("IVMD Error: Page at %lx couldn't be 
>> reserved\n",
>> +addr);
>> +return -EIO;
>> +}
>> +
>> +/* Types which won't be handed out are considered good enough. 
>> */
>> +if ( !(type & (RAM_TYPE_RESERVED | RAM_TYPE_ACPI |
>> +   RAM_TYPE_UNUSABLE)) )
>> +continue;
>> +
>> +AMD_IOMMU_DEBUG("IVMD Error: Page at %lx can't be converted\n",
>> +addr);
> 
> I think these print messages need to more than just debug.  The first
> one is a warning, whereas the final two are hard errors liable to impact
> the correct running of the system.

Well, people would observe IOMMUs not getting put in use. I was following
existing style in this regard on the assumption that in such an event
people would (be told to) enable "iommu=debug". Hence ...

> Especially as you're putting them in to try and spot problem cases, they
> should be visible by default for when we inevitably get bug reports to
> xen-devel saying "something changed with passthrough in Xen 4.16".

... I can convert to ordinary printk(), provided you're convinced the
described model isn't reasonable and introducing a logging inconsistency
is worth it.

Jan

Re: [PATCH v7 2/8] AMD/IOMMU: obtain IVHD type to use earlier

2021-08-26 Thread Andrew Cooper

On 26/08/2021 08:23, Jan Beulich wrote:
> Doing this in amd_iommu_prepare() is too late for it, in particular, to
> be used in amd_iommu_detect_one_acpi(), as a subsequent change will want
> to do. Moving it immediately ahead of amd_iommu_detect_acpi() is
> (luckily) pretty simple, (pretty importantly) without breaking
> amd_iommu_prepare()'s logic to prevent multiple processing.
>
> This involves moving table checksumming, as
> amd_iommu_get_supported_ivhd_type() ->  get_supported_ivhd_type() will
> now be invoked before amd_iommu_detect_acpi()  -> detect_iommu_acpi(). In
> the course of dojng so stop open-coding acpi_tb_checksum(), seeing that

doing.

> --- a/xen/drivers/passthrough/amd/iommu_acpi.c
> +++ b/xen/drivers/passthrough/amd/iommu_acpi.c
> @@ -1150,20 +1152,7 @@ static int __init parse_ivrs_table(struc
>  static int __init detect_iommu_acpi(struct acpi_table_header *table)
>  {
>  const struct acpi_ivrs_header *ivrs_block;
> -unsigned long i;
>  unsigned long length = sizeof(struct acpi_table_ivrs);
> -u8 checksum, *raw_table;
> -
> -/* validate checksum: sum of entire table == 0 */
> -checksum = 0;
> -raw_table = (u8 *)table;
> -for ( i = 0; i < table->length; i++ )
> -checksum += raw_table[i];
> -if ( checksum )
> -{
> -AMD_IOMMU_DEBUG("IVRS Error: Invalid Checksum %#x\n", checksum);
> -return -ENODEV;
> -}
>  
>  while ( table->length > (length + sizeof(*ivrs_block)) )
>  {
> @@ -1300,6 +1289,15 @@ get_supported_ivhd_type(struct acpi_tabl
>  {
>  size_t length = sizeof(struct acpi_table_ivrs);
>  const struct acpi_ivrs_header *ivrs_block, *blk = NULL;
> +uint8_t checksum;
> +
> +/* Validate checksum: Sum of entire table == 0. */
> +checksum = acpi_tb_checksum(ACPI_CAST_PTR(uint8_t, table), 
> table->length);
> +if ( checksum )
> +{
> +AMD_IOMMU_DEBUG("IVRS Error: Invalid Checksum %#x\n", checksum);

I know you're just moving code, but this really needs to be a visible
error.  It's "I'm turning off the IOMMU because the ACPI table is bad",
which is about as serious as errors come.

~Andrew

Re: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to NUMA system

2021-08-26 Thread Jan Beulich

On 26.08.2021 14:08, Wei Chen wrote:
>> From: Jan Beulich 
>> Sent: 2021年8月26日 17:40
>>
>> On 26.08.2021 10:49, Julien Grall wrote:
>>> Right, but again, why do you want to solve the problem on Arm and not
>>> x86? After all, NUMA is not architecture specific (in fact you move most
>>> of the code in common).
>>>
> 
> I am not very familiar with x86, so when I was composing this patch series,
> I always thought that if I could solve it inside Arm Arch, I would solve it
> inside Arm Arch. That seems a bit conservative, and inappropriate on solving
> this problem.
> 
>>> In fact, the risk, is someone may read arch/x86 and doesn't realize the
>>> CPU is not in the node until late on Arm.
>>>
>>> So I think we should call numa_add_cpu() around the same place on all
>>> the architectures.
>>
>> FWIW: +1
> 
> I agree. As Jan in this discussion. How about following current x86's
> numa_add_cpu behaviors in __start_xen, but add some code to revert
> numa_add_cpu when cpu_up failed (both Arm and x86)?

Sure - if we don't clean up properly on x86 on a failure path, I'm all
for having that fixed.

Jan

RE: Enabling hypervisor agnosticism for VirtIO backends

2021-08-26 Thread Wei Chen

Hi Akashi,

> -Original Message-
> From: AKASHI Takahiro 
> Sent: 2021年8月26日 17:41
> To: Wei Chen 
> Cc: Oleksandr Tyshchenko ; Stefano Stabellini
> ; Alex Benn??e ; Kaly Xin
> ; Stratos Mailing List ;
> virtio-...@lists.oasis-open.org; Arnd Bergmann ;
> Viresh Kumar ; Stefano Stabellini
> ; stefa...@redhat.com; Jan Kiszka
> ; Carl van Schaik ;
> prat...@quicinc.com; Srivatsa Vaddagiri ; Jean-
> Philippe Brucker ; Mathieu Poirier
> ; Oleksandr Tyshchenko
> ; Bertrand Marquis
> ; Artem Mygaiev ; Julien
> Grall ; Juergen Gross ; Paul Durrant
> ; Xen Devel 
> Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
>
> Hi Wei,
>
> On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
> > On Wed, Aug 18, 2021 at 08:35:51AM +, Wei Chen wrote:
> > > Hi Akashi，
> > >
> > > > -Original Message-
> > > > From: AKASHI Takahiro 
> > > > Sent: 2021年8月18日 13:39
> > > > To: Wei Chen 
> > > > Cc: Oleksandr Tyshchenko ; Stefano Stabellini
> > > > ; Alex Benn??e ;
> Stratos
> > > > Mailing List ; virtio-
> dev@lists.oasis-
> > > > open.org; Arnd Bergmann ; Viresh Kumar
> > > > ; Stefano Stabellini
> > > > ; stefa...@redhat.com; Jan Kiszka
> > > > ; Carl van Schaik
> ;
> > > > prat...@quicinc.com; Srivatsa Vaddagiri ;
> Jean-
> > > > Philippe Brucker ; Mathieu Poirier
> > > > ; Oleksandr Tyshchenko
> > > > ; Bertrand Marquis
> > > > ; Artem Mygaiev ;
> Julien
> > > > Grall ; Juergen Gross ; Paul
> Durrant
> > > > ; Xen Devel 
> > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > >
> > > > On Tue, Aug 17, 2021 at 08:39:09AM +, Wei Chen wrote:
> > > > > Hi Akashi,
> > > > >
> > > > > > -Original Message-
> > > > > > From: AKASHI Takahiro 
> > > > > > Sent: 2021年8月17日 16:08
> > > > > > To: Wei Chen 
> > > > > > Cc: Oleksandr Tyshchenko ; Stefano
> Stabellini
> > > > > > ; Alex Benn??e ;
> > > > Stratos
> > > > > > Mailing List ; virtio-
> > > > dev@lists.oasis-
> > > > > > open.org; Arnd Bergmann ; Viresh Kumar
> > > > > > ; Stefano Stabellini
> > > > > > ; stefa...@redhat.com; Jan Kiszka
> > > > > > ; Carl van Schaik
> ;
> > > > > > prat...@quicinc.com; Srivatsa Vaddagiri ;
> Jean-
> > > > > > Philippe Brucker ; Mathieu Poirier
> > > > > > ; Oleksandr Tyshchenko
> > > > > > ; Bertrand Marquis
> > > > > > ; Artem Mygaiev
> ;
> > > > Julien
> > > > > > Grall ; Juergen Gross ; Paul
> Durrant
> > > > > > ; Xen Devel 
> > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > > >
> > > > > > Hi Wei, Oleksandr,
> > > > > >
> > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +, Wei Chen wrote:
> > > > > > > Hi All,
> > > > > > >
> > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here.
> > > > > > > This proposal is still discussing in Xen and KVM communities.
> > > > > > > The main work is to decouple the kvmtool from KVM and make
> > > > > > > other hypervisors can reuse the virtual device implementations.
> > > > > > >
> > > > > > > In this case, we need to introduce an intermediate hypervisor
> > > > > > > layer for VMM abstraction, Which is, I think it's very close
> > > > > > > to stratos' virtio hypervisor agnosticism work.
> > > > > >
> > > > > > # My proposal[1] comes from my own idea and doesn't always
> represent
> > > > > > # Linaro's view on this subject nor reflect Alex's concerns.
> > > > Nevertheless,
> > > > > >
> > > > > > Your idea and my proposal seem to share the same background.
> > > > > > Both have the similar goal and currently start with, at first,
> Xen
> > > > > > and are based on kvm-tool. (Actually, my work is derived from
> > > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
> > > > > >
> > > > > > In particular, the abstraction of hypervisor interfaces has a
> same
> > > > > > set of interfaces (for your "struct vmm_impl" and my "RPC
> interfaces").
> > > > > > This is not co-incident as we both share the same origin as I
> said
> > > > above.
> > > > > > And so we will also share the same issues. One of them is a way
> of
> > > > > > "sharing/mapping FE's memory". There is some trade-off between
> > > > > > the portability and the performance impact.
> > > > > > So we can discuss the topic here in this ML, too.
> > > > > > (See Alex's original email, too).
> > > > > >
> > > > > Yes, I agree.
> > > > >
> > > > > > On the other hand, my approach aims to create a "single-binary"
> > > > solution
> > > > > > in which the same binary of BE vm could run on any hypervisors.
> > > > > > Somehow similar to your "proposal-#2" in [2], but in my solution,
> all
> > > > > > the hypervisor-specific code would be put into another entity
> (VM),
> > > > > > named "virtio-proxy" and the abstracted operations are served
> via RPC.
> > > > > > (In this sense, BE is hypervisor-agnostic but might have OS
> > > > dependency.)
> > > > > > But I know that we need discuss if this is a requirement even
> > > > > > in Stratos project or not. (Maybe not)
> > > > > >
> > > > >
> > > > > Sorry, I

Re: [PATCH v7 1/8] AMD/IOMMU: check / convert IVMD ranges for being / to be reserved

2021-08-26 Thread Andrew Cooper

On 26/08/2021 08:23, Jan Beulich wrote:
> While the specification doesn't say so, just like for VT-d's RMRRs no
> good can come from these ranges being e.g. conventional RAM or entirely
> unmarked and hence usable for placing e.g. PCI device BARs. Check
> whether they are, and put in some limited effort to convert to reserved.
> (More advanced logic can be added if actual problems are found with this
> simplistic variant.)
>
> Signed-off-by: Jan Beulich 
> Reviewed-by: Paul Durrant 
> ---
> v7: Re-base.
> v5: New.
>
> --- a/xen/drivers/passthrough/amd/iommu_acpi.c
> +++ b/xen/drivers/passthrough/amd/iommu_acpi.c
> @@ -384,6 +384,38 @@ static int __init parse_ivmd_block(const
>  AMD_IOMMU_DEBUG("IVMD Block: type %#x phys %#lx len %#lx\n",
>  ivmd_block->header.type, start_addr, mem_length);
>  
> +if ( !e820_all_mapped(base, limit + PAGE_SIZE, E820_RESERVED) )
> +{
> +paddr_t addr;
> +
> +AMD_IOMMU_DEBUG("IVMD: [%lx,%lx) is not (entirely) in reserved 
> memory\n",
> +base, limit + PAGE_SIZE);
> +
> +for ( addr = base; addr <= limit; addr += PAGE_SIZE )
> +{
> +unsigned int type = page_get_ram_type(maddr_to_mfn(addr));
> +
> +if ( type == RAM_TYPE_UNKNOWN )
> +{
> +if ( e820_add_range(, addr, addr + PAGE_SIZE,
> +E820_RESERVED) )
> +continue;
> +AMD_IOMMU_DEBUG("IVMD Error: Page at %lx couldn't be 
> reserved\n",
> +addr);
> +return -EIO;
> +}
> +
> +/* Types which won't be handed out are considered good enough. */
> +if ( !(type & (RAM_TYPE_RESERVED | RAM_TYPE_ACPI |
> +   RAM_TYPE_UNUSABLE)) )
> +continue;
> +
> +AMD_IOMMU_DEBUG("IVMD Error: Page at %lx can't be converted\n",
> +addr);

I think these print messages need to more than just debug.  The first
one is a warning, whereas the final two are hard errors liable to impact
the correct running of the system.

Especially as you're putting them in to try and spot problem cases, they
should be visible by default for when we inevitably get bug reports to
xen-devel saying "something changed with passthrough in Xen 4.16".

~Andrew


> +return -EIO;
> +}
> +}
> +
>  if ( ivmd_block->header.flags & ACPI_IVMD_EXCLUSION_RANGE )
>  exclusion = true;
>  else if ( ivmd_block->header.flags & ACPI_IVMD_UNITY )
>

RE: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to NUMA system

2021-08-26 Thread Wei Chen

Hi Jan, Julien,

> -Original Message-
> From: Jan Beulich 
> Sent: 2021年8月26日 17:40
> To: Julien Grall ; Wei Chen 
> Cc: Bertrand Marquis ; xen-
> de...@lists.xenproject.org; sstabell...@kernel.org
> Subject: Re: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to
> NUMA system
> 
> On 26.08.2021 10:49, Julien Grall wrote:
> > On 26/08/2021 08:24, Wei Chen wrote:
> >>> -Original Message-
> >>> From: Julien Grall 
> >>> Sent: 2021年8月26日 0:58
> >>> On 11/08/2021 11:24, Wei Chen wrote:
>  --- a/xen/arch/arm/smpboot.c
>  +++ b/xen/arch/arm/smpboot.c
>  @@ -358,6 +358,12 @@ void start_secondary(void)
>  */
> smp_wmb();
> 
>  +/*
>  + * If Xen is running on a NUMA off system, there will
>  + * be a node#0 at least.
>  + */
>  +numa_add_cpu(cpuid);
>  +
> >>>
> >>> On x86, numa_add_cpu() will be called before the pCPU is brought up. I
> >>> am not quite too sure why we are doing it differently here. Can you
> >>> clarify it?
> >>
> >> Of course we can invoke numa_add_cpu before cpu_up as x86. But in my
> tests,
> >> I found when cpu bring up failed, this cpu still be add to NUMA.
> Although
> >> this does not affect the execution of the code (because CPU is offline),
> >> But I don't think adding a offline CPU to NUMA makes sense.
> >
> > Right, but again, why do you want to solve the problem on Arm and not
> > x86? After all, NUMA is not architecture specific (in fact you move most
> > of the code in common).
> >

I am not very familiar with x86, so when I was composing this patch series,
I always thought that if I could solve it inside Arm Arch, I would solve it
inside Arm Arch. That seems a bit conservative, and inappropriate on solving
this problem.

> > In fact, the risk, is someone may read arch/x86 and doesn't realize the
> > CPU is not in the node until late on Arm.
> >
> > So I think we should call numa_add_cpu() around the same place on all
> > the architectures.
> 
> FWIW: +1

I agree. As Jan in this discussion. How about following current x86's
numa_add_cpu behaviors in __start_xen, but add some code to revert
numa_add_cpu when cpu_up failed (both Arm and x86)?

> 
> Jan

Re: [PATCH 17/17] IOMMU/x86: drop pointless NULL checks

2021-08-26 Thread Andrew Cooper

On 24/08/2021 15:27, Jan Beulich wrote:
> --- a/xen/drivers/passthrough/vtd/utils.c
> +++ b/xen/drivers/passthrough/vtd/utils.c
> @@ -106,11 +106,6 @@ void print_vtd_entries(struct vtd_iommu
>  }
>  
>  root_entry = (struct root_entry *)map_vtd_domain_page(iommu->root_maddr);

Seeing as you're actually cleaning up mapping calls, drop this cast too?

Either way, Acked-by: Andrew Cooper 

> -if ( root_entry == NULL )
> -{
> -printk("root_entry == NULL\n");
> -return;
> -}
>  
>  printk("root_entry[%02x] = %"PRIx64"\n", bus, root_entry[bus].val);
>  if ( !root_present(root_entry[bus]) )
>

Re: [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

2021-08-26 Thread Andrew Cooper

On 24/08/2021 15:21, Jan Beulich wrote:
> While already the case for PVH, there's no reason to treat PV
> differently here, though of course the addresses get taken from another
> source in this case. Except that, to match CPU side mappings, by default
> we permit r/o ones. This then also means we now deal consistently with
> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>
> Signed-off-by: Jan Beulich 

Why do we give PV dom0 a mapping of the IO-APIC?  Having thought about
it, it cannot possibly be usable.

IO-APICs use a indirect window, and Xen doesn't perform any
write-emulation (that I can see), so the guest can't even read the data
register and work out which register it represents.  It also can't do an
atomic 64bit read across the index and data registers, as that is
explicitly undefined behaviour in the IO-APIC spec.

On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite
the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls
where introduced upon discovering that a read-only mapping of the
IO-APIC it totally useless.

I think we can safely not expose the IO-APICs to PV dom0 at all, which
simplifies the memory handling further.

~Andrew

RE: [XEN RFC PATCH 23/40] xen/arm: introduce a helper to parse device tree memory node

2021-08-26 Thread Wei Chen

Hi Julien,

> -Original Message-
> From: Julien Grall 
> Sent: 2021年8月26日 16:22
> To: Wei Chen ; xen-devel@lists.xenproject.org;
> sstabell...@kernel.org
> Cc: Bertrand Marquis 
> Subject: Re: [XEN RFC PATCH 23/40] xen/arm: introduce a helper to parse
> device tree memory node
> 
> 
> 
> On 26/08/2021 07:35, Wei Chen wrote:
> > Hi Julien,
> 
> Hi Wei,
> 
> >> -Original Message-
> >> From: Julien Grall 
> >> Sent: 2021年8月25日 21:49
> >> To: Wei Chen ; xen-devel@lists.xenproject.org;
> >> sstabell...@kernel.org; jbeul...@suse.com
> >> Cc: Bertrand Marquis 
> >> Subject: Re: [XEN RFC PATCH 23/40] xen/arm: introduce a helper to parse
> >> device tree memory node
> >>
> >> Hi Wei,
> >>
> >> On 11/08/2021 11:24, Wei Chen wrote:
> >>> Memory blocks' NUMA ID information is stored in device tree's
> >>> memory nodes as "numa-node-id". We need a new helper to parse
> >>> and verify this ID from memory nodes.
> >>>
> >>> In order to support memory affinity in later use, the valid
> >>> memory ranges and NUMA ID will be saved to tables.
> >>>
> >>> Signed-off-by: Wei Chen 
> >>> ---
> >>>xen/arch/arm/numa_device_tree.c | 130
> 
> >>>1 file changed, 130 insertions(+)
> >>>
> >>> diff --git a/xen/arch/arm/numa_device_tree.c
> >> b/xen/arch/arm/numa_device_tree.c
> >>> index 37cc56acf3..bbe081dcd1 100644
> >>> --- a/xen/arch/arm/numa_device_tree.c
> >>> +++ b/xen/arch/arm/numa_device_tree.c
> >>> @@ -20,11 +20,13 @@
> >>>#include 
> >>>#include 
> >>>#include 
> >>> +#include 
> >>>#include 
> >>>#include 
> >>>
> >>>s8 device_tree_numa = 0;
> >>>static nodemask_t processor_nodes_parsed __initdata;
> >>> +static nodemask_t memory_nodes_parsed __initdata;
> >>>
> >>>static int srat_disabled(void)
> >>>{
> >>> @@ -55,6 +57,79 @@ static int __init
> >> dtb_numa_processor_affinity_init(nodeid_t node)
> >>>return 0;
> >>>}
> >>>
> >>> +/* Callback for parsing of the memory regions affinity */
> >>> +static int __init dtb_numa_memory_affinity_init(nodeid_t node,
> >>> +paddr_t start, paddr_t size)
> >>> +{
> >>
> >> The implementation of this function is quite similar ot the ACPI
> >> version. Can this be abstracted?
> >
> > In my draft, I had tried to merge ACPI and DTB versions in one
> > function. I introduced a number of "if else" to distinguish ACPI
> > from DTB, especially ACPI hotplug. The function seems very messy.
> > Not enough benefits to make up for the mess, so I gave up.
> 
> It think you can get away from distinguishing between ACPI and DT in
> that helper:
>* ma->flags & ACPI_SRAT_MEM_HOTPLUGGABLE could be replace by an
> argument indicating whether the region is hotpluggable (this would
> always be false for DT)
>* Access to memblk_hotplug can be stubbed (in the future we may want
> to consider memory hotplug even on Arm).
> 
> Do you still have the "if else" version? If so can you post it?
> 

I just tried to do that in draft process, because I was not satisfied
with the changes, I haven't saved them as a patch.

I think your suggestions are worth to try again, I will do it
in next version.


> Cheers,
> 
> --
> Julien Grall

Re: [PATCH 4/9] gnttab: drop GNTMAP_can_fail

2021-08-26 Thread Andrew Cooper

On 26/08/2021 11:13, Jan Beulich wrote:
> There's neither documentation of what this flag is supposed to mean, nor
> any implementation. With this, don't even bother enclosing the #define-s
> in a __XEN_INTERFACE_VERSION__ conditional, but drop them altogether.
>
> Signed-off-by: Jan Beulich 

It was introduced in 4d45702cf0398 along with GNTST_eagain, and the
commit message hints that it is for xen-paging

Furthermore, the first use of GNTST_eagain was in ecb35ecb79e0 for
trying to map/copy a paged-out frame.

Therefore I think it is reasonable to conclude that there was a planned
interaction between GNTMAP_can_fail and paging, which presumably would
have been "don't pull this in from disk if it is paged out".

I think it is fine to drop, as no implementation has ever existed in
Xen, but it would be helpful to have the history summarised in the
commit message.

~Andrew

[linux-linus test] 164481: regressions - FAIL

2021-08-26 Thread osstest service owner

flight 164481 linux-linus real [real]
http://logs.test-lab.xenproject.org/osstest/logs/164481/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-xl-xsm7 xen-install  fail REGR. vs. 152332
 test-amd64-i386-qemut-rhel6hvm-intel  7 xen-install  fail REGR. vs. 152332
 test-amd64-i386-xl-qemuu-dmrestrict-amd64-dmrestrict 7 xen-install fail REGR. 
vs. 152332
 test-amd64-i386-xl-qemut-debianhvm-amd64  7 xen-install  fail REGR. vs. 152332
 test-amd64-i386-xl-qemuu-ws16-amd64  7 xen-install   fail REGR. vs. 152332
 test-amd64-i386-xl-qemuu-debianhvm-amd64-shadow 7 xen-install fail REGR. vs. 
152332
 test-amd64-i386-qemuu-rhel6hvm-intel  7 xen-install  fail REGR. vs. 152332
 test-amd64-i386-examine   6 xen-install  fail REGR. vs. 152332
 test-amd64-i386-libvirt   7 xen-install  fail REGR. vs. 152332
 test-amd64-i386-xl-qemuu-debianhvm-i386-xsm 7 xen-install fail REGR. vs. 152332
 test-amd64-coresched-i386-xl  7 xen-install  fail REGR. vs. 152332
 test-amd64-i386-xl-qemuu-debianhvm-amd64  7 xen-install  fail REGR. vs. 152332
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 7 xen-install fail REGR. vs. 
152332
 test-amd64-i386-xl7 xen-install  fail REGR. vs. 152332
 test-amd64-i386-qemut-rhel6hvm-amd  7 xen-installfail REGR. vs. 152332
 test-amd64-i386-xl-qemut-ws16-amd64  7 xen-install   fail REGR. vs. 152332
 test-amd64-i386-qemuu-rhel6hvm-amd  7 xen-installfail REGR. vs. 152332
 test-amd64-i386-pair 10 xen-install/src_host fail REGR. vs. 152332
 test-amd64-i386-pair 11 xen-install/dst_host fail REGR. vs. 152332
 test-amd64-i386-libvirt-xsm   7 xen-install  fail REGR. vs. 152332
 test-amd64-i386-xl-raw7 xen-install  fail REGR. vs. 152332
 test-amd64-i386-freebsd10-amd64  7 xen-install   fail REGR. vs. 152332
 test-amd64-i386-xl-pvshim 7 xen-install  fail REGR. vs. 152332
 test-amd64-i386-freebsd10-i386  7 xen-installfail REGR. vs. 152332
 test-amd64-i386-xl-qemut-debianhvm-i386-xsm 7 xen-install fail REGR. vs. 152332
 test-amd64-i386-xl-shadow 7 xen-install  fail REGR. vs. 152332
 test-amd64-i386-xl-qemut-win7-amd64  7 xen-install   fail REGR. vs. 152332
 test-amd64-i386-xl-qemuu-ovmf-amd64  7 xen-install   fail REGR. vs. 152332
 test-amd64-i386-xl-qemuu-win7-amd64  7 xen-install   fail REGR. vs. 152332
 test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 7 xen-install fail REGR. 
vs. 152332
 test-amd64-i386-libvirt-pair 10 xen-install/src_host fail REGR. vs. 152332
 test-amd64-i386-libvirt-pair 11 xen-install/dst_host fail REGR. vs. 152332
 test-arm64-arm64-xl-credit1  13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-xl  13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-xl-thunderx 13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-xl-credit2  13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-xl-xsm  13 debian-fixup fail REGR. vs. 152332
 test-arm64-arm64-libvirt-xsm 13 debian-fixup fail REGR. vs. 152332

Regressions which are regarded as allowable (not blocking):
 test-amd64-amd64-xl-rtds 20 guest-localmigrate/x10   fail REGR. vs. 152332

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 152332
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 152332
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 152332
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 152332
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 152332
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 152332
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 152332
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never

Re: [PATCH MINI-OS] gnttab: drop GNTMAP_can_fail

2021-08-26 Thread Samuel Thibault

Jan Beulich, le jeu. 26 août 2021 12:20:26 +0200, a ecrit:
> There's neither documentation of what this flag is supposed to mean, nor
> any implementation in the hypervisor.
> 
> Signed-off-by: Jan Beulich 

Reviewed-by: Samuel Thibault 

> --- a/include/xen/grant_table.h
> +++ b/include/xen/grant_table.h
> @@ -627,9 +627,6 @@ DEFINE_XEN_GUEST_HANDLE(gnttab_cache_flu
>  #define _GNTMAP_contains_pte(4)
>  #define GNTMAP_contains_pte (1<<_GNTMAP_contains_pte)
>  
> -#define _GNTMAP_can_fail(5)
> -#define GNTMAP_can_fail (1<<_GNTMAP_can_fail)
> -
>  /*
>   * Bits to be placed in guest kernel available PTE bits (architecture
>   * dependent; only supported when XENFEAT_gnttab_map_avail_bits is set).
>

[PATCH XTF] gnttab: drop GNTMAP_can_fail

2021-08-26 Thread Jan Beulich

There's neither documentation of what this flag is supposed to mean, nor
any implementation in the hypervisor.

Signed-off-by: Jan Beulich 

--- a/include/xen/grant_table.h
+++ b/include/xen/grant_table.h
@@ -196,9 +196,6 @@ typedef union {
 #define _GNTMAP_contains_pte4
 #define GNTMAP_contains_pte (1 << _GNTMAP_contains_pte)
 
-#define _GNTMAP_can_fail5
-#define GNTMAP_can_fail (1 << _GNTMAP_can_fail)
-
 /*
  * Bits to be placed in guest kernel available PTE bits (architecture
  * dependent; only supported when XENFEAT_gnttab_map_avail_bits is set).

[PATCH MINI-OS] gnttab: drop GNTMAP_can_fail

2021-08-26 Thread Jan Beulich

There's neither documentation of what this flag is supposed to mean, nor
any implementation in the hypervisor.

Signed-off-by: Jan Beulich 

--- a/include/xen/grant_table.h
+++ b/include/xen/grant_table.h
@@ -627,9 +627,6 @@ DEFINE_XEN_GUEST_HANDLE(gnttab_cache_flu
 #define _GNTMAP_contains_pte(4)
 #define GNTMAP_contains_pte (1<<_GNTMAP_contains_pte)
 
-#define _GNTMAP_can_fail(5)
-#define GNTMAP_can_fail (1<<_GNTMAP_can_fail)
-
 /*
  * Bits to be placed in guest kernel available PTE bits (architecture
  * dependent; only supported when XENFEAT_gnttab_map_avail_bits is set).

[xen-unstable test] 164477: tolerable FAIL - PUSHED

2021-08-26 Thread osstest service owner

flight 164477 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/164477/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 164405
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 164405
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 164405
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 164405
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 164405
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 164405
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 164405
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 164405
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 164405
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 164405
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 164405
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass

version targeted for testing:
 xen  a931e8e64af07bd333a31f3b71a3f8f3e7910857
baseline version:
 xen  93713f444b3f29d6848527506db69cf78976b32d

Last test of basis   164405  2021-08-23 08:02:08 Z3 days
Testing same since   164477  2021-08-25 09:09:48 Z1 days1 attempts


People who touched revisions under test:
  Julien Grall 
  Michal Orzel 
  Oleksandr Andrushchenko 
  Oleksandr Tyshchenko 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm

[PATCH 9/9] gnttab: don't silently truncate GFNs in compat setup-table handling

2021-08-26 Thread Jan Beulich

Returning back truncated frame numbers is unhelpful: Quite likely
they're not owned by the domain (if it's PV), or we may misguide the
guest into writing grant entries into a page that it actually uses for
other purposes.

Signed-off-by: Jan Beulich 
---
RFC: Arguably in the 32-bit PV case it may be necessary to instead put
 in place an explicit address restriction when allocating
 ->shared_raw[N]. This is currently implicit by alloc_xenheap_page()
 only returning memory covered by the direct-map.

--- a/xen/common/compat/grant_table.c
+++ b/xen/common/compat/grant_table.c
@@ -175,8 +175,15 @@ int compat_grant_table_op(unsigned int c
  i < (_s_)->nr_frames; ++i ) \
 { \
 compat_pfn_t frame = (_s_)->frame_list.p[i]; \
-if ( __copy_to_compat_offset((_d_)->frame_list, \
- i, , 1) ) \
+if ( frame != (_s_)->frame_list.p[i] ) \
+{ \
+if ( VALID_M2P((_s_)->frame_list.p[i]) ) \
+(_s_)->status = GNTST_address_too_big; \
+else \
+frame |= 0x8000U;\
+} \
+else if ( __copy_to_compat_offset((_d_)->frame_list, \
+  i, , 1) ) \
 (_s_)->status = GNTST_bad_virt_addr; \
 } \
 } while (0)

[PATCH 8/9] gnttab: bail from GFN-storing loops early in case of error

2021-08-26 Thread Jan Beulich

The contents of the output arrays are undefined in both cases anyway
when the operation itself gets marked as failed. There's no value in
trying to continue after a guest memory access failure.

Signed-off-by: Jan Beulich 
---
There's also a curious difference between the two sub-ops wrt the use of
SHARED_M2P().

--- a/xen/common/compat/grant_table.c
+++ b/xen/common/compat/grant_table.c
@@ -170,17 +170,14 @@ int compat_grant_table_op(unsigned int c
 if ( rc == 0 )
 {
 #define XLAT_gnttab_setup_table_HNDL_frame_list(_d_, _s_) \
-do \
-{ \
-if ( (_s_)->status == GNTST_okay ) \
+do { \
+for ( i = 0; (_s_)->status == GNTST_okay && \
+ i < (_s_)->nr_frames; ++i ) \
 { \
-for ( i = 0; i < (_s_)->nr_frames; ++i ) \
-{ \
-unsigned int frame = (_s_)->frame_list.p[i]; \
-if ( __copy_to_compat_offset((_d_)->frame_list, \
- i, , 1) ) \
-(_s_)->status = GNTST_bad_virt_addr; \
-} \
+compat_pfn_t frame = (_s_)->frame_list.p[i]; \
+if ( __copy_to_compat_offset((_d_)->frame_list, \
+ i, , 1) ) \
+(_s_)->status = GNTST_bad_virt_addr; \
 } \
 } while (0)
 XLAT_gnttab_setup_table(, nat.setup);
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -2103,7 +2103,10 @@ gnttab_setup_table(
 BUG_ON(SHARED_M2P(gmfn));
 
 if ( __copy_to_guest_offset(op.frame_list, i, , 1) )
+{
 op.status = GNTST_bad_virt_addr;
+break;
+}
 }
 
  unlock:
@@ -3289,17 +3292,15 @@ gnttab_get_status_frames(XEN_GUEST_HANDL
  "status frames, but has only %u\n",
  d->domain_id, op.nr_frames, nr_status_frames(gt));
 op.status = GNTST_general_error;
-goto unlock;
 }
 
-for ( i = 0; i < op.nr_frames; i++ )
+for ( i = 0; op.status == GNTST_okay && i < op.nr_frames; i++ )
 {
 gmfn = gfn_x(gnttab_status_gfn(d, gt, i));
 if ( __copy_to_guest_offset(op.frame_list, i, , 1) )
 op.status = GNTST_bad_virt_addr;
 }
 
- unlock:
 grant_read_unlock(gt);
  out2:
 rcu_unlock_domain(d);

[PATCH 7/9] gnttab: no need to translate handle for gnttab_get_status_frames()

2021-08-26 Thread Jan Beulich

Unlike for GNTTABOP_setup_table native and compat frame lists are arrays
of the same type (uint64_t). Hence there's no need to translate the frame
values. This then also renders unnecessary the limit_max parameter of
gnttab_get_status_frames().

Signed-off-by: Jan Beulich 

--- a/xen/common/compat/grant_table.c
+++ b/xen/common/compat/grant_table.c
@@ -271,10 +271,7 @@ int compat_grant_table_op(unsigned int c
 }
 break;
 
-case GNTTABOP_get_status_frames: {
-unsigned int max_frame_list_size_in_pages =
-(COMPAT_ARG_XLAT_SIZE - sizeof(*nat.get_status)) /
-sizeof(*nat.get_status->frame_list.p);
+case GNTTABOP_get_status_frames:
 if ( count != 1)
 {
 rc = -EINVAL;
@@ -289,38 +286,25 @@ int compat_grant_table_op(unsigned int c
 }
 
 #define XLAT_gnttab_get_status_frames_HNDL_frame_list(_d_, _s_) \
-set_xen_guest_handle((_d_)->frame_list, (uint64_t 
*)(nat.get_status + 1))
+guest_from_compat_handle((_d_)->frame_list, (_s_)->frame_list)
 XLAT_gnttab_get_status_frames(nat.get_status, _status);
 #undef XLAT_gnttab_get_status_frames_HNDL_frame_list
 
 rc = gnttab_get_status_frames(
-guest_handle_cast(nat.uop, gnttab_get_status_frames_t),
-count, max_frame_list_size_in_pages);
+guest_handle_cast(nat.uop, gnttab_get_status_frames_t), count);
 if ( rc >= 0 )
 {
-#define XLAT_gnttab_get_status_frames_HNDL_frame_list(_d_, _s_) \
-do \
-{ \
-if ( (_s_)->status == GNTST_okay ) \
-{ \
-for ( i = 0; i < (_s_)->nr_frames; ++i ) \
-{ \
-uint64_t frame = (_s_)->frame_list.p[i]; \
-if ( __copy_to_compat_offset((_d_)->frame_list, \
- i, , 1) ) \
-(_s_)->status = GNTST_bad_virt_addr; \
-} \
-} \
-} while (0)
-XLAT_gnttab_get_status_frames(_status, nat.get_status);
-#undef XLAT_gnttab_get_status_frames_HNDL_frame_list
-if ( unlikely(__copy_to_guest(cmp_uop, _status, 1)) )
+XEN_GUEST_HANDLE_PARAM(gnttab_get_status_frames_compat_t) get =
+guest_handle_cast(cmp_uop,
+  gnttab_get_status_frames_compat_t);
+
+if ( unlikely(__copy_field_to_guest(get, nat.get_status,
+status)) )
 rc = -EFAULT;
 else
 i = 1;
 }
 break;
-}
 
 default:
 domain_crash(current->domain);
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -3242,7 +3242,7 @@ gnttab_set_version(XEN_GUEST_HANDLE_PARA
 
 static long
 gnttab_get_status_frames(XEN_GUEST_HANDLE_PARAM(gnttab_get_status_frames_t) 
uop,
- unsigned int count, unsigned int limit_max)
+ unsigned int count)
 {
 gnttab_get_status_frames_t op;
 struct domain *d;
@@ -3292,15 +3292,6 @@ gnttab_get_status_frames(XEN_GUEST_HANDL
 goto unlock;
 }
 
-if ( unlikely(limit_max < op.nr_frames) )
-{
-gdprintk(XENLOG_WARNING,
- "nr_status_frames for %pd is too large (%u,%u)\n",
- d, op.nr_frames, limit_max);
-op.status = GNTST_general_error;
-goto unlock;
-}
-
 for ( i = 0; i < op.nr_frames; i++ )
 {
 gmfn = gfn_x(gnttab_status_gfn(d, gt, i));
@@ -3664,8 +3655,7 @@ do_grant_table_op(
 
 case GNTTABOP_get_status_frames:
 rc = gnttab_get_status_frames(
-guest_handle_cast(uop, gnttab_get_status_frames_t), count,
-  UINT_MAX);
+guest_handle_cast(uop, gnttab_get_status_frames_t), count);
 break;
 
 case GNTTABOP_get_version:

[PATCH 6/9] gnttab: check handle early in gnttab_get_status_frames()

2021-08-26 Thread Jan Beulich

Like done in gnttab_setup_table(), check the handle once early in the
function and use the lighter-weight (for PV) copying function in the
loop.

Signed-off-by: Jan Beulich 

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -3261,6 +3261,9 @@ gnttab_get_status_frames(XEN_GUEST_HANDL
 return -EFAULT;
 }
 
+if ( !guest_handle_okay(op.frame_list, op.nr_frames) )
+return -EFAULT;
+
 d = rcu_lock_domain_by_any_id(op.dom);
 if ( d == NULL )
 {
@@ -3301,7 +3304,7 @@ gnttab_get_status_frames(XEN_GUEST_HANDL
 for ( i = 0; i < op.nr_frames; i++ )
 {
 gmfn = gfn_x(gnttab_status_gfn(d, gt, i));
-if ( copy_to_guest_offset(op.frame_list, i, , 1) )
+if ( __copy_to_guest_offset(op.frame_list, i, , 1) )
 op.status = GNTST_bad_virt_addr;
 }

[PATCH 5/9] gnttab: defer allocation of status frame tracking array

2021-08-26 Thread Jan Beulich

This array can be large when many grant frames are permitted; avoid
allocating it when it's not going to be used anyway, by doing this only
in gnttab_populate_status_frames(). While the delaying of the respective
memory allocation adds possible reasons for failure of the respective
enclosing operations, there are other memory allocations there already,
so callers can't expect these operations to always succeed anyway.

As to the re-ordering at the end of gnttab_unpopulate_status_frames(),
this is merely to represent intended order of actions (shrink array
bound, then free higher array entries).

Signed-off-by: Jan Beulich 
Reviewed-by: Julien Grall 
---
v1: Fold into series.
[standalone history]
v4: Add a comment. Add a few blank lines. Extend description.
v3: Drop smp_wmb(). Re-base.
v2: Defer allocation to when a domain actually switches to the v2 grant
API.

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1774,6 +1774,17 @@ gnttab_populate_status_frames(struct dom
 /* Make sure, prior version checks are architectural visible */
 block_speculation();
 
+if ( gt->status == ZERO_BLOCK_PTR )
+{
+gt->status = xzalloc_array(grant_status_t *,
+   
grant_to_status_frames(gt->max_grant_frames));
+if ( !gt->status )
+{
+gt->status = ZERO_BLOCK_PTR;
+return -ENOMEM;
+}
+}
+
 for ( i = nr_status_frames(gt); i < req_status_frames; i++ )
 {
 if ( (gt->status[i] = alloc_xenheap_page()) == NULL )
@@ -1794,18 +1805,25 @@ status_alloc_failed:
 free_xenheap_page(gt->status[i]);
 gt->status[i] = NULL;
 }
+
+if ( !nr_status_frames(gt) )
+{
+xfree(gt->status);
+gt->status = ZERO_BLOCK_PTR;
+}
+
 return -ENOMEM;
 }
 
 static int
 gnttab_unpopulate_status_frames(struct domain *d, struct grant_table *gt)
 {
-unsigned int i;
+unsigned int i, n = nr_status_frames(gt);
 
 /* Make sure, prior version checks are architectural visible */
 block_speculation();
 
-for ( i = 0; i < nr_status_frames(gt); i++ )
+for ( i = 0; i < n; i++ )
 {
 struct page_info *pg = virt_to_page(gt->status[i]);
 gfn_t gfn = gnttab_get_frame_gfn(gt, true, i);
@@ -1860,12 +1878,11 @@ gnttab_unpopulate_status_frames(struct d
 page_set_owner(pg, NULL);
 }
 
-for ( i = 0; i < nr_status_frames(gt); i++ )
-{
-free_xenheap_page(gt->status[i]);
-gt->status[i] = NULL;
-}
 gt->nr_status_frames = 0;
+for ( i = 0; i < n; i++ )
+free_xenheap_page(gt->status[i]);
+xfree(gt->status);
+gt->status = ZERO_BLOCK_PTR;
 
 return 0;
 }
@@ -1988,11 +2005,11 @@ int grant_table_init(struct domain *d, i
 if ( gt->shared_raw == NULL )
 goto out;
 
-/* Status pages for grant table - for version 2 */
-gt->status = xzalloc_array(grant_status_t *,
-   grant_to_status_frames(gt->max_grant_frames));
-if ( gt->status == NULL )
-goto out;
+/*
+ * Status page tracking array for v2 gets allocated on demand. But don't
+ * leave a NULL pointer there.
+ */
+gt->status = ZERO_BLOCK_PTR;
 
 grant_write_lock(gt);
 
@@ -4103,11 +4120,13 @@ int gnttab_acquire_resource(
 if ( gt->gt_version != 2 )
 break;
 
+/* This may change gt->status, so has to happen before setting vaddrs. 
*/ 
+rc = gnttab_get_status_frame_mfn(d, final_frame, );
+
 /* Check that void ** is a suitable representation for gt->status. */
 BUILD_BUG_ON(!__builtin_types_compatible_p(
  typeof(gt->status), grant_status_t **));
 vaddrs = (void **)gt->status;
-rc = gnttab_get_status_frame_mfn(d, final_frame, );
 break;
 }

[PATCH 4/9] gnttab: drop GNTMAP_can_fail

2021-08-26 Thread Jan Beulich

There's neither documentation of what this flag is supposed to mean, nor
any implementation. With this, don't even bother enclosing the #define-s
in a __XEN_INTERFACE_VERSION__ conditional, but drop them altogether.

Signed-off-by: Jan Beulich 

--- a/xen/include/public/grant_table.h
+++ b/xen/include/public/grant_table.h
@@ -628,9 +628,6 @@ DEFINE_XEN_GUEST_HANDLE(gnttab_cache_flu
 #define _GNTMAP_contains_pte(4)
 #define GNTMAP_contains_pte (1<<_GNTMAP_contains_pte)
 
-#define _GNTMAP_can_fail(5)
-#define GNTMAP_can_fail (1<<_GNTMAP_can_fail)
-
 /*
  * Bits to be placed in guest kernel available PTE bits (architecture
  * dependent; only supported when XENFEAT_gnttab_map_avail_bits is set).

[PATCH 3/9] gnttab: fold recurring is_iomem_page()

2021-08-26 Thread Jan Beulich

In all cases call the function just once instead of up to four times, at
the same time avoiding to store a dangling pointer in a local variable.

Signed-off-by: Jan Beulich 

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1587,11 +1587,11 @@ unmap_common_complete(struct gnttab_unma
 else
 status = _entry(rgt, op->ref);
 
-pg = mfn_to_page(op->mfn);
+pg = !is_iomem_page(act->mfn) ? mfn_to_page(op->mfn) : NULL;
 
 if ( op->done & GNTMAP_device_map )
 {
-if ( !is_iomem_page(act->mfn) )
+if ( pg )
 {
 if ( op->done & GNTMAP_readonly )
 put_page(pg);
@@ -1608,7 +1608,7 @@ unmap_common_complete(struct gnttab_unma
 
 if ( op->done & GNTMAP_host_map )
 {
-if ( !is_iomem_page(op->mfn) )
+if ( pg )
 {
 if ( gnttab_host_mapping_get_page_type(op->done & GNTMAP_readonly,
ld, rd) )
@@ -3778,7 +3778,7 @@ int gnttab_release_mappings(struct domai
 else
 status = _entry(rgt, ref);
 
-pg = mfn_to_page(act->mfn);
+pg = !is_iomem_page(act->mfn) ? mfn_to_page(act->mfn) : NULL;
 
 if ( map->flags & GNTMAP_readonly )
 {
@@ -3786,7 +3786,7 @@ int gnttab_release_mappings(struct domai
 {
 BUG_ON(!(act->pin & GNTPIN_devr_mask));
 act->pin -= GNTPIN_devr_inc;
-if ( !is_iomem_page(act->mfn) )
+if ( pg )
 put_page(pg);
 }
 
@@ -3794,8 +3794,7 @@ int gnttab_release_mappings(struct domai
 {
 BUG_ON(!(act->pin & GNTPIN_hstr_mask));
 act->pin -= GNTPIN_hstr_inc;
-if ( gnttab_release_host_mappings(d) &&
- !is_iomem_page(act->mfn) )
+if ( pg && gnttab_release_host_mappings(d) )
 put_page(pg);
 }
 }
@@ -3805,7 +3804,7 @@ int gnttab_release_mappings(struct domai
 {
 BUG_ON(!(act->pin & GNTPIN_devw_mask));
 act->pin -= GNTPIN_devw_inc;
-if ( !is_iomem_page(act->mfn) )
+if ( pg )
 put_page_and_type(pg);
 }
 
@@ -3813,8 +3812,7 @@ int gnttab_release_mappings(struct domai
 {
 BUG_ON(!(act->pin & GNTPIN_hstw_mask));
 act->pin -= GNTPIN_hstw_inc;
-if ( gnttab_release_host_mappings(d) &&
- !is_iomem_page(act->mfn) )
+if ( pg && gnttab_release_host_mappings(d) )
 {
 if ( gnttab_host_mapping_get_page_type(false, d, rd) )
 put_page_type(pg);
In all cases call the function just once instead of up to four times, at
the same time avoiding to store a dangling pointer in a local variable.

Signed-off-by: Jan Beulich 

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1587,11 +1587,11 @@ unmap_common_complete(struct gnttab_unma
 else
 status = _entry(rgt, op->ref);
 
-pg = mfn_to_page(op->mfn);
+pg = !is_iomem_page(act->mfn) ? mfn_to_page(op->mfn) : NULL;
 
 if ( op->done & GNTMAP_device_map )
 {
-if ( !is_iomem_page(act->mfn) )
+if ( pg )
 {
 if ( op->done & GNTMAP_readonly )
 put_page(pg);
@@ -1608,7 +1608,7 @@ unmap_common_complete(struct gnttab_unma
 
 if ( op->done & GNTMAP_host_map )
 {
-if ( !is_iomem_page(op->mfn) )
+if ( pg )
 {
 if ( gnttab_host_mapping_get_page_type(op->done & GNTMAP_readonly,
ld, rd) )
@@ -3778,7 +3778,7 @@ int gnttab_release_mappings(struct domai
 else
 status = _entry(rgt, ref);
 
-pg = mfn_to_page(act->mfn);
+pg = !is_iomem_page(act->mfn) ? mfn_to_page(act->mfn) : NULL;
 
 if ( map->flags & GNTMAP_readonly )
 {
@@ -3786,7 +3786,7 @@ int gnttab_release_mappings(struct domai
 {
 BUG_ON(!(act->pin & GNTPIN_devr_mask));
 act->pin -= GNTPIN_devr_inc;
-if ( !is_iomem_page(act->mfn) )
+if ( pg )
 put_page(pg);
 }
 
@@ -3794,8 +3794,7 @@ int gnttab_release_mappings(struct domai
 {
 BUG_ON(!(act->pin & GNTPIN_hstr_mask));
 act->pin -= GNTPIN_hstr_inc;
-if ( gnttab_release_host_mappings(d) &&
- !is_iomem_page(act->mfn) )
+if ( pg && gnttab_release_host_mappings(d) )
 put_page(pg);
 }
 }
@@ -3805,7 +3804,7 @@ int gnttab_release_mappings(struct domai
 {
 BUG_ON(!(act->pin & GNTPIN_devw_mask));
 act->pin -= GNTPIN_devw_inc;
-if ( !is_iomem_page(act->mfn) )
+

[PATCH 2/9] gnttab: drop a redundant expression from gnttab_release_mappings()

2021-08-26 Thread Jan Beulich

This gnttab_host_mapping_get_page_type() invocation sits in the "else"
path of a conditional controlled by "map->flags & GNTMAP_readonly".

Signed-off-by: Jan Beulich 

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -3816,9 +3816,7 @@ int gnttab_release_mappings(struct domai
 if ( gnttab_release_host_mappings(d) &&
  !is_iomem_page(act->mfn) )
 {
-if ( gnttab_host_mapping_get_page_type((map->flags &
-GNTMAP_readonly),
-   d, rd) )
+if ( gnttab_host_mapping_get_page_type(false, d, rd) )
 put_page_type(pg);
 put_page(pg);
 }

[PATCH 1/9] gnttab: defer allocation of maptrack frames table

2021-08-26 Thread Jan Beulich

By default all guests are permitted to have up to 1024 maptrack frames,
which on 64-bit means an 8k frame table. Yet except for driver domains
guests normally don't make use of grant mappings. Defer allocating the
table until a map track handle is first requested.

Signed-off-by: Jan Beulich 
---
I continue to be unconvinced that it is a good idea to allow all DomU-s
1024 maptrack frames by default. While I'm still of the opinion that a
hypervisor enforced upper bound is okay, I question this upper bound
also getting used as the default value - this is perhaps okay for Dom0,
but not elsewhere.

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -633,6 +633,34 @@ get_maptrack_handle(
 if ( likely(handle != INVALID_MAPTRACK_HANDLE) )
 return handle;
 
+if ( unlikely(!read_atomic(>maptrack)) )
+{
+struct grant_mapping **maptrack = NULL;
+
+if ( lgt->max_maptrack_frames )
+maptrack = vzalloc(lgt->max_maptrack_frames * sizeof(*maptrack));
+
+spin_lock(>maptrack_lock);
+
+if ( !lgt->maptrack )
+{
+if ( !maptrack )
+{
+spin_unlock(>maptrack_lock);
+return INVALID_MAPTRACK_HANDLE;
+}
+
+write_atomic(>maptrack, maptrack);
+maptrack = NULL;
+
+radix_tree_init(>maptrack_tree);
+}
+
+spin_unlock(>maptrack_lock);
+
+vfree(maptrack);
+}
+
 spin_lock(>maptrack_lock);
 
 /*
@@ -1955,16 +1983,6 @@ int grant_table_init(struct domain *d, i
 if ( gt->active == NULL )
 goto out;
 
-/* Tracking of mapped foreign frames table */
-if ( gt->max_maptrack_frames )
-{
-gt->maptrack = vzalloc(gt->max_maptrack_frames * 
sizeof(*gt->maptrack));
-if ( gt->maptrack == NULL )
-goto out;
-
-radix_tree_init(>maptrack_tree);
-}
-
 /* Shared grant table. */
 gt->shared_raw = xzalloc_array(void *, gt->max_grant_frames);
 if ( gt->shared_raw == NULL )

[PATCH 0/9] gnttab: further work from XSA-380 / -382 context

2021-08-26 Thread Jan Beulich

The first four patches can be attributed to the former, the last four
patches to the latter. The middle patch had been submitted standalone
before, has a suitable Reviewed-by tag, but also has an objection by
Andrew pending, which unfortunately has lead to this patch now being
stuck. Short of Andrew being willing to settle the disagreement more
with Julien than with me (although I'm on Julien's side), I have no
idea what to do here.

There's probably not much interrelation between the patches, so they
can perhaps go in about any order.

1: defer allocation of maptrack frames table
2: drop a redundant expression from gnttab_release_mappings()
3: fold recurring is_iomem_page()
4: drop GNTMAP_can_fail
5: defer allocation of status frame tracking array
6: check handle early in gnttab_get_status_frames()
7: no need to translate handle for gnttab_get_status_frames()
8: bail from GFN-storing loops early in case of error
9: don't silently truncate GFNs in compat setup-table handling

Jan

Re: [PATCH v2 08/10] xsm: remove xsm_default_t from hook definitions

2021-08-26 Thread Jan Beulich

On 06.08.2021 23:41, Daniel P. Smith wrote:
> While not all of the points of contentions nor all of my concerns are
> all addressed, I would like to hope that v3 is seen as an attempt
> compromise, those compromises are acceptable, and that I can begin to
> bring the next patch set forward. Thank you and looking forward to
> responses.

Having gone through the series I've been happy to see the adjustments
that have been made. There are still further requests I have spelled
out, but I think (hope) those aren't as controversial anymore.

Jan

Re: Enabling hypervisor agnosticism for VirtIO backends

2021-08-26 Thread AKASHI Takahiro

Hi Wei,

On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
> On Wed, Aug 18, 2021 at 08:35:51AM +, Wei Chen wrote:
> > Hi Akashi，
> > 
> > > -Original Message-
> > > From: AKASHI Takahiro 
> > > Sent: 2021年8月18日 13:39
> > > To: Wei Chen 
> > > Cc: Oleksandr Tyshchenko ; Stefano Stabellini
> > > ; Alex Benn??e ; Stratos
> > > Mailing List ; virtio-dev@lists.oasis-
> > > open.org; Arnd Bergmann ; Viresh Kumar
> > > ; Stefano Stabellini
> > > ; stefa...@redhat.com; Jan Kiszka
> > > ; Carl van Schaik ;
> > > prat...@quicinc.com; Srivatsa Vaddagiri ; Jean-
> > > Philippe Brucker ; Mathieu Poirier
> > > ; Oleksandr Tyshchenko
> > > ; Bertrand Marquis
> > > ; Artem Mygaiev ; Julien
> > > Grall ; Juergen Gross ; Paul Durrant
> > > ; Xen Devel 
> > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > >
> > > On Tue, Aug 17, 2021 at 08:39:09AM +, Wei Chen wrote:
> > > > Hi Akashi,
> > > >
> > > > > -Original Message-
> > > > > From: AKASHI Takahiro 
> > > > > Sent: 2021年8月17日 16:08
> > > > > To: Wei Chen 
> > > > > Cc: Oleksandr Tyshchenko ; Stefano Stabellini
> > > > > ; Alex Benn??e ;
> > > Stratos
> > > > > Mailing List ; virtio-
> > > dev@lists.oasis-
> > > > > open.org; Arnd Bergmann ; Viresh Kumar
> > > > > ; Stefano Stabellini
> > > > > ; stefa...@redhat.com; Jan Kiszka
> > > > > ; Carl van Schaik ;
> > > > > prat...@quicinc.com; Srivatsa Vaddagiri ; Jean-
> > > > > Philippe Brucker ; Mathieu Poirier
> > > > > ; Oleksandr Tyshchenko
> > > > > ; Bertrand Marquis
> > > > > ; Artem Mygaiev ;
> > > Julien
> > > > > Grall ; Juergen Gross ; Paul Durrant
> > > > > ; Xen Devel 
> > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > >
> > > > > Hi Wei, Oleksandr,
> > > > >
> > > > > On Mon, Aug 16, 2021 at 10:04:03AM +, Wei Chen wrote:
> > > > > > Hi All,
> > > > > >
> > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here.
> > > > > > This proposal is still discussing in Xen and KVM communities.
> > > > > > The main work is to decouple the kvmtool from KVM and make
> > > > > > other hypervisors can reuse the virtual device implementations.
> > > > > >
> > > > > > In this case, we need to introduce an intermediate hypervisor
> > > > > > layer for VMM abstraction, Which is, I think it's very close
> > > > > > to stratos' virtio hypervisor agnosticism work.
> > > > >
> > > > > # My proposal[1] comes from my own idea and doesn't always represent
> > > > > # Linaro's view on this subject nor reflect Alex's concerns.
> > > Nevertheless,
> > > > >
> > > > > Your idea and my proposal seem to share the same background.
> > > > > Both have the similar goal and currently start with, at first, Xen
> > > > > and are based on kvm-tool. (Actually, my work is derived from
> > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
> > > > >
> > > > > In particular, the abstraction of hypervisor interfaces has a same
> > > > > set of interfaces (for your "struct vmm_impl" and my "RPC 
> > > > > interfaces").
> > > > > This is not co-incident as we both share the same origin as I said
> > > above.
> > > > > And so we will also share the same issues. One of them is a way of
> > > > > "sharing/mapping FE's memory". There is some trade-off between
> > > > > the portability and the performance impact.
> > > > > So we can discuss the topic here in this ML, too.
> > > > > (See Alex's original email, too).
> > > > >
> > > > Yes, I agree.
> > > >
> > > > > On the other hand, my approach aims to create a "single-binary"
> > > solution
> > > > > in which the same binary of BE vm could run on any hypervisors.
> > > > > Somehow similar to your "proposal-#2" in [2], but in my solution, all
> > > > > the hypervisor-specific code would be put into another entity (VM),
> > > > > named "virtio-proxy" and the abstracted operations are served via RPC.
> > > > > (In this sense, BE is hypervisor-agnostic but might have OS
> > > dependency.)
> > > > > But I know that we need discuss if this is a requirement even
> > > > > in Stratos project or not. (Maybe not)
> > > > >
> > > >
> > > > Sorry, I haven't had time to finish reading your virtio-proxy completely
> > > > (I will do it ASAP). But from your description, it seems we need a
> > > > 3rd VM between FE and BE? My concern is that, if my assumption is right,
> > > > will it increase the latency in data transport path? Even if we're
> > > > using some lightweight guest like RTOS or Unikernel,
> > >
> > > Yes, you're right. But I'm afraid that it is a matter of degree.
> > > As far as we execute 'mapping' operations at every fetch of payload,
> > > we will see latency issue (even in your case) and if we have some solution
> > > for it, we won't see it neither in my proposal :)
> > >
> > 
> > Oleksandr has sent a proposal to Xen mailing list to reduce this kind
> > of "mapping/unmapping" operations. So the latency caused by this behavior
> > on Xen may eventually be eliminated, and Linux-KVM doesn't

Re: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to NUMA system

2021-08-26 Thread Jan Beulich

On 26.08.2021 10:49, Julien Grall wrote:
> On 26/08/2021 08:24, Wei Chen wrote:
>>> -Original Message-
>>> From: Julien Grall 
>>> Sent: 2021年8月26日 0:58
>>> On 11/08/2021 11:24, Wei Chen wrote:
 --- a/xen/arch/arm/smpboot.c
 +++ b/xen/arch/arm/smpboot.c
 @@ -358,6 +358,12 @@ void start_secondary(void)
 */
smp_wmb();

 +/*
 + * If Xen is running on a NUMA off system, there will
 + * be a node#0 at least.
 + */
 +numa_add_cpu(cpuid);
 +
>>>
>>> On x86, numa_add_cpu() will be called before the pCPU is brought up. I
>>> am not quite too sure why we are doing it differently here. Can you
>>> clarify it?
>>
>> Of course we can invoke numa_add_cpu before cpu_up as x86. But in my tests,
>> I found when cpu bring up failed, this cpu still be add to NUMA. Although
>> this does not affect the execution of the code (because CPU is offline),
>> But I don't think adding a offline CPU to NUMA makes sense.
> 
> Right, but again, why do you want to solve the problem on Arm and not 
> x86? After all, NUMA is not architecture specific (in fact you move most 
> of the code in common).
> 
> In fact, the risk, is someone may read arch/x86 and doesn't realize the 
> CPU is not in the node until late on Arm.
> 
> So I think we should call numa_add_cpu() around the same place on all 
> the architectures.

FWIW: +1

Jan

Re: [PATCH v3 7/7] xsm: removing facade that XSM can be enabled/disabled

2021-08-26 Thread Jan Beulich

On 05.08.2021 16:06, Daniel P. Smith wrote:
> The XSM facilities are always in use by Xen with the facade of being able to
> turn XSM on and off. This option is in fact about allowing the selection of
> which policies are available and which are used at runtime.  To provide this
> facade a complicated serious of #ifdef's are used to selective include

Nit: It took me a moment to realize that the sentence reads oddly because
you likely mean "series", not "serious".

> different headers or portions of headers. This series of #ifdef gyrations
> switches between two different versions of the XSM hook interfaces and their
> respective backing implementation.  All of this is done to provide a minimal
> size/performance optimization for when alternative policies are disabled.
> 
> To unwind the #ifdef gyrations a series of changes were necessary,
> * replace CONFIG_XSM with XSM_CONFIGURABLE to allow visibility of
>   selecting alternate XSM policy modules to those that require it
> * adjusted CONFIG_XSM_SILO, CONFIG_XSM_FLASK, and the default module
>   selection to sensible defaults
> * collapsed the "dummy/defualt" XSM interface and implementation with the
>   "multiple policy" interface to provide a single inlined implementation
>   that attempts to use a registered hook and falls back to the check from
>   the dummy implementation
> * the collapse to a single interface broke code relying on the alternate
>   interface, specifically SILO, this was reworked to remove the
>   indirection/abstraction making SILO explicit in its access control
>   decisions
> * with the change of the XSM hooks to fall back to enforcing the dummy
>   policy, it is no longer necessary to fill NULL entries in the struct
>   xsm_ops returned by an XSM module's init

It would be nice if some of this could be split. Is this really close to
impossible?

> --- a/xen/common/Kconfig
> +++ b/xen/common/Kconfig
> @@ -200,23 +200,15 @@ config XENOPROF
>  
> If unsure, say Y.
>  
> -config XSM
> - bool "Xen Security Modules support"
> - default ARM
> - ---help---
> -   Enables the security framework known as Xen Security Modules which
> -   allows administrators fine-grained control over a Xen domain and
> -   its capabilities by defining permissible interactions between domains,
> -   the hypervisor itself, and related resources such as memory and
> -   devices.
> -
> -   If unsure, say N.
> +config XSM_CONFIGURABLE
> +bool "Enable Configuring Xen Security Modules"

Is there a reason to change not only the prompt, but also the name of
the Kconfig setting? This alone is the reason for some otherwise
unnecessary code churn.

Also please correct indentation here.

>  config XSM_FLASK
> - def_bool y
> - prompt "FLux Advanced Security Kernel support"
> - depends on XSM
> - ---help---
> + bool "FLux Advanced Security Kernel support"
> + default n

I don't understand this change in default (and as an aside, a default
of "n" doesn't need spelling out): In the description you say "adjusted
CONFIG_XSM_SILO, CONFIG_XSM_FLASK, and the default module selection to
sensible defaults". If that's to describe this change, then I'm afraid
I don't see why defaulting to "n" is more sensible once the person
configuring Xen has chosen the configure XSM's (or XSM_CONFIGURABLE's)
sub-options. If that's unrelated to the change here, then I'm afraid
I'm missing justification altogether. (Same for SILO then.)

> + depends on XSM_CONFIGURABLE
> + select XSM_EVTCHN_LABELING

Neither this nor any prior patch introduces an option of this name,
and there's also none in the present tree. All afaics; I may have
overlooked something or typo-ed a "grep" command.

> @@ -265,14 +258,14 @@ config XSM_SILO
> If unsure, say Y.
>  
>  choice
> - prompt "Default XSM implementation"
> - depends on XSM
> + prompt "Default XSM module"
>   default XSM_SILO_DEFAULT if XSM_SILO && ARM
>   default XSM_FLASK_DEFAULT if XSM_FLASK
>   default XSM_SILO_DEFAULT if XSM_SILO
>   default XSM_DUMMY_DEFAULT
> + depends on XSM_CONFIGURABLE

With the larger set of "default" lines I'd like to suggest to keep
"depends on" ahead of them.

> @@ -282,7 +275,7 @@ endchoice
>  config LATE_HWDOM
>   bool "Dedicated hardware domain"
>   default n
> - depends on XSM && X86
> + depends on XSM_FLASK && X86

This change is not mentioned or justified in the description. In fact
I think it is unrelated to the change here and hence would want breaking
out.

>   ---help---

As you're changing these elsewhere, any chance of you also changing
this one to just "help"?

> --- a/xen/include/xsm/xsm.h
> +++ b/xen/include/xsm/xsm.h
> @@ -19,545 +19,1023 @@
>  #include 
>  #include 
>  #include 
> -
> -#ifdef CONFIG_XSM
> +#include 
> +#include 
>  
>  extern struct xsm_ops xsm_ops;
>  
> -static inline void xsm_security_domaininfo

Re: [PATCH] x86/xstate: reset cached register values on resume

2021-08-26 Thread Andrew Cooper

On 26/08/2021 08:40, Jan Beulich wrote:
> On 25.08.2021 18:49, Andrew Cooper wrote:
>> On 25/08/2021 16:02, Jan Beulich wrote:
>>> On 24.08.2021 23:11, Andrew Cooper wrote:
>>>  If
>>> the register started out non-zero (the default on AMD iirc, as it's
>>> not really masks there) but the first value to be written was zero,
>>> we'd skip the write.
>> There is cpuidmask_defaults which does get filled from the MSRs on boot.
>>
>> AMD are real CPUID settings, while Intel is an and-mask.  i.e. they're
>> both non-zero (unless someone does something silly with the command line
>> arguments, and I don't expect Xen to be happy booting if the leaves all
>> read 0).
> Surely not all of them together, but I don't think it's completely
> unreasonable for one (say the XSAVE one, if e.g. XSAVE is to be turned
> off altogether for guests) to be all zero.
>
>> Each early_init_*() has an explicit ctxt_switch_levelling(NULL) call
>> which, given non-zero content in cpuidmask_defaults should latch each
>> one appropriately in the the per-cpu variable.
> Well, as you say - provided the individual fields are all non-zero.

The MSRs only exist on CPUs which have non-zero features in the relevant
leaves.

The XSAVE and Therm registers could plausibly be 0.  Dom0 is first to
boot and won't expect to have XSAVE hidden, but we do zero all of leaf 6
in recalculate_misc()

There are certainly corner cases here to improve, but I think all
registers will latch on the first early_init_*(), except for Therm on
AMD which will latch on the first context switch from a PV guest back to
idle.

 cpu/common.c:120:static DEFINE_PER_CPU(uint64_t, msr_misc_features);
>>> Almost the same here - we only initialize the variable on the BSP
>>> afaics.
>> No - way way way worse, I think.
>>
>> For all APs, we write 0 or MSR_MISC_FEATURES_CPUID_FAULTING into
>> MSR_INTEL_MISC_FEATURES_ENABLES, which amongst other things turns off
>> Fast String Enable.
> Urgh, indeed. Prior to 6e2fdc0f8902 there was a read-modify-write
> operation. With the introduction of the cache variable this went
> away, while the cache gets filled for BSP only.

Yeah - I really screwed up on that one...  It was also part of the PV
Shim work done in a hurry in the lead up to Spectre/Meltdown.

MSR_INTEL_MISC_FEATURES_ENABLES is a lot like
MSR_{TSX_FORCE_ABORT,TSX_CTRL,MCU_OPT_CTRL} and the MTRRs.

Most of the content wants to be identical on all cores, so we do want to
fix up AP values with the BSP value if they differ, but we also want a
software cache covering at least the CPUID_FAULTING bit so we don't have
a unnecessary MSR read on the context switch path.

I'll try to do something better.

~Andrew

Re: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to NUMA system

2021-08-26 Thread Julien Grall





On 26/08/2021 08:24, Wei Chen wrote:

Hi Julien,


Hi Wei,


-Original Message-
From: Julien Grall 
Sent: 2021年8月26日 0:58
To: Wei Chen ; xen-devel@lists.xenproject.org;
sstabell...@kernel.org; jbeul...@suse.com
Cc: Bertrand Marquis 
Subject: Re: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to
NUMA system

Hi Wei,

On 11/08/2021 11:24, Wei Chen wrote:

When cpu boot up, we have add them to NUMA system. In current
stage, we have not parsed the NUMA data, but we have created
a fake NUMA node. So, in this patch, all CPU will be added
to NUMA node#0. After the NUMA data has been parsed from device
tree, the CPU will be added to correct NUMA node as the NUMA
data described.

Signed-off-by: Wei Chen 
---
   xen/arch/arm/setup.c   | 6 ++
   xen/arch/arm/smpboot.c | 6 ++
   xen/include/asm-arm/numa.h | 1 +
   3 files changed, 13 insertions(+)

diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index 3c58d2d441..7531989f21 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -918,6 +918,12 @@ void __init start_xen(unsigned long

boot_phys_offset,


   processor_id();

+/*
+ * If Xen is running on a NUMA off system, there will
+ * be a node#0 at least.
+ */
+numa_add_cpu(0);
+
   smp_init_cpus();
   cpus = smp_get_max_cpus();
   printk(XENLOG_INFO "SMP: Allowing %u CPUs\n", cpus);
diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
index a1ee3146ef..aa78958c07 100644
--- a/xen/arch/arm/smpboot.c
+++ b/xen/arch/arm/smpboot.c
@@ -358,6 +358,12 @@ void start_secondary(void)
*/
   smp_wmb();

+/*
+ * If Xen is running on a NUMA off system, there will
+ * be a node#0 at least.
+ */
+numa_add_cpu(cpuid);
+


On x86, numa_add_cpu() will be called before the pCPU is brought up. I
am not quite too sure why we are doing it differently here. Can you
clarify it?


Of course we can invoke numa_add_cpu before cpu_up as x86. But in my tests,
I found when cpu bring up failed, this cpu still be add to NUMA. Although
this does not affect the execution of the code (because CPU is offline),
But I don't think adding a offline CPU to NUMA makes sense.


Right, but again, why do you want to solve the problem on Arm and not 
x86? After all, NUMA is not architecture specific (in fact you move most 
of the code in common).


In fact, the risk, is someone may read arch/x86 and doesn't realize the 
CPU is not in the node until late on Arm.


So I think we should call numa_add_cpu() around the same place on all 
the architectures.


If you think the current position on x86 is not correct, then it should 
be changed at as well. However, I don't know the story behind the 
position of the call on x86. You may want to ask the x86 maintainers.


Cheers,

--
Julien Grall

Re: [XEN RFC PATCH 23/40] xen/arm: introduce a helper to parse device tree memory node

2021-08-26 Thread Julien Grall





On 26/08/2021 07:35, Wei Chen wrote:

Hi Julien,


Hi Wei,


-Original Message-
From: Julien Grall 
Sent: 2021年8月25日 21:49
To: Wei Chen ; xen-devel@lists.xenproject.org;
sstabell...@kernel.org; jbeul...@suse.com
Cc: Bertrand Marquis 
Subject: Re: [XEN RFC PATCH 23/40] xen/arm: introduce a helper to parse
device tree memory node

Hi Wei,

On 11/08/2021 11:24, Wei Chen wrote:

Memory blocks' NUMA ID information is stored in device tree's
memory nodes as "numa-node-id". We need a new helper to parse
and verify this ID from memory nodes.

In order to support memory affinity in later use, the valid
memory ranges and NUMA ID will be saved to tables.

Signed-off-by: Wei Chen 
---
   xen/arch/arm/numa_device_tree.c | 130 
   1 file changed, 130 insertions(+)

diff --git a/xen/arch/arm/numa_device_tree.c

b/xen/arch/arm/numa_device_tree.c

index 37cc56acf3..bbe081dcd1 100644
--- a/xen/arch/arm/numa_device_tree.c
+++ b/xen/arch/arm/numa_device_tree.c
@@ -20,11 +20,13 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 

   s8 device_tree_numa = 0;
   static nodemask_t processor_nodes_parsed __initdata;
+static nodemask_t memory_nodes_parsed __initdata;

   static int srat_disabled(void)
   {
@@ -55,6 +57,79 @@ static int __init

dtb_numa_processor_affinity_init(nodeid_t node)

   return 0;
   }

+/* Callback for parsing of the memory regions affinity */
+static int __init dtb_numa_memory_affinity_init(nodeid_t node,
+paddr_t start, paddr_t size)
+{


The implementation of this function is quite similar ot the ACPI
version. Can this be abstracted?


In my draft, I had tried to merge ACPI and DTB versions in one
function. I introduced a number of "if else" to distinguish ACPI
from DTB, especially ACPI hotplug. The function seems very messy.
Not enough benefits to make up for the mess, so I gave up.


It think you can get away from distinguishing between ACPI and DT in 
that helper:
  * ma->flags & ACPI_SRAT_MEM_HOTPLUGGABLE could be replace by an 
argument indicating whether the region is hotpluggable (this would 
always be false for DT)
  * Access to memblk_hotplug can be stubbed (in the future we may want 
to consider memory hotplug even on Arm).


Do you still have the "if else" version? If so can you post it?

Cheers,

--
Julien Grall

Re: [PATCH v3 5/7] xsm: decouple xsm header inclusion selection

2021-08-26 Thread Jan Beulich

On 05.08.2021 16:06, Daniel P. Smith wrote:
> --- /dev/null
> +++ b/xen/include/xsm/xsm-core.h
> @@ -0,0 +1,273 @@
> +/*
> + *  This file contains the XSM hook definitions for Xen.
> + *
> + *  This work is based on the LSM implementation in Linux 2.6.13.4.
> + *
> + *  Author:  George Coker, 
> + *
> + *  Contributors: Michael LeMay, 
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License version 2,
> + *  as published by the Free Software Foundation.
> + */
> +
> +#ifndef __XSM_CORE_H__
> +#define __XSM_CORE_H__
> +
> +#include 
> +#include 

I was going to ask to invert the order (as we try to arrange #include-s
alphabetically), but it looks like multiboot.h isn't fit for this.

> +typedef void xsm_op_t;
> +DEFINE_XEN_GUEST_HANDLE(xsm_op_t);

Just FTR - I consider this dubious. If void is meant, I don't see why
a void handle can't be used.

> +/* policy magic number (defined by XSM_MAGIC) */
> +typedef uint32_t xsm_magic_t;
> +
> +#ifdef CONFIG_XSM_FLASK
> +#define XSM_MAGIC 0xf97cff8c
> +#else
> +#define XSM_MAGIC 0x0
> +#endif
> +
> +/* These annotations are used by callers and in dummy.h to document the
> + * default actions of XSM hooks. They should be compiled out otherwise.
> + */

I realize you only move code, but like e.g. the u32 -> uint32_t change
in context above I'd like to encourage you to also address other style
issues in the newly introduced file. Here I'm talking about comment
style, requiring /* to be on its own line.

> +enum xsm_default {
> +XSM_HOOK, /* Guests can normally access the hypercall */
> +XSM_DM_PRIV,  /* Device model can perform on its target domain */
> +XSM_TARGET,   /* Can perform on self or your target domain */
> +XSM_PRIV, /* Privileged - normally restricted to dom0 */
> +XSM_XS_PRIV,  /* Xenstore domain - can do some privileged operations */
> +XSM_OTHER /* Something more complex */
> +};
> +typedef enum xsm_default xsm_default_t;
> +
> +struct xsm_ops {
> +void (*security_domaininfo) (struct domain *d,

Similarly here (and below) - we don't normally put a blank between
the closing and opening parentheses in function pointer declarations.
The majority does so here, but ...

>[...]
> +int (*page_offline)(uint32_t cmd);
> +int (*hypfs_op)(void);

... there are exceptions.

>[...]
> +int (*platform_op) (uint32_t cmd);
> +
> +#ifdef CONFIG_X86
> +int (*do_mca) (void);
> +int (*shadow_control) (struct domain *d, uint32_t op);
> +int (*mem_sharing_op) (struct domain *d, struct domain *cd, int op);
> +int (*apic) (struct domain *d, int cmd);
> +int (*memtype) (uint32_t access);
> +int (*machine_memory_map) (void);
> +int (*domain_memory_map) (struct domain *d);
> +#define XSM_MMU_UPDATE_READ  1
> +#define XSM_MMU_UPDATE_WRITE 2
> +#define XSM_MMU_NORMAL_UPDATE4
> +#define XSM_MMU_MACHPHYS_UPDATE  8
> +int (*mmu_update) (struct domain *d, struct domain *t,
> +   struct domain *f, uint32_t flags);
> +int (*mmuext_op) (struct domain *d, struct domain *f);
> +int (*update_va_mapping) (struct domain *d, struct domain *f,
> +  l1_pgentry_t pte);
> +int (*priv_mapping) (struct domain *d, struct domain *t);
> +int (*ioport_permission) (struct domain *d, uint32_t s, uint32_t e,
> +  uint8_t allow);
> +int (*ioport_mapping) (struct domain *d, uint32_t s, uint32_t e,
> +   uint8_t allow);
> +int (*pmu_op) (struct domain *d, unsigned int op);
> +#endif
> +int (*dm_op) (struct domain *d);

To match grouping elsewhere, a blank line above here, ...

> +int (*xen_version) (uint32_t cmd);
> +int (*domain_resource_map) (struct domain *d);
> +#ifdef CONFIG_ARGO

... and here would be nice.

> +int (*argo_enable) (const struct domain *d);
> +int (*argo_register_single_source) (const struct domain *d,
> +const struct domain *t);
> +int (*argo_register_any_source) (const struct domain *d);
> +int (*argo_send) (const struct domain *d, const struct domain *t);
> +#endif
> +};
> +
> +extern void xsm_fixup_ops(struct xsm_ops *ops);
> +
> +#ifdef CONFIG_XSM
> +
> +#ifdef CONFIG_MULTIBOOT
> +extern int xsm_multiboot_init(unsigned long *module_map,
> +  const multiboot_info_t *mbi);
> +extern int xsm_multiboot_policy_init(unsigned long *module_map,
> + const multiboot_info_t *mbi,
> + void **policy_buffer,
> + size_t *policy_size);
> +#endif
> +
> +#ifdef CONFIG_HAS_DEVICE_TREE
> +/*
> + * Initialize XSM
> + *
> + * On success, return 1 if using SILO mode else 0.
> + */
> +extern int xsm_dt_init(void);
> +extern int xsm_dt_policy_init(void **policy_buffer, size_t *policy_size);
> +extern bool

Re: Xen C-state Issues

2021-08-26 Thread Jan Beulich

On 26.08.2021 03:18, Elliott Mitchell wrote:
> On Tue, Aug 24, 2021 at 08:14:41AM +0200, Jan Beulich wrote:
>> On 24.08.2021 07:37, Elliott Mitchell wrote:
>>> On Mon, Aug 23, 2021 at 09:12:52AM +0200, Jan Beulich wrote:
 On 21.08.2021 18:25, Elliott Mitchell wrote:
> ACPI C-state support might not see too much use, but it does see some.
>
> With Xen 4.11 and Linux kernel 4.19, I found higher C-states only got
> enabled for physical cores for which Domain 0 had a corresponding vCPU.
> On a machine where Domain 0 has 5 vCPUs, but 8 reported cores, the
> additional C-states would only be enabled on cores 0-4.
>
> This can be worked around by giving Domain 0 vCPUs equal to cores, but
> then offlining the extra vCPUs.  I'm guessing this is a bug with the
> Linux 4.19 xen_acpi_processor module.
>
>
>
> Appears Xen 4.14 doesn't work at all with Linux kernel 4.19's ACPI
> C-state support.  This combination is unable to enable higher C-states
> on any core.  Since Xen 4.14 and Linux 4.19 are *both* *presently*
> supported it seems patch(es) are needed somewhere for this combination.

 Hmm, having had observed the same quite some time ago, I thought I had
 dealt with these problems. Albeit surely not in Xen 4.11 or Linux 4.19.
 Any chance you could check up-to-date versions of both Xen and Linux
 (together)?
>>>
>>> I can believe you got this fixed, but the Linux fixes never got
>>> backported.
>>>
>>> Of the two, higher C-states working with Linux 4.19 and Xen 4.11, but
>>> not Linux 4.19 and Xen 4.14 is more concerning to me.
>>
>> I'm afraid without you providing detail (full verbosity logs) and
>> ideally checking with 4.15 or yet better -unstable it's going to be
>> hard to judge whether that's a bug, and if so where it might sit.
> 
> That would be a very different sort of bug report if that was found to
> be an issue.  This report is likely a problem of fixes not being
> backported to stable branches.

As you say - likely. I'd like to be sure.

> What you're writing about would be looking for bugs in development
> branches.

Very much so, in case there are issues left, or ones have got
reintroduced. That's what the primary purpose of this list is.

If you were suspecting missing fixes in the kernel, I guess xen-devel
isn't the preferred channel anyway. Otoh the stable maintainers there
would likely want concrete commits pointed out ...

Jan

Re: [PATCH] x86/xstate: reset cached register values on resume

2021-08-26 Thread Jan Beulich

On 25.08.2021 18:49, Andrew Cooper wrote:
> On 25/08/2021 16:02, Jan Beulich wrote:
>> On 24.08.2021 23:11, Andrew Cooper wrote:
>>> On 18/08/2021 13:44, Andrew Cooper wrote:
 On 18/08/2021 12:30, Marek Marczykowski-Górecki wrote:
> set_xcr0() and set_msr_xss() use cached value to avoid setting the
> register to the same value over and over. But suspend/resume implicitly
> reset the registers and since percpu areas are not deallocated on
> suspend anymore, the cache gets stale.
> Reset the cache on resume, to ensure the next write will really hit the
> hardware. Choose value 0, as it will never be a legitimate write to
> those registers - and so, will force write (and cache update).
>
> Note the cache is used io get_xcr0() and get_msr_xss() too, but:
> - set_xcr0() is called few lines below in xstate_init(), so it will
>   update the cache with appropriate value
> - get_msr_xss() is not used anywhere - and thus not before any
>   set_msr_xss() that will fill the cache
>
> Fixes: aca2a985a55a "xen: don't free percpu areas during suspend"
> Signed-off-by: Marek Marczykowski-Górecki 
> 
 I'd prefer to do this differently.  As I said in the thread, there are
 other registers such as MSR_TSC_AUX which fall into the same category,
 and I'd like to make something which works systematically.
>>> Ok - after some searching, I think we have problems with:
>>>
>>> cpu/common.c:47:DEFINE_PER_CPU(struct cpuidmasks, cpuidmasks);
>> Don't we have a problem here even during initial boot? I can't see
>> the per-CPU variable to get filled by what the registers hold.
> 
> No, I don't think so, but it is a roundabout route.

So where do you see it getting filled?

>>  If
>> the register started out non-zero (the default on AMD iirc, as it's
>> not really masks there) but the first value to be written was zero,
>> we'd skip the write.
> 
> There is cpuidmask_defaults which does get filled from the MSRs on boot.
> 
> AMD are real CPUID settings, while Intel is an and-mask.  i.e. they're
> both non-zero (unless someone does something silly with the command line
> arguments, and I don't expect Xen to be happy booting if the leaves all
> read 0).

Surely not all of them together, but I don't think it's completely
unreasonable for one (say the XSAVE one, if e.g. XSAVE is to be turned
off altogether for guests) to be all zero.

> Each early_init_*() has an explicit ctxt_switch_levelling(NULL) call
> which, given non-zero content in cpuidmask_defaults should latch each
> one appropriately in the the per-cpu variable.

Well, as you say - provided the individual fields are all non-zero.

>>> cpu/common.c:120:static DEFINE_PER_CPU(uint64_t, msr_misc_features);
>> Almost the same here - we only initialize the variable on the BSP
>> afaics.
> 
> No - way way way worse, I think.
> 
> For all APs, we write 0 or MSR_MISC_FEATURES_CPUID_FAULTING into
> MSR_INTEL_MISC_FEATURES_ENABLES, which amongst other things turns off
> Fast String Enable.

Urgh, indeed. Prior to 6e2fdc0f8902 there was a read-modify-write
operation. With the introduction of the cache variable this went
away, while the cache gets filled for BSP only.

Jan

RE: [XEN RFC PATCH 00/40] Add device tree based NUMA support to Arm64

2021-08-26 Thread Wei Chen

Hi Stefano,

> -Original Message-
> From: Stefano Stabellini 
> Sent: 2021年8月26日 8:09
> To: Wei Chen 
> Cc: xen-devel@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org;
> jbeul...@suse.com; Bertrand Marquis ;
> andrew.coop...@citrix.com
> Subject: Re: [XEN RFC PATCH 00/40] Add device tree based NUMA support to
> Arm64
> 
> Thanks for the big contribution!
> 
> I just wanted to let you know that the series passed all the gitlab-ci
> build tests without issues.
> 
> The runtime tests originally failed due to unrelated problems (there was
> a Debian testing upgrade that broke Gitlab-CI.) I fix the underlying
> issue and restarted the failed tests and now they passed.
> 
> This is the pipeline:
> https://gitlab.com/xen-project/patchew/xen/-/pipelines/351484940
> 
> There are still two runtime x86 tests that fail but I don't think the
> failures are related to your series.
> 
> 

Thanks for testing this series : )

> On Wed, 11 Aug 2021, Wei Chen wrote:
> > Xen memory allocation and scheduler modules are NUMA aware.
> > But actually, on x86 has implemented the architecture APIs
> > to support NUMA. Arm was providing a set of fake architecture
> > APIs to make it compatible with NUMA awared memory allocation
> > and scheduler.
> >
> > Arm system was working well as a single node NUMA system with
> > these fake APIs, because we didn't have multiple nodes NUMA
> > system on Arm. But in recent years, more and more Arm devices
> > support multiple nodes NUMA system. Like TX2, some Hisilicon
> > chips and the Ampere Altra.
> >
> > So now we have a new problem. When Xen is running on these Arm
> > devices, Xen still treat them as single node SMP systems. The
> > NUMA affinity capability of Xen memory allocation and scheduler
> > becomes meaningless. Because they rely on input data that does
> > not reflect real NUMA layout.
> >
> > Xen still think the access time for all of the memory is the
> > same for all CPUs. However, Xen may allocate memory to a VM
> > from different NUMA nodes with different access speeds. This
> > difference can be amplified in workloads inside VM, causing
> > performance instability and timeouts.
> >
> > So in this patch series, we implement a set of NUMA API to use
> > device tree to describe the NUMA layout. We reuse most of the
> > code of x86 NUMA to create and maintain the mapping between
> > memory and CPU, create the matrix between any two NUMA nodes.
> > Except ACPI and some x86 specified code, we have moved other
> > code to common. In next stage, when we implement ACPI based
> > NUMA for Arm64, we may move the ACPI NUMA code to common too,
> > but in current stage, we keep it as x86 only.
> >
> > This patch serires has been tested and booted well on one
> > Arm64 NUMA machine and one HPE x86 NUMA machine.
> >
> > Hongda Deng (2):
> >   xen/arm: return default DMA bit width when platform is not set
> >   xen/arm: Fix lowmem_bitsize when arch_get_dma_bitsize return 0
> >
> > Wei Chen (38):
> >   tools: Fix -Werror=maybe-uninitialized for xlu_pci_parse_bdf
> >   xen/arm: Print a 64-bit number in hex from early uart
> >   xen/x86: Initialize memnodemapsize while faking NUMA node
> >   xen: decouple NUMA from ACPI in Kconfig
> >   xen/arm: use !CONFIG_NUMA to keep fake NUMA API
> >   xen/x86: Move NUMA memory node map functions to common
> >   xen/x86: Move numa_add_cpu_node to common
> >   xen/x86: Move NR_NODE_MEMBLKS macro to common
> >   xen/x86: Move NUMA nodes and memory block ranges to common
> >   xen/x86: Move numa_initmem_init to common
> >   xen/arm: introduce numa_set_node for Arm
> >   xen/arm: set NUMA nodes max number to 64 by default
> >   xen/x86: move NUMA API from x86 header to common header
> >   xen/arm: Create a fake NUMA node to use common code
> >   xen/arm: Introduce DEVICE_TREE_NUMA Kconfig for arm64
> >   xen/arm: Keep memory nodes in dtb for NUMA when boot from EFI
> >   xen: fdt: Introduce a helper to check fdt node type
> >   xen/arm: implement node distance helpers for Arm64
> >   xen/arm: introduce device_tree_numa as a switch for device tree NUMA
> >   xen/arm: introduce a helper to parse device tree processor node
> >   xen/arm: introduce a helper to parse device tree memory node
> >   xen/arm: introduce a helper to parse device tree NUMA distance map
> >   xen/arm: unified entry to parse all NUMA data from device tree
> >   xen/arm: Add boot and secondary CPU to NUMA system
> >   xen/arm: build CPU NUMA node map while creating cpu_logical_map
> >   xen/x86: decouple nodes_cover_memory with E820 map
> >   xen/arm: implement Arm arch helpers Arm to get memory map info
> >   xen: move NUMA memory and CPU parsed nodemasks to common
> >   xen/x86: move nodes_cover_memory to common
> >   xen/x86: make acpi_scan_nodes to be neutral
> >   xen: export bad_srat and srat_disabled to extern
> >   xen: move numa_scan_nodes from x86 to common
> >   xen: enable numa_scan_nodes for device tree based NUMA
> >   xen/arm: keep guest still be NUMA

RE: [XEN RFC PATCH 30/40] xen: move NUMA memory and CPU parsed nodemasks to common

2021-08-26 Thread Wei Chen

Hi Juilien,

> -Original Message-
> From: Julien Grall 
> Sent: 2021年8月26日 1:17
> To: Wei Chen ; xen-devel@lists.xenproject.org;
> sstabell...@kernel.org; jbeul...@suse.com
> Cc: Bertrand Marquis 
> Subject: Re: [XEN RFC PATCH 30/40] xen: move NUMA memory and CPU parsed
> nodemasks to common
> 
> Hi Wei,
> 
> On 11/08/2021 11:24, Wei Chen wrote:
> > Both memory_nodes_parsed and processor_nodes_parsed are using
> > for Arm and x86 to record parded NUMA memory and CPU. So we
> > move them to common.
> 
> Looking at the usage, they both call:
> 
> numa_set...(..., bitmap)
> 
> So rather than exporting the two helpers, could we simply add helpers to
> abstract it?
> 

I will try to fix it in next version.

> 
> >
> > Signed-off-by: Wei Chen 
> > ---
> >   xen/arch/arm/numa_device_tree.c | 2 --
> >   xen/arch/x86/srat.c | 3 ---
> >   xen/common/numa.c   | 3 +++
> >   xen/include/xen/nodemask.h  | 2 ++
> >   4 files changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/xen/arch/arm/numa_device_tree.c
> b/xen/arch/arm/numa_device_tree.c
> > index 27ffb72f7b..f74b7f6427 100644
> > --- a/xen/arch/arm/numa_device_tree.c
> > +++ b/xen/arch/arm/numa_device_tree.c
> > @@ -25,8 +25,6 @@
> >   #include 
> >
> >   s8 device_tree_numa = 0;
> > -static nodemask_t processor_nodes_parsed __initdata;
> > -static nodemask_t memory_nodes_parsed __initdata;
> 
> This is code that was introduced in a previous patch. In general, it is
> better to do the rework first and then add the new code. This makes
> easier to follow series as the code added is not changed.
> 

Yes, I will fix it in next version.

> >
> >   static int srat_disabled(void)
> >   {
> > diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> > index 2298353846..dd3aa30843 100644
> > --- a/xen/arch/x86/srat.c
> > +++ b/xen/arch/x86/srat.c
> > @@ -24,9 +24,6 @@
> >
> >   static struct acpi_table_slit *__read_mostly acpi_slit;
> >
> > -static nodemask_t memory_nodes_parsed __initdata;
> > -static nodemask_t processor_nodes_parsed __initdata;
> > -
> >   struct pxm2node {
> > unsigned pxm;
> > nodeid_t node;
> > diff --git a/xen/common/numa.c b/xen/common/numa.c
> > index 26c0006d04..79ab250543 100644
> > --- a/xen/common/numa.c
> > +++ b/xen/common/numa.c
> > @@ -35,6 +35,9 @@ int num_node_memblks;
> >   struct node node_memblk_range[NR_NODE_MEMBLKS];
> >   nodeid_t memblk_nodeid[NR_NODE_MEMBLKS];
> >
> > +nodemask_t memory_nodes_parsed __initdata;
> > +nodemask_t processor_nodes_parsed __initdata;
> > +
> >   bool numa_off;
> >
> >   /*
> > diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
> > index 1dd6c7458e..29ce5e28e7 100644
> > --- a/xen/include/xen/nodemask.h
> > +++ b/xen/include/xen/nodemask.h
> > @@ -276,6 +276,8 @@ static inline int __cycle_node(int n, const
> nodemask_t *maskp, int nbits)
> >*/
> >
> >   extern nodemask_t node_online_map;
> > +extern nodemask_t memory_nodes_parsed;
> > +extern nodemask_t processor_nodes_parsed;
> >
> >   #if MAX_NUMNODES > 1
> >   #define num_online_nodes()nodes_weight(node_online_map)
> >
> 
> Cheers,
> 
> --
> Julien Grall

RE: [XEN RFC PATCH 29/40] xen/arm: implement Arm arch helpers Arm to get memory map info

2021-08-26 Thread Wei Chen

Hi Julien,

> -Original Message-
> From: Julien Grall 
> Sent: 2021年8月26日 1:10
> To: Wei Chen ; xen-devel@lists.xenproject.org;
> sstabell...@kernel.org; jbeul...@suse.com
> Cc: Bertrand Marquis 
> Subject: Re: [XEN RFC PATCH 29/40] xen/arm: implement Arm arch helpers Arm
> to get memory map info
> 
> Hi Wei,
> 
> On 11/08/2021 11:24, Wei Chen wrote:
> > These two helpers are architecture APIs that are required by
> > nodes_cover_memory.
> >
> > Signed-off-by: Wei Chen 
> > ---
> >   xen/arch/arm/numa.c | 14 ++
> >   1 file changed, 14 insertions(+)
> >
> > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> > index f61a8df645..6eebf8e8bc 100644
> > --- a/xen/arch/arm/numa.c
> > +++ b/xen/arch/arm/numa.c
> > @@ -126,3 +126,17 @@ void __init numa_init(bool acpi_off)
> >   numa_initmem_init(PFN_UP(ram_start), PFN_DOWN(ram_end));
> >   return;
> >   }
> > +
> > +uint32_t __init arch_meminfo_get_nr_bank(void)
> > +{
> > +   return bootinfo.mem.nr_banks;
> > +}
> > +
> > +int __init arch_meminfo_get_ram_bank_range(int bank,
> > +   unsigned long long *start, unsigned long long *end)
> 
> They are physical address, so we should use "paddr_t" as on system such
> as 32-bit Arm, "unsigned long" is not enough to cover all the physical
> address.
> 
> As you change the type, I would also suggest to change the bank from an
> int to an unsigned int.
> 

I will fix them in next version.

> > +{
> > +   *start = bootinfo.mem.bank[bank].start;
> > +   *end = bootinfo.mem.bank[bank].start + bootinfo.mem.bank[bank].size;
> > +
> > +   return 0;
> > +}
> >
> 
> Cheers,
> 
> --
> Julien Grall

RE: [XEN RFC PATCH 27/40] xen/arm: build CPU NUMA node map while creating cpu_logical_map

2021-08-26 Thread Wei Chen

Hi Julien,

> -Original Message-
> From: Julien Grall 
> Sent: 2021年8月26日 1:07
> To: Wei Chen ; xen-devel@lists.xenproject.org;
> sstabell...@kernel.org
> Cc: Bertrand Marquis ; Jan Beulich
> 
> Subject: Re: [XEN RFC PATCH 27/40] xen/arm: build CPU NUMA node map while
> creating cpu_logical_map
> 
> Hi Wei,
> 
> On 11/08/2021 11:24, Wei Chen wrote:
> > Sometimes, CPU logical ID maybe different with physical CPU ID.
> > Xen is using CPU logial ID for runtime usage, so we should use
> > CPU logical ID to create map between NUMA node and CPU.
> 
> This commit message gives the impression that you are trying to fix a
> bug. However, what you are explaining is the reason why the code will
> use the logical ID rather than physical ID.
> 
> I think the commit message should explain what the patch is doing. You
> can then add an explanation why you are using the CPU logical ID.
> Something like "Note we storing the CPU logical ID because...".
> 
> 

Ok

> >
> > Signed-off-by: Wei Chen 
> > ---
> >   xen/arch/arm/smpboot.c | 31 ++-
> >   1 file changed, 30 insertions(+), 1 deletion(-)
> >
> > diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
> > index aa78958c07..dd5a45bffc 100644
> > --- a/xen/arch/arm/smpboot.c
> > +++ b/xen/arch/arm/smpboot.c
> > @@ -121,7 +121,12 @@ static void __init dt_smp_init_cpus(void)
> >   {
> >   [0 ... NR_CPUS - 1] = MPIDR_INVALID
> >   };
> > +static nodeid_t node_map[NR_CPUS] __initdata =
> > +{
> > +[0 ... NR_CPUS - 1] = NUMA_NO_NODE
> > +};
> >   bool bootcpu_valid = false;
> > +uint32_t nid = 0;
> >   int rc;
> >
> >   mpidr = boot_cpu_data.mpidr.bits & MPIDR_HWID_MASK;
> > @@ -172,6 +177,26 @@ static void __init dt_smp_init_cpus(void)
> >   continue;
> >   }
> >
> > +#ifdef CONFIG_DEVICE_TREE_NUMA
> > +/*
> > + *  When CONFIG_DEVICE_TREE_NUMA is set, try to fetch numa
> infomation
> > + * from CPU dts node, otherwise the nid is always 0.
> > + */
> > +if ( !dt_property_read_u32(cpu, "numa-node-id", ) )
> 
> You can avoid the #ifdef by writing:
> 
> if ( IS_ENABLED(CONFIG_DEVICE_TREE_NUMA) && ... )
> 
> However, I would using CONFIG_NUMA because this code is already DT
> specific. So we can shorten the name a bit.
> 

OK

> > +{
> > +printk(XENLOG_WARNING
> > +"cpu[%d] dts path: %s: doesn't have numa infomation!\n",
> 
> s/information/information/

OK

> 
> > +cpuidx, dt_node_full_name(cpu));
> > +/*
> > + * The the early stage of NUMA initialization, when Xen
> found any
> 
> s/The/During/?

Oh, yes, I will fix it.

> 
> > + * CPU dts node doesn't have numa-node-id info, the NUMA
> will be
> > + * treated as off, all CPU will be set to a FAKE node 0. So
> if we
> > + * get numa-node-id failed here, we should set nid to 0.
> > + */
> > +nid = 0;
> > +}
> > +#endif
> > +
> >   /*
> >* 8 MSBs must be set to 0 in the DT since the reg property
> >* defines the MPIDR[23:0]
> > @@ -231,9 +256,12 @@ static void __init dt_smp_init_cpus(void)
> >   {
> >   printk("cpu%d init failed (hwid %"PRIregister"): %d\n", i,
> hwid, rc);
> >   tmp_map[i] = MPIDR_INVALID;
> > +node_map[i] = NUMA_NO_NODE;
> >   }
> > -else
> > +else {
> >   tmp_map[i] = hwid;
> > +node_map[i] = nid;
> > +}
> >   }
> >
> >   if ( !bootcpu_valid )
> > @@ -249,6 +277,7 @@ static void __init dt_smp_init_cpus(void)
> >   continue;
> >   cpumask_set_cpu(i, _possible_map);
> >   cpu_logical_map(i) = tmp_map[i];
> > +numa_set_node(i, node_map[i]);
> >   }
> >   }
> >>
> 
> Cheers,
> 
> --
> Julien Grall

[PATCH v7 8/8] AMD/IOMMU: respect AtsDisabled device flag

2021-08-26 Thread Jan Beulich

IVHD entries may specify that ATS is to be blocked for a device or range
of devices. Honor firmware telling us so.

While adding respective checks I noticed that the 2nd conditional in
amd_iommu_setup_domain_device() failed to check the IOMMU's capability.
Add the missing part of the condition there, as no good can come from
enabling ATS on a device when the IOMMU is not capable of dealing with
ATS requests.

For actually using ACPI_IVHD_ATS_DISABLED, make its expansion no longer
exhibit UB.

Signed-off-by: Jan Beulich 
---
TBD: I find the ordering in amd_iommu_disable_domain_device()
 suspicious: amd_iommu_enable_domain_device() sets up the DTE first
 and then enables ATS on the device. It would seem to me that
 disabling would better be done the other way around (disable ATS on
 device, then adjust DTE).
TBD: As an alternative to adding the missing IOMMU capability check, we
 may want to consider simply using dte->i in the 2nd conditional in
 amd_iommu_enable_domain_device().
For both of these, while ATS enabling/disabling gets invoked without any
locks held, the two functions should not be possible to race with one
another for any individual device (or else we'd be in trouble already,
as ATS might then get re-enabled immediately after it was disabled, with
the DTE out of sync with this setting).
---
v7: New.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -120,6 +120,7 @@ struct ivrs_mappings {
 uint16_t dte_requestor_id;
 bool valid:1;
 bool dte_allow_exclusion:1;
+bool block_ats:1;
 
 /* ivhd device data settings */
 uint8_t device_flags;
--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -55,8 +55,8 @@ union acpi_ivhd_device {
 };
 
 static void __init add_ivrs_mapping_entry(
-uint16_t bdf, uint16_t alias_id, uint8_t flags, bool alloc_irt,
-struct amd_iommu *iommu)
+uint16_t bdf, uint16_t alias_id, uint8_t flags, unsigned int ext_flags,
+bool alloc_irt, struct amd_iommu *iommu)
 {
 struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(iommu->seg);
 
@@ -66,6 +66,7 @@ static void __init add_ivrs_mapping_entr
 ivrs_mappings[bdf].dte_requestor_id = alias_id;
 
 /* override flags for range of devices */
+ivrs_mappings[bdf].block_ats = ext_flags & ACPI_IVHD_ATS_DISABLED;
 ivrs_mappings[bdf].device_flags = flags;
 
 /* Don't map an IOMMU by itself. */
@@ -499,7 +500,7 @@ static u16 __init parse_ivhd_device_sele
 return 0;
 }
 
-add_ivrs_mapping_entry(bdf, bdf, select->header.data_setting, false,
+add_ivrs_mapping_entry(bdf, bdf, select->header.data_setting, 0, false,
iommu);
 
 return sizeof(*select);
@@ -545,7 +546,7 @@ static u16 __init parse_ivhd_device_rang
 AMD_IOMMU_DEBUG(" Dev_Id Range: %#x -> %#x\n", first_bdf, last_bdf);
 
 for ( bdf = first_bdf; bdf <= last_bdf; bdf++ )
-add_ivrs_mapping_entry(bdf, bdf, range->start.header.data_setting,
+add_ivrs_mapping_entry(bdf, bdf, range->start.header.data_setting, 0,
false, iommu);
 
 return dev_length;
@@ -580,7 +581,7 @@ static u16 __init parse_ivhd_device_alia
 
 AMD_IOMMU_DEBUG(" Dev_Id Alias: %#x\n", alias_id);
 
-add_ivrs_mapping_entry(bdf, alias_id, alias->header.data_setting, true,
+add_ivrs_mapping_entry(bdf, alias_id, alias->header.data_setting, 0, true,
iommu);
 
 return dev_length;
@@ -636,7 +637,7 @@ static u16 __init parse_ivhd_device_alia
 
 for ( bdf = first_bdf; bdf <= last_bdf; bdf++ )
 add_ivrs_mapping_entry(bdf, alias_id, range->alias.header.data_setting,
-   true, iommu);
+   0, true, iommu);
 
 return dev_length;
 }
@@ -661,7 +662,8 @@ static u16 __init parse_ivhd_device_exte
 return 0;
 }
 
-add_ivrs_mapping_entry(bdf, bdf, ext->header.data_setting, false, iommu);
+add_ivrs_mapping_entry(bdf, bdf, ext->header.data_setting,
+   ext->extended_data, false, iommu);
 
 return dev_length;
 }
@@ -708,7 +710,7 @@ static u16 __init parse_ivhd_device_exte
 
 for ( bdf = first_bdf; bdf <= last_bdf; bdf++ )
 add_ivrs_mapping_entry(bdf, bdf, range->extended.header.data_setting,
-   false, iommu);
+   range->extended.extended_data, false, iommu);
 
 return dev_length;
 }
@@ -800,7 +802,7 @@ static u16 __init parse_ivhd_device_spec
 
 AMD_IOMMU_DEBUG("IVHD Special: %pp variety %#x handle %#x\n",
 _SBDF2(seg, bdf), special->variety, special->handle);
-add_ivrs_mapping_entry(bdf, bdf, special->header.data_setting, true,
+add_ivrs_mapping_entry(bdf, bdf, special->header.data_setting, 0, true,
iommu);
 
 switch ( special->variety )
---

[PATCH v7 7/8] AMD/IOMMU: add "ivmd=" command line option

2021-08-26 Thread Jan Beulich

Just like VT-d's "rmrr=" it can be used to cover for firmware omissions.
Since systems surfacing IVMDs seem to be rare, it is also meant to allow
testing of the involved code.

Only the IVMD flavors actually understood by the IVMD parsing logic can
be generated, and for this initial implementation there's also no way to
control the flags field - unity r/w mappings are assumed.

Signed-off-by: Jan Beulich 
Reviewed-by: Paul Durrant 
---
v5: New.

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -836,12 +836,12 @@ Controls for the dom0 IOMMU setup.
 
 Typically, some devices in a system use bits of RAM for communication, and
 these areas should be listed as reserved in the E820 table and identified
-via RMRR or IVMD entries in the APCI tables, so Xen can ensure that they
+via RMRR or IVMD entries in the ACPI tables, so Xen can ensure that they
 are identity-mapped in the IOMMU.  However, some firmware makes mistakes,
 and this option is a coarse-grain workaround for those errors.
 
 Where possible, finer grain corrections should be made with the `rmrr=`,
-`ivrs_hpet=` or `ivrs_ioapic=` command line options.
+`ivmd=`, `ivrs_hpet[]=`, or `ivrs_ioapic[]=` command line options.
 
 This option is disabled by default, and deprecated and intended for
 removal in future versions of Xen.  If specifying `map-inclusive` is the
@@ -1523,6 +1523,31 @@ _dom0-iommu=map-inclusive_ - using both
 > `= `
 
 ### irq_vector_map (x86)
+
+### ivmd (x86)
+> `= [-][=[-][,[-][,...]]][;...]`
+
+Define IVMD-like ranges that are missing from ACPI tables along with the
+device(s) they belong to, and use them for 1:1 mapping.  End addresses can be
+omitted when exactly one page is meant.  The ranges are inclusive when start
+and end are specified.  Note that only PCI segment 0 is supported at this time,
+but it is fine to specify it explicitly.
+
+'start' and 'end' values are page numbers (not full physical addresses),
+in hexadecimal format (can optionally be preceded by "0x").
+
+Omitting the optional (range of) BDF spcifiers signals that the range is to
+be applied to all devices.
+
+Usage example: If device 0:0:1d.0 requires one page (0xd5d45) to be
+reserved, and devices 0:0:1a.0...0:0:1a.3 collectively require three pages
+(0xd5d46 thru 0xd5d48) to be reserved, one usage would be:
+
+ivmd=d5d45=0:1d.0;0xd5d46-0xd5d48=0:1a.0-0:1a.3
+
+Note: grub2 requires to escape or quote special characters, like ';' when
+multiple ranges are specified - refer to the grub2 documentation.
+
 ### ivrs_hpet[``] (AMD)
 > `=[:]:.`
 
--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -1063,6 +1063,9 @@ static void __init dump_acpi_table_heade
 
 }
 
+static struct acpi_ivrs_memory __initdata user_ivmds[8];
+static unsigned int __initdata nr_ivmd;
+
 #define to_ivhd_block(hdr) \
 container_of(hdr, const struct acpi_ivrs_hardware, header)
 #define to_ivmd_block(hdr) \
@@ -1087,7 +1090,7 @@ static int __init parse_ivrs_table(struc
 {
 const struct acpi_ivrs_header *ivrs_block;
 unsigned long length;
-unsigned int apic;
+unsigned int apic, i;
 bool_t sb_ioapic = !iommu_intremap;
 int error = 0;
 
@@ -1122,6 +1125,12 @@ static int __init parse_ivrs_table(struc
 length += ivrs_block->length;
 }
 
+/* Add command line specified IVMD-equivalents. */
+if ( nr_ivmd )
+AMD_IOMMU_DEBUG("IVMD: %u command line provided entries\n", nr_ivmd);
+for ( i = 0; !error && i < nr_ivmd; ++i )
+error = parse_ivmd_block(user_ivmds + i);
+
 /* Each IO-APIC must have been mentioned in the table. */
 for ( apic = 0; !error && iommu_intremap && apic < nr_ioapics; ++apic )
 {
@@ -1362,3 +1371,80 @@ int __init amd_iommu_get_supported_ivhd_
 {
 return acpi_table_parse(ACPI_SIG_IVRS, get_supported_ivhd_type);
 }
+
+/*
+ * Parse "ivmd" command line option to later add the parsed devices / regions
+ * into unity mapping lists, just like IVMDs parsed from ACPI.
+ * Format:
+ * 
ivmd=[-][=[-'][,[-'][,...]]][;...]
+ */
+static int __init parse_ivmd_param(const char *s)
+{
+do {
+unsigned long start, end;
+const char *cur;
+
+if ( nr_ivmd >= ARRAY_SIZE(user_ivmds) )
+return -E2BIG;
+
+start = simple_strtoul(cur = s, , 16);
+if ( cur == s )
+return -EINVAL;
+
+if ( *s == '-' )
+{
+end = simple_strtoul(cur = s + 1, , 16);
+if ( cur == s || end < start )
+return -EINVAL;
+}
+else
+end = start;
+
+if ( *s != '=' )
+{
+user_ivmds[nr_ivmd].start_address = start << PAGE_SHIFT;
+user_ivmds[nr_ivmd].memory_length = (end - start + 1) << 
PAGE_SHIFT;
+user_ivmds[nr_ivmd].header.flags = ACPI_IVMD_UNITY |
+   ACPI_IVMD_READ | 
ACPI_IVMD_WRITE;
+

[PATCH v7 6/8] AMD/IOMMU: provide function backing XENMEM_reserved_device_memory_map

2021-08-26 Thread Jan Beulich

Just like for VT-d, exclusion / unity map ranges would better be
reflected in e.g. the guest's E820 map. The reporting infrastructure
was put in place still pretty tailored to VT-d's needs; extend
get_reserved_device_memory() to allow vendor specific code to probe
whether a particular (seg,bus,dev,func) tuple would get its data
actually recorded. I admit the de-duplication of entries is quite
limited for now, but considering our trouble to find a system
surfacing _any_ IVMD this is likely not a critical issue for this
initial implementation.

Signed-off-by: Jan Beulich 
Reviewed-by: Paul Durrant 
---
v7: Re-base.
v5: New.

--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1042,6 +1042,9 @@ static int get_reserved_device_memory(xe
 if ( !(grdm->map.flags & XENMEM_RDM_ALL) && (sbdf != id) )
 return 0;
 
+if ( !nr )
+return 1;
+
 if ( grdm->used_entries < grdm->map.nr_entries )
 {
 struct xen_reserved_device_memory rdm = {
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -110,6 +110,7 @@ struct amd_iommu {
 struct ivrs_unity_map {
 bool read:1;
 bool write:1;
+bool global:1;
 paddr_t addr;
 unsigned long length;
 struct ivrs_unity_map *next;
@@ -236,6 +237,7 @@ int amd_iommu_reserve_domain_unity_map(s
unsigned int flag);
 int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
  const struct ivrs_unity_map *map);
+int amd_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t dfn,
  unsigned long page_count,
  unsigned int flush_flags);
--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -145,7 +145,7 @@ static int __init reserve_iommu_exclusio
 
 static int __init reserve_unity_map_for_device(
 uint16_t seg, uint16_t bdf, unsigned long base,
-unsigned long length, bool iw, bool ir)
+unsigned long length, bool iw, bool ir, bool global)
 {
 struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
 struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
@@ -164,7 +164,11 @@ static int __init reserve_unity_map_for_
  */
 if ( base == unity_map->addr && length == unity_map->length &&
  ir == unity_map->read && iw == unity_map->write )
+{
+if ( global )
+unity_map->global = true;
 return 0;
+}
 
 if ( unity_map->addr + unity_map->length > base &&
  base + length > unity_map->addr )
@@ -183,6 +187,7 @@ static int __init reserve_unity_map_for_
 
 unity_map->read = ir;
 unity_map->write = iw;
+unity_map->global = global;
 unity_map->addr = base;
 unity_map->length = length;
 unity_map->next = ivrs_mappings[bdf].unity_map;
@@ -222,7 +227,8 @@ static int __init register_range_for_all
 
 /* reserve r/w unity-mapped page entries for devices */
 for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir,
+  true);
 }
 
 return rc;
@@ -255,8 +261,10 @@ static int __init register_range_for_dev
 paddr_t length = limit + PAGE_SIZE - base;
 
 /* reserve unity-mapped page entries for device */
-rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
- reserve_unity_map_for_device(seg, req, base, length, iw, ir);
+rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir,
+  false) ?:
+ reserve_unity_map_for_device(seg, req, base, length, iw, ir,
+  false);
 }
 else
 {
@@ -292,9 +300,9 @@ static int __init register_range_for_iom
 
 req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
 rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-  iw, ir) ?:
+  iw, ir, false) ?:
  reserve_unity_map_for_device(iommu->seg, req, base, length,
-  iw, ir);
+  iw, ir, false);
 }
 
 return rc;
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -467,6 +467,81 @@ int amd_iommu_reserve_domain_unity_unmap
 return rc;
 }
 
+int amd_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+unsigned int seg = 0 /* XXX */, bdf;
+const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
+/* At least for global entries,

[PATCH v7 5/8] AMD/IOMMU: also insert IVMD ranges into Dom0's page tables

2021-08-26 Thread Jan Beulich

So far only one region would be taken care of, if it can be placed in
the exclusion range registers of the IOMMU. Take care of further ranges
as well. Seeing that we've been doing fine without this, make both
insertion and removal best effort only.

Signed-off-by: Jan Beulich 
Reviewed-by: Paul Durrant 
---
v5: New.

--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -522,6 +522,14 @@ static int amd_iommu_add_device(u8 devfn
 amd_iommu_flush_device(iommu, bdf);
 }
 
+if ( amd_iommu_reserve_domain_unity_map(
+ pdev->domain,
+ ivrs_mappings[ivrs_mappings[bdf].dte_requestor_id].unity_map,
+ 0) )
+AMD_IOMMU_DEBUG("%pd: unity mapping failed for %04x:%02x:%02x.%u\n",
+pdev->domain, pdev->seg, pdev->bus, PCI_SLOT(devfn),
+PCI_FUNC(devfn));
+
 return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
 }
 
@@ -547,6 +555,14 @@ static int amd_iommu_remove_device(u8 de
 
 ivrs_mappings = get_ivrs_mappings(pdev->seg);
 bdf = PCI_BDF2(pdev->bus, devfn);
+
+if ( amd_iommu_reserve_domain_unity_unmap(
+ pdev->domain,
+ ivrs_mappings[ivrs_mappings[bdf].dte_requestor_id].unity_map) )
+AMD_IOMMU_DEBUG("%pd: unity unmapping failed for %04x:%02x:%02x.%u\n",
+pdev->domain, pdev->seg, pdev->bus, PCI_SLOT(devfn),
+PCI_FUNC(devfn));
+
 if ( amd_iommu_perdev_intremap &&
  ivrs_mappings[bdf].dte_requestor_id == bdf &&
  ivrs_mappings[bdf].intremap_table )

RE: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to NUMA system

2021-08-26 Thread Wei Chen

Hi Julien,

> -Original Message-
> From: Julien Grall 
> Sent: 2021年8月26日 0:58
> To: Wei Chen ; xen-devel@lists.xenproject.org;
> sstabell...@kernel.org; jbeul...@suse.com
> Cc: Bertrand Marquis 
> Subject: Re: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to
> NUMA system
> 
> Hi Wei,
> 
> On 11/08/2021 11:24, Wei Chen wrote:
> > When cpu boot up, we have add them to NUMA system. In current
> > stage, we have not parsed the NUMA data, but we have created
> > a fake NUMA node. So, in this patch, all CPU will be added
> > to NUMA node#0. After the NUMA data has been parsed from device
> > tree, the CPU will be added to correct NUMA node as the NUMA
> > data described.
> >
> > Signed-off-by: Wei Chen 
> > ---
> >   xen/arch/arm/setup.c   | 6 ++
> >   xen/arch/arm/smpboot.c | 6 ++
> >   xen/include/asm-arm/numa.h | 1 +
> >   3 files changed, 13 insertions(+)
> >
> > diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
> > index 3c58d2d441..7531989f21 100644
> > --- a/xen/arch/arm/setup.c
> > +++ b/xen/arch/arm/setup.c
> > @@ -918,6 +918,12 @@ void __init start_xen(unsigned long
> boot_phys_offset,
> >
> >   processor_id();
> >
> > +/*
> > + * If Xen is running on a NUMA off system, there will
> > + * be a node#0 at least.
> > + */
> > +numa_add_cpu(0);
> > +
> >   smp_init_cpus();
> >   cpus = smp_get_max_cpus();
> >   printk(XENLOG_INFO "SMP: Allowing %u CPUs\n", cpus);
> > diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
> > index a1ee3146ef..aa78958c07 100644
> > --- a/xen/arch/arm/smpboot.c
> > +++ b/xen/arch/arm/smpboot.c
> > @@ -358,6 +358,12 @@ void start_secondary(void)
> >*/
> >   smp_wmb();
> >
> > +/*
> > + * If Xen is running on a NUMA off system, there will
> > + * be a node#0 at least.
> > + */
> > +numa_add_cpu(cpuid);
> > +
> 
> On x86, numa_add_cpu() will be called before the pCPU is brought up. I
> am not quite too sure why we are doing it differently here. Can you
> clarify it?

Of course we can invoke numa_add_cpu before cpu_up as x86. But in my tests,
I found when cpu bring up failed, this cpu still be add to NUMA. Although
this does not affect the execution of the code (because CPU is offline),  
But I don't think adding a offline CPU to NUMA makes sense.



> 
> >   /* Now report this CPU is up */
> >   cpumask_set_cpu(cpuid, _online_map);
> >
> > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > index 7a3588ac7f..dd31324b0b 100644
> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -59,6 +59,7 @@ extern mfn_t first_valid_mfn;
> >   #define __node_distance(a, b) (20)
> >
> >   #define numa_init(x) do { } while (0)
> > +#define numa_add_cpu(x) do { } while (0)
> 
> This is a stubs for a common helper. So I think this wants to be moved
> in the !CONFIG_NUMA in xen/numa.h.
> 

OK

> >   #define numa_set_node(x, y) do { } while (0)
> >
> >   #endif
> >
> 
> Cheers,
> 
> --
> Julien Grall

[PATCH v7 4/8] AMD/IOMMU: check IVMD ranges against host implementation limits

2021-08-26 Thread Jan Beulich

When such ranges can't be represented as 1:1 mappings in page tables,
reject them as presumably bogus. Note that when we detect features late
(because of EFRSup being clear in the ACPI tables), it would be quite a
bit of work to check for (and drop) out of range IVMD ranges, so IOMMU
initialization gets failed in this case instead.

Signed-off-by: Jan Beulich 
Reviewed-by: Paul Durrant 
---
v7: Re-base.
v6: Re-base.
v5: New.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -305,6 +305,7 @@ extern struct hpet_sbdf {
 } hpet_sbdf;
 
 extern unsigned int amd_iommu_acpi_info;
+extern unsigned int amd_iommu_max_paging_mode;
 extern int amd_iommu_min_paging_mode;
 
 extern void *shared_intremap_table;
@@ -358,7 +359,7 @@ static inline int amd_iommu_get_paging_m
 while ( max_frames > PTE_PER_TABLE_SIZE )
 {
 max_frames = PTE_PER_TABLE_ALIGN(max_frames) >> PTE_PER_TABLE_SHIFT;
-if ( ++level > 6 )
+if ( ++level > amd_iommu_max_paging_mode )
 return -ENOMEM;
 }
 
--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -370,6 +370,7 @@ static int __init parse_ivmd_device_iomm
 static int __init parse_ivmd_block(const struct acpi_ivrs_memory *ivmd_block)
 {
 unsigned long start_addr, mem_length, base, limit;
+unsigned int addr_bits;
 bool iw = true, ir = true, exclusion = false;
 
 if ( ivmd_block->header.length < sizeof(*ivmd_block) )
@@ -386,6 +387,17 @@ static int __init parse_ivmd_block(const
 AMD_IOMMU_DEBUG("IVMD Block: type %#x phys %#lx len %#lx\n",
 ivmd_block->header.type, start_addr, mem_length);
 
+addr_bits = min(MASK_EXTR(amd_iommu_acpi_info, ACPI_IVRS_PHYSICAL_SIZE),
+MASK_EXTR(amd_iommu_acpi_info, ACPI_IVRS_VIRTUAL_SIZE));
+if ( amd_iommu_get_paging_mode(PFN_UP(start_addr + mem_length)) < 0 ||
+ (addr_bits < BITS_PER_LONG &&
+  ((start_addr + mem_length - 1) >> addr_bits)) )
+{
+AMD_IOMMU_DEBUG("IVMD: [%lx,%lx) is not IOMMU addressable\n",
+start_addr, start_addr + mem_length);
+return 0;
+}
+
 if ( !e820_all_mapped(base, limit + PAGE_SIZE, E820_RESERVED) )
 {
 paddr_t addr;
--- a/xen/drivers/passthrough/amd/iommu_detect.c
+++ b/xen/drivers/passthrough/amd/iommu_detect.c
@@ -67,6 +67,9 @@ void __init get_iommu_features(struct am
 
 iommu->features.raw =
 readq(iommu->mmio_base + IOMMU_EXT_FEATURE_MMIO_OFFSET);
+
+if ( 4 + iommu->features.flds.hats < amd_iommu_max_paging_mode )
+amd_iommu_max_paging_mode = 4 + iommu->features.flds.hats;
 }
 
 /* Don't log the same set of features over and over. */
@@ -200,6 +203,10 @@ int __init amd_iommu_detect_one_acpi(
 else if ( list_empty(_iommu_head) )
 AMD_IOMMU_DEBUG("EFRSup not set in ACPI table; will fall back to 
hardware\n");
 
+if ( (amd_iommu_acpi_info & ACPI_IVRS_EFR_SUP) &&
+ 4 + iommu->features.flds.hats < amd_iommu_max_paging_mode )
+amd_iommu_max_paging_mode = 4 + iommu->features.flds.hats;
+
 /* override IOMMU HT flags */
 iommu->ht_flags = ivhd_block->header.flags;
 
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -1376,6 +1376,13 @@ static int __init amd_iommu_prepare_one(
 
 get_iommu_features(iommu);
 
+/*
+ * Late extended feature determination may cause previously mappable
+ * IVMD ranges to become unmappable.
+ */
+if ( amd_iommu_max_paging_mode < amd_iommu_min_paging_mode )
+return -ERANGE;
+
 return 0;
 }
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -254,6 +254,7 @@ int amd_iommu_alloc_root(struct domain *
 return 0;
 }
 
+unsigned int __read_mostly amd_iommu_max_paging_mode = 6;
 int __read_mostly amd_iommu_min_paging_mode = 1;
 
 static int amd_iommu_domain_init(struct domain *d)

[PATCH v7 3/8] AMD/IOMMU: improve (extended) feature detection

2021-08-26 Thread Jan Beulich

First of all the documentation is very clear about ACPI table data
superseding raw register data. Use raw register data only if EFRSup is
clear in the ACPI tables (which may still go too far). Additionally if
this flag is clear, the IVRS type 11H table is reserved and hence may
not be recognized.

Furthermore propagate IVRS type 10H data into the feature flags
recorded, as the full extended features field is available in type 11H
only.

Note that this also makes necessary to stop the bad practice of us
finding a type 11H IVHD entry, but still processing the type 10H one
in detect_iommu_acpi()'s invocation of amd_iommu_detect_one_acpi().

Note also that the features.raw check in amd_iommu_prepare_one() needs
replacing, now that the field can also be populated by different means.
Key IOMMUv2 availability off of IVHD type not being 10H, and then move
it a function layer up, so that it would be set only once all IOMMUs
have been successfully prepared.

Signed-off-by: Jan Beulich 
Reviewed-by: Paul Durrant 
---
v7: Re-base.
v5: New.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -304,6 +304,7 @@ extern struct hpet_sbdf {
 } init;
 } hpet_sbdf;
 
+extern unsigned int amd_iommu_acpi_info;
 extern int amd_iommu_min_paging_mode;
 
 extern void *shared_intremap_table;
--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -1051,7 +1051,8 @@ static void __init dump_acpi_table_heade
 static inline bool_t is_ivhd_block(u8 type)
 {
 return (type == ACPI_IVRS_TYPE_HARDWARE ||
-type == ACPI_IVRS_TYPE_HARDWARE_11H);
+((amd_iommu_acpi_info & ACPI_IVRS_EFR_SUP) &&
+ type == ACPI_IVRS_TYPE_HARDWARE_11H));
 }
 
 static inline bool_t is_ivmd_block(u8 type)
@@ -1159,7 +1160,7 @@ static int __init detect_iommu_acpi(stru
 ivrs_block = (struct acpi_ivrs_header *)((u8 *)table + length);
 if ( table->length < (length + ivrs_block->length) )
 return -ENODEV;
-if ( ivrs_block->type == ACPI_IVRS_TYPE_HARDWARE &&
+if ( ivrs_block->type == ivhd_type &&
  amd_iommu_detect_one_acpi(to_ivhd_block(ivrs_block)) != 0 )
 return -ENODEV;
 length += ivrs_block->length;
@@ -1299,6 +1300,9 @@ get_supported_ivhd_type(struct acpi_tabl
 return -ENODEV;
 }
 
+amd_iommu_acpi_info = container_of(table, const struct acpi_table_ivrs,
+   header)->info;
+
 while ( table->length > (length + sizeof(*ivrs_block)) )
 {
 ivrs_block = (struct acpi_ivrs_header *)((u8 *)table + length);
--- a/xen/drivers/passthrough/amd/iommu_detect.c
+++ b/xen/drivers/passthrough/amd/iommu_detect.c
@@ -60,14 +60,14 @@ void __init get_iommu_features(struct am
 const struct amd_iommu *first;
 ASSERT( iommu->mmio_base );
 
-if ( !iommu_has_cap(iommu, PCI_CAP_EFRSUP_SHIFT) )
+if ( !(amd_iommu_acpi_info & ACPI_IVRS_EFR_SUP) )
 {
-iommu->features.raw = 0;
-return;
-}
+if ( !iommu_has_cap(iommu, PCI_CAP_EFRSUP_SHIFT) )
+return;
 
-iommu->features.raw =
-readq(iommu->mmio_base + IOMMU_EXT_FEATURE_MMIO_OFFSET);
+iommu->features.raw =
+readq(iommu->mmio_base + IOMMU_EXT_FEATURE_MMIO_OFFSET);
+}
 
 /* Don't log the same set of features over and over. */
 first = list_first_entry(_iommu_head, struct amd_iommu, list);
@@ -164,6 +164,42 @@ int __init amd_iommu_detect_one_acpi(
 iommu->cap_offset = ivhd_block->capability_offset;
 iommu->mmio_base_phys = ivhd_block->base_address;
 
+if ( ivhd_type != ACPI_IVRS_TYPE_HARDWARE )
+iommu->features.raw = ivhd_block->efr_image;
+else if ( amd_iommu_acpi_info & ACPI_IVRS_EFR_SUP )
+{
+union {
+uint32_t raw;
+struct {
+unsigned int xt_sup:1;
+unsigned int nx_sup:1;
+unsigned int gt_sup:1;
+unsigned int glx_sup:2;
+unsigned int ia_sup:1;
+unsigned int ga_sup:1;
+unsigned int he_sup:1;
+unsigned int pas_max:5;
+unsigned int pn_counters:4;
+unsigned int pn_banks:6;
+unsigned int msi_num_ppr:5;
+unsigned int gats:2;
+unsigned int hats:2;
+};
+} attr = { .raw = ivhd_block->iommu_attr };
+
+iommu->features.flds.xt_sup = attr.xt_sup;
+iommu->features.flds.nx_sup = attr.nx_sup;
+iommu->features.flds.gt_sup = attr.gt_sup;
+iommu->features.flds.glx_sup = attr.glx_sup;
+iommu->features.flds.ia_sup = attr.ia_sup;
+iommu->features.flds.ga_sup = attr.ga_sup;
+iommu->features.flds.pas_max = attr.pas_max;
+iommu->features.flds.gats = attr.gats;
+iommu->features.flds.hats = attr.hats;
+}
+else if ( list_empty(_iommu_head) )
+

[PATCH v7 2/8] AMD/IOMMU: obtain IVHD type to use earlier

2021-08-26 Thread Jan Beulich

Doing this in amd_iommu_prepare() is too late for it, in particular, to
be used in amd_iommu_detect_one_acpi(), as a subsequent change will want
to do. Moving it immediately ahead of amd_iommu_detect_acpi() is
(luckily) pretty simple, (pretty importantly) without breaking
amd_iommu_prepare()'s logic to prevent multiple processing.

This involves moving table checksumming, as
amd_iommu_get_supported_ivhd_type() ->  get_supported_ivhd_type() will
now be invoked before amd_iommu_detect_acpi()  -> detect_iommu_acpi(). In
the course of dojng so stop open-coding acpi_tb_checksum(), seeing that
we have other uses of this originally ACPI-private function elsewhere in
the tree.

Signed-off-by: Jan Beulich 
---
v7: Move table checksumming.
v5: New.

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -22,6 +22,8 @@
 
 #include 
 
+#include 
+
 #include "iommu.h"
 
 /* Some helper structures, particularly to deal with ranges. */
@@ -1150,20 +1152,7 @@ static int __init parse_ivrs_table(struc
 static int __init detect_iommu_acpi(struct acpi_table_header *table)
 {
 const struct acpi_ivrs_header *ivrs_block;
-unsigned long i;
 unsigned long length = sizeof(struct acpi_table_ivrs);
-u8 checksum, *raw_table;
-
-/* validate checksum: sum of entire table == 0 */
-checksum = 0;
-raw_table = (u8 *)table;
-for ( i = 0; i < table->length; i++ )
-checksum += raw_table[i];
-if ( checksum )
-{
-AMD_IOMMU_DEBUG("IVRS Error: Invalid Checksum %#x\n", checksum);
-return -ENODEV;
-}
 
 while ( table->length > (length + sizeof(*ivrs_block)) )
 {
@@ -1300,6 +1289,15 @@ get_supported_ivhd_type(struct acpi_tabl
 {
 size_t length = sizeof(struct acpi_table_ivrs);
 const struct acpi_ivrs_header *ivrs_block, *blk = NULL;
+uint8_t checksum;
+
+/* Validate checksum: Sum of entire table == 0. */
+checksum = acpi_tb_checksum(ACPI_CAST_PTR(uint8_t, table), table->length);
+if ( checksum )
+{
+AMD_IOMMU_DEBUG("IVRS Error: Invalid Checksum %#x\n", checksum);
+return -ENODEV;
+}
 
 while ( table->length > (length + sizeof(*ivrs_block)) )
 {
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -1398,15 +1398,9 @@ int __init amd_iommu_prepare(bool xt)
 goto error_out;
 
 /* Have we been here before? */
-if ( ivhd_type )
+if ( ivrs_bdf_entries )
 return 0;
 
-rc = amd_iommu_get_supported_ivhd_type();
-if ( rc < 0 )
-goto error_out;
-BUG_ON(!rc);
-ivhd_type = rc;
-
 rc = amd_iommu_get_ivrs_dev_entries();
 if ( !rc )
 rc = -ENODEV;
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -179,9 +179,17 @@ static int __must_check amd_iommu_setup_
 
 int __init acpi_ivrs_init(void)
 {
+int rc;
+
 if ( !iommu_enable && !iommu_intremap )
 return 0;
 
+rc = amd_iommu_get_supported_ivhd_type();
+if ( rc < 0 )
+return rc;
+BUG_ON(!rc);
+ivhd_type = rc;
+
 if ( (amd_iommu_detect_acpi() !=0) || (iommu_found() == 0) )
 {
 iommu_intremap = iommu_intremap_off;

[PATCH v7 1/8] AMD/IOMMU: check / convert IVMD ranges for being / to be reserved

2021-08-26 Thread Jan Beulich

While the specification doesn't say so, just like for VT-d's RMRRs no
good can come from these ranges being e.g. conventional RAM or entirely
unmarked and hence usable for placing e.g. PCI device BARs. Check
whether they are, and put in some limited effort to convert to reserved.
(More advanced logic can be added if actual problems are found with this
simplistic variant.)

Signed-off-by: Jan Beulich 
Reviewed-by: Paul Durrant 
---
v7: Re-base.
v5: New.

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -384,6 +384,38 @@ static int __init parse_ivmd_block(const
 AMD_IOMMU_DEBUG("IVMD Block: type %#x phys %#lx len %#lx\n",
 ivmd_block->header.type, start_addr, mem_length);
 
+if ( !e820_all_mapped(base, limit + PAGE_SIZE, E820_RESERVED) )
+{
+paddr_t addr;
+
+AMD_IOMMU_DEBUG("IVMD: [%lx,%lx) is not (entirely) in reserved 
memory\n",
+base, limit + PAGE_SIZE);
+
+for ( addr = base; addr <= limit; addr += PAGE_SIZE )
+{
+unsigned int type = page_get_ram_type(maddr_to_mfn(addr));
+
+if ( type == RAM_TYPE_UNKNOWN )
+{
+if ( e820_add_range(, addr, addr + PAGE_SIZE,
+E820_RESERVED) )
+continue;
+AMD_IOMMU_DEBUG("IVMD Error: Page at %lx couldn't be 
reserved\n",
+addr);
+return -EIO;
+}
+
+/* Types which won't be handed out are considered good enough. */
+if ( !(type & (RAM_TYPE_RESERVED | RAM_TYPE_ACPI |
+   RAM_TYPE_UNUSABLE)) )
+continue;
+
+AMD_IOMMU_DEBUG("IVMD Error: Page at %lx can't be converted\n",
+addr);
+return -EIO;
+}
+}
+
 if ( ivmd_block->header.flags & ACPI_IVMD_EXCLUSION_RANGE )
 exclusion = true;
 else if ( ivmd_block->header.flags & ACPI_IVMD_UNITY )

1 2 >

1 - 100 of 104 matches

Mail list logo