Re: [PATCH v15 10/12] swiotlb: Add restricted DMA pool initialization
On Tue, Aug 24, 2021 at 10:26 PM Guenter Roeck wrote: > > Hi Claire, > > On Thu, Jun 24, 2021 at 11:55:24PM +0800, Claire Chang wrote: > > Add the initialization function to create restricted DMA pools from > > matching reserved-memory nodes. > > > > Regardless of swiotlb setting, the restricted DMA pool is preferred if > > available. > > > > The restricted DMA pools provide a basic level of protection against the > > DMA overwriting buffer contents at unexpected times. However, to protect > > against general data leakage and system memory corruption, the system > > needs to provide a way to lock down the memory access, e.g., MPU. > > > > Signed-off-by: Claire Chang > > Reviewed-by: Christoph Hellwig > > Tested-by: Stefano Stabellini > > Tested-by: Will Deacon > > --- > > include/linux/swiotlb.h | 3 +- > > kernel/dma/Kconfig | 14 > > kernel/dma/swiotlb.c| 76 + > > 3 files changed, 92 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h > > index 3b9454d1e498..39284ff2a6cd 100644 > > --- a/include/linux/swiotlb.h > > +++ b/include/linux/swiotlb.h > > @@ -73,7 +73,8 @@ extern enum swiotlb_force swiotlb_force; > > * range check to see if the memory was in fact allocated by this > > * API. > > * @nslabs: The number of IO TLB blocks (in groups of 64) between @start > > and > > - * @end. This is command line adjustable via setup_io_tlb_npages. > > + * @end. For default swiotlb, this is command line adjustable via > > + * setup_io_tlb_npages. > > * @used:The number of used IO TLB block. > > * @list:The free list describing the number of free entries available > > * from each index. > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig > > index 77b405508743..3e961dc39634 100644 > > --- a/kernel/dma/Kconfig > > +++ b/kernel/dma/Kconfig > > @@ -80,6 +80,20 @@ config SWIOTLB > > bool > > select NEED_DMA_MAP_STATE > > > > +config DMA_RESTRICTED_POOL > > + bool "DMA Restricted Pool" > > + depends on OF && OF_RESERVED_MEM > > + select SWIOTLB > > This makes SWIOTLB user configurable, which in turn results in > > mips64-linux-ld: arch/mips/kernel/setup.o: in function `arch_mem_init': > setup.c:(.init.text+0x19c8): undefined reference to `plat_swiotlb_setup' > make[1]: *** [Makefile:1280: vmlinux] Error 1 > > when building mips:allmodconfig. > > Should this possibly be "depends on SWIOTLB" ? Patch is sent here: https://lkml.org/lkml/2021/8/26/932 > > Thanks, > Guenter Thanks, Claire
RE: [PATCH] VT-d: fix caching mode IOTLB flushing
> From: Jan Beulich > Sent: Thursday, August 19, 2021 4:06 PM > > While for context cache entry flushing use of did 0 is indeed correct > (after all upon reading the context entry the IOMMU wouldn't know any > domain ID if the entry is not present, and hence a surrogate one needs > to be used), for IOTLB entries the normal domain ID (from the [present] > context entry) gets used. See sub-section "IOTLB" of section "Address > Translation Caches" in the VT-d spec. > > Signed-off-by: Jan Beulich Reviewed-by: Kevin Tian > --- > Luckily this is supposed to be an issue only when running on emulated > IOMMUs; hardware implementations are expected to have CAP.CM=0. > > --- a/xen/drivers/passthrough/vtd/iommu.c > +++ b/xen/drivers/passthrough/vtd/iommu.c > @@ -474,17 +474,10 @@ int vtd_flush_iotlb_reg(struct vtd_iommu > > /* > * In the non-present entry flush case, if hardware doesn't cache > - * non-present entry we do nothing and if hardware cache non-present > - * entry, we flush entries of domain 0 (the domain id is used to cache > - * any non-present entries) > + * non-present entries we do nothing. > */ > -if ( flush_non_present_entry ) > -{ > -if ( !cap_caching_mode(iommu->cap) ) > -return 1; > -else > -did = 0; > -} > +if ( flush_non_present_entry && !cap_caching_mode(iommu->cap) ) > +return 1; > > /* use register invalidation */ > switch ( type ) > --- a/xen/drivers/passthrough/vtd/qinval.c > +++ b/xen/drivers/passthrough/vtd/qinval.c > @@ -362,17 +362,10 @@ static int __must_check flush_iotlb_qi(s > > /* > * In the non-present entry flush case, if hardware doesn't cache > - * non-present entry we do nothing and if hardware cache non-present > - * entry, we flush entries of domain 0 (the domain id is used to cache > - * any non-present entries) > + * non-present entries we do nothing. > */ > -if ( flush_non_present_entry ) > -{ > -if ( !cap_caching_mode(iommu->cap) ) > -return 1; > -else > -did = 0; > -} > +if ( flush_non_present_entry && !cap_caching_mode(iommu->cap) ) > +return 1; > > /* use queued invalidation */ > if (cap_write_drain(iommu->cap))
[xen-4.14-testing test] 164493: regressions - FAIL
flight 164493 xen-4.14-testing real [real] flight 164505 xen-4.14-testing real-retest [real] http://logs.test-lab.xenproject.org/osstest/logs/164493/ http://logs.test-lab.xenproject.org/osstest/logs/164505/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: test-amd64-amd64-dom0pvh-xl-amd 8 xen-boot fail REGR. vs. 163750 test-amd64-amd64-dom0pvh-xl-intel 8 xen-bootfail REGR. vs. 163750 Tests which are failing intermittently (not blocking): test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail pass in 164505-retest Tests which did not succeed, but are not blocking: test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 163750 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 163750 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 163750 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 163750 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 163750 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 163750 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail like 163750 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 163750 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 163750 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 163750 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 163750 test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-xl-pvshim14 guest-start fail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-i386-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-i386-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail never pass version targeted for testing: xen 74e93071826fe3aaab32e469280a3253a39147f6 baseline version: xen 49299c4813b7847d29df07bf790f5489060f2a9c Last test of basis 163750 2021-07-16
RE: [XEN RFC PATCH 16/40] xen/arm: Create a fake NUMA node to use common code
Hi Stefano, > -Original Message- > From: Stefano Stabellini > Sent: 2021年8月27日 7:10 > To: Wei Chen > Cc: xen-devel@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org; > jbeul...@suse.com; Bertrand Marquis > Subject: Re: [XEN RFC PATCH 16/40] xen/arm: Create a fake NUMA node to use > common code > > On Wed, 11 Aug 2021, Wei Chen wrote: > > When CONFIG_NUMA is enabled for Arm, Xen will switch to use common > > NUMA API instead of previous fake NUMA API. Before we parse NUMA > > information from device tree or ACPI SRAT table, we need to init > > the NUMA related variables, like cpu_to_node, as single node NUMA > > system. > > > > So in this patch, we introduce a numa_init function for to > > initialize these data structures as all resources belongs to node#0. > > This will make the new API returns the same values as the fake API > > has done. > > > > Signed-off-by: Wei Chen > > --- > > xen/arch/arm/numa.c| 53 ++ > > xen/arch/arm/setup.c | 8 ++ > > xen/include/asm-arm/numa.h | 11 > > 3 files changed, 72 insertions(+) > > > > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c > > index 1e30c5bb13..566ad1e52b 100644 > > --- a/xen/arch/arm/numa.c > > +++ b/xen/arch/arm/numa.c > > @@ -20,6 +20,8 @@ > > #include > > #include > > #include > > +#include > > +#include > > > > void numa_set_node(int cpu, nodeid_t nid) > > { > > @@ -29,3 +31,54 @@ void numa_set_node(int cpu, nodeid_t nid) > > > > cpu_to_node[cpu] = nid; > > } > > + > > +void __init numa_init(bool acpi_off) > > +{ > > +uint32_t idx; > > +paddr_t ram_start = ~0; > > +paddr_t ram_size = 0; > > +paddr_t ram_end = 0; > > + > > +printk(XENLOG_WARNING > > +"NUMA has not been supported yet, NUMA off!\n"); > > NIT: please align > OK > > > +/* Arm NUMA has not been implemented until this patch */ > > "Arm NUMA is not implemented yet" > OK > > > +numa_off = true; > > + > > +/* > > + * Set all cpu_to_node mapping to 0, this will make cpu_to_node > > + * function return 0 as previous fake cpu_to_node API. > > + */ > > +for ( idx = 0; idx < NR_CPUS; idx++ ) > > +cpu_to_node[idx] = 0; > > + > > +/* > > + * Make node_to_cpumask, node_spanned_pages and node_start_pfn > > + * return as previous fake APIs. > > + */ > > +for ( idx = 0; idx < MAX_NUMNODES; idx++ ) { > > +node_to_cpumask[idx] = cpu_online_map; > > +node_spanned_pages(idx) = (max_page - mfn_x(first_valid_mfn)); > > +node_start_pfn(idx) = (mfn_x(first_valid_mfn)); > > +} > > I just want to note that this works because MAX_NUMNODES is 1. If > MAX_NUMNODES was > 1 then it would be wrong to set node_to_cpumask, > node_spanned_pages and node_start_pfn for all nodes to the same values. > > It might be worth writing something about it in the in-code comment. > OK, I will do it. > > > +/* > > + * Find the minimal and maximum address of RAM, NUMA will > > + * build a memory to node mapping table for the whole range. > > + */ > > +ram_start = bootinfo.mem.bank[0].start; > > +ram_size = bootinfo.mem.bank[0].size; > > +ram_end = ram_start + ram_size; > > +for ( idx = 1 ; idx < bootinfo.mem.nr_banks; idx++ ) > > +{ > > +paddr_t bank_start = bootinfo.mem.bank[idx].start; > > +paddr_t bank_size = bootinfo.mem.bank[idx].size; > > +paddr_t bank_end = bank_start + bank_size; > > + > > +ram_size = ram_size + bank_size; > > ram_size is updated but not utilized > Ok, I will remove it. > > > +ram_start = min(ram_start, bank_start); > > +ram_end = max(ram_end, bank_end); > > +} > > + > > +numa_initmem_init(PFN_UP(ram_start), PFN_DOWN(ram_end)); > > +return; > > +} > > diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c > > index 63a908e325..3c58d2d441 100644 > > --- a/xen/arch/arm/setup.c > > +++ b/xen/arch/arm/setup.c > > @@ -30,6 +30,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -874,6 +875,13 @@ void __init start_xen(unsigned long > boot_phys_offset, > > /* Parse the ACPI tables for possible boot-time configuration */ > > acpi_boot_table_init(); > > > > +/* > > + * Try to initialize NUMA system, if failed, the system will > > + * fallback to uniform system which means system has only 1 > > + * NUMA node. > > + */ > > +numa_init(acpi_disabled); > > + > > end_boot_allocator(); > > > > /* > > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h > > index b2982f9053..bb495a24e1 100644 > > --- a/xen/include/asm-arm/numa.h > > +++ b/xen/include/asm-arm/numa.h > > @@ -13,6 +13,16 @@ typedef u8 nodeid_t; > > */ > > #define NODES_SHIFT 6 > > > > +extern void numa_init(bool acpi_off); > > + > > +/* > > + * Temporary for fake NUMA node, when CPU, memory
[PATCH 14/15] perf: Disallow bulk unregistering of guest callbacks and do cleanup
Drop the helper that allows bulk unregistering of the per-CPU callbacks now that KVM, the only entity that actually unregisters callbacks, uses the per-CPU helpers. Bulk unregistering is inherently unsafe as there are no protections against nullifying a pointer for a CPU that is using said pointer in a PMI handler. Opportunistically tweak names to better reflect reality. Signed-off-by: Sean Christopherson --- arch/x86/xen/pmu.c | 2 +- include/linux/kvm_host.h | 2 +- include/linux/perf_event.h | 9 +++-- kernel/events/core.c | 31 +++ virt/kvm/kvm_main.c| 2 +- 5 files changed, 17 insertions(+), 29 deletions(-) diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c index e13b0b49fcdf..57834de043c3 100644 --- a/arch/x86/xen/pmu.c +++ b/arch/x86/xen/pmu.c @@ -548,7 +548,7 @@ void xen_pmu_init(int cpu) per_cpu(xenpmu_shared, cpu).flags = 0; if (cpu == 0) { - perf_register_guest_info_callbacks(_guest_cbs); + perf_register_guest_info_callbacks_all_cpus(_guest_cbs); xen_pmu_arch_init(); } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0db9af0b628c..d68a49d5fc53 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1171,7 +1171,7 @@ unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu); void kvm_register_perf_callbacks(void); static inline void kvm_unregister_perf_callbacks(void) { - __perf_unregister_guest_info_callbacks(); + perf_unregister_guest_info_callbacks(); } #endif diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 7a367bf1b78d..db701409a62f 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1238,10 +1238,9 @@ extern void perf_event_bpf_event(struct bpf_prog *prog, #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS DECLARE_PER_CPU(struct perf_guest_info_callbacks *, perf_guest_cbs); -extern void __perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs); -extern void __perf_unregister_guest_info_callbacks(void); -extern void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); +extern void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs); extern void perf_unregister_guest_info_callbacks(void); +extern void perf_register_guest_info_callbacks_all_cpus(struct perf_guest_info_callbacks *cbs); #endif /* CONFIG_HAVE_GUEST_PERF_EVENTS */ extern void perf_event_exec(void); @@ -1486,9 +1485,7 @@ static inline void perf_bp_event(struct perf_event *event, void *data){ } #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS -static inline void perf_register_guest_info_callbacks -(struct perf_guest_info_callbacks *callbacks) { } -static inline void perf_unregister_guest_info_callbacks(void) { } +extern void perf_register_guest_info_callbacks_all_cpus(struct perf_guest_info_callbacks *cbs); #endif static inline void perf_event_mmap(struct vm_area_struct *vma) { } diff --git a/kernel/events/core.c b/kernel/events/core.c index 2f28d9d8dc94..f1964096c4c2 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6485,35 +6485,26 @@ static void perf_pending_event(struct irq_work *entry) #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS DEFINE_PER_CPU(struct perf_guest_info_callbacks *, perf_guest_cbs); -void __perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs) -{ - __this_cpu_write(perf_guest_cbs, cbs); -} -EXPORT_SYMBOL_GPL(__perf_register_guest_info_callbacks); - -void __perf_unregister_guest_info_callbacks(void) -{ - __this_cpu_write(perf_guest_cbs, NULL); -} -EXPORT_SYMBOL_GPL(__perf_unregister_guest_info_callbacks); - void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs) { - int cpu; - - for_each_possible_cpu(cpu) - per_cpu(perf_guest_cbs, cpu) = cbs; + __this_cpu_write(perf_guest_cbs, cbs); } EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks); void perf_unregister_guest_info_callbacks(void) { - int cpu; - - for_each_possible_cpu(cpu) - per_cpu(perf_guest_cbs, cpu) = NULL; + __this_cpu_write(perf_guest_cbs, NULL); } EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks); + +void perf_register_guest_info_callbacks_all_cpus(struct perf_guest_info_callbacks *cbs) +{ + int cpu; + + for_each_possible_cpu(cpu) + per_cpu(perf_guest_cbs, cpu) = cbs; +} +EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks_all_cpus); #endif static void diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e0b1c9386926..1bcc3eab510b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -5502,7 +5502,7 @@ EXPORT_SYMBOL_GPL(kvm_set_intel_pt_intr_handler); void kvm_register_perf_callbacks(void) { - __perf_register_guest_info_callbacks(_guest_cbs); +
[PATCH 13/15] KVM: arm64: Drop perf.c and fold its tiny bit of code into pmu.c
Fold that last few remnants of perf.c into pmu.c and rename the init helper as appropriate. Signed-off-by: Sean Christopherson --- arch/arm64/include/asm/kvm_host.h | 2 -- arch/arm64/kvm/Makefile | 2 +- arch/arm64/kvm/arm.c | 3 ++- arch/arm64/kvm/perf.c | 20 arch/arm64/kvm/pmu.c | 8 include/kvm/arm_pmu.h | 1 + 6 files changed, 12 insertions(+), 24 deletions(-) delete mode 100644 arch/arm64/kvm/perf.c diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 12e8d789e1ac..86c0fdd11ad2 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -670,8 +670,6 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned int len); int kvm_handle_mmio_return(struct kvm_vcpu *vcpu); int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa); -void kvm_perf_init(void); - #ifdef CONFIG_PERF_EVENTS #define __KVM_WANT_PERF_CALLBACKS #else diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile index 989bb5dad2c8..0bcc378b7961 100644 --- a/arch/arm64/kvm/Makefile +++ b/arch/arm64/kvm/Makefile @@ -12,7 +12,7 @@ obj-$(CONFIG_KVM) += hyp/ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \ $(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \ -arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \ +arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \ inject_fault.o va_layout.o handle_exit.o \ guest.o debug.o reset.o sys_regs.o \ vgic-sys-reg-v3.o fpsimd.o pmu.o \ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index dfc8078dd4f9..57e637dee71d 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -1747,7 +1747,8 @@ static int init_subsystems(void) if (err) goto out; - kvm_perf_init(); + kvm_pmu_init(); + kvm_sys_reg_table_init(); out: diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c deleted file mode 100644 index ad9fdc2f2f70.. --- a/arch/arm64/kvm/perf.c +++ /dev/null @@ -1,20 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only -/* - * Based on the x86 implementation. - * - * Copyright (C) 2012 ARM Ltd. - * Author: Marc Zyngier - */ - -#include -#include - -#include - -DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available); - -void kvm_perf_init(void) -{ - if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled()) - static_branch_enable(_arm_pmu_available); -} diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c index 03a6c1f4a09a..d98b57a17043 100644 --- a/arch/arm64/kvm/pmu.c +++ b/arch/arm64/kvm/pmu.c @@ -7,6 +7,14 @@ #include #include +DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available); + +void kvm_pmu_init(void) +{ + if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled()) + static_branch_enable(_arm_pmu_available); +} + /* * Given the perf event attributes and system type, determine * if we are going to need to switch counters at guest entry/exit. diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h index 864b9997efb2..42270676498d 100644 --- a/include/kvm/arm_pmu.h +++ b/include/kvm/arm_pmu.h @@ -14,6 +14,7 @@ #define ARMV8_PMU_MAX_COUNTER_PAIRS((ARMV8_PMU_MAX_COUNTERS + 1) >> 1) DECLARE_STATIC_KEY_FALSE(kvm_arm_pmu_available); +void kvm_pmu_init(void); static __always_inline bool kvm_arm_support_pmu_v3(void) { -- 2.33.0.259.gc128427fd7-goog
[PATCH 12/15] KVM: arm64: Convert to the generic perf callbacks
Drop arm64's version of the callbacks in favor of the callbacks provided by generic KVM, which are semantically identical. Implement the "get ip" hook as needed. Signed-off-by: Sean Christopherson --- arch/arm64/include/asm/kvm_host.h | 6 + arch/arm64/kvm/arm.c | 5 arch/arm64/kvm/perf.c | 38 --- 3 files changed, 6 insertions(+), 43 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 007c38d77fd9..12e8d789e1ac 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -673,11 +673,7 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa); void kvm_perf_init(void); #ifdef CONFIG_PERF_EVENTS -void kvm_register_perf_callbacks(void); -static inline void kvm_unregister_perf_callbacks(void) -{ - __perf_unregister_guest_info_callbacks(); -} +#define __KVM_WANT_PERF_CALLBACKS #else static inline void kvm_register_perf_callbacks(void) {} static inline void kvm_unregister_perf_callbacks(void) {} diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index ec386971030d..dfc8078dd4f9 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -503,6 +503,11 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu) return vcpu_mode_priv(vcpu); } +unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu) +{ + return *vcpu_pc(vcpu); +} + /* Just ensure a guest exit from a particular CPU */ static void exit_vm_noop(void *info) { diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c index 2556b0a3b096..ad9fdc2f2f70 100644 --- a/arch/arm64/kvm/perf.c +++ b/arch/arm64/kvm/perf.c @@ -13,44 +13,6 @@ DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available); -#ifdef CONFIG_PERF_EVENTS -static int kvm_is_in_guest(void) -{ - return true; -} - -static int kvm_is_user_mode(void) -{ - struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); - - if (WARN_ON_ONCE(!vcpu)) - return 0; - - return !vcpu_mode_priv(vcpu); -} - -static unsigned long kvm_get_guest_ip(void) -{ - struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); - - if (WARN_ON_ONCE(!vcpu)) - return 0; - - return *vcpu_pc(vcpu); -} - -static struct perf_guest_info_callbacks kvm_guest_cbs = { - .is_in_guest= kvm_is_in_guest, - .is_user_mode = kvm_is_user_mode, - .get_guest_ip = kvm_get_guest_ip, -}; - -void kvm_register_perf_callbacks(void) -{ - __perf_register_guest_info_callbacks(_guest_cbs); -} -#endif /* CONFIG_PERF_EVENTS*/ - void kvm_perf_init(void) { if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled()) -- 2.33.0.259.gc128427fd7-goog
[PATCH 10/15] KVM: Move x86's perf guest info callbacks to generic KVM
Move x86's perf guest callbacks into common KVM, as they are semantically identical to arm64's callbacks (the only other such KVM callbacks). arm64 will convert to the common versions in a future patch. Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/x86.c | 48 + arch/x86/kvm/x86.h | 6 - include/linux/kvm_host.h| 12 + virt/kvm/kvm_main.c | 46 +++ 5 files changed, 66 insertions(+), 47 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 465b35736d9b..63553a1f43ee 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -36,6 +36,7 @@ #include #define __KVM_HAVE_ARCH_VCPU_DEBUGFS +#define __KVM_WANT_PERF_CALLBACKS #define KVM_MAX_VCPUS 288 #define KVM_SOFT_MAX_VCPUS 240 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e337aef60793..7cb0f04e24ee 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8264,32 +8264,6 @@ static void kvm_timer_init(void) kvmclock_cpu_online, kvmclock_cpu_down_prep); } -static int kvm_is_in_guest(void) -{ - /* x86's callbacks are registered only when handling a guest NMI. */ - return true; -} - -static int kvm_is_user_mode(void) -{ - struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); - - if (WARN_ON_ONCE(!vcpu)) - return 0; - - return static_call(kvm_x86_get_cpl)(vcpu) != 0; -} - -static unsigned long kvm_get_guest_ip(void) -{ - struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); - - if (WARN_ON_ONCE(!vcpu)) - return 0; - - return kvm_rip_read(vcpu); -} - static void kvm_handle_intel_pt_intr(void) { struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); @@ -8302,19 +8276,6 @@ static void kvm_handle_intel_pt_intr(void) (unsigned long *)>arch.pmu.global_status); } -static struct perf_guest_info_callbacks kvm_guest_cbs = { - .is_in_guest= kvm_is_in_guest, - .is_user_mode = kvm_is_user_mode, - .get_guest_ip = kvm_get_guest_ip, - .handle_intel_pt_intr = NULL, -}; - -void kvm_register_perf_callbacks(void) -{ - __perf_register_guest_info_callbacks(_guest_cbs); -} -EXPORT_SYMBOL_GPL(kvm_register_perf_callbacks); - #ifdef CONFIG_X86_64 static void pvclock_gtod_update_fn(struct work_struct *work) { @@ -11069,7 +11030,7 @@ int kvm_arch_hardware_setup(void *opaque) kvm_ops_static_call_update(); if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest()) - kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr; + kvm_set_intel_pt_intr_handler(kvm_handle_intel_pt_intr); if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) supported_xss = 0; @@ -11098,7 +11059,7 @@ int kvm_arch_hardware_setup(void *opaque) void kvm_arch_hardware_unsetup(void) { - kvm_guest_cbs.handle_intel_pt_intr = NULL; + kvm_set_intel_pt_intr_handler(NULL); static_call(kvm_x86_hardware_unsetup)(); } @@ -11725,6 +11686,11 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu) return vcpu->arch.preempted_in_kernel; } +unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu) +{ + return kvm_rip_read(vcpu); +} + int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu) { return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE; diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index f13f15d2fab8..e1fe738c3827 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -387,12 +387,6 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm) return kvm->arch.cstate_in_guest; } -void kvm_register_perf_callbacks(void); -static inline void kvm_unregister_perf_callbacks(void) -{ - __perf_unregister_guest_info_callbacks(); -} - static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, bool is_nmi) { WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, is_nmi); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index e4d712e9f760..0db9af0b628c 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1163,6 +1163,18 @@ static inline bool kvm_arch_intc_initialized(struct kvm *kvm) } #endif +#ifdef __KVM_WANT_PERF_CALLBACKS + +void kvm_set_intel_pt_intr_handler(void (*handler)(void)); +unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu); + +void kvm_register_perf_callbacks(void); +static inline void kvm_unregister_perf_callbacks(void) +{ + __perf_unregister_guest_info_callbacks(); +} +#endif + int kvm_arch_init_vm(struct kvm *kvm, unsigned long type); void kvm_arch_destroy_vm(struct kvm *kvm); void kvm_arch_sync_events(struct kvm *kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 3e67c93ca403..13c4f58a75e5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c
[PATCH 15/15] perf: KVM: Indicate "in guest" via NULL ->is_in_guest callback
Interpret a null ->is_in_guest callback as meaning "in guest" and use the new semantics in KVM, which currently returns 'true' unconditionally in its implementation of ->is_in_guest(). This avoids a retpoline on the indirect call for PMIs that arrive in a KVM guest, and also provides a handy excuse for a wrapper around retrieval of perf_get_guest_cbs, e.g. to reduce the probability of an errant direct read of perf_guest_cbs. Signed-off-by: Sean Christopherson --- arch/x86/events/core.c | 16 arch/x86/events/intel/core.c | 5 ++--- include/linux/perf_event.h | 17 + virt/kvm/kvm_main.c | 9 ++--- 4 files changed, 29 insertions(+), 18 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 34155a52e498..b60c339ae06b 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2761,11 +2761,11 @@ static bool perf_hw_regs(struct pt_regs *regs) void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { - struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); + struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs(); struct unwind_state state; unsigned long addr; - if (guest_cbs && guest_cbs->is_in_guest()) { + if (guest_cbs) { /* TODO: We don't support guest os callchain now */ return; } @@ -2865,11 +2865,11 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { - struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); + struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs(); struct stack_frame frame; const struct stack_frame __user *fp; - if (guest_cbs && guest_cbs->is_in_guest()) { + if (guest_cbs) { /* TODO: We don't support guest os callchain now */ return; } @@ -2946,9 +2946,9 @@ static unsigned long code_segment_base(struct pt_regs *regs) unsigned long perf_instruction_pointer(struct pt_regs *regs) { - struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); + struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs(); - if (guest_cbs && guest_cbs->is_in_guest()) + if (guest_cbs) return guest_cbs->get_guest_ip(); return regs->ip + code_segment_base(regs); @@ -2956,10 +2956,10 @@ unsigned long perf_instruction_pointer(struct pt_regs *regs) unsigned long perf_misc_flags(struct pt_regs *regs) { - struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); + struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs(); int misc = 0; - if (guest_cbs && guest_cbs->is_in_guest()) { + if (guest_cbs) { if (guest_cbs->is_user_mode()) misc |= PERF_RECORD_MISC_GUEST_USER; else diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 96001962c24d..9a8c18b51a96 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2853,9 +2853,8 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status) */ if (__test_and_clear_bit(GLOBAL_STATUS_TRACE_TOPAPMI_BIT, (unsigned long *))) { handled++; - guest_cbs = this_cpu_read(perf_guest_cbs); - if (unlikely(guest_cbs && guest_cbs->is_in_guest() && -guest_cbs->handle_intel_pt_intr)) + guest_cbs = perf_get_guest_cbs(); + if (unlikely(guest_cbs && guest_cbs->handle_intel_pt_intr)) guest_cbs->handle_intel_pt_intr(); else intel_pt_interrupt(); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index db701409a62f..6e3a10784d24 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1241,6 +1241,23 @@ DECLARE_PER_CPU(struct perf_guest_info_callbacks *, perf_guest_cbs); extern void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs); extern void perf_unregister_guest_info_callbacks(void); extern void perf_register_guest_info_callbacks_all_cpus(struct perf_guest_info_callbacks *cbs); +/* + * Returns guest callbacks for the current CPU if callbacks are registered and + * the PMI fired while a guest was running, otherwise returns NULL. + */ +static inline struct perf_guest_info_callbacks *perf_get_guest_cbs(void) +{ + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); + + /* +* Implementing is_in_guest is optional if the callbacks are registered +* only when "in guest". +*/ + if (guest_cbs && (!guest_cbs->is_in_guest || guest_cbs->is_in_guest())) +
[PATCH 11/15] KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
Now that all state needed for VMX's PT interrupt handler is exposed to vmx.c (specifically the currently running vCPU), move the handler into vmx.c where it belongs. Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/vmx/vmx.c | 24 +--- arch/x86/kvm/x86.c | 17 - virt/kvm/kvm_main.c | 1 + 4 files changed, 22 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 63553a1f43ee..daa33147650a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1496,7 +1496,6 @@ struct kvm_x86_init_ops { int (*disabled_by_bios)(void); int (*check_processor_compatibility)(void); int (*hardware_setup)(void); - bool (*intel_pt_intr_in_guest)(void); struct kvm_x86_ops *runtime_ops; }; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f08980ef7c44..4665a272249a 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7535,6 +7535,8 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu) static void hardware_unsetup(void) { + kvm_set_intel_pt_intr_handler(NULL); + if (nested) nested_vmx_hardware_unsetup(); @@ -7685,6 +7687,18 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector, }; +static void vmx_handle_intel_pt_intr(void) +{ + struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); + + if (WARN_ON_ONCE(!vcpu)) + return; + + kvm_make_request(KVM_REQ_PMI, vcpu); + __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT, + (unsigned long *)>arch.pmu.global_status); +} + static __init void vmx_setup_user_return_msrs(void) { @@ -7886,9 +7900,14 @@ static __init int hardware_setup(void) vmx_set_cpu_caps(); r = alloc_kvm_area(); - if (r) + if (r) { nested_vmx_hardware_unsetup(); - return r; + return r; + } + + if (pt_mode == PT_MODE_HOST_GUEST) + kvm_set_intel_pt_intr_handler(vmx_handle_intel_pt_intr); + return 0; } static struct kvm_x86_init_ops vmx_init_ops __initdata = { @@ -7896,7 +7915,6 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = { .disabled_by_bios = vmx_disabled_by_bios, .check_processor_compatibility = vmx_check_processor_compat, .hardware_setup = hardware_setup, - .intel_pt_intr_in_guest = vmx_pt_mode_is_host_guest, .runtime_ops = _x86_ops, }; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7cb0f04e24ee..11c7a02f839c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8264,18 +8264,6 @@ static void kvm_timer_init(void) kvmclock_cpu_online, kvmclock_cpu_down_prep); } -static void kvm_handle_intel_pt_intr(void) -{ - struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); - - if (WARN_ON_ONCE(!vcpu)) - return; - - kvm_make_request(KVM_REQ_PMI, vcpu); - __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT, - (unsigned long *)>arch.pmu.global_status); -} - #ifdef CONFIG_X86_64 static void pvclock_gtod_update_fn(struct work_struct *work) { @@ -11029,9 +11017,6 @@ int kvm_arch_hardware_setup(void *opaque) memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops)); kvm_ops_static_call_update(); - if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest()) - kvm_set_intel_pt_intr_handler(kvm_handle_intel_pt_intr); - if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) supported_xss = 0; @@ -11059,8 +11044,6 @@ int kvm_arch_hardware_setup(void *opaque) void kvm_arch_hardware_unsetup(void) { - kvm_set_intel_pt_intr_handler(NULL); - static_call(kvm_x86_hardware_unsetup)(); } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 13c4f58a75e5..e0b1c9386926 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -5498,6 +5498,7 @@ void kvm_set_intel_pt_intr_handler(void (*handler)(void)) { kvm_guest_cbs.handle_intel_pt_intr = handler; } +EXPORT_SYMBOL_GPL(kvm_set_intel_pt_intr_handler); void kvm_register_perf_callbacks(void) { -- 2.33.0.259.gc128427fd7-goog
[PATCH 08/15] KVM: x86: Drop current_vcpu in favor of kvm_running_vcpu
Now that KVM registers perf callbacks only when the CPU is "in guest", use kvm_running_vcpu instead of current_vcpu to retrieve the associated vCPU and drop current_vcpu. Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 12 +--- arch/x86/kvm/x86.h | 4 2 files changed, 5 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d4d91944fde7..e337aef60793 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8264,17 +8264,15 @@ static void kvm_timer_init(void) kvmclock_cpu_online, kvmclock_cpu_down_prep); } -DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu); -EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu); - static int kvm_is_in_guest(void) { - return __this_cpu_read(current_vcpu) != NULL; + /* x86's callbacks are registered only when handling a guest NMI. */ + return true; } static int kvm_is_user_mode(void) { - struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu); + struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); if (WARN_ON_ONCE(!vcpu)) return 0; @@ -8284,7 +8282,7 @@ static int kvm_is_user_mode(void) static unsigned long kvm_get_guest_ip(void) { - struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu); + struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); if (WARN_ON_ONCE(!vcpu)) return 0; @@ -8294,7 +8292,7 @@ static unsigned long kvm_get_guest_ip(void) static void kvm_handle_intel_pt_intr(void) { - struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu); + struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); if (WARN_ON_ONCE(!vcpu)) return; diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 4c5ba4128b38..f13f15d2fab8 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -393,11 +393,8 @@ static inline void kvm_unregister_perf_callbacks(void) __perf_unregister_guest_info_callbacks(); } -DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu); - static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, bool is_nmi) { - __this_cpu_write(current_vcpu, vcpu); WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, is_nmi); kvm_register_perf_callbacks(); @@ -408,7 +405,6 @@ static inline void kvm_after_interrupt(struct kvm_vcpu *vcpu) kvm_unregister_perf_callbacks(); WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, false); - __this_cpu_write(current_vcpu, NULL); } -- 2.33.0.259.gc128427fd7-goog
[PATCH 09/15] KVM: arm64: Register/unregister perf callbacks at vcpu load/put
Register/unregister perf callbacks at vcpu_load()/vcpu_put() instead of keeping the callbacks registered for all eternity after loading KVM. This will allow future cleanups and optimizations as the registration of the callbacks signifies "in guest". This will also allow moving the callbacks into common KVM as they arm64 and x86 now have semantically identical callback implementations. Note, KVM could likely be more precise in its registration, but that's a cleanup for the future. Signed-off-by: Sean Christopherson --- arch/arm64/include/asm/kvm_host.h | 12 ++- arch/arm64/kvm/arm.c | 5 - arch/arm64/kvm/perf.c | 36 ++- 3 files changed, 31 insertions(+), 22 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index ed940aec89e0..007c38d77fd9 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -671,7 +671,17 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu); int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa); void kvm_perf_init(void); -void kvm_perf_teardown(void); + +#ifdef CONFIG_PERF_EVENTS +void kvm_register_perf_callbacks(void); +static inline void kvm_unregister_perf_callbacks(void) +{ + __perf_unregister_guest_info_callbacks(); +} +#else +static inline void kvm_register_perf_callbacks(void) {} +static inline void kvm_unregister_perf_callbacks(void) {} +#endif long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu); gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu); diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index e9a2b8f27792..ec386971030d 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -429,10 +429,13 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) if (vcpu_has_ptrauth(vcpu)) vcpu_ptrauth_disable(vcpu); kvm_arch_vcpu_load_debug_state_flags(vcpu); + + kvm_register_perf_callbacks(); } void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + kvm_unregister_perf_callbacks(); kvm_arch_vcpu_put_debug_state_flags(vcpu); kvm_arch_vcpu_put_fp(vcpu); if (has_vhe()) @@ -2155,7 +2158,7 @@ int kvm_arch_init(void *opaque) /* NOP: Compiling as a module not supported */ void kvm_arch_exit(void) { - kvm_perf_teardown(); + } static int __init early_kvm_mode_cfg(char *arg) diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c index 039fe59399a2..2556b0a3b096 100644 --- a/arch/arm64/kvm/perf.c +++ b/arch/arm64/kvm/perf.c @@ -13,33 +13,30 @@ DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available); +#ifdef CONFIG_PERF_EVENTS static int kvm_is_in_guest(void) { -return kvm_get_running_vcpu() != NULL; + return true; } static int kvm_is_user_mode(void) { - struct kvm_vcpu *vcpu; + struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); - vcpu = kvm_get_running_vcpu(); + if (WARN_ON_ONCE(!vcpu)) + return 0; - if (vcpu) - return !vcpu_mode_priv(vcpu); - - return 0; + return !vcpu_mode_priv(vcpu); } static unsigned long kvm_get_guest_ip(void) { - struct kvm_vcpu *vcpu; + struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); - vcpu = kvm_get_running_vcpu(); + if (WARN_ON_ONCE(!vcpu)) + return 0; - if (vcpu) - return *vcpu_pc(vcpu); - - return 0; + return *vcpu_pc(vcpu); } static struct perf_guest_info_callbacks kvm_guest_cbs = { @@ -48,15 +45,14 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = { .get_guest_ip = kvm_get_guest_ip, }; +void kvm_register_perf_callbacks(void) +{ + __perf_register_guest_info_callbacks(_guest_cbs); +} +#endif /* CONFIG_PERF_EVENTS*/ + void kvm_perf_init(void) { if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled()) static_branch_enable(_arm_pmu_available); - - perf_register_guest_info_callbacks(_guest_cbs); -} - -void kvm_perf_teardown(void) -{ - perf_unregister_guest_info_callbacks(); } -- 2.33.0.259.gc128427fd7-goog
[PATCH 07/15] KVM: Use dedicated flag to track if KVM is handling an NMI from guest
Add a dedicated flag to detect the case where KVM's PMC overflow callback was originally invoked in response to an NMI that arrived while the guest was running. Using current_vcpu is less precise as IRQs also set current_vcpu (though presumably KVM's callback should not be reached in that case), and more importantly, this will allow dropping current_vcpu as the perf callbacks can switch to kvm_running_vcpu now that the perf callbacks are precisely registered, i.e. kvm_running_vcpu doesn't need to be used to detect if a PMI arrived in the guest. Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting") Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 3 +-- arch/x86/kvm/pmu.c | 2 +- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/x86.c | 4 ++-- arch/x86/kvm/x86.h | 4 +++- 6 files changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1ea4943a73d7..465b35736d9b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -763,6 +763,7 @@ struct kvm_vcpu_arch { unsigned nmi_pending; /* NMI queued after currently running handler */ bool nmi_injected;/* Trying to inject an NMI this entry */ bool smi_pending;/* SMI queued after currently running handler */ + bool handling_nmi_from_guest; struct kvm_mtrr mtrr_state; u64 pat; @@ -1874,8 +1875,6 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu); int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err); void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu); -int kvm_is_in_guest(void); - void __user *__x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size); bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 0772bad9165c..2b8934b452ea 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -87,7 +87,7 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event, * woken up. So we should wake it, but this is impossible from * NMI context. Do it from irq work instead. */ - if (!kvm_is_in_guest()) + if (!pmc->vcpu->arch.handling_nmi_from_guest) irq_work_queue(_to_pmu(pmc)->irq_work); else kvm_make_request(KVM_REQ_PMI, pmc->vcpu); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 1a70e11f0487..3fc6767e5fd8 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3843,7 +3843,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu) } if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI)) - kvm_before_interrupt(vcpu); + kvm_before_interrupt(vcpu, true); kvm_load_host_xsave_state(vcpu); stgi(); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f19d72136f77..f08980ef7c44 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6344,7 +6344,7 @@ void vmx_do_interrupt_nmi_irqoff(unsigned long entry); static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu, unsigned long entry) { - kvm_before_interrupt(vcpu); + kvm_before_interrupt(vcpu, entry == (unsigned long)asm_exc_nmi_noist); vmx_do_interrupt_nmi_irqoff(entry); kvm_after_interrupt(vcpu); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bc4ee6ea7752..d4d91944fde7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8267,7 +8267,7 @@ static void kvm_timer_init(void) DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu); EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu); -int kvm_is_in_guest(void) +static int kvm_is_in_guest(void) { return __this_cpu_read(current_vcpu) != NULL; } @@ -9678,7 +9678,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) * interrupts on processors that implement an interrupt shadow, the * stat.exits increment will do nicely. */ - kvm_before_interrupt(vcpu); + kvm_before_interrupt(vcpu, false); local_irq_enable(); ++vcpu->stat.exits; local_irq_disable(); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 5cedc0e8a5d5..4c5ba4128b38 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -395,9 +395,10 @@ static inline void kvm_unregister_perf_callbacks(void) DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu); -static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu) +static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, bool is_nmi) { __this_cpu_write(current_vcpu, vcpu); + WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, is_nmi); kvm_register_perf_callbacks(); } @@ -406,6 +407,7 @@ static inline void
[PATCH 01/15] KVM: x86: Register perf callbacks after calling vendor's hardware_setup()
Wait to register perf callbacks until after doing vendor hardaware setup. VMX's hardware_setup() configures Intel Processor Trace (PT) mode, and a future fix to register the Intel PT guest interrupt hook if and only if Intel PT is exposed to the guest will consume the configured PT mode. Delaying registration to hardware setup is effectively a nop as KVM's perf hooks all pivot on the per-CPU current_vcpu, which is non-NULL only when KVM is handling an IRQ/NMI in a VM-Exit path. I.e. current_vcpu will be NULL throughout both kvm_arch_init() and kvm_arch_hardware_setup(). Cc: Alexander Shishkin Cc: Artem Kashkanov Cc: sta...@vger.kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 86539c1686fa..fb6015f97f9e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8426,8 +8426,6 @@ int kvm_arch_init(void *opaque) kvm_timer_init(); - perf_register_guest_info_callbacks(_guest_cbs); - if (boot_cpu_has(X86_FEATURE_XSAVE)) { host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK); supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0; @@ -8461,7 +8459,6 @@ void kvm_arch_exit(void) clear_hv_tscchange_cb(); #endif kvm_lapic_exit(); - perf_unregister_guest_info_callbacks(_guest_cbs); if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) cpufreq_unregister_notifier(_cpufreq_notifier_block, @@ -11064,6 +11061,8 @@ int kvm_arch_hardware_setup(void *opaque) memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops)); kvm_ops_static_call_update(); + perf_register_guest_info_callbacks(_guest_cbs); + if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) supported_xss = 0; @@ -11091,6 +11090,8 @@ int kvm_arch_hardware_setup(void *opaque) void kvm_arch_hardware_unsetup(void) { + perf_unregister_guest_info_callbacks(_guest_cbs); + static_call(kvm_x86_hardware_unsetup)(); } -- 2.33.0.259.gc128427fd7-goog
Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks
TL;DR: Please don't merge this patch, it's broken and is also built on a shoddy foundation that I would like to fix. On Fri, Aug 06, 2021, Zhu Lingshan wrote: > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 464917096e73..e466fc8176e1 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -6489,9 +6489,18 @@ static void perf_pending_event(struct irq_work *entry) > */ > struct perf_guest_info_callbacks *perf_guest_cbs; > > +/* explicitly use __weak to fix duplicate symbol error */ > +void __weak arch_perf_update_guest_cbs(void) > +{ > +} > + > int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs) > { > + if (WARN_ON_ONCE(perf_guest_cbs)) > + return -EBUSY; > + > perf_guest_cbs = cbs; > + arch_perf_update_guest_cbs(); This is horribly broken, it fails to cleanup the static calls when KVM unregisters the callbacks, which happens when the vendor module, e.g. kvm_intel, is unloaded. The explosion doesn't happen until 'kvm' is unloaded because the functions are implemented in 'kvm', i.e. the use-after-free is deferred a bit. BUG: unable to handle page fault for address: a011bb90 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 6211067 P4D 6211067 PUD 6212063 PMD 102b99067 PTE 0 Oops: 0010 [#1] PREEMPT SMP CPU: 0 PID: 1047 Comm: rmmod Not tainted 5.14.0-rc2+ #460 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:0xa011bb90 Code: Unable to access opcode bytes at RIP 0xa011bb66. Call Trace: ? perf_misc_flags+0xe/0x50 ? perf_prepare_sample+0x53/0x6b0 ? perf_event_output_forward+0x67/0x160 ? kvm_clock_read+0x14/0x30 ? kvm_sched_clock_read+0x5/0x10 ? sched_clock_cpu+0xd/0xd0 ? __perf_event_overflow+0x52/0xf0 ? handle_pmi_common+0x1f2/0x2d0 ? __flush_tlb_all+0x30/0x30 ? intel_pmu_handle_irq+0xcf/0x410 ? nmi_handle+0x5/0x260 ? perf_event_nmi_handler+0x28/0x50 ? nmi_handle+0xc7/0x260 ? lock_release+0x2b0/0x2b0 ? default_do_nmi+0x6b/0x170 ? exc_nmi+0x103/0x130 ? end_repeat_nmi+0x16/0x1f ? lock_release+0x2b0/0x2b0 ? lock_release+0x2b0/0x2b0 ? lock_release+0x2b0/0x2b0 Modules linked in: irqbypass [last unloaded: kvm] Even more fun, the existing perf_guest_cbs framework is also broken, though it's much harder to get it to fail, and probably impossible to get it to fail without some help. The issue is that perf_guest_cbs is global, which means that it can be nullified by KVM (during module unload) while the callbacks are being accessed by a PMI handler on a different CPU. The bug has escaped notice because all dererfences of perf_guest_cbs follow the same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and AFAICT the compiler never reload perf_guest_cbs in this sequence. The compiler does reload perf_guest_cbs for any future dereferences, but the ->is_in_guest() guard all but guarantees the PMI handler will win the race, e.g. to nullify perf_guest_cbs, KVM has to completely exit the guest and teardown down all VMs before it can be unloaded. But with a help, e.g. RAED_ONCE(perf_guest_cbs), unloading kvm_intel can trigger a NULL pointer derference, e.g. this tweak diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 1eb45139fcc6..202e5ad97f82 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs) { int misc = 0; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) { if (perf_guest_cbs->is_user_mode()) misc |= PERF_RECORD_MISC_GUEST_USER; else while spamming module load/unload leads to: BUG: kernel NULL pointer dereference, address: #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 P4D 0 Oops: [#1] PREEMPT SMP CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:perf_misc_flags+0x1c/0x70 Call Trace: perf_prepare_sample+0x53/0x6b0 perf_event_output_forward+0x67/0x160 __perf_event_overflow+0x52/0xf0 handle_pmi_common+0x207/0x300 intel_pmu_handle_irq+0xcf/0x410 perf_event_nmi_handler+0x28/0x50 nmi_handle+0xc7/0x260 default_do_nmi+0x6b/0x170 exc_nmi+0x103/0x130 asm_exc_nmi+0x76/0xbf The good news is that I have a series that should fix both the existing NULL pointer bug and mostly obviate the need for static calls. The bad news is that my approach, making perf_guest_cbs per-CPU, likely complicates turning these into static calls, though I'm guessing it's still a solvable problem. Tangentially related, IMO we should make architectures opt-in to getting
[PATCH 04/15] perf: Force architectures to opt-in to guest callbacks
Introduce HAVE_GUEST_PERF_EVENTS and require architectures to select it to allow register guest callbacks in perf. Future patches will convert the callbacks to per-CPU definitions. Rather than churn a bunch of arch code (that was presumably copy+pasted from x86), remove it wholesale as it's useless and at best wasting cycles. Wrap even the stubs with an #ifdef to avoid an arch sneaking in a bogus registration with CONFIG_PERF_EVENTS=n. Signed-off-by: Sean Christopherson --- arch/arm/kernel/perf_callchain.c | 28 arch/arm64/Kconfig | 1 + arch/csky/kernel/perf_callchain.c | 10 -- arch/nds32/kernel/perf_event_cpu.c | 29 - arch/riscv/kernel/perf_callchain.c | 10 -- arch/x86/Kconfig | 1 + include/linux/perf_event.h | 4 init/Kconfig | 3 +++ kernel/events/core.c | 2 ++ 9 files changed, 19 insertions(+), 69 deletions(-) diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c index 3b69a76d341e..bc6b246ab55e 100644 --- a/arch/arm/kernel/perf_callchain.c +++ b/arch/arm/kernel/perf_callchain.c @@ -64,11 +64,6 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs { struct frame_tail __user *tail; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { - /* We don't support guest os callchain now */ - return; - } - perf_callchain_store(entry, regs->ARM_pc); if (!current->mm) @@ -100,20 +95,12 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re { struct stackframe fr; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { - /* We don't support guest os callchain now */ - return; - } - arm_get_current_stackframe(regs, ); walk_stackframe(, callchain_trace, entry); } unsigned long perf_instruction_pointer(struct pt_regs *regs) { - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) - return perf_guest_cbs->get_guest_ip(); - return instruction_pointer(regs); } @@ -121,17 +108,10 @@ unsigned long perf_misc_flags(struct pt_regs *regs) { int misc = 0; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { - if (perf_guest_cbs->is_user_mode()) - misc |= PERF_RECORD_MISC_GUEST_USER; - else - misc |= PERF_RECORD_MISC_GUEST_KERNEL; - } else { - if (user_mode(regs)) - misc |= PERF_RECORD_MISC_USER; - else - misc |= PERF_RECORD_MISC_KERNEL; - } + if (user_mode(regs)) + misc |= PERF_RECORD_MISC_USER; + else + misc |= PERF_RECORD_MISC_KERNEL; return misc; } diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index b5b13a932561..72a201a686c5 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -190,6 +190,7 @@ config ARM64 select HAVE_NMI select HAVE_PATA_PLATFORM select HAVE_PERF_EVENTS + select HAVE_GUEST_PERF_EVENTS select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP select HAVE_REGS_AND_STACK_ACCESS_API diff --git a/arch/csky/kernel/perf_callchain.c b/arch/csky/kernel/perf_callchain.c index ab55e98ee8f6..92057de08f4f 100644 --- a/arch/csky/kernel/perf_callchain.c +++ b/arch/csky/kernel/perf_callchain.c @@ -88,10 +88,6 @@ void perf_callchain_user(struct perf_callchain_entry_ctx *entry, { unsigned long fp = 0; - /* C-SKY does not support virtualization. */ - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) - return; - fp = regs->regs[4]; perf_callchain_store(entry, regs->pc); @@ -112,12 +108,6 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, { struct stackframe fr; - /* C-SKY does not support virtualization. */ - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { - pr_warn("C-SKY does not support perf in guest mode!"); - return; - } - fr.fp = regs->regs[4]; fr.lr = regs->lr; walk_stackframe(, entry); diff --git a/arch/nds32/kernel/perf_event_cpu.c b/arch/nds32/kernel/perf_event_cpu.c index 0ce6f9f307e6..a78a879e7ef1 100644 --- a/arch/nds32/kernel/perf_event_cpu.c +++ b/arch/nds32/kernel/perf_event_cpu.c @@ -1371,11 +1371,6 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, leaf_fp = 0; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { - /* We don't support guest os callchain now */ - return; - } - perf_callchain_store(entry, regs->ipc); fp = regs->fp; gp = regs->gp; @@ -1481,10 +1476,6 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
[PATCH 06/15] KVM: x86: Register perf callbacks only when actively handling interrupt
Register KVM's perf callback only when handling an interrupt that may be a PMI (sadly this includes IRQs), and unregister the callback immediately after handling the interrupt (or closing the window). Registering the callback on a per-CPU basis (with preemption disabled!), fixes a mostly theoretical bug where perf could dereference a NULL pointer due to KVM unloading and unregistering the callbacks in between perf queries of the callback functions. The precise registration will also allow for future cleanups and optimizations, e.g. the existence of the callbacks can serve as the "in guest" check. Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 27 +-- arch/x86/kvm/x86.h | 10 ++ include/linux/perf_event.h | 2 ++ kernel/events/core.c | 12 4 files changed, 41 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bae951344e28..bc4ee6ea7752 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8274,28 +8274,31 @@ int kvm_is_in_guest(void) static int kvm_is_user_mode(void) { - int user_mode = 3; + struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu); - if (__this_cpu_read(current_vcpu)) - user_mode = static_call(kvm_x86_get_cpl)(__this_cpu_read(current_vcpu)); + if (WARN_ON_ONCE(!vcpu)) + return 0; - return user_mode != 0; + return static_call(kvm_x86_get_cpl)(vcpu) != 0; } static unsigned long kvm_get_guest_ip(void) { - unsigned long ip = 0; + struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu); - if (__this_cpu_read(current_vcpu)) - ip = kvm_rip_read(__this_cpu_read(current_vcpu)); + if (WARN_ON_ONCE(!vcpu)) + return 0; - return ip; + return kvm_rip_read(vcpu); } static void kvm_handle_intel_pt_intr(void) { struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu); + if (WARN_ON_ONCE(!vcpu)) + return; + kvm_make_request(KVM_REQ_PMI, vcpu); __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT, (unsigned long *)>arch.pmu.global_status); @@ -8308,6 +8311,12 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = { .handle_intel_pt_intr = NULL, }; +void kvm_register_perf_callbacks(void) +{ + __perf_register_guest_info_callbacks(_guest_cbs); +} +EXPORT_SYMBOL_GPL(kvm_register_perf_callbacks); + #ifdef CONFIG_X86_64 static void pvclock_gtod_update_fn(struct work_struct *work) { @@ -11063,7 +11072,6 @@ int kvm_arch_hardware_setup(void *opaque) if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest()) kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr; - perf_register_guest_info_callbacks(_guest_cbs); if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) supported_xss = 0; @@ -11092,7 +11100,6 @@ int kvm_arch_hardware_setup(void *opaque) void kvm_arch_hardware_unsetup(void) { - perf_unregister_guest_info_callbacks(); kvm_guest_cbs.handle_intel_pt_intr = NULL; static_call(kvm_x86_hardware_unsetup)(); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 7d66d63dc55a..5cedc0e8a5d5 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -387,15 +387,25 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm) return kvm->arch.cstate_in_guest; } +void kvm_register_perf_callbacks(void); +static inline void kvm_unregister_perf_callbacks(void) +{ + __perf_unregister_guest_info_callbacks(); +} + DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu); static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu) { __this_cpu_write(current_vcpu, vcpu); + + kvm_register_perf_callbacks(); } static inline void kvm_after_interrupt(struct kvm_vcpu *vcpu) { + kvm_unregister_perf_callbacks(); + __this_cpu_write(current_vcpu, NULL); } diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index c98253dae037..7a367bf1b78d 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1238,6 +1238,8 @@ extern void perf_event_bpf_event(struct bpf_prog *prog, #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS DECLARE_PER_CPU(struct perf_guest_info_callbacks *, perf_guest_cbs); +extern void __perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs); +extern void __perf_unregister_guest_info_callbacks(void); extern void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); extern void perf_unregister_guest_info_callbacks(void); #endif /* CONFIG_HAVE_GUEST_PERF_EVENTS */ diff --git a/kernel/events/core.c b/kernel/events/core.c index 9bc1375d6ed9..2f28d9d8dc94 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6485,6 +6485,18 @@ static void perf_pending_event(struct irq_work *entry) #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS
[PATCH 05/15] perf: Track guest callbacks on a per-CPU basis
Use a per-CPU pointer to track perf's guest callbacks so that KVM can set the callbacks more precisely and avoid a lurking NULL pointer dereference. On x86, KVM supports being built as a module and thus can be unloaded. And because the shared callbacks are referenced from IRQ/NMI context, unloading KVM can run concurrently with perf, and thus all of perf's checks for a NULL perf_guest_cbs are flawed as perf_guest_cbs could be nullified between the check and dereference. In practice, this has not been problematic because the callbacks are always guarded with a "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and it's extremely unlikely the compiler will choost to reload perf_guest_cbs in that particular sequence. Because is_in_guest() is obviously true only when KVM is running a guest, perf always wins the race to the guarded code (which does often reload perf_guest_cbs) as KVM has to stop running all guests and do a heavy teardown before unloading. Cc: Zhu Lingshan Signed-off-by: Sean Christopherson --- arch/arm64/kernel/perf_callchain.c | 18 -- arch/x86/events/core.c | 17 +++-- arch/x86/events/intel/core.c | 8 +--- include/linux/perf_event.h | 2 +- kernel/events/core.c | 12 +--- 5 files changed, 38 insertions(+), 19 deletions(-) diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c index 4a72c2727309..38555275c6a2 100644 --- a/arch/arm64/kernel/perf_callchain.c +++ b/arch/arm64/kernel/perf_callchain.c @@ -102,7 +102,9 @@ compat_user_backtrace(struct compat_frame_tail __user *tail, void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); + + if (guest_cbs && guest_cbs->is_in_guest()) { /* We don't support guest os callchain now */ return; } @@ -147,9 +149,10 @@ static bool callchain_trace(void *data, unsigned long pc) void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); struct stackframe frame; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (guest_cbs && guest_cbs->is_in_guest()) { /* We don't support guest os callchain now */ return; } @@ -160,18 +163,21 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, unsigned long perf_instruction_pointer(struct pt_regs *regs) { - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) - return perf_guest_cbs->get_guest_ip(); + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); + + if (guest_cbs && guest_cbs->is_in_guest()) + return guest_cbs->get_guest_ip(); return instruction_pointer(regs); } unsigned long perf_misc_flags(struct pt_regs *regs) { + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); int misc = 0; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { - if (perf_guest_cbs->is_user_mode()) + if (guest_cbs && guest_cbs->is_in_guest()) { + if (guest_cbs->is_user_mode()) misc |= PERF_RECORD_MISC_GUEST_USER; else misc |= PERF_RECORD_MISC_GUEST_KERNEL; diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 1eb45139fcc6..34155a52e498 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2761,10 +2761,11 @@ static bool perf_hw_regs(struct pt_regs *regs) void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); struct unwind_state state; unsigned long addr; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (guest_cbs && guest_cbs->is_in_guest()) { /* TODO: We don't support guest os callchain now */ return; } @@ -2864,10 +2865,11 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); struct stack_frame frame; const struct stack_frame __user *fp; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (guest_cbs && guest_cbs->is_in_guest()) { /* TODO: We don't support guest os callchain now */ return; } @@ -2944,18 +2946,21 @@ static unsigned long code_segment_base(struct pt_regs *regs)
[PATCH 00/15] perf: KVM: Fix, optimize, and clean up callbacks
This started out as a small series[1] to fix a KVM bug related to Intel PT interrupt handling and snowballed horribly. The main problem being addressed is that the perf_guest_cbs are shared by all CPUs, can be nullified by KVM during module unload, and are not protected against concurrent access from NMI context. The bug has escaped notice because all dereferences of perf_guest_cbs follow the same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and AFAICT the compiler never reloads perf_guest_cbs in this sequence. The compiler does reload perf_guest_cbs for any future dereferences, but the ->is_in_guest() guard all but guarantees the PMI handler will win the race, e.g. to nullify perf_guest_cbs, KVM has to completely exit the guest and teardown down all VMs before it can be unloaded. But with help, e.g. READ_ONCE(perf_guest_cbs), unloading kvm_intel can trigger a NULL pointer derference (see below). Manual intervention aside, the bug is a bit of a time bomb, e.g. my patch 3 from the original PT handling series would have omitted the ->is_in_guest() guard. This series fixes the problem by making the callbacks per-CPU, and registering/unregistering the callbacks only with preemption disabled (except for the Xen case, which doesn't unregister). This approach also allows for several nice cleanups in this series. KVM x86 and arm64 can share callbacks, KVM x86 drops its somewhat redundant current_vcpu, and the retpoline that is currently hit when KVM is loaded (due to always checking ->is_in_guest()) goes away (it's still there when running as Xen Dom0). Changing to per-CPU callbacks also provides a good excuse to excise copy+paste code from architectures that can't possibly have guest callbacks. This series conflicts horribly with a proposed patch[2] to use static calls for perf_guest_cbs. But that patch is broken as it completely fails to handle unregister, and it's not clear to me whether or not it can correctly handle unregister without fixing the underlying race (I don't know enough about the code patching for static calls). This tweak diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 1eb45139fcc6..202e5ad97f82 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs) { int misc = 0; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) { if (perf_guest_cbs->is_user_mode()) misc |= PERF_RECORD_MISC_GUEST_USER; else while spamming module load/unload leads to: BUG: kernel NULL pointer dereference, address: #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 P4D 0 Oops: [#1] PREEMPT SMP CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:perf_misc_flags+0x1c/0x70 Call Trace: perf_prepare_sample+0x53/0x6b0 perf_event_output_forward+0x67/0x160 __perf_event_overflow+0x52/0xf0 handle_pmi_common+0x207/0x300 intel_pmu_handle_irq+0xcf/0x410 perf_event_nmi_handler+0x28/0x50 nmi_handle+0xc7/0x260 default_do_nmi+0x6b/0x170 exc_nmi+0x103/0x130 asm_exc_nmi+0x76/0xbf [1] https://lkml.kernel.org/r/20210823193709.55886-1-sea...@google.com [2] https://lkml.kernel.org/r/20210806133802.3528-2-lingshan@intel.com Sean Christopherson (15): KVM: x86: Register perf callbacks after calling vendor's hardware_setup() KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest perf: Stop pretending that perf can handle multiple guest callbacks perf: Force architectures to opt-in to guest callbacks perf: Track guest callbacks on a per-CPU basis KVM: x86: Register perf callbacks only when actively handling interrupt KVM: Use dedicated flag to track if KVM is handling an NMI from guest KVM: x86: Drop current_vcpu in favor of kvm_running_vcpu KVM: arm64: Register/unregister perf callbacks at vcpu load/put KVM: Move x86's perf guest info callbacks to generic KVM KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c KVM: arm64: Convert to the generic perf callbacks KVM: arm64: Drop perf.c and fold its tiny bit of code into pmu.c perf: Disallow bulk unregistering of guest callbacks and do cleanup perf: KVM: Indicate "in guest" via NULL ->is_in_guest callback arch/arm/kernel/perf_callchain.c | 28 ++ arch/arm64/Kconfig | 1 + arch/arm64/include/asm/kvm_host.h | 8 +++- arch/arm64/kernel/perf_callchain.c | 18 ++--- arch/arm64/kvm/Makefile| 2 +- arch/arm64/kvm/arm.c | 13 ++- arch/arm64/kvm/perf.c | 62 -- arch/arm64/kvm/pmu.c | 8 arch/csky/kernel/perf_callchain.c | 10
[PATCH 02/15] KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest
Override the Processor Trace (PT) interrupt handler for guest mode if and only if PT is configured for host+guest mode, i.e. is being used independently by both host and guest. If PT is configured for system mode, the host fully controls PT and must handle all events. Fixes: 8479e04e7d6b ("KVM: x86: Inject PMI for KVM guest") Cc: sta...@vger.kernel.org Cc: Like Xu Reported-by: Alexander Shishkin Reported-by: Artem Kashkanov Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/pmu.h | 1 + arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/x86.c | 5 - 4 files changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 09b256db394a..1ea4943a73d7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1494,6 +1494,7 @@ struct kvm_x86_init_ops { int (*disabled_by_bios)(void); int (*check_processor_compatibility)(void); int (*hardware_setup)(void); + bool (*intel_pt_intr_in_guest)(void); struct kvm_x86_ops *runtime_ops; }; diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 0e4f2b1fa9fb..b06dbbd7eeeb 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -41,6 +41,7 @@ struct kvm_pmu_ops { void (*reset)(struct kvm_vcpu *vcpu); void (*deliver_pmi)(struct kvm_vcpu *vcpu); void (*cleanup)(struct kvm_vcpu *vcpu); + void (*handle_intel_pt_intr)(void); }; static inline u64 pmc_bitmask(struct kvm_pmc *pmc) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index fada1055f325..f19d72136f77 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7896,6 +7896,7 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = { .disabled_by_bios = vmx_disabled_by_bios, .check_processor_compatibility = vmx_check_processor_compat, .hardware_setup = hardware_setup, + .intel_pt_intr_in_guest = vmx_pt_mode_is_host_guest, .runtime_ops = _x86_ops, }; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fb6015f97f9e..ffc6c2d73508 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8305,7 +8305,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = { .is_in_guest= kvm_is_in_guest, .is_user_mode = kvm_is_user_mode, .get_guest_ip = kvm_get_guest_ip, - .handle_intel_pt_intr = kvm_handle_intel_pt_intr, + .handle_intel_pt_intr = NULL, }; #ifdef CONFIG_X86_64 @@ -11061,6 +11061,8 @@ int kvm_arch_hardware_setup(void *opaque) memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops)); kvm_ops_static_call_update(); + if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest()) + kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr; perf_register_guest_info_callbacks(_guest_cbs); if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) @@ -11091,6 +11093,7 @@ int kvm_arch_hardware_setup(void *opaque) void kvm_arch_hardware_unsetup(void) { perf_unregister_guest_info_callbacks(_guest_cbs); + kvm_guest_cbs.handle_intel_pt_intr = NULL; static_call(kvm_x86_hardware_unsetup)(); } -- 2.33.0.259.gc128427fd7-goog
[PATCH 03/15] perf: Stop pretending that perf can handle multiple guest callbacks
Drop the 'int' return value from the perf (un)register callbacks helpers and stop pretending perf can support multiple callbacks. The 'int' returns are not future proofing anything as none of the callers take action on an error. It's also not obvious that there will ever be cotenant hypervisors, and if there are, that allowing multiple callbacks to be registered is desirable or even correct. No functional change intended. Signed-off-by: Sean Christopherson --- arch/arm64/include/asm/kvm_host.h | 4 ++-- arch/arm64/kvm/perf.c | 8 arch/x86/kvm/x86.c| 2 +- include/linux/perf_event.h| 11 +-- kernel/events/core.c | 11 ++- 5 files changed, 14 insertions(+), 22 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 41911585ae0c..ed940aec89e0 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -670,8 +670,8 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned int len); int kvm_handle_mmio_return(struct kvm_vcpu *vcpu); int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa); -int kvm_perf_init(void); -int kvm_perf_teardown(void); +void kvm_perf_init(void); +void kvm_perf_teardown(void); long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu); gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu); diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c index 151c31fb9860..039fe59399a2 100644 --- a/arch/arm64/kvm/perf.c +++ b/arch/arm64/kvm/perf.c @@ -48,15 +48,15 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = { .get_guest_ip = kvm_get_guest_ip, }; -int kvm_perf_init(void) +void kvm_perf_init(void) { if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled()) static_branch_enable(_arm_pmu_available); - return perf_register_guest_info_callbacks(_guest_cbs); + perf_register_guest_info_callbacks(_guest_cbs); } -int kvm_perf_teardown(void) +void kvm_perf_teardown(void) { - return perf_unregister_guest_info_callbacks(_guest_cbs); + perf_unregister_guest_info_callbacks(); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ffc6c2d73508..bae951344e28 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11092,7 +11092,7 @@ int kvm_arch_hardware_setup(void *opaque) void kvm_arch_hardware_unsetup(void) { - perf_unregister_guest_info_callbacks(_guest_cbs); + perf_unregister_guest_info_callbacks(); kvm_guest_cbs.handle_intel_pt_intr = NULL; static_call(kvm_x86_hardware_unsetup)(); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 2d510ad750ed..05c0efba3cd1 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1237,8 +1237,8 @@ extern void perf_event_bpf_event(struct bpf_prog *prog, u16 flags); extern struct perf_guest_info_callbacks *perf_guest_cbs; -extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); -extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); +extern void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); +extern void perf_unregister_guest_info_callbacks(void); extern void perf_event_exec(void); extern void perf_event_comm(struct task_struct *tsk, bool exec); @@ -1481,10 +1481,9 @@ perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) { } static inline void perf_bp_event(struct perf_event *event, void *data){ } -static inline int perf_register_guest_info_callbacks -(struct perf_guest_info_callbacks *callbacks) { return 0; } -static inline int perf_unregister_guest_info_callbacks -(struct perf_guest_info_callbacks *callbacks) { return 0; } +static inline void perf_register_guest_info_callbacks +(struct perf_guest_info_callbacks *callbacks) { } +static inline void perf_unregister_guest_info_callbacks(void) { } static inline void perf_event_mmap(struct vm_area_struct *vma) { } diff --git a/kernel/events/core.c b/kernel/events/core.c index 464917096e73..baae796612b9 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6482,24 +6482,17 @@ static void perf_pending_event(struct irq_work *entry) perf_swevent_put_recursion_context(rctx); } -/* - * We assume there is only KVM supporting the callbacks. - * Later on, we might change it to a list if there is - * another virtualization implementation supporting the callbacks. - */ struct perf_guest_info_callbacks *perf_guest_cbs; -int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs) +void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs) { perf_guest_cbs = cbs; - return 0; }
Re: [XEN RFC PATCH 22/40] xen/arm: introduce a helper to parse device tree processor node
On Wed, 11 Aug 2021, Wei Chen wrote: > Processor NUMA ID information is stored in device tree's processor > node as "numa-node-id". We need a new helper to parse this ID from > processor node. If we get this ID from processor node, this ID's > validity still need to be checked. Once we got a invalid NUMA ID > from any processor node, the device tree will be marked as NUMA > information invalid. > > Signed-off-by: Wei Chen > --- > xen/arch/arm/numa_device_tree.c | 41 +++-- > 1 file changed, 39 insertions(+), 2 deletions(-) > > diff --git a/xen/arch/arm/numa_device_tree.c b/xen/arch/arm/numa_device_tree.c > index 1c74ad135d..37cc56acf3 100644 > --- a/xen/arch/arm/numa_device_tree.c > +++ b/xen/arch/arm/numa_device_tree.c > @@ -20,16 +20,53 @@ > #include > #include > #include > +#include > +#include > > s8 device_tree_numa = 0; > +static nodemask_t processor_nodes_parsed __initdata; > > -int srat_disabled(void) > +static int srat_disabled(void) > { > return numa_off || device_tree_numa < 0; > } > > -void __init bad_srat(void) > +static __init void bad_srat(void) > { > printk(KERN_ERR "DT: NUMA information is not used.\n"); > device_tree_numa = -1; > } > + > +/* Callback for device tree processor affinity */ > +static int __init dtb_numa_processor_affinity_init(nodeid_t node) > +{ > +if ( srat_disabled() ) > +return -EINVAL; > +else if ( node == NUMA_NO_NODE || node >= MAX_NUMNODES ) { > + bad_srat(); > + return -EINVAL; > + } > + > +node_set(node, processor_nodes_parsed); > + > +device_tree_numa = 1; > +printk(KERN_INFO "DT: NUMA node %u processor parsed\n", node); > + > +return 0; > +} > + > +/* Parse CPU NUMA node info */ > +int __init device_tree_parse_numa_cpu_node(const void *fdt, int node) > +{ > +uint32_t nid; > + > +nid = device_tree_get_u32(fdt, node, "numa-node-id", MAX_NUMNODES); > +printk(XENLOG_WARNING "CPU on NUMA node:%u\n", nid); Given that this is not actually a warning (is it?) then I would move it to XENLOG_INFO > +if ( nid >= MAX_NUMNODES ) > +{ > +printk(XENLOG_WARNING "Node id %u exceeds maximum value\n", nid); This could be XENLOG_ERR > +return -EINVAL; > +} > + > +return dtb_numa_processor_affinity_init(nid); > +}
Re: [XEN RFC PATCH 20/40] xen/arm: implement node distance helpers for Arm64
On Wed, 11 Aug 2021, Wei Chen wrote: > In current Xen code, __node_distance is a fake API, it always > returns NUMA_REMOTE_DISTANCE(20). Now we use a matrix to record > the distance between any two nodes. Accordingly, we provide a > set_node_distance API to set the distance for any two nodes in > this patch. > > Signed-off-by: Wei Chen > --- > xen/arch/arm/numa.c| 44 ++ > xen/include/asm-arm/numa.h | 12 ++- > xen/include/asm-x86/numa.h | 1 - > xen/include/xen/numa.h | 2 +- > 4 files changed, 56 insertions(+), 3 deletions(-) > > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c > index 566ad1e52b..f61a8df645 100644 > --- a/xen/arch/arm/numa.c > +++ b/xen/arch/arm/numa.c > @@ -23,6 +23,11 @@ > #include > #include > > +static uint8_t __read_mostly > +node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = { > +{ NUMA_REMOTE_DISTANCE } > +}; > + > void numa_set_node(int cpu, nodeid_t nid) > { > if ( nid >= MAX_NUMNODES || > @@ -32,6 +37,45 @@ void numa_set_node(int cpu, nodeid_t nid) > cpu_to_node[cpu] = nid; > } > > +void __init numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance) > +{ > +if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES ) > +{ > +printk(KERN_WARNING > +"NUMA nodes are out of matrix, from=%u to=%u distance=%u\n", > +from, to, distance); NIT: please align. Example: printk(KERN_WARNING "NUMA nodes are out of matrix, from=%u to=%u distance=%u\n", Also please use PRIu32 for uint32_t. Probably should use PRIu8 for nodeids. > +return; > +} > + > +/* NUMA defines 0xff as an unreachable node and 0-9 are undefined */ > +if ( distance >= NUMA_NO_DISTANCE || > +(distance >= NUMA_DISTANCE_UDF_MIN && > + distance <= NUMA_DISTANCE_UDF_MAX) || > +(from == to && distance != NUMA_LOCAL_DISTANCE) ) > +{ > +printk(KERN_WARNING > +"Invalid NUMA node distance, from:%d to:%d distance=%d\n", > +from, to, distance); NIT: please align Also you used %u before for nodeids, which is better because from and to are unsigned. Distance should be uint32_t. > +return; > +} > + > +node_distance_map[from][to] = distance; Shouldn't we also be setting: node_distance_map[to][from] = distance; ? > +} > + > +uint8_t __node_distance(nodeid_t from, nodeid_t to) > +{ > +/* > + * Check whether the nodes are in the matrix range. > + * When any node is out of range, except from and to nodes are the > + * same, we treat them as unreachable (return 0xFF) > + */ > +if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES ) > +return from == to ? NUMA_LOCAL_DISTANCE : NUMA_NO_DISTANCE; > + > +return node_distance_map[from][to]; > +} > +EXPORT_SYMBOL(__node_distance); > + > void __init numa_init(bool acpi_off) > { > uint32_t idx; > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h > index bb495a24e1..559b028a01 100644 > --- a/xen/include/asm-arm/numa.h > +++ b/xen/include/asm-arm/numa.h > @@ -12,8 +12,19 @@ typedef u8 nodeid_t; > * set the number of NUMA memory block number to 128. > */ > #define NODES_SHIFT 6 > +/* > + * In ACPI spec, 0-9 are the reserved values for node distance, > + * 10 indicates local node distance, 20 indicates remote node > + * distance. Set node distance map in device tree will follow > + * the ACPI's definition. > + */ > +#define NUMA_DISTANCE_UDF_MIN 0 > +#define NUMA_DISTANCE_UDF_MAX 9 > +#define NUMA_LOCAL_DISTANCE 10 > +#define NUMA_REMOTE_DISTANCE20 > > extern void numa_init(bool acpi_off); > +extern void numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance); > > /* > * Temporary for fake NUMA node, when CPU, memory and distance > @@ -21,7 +32,6 @@ extern void numa_init(bool acpi_off); > * symbols will be removed. > */ > extern mfn_t first_valid_mfn; > -#define __node_distance(a, b) (20) > > #else > > diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h > index 5a57a51e26..e0253c20b7 100644 > --- a/xen/include/asm-x86/numa.h > +++ b/xen/include/asm-x86/numa.h > @@ -21,7 +21,6 @@ extern nodeid_t apicid_to_node[]; > extern void init_cpu_to_node(void); > > void srat_parse_regions(u64 addr); > -extern u8 __node_distance(nodeid_t a, nodeid_t b); > unsigned int arch_get_dma_bitsize(void); > > #endif > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h > index cb08d2eca9..0475823b13 100644 > --- a/xen/include/xen/numa.h > +++ b/xen/include/xen/numa.h > @@ -58,7 +58,7 @@ static inline __attribute__((pure)) nodeid_t > phys_to_nid(paddr_t addr) > #define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages) > #define node_end_pfn(nid) (NODE_DATA(nid)->node_start_pfn + \ >NODE_DATA(nid)->node_spanned_pages) > - > +extern u8 __node_distance(nodeid_t a, nodeid_t b); > extern void
Re: HVM guest only bring up a single vCPU
On 26/08/2021 22:00, Julien Grall wrote: > Hi Andrew, > > While doing more testing today, I noticed that only one vCPU would be > brought up with HVM guest with Xen 4.16 on my setup (QEMU): > > [ 1.122180] > > [ 1.122180] UBSAN: shift-out-of-bounds in > oss/linux/arch/x86/kernel/apic/apic.c:2362:13 > [ 1.122180] shift exponent -1 is negative > [ 1.122180] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-rc7+ #304 > [ 1.122180] Hardware name: Xen HVM domU, BIOS 4.16-unstable 06/07/2021 > [ 1.122180] Call Trace: > [ 1.122180] dump_stack_lvl+0x56/0x6c > [ 1.122180] ubsan_epilogue+0x5/0x50 > [ 1.122180] __ubsan_handle_shift_out_of_bounds+0xfa/0x140 > [ 1.122180] ? cgroup_kill_write+0x4d/0x150 > [ 1.122180] ? cpu_up+0x6e/0x100 > [ 1.122180] ? _raw_spin_unlock_irqrestore+0x30/0x50 > [ 1.122180] ? rcu_read_lock_held_common+0xe/0x40 > [ 1.122180] ? irq_shutdown_and_deactivate+0x11/0x30 > [ 1.122180] ? lock_release+0xc7/0x2a0 > [ 1.122180] ? apic_id_is_primary_thread+0x56/0x60 > [ 1.122180] apic_id_is_primary_thread+0x56/0x60 > [ 1.122180] cpu_up+0xbd/0x100 > [ 1.122180] bringup_nonboot_cpus+0x4f/0x60 > [ 1.122180] smp_init+0x26/0x74 > [ 1.122180] kernel_init_freeable+0x183/0x32d > [ 1.122180] ? _raw_spin_unlock_irq+0x24/0x40 > [ 1.122180] ? rest_init+0x330/0x330 > [ 1.122180] kernel_init+0x17/0x140 > [ 1.122180] ? rest_init+0x330/0x330 > [ 1.122180] ret_from_fork+0x22/0x30 > [ 1.122244] > > [ 1.123176] installing Xen timer for CPU 1 > [ 1.123369] x86: Booting SMP configuration: > [ 1.123409] node #0, CPUs: #1 > [ 1.154400] Callback from call_rcu_tasks_trace() invoked. > [ 1.154491] smp: Brought up 1 node, 1 CPU > [ 1.154526] smpboot: Max logical packages: 2 > [ 1.154570] smpboot: Total of 1 processors activated (5999.99 > BogoMIPS) > > I have tried a PV guest (same setup) and the kernel could bring up all > the vCPUs. > > Digging down, Linux will set smp_num_siblings to 0 (via > detect_ht_early()) and as a result will skip all the CPUs. The value > is retrieve from a CPUID leaf. So it sounds like we don't set the > leaft correctly. > > FWIW, I have also tried on Xen 4.11 and could spot the same issue. > Does this ring any bell to you? The CPUID data we give to guests is generally nonsense when it comes to topology. By any chance does the hardware you're booting this on not have hyperthreading enabled/active to begin with? Fixing this is on the todo list, but it needs libxl to start using policy objects (series for the next phase of this still pending on xen-devel). Exactly how you represent the topology to the guest correctly depends on the vendor and rough generation - I believe there are 5 different algorithms to use, and for AMD in particular, it even depends on how many IO-APICs are visible in the guest. ~Andrew
Re: [XEN RFC PATCH 18/40] xen/arm: Keep memory nodes in dtb for NUMA when boot from EFI
On Wed, 11 Aug 2021, Wei Chen wrote: > EFI can get memory map from EFI system table. But EFI system > table doesn't contain memory NUMA information, EFI depends on > ACPI SRAT or device tree memory node to parse memory blocks' > NUMA mapping. > > But in current code, when Xen is booting from EFI, it will > delete all memory nodes in device tree. So in UEFI + DTB > boot, we don't have numa-node-id for memory blocks any more. > > So in this patch, we will keep memory nodes in device tree for > NUMA code to parse memory numa-node-id later. > > As a side effect, if we still parse boot memory information in > early_scan_node, bootmem.info will calculate memory ranges in > memory nodes twice. So we have to prvent early_scan_node to > parse memory nodes in EFI boot. > > As EFI APIs only can be used in Arm64, so we introduced a wrapper > in header file to prevent #ifdef CONFIG_ARM_64/32 in code block. > > Signed-off-by: Wei Chen > --- > xen/arch/arm/bootfdt.c | 8 +++- > xen/arch/arm/efi/efi-boot.h | 25 - > xen/include/asm-arm/setup.h | 6 ++ > 3 files changed, 13 insertions(+), 26 deletions(-) > > diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c > index 476e32e0f5..7df149dbca 100644 > --- a/xen/arch/arm/bootfdt.c > +++ b/xen/arch/arm/bootfdt.c > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -335,7 +336,12 @@ static int __init early_scan_node(const void *fdt, > { > int rc = 0; > > -if ( device_tree_node_matches(fdt, node, "memory") ) > +/* > + * If system boot from EFI, bootinfo.mem has been set by EFI, > + * so we don't need to parse memory node from DTB. > + */ > +if ( device_tree_node_matches(fdt, node, "memory") && > + !arch_efi_enabled(EFI_BOOT) ) > rc = process_memory_node(fdt, node, name, depth, > address_cells, size_cells, ); > else if ( depth == 1 && !dt_node_cmp(name, "reserved-memory") ) If we are going to use the device tree info for the numa nodes (and related memory) does it make sense to still rely on the EFI tables for the memory map? I wonder if we should just use device tree for memory and ignore EFI instead. Do you know what Linux does in this regard?
Re: [XEN RFC PATCH 16/40] xen/arm: Create a fake NUMA node to use common code
On Wed, 11 Aug 2021, Wei Chen wrote: > When CONFIG_NUMA is enabled for Arm, Xen will switch to use common > NUMA API instead of previous fake NUMA API. Before we parse NUMA > information from device tree or ACPI SRAT table, we need to init > the NUMA related variables, like cpu_to_node, as single node NUMA > system. > > So in this patch, we introduce a numa_init function for to > initialize these data structures as all resources belongs to node#0. > This will make the new API returns the same values as the fake API > has done. > > Signed-off-by: Wei Chen > --- > xen/arch/arm/numa.c| 53 ++ > xen/arch/arm/setup.c | 8 ++ > xen/include/asm-arm/numa.h | 11 > 3 files changed, 72 insertions(+) > > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c > index 1e30c5bb13..566ad1e52b 100644 > --- a/xen/arch/arm/numa.c > +++ b/xen/arch/arm/numa.c > @@ -20,6 +20,8 @@ > #include > #include > #include > +#include > +#include > > void numa_set_node(int cpu, nodeid_t nid) > { > @@ -29,3 +31,54 @@ void numa_set_node(int cpu, nodeid_t nid) > > cpu_to_node[cpu] = nid; > } > + > +void __init numa_init(bool acpi_off) > +{ > +uint32_t idx; > +paddr_t ram_start = ~0; > +paddr_t ram_size = 0; > +paddr_t ram_end = 0; > + > +printk(XENLOG_WARNING > +"NUMA has not been supported yet, NUMA off!\n"); NIT: please align > +/* Arm NUMA has not been implemented until this patch */ "Arm NUMA is not implemented yet" > +numa_off = true; > + > +/* > + * Set all cpu_to_node mapping to 0, this will make cpu_to_node > + * function return 0 as previous fake cpu_to_node API. > + */ > +for ( idx = 0; idx < NR_CPUS; idx++ ) > +cpu_to_node[idx] = 0; > + > +/* > + * Make node_to_cpumask, node_spanned_pages and node_start_pfn > + * return as previous fake APIs. > + */ > +for ( idx = 0; idx < MAX_NUMNODES; idx++ ) { > +node_to_cpumask[idx] = cpu_online_map; > +node_spanned_pages(idx) = (max_page - mfn_x(first_valid_mfn)); > +node_start_pfn(idx) = (mfn_x(first_valid_mfn)); > +} I just want to note that this works because MAX_NUMNODES is 1. If MAX_NUMNODES was > 1 then it would be wrong to set node_to_cpumask, node_spanned_pages and node_start_pfn for all nodes to the same values. It might be worth writing something about it in the in-code comment. > +/* > + * Find the minimal and maximum address of RAM, NUMA will > + * build a memory to node mapping table for the whole range. > + */ > +ram_start = bootinfo.mem.bank[0].start; > +ram_size = bootinfo.mem.bank[0].size; > +ram_end = ram_start + ram_size; > +for ( idx = 1 ; idx < bootinfo.mem.nr_banks; idx++ ) > +{ > +paddr_t bank_start = bootinfo.mem.bank[idx].start; > +paddr_t bank_size = bootinfo.mem.bank[idx].size; > +paddr_t bank_end = bank_start + bank_size; > + > +ram_size = ram_size + bank_size; ram_size is updated but not utilized > +ram_start = min(ram_start, bank_start); > +ram_end = max(ram_end, bank_end); > +} > + > +numa_initmem_init(PFN_UP(ram_start), PFN_DOWN(ram_end)); > +return; > +} > diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c > index 63a908e325..3c58d2d441 100644 > --- a/xen/arch/arm/setup.c > +++ b/xen/arch/arm/setup.c > @@ -30,6 +30,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -874,6 +875,13 @@ void __init start_xen(unsigned long boot_phys_offset, > /* Parse the ACPI tables for possible boot-time configuration */ > acpi_boot_table_init(); > > +/* > + * Try to initialize NUMA system, if failed, the system will > + * fallback to uniform system which means system has only 1 > + * NUMA node. > + */ > +numa_init(acpi_disabled); > + > end_boot_allocator(); > > /* > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h > index b2982f9053..bb495a24e1 100644 > --- a/xen/include/asm-arm/numa.h > +++ b/xen/include/asm-arm/numa.h > @@ -13,6 +13,16 @@ typedef u8 nodeid_t; > */ > #define NODES_SHIFT 6 > > +extern void numa_init(bool acpi_off); > + > +/* > + * Temporary for fake NUMA node, when CPU, memory and distance > + * matrix will be read from DTB or ACPI SRAT. The following > + * symbols will be removed. > + */ > +extern mfn_t first_valid_mfn; > +#define __node_distance(a, b) (20) > + > #else > > /* Fake one node for now. See also node_online_map. */ > @@ -35,6 +45,7 @@ extern mfn_t first_valid_mfn; > #define node_start_pfn(nid) (mfn_x(first_valid_mfn)) > #define __node_distance(a, b) (20) > > +#define numa_init(x) do { } while (0) > #define numa_set_node(x, y) do { } while (0) > > #endif > -- > 2.25.1 >
Re: HVM guest only bring up a single vCPU
On Thu, Aug 26, 2021 at 10:00:58PM +0100, Julien Grall wrote: > Hi Andrew, > > While doing more testing today, I noticed that only one vCPU would be > brought up with HVM guest with Xen 4.16 on my setup (QEMU): > > [1.122180] > > [1.122180] UBSAN: shift-out-of-bounds in > oss/linux/arch/x86/kernel/apic/apic.c:2362:13 > [1.122180] shift exponent -1 is negative > [1.122180] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-rc7+ #304 > [1.122180] Hardware name: Xen HVM domU, BIOS 4.16-unstable 06/07/2021 > [1.122180] Call Trace: > [1.122180] dump_stack_lvl+0x56/0x6c > [1.122180] ubsan_epilogue+0x5/0x50 > [1.122180] __ubsan_handle_shift_out_of_bounds+0xfa/0x140 > [1.122180] ? cgroup_kill_write+0x4d/0x150 > [1.122180] ? cpu_up+0x6e/0x100 > [1.122180] ? _raw_spin_unlock_irqrestore+0x30/0x50 > [1.122180] ? rcu_read_lock_held_common+0xe/0x40 > [1.122180] ? irq_shutdown_and_deactivate+0x11/0x30 > [1.122180] ? lock_release+0xc7/0x2a0 > [1.122180] ? apic_id_is_primary_thread+0x56/0x60 > [1.122180] apic_id_is_primary_thread+0x56/0x60 > [1.122180] cpu_up+0xbd/0x100 > [1.122180] bringup_nonboot_cpus+0x4f/0x60 > [1.122180] smp_init+0x26/0x74 > [1.122180] kernel_init_freeable+0x183/0x32d > [1.122180] ? _raw_spin_unlock_irq+0x24/0x40 > [1.122180] ? rest_init+0x330/0x330 > [1.122180] kernel_init+0x17/0x140 > [1.122180] ? rest_init+0x330/0x330 > [1.122180] ret_from_fork+0x22/0x30 > [1.122244] > > [1.123176] installing Xen timer for CPU 1 > [1.123369] x86: Booting SMP configuration: > [1.123409] node #0, CPUs: #1 > [1.154400] Callback from call_rcu_tasks_trace() invoked. > [1.154491] smp: Brought up 1 node, 1 CPU > [1.154526] smpboot: Max logical packages: 2 > [1.154570] smpboot: Total of 1 processors activated (5999.99 BogoMIPS) > > I have tried a PV guest (same setup) and the kernel could bring up all the > vCPUs. > > Digging down, Linux will set smp_num_siblings to 0 (via detect_ht_early()) > and as a result will skip all the CPUs. The value is retrieve from a CPUID > leaf. So it sounds like we don't set the leaft correctly. > > FWIW, I have also tried on Xen 4.11 and could spot the same issue. Does this > ring any bell to you? Is it maybe this: https://lore.kernel.org/xen-devel/20201106003529.391649-1-bmas...@redhat.com/T/#u ? -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
HVM guest only bring up a single vCPU
Hi Andrew, While doing more testing today, I noticed that only one vCPU would be brought up with HVM guest with Xen 4.16 on my setup (QEMU): [1.122180] [1.122180] UBSAN: shift-out-of-bounds in oss/linux/arch/x86/kernel/apic/apic.c:2362:13 [1.122180] shift exponent -1 is negative [1.122180] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-rc7+ #304 [1.122180] Hardware name: Xen HVM domU, BIOS 4.16-unstable 06/07/2021 [1.122180] Call Trace: [1.122180] dump_stack_lvl+0x56/0x6c [1.122180] ubsan_epilogue+0x5/0x50 [1.122180] __ubsan_handle_shift_out_of_bounds+0xfa/0x140 [1.122180] ? cgroup_kill_write+0x4d/0x150 [1.122180] ? cpu_up+0x6e/0x100 [1.122180] ? _raw_spin_unlock_irqrestore+0x30/0x50 [1.122180] ? rcu_read_lock_held_common+0xe/0x40 [1.122180] ? irq_shutdown_and_deactivate+0x11/0x30 [1.122180] ? lock_release+0xc7/0x2a0 [1.122180] ? apic_id_is_primary_thread+0x56/0x60 [1.122180] apic_id_is_primary_thread+0x56/0x60 [1.122180] cpu_up+0xbd/0x100 [1.122180] bringup_nonboot_cpus+0x4f/0x60 [1.122180] smp_init+0x26/0x74 [1.122180] kernel_init_freeable+0x183/0x32d [1.122180] ? _raw_spin_unlock_irq+0x24/0x40 [1.122180] ? rest_init+0x330/0x330 [1.122180] kernel_init+0x17/0x140 [1.122180] ? rest_init+0x330/0x330 [1.122180] ret_from_fork+0x22/0x30 [1.122244] [1.123176] installing Xen timer for CPU 1 [1.123369] x86: Booting SMP configuration: [1.123409] node #0, CPUs: #1 [1.154400] Callback from call_rcu_tasks_trace() invoked. [1.154491] smp: Brought up 1 node, 1 CPU [1.154526] smpboot: Max logical packages: 2 [1.154570] smpboot: Total of 1 processors activated (5999.99 BogoMIPS) I have tried a PV guest (same setup) and the kernel could bring up all the vCPUs. Digging down, Linux will set smp_num_siblings to 0 (via detect_ht_early()) and as a result will skip all the CPUs. The value is retrieve from a CPUID leaf. So it sounds like we don't set the leaft correctly. FWIW, I have also tried on Xen 4.11 and could spot the same issue. Does this ring any bell to you? Cheers, -- Julien Grall
Re: [PATCH v2] PCI/MSI: Skip masking MSI-X on Xen PV
On Thu, Aug 26, 2021 at 07:03:42PM +0200, Marek Marczykowski-Górecki wrote: > When running as Xen PV guest, masking MSI-X is a responsibility of the > hypervisor. Guest has no write access to relevant BAR at all - when it > tries to, it results in a crash like this: > > BUG: unable to handle page fault for address: c9004069100c > #PF: supervisor write access in kernel mode > #PF: error_code(0x0003) - permissions violation > PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 8010febd4075 > Oops: 0003 [#1] SMP NOPTI > CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: GW > 5.14.0-rc7-1.fc32.qubes.x86_64 #15 > Workqueue: events work_for_cpu_fn > RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0 > Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 0f > b7 f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 83 c0 > 10 48 39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48 > RSP: e02b:c9004018bd50 EFLAGS: 00010212 > RAX: c9004069100c RBX: 88800ed412f8 RCX: c9004069105c > RDX: 0001 RSI: 000febd4 RDI: c90040691000 > RBP: 0003 R08: R09: febd404f > R10: 7ff0 R11: 88800ee8ae40 R12: 88800ed41000 > R13: R14: 0040 R15: feba > FS: () GS:88801840() > knlGS: > CS: e030 DS: ES: CR0: 80050033 > CR2: 807f5ea0 CR3: 12f6a000 CR4: 0660 > Call Trace: > e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e] > e1000_probe+0x41f/0xdb0 [e1000e] > local_pci_probe+0x42/0x80 > (...) > > There is pci_msi_ignore_mask variable for bypassing MSI(-X) masking on Xen > PV, but msix_mask_all() missed checking it. Add the check there too. > > Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries") > Cc: sta...@vger.kernel.org > Signed-off-by: Marek Marczykowski-Górecki Acked-by: Bjorn Helgaas > --- > Cc: xen-devel@lists.xenproject.org > > Changes in v2: > - update commit message (MSI -> MSI-X) > --- > drivers/pci/msi.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index e5e75331b415..3a9f4f8ad8f9 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -776,6 +776,9 @@ static void msix_mask_all(void __iomem *base, int tsize) > u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT; > int i; > > + if (pci_msi_ignore_mask) > + return; > + > for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE) > writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL); > } > -- > 2.31.1 >
[xen-4.12-testing test] 164489: tolerable FAIL - PUSHED
flight 164489 xen-4.12-testing real [real] http://logs.test-lab.xenproject.org/osstest/logs/164489/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-amd64-amd64-xl-qcow219 guest-localmigrate/x10 fail like 162549 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 162549 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 162549 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 162549 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 162549 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 162549 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 162549 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 162549 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 162549 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail like 162549 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 162549 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 162549 test-amd64-i386-xl-pvshim14 guest-start fail never pass test-amd64-i386-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-i386-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass version targeted for testing: xen 35ba323378d05509f2e0dc049520e140be183003 baseline version: xen ea20eee97e9e0861127a8070cc7b9ae3557b09fb Last test of basis 162549 2021-06-08 18:37:01 Z 79 days Failing since164259 2021-08-19 17:07:29 Z7 days5 attempts Testing same since 164489 2021-08-25 17:57:16 Z1 days1 attempts People who touched revisions under test: Andrew Cooper Anthony PERARD Ian Jackson Jan Beulich Jason Andryuk Julien Grall Roger Pau Monné Stefano
[PATCH v2] PCI/MSI: Skip masking MSI-X on Xen PV
When running as Xen PV guest, masking MSI-X is a responsibility of the hypervisor. Guest has no write access to relevant BAR at all - when it tries to, it results in a crash like this: BUG: unable to handle page fault for address: c9004069100c #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 8010febd4075 Oops: 0003 [#1] SMP NOPTI CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: GW 5.14.0-rc7-1.fc32.qubes.x86_64 #15 Workqueue: events work_for_cpu_fn RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0 Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 0f b7 f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 83 c0 10 48 39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48 RSP: e02b:c9004018bd50 EFLAGS: 00010212 RAX: c9004069100c RBX: 88800ed412f8 RCX: c9004069105c RDX: 0001 RSI: 000febd4 RDI: c90040691000 RBP: 0003 R08: R09: febd404f R10: 7ff0 R11: 88800ee8ae40 R12: 88800ed41000 R13: R14: 0040 R15: feba FS: () GS:88801840() knlGS: CS: e030 DS: ES: CR0: 80050033 CR2: 807f5ea0 CR3: 12f6a000 CR4: 0660 Call Trace: e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e] e1000_probe+0x41f/0xdb0 [e1000e] local_pci_probe+0x42/0x80 (...) There is pci_msi_ignore_mask variable for bypassing MSI(-X) masking on Xen PV, but msix_mask_all() missed checking it. Add the check there too. Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries") Cc: sta...@vger.kernel.org Signed-off-by: Marek Marczykowski-Górecki --- Cc: xen-devel@lists.xenproject.org Changes in v2: - update commit message (MSI -> MSI-X) --- drivers/pci/msi.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index e5e75331b415..3a9f4f8ad8f9 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -776,6 +776,9 @@ static void msix_mask_all(void __iomem *base, int tsize) u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT; int i; + if (pci_msi_ignore_mask) + return; + for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE) writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL); } -- 2.31.1
Re: [PATCH] PCI/MSI: skip masking MSI on Xen PV
On Thu, Aug 26, 2021 at 06:36:49PM +0200, Marek Marczykowski-Górecki wrote: > On Thu, Aug 26, 2021 at 09:55:32AM -0500, Bjorn Helgaas wrote: > > If/when you repost this, please run "git log --oneline > > drivers/pci/msi.c" and follow the convention of capitalizing the > > subject line. > > > > Also, I think this patch refers specifically to MSI-X, not MSI, so > > please update the subject line and the "masking MSI" below to reflect > > that. > > Sure, thanks for pointing this out. Is the patch otherwise ok? Should I > post v2 with just updated commit message? Wouldn't hurt to post a v2. I don't have any objections to the patch, but ultimately up to Thomas. > > On Thu, Aug 26, 2021 at 03:43:37PM +0200, Marek Marczykowski-Górecki wrote: > > > When running as Xen PV guest, masking MSI is a responsibility of the > > > hypervisor. Guest has no write access to relevant BAR at all - when it > > > tries to, it results in a crash like this: > > > > > > BUG: unable to handle page fault for address: c9004069100c > > > #PF: supervisor write access in kernel mode > > > #PF: error_code(0x0003) - permissions violation > > > PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 8010febd4075 > > > Oops: 0003 [#1] SMP NOPTI > > > CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: GW > > > 5.14.0-rc7-1.fc32.qubes.x86_64 #15 > > > Workqueue: events work_for_cpu_fn > > > RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0 > > > Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 > > > 0f b7 f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 > > > 83 c0 10 48 39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48 > > > RSP: e02b:c9004018bd50 EFLAGS: 00010212 > > > RAX: c9004069100c RBX: 88800ed412f8 RCX: c9004069105c > > > RDX: 0001 RSI: 000febd4 RDI: c90040691000 > > > RBP: 0003 R08: R09: febd404f > > > R10: 7ff0 R11: 88800ee8ae40 R12: 88800ed41000 > > > R13: R14: 0040 R15: feba > > > FS: () GS:88801840() > > > knlGS: > > > CS: e030 DS: ES: CR0: 80050033 > > > CR2: 807f5ea0 CR3: 12f6a000 CR4: 0660 > > > Call Trace: > > > e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e] > > > e1000_probe+0x41f/0xdb0 [e1000e] > > > local_pci_probe+0x42/0x80 > > > (...) > > > > > > There is pci_msi_ignore_mask variable for bypassing MSI masking on Xen > > > PV, but msix_mask_all() missed checking it. Add the check there too. > > > > > > Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries") > > > Cc: sta...@vger.kernel.org > > > > 7d5ec3d36123 appeared in v5.14-rc6, so if this fix is merged before > > v5.14, the stable tag will be unnecessary. But we are running out of > > time there. > > 7d5ec3d36123 was already backported to stable branches (at least 5.10 > and 5.4), and in fact this is how I discovered the issue... Oh, right, of course. Sorry :) > > > Signed-off-by: Marek Marczykowski-Górecki > > > > > > --- > > > Cc: xen-devel@lists.xenproject.org > > > --- > > > drivers/pci/msi.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > > > index e5e75331b415..3a9f4f8ad8f9 100644 > > > --- a/drivers/pci/msi.c > > > +++ b/drivers/pci/msi.c > > > @@ -776,6 +776,9 @@ static void msix_mask_all(void __iomem *base, int > > > tsize) > > > u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT; > > > int i; > > > > > > + if (pci_msi_ignore_mask) > > > + return; > > > + > > > for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE) > > > writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL); > > > } > > > -- > > > 2.31.1 > > > > > -- > Best Regards, > Marek Marczykowski-Górecki > Invisible Things Lab
Re: [PATCH] PCI/MSI: skip masking MSI on Xen PV
On Thu, Aug 26, 2021 at 09:55:32AM -0500, Bjorn Helgaas wrote: > If/when you repost this, please run "git log --oneline > drivers/pci/msi.c" and follow the convention of capitalizing the > subject line. > > Also, I think this patch refers specifically to MSI-X, not MSI, so > please update the subject line and the "masking MSI" below to reflect > that. Sure, thanks for pointing this out. Is the patch otherwise ok? Should I post v2 with just updated commit message? > On Thu, Aug 26, 2021 at 03:43:37PM +0200, Marek Marczykowski-Górecki wrote: > > When running as Xen PV guest, masking MSI is a responsibility of the > > hypervisor. Guest has no write access to relevant BAR at all - when it > > tries to, it results in a crash like this: > > > > BUG: unable to handle page fault for address: c9004069100c > > #PF: supervisor write access in kernel mode > > #PF: error_code(0x0003) - permissions violation > > PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 8010febd4075 > > Oops: 0003 [#1] SMP NOPTI > > CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: GW > > 5.14.0-rc7-1.fc32.qubes.x86_64 #15 > > Workqueue: events work_for_cpu_fn > > RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0 > > Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 0f > > b7 f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 83 > > c0 10 48 39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48 > > RSP: e02b:c9004018bd50 EFLAGS: 00010212 > > RAX: c9004069100c RBX: 88800ed412f8 RCX: c9004069105c > > RDX: 0001 RSI: 000febd4 RDI: c90040691000 > > RBP: 0003 R08: R09: febd404f > > R10: 7ff0 R11: 88800ee8ae40 R12: 88800ed41000 > > R13: R14: 0040 R15: feba > > FS: () GS:88801840() > > knlGS: > > CS: e030 DS: ES: CR0: 80050033 > > CR2: 807f5ea0 CR3: 12f6a000 CR4: 0660 > > Call Trace: > > e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e] > > e1000_probe+0x41f/0xdb0 [e1000e] > > local_pci_probe+0x42/0x80 > > (...) > > > > There is pci_msi_ignore_mask variable for bypassing MSI masking on Xen > > PV, but msix_mask_all() missed checking it. Add the check there too. > > > > Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries") > > Cc: sta...@vger.kernel.org > > 7d5ec3d36123 appeared in v5.14-rc6, so if this fix is merged before > v5.14, the stable tag will be unnecessary. But we are running out of > time there. 7d5ec3d36123 was already backported to stable branches (at least 5.10 and 5.4), and in fact this is how I discovered the issue... > > Signed-off-by: Marek Marczykowski-Górecki > > --- > > Cc: xen-devel@lists.xenproject.org > > --- > > drivers/pci/msi.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > > index e5e75331b415..3a9f4f8ad8f9 100644 > > --- a/drivers/pci/msi.c > > +++ b/drivers/pci/msi.c > > @@ -776,6 +776,9 @@ static void msix_mask_all(void __iomem *base, int tsize) > > u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT; > > int i; > > > > + if (pci_msi_ignore_mask) > > + return; > > + > > for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE) > > writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL); > > } > > -- > > 2.31.1 > > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
[xen-4.11-testing test] 164486: regressions - FAIL
flight 164486 xen-4.11-testing real [real] flight 164502 xen-4.11-testing real-retest [real] http://logs.test-lab.xenproject.org/osstest/logs/164486/ http://logs.test-lab.xenproject.org/osstest/logs/164502/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: test-amd64-i386-xl-qemuu-ovmf-amd64 12 debian-hvm-install fail REGR. vs. 162548 test-amd64-amd64-xl-qemuu-ovmf-amd64 12 debian-hvm-install fail REGR. vs. 162548 Tests which are failing intermittently (not blocking): test-arm64-arm64-xl-seattle 8 xen-bootfail pass in 164502-retest test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail pass in 164502-retest Tests which did not succeed, but are not blocking: test-arm64-arm64-xl-seattle 15 migrate-support-check fail in 164502 never pass test-arm64-arm64-xl-seattle 16 saverestore-support-check fail in 164502 never pass test-amd64-amd64-xl-qemuu-dmrestrict-amd64-dmrestrict 12 debian-hvm-install fail like 162548 test-amd64-i386-xl-qemuu-dmrestrict-amd64-dmrestrict 12 debian-hvm-install fail like 162548 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 162548 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 162548 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 162548 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 162548 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 162548 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 162548 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 162548 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 162548 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail like 162548 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 162548 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 162548 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-i386-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-xl-pvshim14 guest-start fail never pass test-amd64-i386-libvirt 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass
Re: [PATCH] PCI/MSI: skip masking MSI on Xen PV
If/when you repost this, please run "git log --oneline drivers/pci/msi.c" and follow the convention of capitalizing the subject line. Also, I think this patch refers specifically to MSI-X, not MSI, so please update the subject line and the "masking MSI" below to reflect that. On Thu, Aug 26, 2021 at 03:43:37PM +0200, Marek Marczykowski-Górecki wrote: > When running as Xen PV guest, masking MSI is a responsibility of the > hypervisor. Guest has no write access to relevant BAR at all - when it > tries to, it results in a crash like this: > > BUG: unable to handle page fault for address: c9004069100c > #PF: supervisor write access in kernel mode > #PF: error_code(0x0003) - permissions violation > PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 8010febd4075 > Oops: 0003 [#1] SMP NOPTI > CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: GW > 5.14.0-rc7-1.fc32.qubes.x86_64 #15 > Workqueue: events work_for_cpu_fn > RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0 > Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 0f > b7 f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 83 c0 > 10 48 39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48 > RSP: e02b:c9004018bd50 EFLAGS: 00010212 > RAX: c9004069100c RBX: 88800ed412f8 RCX: c9004069105c > RDX: 0001 RSI: 000febd4 RDI: c90040691000 > RBP: 0003 R08: R09: febd404f > R10: 7ff0 R11: 88800ee8ae40 R12: 88800ed41000 > R13: R14: 0040 R15: feba > FS: () GS:88801840() > knlGS: > CS: e030 DS: ES: CR0: 80050033 > CR2: 807f5ea0 CR3: 12f6a000 CR4: 0660 > Call Trace: > e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e] > e1000_probe+0x41f/0xdb0 [e1000e] > local_pci_probe+0x42/0x80 > (...) > > There is pci_msi_ignore_mask variable for bypassing MSI masking on Xen > PV, but msix_mask_all() missed checking it. Add the check there too. > > Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries") > Cc: sta...@vger.kernel.org 7d5ec3d36123 appeared in v5.14-rc6, so if this fix is merged before v5.14, the stable tag will be unnecessary. But we are running out of time there. > Signed-off-by: Marek Marczykowski-Górecki > --- > Cc: xen-devel@lists.xenproject.org > --- > drivers/pci/msi.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index e5e75331b415..3a9f4f8ad8f9 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -776,6 +776,9 @@ static void msix_mask_all(void __iomem *base, int tsize) > u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT; > int i; > > + if (pci_msi_ignore_mask) > + return; > + > for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE) > writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL); > } > -- > 2.31.1 >
Re: [PATCH v7 8/8] AMD/IOMMU: respect AtsDisabled device flag
On 26.08.2021 16:27, Andrew Cooper wrote: > On 26/08/2021 08:26, Jan Beulich wrote: >> TBD: I find the ordering in amd_iommu_disable_domain_device() >> suspicious: amd_iommu_enable_domain_device() sets up the DTE first >> and then enables ATS on the device. It would seem to me that >> disabling would better be done the other way around (disable ATS on >> device, then adjust DTE). > > I'd hope that the worst which goes wrong, given the problematic order, > is a master abort. > > But yes - ATS wants disabling on the device first, before the DTE is > updated. Okay, I'll add another patch. Jan
Re: [PATCH v7 7/8] AMD/IOMMU: add "ivmd=" command line option
On 26.08.2021 16:08, Andrew Cooper wrote: > On 26/08/2021 08:25, Jan Beulich wrote: >> @@ -1523,6 +1523,31 @@ _dom0-iommu=map-inclusive_ - using both >> > `= ` >> >> ### irq_vector_map (x86) >> + >> +### ivmd (x86) >> +> `= >> [-][=[-][,[-][,...]]][;...]` >> + >> +Define IVMD-like ranges that are missing from ACPI tables along with the >> +device(s) they belong to, and use them for 1:1 mapping. End addresses can >> be >> +omitted when exactly one page is meant. The ranges are inclusive when start >> +and end are specified. Note that only PCI segment 0 is supported at this >> time, >> +but it is fine to specify it explicitly. >> + >> +'start' and 'end' values are page numbers (not full physical addresses), >> +in hexadecimal format (can optionally be preceded by "0x"). >> + >> +Omitting the optional (range of) BDF spcifiers signals that the range is to >> +be applied to all devices. >> + >> +Usage example: If device 0:0:1d.0 requires one page (0xd5d45) to be >> +reserved, and devices 0:0:1a.0...0:0:1a.3 collectively require three pages >> +(0xd5d46 thru 0xd5d48) to be reserved, one usage would be: >> + >> +ivmd=d5d45=0:1d.0;0xd5d46-0xd5d48=0:1a.0-0:1a.3 >> + >> +Note: grub2 requires to escape or quote special characters, like ';' when >> +multiple ranges are specified - refer to the grub2 documentation. > > I'm slightly concerned that we're putting in syntax which the majority > bootloader in use can't handle. This matches RMRR handling, and I'd really like to keep the two as similar as possible. Plus you can avoid the use of ; also by having more than one "ivmd=" on the command line. > A real IVMD entry in hardware doesn't have the concept of multiple > device ranges, so I think comma ought to be the main list separator, and > I don't think we need ; at all. Firmware would need to present two IVMD entries in such a case. On the command line I think we should allow more compact representation. Plus again - this is similar to "rmrr=". Jan
Re: [PATCH v7 8/8] AMD/IOMMU: respect AtsDisabled device flag
On 26/08/2021 08:26, Jan Beulich wrote: > IVHD entries may specify that ATS is to be blocked for a device or range > of devices. Honor firmware telling us so. It would be helpful if there was any explanation as to the purpose of this bit. It is presumably to do with SecureATS, but that works by accepting the ATS translation and doing the pagewalk anyway. > > While adding respective checks I noticed that the 2nd conditional in > amd_iommu_setup_domain_device() failed to check the IOMMU's capability. > Add the missing part of the condition there, as no good can come from > enabling ATS on a device when the IOMMU is not capable of dealing with > ATS requests. > > For actually using ACPI_IVHD_ATS_DISABLED, make its expansion no longer > exhibit UB. > > Signed-off-by: Jan Beulich > --- > TBD: I find the ordering in amd_iommu_disable_domain_device() > suspicious: amd_iommu_enable_domain_device() sets up the DTE first > and then enables ATS on the device. It would seem to me that > disabling would better be done the other way around (disable ATS on > device, then adjust DTE). I'd hope that the worst which goes wrong, given the problematic order, is a master abort. But yes - ATS wants disabling on the device first, before the DTE is updated. ~Andrew
Re: [PATCH v7 7/8] AMD/IOMMU: add "ivmd=" command line option
On 26/08/2021 08:25, Jan Beulich wrote: > Just like VT-d's "rmrr=" it can be used to cover for firmware omissions. > Since systems surfacing IVMDs seem to be rare, it is also meant to allow > testing of the involved code. > > Only the IVMD flavors actually understood by the IVMD parsing logic can > be generated, and for this initial implementation there's also no way to > control the flags field - unity r/w mappings are assumed. > > Signed-off-by: Jan Beulich > Reviewed-by: Paul Durrant > --- > v5: New. > > --- a/docs/misc/xen-command-line.pandoc > +++ b/docs/misc/xen-command-line.pandoc > @@ -836,12 +836,12 @@ Controls for the dom0 IOMMU setup. > > Typically, some devices in a system use bits of RAM for communication, > and > these areas should be listed as reserved in the E820 table and identified > -via RMRR or IVMD entries in the APCI tables, so Xen can ensure that they > +via RMRR or IVMD entries in the ACPI tables, so Xen can ensure that they Oops. > @@ -1523,6 +1523,31 @@ _dom0-iommu=map-inclusive_ - using both > > `= ` > > ### irq_vector_map (x86) > + > +### ivmd (x86) > +> `= > [-][=[-][,[-][,...]]][;...]` > + > +Define IVMD-like ranges that are missing from ACPI tables along with the > +device(s) they belong to, and use them for 1:1 mapping. End addresses can be > +omitted when exactly one page is meant. The ranges are inclusive when start > +and end are specified. Note that only PCI segment 0 is supported at this > time, > +but it is fine to specify it explicitly. > + > +'start' and 'end' values are page numbers (not full physical addresses), > +in hexadecimal format (can optionally be preceded by "0x"). > + > +Omitting the optional (range of) BDF spcifiers signals that the range is to > +be applied to all devices. > + > +Usage example: If device 0:0:1d.0 requires one page (0xd5d45) to be > +reserved, and devices 0:0:1a.0...0:0:1a.3 collectively require three pages > +(0xd5d46 thru 0xd5d48) to be reserved, one usage would be: > + > +ivmd=d5d45=0:1d.0;0xd5d46-0xd5d48=0:1a.0-0:1a.3 > + > +Note: grub2 requires to escape or quote special characters, like ';' when > +multiple ranges are specified - refer to the grub2 documentation. I'm slightly concerned that we're putting in syntax which the majority bootloader in use can't handle. A real IVMD entry in hardware doesn't have the concept of multiple device ranges, so I think comma ought to be the main list separator, and I don't think we need ; at all. ~Andrew
Re: [PATCH v7 6/8] AMD/IOMMU: provide function backing XENMEM_reserved_device_memory_map
On 26.08.2021 15:24, Andrew Cooper wrote: > On 26/08/2021 08:25, Jan Beulich wrote: >> --- a/xen/drivers/passthrough/amd/iommu_map.c >> +++ b/xen/drivers/passthrough/amd/iommu_map.c >> @@ -467,6 +467,81 @@ int amd_iommu_reserve_domain_unity_unmap >> return rc; >> } >> >> +int amd_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt) >> +{ >> +unsigned int seg = 0 /* XXX */, bdf; > > Is this XXX intended to stay? Yes - we've already got 3 similar instances plus at least two where the hardcoding of segment 0 is not otherwise marked. Jan > I can't say I'm fussed about multi-segment handling, given the absence > of support on any hardware I've ever encountered. > > ~Andrew >
Re: [PATCH v7 4/8] AMD/IOMMU: check IVMD ranges against host implementation limits
On 26.08.2021 15:16, Andrew Cooper wrote: > On 26/08/2021 08:24, Jan Beulich wrote: >> When such ranges can't be represented as 1:1 mappings in page tables, >> reject them as presumably bogus. Note that when we detect features late >> (because of EFRSup being clear in the ACPI tables), it would be quite a >> bit of work to check for (and drop) out of range IVMD ranges, so IOMMU >> initialization gets failed in this case instead. >> >> Signed-off-by: Jan Beulich >> Reviewed-by: Paul Durrant > > I'm not certain this is correct in combination with memory encryption. Which we don't enable. I don't follow why you put this up as an extra requirement. I'm adding checks based on IOMMU-specific data alone. I think that's a fair and consistent step in the right direction, no matter that there may be another step to go. Plus ... > The upper bits are the KeyID, but we shouldn't find any of those set in > an IVMD range. I think at a minimum, we need to reduce the address > check by the stolen bits for encryption, which gives a lower bound. ... aren't you asking here to consume data I'm only making available in the still pending "x86/AMD: make HT range dynamic for Fam17 and up", where I (now, i.e. v2) adjust paddr_bits accordingly? Else I'm afraid I don't even know what you're talking about. Jan
[PATCH] PCI/MSI: skip masking MSI on Xen PV
When running as Xen PV guest, masking MSI is a responsibility of the hypervisor. Guest has no write access to relevant BAR at all - when it tries to, it results in a crash like this: BUG: unable to handle page fault for address: c9004069100c #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 18f1c067 P4D 18f1c067 PUD 4dbd067 PMD 4fba067 PTE 8010febd4075 Oops: 0003 [#1] SMP NOPTI CPU: 0 PID: 234 Comm: kworker/0:2 Tainted: GW 5.14.0-rc7-1.fc32.qubes.x86_64 #15 Workqueue: events work_for_cpu_fn RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0 Code: 2f 96 ff 48 89 44 24 28 48 89 c7 48 85 c0 0f 84 f6 01 00 00 45 0f b7 f6 48 8d 40 0c ba 01 00 00 00 49 c1 e6 04 4a 8d 4c 37 1c <89> 10 48 83 c0 10 48 39 c1 75 f5 41 0f b6 44 24 6a 84 c0 0f 84 48 RSP: e02b:c9004018bd50 EFLAGS: 00010212 RAX: c9004069100c RBX: 88800ed412f8 RCX: c9004069105c RDX: 0001 RSI: 000febd4 RDI: c90040691000 RBP: 0003 R08: R09: febd404f R10: 7ff0 R11: 88800ee8ae40 R12: 88800ed41000 R13: R14: 0040 R15: feba FS: () GS:88801840() knlGS: CS: e030 DS: ES: CR0: 80050033 CR2: 807f5ea0 CR3: 12f6a000 CR4: 0660 Call Trace: e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e] e1000_probe+0x41f/0xdb0 [e1000e] local_pci_probe+0x42/0x80 (...) There is pci_msi_ignore_mask variable for bypassing MSI masking on Xen PV, but msix_mask_all() missed checking it. Add the check there too. Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries") Cc: sta...@vger.kernel.org Signed-off-by: Marek Marczykowski-Górecki --- Cc: xen-devel@lists.xenproject.org --- drivers/pci/msi.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index e5e75331b415..3a9f4f8ad8f9 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -776,6 +776,9 @@ static void msix_mask_all(void __iomem *base, int tsize) u32 ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT; int i; + if (pci_msi_ignore_mask) + return; + for (i = 0; i < tsize; i++, base += PCI_MSIX_ENTRY_SIZE) writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL); } -- 2.31.1
Re: [PATCH v7 6/8] AMD/IOMMU: provide function backing XENMEM_reserved_device_memory_map
On 26/08/2021 08:25, Jan Beulich wrote: > --- a/xen/drivers/passthrough/amd/iommu_map.c > +++ b/xen/drivers/passthrough/amd/iommu_map.c > @@ -467,6 +467,81 @@ int amd_iommu_reserve_domain_unity_unmap > return rc; > } > > +int amd_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt) > +{ > +unsigned int seg = 0 /* XXX */, bdf; Is this XXX intended to stay? I can't say I'm fussed about multi-segment handling, given the absence of support on any hardware I've ever encountered. ~Andrew
Re: [PATCH v1 01/14] xen/pci: Refactor MSI code that implements MSI functionality within XEN
On 8/19/21 8:02 AM, Rahul Singh wrote: > MSI code that implements MSI functionality to support MSI within XEN is > not usable on ARM. Move the code under CONFIG_HAS_PCI_MSI flag to gate > the code for ARM. > > No functional change intended. > > Signed-off-by: Rahul Singh > --- > xen/arch/x86/Kconfig | 1 + > xen/drivers/passthrough/Makefile | 1 + > xen/drivers/passthrough/msi.c| 96 > xen/drivers/passthrough/pci.c| 54 +- > xen/drivers/pci/Kconfig | 4 ++ > xen/include/xen/msi.h| 56 +++ > xen/xsm/flask/hooks.c| 8 +-- > 7 files changed, 175 insertions(+), 45 deletions(-) > create mode 100644 xen/drivers/passthrough/msi.c > create mode 100644 xen/include/xen/msi.h > ... > diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c > index f1a1217c98..fdcfeb984c 100644 > --- a/xen/xsm/flask/hooks.c > +++ b/xen/xsm/flask/hooks.c > @@ -21,7 +21,7 @@ > #include > #include > #include > -#ifdef CONFIG_HAS_PCI > +#ifdef CONFIG_HAS_PCI_MSI > #include > #endif > #include > @@ -114,7 +114,7 @@ static int get_irq_sid(int irq, u32 *sid, struct > avc_audit_data *ad) > } > return security_irq_sid(irq, sid); > } > -#ifdef CONFIG_HAS_PCI > +#ifdef CONFIG_HAS_PCI_MSI > { > struct irq_desc *desc = irq_to_desc(irq); > if ( desc->msi_desc && desc->msi_desc->dev ) { > @@ -874,7 +874,7 @@ static int flask_map_domain_pirq (struct domain *d) > static int flask_map_domain_msi (struct domain *d, int irq, const void *data, > u32 *sid, struct avc_audit_data *ad) > { > -#ifdef CONFIG_HAS_PCI > +#ifdef CONFIG_HAS_PCI_MSI > const struct msi_info *msi = data; > u32 machine_bdf = (msi->seg << 16) | (msi->bus << 8) | msi->devfn; > > @@ -940,7 +940,7 @@ static int flask_unmap_domain_pirq (struct domain *d) > static int flask_unmap_domain_msi (struct domain *d, int irq, const void > *data, > u32 *sid, struct avc_audit_data *ad) > { > -#ifdef CONFIG_HAS_PCI > +#ifdef CONFIG_HAS_PCI_MSI > const struct pci_dev *pdev = data; > u32 machine_bdf = (pdev->seg << 16) | (pdev->bus << 8) | pdev->devfn; > > Straightforward, so I see no issue with the flask related changes. Reviewed-by: Daniel P. Smith v/r dps
Re: [PATCH 4/9] gnttab: drop GNTMAP_can_fail
On 26/08/2021 14:03, Jan Beulich wrote: > On 26.08.2021 13:45, Andrew Cooper wrote: >> On 26/08/2021 11:13, Jan Beulich wrote: >>> There's neither documentation of what this flag is supposed to mean, nor >>> any implementation. With this, don't even bother enclosing the #define-s >>> in a __XEN_INTERFACE_VERSION__ conditional, but drop them altogether. >>> >>> Signed-off-by: Jan Beulich >> It was introduced in 4d45702cf0398 along with GNTST_eagain, and the >> commit message hints that it is for xen-paging >> >> Furthermore, the first use of GNTST_eagain was in ecb35ecb79e0 for >> trying to map/copy a paged-out frame. >> >> Therefore I think it is reasonable to conclude that there was a planned >> interaction between GNTMAP_can_fail and paging, which presumably would >> have been "don't pull this in from disk if it is paged out". > I suppose that's (far fetched?) guesswork. Not really - the phrase "far fetched" loosely translates as implausible or unlikely, and I wouldn't characterise the suggestion as implausible. It is guesswork, and the most plausible guess I can think of. It is clear that GNTMAP_can_fail is related to paging somehow, which means there is some interaction with the gref not being mappable right now. > >> I think it is fine to drop, as no implementation has ever existed in >> Xen, but it would be helpful to have the history summarised in the >> commit message. > I've extended it to > > "There's neither documentation of what this flag is supposed to mean, nor > any implementation. Commit 4d45702cf0398 ("paging: Updates to public > grant table header file") suggests there might have been plans to use it > for interaction with mem-paging, but no such functionality has ever > materialized. With this, don't even bother enclosing the #define-s in a > __XEN_INTERFACE_VERSION__ conditional, but drop them altogether." LGTM. Reviewed-by: Andrew Cooper > > Jan >
Re: [PATCH v7 4/8] AMD/IOMMU: check IVMD ranges against host implementation limits
On 26/08/2021 08:24, Jan Beulich wrote: > When such ranges can't be represented as 1:1 mappings in page tables, > reject them as presumably bogus. Note that when we detect features late > (because of EFRSup being clear in the ACPI tables), it would be quite a > bit of work to check for (and drop) out of range IVMD ranges, so IOMMU > initialization gets failed in this case instead. > > Signed-off-by: Jan Beulich > Reviewed-by: Paul Durrant I'm not certain this is correct in combination with memory encryption. The upper bits are the KeyID, but we shouldn't find any of those set in an IVMD range. I think at a minimum, we need to reduce the address check by the stolen bits for encryption, which gives a lower bound. ~Andrew
Re: [PATCH v7 3/8] AMD/IOMMU: improve (extended) feature detection
On 26.08.2021 15:02, Andrew Cooper wrote: > On 26/08/2021 08:23, Jan Beulich wrote: >> First of all the documentation is very clear about ACPI table data >> superseding raw register data. Use raw register data only if EFRSup is >> clear in the ACPI tables (which may still go too far). Additionally if >> this flag is clear, the IVRS type 11H table is reserved and hence may >> not be recognized. > > The spec says: > > Software Implementation Note: Information conveyed in the IVRS overrides > the corresponding > information available through the IOMMU hardware registers. System > software is required to honor > the ACPI settings. > > This suggests that if there is an ACPI table, the hardware registers > shouldn't be followed. > > Given what else is broken when there is no APCI table, I think we can > (and should) not support this configuration. Well, we don't. We do require ACPI tables. But that's not the question here. Instead what this is about is whether the ACPI table has EFRSup clear. This flag being clear when the register actually exists is imo more likely a firmware bug. Plus I didn't want to go too far in one step; I do acknowledge that we may want to be even more strict down the road by saying "(which may still go too far)". >> Furthermore propagate IVRS type 10H data into the feature flags >> recorded, as the full extended features field is available in type 11H >> only. >> >> Note that this also makes necessary to stop the bad practice of us >> finding a type 11H IVHD entry, but still processing the type 10H one >> in detect_iommu_acpi()'s invocation of amd_iommu_detect_one_acpi(). > > I could have sworn I read in the spec that if 11H is present, 10H should > be ignored, but I can't actually locate a statement to this effect. What I can find is "It is recommended for system software to detect IOMMU features from the fields in the IVHD Type11h structure information, superseding information in Type10h block and MMIO registers." Jan
Re: [PATCH 4/9] gnttab: drop GNTMAP_can_fail
On 26.08.2021 13:45, Andrew Cooper wrote: > On 26/08/2021 11:13, Jan Beulich wrote: >> There's neither documentation of what this flag is supposed to mean, nor >> any implementation. With this, don't even bother enclosing the #define-s >> in a __XEN_INTERFACE_VERSION__ conditional, but drop them altogether. >> >> Signed-off-by: Jan Beulich > > It was introduced in 4d45702cf0398 along with GNTST_eagain, and the > commit message hints that it is for xen-paging > > Furthermore, the first use of GNTST_eagain was in ecb35ecb79e0 for > trying to map/copy a paged-out frame. > > Therefore I think it is reasonable to conclude that there was a planned > interaction between GNTMAP_can_fail and paging, which presumably would > have been "don't pull this in from disk if it is paged out". I suppose that's (far fetched?) guesswork. > I think it is fine to drop, as no implementation has ever existed in > Xen, but it would be helpful to have the history summarised in the > commit message. I've extended it to "There's neither documentation of what this flag is supposed to mean, nor any implementation. Commit 4d45702cf0398 ("paging: Updates to public grant table header file") suggests there might have been plans to use it for interaction with mem-paging, but no such functionality has ever materialized. With this, don't even bother enclosing the #define-s in a __XEN_INTERFACE_VERSION__ conditional, but drop them altogether." Jan
Re: [PATCH v7 3/8] AMD/IOMMU: improve (extended) feature detection
On 26/08/2021 08:23, Jan Beulich wrote: > First of all the documentation is very clear about ACPI table data > superseding raw register data. Use raw register data only if EFRSup is > clear in the ACPI tables (which may still go too far). Additionally if > this flag is clear, the IVRS type 11H table is reserved and hence may > not be recognized. The spec says: Software Implementation Note: Information conveyed in the IVRS overrides the corresponding information available through the IOMMU hardware registers. System software is required to honor the ACPI settings. This suggests that if there is an ACPI table, the hardware registers shouldn't be followed. Given what else is broken when there is no APCI table, I think we can (and should) not support this configuration. > Furthermore propagate IVRS type 10H data into the feature flags > recorded, as the full extended features field is available in type 11H > only. > > Note that this also makes necessary to stop the bad practice of us > finding a type 11H IVHD entry, but still processing the type 10H one > in detect_iommu_acpi()'s invocation of amd_iommu_detect_one_acpi(). I could have sworn I read in the spec that if 11H is present, 10H should be ignored, but I can't actually locate a statement to this effect. ~Andrew > > Note also that the features.raw check in amd_iommu_prepare_one() needs > replacing, now that the field can also be populated by different means. > Key IOMMUv2 availability off of IVHD type not being 10H, and then move > it a function layer up, so that it would be set only once all IOMMUs > have been successfully prepared. > > Signed-off-by: Jan Beulich > Reviewed-by: Paul Durrant
Re: [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
On 26.08.2021 13:57, Andrew Cooper wrote: > On 24/08/2021 15:21, Jan Beulich wrote: >> While already the case for PVH, there's no reason to treat PV >> differently here, though of course the addresses get taken from another >> source in this case. Except that, to match CPU side mappings, by default >> we permit r/o ones. This then also means we now deal consistently with >> IO-APICs whose MMIO is or is not covered by E820 reserved regions. >> >> Signed-off-by: Jan Beulich > > Why do we give PV dom0 a mapping of the IO-APIC? Having thought about > it, it cannot possibly be usable. > > IO-APICs use a indirect window, and Xen doesn't perform any > write-emulation (that I can see), so the guest can't even read the data > register and work out which register it represents. It also can't do an > atomic 64bit read across the index and data registers, as that is > explicitly undefined behaviour in the IO-APIC spec. > > On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite > the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls > where introduced upon discovering that a read-only mapping of the > IO-APIC it totally useless. > > I think we can safely not expose the IO-APICs to PV dom0 at all, which > simplifies the memory handling further. The reason we do expose it r/o is that some ACPI implementations access (read and write) some RTEs from AML. If we don't allow Dom0 to map the IO-APIC, Dom0 will typically fail to boot on such systems. What we have right now seems to be good enough for those systems, no matter that the data they get to read makes little sense. If we learned of systems where this isn't sufficient, we'd have to implement more complete read emulation (i.e. latching writes to the window register while still discarding writes to the data register). If we produced a not-present PTE instead of a r/o one for such mapping requests, I'm afraid we'd actually further complicate memory handling, because we'd then need to consider for emulation also n/p #PF, not just writes to r/o mappings. This said - I'd also be fine with consistently not mapping the IO-APICs in the IOMMU page tables. It was you to request CPU and IOMMU mappings to match. What I want to do away with is the present mixture, derived from the E820 type covering the IO-APIC space. Jan
Re: [PATCH 17/17] IOMMU/x86: drop pointless NULL checks
On 26.08.2021 14:05, Andrew Cooper wrote: > On 24/08/2021 15:27, Jan Beulich wrote: >> --- a/xen/drivers/passthrough/vtd/utils.c >> +++ b/xen/drivers/passthrough/vtd/utils.c >> @@ -106,11 +106,6 @@ void print_vtd_entries(struct vtd_iommu >> } >> >> root_entry = (struct root_entry >> *)map_vtd_domain_page(iommu->root_maddr); > > Seeing as you're actually cleaning up mapping calls, drop this cast too? There are more of these, so this would be yet another cleanup patch. I didn't get annoyed enough by these to put one together; instead I was hoping that we might touch these lines anyway at some point. E.g. when doing away with the funny map_vtd_domain_page() wrapper around map_domain_page() ... > Either way, Acked-by: Andrew Cooper Thanks. Jan
Re: [PATCH v7 2/8] AMD/IOMMU: obtain IVHD type to use earlier
On 26.08.2021 14:30, Andrew Cooper wrote: > On 26/08/2021 08:23, Jan Beulich wrote: >> Doing this in amd_iommu_prepare() is too late for it, in particular, to >> be used in amd_iommu_detect_one_acpi(), as a subsequent change will want >> to do. Moving it immediately ahead of amd_iommu_detect_acpi() is >> (luckily) pretty simple, (pretty importantly) without breaking >> amd_iommu_prepare()'s logic to prevent multiple processing. >> >> This involves moving table checksumming, as >> amd_iommu_get_supported_ivhd_type() -> get_supported_ivhd_type() will >> now be invoked before amd_iommu_detect_acpi() -> detect_iommu_acpi(). In >> the course of dojng so stop open-coding acpi_tb_checksum(), seeing that > > doing. > >> --- a/xen/drivers/passthrough/amd/iommu_acpi.c >> +++ b/xen/drivers/passthrough/amd/iommu_acpi.c >> @@ -1150,20 +1152,7 @@ static int __init parse_ivrs_table(struc >> static int __init detect_iommu_acpi(struct acpi_table_header *table) >> { >> const struct acpi_ivrs_header *ivrs_block; >> -unsigned long i; >> unsigned long length = sizeof(struct acpi_table_ivrs); >> -u8 checksum, *raw_table; >> - >> -/* validate checksum: sum of entire table == 0 */ >> -checksum = 0; >> -raw_table = (u8 *)table; >> -for ( i = 0; i < table->length; i++ ) >> -checksum += raw_table[i]; >> -if ( checksum ) >> -{ >> -AMD_IOMMU_DEBUG("IVRS Error: Invalid Checksum %#x\n", checksum); >> -return -ENODEV; >> -} >> >> while ( table->length > (length + sizeof(*ivrs_block)) ) >> { >> @@ -1300,6 +1289,15 @@ get_supported_ivhd_type(struct acpi_tabl >> { >> size_t length = sizeof(struct acpi_table_ivrs); >> const struct acpi_ivrs_header *ivrs_block, *blk = NULL; >> +uint8_t checksum; >> + >> +/* Validate checksum: Sum of entire table == 0. */ >> +checksum = acpi_tb_checksum(ACPI_CAST_PTR(uint8_t, table), >> table->length); >> +if ( checksum ) >> +{ >> +AMD_IOMMU_DEBUG("IVRS Error: Invalid Checksum %#x\n", checksum); > > I know you're just moving code, but this really needs to be a visible > error. It's "I'm turning off the IOMMU because the ACPI table is bad", > which is about as serious as errors come. I'll wait for us settling on patch 1 in this regard, and then follow the same model here. Jan
Re: [PATCH v7 1/8] AMD/IOMMU: check / convert IVMD ranges for being / to be reserved
On 26.08.2021 14:10, Andrew Cooper wrote: > On 26/08/2021 08:23, Jan Beulich wrote: >> While the specification doesn't say so, just like for VT-d's RMRRs no >> good can come from these ranges being e.g. conventional RAM or entirely >> unmarked and hence usable for placing e.g. PCI device BARs. Check >> whether they are, and put in some limited effort to convert to reserved. >> (More advanced logic can be added if actual problems are found with this >> simplistic variant.) >> >> Signed-off-by: Jan Beulich >> Reviewed-by: Paul Durrant >> --- >> v7: Re-base. >> v5: New. >> >> --- a/xen/drivers/passthrough/amd/iommu_acpi.c >> +++ b/xen/drivers/passthrough/amd/iommu_acpi.c >> @@ -384,6 +384,38 @@ static int __init parse_ivmd_block(const >> AMD_IOMMU_DEBUG("IVMD Block: type %#x phys %#lx len %#lx\n", >> ivmd_block->header.type, start_addr, mem_length); >> >> +if ( !e820_all_mapped(base, limit + PAGE_SIZE, E820_RESERVED) ) >> +{ >> +paddr_t addr; >> + >> +AMD_IOMMU_DEBUG("IVMD: [%lx,%lx) is not (entirely) in reserved >> memory\n", >> +base, limit + PAGE_SIZE); >> + >> +for ( addr = base; addr <= limit; addr += PAGE_SIZE ) >> +{ >> +unsigned int type = page_get_ram_type(maddr_to_mfn(addr)); >> + >> +if ( type == RAM_TYPE_UNKNOWN ) >> +{ >> +if ( e820_add_range(, addr, addr + PAGE_SIZE, >> +E820_RESERVED) ) >> +continue; >> +AMD_IOMMU_DEBUG("IVMD Error: Page at %lx couldn't be >> reserved\n", >> +addr); >> +return -EIO; >> +} >> + >> +/* Types which won't be handed out are considered good enough. >> */ >> +if ( !(type & (RAM_TYPE_RESERVED | RAM_TYPE_ACPI | >> + RAM_TYPE_UNUSABLE)) ) >> +continue; >> + >> +AMD_IOMMU_DEBUG("IVMD Error: Page at %lx can't be converted\n", >> +addr); > > I think these print messages need to more than just debug. The first > one is a warning, whereas the final two are hard errors liable to impact > the correct running of the system. Well, people would observe IOMMUs not getting put in use. I was following existing style in this regard on the assumption that in such an event people would (be told to) enable "iommu=debug". Hence ... > Especially as you're putting them in to try and spot problem cases, they > should be visible by default for when we inevitably get bug reports to > xen-devel saying "something changed with passthrough in Xen 4.16". ... I can convert to ordinary printk(), provided you're convinced the described model isn't reasonable and introducing a logging inconsistency is worth it. Jan
Re: [PATCH v7 2/8] AMD/IOMMU: obtain IVHD type to use earlier
On 26/08/2021 08:23, Jan Beulich wrote: > Doing this in amd_iommu_prepare() is too late for it, in particular, to > be used in amd_iommu_detect_one_acpi(), as a subsequent change will want > to do. Moving it immediately ahead of amd_iommu_detect_acpi() is > (luckily) pretty simple, (pretty importantly) without breaking > amd_iommu_prepare()'s logic to prevent multiple processing. > > This involves moving table checksumming, as > amd_iommu_get_supported_ivhd_type() -> get_supported_ivhd_type() will > now be invoked before amd_iommu_detect_acpi() -> detect_iommu_acpi(). In > the course of dojng so stop open-coding acpi_tb_checksum(), seeing that doing. > --- a/xen/drivers/passthrough/amd/iommu_acpi.c > +++ b/xen/drivers/passthrough/amd/iommu_acpi.c > @@ -1150,20 +1152,7 @@ static int __init parse_ivrs_table(struc > static int __init detect_iommu_acpi(struct acpi_table_header *table) > { > const struct acpi_ivrs_header *ivrs_block; > -unsigned long i; > unsigned long length = sizeof(struct acpi_table_ivrs); > -u8 checksum, *raw_table; > - > -/* validate checksum: sum of entire table == 0 */ > -checksum = 0; > -raw_table = (u8 *)table; > -for ( i = 0; i < table->length; i++ ) > -checksum += raw_table[i]; > -if ( checksum ) > -{ > -AMD_IOMMU_DEBUG("IVRS Error: Invalid Checksum %#x\n", checksum); > -return -ENODEV; > -} > > while ( table->length > (length + sizeof(*ivrs_block)) ) > { > @@ -1300,6 +1289,15 @@ get_supported_ivhd_type(struct acpi_tabl > { > size_t length = sizeof(struct acpi_table_ivrs); > const struct acpi_ivrs_header *ivrs_block, *blk = NULL; > +uint8_t checksum; > + > +/* Validate checksum: Sum of entire table == 0. */ > +checksum = acpi_tb_checksum(ACPI_CAST_PTR(uint8_t, table), > table->length); > +if ( checksum ) > +{ > +AMD_IOMMU_DEBUG("IVRS Error: Invalid Checksum %#x\n", checksum); I know you're just moving code, but this really needs to be a visible error. It's "I'm turning off the IOMMU because the ACPI table is bad", which is about as serious as errors come. ~Andrew
Re: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to NUMA system
On 26.08.2021 14:08, Wei Chen wrote: >> From: Jan Beulich >> Sent: 2021年8月26日 17:40 >> >> On 26.08.2021 10:49, Julien Grall wrote: >>> Right, but again, why do you want to solve the problem on Arm and not >>> x86? After all, NUMA is not architecture specific (in fact you move most >>> of the code in common). >>> > > I am not very familiar with x86, so when I was composing this patch series, > I always thought that if I could solve it inside Arm Arch, I would solve it > inside Arm Arch. That seems a bit conservative, and inappropriate on solving > this problem. > >>> In fact, the risk, is someone may read arch/x86 and doesn't realize the >>> CPU is not in the node until late on Arm. >>> >>> So I think we should call numa_add_cpu() around the same place on all >>> the architectures. >> >> FWIW: +1 > > I agree. As Jan in this discussion. How about following current x86's > numa_add_cpu behaviors in __start_xen, but add some code to revert > numa_add_cpu when cpu_up failed (both Arm and x86)? Sure - if we don't clean up properly on x86 on a failure path, I'm all for having that fixed. Jan
RE: Enabling hypervisor agnosticism for VirtIO backends
Hi Akashi, > -Original Message- > From: AKASHI Takahiro > Sent: 2021年8月26日 17:41 > To: Wei Chen > Cc: Oleksandr Tyshchenko ; Stefano Stabellini > ; Alex Benn??e ; Kaly Xin > ; Stratos Mailing List ; > virtio-...@lists.oasis-open.org; Arnd Bergmann ; > Viresh Kumar ; Stefano Stabellini > ; stefa...@redhat.com; Jan Kiszka > ; Carl van Schaik ; > prat...@quicinc.com; Srivatsa Vaddagiri ; Jean- > Philippe Brucker ; Mathieu Poirier > ; Oleksandr Tyshchenko > ; Bertrand Marquis > ; Artem Mygaiev ; Julien > Grall ; Juergen Gross ; Paul Durrant > ; Xen Devel > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends > > Hi Wei, > > On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote: > > On Wed, Aug 18, 2021 at 08:35:51AM +, Wei Chen wrote: > > > Hi Akashi, > > > > > > > -Original Message- > > > > From: AKASHI Takahiro > > > > Sent: 2021年8月18日 13:39 > > > > To: Wei Chen > > > > Cc: Oleksandr Tyshchenko ; Stefano Stabellini > > > > ; Alex Benn??e ; > Stratos > > > > Mailing List ; virtio- > dev@lists.oasis- > > > > open.org; Arnd Bergmann ; Viresh Kumar > > > > ; Stefano Stabellini > > > > ; stefa...@redhat.com; Jan Kiszka > > > > ; Carl van Schaik > ; > > > > prat...@quicinc.com; Srivatsa Vaddagiri ; > Jean- > > > > Philippe Brucker ; Mathieu Poirier > > > > ; Oleksandr Tyshchenko > > > > ; Bertrand Marquis > > > > ; Artem Mygaiev ; > Julien > > > > Grall ; Juergen Gross ; Paul > Durrant > > > > ; Xen Devel > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends > > > > > > > > On Tue, Aug 17, 2021 at 08:39:09AM +, Wei Chen wrote: > > > > > Hi Akashi, > > > > > > > > > > > -Original Message- > > > > > > From: AKASHI Takahiro > > > > > > Sent: 2021年8月17日 16:08 > > > > > > To: Wei Chen > > > > > > Cc: Oleksandr Tyshchenko ; Stefano > Stabellini > > > > > > ; Alex Benn??e ; > > > > Stratos > > > > > > Mailing List ; virtio- > > > > dev@lists.oasis- > > > > > > open.org; Arnd Bergmann ; Viresh Kumar > > > > > > ; Stefano Stabellini > > > > > > ; stefa...@redhat.com; Jan Kiszka > > > > > > ; Carl van Schaik > ; > > > > > > prat...@quicinc.com; Srivatsa Vaddagiri ; > Jean- > > > > > > Philippe Brucker ; Mathieu Poirier > > > > > > ; Oleksandr Tyshchenko > > > > > > ; Bertrand Marquis > > > > > > ; Artem Mygaiev > ; > > > > Julien > > > > > > Grall ; Juergen Gross ; Paul > Durrant > > > > > > ; Xen Devel > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends > > > > > > > > > > > > Hi Wei, Oleksandr, > > > > > > > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +, Wei Chen wrote: > > > > > > > Hi All, > > > > > > > > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here. > > > > > > > This proposal is still discussing in Xen and KVM communities. > > > > > > > The main work is to decouple the kvmtool from KVM and make > > > > > > > other hypervisors can reuse the virtual device implementations. > > > > > > > > > > > > > > In this case, we need to introduce an intermediate hypervisor > > > > > > > layer for VMM abstraction, Which is, I think it's very close > > > > > > > to stratos' virtio hypervisor agnosticism work. > > > > > > > > > > > > # My proposal[1] comes from my own idea and doesn't always > represent > > > > > > # Linaro's view on this subject nor reflect Alex's concerns. > > > > Nevertheless, > > > > > > > > > > > > Your idea and my proposal seem to share the same background. > > > > > > Both have the similar goal and currently start with, at first, > Xen > > > > > > and are based on kvm-tool. (Actually, my work is derived from > > > > > > EPAM's virtio-disk, which is also based on kvm-tool.) > > > > > > > > > > > > In particular, the abstraction of hypervisor interfaces has a > same > > > > > > set of interfaces (for your "struct vmm_impl" and my "RPC > interfaces"). > > > > > > This is not co-incident as we both share the same origin as I > said > > > > above. > > > > > > And so we will also share the same issues. One of them is a way > of > > > > > > "sharing/mapping FE's memory". There is some trade-off between > > > > > > the portability and the performance impact. > > > > > > So we can discuss the topic here in this ML, too. > > > > > > (See Alex's original email, too). > > > > > > > > > > > Yes, I agree. > > > > > > > > > > > On the other hand, my approach aims to create a "single-binary" > > > > solution > > > > > > in which the same binary of BE vm could run on any hypervisors. > > > > > > Somehow similar to your "proposal-#2" in [2], but in my solution, > all > > > > > > the hypervisor-specific code would be put into another entity > (VM), > > > > > > named "virtio-proxy" and the abstracted operations are served > via RPC. > > > > > > (In this sense, BE is hypervisor-agnostic but might have OS > > > > dependency.) > > > > > > But I know that we need discuss if this is a requirement even > > > > > > in Stratos project or not. (Maybe not) > > > > > > > > > > > > > > > > Sorry, I
Re: [PATCH v7 1/8] AMD/IOMMU: check / convert IVMD ranges for being / to be reserved
On 26/08/2021 08:23, Jan Beulich wrote: > While the specification doesn't say so, just like for VT-d's RMRRs no > good can come from these ranges being e.g. conventional RAM or entirely > unmarked and hence usable for placing e.g. PCI device BARs. Check > whether they are, and put in some limited effort to convert to reserved. > (More advanced logic can be added if actual problems are found with this > simplistic variant.) > > Signed-off-by: Jan Beulich > Reviewed-by: Paul Durrant > --- > v7: Re-base. > v5: New. > > --- a/xen/drivers/passthrough/amd/iommu_acpi.c > +++ b/xen/drivers/passthrough/amd/iommu_acpi.c > @@ -384,6 +384,38 @@ static int __init parse_ivmd_block(const > AMD_IOMMU_DEBUG("IVMD Block: type %#x phys %#lx len %#lx\n", > ivmd_block->header.type, start_addr, mem_length); > > +if ( !e820_all_mapped(base, limit + PAGE_SIZE, E820_RESERVED) ) > +{ > +paddr_t addr; > + > +AMD_IOMMU_DEBUG("IVMD: [%lx,%lx) is not (entirely) in reserved > memory\n", > +base, limit + PAGE_SIZE); > + > +for ( addr = base; addr <= limit; addr += PAGE_SIZE ) > +{ > +unsigned int type = page_get_ram_type(maddr_to_mfn(addr)); > + > +if ( type == RAM_TYPE_UNKNOWN ) > +{ > +if ( e820_add_range(, addr, addr + PAGE_SIZE, > +E820_RESERVED) ) > +continue; > +AMD_IOMMU_DEBUG("IVMD Error: Page at %lx couldn't be > reserved\n", > +addr); > +return -EIO; > +} > + > +/* Types which won't be handed out are considered good enough. */ > +if ( !(type & (RAM_TYPE_RESERVED | RAM_TYPE_ACPI | > + RAM_TYPE_UNUSABLE)) ) > +continue; > + > +AMD_IOMMU_DEBUG("IVMD Error: Page at %lx can't be converted\n", > +addr); I think these print messages need to more than just debug. The first one is a warning, whereas the final two are hard errors liable to impact the correct running of the system. Especially as you're putting them in to try and spot problem cases, they should be visible by default for when we inevitably get bug reports to xen-devel saying "something changed with passthrough in Xen 4.16". ~Andrew > +return -EIO; > +} > +} > + > if ( ivmd_block->header.flags & ACPI_IVMD_EXCLUSION_RANGE ) > exclusion = true; > else if ( ivmd_block->header.flags & ACPI_IVMD_UNITY ) >
RE: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to NUMA system
Hi Jan, Julien, > -Original Message- > From: Jan Beulich > Sent: 2021年8月26日 17:40 > To: Julien Grall ; Wei Chen > Cc: Bertrand Marquis ; xen- > de...@lists.xenproject.org; sstabell...@kernel.org > Subject: Re: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to > NUMA system > > On 26.08.2021 10:49, Julien Grall wrote: > > On 26/08/2021 08:24, Wei Chen wrote: > >>> -Original Message- > >>> From: Julien Grall > >>> Sent: 2021年8月26日 0:58 > >>> On 11/08/2021 11:24, Wei Chen wrote: > --- a/xen/arch/arm/smpboot.c > +++ b/xen/arch/arm/smpboot.c > @@ -358,6 +358,12 @@ void start_secondary(void) > */ > smp_wmb(); > > +/* > + * If Xen is running on a NUMA off system, there will > + * be a node#0 at least. > + */ > +numa_add_cpu(cpuid); > + > >>> > >>> On x86, numa_add_cpu() will be called before the pCPU is brought up. I > >>> am not quite too sure why we are doing it differently here. Can you > >>> clarify it? > >> > >> Of course we can invoke numa_add_cpu before cpu_up as x86. But in my > tests, > >> I found when cpu bring up failed, this cpu still be add to NUMA. > Although > >> this does not affect the execution of the code (because CPU is offline), > >> But I don't think adding a offline CPU to NUMA makes sense. > > > > Right, but again, why do you want to solve the problem on Arm and not > > x86? After all, NUMA is not architecture specific (in fact you move most > > of the code in common). > > I am not very familiar with x86, so when I was composing this patch series, I always thought that if I could solve it inside Arm Arch, I would solve it inside Arm Arch. That seems a bit conservative, and inappropriate on solving this problem. > > In fact, the risk, is someone may read arch/x86 and doesn't realize the > > CPU is not in the node until late on Arm. > > > > So I think we should call numa_add_cpu() around the same place on all > > the architectures. > > FWIW: +1 I agree. As Jan in this discussion. How about following current x86's numa_add_cpu behaviors in __start_xen, but add some code to revert numa_add_cpu when cpu_up failed (both Arm and x86)? > > Jan
Re: [PATCH 17/17] IOMMU/x86: drop pointless NULL checks
On 24/08/2021 15:27, Jan Beulich wrote: > --- a/xen/drivers/passthrough/vtd/utils.c > +++ b/xen/drivers/passthrough/vtd/utils.c > @@ -106,11 +106,6 @@ void print_vtd_entries(struct vtd_iommu > } > > root_entry = (struct root_entry *)map_vtd_domain_page(iommu->root_maddr); Seeing as you're actually cleaning up mapping calls, drop this cast too? Either way, Acked-by: Andrew Cooper > -if ( root_entry == NULL ) > -{ > -printk("root_entry == NULL\n"); > -return; > -} > > printk("root_entry[%02x] = %"PRIx64"\n", bus, root_entry[bus].val); > if ( !root_present(root_entry[bus]) ) >
Re: [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
On 24/08/2021 15:21, Jan Beulich wrote: > While already the case for PVH, there's no reason to treat PV > differently here, though of course the addresses get taken from another > source in this case. Except that, to match CPU side mappings, by default > we permit r/o ones. This then also means we now deal consistently with > IO-APICs whose MMIO is or is not covered by E820 reserved regions. > > Signed-off-by: Jan Beulich Why do we give PV dom0 a mapping of the IO-APIC? Having thought about it, it cannot possibly be usable. IO-APICs use a indirect window, and Xen doesn't perform any write-emulation (that I can see), so the guest can't even read the data register and work out which register it represents. It also can't do an atomic 64bit read across the index and data registers, as that is explicitly undefined behaviour in the IO-APIC spec. On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls where introduced upon discovering that a read-only mapping of the IO-APIC it totally useless. I think we can safely not expose the IO-APICs to PV dom0 at all, which simplifies the memory handling further. ~Andrew
RE: [XEN RFC PATCH 23/40] xen/arm: introduce a helper to parse device tree memory node
Hi Julien, > -Original Message- > From: Julien Grall > Sent: 2021年8月26日 16:22 > To: Wei Chen ; xen-devel@lists.xenproject.org; > sstabell...@kernel.org > Cc: Bertrand Marquis > Subject: Re: [XEN RFC PATCH 23/40] xen/arm: introduce a helper to parse > device tree memory node > > > > On 26/08/2021 07:35, Wei Chen wrote: > > Hi Julien, > > Hi Wei, > > >> -Original Message- > >> From: Julien Grall > >> Sent: 2021年8月25日 21:49 > >> To: Wei Chen ; xen-devel@lists.xenproject.org; > >> sstabell...@kernel.org; jbeul...@suse.com > >> Cc: Bertrand Marquis > >> Subject: Re: [XEN RFC PATCH 23/40] xen/arm: introduce a helper to parse > >> device tree memory node > >> > >> Hi Wei, > >> > >> On 11/08/2021 11:24, Wei Chen wrote: > >>> Memory blocks' NUMA ID information is stored in device tree's > >>> memory nodes as "numa-node-id". We need a new helper to parse > >>> and verify this ID from memory nodes. > >>> > >>> In order to support memory affinity in later use, the valid > >>> memory ranges and NUMA ID will be saved to tables. > >>> > >>> Signed-off-by: Wei Chen > >>> --- > >>>xen/arch/arm/numa_device_tree.c | 130 > > >>>1 file changed, 130 insertions(+) > >>> > >>> diff --git a/xen/arch/arm/numa_device_tree.c > >> b/xen/arch/arm/numa_device_tree.c > >>> index 37cc56acf3..bbe081dcd1 100644 > >>> --- a/xen/arch/arm/numa_device_tree.c > >>> +++ b/xen/arch/arm/numa_device_tree.c > >>> @@ -20,11 +20,13 @@ > >>>#include > >>>#include > >>>#include > >>> +#include > >>>#include > >>>#include > >>> > >>>s8 device_tree_numa = 0; > >>>static nodemask_t processor_nodes_parsed __initdata; > >>> +static nodemask_t memory_nodes_parsed __initdata; > >>> > >>>static int srat_disabled(void) > >>>{ > >>> @@ -55,6 +57,79 @@ static int __init > >> dtb_numa_processor_affinity_init(nodeid_t node) > >>>return 0; > >>>} > >>> > >>> +/* Callback for parsing of the memory regions affinity */ > >>> +static int __init dtb_numa_memory_affinity_init(nodeid_t node, > >>> +paddr_t start, paddr_t size) > >>> +{ > >> > >> The implementation of this function is quite similar ot the ACPI > >> version. Can this be abstracted? > > > > In my draft, I had tried to merge ACPI and DTB versions in one > > function. I introduced a number of "if else" to distinguish ACPI > > from DTB, especially ACPI hotplug. The function seems very messy. > > Not enough benefits to make up for the mess, so I gave up. > > It think you can get away from distinguishing between ACPI and DT in > that helper: >* ma->flags & ACPI_SRAT_MEM_HOTPLUGGABLE could be replace by an > argument indicating whether the region is hotpluggable (this would > always be false for DT) >* Access to memblk_hotplug can be stubbed (in the future we may want > to consider memory hotplug even on Arm). > > Do you still have the "if else" version? If so can you post it? > I just tried to do that in draft process, because I was not satisfied with the changes, I haven't saved them as a patch. I think your suggestions are worth to try again, I will do it in next version. > Cheers, > > -- > Julien Grall
Re: [PATCH 4/9] gnttab: drop GNTMAP_can_fail
On 26/08/2021 11:13, Jan Beulich wrote: > There's neither documentation of what this flag is supposed to mean, nor > any implementation. With this, don't even bother enclosing the #define-s > in a __XEN_INTERFACE_VERSION__ conditional, but drop them altogether. > > Signed-off-by: Jan Beulich It was introduced in 4d45702cf0398 along with GNTST_eagain, and the commit message hints that it is for xen-paging Furthermore, the first use of GNTST_eagain was in ecb35ecb79e0 for trying to map/copy a paged-out frame. Therefore I think it is reasonable to conclude that there was a planned interaction between GNTMAP_can_fail and paging, which presumably would have been "don't pull this in from disk if it is paged out". I think it is fine to drop, as no implementation has ever existed in Xen, but it would be helpful to have the history summarised in the commit message. ~Andrew
[linux-linus test] 164481: regressions - FAIL
flight 164481 linux-linus real [real] http://logs.test-lab.xenproject.org/osstest/logs/164481/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: test-amd64-i386-xl-xsm7 xen-install fail REGR. vs. 152332 test-amd64-i386-qemut-rhel6hvm-intel 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-qemuu-dmrestrict-amd64-dmrestrict 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-qemut-debianhvm-amd64 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-qemuu-ws16-amd64 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-qemuu-debianhvm-amd64-shadow 7 xen-install fail REGR. vs. 152332 test-amd64-i386-qemuu-rhel6hvm-intel 7 xen-install fail REGR. vs. 152332 test-amd64-i386-examine 6 xen-install fail REGR. vs. 152332 test-amd64-i386-libvirt 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-qemuu-debianhvm-i386-xsm 7 xen-install fail REGR. vs. 152332 test-amd64-coresched-i386-xl 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-qemuu-debianhvm-amd64 7 xen-install fail REGR. vs. 152332 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl7 xen-install fail REGR. vs. 152332 test-amd64-i386-qemut-rhel6hvm-amd 7 xen-installfail REGR. vs. 152332 test-amd64-i386-xl-qemut-ws16-amd64 7 xen-install fail REGR. vs. 152332 test-amd64-i386-qemuu-rhel6hvm-amd 7 xen-installfail REGR. vs. 152332 test-amd64-i386-pair 10 xen-install/src_host fail REGR. vs. 152332 test-amd64-i386-pair 11 xen-install/dst_host fail REGR. vs. 152332 test-amd64-i386-libvirt-xsm 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-raw7 xen-install fail REGR. vs. 152332 test-amd64-i386-freebsd10-amd64 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-pvshim 7 xen-install fail REGR. vs. 152332 test-amd64-i386-freebsd10-i386 7 xen-installfail REGR. vs. 152332 test-amd64-i386-xl-qemut-debianhvm-i386-xsm 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-shadow 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-qemut-win7-amd64 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-qemuu-ovmf-amd64 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-qemuu-win7-amd64 7 xen-install fail REGR. vs. 152332 test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 7 xen-install fail REGR. vs. 152332 test-amd64-i386-libvirt-pair 10 xen-install/src_host fail REGR. vs. 152332 test-amd64-i386-libvirt-pair 11 xen-install/dst_host fail REGR. vs. 152332 test-arm64-arm64-xl-credit1 13 debian-fixup fail REGR. vs. 152332 test-arm64-arm64-xl 13 debian-fixup fail REGR. vs. 152332 test-arm64-arm64-xl-thunderx 13 debian-fixup fail REGR. vs. 152332 test-arm64-arm64-xl-credit2 13 debian-fixup fail REGR. vs. 152332 test-arm64-arm64-xl-xsm 13 debian-fixup fail REGR. vs. 152332 test-arm64-arm64-libvirt-xsm 13 debian-fixup fail REGR. vs. 152332 Regressions which are regarded as allowable (not blocking): test-amd64-amd64-xl-rtds 20 guest-localmigrate/x10 fail REGR. vs. 152332 Tests which did not succeed, but are not blocking: test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 152332 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 152332 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 152332 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 152332 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 152332 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 152332 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail like 152332 test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never
Re: [PATCH MINI-OS] gnttab: drop GNTMAP_can_fail
Jan Beulich, le jeu. 26 août 2021 12:20:26 +0200, a ecrit: > There's neither documentation of what this flag is supposed to mean, nor > any implementation in the hypervisor. > > Signed-off-by: Jan Beulich Reviewed-by: Samuel Thibault > --- a/include/xen/grant_table.h > +++ b/include/xen/grant_table.h > @@ -627,9 +627,6 @@ DEFINE_XEN_GUEST_HANDLE(gnttab_cache_flu > #define _GNTMAP_contains_pte(4) > #define GNTMAP_contains_pte (1<<_GNTMAP_contains_pte) > > -#define _GNTMAP_can_fail(5) > -#define GNTMAP_can_fail (1<<_GNTMAP_can_fail) > - > /* > * Bits to be placed in guest kernel available PTE bits (architecture > * dependent; only supported when XENFEAT_gnttab_map_avail_bits is set). >
[PATCH XTF] gnttab: drop GNTMAP_can_fail
There's neither documentation of what this flag is supposed to mean, nor any implementation in the hypervisor. Signed-off-by: Jan Beulich --- a/include/xen/grant_table.h +++ b/include/xen/grant_table.h @@ -196,9 +196,6 @@ typedef union { #define _GNTMAP_contains_pte4 #define GNTMAP_contains_pte (1 << _GNTMAP_contains_pte) -#define _GNTMAP_can_fail5 -#define GNTMAP_can_fail (1 << _GNTMAP_can_fail) - /* * Bits to be placed in guest kernel available PTE bits (architecture * dependent; only supported when XENFEAT_gnttab_map_avail_bits is set).
[PATCH MINI-OS] gnttab: drop GNTMAP_can_fail
There's neither documentation of what this flag is supposed to mean, nor any implementation in the hypervisor. Signed-off-by: Jan Beulich --- a/include/xen/grant_table.h +++ b/include/xen/grant_table.h @@ -627,9 +627,6 @@ DEFINE_XEN_GUEST_HANDLE(gnttab_cache_flu #define _GNTMAP_contains_pte(4) #define GNTMAP_contains_pte (1<<_GNTMAP_contains_pte) -#define _GNTMAP_can_fail(5) -#define GNTMAP_can_fail (1<<_GNTMAP_can_fail) - /* * Bits to be placed in guest kernel available PTE bits (architecture * dependent; only supported when XENFEAT_gnttab_map_avail_bits is set).
[xen-unstable test] 164477: tolerable FAIL - PUSHED
flight 164477 xen-unstable real [real] http://logs.test-lab.xenproject.org/osstest/logs/164477/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 164405 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 164405 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 164405 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 164405 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 164405 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 164405 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 164405 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail like 164405 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 164405 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 164405 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 164405 test-amd64-i386-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-i386-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-xl-pvshim14 guest-start fail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass version targeted for testing: xen a931e8e64af07bd333a31f3b71a3f8f3e7910857 baseline version: xen 93713f444b3f29d6848527506db69cf78976b32d Last test of basis 164405 2021-08-23 08:02:08 Z3 days Testing same since 164477 2021-08-25 09:09:48 Z1 days1 attempts People who touched revisions under test: Julien Grall Michal Orzel Oleksandr Andrushchenko Oleksandr Tyshchenko jobs: build-amd64-xsm pass build-arm64-xsm pass build-i386-xsm
[PATCH 9/9] gnttab: don't silently truncate GFNs in compat setup-table handling
Returning back truncated frame numbers is unhelpful: Quite likely they're not owned by the domain (if it's PV), or we may misguide the guest into writing grant entries into a page that it actually uses for other purposes. Signed-off-by: Jan Beulich --- RFC: Arguably in the 32-bit PV case it may be necessary to instead put in place an explicit address restriction when allocating ->shared_raw[N]. This is currently implicit by alloc_xenheap_page() only returning memory covered by the direct-map. --- a/xen/common/compat/grant_table.c +++ b/xen/common/compat/grant_table.c @@ -175,8 +175,15 @@ int compat_grant_table_op(unsigned int c i < (_s_)->nr_frames; ++i ) \ { \ compat_pfn_t frame = (_s_)->frame_list.p[i]; \ -if ( __copy_to_compat_offset((_d_)->frame_list, \ - i, , 1) ) \ +if ( frame != (_s_)->frame_list.p[i] ) \ +{ \ +if ( VALID_M2P((_s_)->frame_list.p[i]) ) \ +(_s_)->status = GNTST_address_too_big; \ +else \ +frame |= 0x8000U;\ +} \ +else if ( __copy_to_compat_offset((_d_)->frame_list, \ + i, , 1) ) \ (_s_)->status = GNTST_bad_virt_addr; \ } \ } while (0)
[PATCH 8/9] gnttab: bail from GFN-storing loops early in case of error
The contents of the output arrays are undefined in both cases anyway when the operation itself gets marked as failed. There's no value in trying to continue after a guest memory access failure. Signed-off-by: Jan Beulich --- There's also a curious difference between the two sub-ops wrt the use of SHARED_M2P(). --- a/xen/common/compat/grant_table.c +++ b/xen/common/compat/grant_table.c @@ -170,17 +170,14 @@ int compat_grant_table_op(unsigned int c if ( rc == 0 ) { #define XLAT_gnttab_setup_table_HNDL_frame_list(_d_, _s_) \ -do \ -{ \ -if ( (_s_)->status == GNTST_okay ) \ +do { \ +for ( i = 0; (_s_)->status == GNTST_okay && \ + i < (_s_)->nr_frames; ++i ) \ { \ -for ( i = 0; i < (_s_)->nr_frames; ++i ) \ -{ \ -unsigned int frame = (_s_)->frame_list.p[i]; \ -if ( __copy_to_compat_offset((_d_)->frame_list, \ - i, , 1) ) \ -(_s_)->status = GNTST_bad_virt_addr; \ -} \ +compat_pfn_t frame = (_s_)->frame_list.p[i]; \ +if ( __copy_to_compat_offset((_d_)->frame_list, \ + i, , 1) ) \ +(_s_)->status = GNTST_bad_virt_addr; \ } \ } while (0) XLAT_gnttab_setup_table(, nat.setup); --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -2103,7 +2103,10 @@ gnttab_setup_table( BUG_ON(SHARED_M2P(gmfn)); if ( __copy_to_guest_offset(op.frame_list, i, , 1) ) +{ op.status = GNTST_bad_virt_addr; +break; +} } unlock: @@ -3289,17 +3292,15 @@ gnttab_get_status_frames(XEN_GUEST_HANDL "status frames, but has only %u\n", d->domain_id, op.nr_frames, nr_status_frames(gt)); op.status = GNTST_general_error; -goto unlock; } -for ( i = 0; i < op.nr_frames; i++ ) +for ( i = 0; op.status == GNTST_okay && i < op.nr_frames; i++ ) { gmfn = gfn_x(gnttab_status_gfn(d, gt, i)); if ( __copy_to_guest_offset(op.frame_list, i, , 1) ) op.status = GNTST_bad_virt_addr; } - unlock: grant_read_unlock(gt); out2: rcu_unlock_domain(d);
[PATCH 7/9] gnttab: no need to translate handle for gnttab_get_status_frames()
Unlike for GNTTABOP_setup_table native and compat frame lists are arrays of the same type (uint64_t). Hence there's no need to translate the frame values. This then also renders unnecessary the limit_max parameter of gnttab_get_status_frames(). Signed-off-by: Jan Beulich --- a/xen/common/compat/grant_table.c +++ b/xen/common/compat/grant_table.c @@ -271,10 +271,7 @@ int compat_grant_table_op(unsigned int c } break; -case GNTTABOP_get_status_frames: { -unsigned int max_frame_list_size_in_pages = -(COMPAT_ARG_XLAT_SIZE - sizeof(*nat.get_status)) / -sizeof(*nat.get_status->frame_list.p); +case GNTTABOP_get_status_frames: if ( count != 1) { rc = -EINVAL; @@ -289,38 +286,25 @@ int compat_grant_table_op(unsigned int c } #define XLAT_gnttab_get_status_frames_HNDL_frame_list(_d_, _s_) \ -set_xen_guest_handle((_d_)->frame_list, (uint64_t *)(nat.get_status + 1)) +guest_from_compat_handle((_d_)->frame_list, (_s_)->frame_list) XLAT_gnttab_get_status_frames(nat.get_status, _status); #undef XLAT_gnttab_get_status_frames_HNDL_frame_list rc = gnttab_get_status_frames( -guest_handle_cast(nat.uop, gnttab_get_status_frames_t), -count, max_frame_list_size_in_pages); +guest_handle_cast(nat.uop, gnttab_get_status_frames_t), count); if ( rc >= 0 ) { -#define XLAT_gnttab_get_status_frames_HNDL_frame_list(_d_, _s_) \ -do \ -{ \ -if ( (_s_)->status == GNTST_okay ) \ -{ \ -for ( i = 0; i < (_s_)->nr_frames; ++i ) \ -{ \ -uint64_t frame = (_s_)->frame_list.p[i]; \ -if ( __copy_to_compat_offset((_d_)->frame_list, \ - i, , 1) ) \ -(_s_)->status = GNTST_bad_virt_addr; \ -} \ -} \ -} while (0) -XLAT_gnttab_get_status_frames(_status, nat.get_status); -#undef XLAT_gnttab_get_status_frames_HNDL_frame_list -if ( unlikely(__copy_to_guest(cmp_uop, _status, 1)) ) +XEN_GUEST_HANDLE_PARAM(gnttab_get_status_frames_compat_t) get = +guest_handle_cast(cmp_uop, + gnttab_get_status_frames_compat_t); + +if ( unlikely(__copy_field_to_guest(get, nat.get_status, +status)) ) rc = -EFAULT; else i = 1; } break; -} default: domain_crash(current->domain); --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -3242,7 +3242,7 @@ gnttab_set_version(XEN_GUEST_HANDLE_PARA static long gnttab_get_status_frames(XEN_GUEST_HANDLE_PARAM(gnttab_get_status_frames_t) uop, - unsigned int count, unsigned int limit_max) + unsigned int count) { gnttab_get_status_frames_t op; struct domain *d; @@ -3292,15 +3292,6 @@ gnttab_get_status_frames(XEN_GUEST_HANDL goto unlock; } -if ( unlikely(limit_max < op.nr_frames) ) -{ -gdprintk(XENLOG_WARNING, - "nr_status_frames for %pd is too large (%u,%u)\n", - d, op.nr_frames, limit_max); -op.status = GNTST_general_error; -goto unlock; -} - for ( i = 0; i < op.nr_frames; i++ ) { gmfn = gfn_x(gnttab_status_gfn(d, gt, i)); @@ -3664,8 +3655,7 @@ do_grant_table_op( case GNTTABOP_get_status_frames: rc = gnttab_get_status_frames( -guest_handle_cast(uop, gnttab_get_status_frames_t), count, - UINT_MAX); +guest_handle_cast(uop, gnttab_get_status_frames_t), count); break; case GNTTABOP_get_version:
[PATCH 6/9] gnttab: check handle early in gnttab_get_status_frames()
Like done in gnttab_setup_table(), check the handle once early in the function and use the lighter-weight (for PV) copying function in the loop. Signed-off-by: Jan Beulich --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -3261,6 +3261,9 @@ gnttab_get_status_frames(XEN_GUEST_HANDL return -EFAULT; } +if ( !guest_handle_okay(op.frame_list, op.nr_frames) ) +return -EFAULT; + d = rcu_lock_domain_by_any_id(op.dom); if ( d == NULL ) { @@ -3301,7 +3304,7 @@ gnttab_get_status_frames(XEN_GUEST_HANDL for ( i = 0; i < op.nr_frames; i++ ) { gmfn = gfn_x(gnttab_status_gfn(d, gt, i)); -if ( copy_to_guest_offset(op.frame_list, i, , 1) ) +if ( __copy_to_guest_offset(op.frame_list, i, , 1) ) op.status = GNTST_bad_virt_addr; }
[PATCH 5/9] gnttab: defer allocation of status frame tracking array
This array can be large when many grant frames are permitted; avoid allocating it when it's not going to be used anyway, by doing this only in gnttab_populate_status_frames(). While the delaying of the respective memory allocation adds possible reasons for failure of the respective enclosing operations, there are other memory allocations there already, so callers can't expect these operations to always succeed anyway. As to the re-ordering at the end of gnttab_unpopulate_status_frames(), this is merely to represent intended order of actions (shrink array bound, then free higher array entries). Signed-off-by: Jan Beulich Reviewed-by: Julien Grall --- v1: Fold into series. [standalone history] v4: Add a comment. Add a few blank lines. Extend description. v3: Drop smp_wmb(). Re-base. v2: Defer allocation to when a domain actually switches to the v2 grant API. --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -1774,6 +1774,17 @@ gnttab_populate_status_frames(struct dom /* Make sure, prior version checks are architectural visible */ block_speculation(); +if ( gt->status == ZERO_BLOCK_PTR ) +{ +gt->status = xzalloc_array(grant_status_t *, + grant_to_status_frames(gt->max_grant_frames)); +if ( !gt->status ) +{ +gt->status = ZERO_BLOCK_PTR; +return -ENOMEM; +} +} + for ( i = nr_status_frames(gt); i < req_status_frames; i++ ) { if ( (gt->status[i] = alloc_xenheap_page()) == NULL ) @@ -1794,18 +1805,25 @@ status_alloc_failed: free_xenheap_page(gt->status[i]); gt->status[i] = NULL; } + +if ( !nr_status_frames(gt) ) +{ +xfree(gt->status); +gt->status = ZERO_BLOCK_PTR; +} + return -ENOMEM; } static int gnttab_unpopulate_status_frames(struct domain *d, struct grant_table *gt) { -unsigned int i; +unsigned int i, n = nr_status_frames(gt); /* Make sure, prior version checks are architectural visible */ block_speculation(); -for ( i = 0; i < nr_status_frames(gt); i++ ) +for ( i = 0; i < n; i++ ) { struct page_info *pg = virt_to_page(gt->status[i]); gfn_t gfn = gnttab_get_frame_gfn(gt, true, i); @@ -1860,12 +1878,11 @@ gnttab_unpopulate_status_frames(struct d page_set_owner(pg, NULL); } -for ( i = 0; i < nr_status_frames(gt); i++ ) -{ -free_xenheap_page(gt->status[i]); -gt->status[i] = NULL; -} gt->nr_status_frames = 0; +for ( i = 0; i < n; i++ ) +free_xenheap_page(gt->status[i]); +xfree(gt->status); +gt->status = ZERO_BLOCK_PTR; return 0; } @@ -1988,11 +2005,11 @@ int grant_table_init(struct domain *d, i if ( gt->shared_raw == NULL ) goto out; -/* Status pages for grant table - for version 2 */ -gt->status = xzalloc_array(grant_status_t *, - grant_to_status_frames(gt->max_grant_frames)); -if ( gt->status == NULL ) -goto out; +/* + * Status page tracking array for v2 gets allocated on demand. But don't + * leave a NULL pointer there. + */ +gt->status = ZERO_BLOCK_PTR; grant_write_lock(gt); @@ -4103,11 +4120,13 @@ int gnttab_acquire_resource( if ( gt->gt_version != 2 ) break; +/* This may change gt->status, so has to happen before setting vaddrs. */ +rc = gnttab_get_status_frame_mfn(d, final_frame, ); + /* Check that void ** is a suitable representation for gt->status. */ BUILD_BUG_ON(!__builtin_types_compatible_p( typeof(gt->status), grant_status_t **)); vaddrs = (void **)gt->status; -rc = gnttab_get_status_frame_mfn(d, final_frame, ); break; }
[PATCH 4/9] gnttab: drop GNTMAP_can_fail
There's neither documentation of what this flag is supposed to mean, nor any implementation. With this, don't even bother enclosing the #define-s in a __XEN_INTERFACE_VERSION__ conditional, but drop them altogether. Signed-off-by: Jan Beulich --- a/xen/include/public/grant_table.h +++ b/xen/include/public/grant_table.h @@ -628,9 +628,6 @@ DEFINE_XEN_GUEST_HANDLE(gnttab_cache_flu #define _GNTMAP_contains_pte(4) #define GNTMAP_contains_pte (1<<_GNTMAP_contains_pte) -#define _GNTMAP_can_fail(5) -#define GNTMAP_can_fail (1<<_GNTMAP_can_fail) - /* * Bits to be placed in guest kernel available PTE bits (architecture * dependent; only supported when XENFEAT_gnttab_map_avail_bits is set).
[PATCH 3/9] gnttab: fold recurring is_iomem_page()
In all cases call the function just once instead of up to four times, at the same time avoiding to store a dangling pointer in a local variable. Signed-off-by: Jan Beulich --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -1587,11 +1587,11 @@ unmap_common_complete(struct gnttab_unma else status = _entry(rgt, op->ref); -pg = mfn_to_page(op->mfn); +pg = !is_iomem_page(act->mfn) ? mfn_to_page(op->mfn) : NULL; if ( op->done & GNTMAP_device_map ) { -if ( !is_iomem_page(act->mfn) ) +if ( pg ) { if ( op->done & GNTMAP_readonly ) put_page(pg); @@ -1608,7 +1608,7 @@ unmap_common_complete(struct gnttab_unma if ( op->done & GNTMAP_host_map ) { -if ( !is_iomem_page(op->mfn) ) +if ( pg ) { if ( gnttab_host_mapping_get_page_type(op->done & GNTMAP_readonly, ld, rd) ) @@ -3778,7 +3778,7 @@ int gnttab_release_mappings(struct domai else status = _entry(rgt, ref); -pg = mfn_to_page(act->mfn); +pg = !is_iomem_page(act->mfn) ? mfn_to_page(act->mfn) : NULL; if ( map->flags & GNTMAP_readonly ) { @@ -3786,7 +3786,7 @@ int gnttab_release_mappings(struct domai { BUG_ON(!(act->pin & GNTPIN_devr_mask)); act->pin -= GNTPIN_devr_inc; -if ( !is_iomem_page(act->mfn) ) +if ( pg ) put_page(pg); } @@ -3794,8 +3794,7 @@ int gnttab_release_mappings(struct domai { BUG_ON(!(act->pin & GNTPIN_hstr_mask)); act->pin -= GNTPIN_hstr_inc; -if ( gnttab_release_host_mappings(d) && - !is_iomem_page(act->mfn) ) +if ( pg && gnttab_release_host_mappings(d) ) put_page(pg); } } @@ -3805,7 +3804,7 @@ int gnttab_release_mappings(struct domai { BUG_ON(!(act->pin & GNTPIN_devw_mask)); act->pin -= GNTPIN_devw_inc; -if ( !is_iomem_page(act->mfn) ) +if ( pg ) put_page_and_type(pg); } @@ -3813,8 +3812,7 @@ int gnttab_release_mappings(struct domai { BUG_ON(!(act->pin & GNTPIN_hstw_mask)); act->pin -= GNTPIN_hstw_inc; -if ( gnttab_release_host_mappings(d) && - !is_iomem_page(act->mfn) ) +if ( pg && gnttab_release_host_mappings(d) ) { if ( gnttab_host_mapping_get_page_type(false, d, rd) ) put_page_type(pg); In all cases call the function just once instead of up to four times, at the same time avoiding to store a dangling pointer in a local variable. Signed-off-by: Jan Beulich --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -1587,11 +1587,11 @@ unmap_common_complete(struct gnttab_unma else status = _entry(rgt, op->ref); -pg = mfn_to_page(op->mfn); +pg = !is_iomem_page(act->mfn) ? mfn_to_page(op->mfn) : NULL; if ( op->done & GNTMAP_device_map ) { -if ( !is_iomem_page(act->mfn) ) +if ( pg ) { if ( op->done & GNTMAP_readonly ) put_page(pg); @@ -1608,7 +1608,7 @@ unmap_common_complete(struct gnttab_unma if ( op->done & GNTMAP_host_map ) { -if ( !is_iomem_page(op->mfn) ) +if ( pg ) { if ( gnttab_host_mapping_get_page_type(op->done & GNTMAP_readonly, ld, rd) ) @@ -3778,7 +3778,7 @@ int gnttab_release_mappings(struct domai else status = _entry(rgt, ref); -pg = mfn_to_page(act->mfn); +pg = !is_iomem_page(act->mfn) ? mfn_to_page(act->mfn) : NULL; if ( map->flags & GNTMAP_readonly ) { @@ -3786,7 +3786,7 @@ int gnttab_release_mappings(struct domai { BUG_ON(!(act->pin & GNTPIN_devr_mask)); act->pin -= GNTPIN_devr_inc; -if ( !is_iomem_page(act->mfn) ) +if ( pg ) put_page(pg); } @@ -3794,8 +3794,7 @@ int gnttab_release_mappings(struct domai { BUG_ON(!(act->pin & GNTPIN_hstr_mask)); act->pin -= GNTPIN_hstr_inc; -if ( gnttab_release_host_mappings(d) && - !is_iomem_page(act->mfn) ) +if ( pg && gnttab_release_host_mappings(d) ) put_page(pg); } } @@ -3805,7 +3804,7 @@ int gnttab_release_mappings(struct domai { BUG_ON(!(act->pin & GNTPIN_devw_mask)); act->pin -= GNTPIN_devw_inc; -if ( !is_iomem_page(act->mfn) ) +
[PATCH 2/9] gnttab: drop a redundant expression from gnttab_release_mappings()
This gnttab_host_mapping_get_page_type() invocation sits in the "else" path of a conditional controlled by "map->flags & GNTMAP_readonly". Signed-off-by: Jan Beulich --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -3816,9 +3816,7 @@ int gnttab_release_mappings(struct domai if ( gnttab_release_host_mappings(d) && !is_iomem_page(act->mfn) ) { -if ( gnttab_host_mapping_get_page_type((map->flags & -GNTMAP_readonly), - d, rd) ) +if ( gnttab_host_mapping_get_page_type(false, d, rd) ) put_page_type(pg); put_page(pg); }
[PATCH 1/9] gnttab: defer allocation of maptrack frames table
By default all guests are permitted to have up to 1024 maptrack frames, which on 64-bit means an 8k frame table. Yet except for driver domains guests normally don't make use of grant mappings. Defer allocating the table until a map track handle is first requested. Signed-off-by: Jan Beulich --- I continue to be unconvinced that it is a good idea to allow all DomU-s 1024 maptrack frames by default. While I'm still of the opinion that a hypervisor enforced upper bound is okay, I question this upper bound also getting used as the default value - this is perhaps okay for Dom0, but not elsewhere. --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -633,6 +633,34 @@ get_maptrack_handle( if ( likely(handle != INVALID_MAPTRACK_HANDLE) ) return handle; +if ( unlikely(!read_atomic(>maptrack)) ) +{ +struct grant_mapping **maptrack = NULL; + +if ( lgt->max_maptrack_frames ) +maptrack = vzalloc(lgt->max_maptrack_frames * sizeof(*maptrack)); + +spin_lock(>maptrack_lock); + +if ( !lgt->maptrack ) +{ +if ( !maptrack ) +{ +spin_unlock(>maptrack_lock); +return INVALID_MAPTRACK_HANDLE; +} + +write_atomic(>maptrack, maptrack); +maptrack = NULL; + +radix_tree_init(>maptrack_tree); +} + +spin_unlock(>maptrack_lock); + +vfree(maptrack); +} + spin_lock(>maptrack_lock); /* @@ -1955,16 +1983,6 @@ int grant_table_init(struct domain *d, i if ( gt->active == NULL ) goto out; -/* Tracking of mapped foreign frames table */ -if ( gt->max_maptrack_frames ) -{ -gt->maptrack = vzalloc(gt->max_maptrack_frames * sizeof(*gt->maptrack)); -if ( gt->maptrack == NULL ) -goto out; - -radix_tree_init(>maptrack_tree); -} - /* Shared grant table. */ gt->shared_raw = xzalloc_array(void *, gt->max_grant_frames); if ( gt->shared_raw == NULL )
[PATCH 0/9] gnttab: further work from XSA-380 / -382 context
The first four patches can be attributed to the former, the last four patches to the latter. The middle patch had been submitted standalone before, has a suitable Reviewed-by tag, but also has an objection by Andrew pending, which unfortunately has lead to this patch now being stuck. Short of Andrew being willing to settle the disagreement more with Julien than with me (although I'm on Julien's side), I have no idea what to do here. There's probably not much interrelation between the patches, so they can perhaps go in about any order. 1: defer allocation of maptrack frames table 2: drop a redundant expression from gnttab_release_mappings() 3: fold recurring is_iomem_page() 4: drop GNTMAP_can_fail 5: defer allocation of status frame tracking array 6: check handle early in gnttab_get_status_frames() 7: no need to translate handle for gnttab_get_status_frames() 8: bail from GFN-storing loops early in case of error 9: don't silently truncate GFNs in compat setup-table handling Jan
Re: [PATCH v2 08/10] xsm: remove xsm_default_t from hook definitions
On 06.08.2021 23:41, Daniel P. Smith wrote: > While not all of the points of contentions nor all of my concerns are > all addressed, I would like to hope that v3 is seen as an attempt > compromise, those compromises are acceptable, and that I can begin to > bring the next patch set forward. Thank you and looking forward to > responses. Having gone through the series I've been happy to see the adjustments that have been made. There are still further requests I have spelled out, but I think (hope) those aren't as controversial anymore. Jan
Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei, On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote: > On Wed, Aug 18, 2021 at 08:35:51AM +, Wei Chen wrote: > > Hi Akashi, > > > > > -Original Message- > > > From: AKASHI Takahiro > > > Sent: 2021年8月18日 13:39 > > > To: Wei Chen > > > Cc: Oleksandr Tyshchenko ; Stefano Stabellini > > > ; Alex Benn??e ; Stratos > > > Mailing List ; virtio-dev@lists.oasis- > > > open.org; Arnd Bergmann ; Viresh Kumar > > > ; Stefano Stabellini > > > ; stefa...@redhat.com; Jan Kiszka > > > ; Carl van Schaik ; > > > prat...@quicinc.com; Srivatsa Vaddagiri ; Jean- > > > Philippe Brucker ; Mathieu Poirier > > > ; Oleksandr Tyshchenko > > > ; Bertrand Marquis > > > ; Artem Mygaiev ; Julien > > > Grall ; Juergen Gross ; Paul Durrant > > > ; Xen Devel > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends > > > > > > On Tue, Aug 17, 2021 at 08:39:09AM +, Wei Chen wrote: > > > > Hi Akashi, > > > > > > > > > -Original Message- > > > > > From: AKASHI Takahiro > > > > > Sent: 2021年8月17日 16:08 > > > > > To: Wei Chen > > > > > Cc: Oleksandr Tyshchenko ; Stefano Stabellini > > > > > ; Alex Benn??e ; > > > Stratos > > > > > Mailing List ; virtio- > > > dev@lists.oasis- > > > > > open.org; Arnd Bergmann ; Viresh Kumar > > > > > ; Stefano Stabellini > > > > > ; stefa...@redhat.com; Jan Kiszka > > > > > ; Carl van Schaik ; > > > > > prat...@quicinc.com; Srivatsa Vaddagiri ; Jean- > > > > > Philippe Brucker ; Mathieu Poirier > > > > > ; Oleksandr Tyshchenko > > > > > ; Bertrand Marquis > > > > > ; Artem Mygaiev ; > > > Julien > > > > > Grall ; Juergen Gross ; Paul Durrant > > > > > ; Xen Devel > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends > > > > > > > > > > Hi Wei, Oleksandr, > > > > > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +, Wei Chen wrote: > > > > > > Hi All, > > > > > > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here. > > > > > > This proposal is still discussing in Xen and KVM communities. > > > > > > The main work is to decouple the kvmtool from KVM and make > > > > > > other hypervisors can reuse the virtual device implementations. > > > > > > > > > > > > In this case, we need to introduce an intermediate hypervisor > > > > > > layer for VMM abstraction, Which is, I think it's very close > > > > > > to stratos' virtio hypervisor agnosticism work. > > > > > > > > > > # My proposal[1] comes from my own idea and doesn't always represent > > > > > # Linaro's view on this subject nor reflect Alex's concerns. > > > Nevertheless, > > > > > > > > > > Your idea and my proposal seem to share the same background. > > > > > Both have the similar goal and currently start with, at first, Xen > > > > > and are based on kvm-tool. (Actually, my work is derived from > > > > > EPAM's virtio-disk, which is also based on kvm-tool.) > > > > > > > > > > In particular, the abstraction of hypervisor interfaces has a same > > > > > set of interfaces (for your "struct vmm_impl" and my "RPC > > > > > interfaces"). > > > > > This is not co-incident as we both share the same origin as I said > > > above. > > > > > And so we will also share the same issues. One of them is a way of > > > > > "sharing/mapping FE's memory". There is some trade-off between > > > > > the portability and the performance impact. > > > > > So we can discuss the topic here in this ML, too. > > > > > (See Alex's original email, too). > > > > > > > > > Yes, I agree. > > > > > > > > > On the other hand, my approach aims to create a "single-binary" > > > solution > > > > > in which the same binary of BE vm could run on any hypervisors. > > > > > Somehow similar to your "proposal-#2" in [2], but in my solution, all > > > > > the hypervisor-specific code would be put into another entity (VM), > > > > > named "virtio-proxy" and the abstracted operations are served via RPC. > > > > > (In this sense, BE is hypervisor-agnostic but might have OS > > > dependency.) > > > > > But I know that we need discuss if this is a requirement even > > > > > in Stratos project or not. (Maybe not) > > > > > > > > > > > > > Sorry, I haven't had time to finish reading your virtio-proxy completely > > > > (I will do it ASAP). But from your description, it seems we need a > > > > 3rd VM between FE and BE? My concern is that, if my assumption is right, > > > > will it increase the latency in data transport path? Even if we're > > > > using some lightweight guest like RTOS or Unikernel, > > > > > > Yes, you're right. But I'm afraid that it is a matter of degree. > > > As far as we execute 'mapping' operations at every fetch of payload, > > > we will see latency issue (even in your case) and if we have some solution > > > for it, we won't see it neither in my proposal :) > > > > > > > Oleksandr has sent a proposal to Xen mailing list to reduce this kind > > of "mapping/unmapping" operations. So the latency caused by this behavior > > on Xen may eventually be eliminated, and Linux-KVM doesn't
Re: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to NUMA system
On 26.08.2021 10:49, Julien Grall wrote: > On 26/08/2021 08:24, Wei Chen wrote: >>> -Original Message- >>> From: Julien Grall >>> Sent: 2021年8月26日 0:58 >>> On 11/08/2021 11:24, Wei Chen wrote: --- a/xen/arch/arm/smpboot.c +++ b/xen/arch/arm/smpboot.c @@ -358,6 +358,12 @@ void start_secondary(void) */ smp_wmb(); +/* + * If Xen is running on a NUMA off system, there will + * be a node#0 at least. + */ +numa_add_cpu(cpuid); + >>> >>> On x86, numa_add_cpu() will be called before the pCPU is brought up. I >>> am not quite too sure why we are doing it differently here. Can you >>> clarify it? >> >> Of course we can invoke numa_add_cpu before cpu_up as x86. But in my tests, >> I found when cpu bring up failed, this cpu still be add to NUMA. Although >> this does not affect the execution of the code (because CPU is offline), >> But I don't think adding a offline CPU to NUMA makes sense. > > Right, but again, why do you want to solve the problem on Arm and not > x86? After all, NUMA is not architecture specific (in fact you move most > of the code in common). > > In fact, the risk, is someone may read arch/x86 and doesn't realize the > CPU is not in the node until late on Arm. > > So I think we should call numa_add_cpu() around the same place on all > the architectures. FWIW: +1 Jan
Re: [PATCH v3 7/7] xsm: removing facade that XSM can be enabled/disabled
On 05.08.2021 16:06, Daniel P. Smith wrote: > The XSM facilities are always in use by Xen with the facade of being able to > turn XSM on and off. This option is in fact about allowing the selection of > which policies are available and which are used at runtime. To provide this > facade a complicated serious of #ifdef's are used to selective include Nit: It took me a moment to realize that the sentence reads oddly because you likely mean "series", not "serious". > different headers or portions of headers. This series of #ifdef gyrations > switches between two different versions of the XSM hook interfaces and their > respective backing implementation. All of this is done to provide a minimal > size/performance optimization for when alternative policies are disabled. > > To unwind the #ifdef gyrations a series of changes were necessary, > * replace CONFIG_XSM with XSM_CONFIGURABLE to allow visibility of > selecting alternate XSM policy modules to those that require it > * adjusted CONFIG_XSM_SILO, CONFIG_XSM_FLASK, and the default module > selection to sensible defaults > * collapsed the "dummy/defualt" XSM interface and implementation with the > "multiple policy" interface to provide a single inlined implementation > that attempts to use a registered hook and falls back to the check from > the dummy implementation > * the collapse to a single interface broke code relying on the alternate > interface, specifically SILO, this was reworked to remove the > indirection/abstraction making SILO explicit in its access control > decisions > * with the change of the XSM hooks to fall back to enforcing the dummy > policy, it is no longer necessary to fill NULL entries in the struct > xsm_ops returned by an XSM module's init It would be nice if some of this could be split. Is this really close to impossible? > --- a/xen/common/Kconfig > +++ b/xen/common/Kconfig > @@ -200,23 +200,15 @@ config XENOPROF > > If unsure, say Y. > > -config XSM > - bool "Xen Security Modules support" > - default ARM > - ---help--- > - Enables the security framework known as Xen Security Modules which > - allows administrators fine-grained control over a Xen domain and > - its capabilities by defining permissible interactions between domains, > - the hypervisor itself, and related resources such as memory and > - devices. > - > - If unsure, say N. > +config XSM_CONFIGURABLE > +bool "Enable Configuring Xen Security Modules" Is there a reason to change not only the prompt, but also the name of the Kconfig setting? This alone is the reason for some otherwise unnecessary code churn. Also please correct indentation here. > config XSM_FLASK > - def_bool y > - prompt "FLux Advanced Security Kernel support" > - depends on XSM > - ---help--- > + bool "FLux Advanced Security Kernel support" > + default n I don't understand this change in default (and as an aside, a default of "n" doesn't need spelling out): In the description you say "adjusted CONFIG_XSM_SILO, CONFIG_XSM_FLASK, and the default module selection to sensible defaults". If that's to describe this change, then I'm afraid I don't see why defaulting to "n" is more sensible once the person configuring Xen has chosen the configure XSM's (or XSM_CONFIGURABLE's) sub-options. If that's unrelated to the change here, then I'm afraid I'm missing justification altogether. (Same for SILO then.) > + depends on XSM_CONFIGURABLE > + select XSM_EVTCHN_LABELING Neither this nor any prior patch introduces an option of this name, and there's also none in the present tree. All afaics; I may have overlooked something or typo-ed a "grep" command. > @@ -265,14 +258,14 @@ config XSM_SILO > If unsure, say Y. > > choice > - prompt "Default XSM implementation" > - depends on XSM > + prompt "Default XSM module" > default XSM_SILO_DEFAULT if XSM_SILO && ARM > default XSM_FLASK_DEFAULT if XSM_FLASK > default XSM_SILO_DEFAULT if XSM_SILO > default XSM_DUMMY_DEFAULT > + depends on XSM_CONFIGURABLE With the larger set of "default" lines I'd like to suggest to keep "depends on" ahead of them. > @@ -282,7 +275,7 @@ endchoice > config LATE_HWDOM > bool "Dedicated hardware domain" > default n > - depends on XSM && X86 > + depends on XSM_FLASK && X86 This change is not mentioned or justified in the description. In fact I think it is unrelated to the change here and hence would want breaking out. > ---help--- As you're changing these elsewhere, any chance of you also changing this one to just "help"? > --- a/xen/include/xsm/xsm.h > +++ b/xen/include/xsm/xsm.h > @@ -19,545 +19,1023 @@ > #include > #include > #include > - > -#ifdef CONFIG_XSM > +#include > +#include > > extern struct xsm_ops xsm_ops; > > -static inline void xsm_security_domaininfo
Re: [PATCH] x86/xstate: reset cached register values on resume
On 26/08/2021 08:40, Jan Beulich wrote: > On 25.08.2021 18:49, Andrew Cooper wrote: >> On 25/08/2021 16:02, Jan Beulich wrote: >>> On 24.08.2021 23:11, Andrew Cooper wrote: >>> If >>> the register started out non-zero (the default on AMD iirc, as it's >>> not really masks there) but the first value to be written was zero, >>> we'd skip the write. >> There is cpuidmask_defaults which does get filled from the MSRs on boot. >> >> AMD are real CPUID settings, while Intel is an and-mask. i.e. they're >> both non-zero (unless someone does something silly with the command line >> arguments, and I don't expect Xen to be happy booting if the leaves all >> read 0). > Surely not all of them together, but I don't think it's completely > unreasonable for one (say the XSAVE one, if e.g. XSAVE is to be turned > off altogether for guests) to be all zero. > >> Each early_init_*() has an explicit ctxt_switch_levelling(NULL) call >> which, given non-zero content in cpuidmask_defaults should latch each >> one appropriately in the the per-cpu variable. > Well, as you say - provided the individual fields are all non-zero. The MSRs only exist on CPUs which have non-zero features in the relevant leaves. The XSAVE and Therm registers could plausibly be 0. Dom0 is first to boot and won't expect to have XSAVE hidden, but we do zero all of leaf 6 in recalculate_misc() There are certainly corner cases here to improve, but I think all registers will latch on the first early_init_*(), except for Therm on AMD which will latch on the first context switch from a PV guest back to idle. cpu/common.c:120:static DEFINE_PER_CPU(uint64_t, msr_misc_features); >>> Almost the same here - we only initialize the variable on the BSP >>> afaics. >> No - way way way worse, I think. >> >> For all APs, we write 0 or MSR_MISC_FEATURES_CPUID_FAULTING into >> MSR_INTEL_MISC_FEATURES_ENABLES, which amongst other things turns off >> Fast String Enable. > Urgh, indeed. Prior to 6e2fdc0f8902 there was a read-modify-write > operation. With the introduction of the cache variable this went > away, while the cache gets filled for BSP only. Yeah - I really screwed up on that one... It was also part of the PV Shim work done in a hurry in the lead up to Spectre/Meltdown. MSR_INTEL_MISC_FEATURES_ENABLES is a lot like MSR_{TSX_FORCE_ABORT,TSX_CTRL,MCU_OPT_CTRL} and the MTRRs. Most of the content wants to be identical on all cores, so we do want to fix up AP values with the BSP value if they differ, but we also want a software cache covering at least the CPUID_FAULTING bit so we don't have a unnecessary MSR read on the context switch path. I'll try to do something better. ~Andrew
Re: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to NUMA system
On 26/08/2021 08:24, Wei Chen wrote: Hi Julien, Hi Wei, -Original Message- From: Julien Grall Sent: 2021年8月26日 0:58 To: Wei Chen ; xen-devel@lists.xenproject.org; sstabell...@kernel.org; jbeul...@suse.com Cc: Bertrand Marquis Subject: Re: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to NUMA system Hi Wei, On 11/08/2021 11:24, Wei Chen wrote: When cpu boot up, we have add them to NUMA system. In current stage, we have not parsed the NUMA data, but we have created a fake NUMA node. So, in this patch, all CPU will be added to NUMA node#0. After the NUMA data has been parsed from device tree, the CPU will be added to correct NUMA node as the NUMA data described. Signed-off-by: Wei Chen --- xen/arch/arm/setup.c | 6 ++ xen/arch/arm/smpboot.c | 6 ++ xen/include/asm-arm/numa.h | 1 + 3 files changed, 13 insertions(+) diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c index 3c58d2d441..7531989f21 100644 --- a/xen/arch/arm/setup.c +++ b/xen/arch/arm/setup.c @@ -918,6 +918,12 @@ void __init start_xen(unsigned long boot_phys_offset, processor_id(); +/* + * If Xen is running on a NUMA off system, there will + * be a node#0 at least. + */ +numa_add_cpu(0); + smp_init_cpus(); cpus = smp_get_max_cpus(); printk(XENLOG_INFO "SMP: Allowing %u CPUs\n", cpus); diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c index a1ee3146ef..aa78958c07 100644 --- a/xen/arch/arm/smpboot.c +++ b/xen/arch/arm/smpboot.c @@ -358,6 +358,12 @@ void start_secondary(void) */ smp_wmb(); +/* + * If Xen is running on a NUMA off system, there will + * be a node#0 at least. + */ +numa_add_cpu(cpuid); + On x86, numa_add_cpu() will be called before the pCPU is brought up. I am not quite too sure why we are doing it differently here. Can you clarify it? Of course we can invoke numa_add_cpu before cpu_up as x86. But in my tests, I found when cpu bring up failed, this cpu still be add to NUMA. Although this does not affect the execution of the code (because CPU is offline), But I don't think adding a offline CPU to NUMA makes sense. Right, but again, why do you want to solve the problem on Arm and not x86? After all, NUMA is not architecture specific (in fact you move most of the code in common). In fact, the risk, is someone may read arch/x86 and doesn't realize the CPU is not in the node until late on Arm. So I think we should call numa_add_cpu() around the same place on all the architectures. If you think the current position on x86 is not correct, then it should be changed at as well. However, I don't know the story behind the position of the call on x86. You may want to ask the x86 maintainers. Cheers, -- Julien Grall
Re: [XEN RFC PATCH 23/40] xen/arm: introduce a helper to parse device tree memory node
On 26/08/2021 07:35, Wei Chen wrote: Hi Julien, Hi Wei, -Original Message- From: Julien Grall Sent: 2021年8月25日 21:49 To: Wei Chen ; xen-devel@lists.xenproject.org; sstabell...@kernel.org; jbeul...@suse.com Cc: Bertrand Marquis Subject: Re: [XEN RFC PATCH 23/40] xen/arm: introduce a helper to parse device tree memory node Hi Wei, On 11/08/2021 11:24, Wei Chen wrote: Memory blocks' NUMA ID information is stored in device tree's memory nodes as "numa-node-id". We need a new helper to parse and verify this ID from memory nodes. In order to support memory affinity in later use, the valid memory ranges and NUMA ID will be saved to tables. Signed-off-by: Wei Chen --- xen/arch/arm/numa_device_tree.c | 130 1 file changed, 130 insertions(+) diff --git a/xen/arch/arm/numa_device_tree.c b/xen/arch/arm/numa_device_tree.c index 37cc56acf3..bbe081dcd1 100644 --- a/xen/arch/arm/numa_device_tree.c +++ b/xen/arch/arm/numa_device_tree.c @@ -20,11 +20,13 @@ #include #include #include +#include #include #include s8 device_tree_numa = 0; static nodemask_t processor_nodes_parsed __initdata; +static nodemask_t memory_nodes_parsed __initdata; static int srat_disabled(void) { @@ -55,6 +57,79 @@ static int __init dtb_numa_processor_affinity_init(nodeid_t node) return 0; } +/* Callback for parsing of the memory regions affinity */ +static int __init dtb_numa_memory_affinity_init(nodeid_t node, +paddr_t start, paddr_t size) +{ The implementation of this function is quite similar ot the ACPI version. Can this be abstracted? In my draft, I had tried to merge ACPI and DTB versions in one function. I introduced a number of "if else" to distinguish ACPI from DTB, especially ACPI hotplug. The function seems very messy. Not enough benefits to make up for the mess, so I gave up. It think you can get away from distinguishing between ACPI and DT in that helper: * ma->flags & ACPI_SRAT_MEM_HOTPLUGGABLE could be replace by an argument indicating whether the region is hotpluggable (this would always be false for DT) * Access to memblk_hotplug can be stubbed (in the future we may want to consider memory hotplug even on Arm). Do you still have the "if else" version? If so can you post it? Cheers, -- Julien Grall
Re: [PATCH v3 5/7] xsm: decouple xsm header inclusion selection
On 05.08.2021 16:06, Daniel P. Smith wrote: > --- /dev/null > +++ b/xen/include/xsm/xsm-core.h > @@ -0,0 +1,273 @@ > +/* > + * This file contains the XSM hook definitions for Xen. > + * > + * This work is based on the LSM implementation in Linux 2.6.13.4. > + * > + * Author: George Coker, > + * > + * Contributors: Michael LeMay, > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2, > + * as published by the Free Software Foundation. > + */ > + > +#ifndef __XSM_CORE_H__ > +#define __XSM_CORE_H__ > + > +#include > +#include I was going to ask to invert the order (as we try to arrange #include-s alphabetically), but it looks like multiboot.h isn't fit for this. > +typedef void xsm_op_t; > +DEFINE_XEN_GUEST_HANDLE(xsm_op_t); Just FTR - I consider this dubious. If void is meant, I don't see why a void handle can't be used. > +/* policy magic number (defined by XSM_MAGIC) */ > +typedef uint32_t xsm_magic_t; > + > +#ifdef CONFIG_XSM_FLASK > +#define XSM_MAGIC 0xf97cff8c > +#else > +#define XSM_MAGIC 0x0 > +#endif > + > +/* These annotations are used by callers and in dummy.h to document the > + * default actions of XSM hooks. They should be compiled out otherwise. > + */ I realize you only move code, but like e.g. the u32 -> uint32_t change in context above I'd like to encourage you to also address other style issues in the newly introduced file. Here I'm talking about comment style, requiring /* to be on its own line. > +enum xsm_default { > +XSM_HOOK, /* Guests can normally access the hypercall */ > +XSM_DM_PRIV, /* Device model can perform on its target domain */ > +XSM_TARGET, /* Can perform on self or your target domain */ > +XSM_PRIV, /* Privileged - normally restricted to dom0 */ > +XSM_XS_PRIV, /* Xenstore domain - can do some privileged operations */ > +XSM_OTHER /* Something more complex */ > +}; > +typedef enum xsm_default xsm_default_t; > + > +struct xsm_ops { > +void (*security_domaininfo) (struct domain *d, Similarly here (and below) - we don't normally put a blank between the closing and opening parentheses in function pointer declarations. The majority does so here, but ... >[...] > +int (*page_offline)(uint32_t cmd); > +int (*hypfs_op)(void); ... there are exceptions. >[...] > +int (*platform_op) (uint32_t cmd); > + > +#ifdef CONFIG_X86 > +int (*do_mca) (void); > +int (*shadow_control) (struct domain *d, uint32_t op); > +int (*mem_sharing_op) (struct domain *d, struct domain *cd, int op); > +int (*apic) (struct domain *d, int cmd); > +int (*memtype) (uint32_t access); > +int (*machine_memory_map) (void); > +int (*domain_memory_map) (struct domain *d); > +#define XSM_MMU_UPDATE_READ 1 > +#define XSM_MMU_UPDATE_WRITE 2 > +#define XSM_MMU_NORMAL_UPDATE4 > +#define XSM_MMU_MACHPHYS_UPDATE 8 > +int (*mmu_update) (struct domain *d, struct domain *t, > + struct domain *f, uint32_t flags); > +int (*mmuext_op) (struct domain *d, struct domain *f); > +int (*update_va_mapping) (struct domain *d, struct domain *f, > + l1_pgentry_t pte); > +int (*priv_mapping) (struct domain *d, struct domain *t); > +int (*ioport_permission) (struct domain *d, uint32_t s, uint32_t e, > + uint8_t allow); > +int (*ioport_mapping) (struct domain *d, uint32_t s, uint32_t e, > + uint8_t allow); > +int (*pmu_op) (struct domain *d, unsigned int op); > +#endif > +int (*dm_op) (struct domain *d); To match grouping elsewhere, a blank line above here, ... > +int (*xen_version) (uint32_t cmd); > +int (*domain_resource_map) (struct domain *d); > +#ifdef CONFIG_ARGO ... and here would be nice. > +int (*argo_enable) (const struct domain *d); > +int (*argo_register_single_source) (const struct domain *d, > +const struct domain *t); > +int (*argo_register_any_source) (const struct domain *d); > +int (*argo_send) (const struct domain *d, const struct domain *t); > +#endif > +}; > + > +extern void xsm_fixup_ops(struct xsm_ops *ops); > + > +#ifdef CONFIG_XSM > + > +#ifdef CONFIG_MULTIBOOT > +extern int xsm_multiboot_init(unsigned long *module_map, > + const multiboot_info_t *mbi); > +extern int xsm_multiboot_policy_init(unsigned long *module_map, > + const multiboot_info_t *mbi, > + void **policy_buffer, > + size_t *policy_size); > +#endif > + > +#ifdef CONFIG_HAS_DEVICE_TREE > +/* > + * Initialize XSM > + * > + * On success, return 1 if using SILO mode else 0. > + */ > +extern int xsm_dt_init(void); > +extern int xsm_dt_policy_init(void **policy_buffer, size_t *policy_size); > +extern bool
Re: Xen C-state Issues
On 26.08.2021 03:18, Elliott Mitchell wrote: > On Tue, Aug 24, 2021 at 08:14:41AM +0200, Jan Beulich wrote: >> On 24.08.2021 07:37, Elliott Mitchell wrote: >>> On Mon, Aug 23, 2021 at 09:12:52AM +0200, Jan Beulich wrote: On 21.08.2021 18:25, Elliott Mitchell wrote: > ACPI C-state support might not see too much use, but it does see some. > > With Xen 4.11 and Linux kernel 4.19, I found higher C-states only got > enabled for physical cores for which Domain 0 had a corresponding vCPU. > On a machine where Domain 0 has 5 vCPUs, but 8 reported cores, the > additional C-states would only be enabled on cores 0-4. > > This can be worked around by giving Domain 0 vCPUs equal to cores, but > then offlining the extra vCPUs. I'm guessing this is a bug with the > Linux 4.19 xen_acpi_processor module. > > > > Appears Xen 4.14 doesn't work at all with Linux kernel 4.19's ACPI > C-state support. This combination is unable to enable higher C-states > on any core. Since Xen 4.14 and Linux 4.19 are *both* *presently* > supported it seems patch(es) are needed somewhere for this combination. Hmm, having had observed the same quite some time ago, I thought I had dealt with these problems. Albeit surely not in Xen 4.11 or Linux 4.19. Any chance you could check up-to-date versions of both Xen and Linux (together)? >>> >>> I can believe you got this fixed, but the Linux fixes never got >>> backported. >>> >>> Of the two, higher C-states working with Linux 4.19 and Xen 4.11, but >>> not Linux 4.19 and Xen 4.14 is more concerning to me. >> >> I'm afraid without you providing detail (full verbosity logs) and >> ideally checking with 4.15 or yet better -unstable it's going to be >> hard to judge whether that's a bug, and if so where it might sit. > > That would be a very different sort of bug report if that was found to > be an issue. This report is likely a problem of fixes not being > backported to stable branches. As you say - likely. I'd like to be sure. > What you're writing about would be looking for bugs in development > branches. Very much so, in case there are issues left, or ones have got reintroduced. That's what the primary purpose of this list is. If you were suspecting missing fixes in the kernel, I guess xen-devel isn't the preferred channel anyway. Otoh the stable maintainers there would likely want concrete commits pointed out ... Jan
Re: [PATCH] x86/xstate: reset cached register values on resume
On 25.08.2021 18:49, Andrew Cooper wrote: > On 25/08/2021 16:02, Jan Beulich wrote: >> On 24.08.2021 23:11, Andrew Cooper wrote: >>> On 18/08/2021 13:44, Andrew Cooper wrote: On 18/08/2021 12:30, Marek Marczykowski-Górecki wrote: > set_xcr0() and set_msr_xss() use cached value to avoid setting the > register to the same value over and over. But suspend/resume implicitly > reset the registers and since percpu areas are not deallocated on > suspend anymore, the cache gets stale. > Reset the cache on resume, to ensure the next write will really hit the > hardware. Choose value 0, as it will never be a legitimate write to > those registers - and so, will force write (and cache update). > > Note the cache is used io get_xcr0() and get_msr_xss() too, but: > - set_xcr0() is called few lines below in xstate_init(), so it will > update the cache with appropriate value > - get_msr_xss() is not used anywhere - and thus not before any > set_msr_xss() that will fill the cache > > Fixes: aca2a985a55a "xen: don't free percpu areas during suspend" > Signed-off-by: Marek Marczykowski-Górecki > I'd prefer to do this differently. As I said in the thread, there are other registers such as MSR_TSC_AUX which fall into the same category, and I'd like to make something which works systematically. >>> Ok - after some searching, I think we have problems with: >>> >>> cpu/common.c:47:DEFINE_PER_CPU(struct cpuidmasks, cpuidmasks); >> Don't we have a problem here even during initial boot? I can't see >> the per-CPU variable to get filled by what the registers hold. > > No, I don't think so, but it is a roundabout route. So where do you see it getting filled? >> If >> the register started out non-zero (the default on AMD iirc, as it's >> not really masks there) but the first value to be written was zero, >> we'd skip the write. > > There is cpuidmask_defaults which does get filled from the MSRs on boot. > > AMD are real CPUID settings, while Intel is an and-mask. i.e. they're > both non-zero (unless someone does something silly with the command line > arguments, and I don't expect Xen to be happy booting if the leaves all > read 0). Surely not all of them together, but I don't think it's completely unreasonable for one (say the XSAVE one, if e.g. XSAVE is to be turned off altogether for guests) to be all zero. > Each early_init_*() has an explicit ctxt_switch_levelling(NULL) call > which, given non-zero content in cpuidmask_defaults should latch each > one appropriately in the the per-cpu variable. Well, as you say - provided the individual fields are all non-zero. >>> cpu/common.c:120:static DEFINE_PER_CPU(uint64_t, msr_misc_features); >> Almost the same here - we only initialize the variable on the BSP >> afaics. > > No - way way way worse, I think. > > For all APs, we write 0 or MSR_MISC_FEATURES_CPUID_FAULTING into > MSR_INTEL_MISC_FEATURES_ENABLES, which amongst other things turns off > Fast String Enable. Urgh, indeed. Prior to 6e2fdc0f8902 there was a read-modify-write operation. With the introduction of the cache variable this went away, while the cache gets filled for BSP only. Jan
RE: [XEN RFC PATCH 00/40] Add device tree based NUMA support to Arm64
Hi Stefano, > -Original Message- > From: Stefano Stabellini > Sent: 2021年8月26日 8:09 > To: Wei Chen > Cc: xen-devel@lists.xenproject.org; sstabell...@kernel.org; jul...@xen.org; > jbeul...@suse.com; Bertrand Marquis ; > andrew.coop...@citrix.com > Subject: Re: [XEN RFC PATCH 00/40] Add device tree based NUMA support to > Arm64 > > Thanks for the big contribution! > > I just wanted to let you know that the series passed all the gitlab-ci > build tests without issues. > > The runtime tests originally failed due to unrelated problems (there was > a Debian testing upgrade that broke Gitlab-CI.) I fix the underlying > issue and restarted the failed tests and now they passed. > > This is the pipeline: > https://gitlab.com/xen-project/patchew/xen/-/pipelines/351484940 > > There are still two runtime x86 tests that fail but I don't think the > failures are related to your series. > > Thanks for testing this series : ) > On Wed, 11 Aug 2021, Wei Chen wrote: > > Xen memory allocation and scheduler modules are NUMA aware. > > But actually, on x86 has implemented the architecture APIs > > to support NUMA. Arm was providing a set of fake architecture > > APIs to make it compatible with NUMA awared memory allocation > > and scheduler. > > > > Arm system was working well as a single node NUMA system with > > these fake APIs, because we didn't have multiple nodes NUMA > > system on Arm. But in recent years, more and more Arm devices > > support multiple nodes NUMA system. Like TX2, some Hisilicon > > chips and the Ampere Altra. > > > > So now we have a new problem. When Xen is running on these Arm > > devices, Xen still treat them as single node SMP systems. The > > NUMA affinity capability of Xen memory allocation and scheduler > > becomes meaningless. Because they rely on input data that does > > not reflect real NUMA layout. > > > > Xen still think the access time for all of the memory is the > > same for all CPUs. However, Xen may allocate memory to a VM > > from different NUMA nodes with different access speeds. This > > difference can be amplified in workloads inside VM, causing > > performance instability and timeouts. > > > > So in this patch series, we implement a set of NUMA API to use > > device tree to describe the NUMA layout. We reuse most of the > > code of x86 NUMA to create and maintain the mapping between > > memory and CPU, create the matrix between any two NUMA nodes. > > Except ACPI and some x86 specified code, we have moved other > > code to common. In next stage, when we implement ACPI based > > NUMA for Arm64, we may move the ACPI NUMA code to common too, > > but in current stage, we keep it as x86 only. > > > > This patch serires has been tested and booted well on one > > Arm64 NUMA machine and one HPE x86 NUMA machine. > > > > Hongda Deng (2): > > xen/arm: return default DMA bit width when platform is not set > > xen/arm: Fix lowmem_bitsize when arch_get_dma_bitsize return 0 > > > > Wei Chen (38): > > tools: Fix -Werror=maybe-uninitialized for xlu_pci_parse_bdf > > xen/arm: Print a 64-bit number in hex from early uart > > xen/x86: Initialize memnodemapsize while faking NUMA node > > xen: decouple NUMA from ACPI in Kconfig > > xen/arm: use !CONFIG_NUMA to keep fake NUMA API > > xen/x86: Move NUMA memory node map functions to common > > xen/x86: Move numa_add_cpu_node to common > > xen/x86: Move NR_NODE_MEMBLKS macro to common > > xen/x86: Move NUMA nodes and memory block ranges to common > > xen/x86: Move numa_initmem_init to common > > xen/arm: introduce numa_set_node for Arm > > xen/arm: set NUMA nodes max number to 64 by default > > xen/x86: move NUMA API from x86 header to common header > > xen/arm: Create a fake NUMA node to use common code > > xen/arm: Introduce DEVICE_TREE_NUMA Kconfig for arm64 > > xen/arm: Keep memory nodes in dtb for NUMA when boot from EFI > > xen: fdt: Introduce a helper to check fdt node type > > xen/arm: implement node distance helpers for Arm64 > > xen/arm: introduce device_tree_numa as a switch for device tree NUMA > > xen/arm: introduce a helper to parse device tree processor node > > xen/arm: introduce a helper to parse device tree memory node > > xen/arm: introduce a helper to parse device tree NUMA distance map > > xen/arm: unified entry to parse all NUMA data from device tree > > xen/arm: Add boot and secondary CPU to NUMA system > > xen/arm: build CPU NUMA node map while creating cpu_logical_map > > xen/x86: decouple nodes_cover_memory with E820 map > > xen/arm: implement Arm arch helpers Arm to get memory map info > > xen: move NUMA memory and CPU parsed nodemasks to common > > xen/x86: move nodes_cover_memory to common > > xen/x86: make acpi_scan_nodes to be neutral > > xen: export bad_srat and srat_disabled to extern > > xen: move numa_scan_nodes from x86 to common > > xen: enable numa_scan_nodes for device tree based NUMA > > xen/arm: keep guest still be NUMA
RE: [XEN RFC PATCH 30/40] xen: move NUMA memory and CPU parsed nodemasks to common
Hi Juilien, > -Original Message- > From: Julien Grall > Sent: 2021年8月26日 1:17 > To: Wei Chen ; xen-devel@lists.xenproject.org; > sstabell...@kernel.org; jbeul...@suse.com > Cc: Bertrand Marquis > Subject: Re: [XEN RFC PATCH 30/40] xen: move NUMA memory and CPU parsed > nodemasks to common > > Hi Wei, > > On 11/08/2021 11:24, Wei Chen wrote: > > Both memory_nodes_parsed and processor_nodes_parsed are using > > for Arm and x86 to record parded NUMA memory and CPU. So we > > move them to common. > > Looking at the usage, they both call: > > numa_set...(..., bitmap) > > So rather than exporting the two helpers, could we simply add helpers to > abstract it? > I will try to fix it in next version. > > > > > Signed-off-by: Wei Chen > > --- > > xen/arch/arm/numa_device_tree.c | 2 -- > > xen/arch/x86/srat.c | 3 --- > > xen/common/numa.c | 3 +++ > > xen/include/xen/nodemask.h | 2 ++ > > 4 files changed, 5 insertions(+), 5 deletions(-) > > > > diff --git a/xen/arch/arm/numa_device_tree.c > b/xen/arch/arm/numa_device_tree.c > > index 27ffb72f7b..f74b7f6427 100644 > > --- a/xen/arch/arm/numa_device_tree.c > > +++ b/xen/arch/arm/numa_device_tree.c > > @@ -25,8 +25,6 @@ > > #include > > > > s8 device_tree_numa = 0; > > -static nodemask_t processor_nodes_parsed __initdata; > > -static nodemask_t memory_nodes_parsed __initdata; > > This is code that was introduced in a previous patch. In general, it is > better to do the rework first and then add the new code. This makes > easier to follow series as the code added is not changed. > Yes, I will fix it in next version. > > > > static int srat_disabled(void) > > { > > diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c > > index 2298353846..dd3aa30843 100644 > > --- a/xen/arch/x86/srat.c > > +++ b/xen/arch/x86/srat.c > > @@ -24,9 +24,6 @@ > > > > static struct acpi_table_slit *__read_mostly acpi_slit; > > > > -static nodemask_t memory_nodes_parsed __initdata; > > -static nodemask_t processor_nodes_parsed __initdata; > > - > > struct pxm2node { > > unsigned pxm; > > nodeid_t node; > > diff --git a/xen/common/numa.c b/xen/common/numa.c > > index 26c0006d04..79ab250543 100644 > > --- a/xen/common/numa.c > > +++ b/xen/common/numa.c > > @@ -35,6 +35,9 @@ int num_node_memblks; > > struct node node_memblk_range[NR_NODE_MEMBLKS]; > > nodeid_t memblk_nodeid[NR_NODE_MEMBLKS]; > > > > +nodemask_t memory_nodes_parsed __initdata; > > +nodemask_t processor_nodes_parsed __initdata; > > + > > bool numa_off; > > > > /* > > diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h > > index 1dd6c7458e..29ce5e28e7 100644 > > --- a/xen/include/xen/nodemask.h > > +++ b/xen/include/xen/nodemask.h > > @@ -276,6 +276,8 @@ static inline int __cycle_node(int n, const > nodemask_t *maskp, int nbits) > >*/ > > > > extern nodemask_t node_online_map; > > +extern nodemask_t memory_nodes_parsed; > > +extern nodemask_t processor_nodes_parsed; > > > > #if MAX_NUMNODES > 1 > > #define num_online_nodes()nodes_weight(node_online_map) > > > > Cheers, > > -- > Julien Grall
RE: [XEN RFC PATCH 29/40] xen/arm: implement Arm arch helpers Arm to get memory map info
Hi Julien, > -Original Message- > From: Julien Grall > Sent: 2021年8月26日 1:10 > To: Wei Chen ; xen-devel@lists.xenproject.org; > sstabell...@kernel.org; jbeul...@suse.com > Cc: Bertrand Marquis > Subject: Re: [XEN RFC PATCH 29/40] xen/arm: implement Arm arch helpers Arm > to get memory map info > > Hi Wei, > > On 11/08/2021 11:24, Wei Chen wrote: > > These two helpers are architecture APIs that are required by > > nodes_cover_memory. > > > > Signed-off-by: Wei Chen > > --- > > xen/arch/arm/numa.c | 14 ++ > > 1 file changed, 14 insertions(+) > > > > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c > > index f61a8df645..6eebf8e8bc 100644 > > --- a/xen/arch/arm/numa.c > > +++ b/xen/arch/arm/numa.c > > @@ -126,3 +126,17 @@ void __init numa_init(bool acpi_off) > > numa_initmem_init(PFN_UP(ram_start), PFN_DOWN(ram_end)); > > return; > > } > > + > > +uint32_t __init arch_meminfo_get_nr_bank(void) > > +{ > > + return bootinfo.mem.nr_banks; > > +} > > + > > +int __init arch_meminfo_get_ram_bank_range(int bank, > > + unsigned long long *start, unsigned long long *end) > > They are physical address, so we should use "paddr_t" as on system such > as 32-bit Arm, "unsigned long" is not enough to cover all the physical > address. > > As you change the type, I would also suggest to change the bank from an > int to an unsigned int. > I will fix them in next version. > > +{ > > + *start = bootinfo.mem.bank[bank].start; > > + *end = bootinfo.mem.bank[bank].start + bootinfo.mem.bank[bank].size; > > + > > + return 0; > > +} > > > > Cheers, > > -- > Julien Grall
RE: [XEN RFC PATCH 27/40] xen/arm: build CPU NUMA node map while creating cpu_logical_map
Hi Julien, > -Original Message- > From: Julien Grall > Sent: 2021年8月26日 1:07 > To: Wei Chen ; xen-devel@lists.xenproject.org; > sstabell...@kernel.org > Cc: Bertrand Marquis ; Jan Beulich > > Subject: Re: [XEN RFC PATCH 27/40] xen/arm: build CPU NUMA node map while > creating cpu_logical_map > > Hi Wei, > > On 11/08/2021 11:24, Wei Chen wrote: > > Sometimes, CPU logical ID maybe different with physical CPU ID. > > Xen is using CPU logial ID for runtime usage, so we should use > > CPU logical ID to create map between NUMA node and CPU. > > This commit message gives the impression that you are trying to fix a > bug. However, what you are explaining is the reason why the code will > use the logical ID rather than physical ID. > > I think the commit message should explain what the patch is doing. You > can then add an explanation why you are using the CPU logical ID. > Something like "Note we storing the CPU logical ID because...". > > Ok > > > > Signed-off-by: Wei Chen > > --- > > xen/arch/arm/smpboot.c | 31 ++- > > 1 file changed, 30 insertions(+), 1 deletion(-) > > > > diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c > > index aa78958c07..dd5a45bffc 100644 > > --- a/xen/arch/arm/smpboot.c > > +++ b/xen/arch/arm/smpboot.c > > @@ -121,7 +121,12 @@ static void __init dt_smp_init_cpus(void) > > { > > [0 ... NR_CPUS - 1] = MPIDR_INVALID > > }; > > +static nodeid_t node_map[NR_CPUS] __initdata = > > +{ > > +[0 ... NR_CPUS - 1] = NUMA_NO_NODE > > +}; > > bool bootcpu_valid = false; > > +uint32_t nid = 0; > > int rc; > > > > mpidr = boot_cpu_data.mpidr.bits & MPIDR_HWID_MASK; > > @@ -172,6 +177,26 @@ static void __init dt_smp_init_cpus(void) > > continue; > > } > > > > +#ifdef CONFIG_DEVICE_TREE_NUMA > > +/* > > + * When CONFIG_DEVICE_TREE_NUMA is set, try to fetch numa > infomation > > + * from CPU dts node, otherwise the nid is always 0. > > + */ > > +if ( !dt_property_read_u32(cpu, "numa-node-id", ) ) > > You can avoid the #ifdef by writing: > > if ( IS_ENABLED(CONFIG_DEVICE_TREE_NUMA) && ... ) > > However, I would using CONFIG_NUMA because this code is already DT > specific. So we can shorten the name a bit. > OK > > +{ > > +printk(XENLOG_WARNING > > +"cpu[%d] dts path: %s: doesn't have numa infomation!\n", > > s/information/information/ OK > > > +cpuidx, dt_node_full_name(cpu)); > > +/* > > + * The the early stage of NUMA initialization, when Xen > found any > > s/The/During/? Oh, yes, I will fix it. > > > + * CPU dts node doesn't have numa-node-id info, the NUMA > will be > > + * treated as off, all CPU will be set to a FAKE node 0. So > if we > > + * get numa-node-id failed here, we should set nid to 0. > > + */ > > +nid = 0; > > +} > > +#endif > > + > > /* > >* 8 MSBs must be set to 0 in the DT since the reg property > >* defines the MPIDR[23:0] > > @@ -231,9 +256,12 @@ static void __init dt_smp_init_cpus(void) > > { > > printk("cpu%d init failed (hwid %"PRIregister"): %d\n", i, > hwid, rc); > > tmp_map[i] = MPIDR_INVALID; > > +node_map[i] = NUMA_NO_NODE; > > } > > -else > > +else { > > tmp_map[i] = hwid; > > +node_map[i] = nid; > > +} > > } > > > > if ( !bootcpu_valid ) > > @@ -249,6 +277,7 @@ static void __init dt_smp_init_cpus(void) > > continue; > > cpumask_set_cpu(i, _possible_map); > > cpu_logical_map(i) = tmp_map[i]; > > +numa_set_node(i, node_map[i]); > > } > > } > >> > > Cheers, > > -- > Julien Grall
[PATCH v7 8/8] AMD/IOMMU: respect AtsDisabled device flag
IVHD entries may specify that ATS is to be blocked for a device or range of devices. Honor firmware telling us so. While adding respective checks I noticed that the 2nd conditional in amd_iommu_setup_domain_device() failed to check the IOMMU's capability. Add the missing part of the condition there, as no good can come from enabling ATS on a device when the IOMMU is not capable of dealing with ATS requests. For actually using ACPI_IVHD_ATS_DISABLED, make its expansion no longer exhibit UB. Signed-off-by: Jan Beulich --- TBD: I find the ordering in amd_iommu_disable_domain_device() suspicious: amd_iommu_enable_domain_device() sets up the DTE first and then enables ATS on the device. It would seem to me that disabling would better be done the other way around (disable ATS on device, then adjust DTE). TBD: As an alternative to adding the missing IOMMU capability check, we may want to consider simply using dte->i in the 2nd conditional in amd_iommu_enable_domain_device(). For both of these, while ATS enabling/disabling gets invoked without any locks held, the two functions should not be possible to race with one another for any individual device (or else we'd be in trouble already, as ATS might then get re-enabled immediately after it was disabled, with the DTE out of sync with this setting). --- v7: New. --- a/xen/drivers/passthrough/amd/iommu.h +++ b/xen/drivers/passthrough/amd/iommu.h @@ -120,6 +120,7 @@ struct ivrs_mappings { uint16_t dte_requestor_id; bool valid:1; bool dte_allow_exclusion:1; +bool block_ats:1; /* ivhd device data settings */ uint8_t device_flags; --- a/xen/drivers/passthrough/amd/iommu_acpi.c +++ b/xen/drivers/passthrough/amd/iommu_acpi.c @@ -55,8 +55,8 @@ union acpi_ivhd_device { }; static void __init add_ivrs_mapping_entry( -uint16_t bdf, uint16_t alias_id, uint8_t flags, bool alloc_irt, -struct amd_iommu *iommu) +uint16_t bdf, uint16_t alias_id, uint8_t flags, unsigned int ext_flags, +bool alloc_irt, struct amd_iommu *iommu) { struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(iommu->seg); @@ -66,6 +66,7 @@ static void __init add_ivrs_mapping_entr ivrs_mappings[bdf].dte_requestor_id = alias_id; /* override flags for range of devices */ +ivrs_mappings[bdf].block_ats = ext_flags & ACPI_IVHD_ATS_DISABLED; ivrs_mappings[bdf].device_flags = flags; /* Don't map an IOMMU by itself. */ @@ -499,7 +500,7 @@ static u16 __init parse_ivhd_device_sele return 0; } -add_ivrs_mapping_entry(bdf, bdf, select->header.data_setting, false, +add_ivrs_mapping_entry(bdf, bdf, select->header.data_setting, 0, false, iommu); return sizeof(*select); @@ -545,7 +546,7 @@ static u16 __init parse_ivhd_device_rang AMD_IOMMU_DEBUG(" Dev_Id Range: %#x -> %#x\n", first_bdf, last_bdf); for ( bdf = first_bdf; bdf <= last_bdf; bdf++ ) -add_ivrs_mapping_entry(bdf, bdf, range->start.header.data_setting, +add_ivrs_mapping_entry(bdf, bdf, range->start.header.data_setting, 0, false, iommu); return dev_length; @@ -580,7 +581,7 @@ static u16 __init parse_ivhd_device_alia AMD_IOMMU_DEBUG(" Dev_Id Alias: %#x\n", alias_id); -add_ivrs_mapping_entry(bdf, alias_id, alias->header.data_setting, true, +add_ivrs_mapping_entry(bdf, alias_id, alias->header.data_setting, 0, true, iommu); return dev_length; @@ -636,7 +637,7 @@ static u16 __init parse_ivhd_device_alia for ( bdf = first_bdf; bdf <= last_bdf; bdf++ ) add_ivrs_mapping_entry(bdf, alias_id, range->alias.header.data_setting, - true, iommu); + 0, true, iommu); return dev_length; } @@ -661,7 +662,8 @@ static u16 __init parse_ivhd_device_exte return 0; } -add_ivrs_mapping_entry(bdf, bdf, ext->header.data_setting, false, iommu); +add_ivrs_mapping_entry(bdf, bdf, ext->header.data_setting, + ext->extended_data, false, iommu); return dev_length; } @@ -708,7 +710,7 @@ static u16 __init parse_ivhd_device_exte for ( bdf = first_bdf; bdf <= last_bdf; bdf++ ) add_ivrs_mapping_entry(bdf, bdf, range->extended.header.data_setting, - false, iommu); + range->extended.extended_data, false, iommu); return dev_length; } @@ -800,7 +802,7 @@ static u16 __init parse_ivhd_device_spec AMD_IOMMU_DEBUG("IVHD Special: %pp variety %#x handle %#x\n", _SBDF2(seg, bdf), special->variety, special->handle); -add_ivrs_mapping_entry(bdf, bdf, special->header.data_setting, true, +add_ivrs_mapping_entry(bdf, bdf, special->header.data_setting, 0, true, iommu); switch ( special->variety ) ---
[PATCH v7 7/8] AMD/IOMMU: add "ivmd=" command line option
Just like VT-d's "rmrr=" it can be used to cover for firmware omissions. Since systems surfacing IVMDs seem to be rare, it is also meant to allow testing of the involved code. Only the IVMD flavors actually understood by the IVMD parsing logic can be generated, and for this initial implementation there's also no way to control the flags field - unity r/w mappings are assumed. Signed-off-by: Jan Beulich Reviewed-by: Paul Durrant --- v5: New. --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -836,12 +836,12 @@ Controls for the dom0 IOMMU setup. Typically, some devices in a system use bits of RAM for communication, and these areas should be listed as reserved in the E820 table and identified -via RMRR or IVMD entries in the APCI tables, so Xen can ensure that they +via RMRR or IVMD entries in the ACPI tables, so Xen can ensure that they are identity-mapped in the IOMMU. However, some firmware makes mistakes, and this option is a coarse-grain workaround for those errors. Where possible, finer grain corrections should be made with the `rmrr=`, -`ivrs_hpet=` or `ivrs_ioapic=` command line options. +`ivmd=`, `ivrs_hpet[]=`, or `ivrs_ioapic[]=` command line options. This option is disabled by default, and deprecated and intended for removal in future versions of Xen. If specifying `map-inclusive` is the @@ -1523,6 +1523,31 @@ _dom0-iommu=map-inclusive_ - using both > `= ` ### irq_vector_map (x86) + +### ivmd (x86) +> `= [-][=[-][,[-][,...]]][;...]` + +Define IVMD-like ranges that are missing from ACPI tables along with the +device(s) they belong to, and use them for 1:1 mapping. End addresses can be +omitted when exactly one page is meant. The ranges are inclusive when start +and end are specified. Note that only PCI segment 0 is supported at this time, +but it is fine to specify it explicitly. + +'start' and 'end' values are page numbers (not full physical addresses), +in hexadecimal format (can optionally be preceded by "0x"). + +Omitting the optional (range of) BDF spcifiers signals that the range is to +be applied to all devices. + +Usage example: If device 0:0:1d.0 requires one page (0xd5d45) to be +reserved, and devices 0:0:1a.0...0:0:1a.3 collectively require three pages +(0xd5d46 thru 0xd5d48) to be reserved, one usage would be: + +ivmd=d5d45=0:1d.0;0xd5d46-0xd5d48=0:1a.0-0:1a.3 + +Note: grub2 requires to escape or quote special characters, like ';' when +multiple ranges are specified - refer to the grub2 documentation. + ### ivrs_hpet[``] (AMD) > `=[:]:.` --- a/xen/drivers/passthrough/amd/iommu_acpi.c +++ b/xen/drivers/passthrough/amd/iommu_acpi.c @@ -1063,6 +1063,9 @@ static void __init dump_acpi_table_heade } +static struct acpi_ivrs_memory __initdata user_ivmds[8]; +static unsigned int __initdata nr_ivmd; + #define to_ivhd_block(hdr) \ container_of(hdr, const struct acpi_ivrs_hardware, header) #define to_ivmd_block(hdr) \ @@ -1087,7 +1090,7 @@ static int __init parse_ivrs_table(struc { const struct acpi_ivrs_header *ivrs_block; unsigned long length; -unsigned int apic; +unsigned int apic, i; bool_t sb_ioapic = !iommu_intremap; int error = 0; @@ -1122,6 +1125,12 @@ static int __init parse_ivrs_table(struc length += ivrs_block->length; } +/* Add command line specified IVMD-equivalents. */ +if ( nr_ivmd ) +AMD_IOMMU_DEBUG("IVMD: %u command line provided entries\n", nr_ivmd); +for ( i = 0; !error && i < nr_ivmd; ++i ) +error = parse_ivmd_block(user_ivmds + i); + /* Each IO-APIC must have been mentioned in the table. */ for ( apic = 0; !error && iommu_intremap && apic < nr_ioapics; ++apic ) { @@ -1362,3 +1371,80 @@ int __init amd_iommu_get_supported_ivhd_ { return acpi_table_parse(ACPI_SIG_IVRS, get_supported_ivhd_type); } + +/* + * Parse "ivmd" command line option to later add the parsed devices / regions + * into unity mapping lists, just like IVMDs parsed from ACPI. + * Format: + * ivmd=[-][=[-'][,[-'][,...]]][;...] + */ +static int __init parse_ivmd_param(const char *s) +{ +do { +unsigned long start, end; +const char *cur; + +if ( nr_ivmd >= ARRAY_SIZE(user_ivmds) ) +return -E2BIG; + +start = simple_strtoul(cur = s, , 16); +if ( cur == s ) +return -EINVAL; + +if ( *s == '-' ) +{ +end = simple_strtoul(cur = s + 1, , 16); +if ( cur == s || end < start ) +return -EINVAL; +} +else +end = start; + +if ( *s != '=' ) +{ +user_ivmds[nr_ivmd].start_address = start << PAGE_SHIFT; +user_ivmds[nr_ivmd].memory_length = (end - start + 1) << PAGE_SHIFT; +user_ivmds[nr_ivmd].header.flags = ACPI_IVMD_UNITY | + ACPI_IVMD_READ | ACPI_IVMD_WRITE; +
[PATCH v7 6/8] AMD/IOMMU: provide function backing XENMEM_reserved_device_memory_map
Just like for VT-d, exclusion / unity map ranges would better be reflected in e.g. the guest's E820 map. The reporting infrastructure was put in place still pretty tailored to VT-d's needs; extend get_reserved_device_memory() to allow vendor specific code to probe whether a particular (seg,bus,dev,func) tuple would get its data actually recorded. I admit the de-duplication of entries is quite limited for now, but considering our trouble to find a system surfacing _any_ IVMD this is likely not a critical issue for this initial implementation. Signed-off-by: Jan Beulich Reviewed-by: Paul Durrant --- v7: Re-base. v5: New. --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -1042,6 +1042,9 @@ static int get_reserved_device_memory(xe if ( !(grdm->map.flags & XENMEM_RDM_ALL) && (sbdf != id) ) return 0; +if ( !nr ) +return 1; + if ( grdm->used_entries < grdm->map.nr_entries ) { struct xen_reserved_device_memory rdm = { --- a/xen/drivers/passthrough/amd/iommu.h +++ b/xen/drivers/passthrough/amd/iommu.h @@ -110,6 +110,7 @@ struct amd_iommu { struct ivrs_unity_map { bool read:1; bool write:1; +bool global:1; paddr_t addr; unsigned long length; struct ivrs_unity_map *next; @@ -236,6 +237,7 @@ int amd_iommu_reserve_domain_unity_map(s unsigned int flag); int amd_iommu_reserve_domain_unity_unmap(struct domain *d, const struct ivrs_unity_map *map); +int amd_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt); int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t dfn, unsigned long page_count, unsigned int flush_flags); --- a/xen/drivers/passthrough/amd/iommu_acpi.c +++ b/xen/drivers/passthrough/amd/iommu_acpi.c @@ -145,7 +145,7 @@ static int __init reserve_iommu_exclusio static int __init reserve_unity_map_for_device( uint16_t seg, uint16_t bdf, unsigned long base, -unsigned long length, bool iw, bool ir) +unsigned long length, bool iw, bool ir, bool global) { struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg); struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map; @@ -164,7 +164,11 @@ static int __init reserve_unity_map_for_ */ if ( base == unity_map->addr && length == unity_map->length && ir == unity_map->read && iw == unity_map->write ) +{ +if ( global ) +unity_map->global = true; return 0; +} if ( unity_map->addr + unity_map->length > base && base + length > unity_map->addr ) @@ -183,6 +187,7 @@ static int __init reserve_unity_map_for_ unity_map->read = ir; unity_map->write = iw; +unity_map->global = global; unity_map->addr = base; unity_map->length = length; unity_map->next = ivrs_mappings[bdf].unity_map; @@ -222,7 +227,8 @@ static int __init register_range_for_all /* reserve r/w unity-mapped page entries for devices */ for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ ) -rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir); +rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir, + true); } return rc; @@ -255,8 +261,10 @@ static int __init register_range_for_dev paddr_t length = limit + PAGE_SIZE - base; /* reserve unity-mapped page entries for device */ -rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?: - reserve_unity_map_for_device(seg, req, base, length, iw, ir); +rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir, + false) ?: + reserve_unity_map_for_device(seg, req, base, length, iw, ir, + false); } else { @@ -292,9 +300,9 @@ static int __init register_range_for_iom req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id; rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length, - iw, ir) ?: + iw, ir, false) ?: reserve_unity_map_for_device(iommu->seg, req, base, length, - iw, ir); + iw, ir, false); } return rc; --- a/xen/drivers/passthrough/amd/iommu_map.c +++ b/xen/drivers/passthrough/amd/iommu_map.c @@ -467,6 +467,81 @@ int amd_iommu_reserve_domain_unity_unmap return rc; } +int amd_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt) +{ +unsigned int seg = 0 /* XXX */, bdf; +const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg); +/* At least for global entries,
[PATCH v7 5/8] AMD/IOMMU: also insert IVMD ranges into Dom0's page tables
So far only one region would be taken care of, if it can be placed in the exclusion range registers of the IOMMU. Take care of further ranges as well. Seeing that we've been doing fine without this, make both insertion and removal best effort only. Signed-off-by: Jan Beulich Reviewed-by: Paul Durrant --- v5: New. --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c @@ -522,6 +522,14 @@ static int amd_iommu_add_device(u8 devfn amd_iommu_flush_device(iommu, bdf); } +if ( amd_iommu_reserve_domain_unity_map( + pdev->domain, + ivrs_mappings[ivrs_mappings[bdf].dte_requestor_id].unity_map, + 0) ) +AMD_IOMMU_DEBUG("%pd: unity mapping failed for %04x:%02x:%02x.%u\n", +pdev->domain, pdev->seg, pdev->bus, PCI_SLOT(devfn), +PCI_FUNC(devfn)); + return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev); } @@ -547,6 +555,14 @@ static int amd_iommu_remove_device(u8 de ivrs_mappings = get_ivrs_mappings(pdev->seg); bdf = PCI_BDF2(pdev->bus, devfn); + +if ( amd_iommu_reserve_domain_unity_unmap( + pdev->domain, + ivrs_mappings[ivrs_mappings[bdf].dte_requestor_id].unity_map) ) +AMD_IOMMU_DEBUG("%pd: unity unmapping failed for %04x:%02x:%02x.%u\n", +pdev->domain, pdev->seg, pdev->bus, PCI_SLOT(devfn), +PCI_FUNC(devfn)); + if ( amd_iommu_perdev_intremap && ivrs_mappings[bdf].dte_requestor_id == bdf && ivrs_mappings[bdf].intremap_table )
RE: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to NUMA system
Hi Julien, > -Original Message- > From: Julien Grall > Sent: 2021年8月26日 0:58 > To: Wei Chen ; xen-devel@lists.xenproject.org; > sstabell...@kernel.org; jbeul...@suse.com > Cc: Bertrand Marquis > Subject: Re: [XEN RFC PATCH 26/40] xen/arm: Add boot and secondary CPU to > NUMA system > > Hi Wei, > > On 11/08/2021 11:24, Wei Chen wrote: > > When cpu boot up, we have add them to NUMA system. In current > > stage, we have not parsed the NUMA data, but we have created > > a fake NUMA node. So, in this patch, all CPU will be added > > to NUMA node#0. After the NUMA data has been parsed from device > > tree, the CPU will be added to correct NUMA node as the NUMA > > data described. > > > > Signed-off-by: Wei Chen > > --- > > xen/arch/arm/setup.c | 6 ++ > > xen/arch/arm/smpboot.c | 6 ++ > > xen/include/asm-arm/numa.h | 1 + > > 3 files changed, 13 insertions(+) > > > > diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c > > index 3c58d2d441..7531989f21 100644 > > --- a/xen/arch/arm/setup.c > > +++ b/xen/arch/arm/setup.c > > @@ -918,6 +918,12 @@ void __init start_xen(unsigned long > boot_phys_offset, > > > > processor_id(); > > > > +/* > > + * If Xen is running on a NUMA off system, there will > > + * be a node#0 at least. > > + */ > > +numa_add_cpu(0); > > + > > smp_init_cpus(); > > cpus = smp_get_max_cpus(); > > printk(XENLOG_INFO "SMP: Allowing %u CPUs\n", cpus); > > diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c > > index a1ee3146ef..aa78958c07 100644 > > --- a/xen/arch/arm/smpboot.c > > +++ b/xen/arch/arm/smpboot.c > > @@ -358,6 +358,12 @@ void start_secondary(void) > >*/ > > smp_wmb(); > > > > +/* > > + * If Xen is running on a NUMA off system, there will > > + * be a node#0 at least. > > + */ > > +numa_add_cpu(cpuid); > > + > > On x86, numa_add_cpu() will be called before the pCPU is brought up. I > am not quite too sure why we are doing it differently here. Can you > clarify it? Of course we can invoke numa_add_cpu before cpu_up as x86. But in my tests, I found when cpu bring up failed, this cpu still be add to NUMA. Although this does not affect the execution of the code (because CPU is offline), But I don't think adding a offline CPU to NUMA makes sense. > > > /* Now report this CPU is up */ > > cpumask_set_cpu(cpuid, _online_map); > > > > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h > > index 7a3588ac7f..dd31324b0b 100644 > > --- a/xen/include/asm-arm/numa.h > > +++ b/xen/include/asm-arm/numa.h > > @@ -59,6 +59,7 @@ extern mfn_t first_valid_mfn; > > #define __node_distance(a, b) (20) > > > > #define numa_init(x) do { } while (0) > > +#define numa_add_cpu(x) do { } while (0) > > This is a stubs for a common helper. So I think this wants to be moved > in the !CONFIG_NUMA in xen/numa.h. > OK > > #define numa_set_node(x, y) do { } while (0) > > > > #endif > > > > Cheers, > > -- > Julien Grall
[PATCH v7 4/8] AMD/IOMMU: check IVMD ranges against host implementation limits
When such ranges can't be represented as 1:1 mappings in page tables, reject them as presumably bogus. Note that when we detect features late (because of EFRSup being clear in the ACPI tables), it would be quite a bit of work to check for (and drop) out of range IVMD ranges, so IOMMU initialization gets failed in this case instead. Signed-off-by: Jan Beulich Reviewed-by: Paul Durrant --- v7: Re-base. v6: Re-base. v5: New. --- a/xen/drivers/passthrough/amd/iommu.h +++ b/xen/drivers/passthrough/amd/iommu.h @@ -305,6 +305,7 @@ extern struct hpet_sbdf { } hpet_sbdf; extern unsigned int amd_iommu_acpi_info; +extern unsigned int amd_iommu_max_paging_mode; extern int amd_iommu_min_paging_mode; extern void *shared_intremap_table; @@ -358,7 +359,7 @@ static inline int amd_iommu_get_paging_m while ( max_frames > PTE_PER_TABLE_SIZE ) { max_frames = PTE_PER_TABLE_ALIGN(max_frames) >> PTE_PER_TABLE_SHIFT; -if ( ++level > 6 ) +if ( ++level > amd_iommu_max_paging_mode ) return -ENOMEM; } --- a/xen/drivers/passthrough/amd/iommu_acpi.c +++ b/xen/drivers/passthrough/amd/iommu_acpi.c @@ -370,6 +370,7 @@ static int __init parse_ivmd_device_iomm static int __init parse_ivmd_block(const struct acpi_ivrs_memory *ivmd_block) { unsigned long start_addr, mem_length, base, limit; +unsigned int addr_bits; bool iw = true, ir = true, exclusion = false; if ( ivmd_block->header.length < sizeof(*ivmd_block) ) @@ -386,6 +387,17 @@ static int __init parse_ivmd_block(const AMD_IOMMU_DEBUG("IVMD Block: type %#x phys %#lx len %#lx\n", ivmd_block->header.type, start_addr, mem_length); +addr_bits = min(MASK_EXTR(amd_iommu_acpi_info, ACPI_IVRS_PHYSICAL_SIZE), +MASK_EXTR(amd_iommu_acpi_info, ACPI_IVRS_VIRTUAL_SIZE)); +if ( amd_iommu_get_paging_mode(PFN_UP(start_addr + mem_length)) < 0 || + (addr_bits < BITS_PER_LONG && + ((start_addr + mem_length - 1) >> addr_bits)) ) +{ +AMD_IOMMU_DEBUG("IVMD: [%lx,%lx) is not IOMMU addressable\n", +start_addr, start_addr + mem_length); +return 0; +} + if ( !e820_all_mapped(base, limit + PAGE_SIZE, E820_RESERVED) ) { paddr_t addr; --- a/xen/drivers/passthrough/amd/iommu_detect.c +++ b/xen/drivers/passthrough/amd/iommu_detect.c @@ -67,6 +67,9 @@ void __init get_iommu_features(struct am iommu->features.raw = readq(iommu->mmio_base + IOMMU_EXT_FEATURE_MMIO_OFFSET); + +if ( 4 + iommu->features.flds.hats < amd_iommu_max_paging_mode ) +amd_iommu_max_paging_mode = 4 + iommu->features.flds.hats; } /* Don't log the same set of features over and over. */ @@ -200,6 +203,10 @@ int __init amd_iommu_detect_one_acpi( else if ( list_empty(_iommu_head) ) AMD_IOMMU_DEBUG("EFRSup not set in ACPI table; will fall back to hardware\n"); +if ( (amd_iommu_acpi_info & ACPI_IVRS_EFR_SUP) && + 4 + iommu->features.flds.hats < amd_iommu_max_paging_mode ) +amd_iommu_max_paging_mode = 4 + iommu->features.flds.hats; + /* override IOMMU HT flags */ iommu->ht_flags = ivhd_block->header.flags; --- a/xen/drivers/passthrough/amd/iommu_init.c +++ b/xen/drivers/passthrough/amd/iommu_init.c @@ -1376,6 +1376,13 @@ static int __init amd_iommu_prepare_one( get_iommu_features(iommu); +/* + * Late extended feature determination may cause previously mappable + * IVMD ranges to become unmappable. + */ +if ( amd_iommu_max_paging_mode < amd_iommu_min_paging_mode ) +return -ERANGE; + return 0; } --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c @@ -254,6 +254,7 @@ int amd_iommu_alloc_root(struct domain * return 0; } +unsigned int __read_mostly amd_iommu_max_paging_mode = 6; int __read_mostly amd_iommu_min_paging_mode = 1; static int amd_iommu_domain_init(struct domain *d)
[PATCH v7 3/8] AMD/IOMMU: improve (extended) feature detection
First of all the documentation is very clear about ACPI table data superseding raw register data. Use raw register data only if EFRSup is clear in the ACPI tables (which may still go too far). Additionally if this flag is clear, the IVRS type 11H table is reserved and hence may not be recognized. Furthermore propagate IVRS type 10H data into the feature flags recorded, as the full extended features field is available in type 11H only. Note that this also makes necessary to stop the bad practice of us finding a type 11H IVHD entry, but still processing the type 10H one in detect_iommu_acpi()'s invocation of amd_iommu_detect_one_acpi(). Note also that the features.raw check in amd_iommu_prepare_one() needs replacing, now that the field can also be populated by different means. Key IOMMUv2 availability off of IVHD type not being 10H, and then move it a function layer up, so that it would be set only once all IOMMUs have been successfully prepared. Signed-off-by: Jan Beulich Reviewed-by: Paul Durrant --- v7: Re-base. v5: New. --- a/xen/drivers/passthrough/amd/iommu.h +++ b/xen/drivers/passthrough/amd/iommu.h @@ -304,6 +304,7 @@ extern struct hpet_sbdf { } init; } hpet_sbdf; +extern unsigned int amd_iommu_acpi_info; extern int amd_iommu_min_paging_mode; extern void *shared_intremap_table; --- a/xen/drivers/passthrough/amd/iommu_acpi.c +++ b/xen/drivers/passthrough/amd/iommu_acpi.c @@ -1051,7 +1051,8 @@ static void __init dump_acpi_table_heade static inline bool_t is_ivhd_block(u8 type) { return (type == ACPI_IVRS_TYPE_HARDWARE || -type == ACPI_IVRS_TYPE_HARDWARE_11H); +((amd_iommu_acpi_info & ACPI_IVRS_EFR_SUP) && + type == ACPI_IVRS_TYPE_HARDWARE_11H)); } static inline bool_t is_ivmd_block(u8 type) @@ -1159,7 +1160,7 @@ static int __init detect_iommu_acpi(stru ivrs_block = (struct acpi_ivrs_header *)((u8 *)table + length); if ( table->length < (length + ivrs_block->length) ) return -ENODEV; -if ( ivrs_block->type == ACPI_IVRS_TYPE_HARDWARE && +if ( ivrs_block->type == ivhd_type && amd_iommu_detect_one_acpi(to_ivhd_block(ivrs_block)) != 0 ) return -ENODEV; length += ivrs_block->length; @@ -1299,6 +1300,9 @@ get_supported_ivhd_type(struct acpi_tabl return -ENODEV; } +amd_iommu_acpi_info = container_of(table, const struct acpi_table_ivrs, + header)->info; + while ( table->length > (length + sizeof(*ivrs_block)) ) { ivrs_block = (struct acpi_ivrs_header *)((u8 *)table + length); --- a/xen/drivers/passthrough/amd/iommu_detect.c +++ b/xen/drivers/passthrough/amd/iommu_detect.c @@ -60,14 +60,14 @@ void __init get_iommu_features(struct am const struct amd_iommu *first; ASSERT( iommu->mmio_base ); -if ( !iommu_has_cap(iommu, PCI_CAP_EFRSUP_SHIFT) ) +if ( !(amd_iommu_acpi_info & ACPI_IVRS_EFR_SUP) ) { -iommu->features.raw = 0; -return; -} +if ( !iommu_has_cap(iommu, PCI_CAP_EFRSUP_SHIFT) ) +return; -iommu->features.raw = -readq(iommu->mmio_base + IOMMU_EXT_FEATURE_MMIO_OFFSET); +iommu->features.raw = +readq(iommu->mmio_base + IOMMU_EXT_FEATURE_MMIO_OFFSET); +} /* Don't log the same set of features over and over. */ first = list_first_entry(_iommu_head, struct amd_iommu, list); @@ -164,6 +164,42 @@ int __init amd_iommu_detect_one_acpi( iommu->cap_offset = ivhd_block->capability_offset; iommu->mmio_base_phys = ivhd_block->base_address; +if ( ivhd_type != ACPI_IVRS_TYPE_HARDWARE ) +iommu->features.raw = ivhd_block->efr_image; +else if ( amd_iommu_acpi_info & ACPI_IVRS_EFR_SUP ) +{ +union { +uint32_t raw; +struct { +unsigned int xt_sup:1; +unsigned int nx_sup:1; +unsigned int gt_sup:1; +unsigned int glx_sup:2; +unsigned int ia_sup:1; +unsigned int ga_sup:1; +unsigned int he_sup:1; +unsigned int pas_max:5; +unsigned int pn_counters:4; +unsigned int pn_banks:6; +unsigned int msi_num_ppr:5; +unsigned int gats:2; +unsigned int hats:2; +}; +} attr = { .raw = ivhd_block->iommu_attr }; + +iommu->features.flds.xt_sup = attr.xt_sup; +iommu->features.flds.nx_sup = attr.nx_sup; +iommu->features.flds.gt_sup = attr.gt_sup; +iommu->features.flds.glx_sup = attr.glx_sup; +iommu->features.flds.ia_sup = attr.ia_sup; +iommu->features.flds.ga_sup = attr.ga_sup; +iommu->features.flds.pas_max = attr.pas_max; +iommu->features.flds.gats = attr.gats; +iommu->features.flds.hats = attr.hats; +} +else if ( list_empty(_iommu_head) ) +
[PATCH v7 2/8] AMD/IOMMU: obtain IVHD type to use earlier
Doing this in amd_iommu_prepare() is too late for it, in particular, to be used in amd_iommu_detect_one_acpi(), as a subsequent change will want to do. Moving it immediately ahead of amd_iommu_detect_acpi() is (luckily) pretty simple, (pretty importantly) without breaking amd_iommu_prepare()'s logic to prevent multiple processing. This involves moving table checksumming, as amd_iommu_get_supported_ivhd_type() -> get_supported_ivhd_type() will now be invoked before amd_iommu_detect_acpi() -> detect_iommu_acpi(). In the course of dojng so stop open-coding acpi_tb_checksum(), seeing that we have other uses of this originally ACPI-private function elsewhere in the tree. Signed-off-by: Jan Beulich --- v7: Move table checksumming. v5: New. --- a/xen/drivers/passthrough/amd/iommu_acpi.c +++ b/xen/drivers/passthrough/amd/iommu_acpi.c @@ -22,6 +22,8 @@ #include +#include + #include "iommu.h" /* Some helper structures, particularly to deal with ranges. */ @@ -1150,20 +1152,7 @@ static int __init parse_ivrs_table(struc static int __init detect_iommu_acpi(struct acpi_table_header *table) { const struct acpi_ivrs_header *ivrs_block; -unsigned long i; unsigned long length = sizeof(struct acpi_table_ivrs); -u8 checksum, *raw_table; - -/* validate checksum: sum of entire table == 0 */ -checksum = 0; -raw_table = (u8 *)table; -for ( i = 0; i < table->length; i++ ) -checksum += raw_table[i]; -if ( checksum ) -{ -AMD_IOMMU_DEBUG("IVRS Error: Invalid Checksum %#x\n", checksum); -return -ENODEV; -} while ( table->length > (length + sizeof(*ivrs_block)) ) { @@ -1300,6 +1289,15 @@ get_supported_ivhd_type(struct acpi_tabl { size_t length = sizeof(struct acpi_table_ivrs); const struct acpi_ivrs_header *ivrs_block, *blk = NULL; +uint8_t checksum; + +/* Validate checksum: Sum of entire table == 0. */ +checksum = acpi_tb_checksum(ACPI_CAST_PTR(uint8_t, table), table->length); +if ( checksum ) +{ +AMD_IOMMU_DEBUG("IVRS Error: Invalid Checksum %#x\n", checksum); +return -ENODEV; +} while ( table->length > (length + sizeof(*ivrs_block)) ) { --- a/xen/drivers/passthrough/amd/iommu_init.c +++ b/xen/drivers/passthrough/amd/iommu_init.c @@ -1398,15 +1398,9 @@ int __init amd_iommu_prepare(bool xt) goto error_out; /* Have we been here before? */ -if ( ivhd_type ) +if ( ivrs_bdf_entries ) return 0; -rc = amd_iommu_get_supported_ivhd_type(); -if ( rc < 0 ) -goto error_out; -BUG_ON(!rc); -ivhd_type = rc; - rc = amd_iommu_get_ivrs_dev_entries(); if ( !rc ) rc = -ENODEV; --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c @@ -179,9 +179,17 @@ static int __must_check amd_iommu_setup_ int __init acpi_ivrs_init(void) { +int rc; + if ( !iommu_enable && !iommu_intremap ) return 0; +rc = amd_iommu_get_supported_ivhd_type(); +if ( rc < 0 ) +return rc; +BUG_ON(!rc); +ivhd_type = rc; + if ( (amd_iommu_detect_acpi() !=0) || (iommu_found() == 0) ) { iommu_intremap = iommu_intremap_off;
[PATCH v7 1/8] AMD/IOMMU: check / convert IVMD ranges for being / to be reserved
While the specification doesn't say so, just like for VT-d's RMRRs no good can come from these ranges being e.g. conventional RAM or entirely unmarked and hence usable for placing e.g. PCI device BARs. Check whether they are, and put in some limited effort to convert to reserved. (More advanced logic can be added if actual problems are found with this simplistic variant.) Signed-off-by: Jan Beulich Reviewed-by: Paul Durrant --- v7: Re-base. v5: New. --- a/xen/drivers/passthrough/amd/iommu_acpi.c +++ b/xen/drivers/passthrough/amd/iommu_acpi.c @@ -384,6 +384,38 @@ static int __init parse_ivmd_block(const AMD_IOMMU_DEBUG("IVMD Block: type %#x phys %#lx len %#lx\n", ivmd_block->header.type, start_addr, mem_length); +if ( !e820_all_mapped(base, limit + PAGE_SIZE, E820_RESERVED) ) +{ +paddr_t addr; + +AMD_IOMMU_DEBUG("IVMD: [%lx,%lx) is not (entirely) in reserved memory\n", +base, limit + PAGE_SIZE); + +for ( addr = base; addr <= limit; addr += PAGE_SIZE ) +{ +unsigned int type = page_get_ram_type(maddr_to_mfn(addr)); + +if ( type == RAM_TYPE_UNKNOWN ) +{ +if ( e820_add_range(, addr, addr + PAGE_SIZE, +E820_RESERVED) ) +continue; +AMD_IOMMU_DEBUG("IVMD Error: Page at %lx couldn't be reserved\n", +addr); +return -EIO; +} + +/* Types which won't be handed out are considered good enough. */ +if ( !(type & (RAM_TYPE_RESERVED | RAM_TYPE_ACPI | + RAM_TYPE_UNUSABLE)) ) +continue; + +AMD_IOMMU_DEBUG("IVMD Error: Page at %lx can't be converted\n", +addr); +return -EIO; +} +} + if ( ivmd_block->header.flags & ACPI_IVMD_EXCLUSION_RANGE ) exclusion = true; else if ( ivmd_block->header.flags & ACPI_IVMD_UNITY )