Re: [PATCH] xen: fix multicall debug data referencing
On 7/18/24 2:21 AM, Juergen Gross wrote: The recent adding of multicall debug mixed up the referencing of the debug data. A __percpu tagged pointer can't be initialized with a plain pointer, so use another percpu variable for the pointer and set it on each new cpu via a function. Fixes: 942d917cb92a ("xen: make multicall debug boot time selectable") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202407151106.5s7mnfpz-...@intel.com/ Signed-off-by: Juergen Gross IIUIC we only need data until SMP is initialized, so setting per-cpu pointer just on cpu0 may be sufficient. Not sure where to do this though. But this works as well. Reviewed-by: Boris Ostrovsky
Re: [PATCH] x86/xen: remove deprecated xen_nopvspin boot parameter
On 7/10/24 7:01 AM, Juergen Gross wrote: The xen_nopvspin boot parameter is deprecated since 2019. nopvspin can be used instead. Remove the xen_nopvspin boot parameter and replace the xen_pvspin variable use cases with nopvspin. This requires to move the nopvspin variable out of the .initdata section, as it needs to be accessed for cpuhotplug, too. Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky
Re: [PATCH 0/2] x86/xen: cleanup of xen private headers
On 7/10/24 5:37 AM, Juergen Gross wrote: Cleanup the private header files in arch/x86/xen by merging them into one file and by removing unneeded stuff. Juergen Gross (2): x86/xen: make some functions static x86/xen: eliminate some private header files arch/x86/xen/apic.c | 2 - arch/x86/xen/debugfs.c | 2 +- arch/x86/xen/debugfs.h | 7 -- arch/x86/xen/enlighten.c | 2 - arch/x86/xen/enlighten_hvm.c | 2 - arch/x86/xen/enlighten_pv.c | 4 - arch/x86/xen/mmu.c | 3 +- arch/x86/xen/mmu.h | 28 --- arch/x86/xen/mmu_hvm.c | 2 +- arch/x86/xen/mmu_pv.c| 15 ++-- arch/x86/xen/multicalls.c| 3 +- arch/x86/xen/multicalls.h| 69 arch/x86/xen/p2m.c | 2 - arch/x86/xen/pmu.c | 1 - arch/x86/xen/pmu.h | 22 -- arch/x86/xen/setup.c | 1 - arch/x86/xen/smp.c | 1 - arch/x86/xen/smp.h | 51 arch/x86/xen/smp_hvm.c | 2 - arch/x86/xen/smp_pv.c| 3 - arch/x86/xen/suspend.c | 2 - arch/x86/xen/xen-ops.h | 148 ++- include/xen/events.h | 2 + 23 files changed, 158 insertions(+), 216 deletions(-) delete mode 100644 arch/x86/xen/debugfs.h delete mode 100644 arch/x86/xen/mmu.h delete mode 100644 arch/x86/xen/multicalls.h delete mode 100644 arch/x86/xen/pmu.h delete mode 100644 arch/x86/xen/smp.h Reviewed-by: Boris Ostrovsky
Re: [PATCH v2] xen: make multicall debug boot time selectable
On 7/10/24 5:27 AM, Juergen Gross wrote: Today Xen multicall debugging needs to be enabled via modifying a define in a source file for getting debug data of multicall errors encountered by users. Switch multicall debugging to depend on a boot parameter "xen_mc_debug" instead, enabling affected users to boot with the new parameter set in order to get better diagnostics. With debugging enabled print the following information in case at least one of the batched calls failed: - all calls of the batch with operation, result and caller - all parameters of each call - all parameters stored in the multicall data for each call Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky
Re: [PATCH] xen: make multicall debug boot time selectable
On 7/8/24 5:04 AM, Jürgen Groß wrote: On 06.07.24 00:36, boris.ostrov...@oracle.com wrote: Also, would it be better to keep these fields as a struct of scalars and instead have the percpu array of this struct? Otherwise there is a whole bunch of [MC_BATCH] arrays, all of them really indexed by the same value. (And while at it, there is no reason to have callbacks[MC_BATCH] sized like that -- it has nothing to do with batch size and can probably be made smaller) As today the mc_buffer's entries are copied via a single memcpy(), there are 3 options: Ah yes, it's memcpy, I didn't think of that. Then leaving it as is is the best. - make mc_debug_data a percpu pointer to a single array, requiring to copy the mc_buffer's entries in a loop - let struct mc_debug_data contain two arrays (entries[] and struct foo {}[], with struct foo containing the other pointers/values) - keep the layout as in my patch Regarding the callbacks: I think the max number of callbacks is indeed MC_BATCH, as for each batch member one callback might be requested. So I'd rather keep it the way it is today. Right, I was trying to point out that it's the max number but I suspect it usually is smaller -- we currently ask for a callback in fewer than half of the cases where we submit a request. -boris
Re: [PATCH] xen: make multicall debug boot time selectable
On 7/3/24 7:56 AM, Juergen Gross wrote: #define MC_BATCH 32 -#define MC_DEBUG 0 - #define MC_ARGS (MC_BATCH * 16) struct mc_buffer { unsigned mcidx, argidx, cbidx; struct multicall_entry entries[MC_BATCH]; -#if MC_DEBUG - struct multicall_entry debug[MC_BATCH]; - void *caller[MC_BATCH]; -#endif unsigned char args[MC_ARGS]; struct callback { void (*fn)(void *); @@ -50,13 +46,84 @@ struct mc_buffer { } callbacks[MC_BATCH]; }; +struct mc_debug_data { + struct multicall_entry debug[MC_BATCH]; 'entries'? It's a mc_debug_data's copy of mc_buffer's entries. Also, would it be better to keep these fields as a struct of scalars and instead have the percpu array of this struct? Otherwise there is a whole bunch of [MC_BATCH] arrays, all of them really indexed by the same value. (And while at it, there is no reason to have callbacks[MC_BATCH] sized like that -- it has nothing to do with batch size and can probably be made smaller) + void *caller[MC_BATCH]; + size_t argsz[MC_BATCH]; +}; + static DEFINE_PER_CPU(struct mc_buffer, mc_buffer); +static struct mc_debug_data __percpu *mc_debug_data; +static struct mc_debug_data mc_debug_data_early __initdata; How about (I think this should work): static struct mc_debug_data __percpu *mc_debug_data __refdata = &mc_debug_data_early; Then you won't need get_mc_debug_ptr(). -boris
Re: [PATCH] x86/xen: fix percpu vcpu_info allocation
On 11/24/23 2:48 AM, Juergen Gross wrote: Today the percpu struct vcpu_info is allocated via DEFINE_PER_CPU(), meaning that it could cross a page boundary. In this case registering it with the hypervisor will fail, resulting in a panic(). This can easily be fixed by using DEFINE_PER_CPU_ALIGNED() instead, as struct vcpu_info is guaranteed to have a size of 64 bytes, matching the cache line size of x86 64-bit processors (Xen doesn't support 32-bit processors). Fixes: 5ead97c84fa7 ("xen: Core Xen implementation") Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky although I am not sure in usefulness of BUILD_BUG_ON --- 64 bytes is part of ABI and hypervisor already has its own BUILD_BUG_ON for this. -boris @@ -160,6 +163,7 @@ void xen_vcpu_setup(int cpu) int err; struct vcpu_info *vcpup; + BUILD_BUG_ON(sizeof(*vcpup) > SMP_CACHE_BYTES); BUG_ON(HYPERVISOR_shared_info == &xen_dummy_shared_info);
Re: [PATCH RESEND 2/2] x86/percpu, xen: Correct PER_CPU_VAR usage to include symbol and its addend
On 10/15/23 12:08 PM, Uros Bizjak wrote: PER_CPU_VAR macro should be applied to a symbol and its addend. Inconsisten usage is currently harmless, but needs to be corrected before %rip-relative addressing is introduced to PER_CPU_VAR macro. No functional changes intended. Cc: Juergen Gross Cc: Boris Ostrovsky Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Signed-off-by: Uros Bizjak Reviewed-by: Boris Ostrovsky
Re: [PATCH 3/3] x86/xen: allow nesting of same lazy mode
On 9/13/23 7:38 AM, Juergen Gross wrote: When running as a paravirtualized guest under Xen, Linux is using "lazy mode" for issuing hypercalls which don't need to take immediate effect in order to improve performance (examples are e.g. multiple PTE changes). There are two different lazy modes defined: MMU and CPU lazy mode. Today it is not possible to nest multiple lazy mode sections, even if they are of the same kind. A recent change in memory management added nesting of MMU lazy mode sections, resulting in a regression when running as Xen PV guest. Technically there is no reason why nesting of multiple sections of the same kind of lazy mode shouldn't be allowed. So add support for that for fixing the regression. Fixes: bcc6cc832573 ("mm: add default definition of set_ptes()") Signed-off-by: Juergen Gross For patches 2 and 3 Reviewed-by: Boris Ostrovsky
Re: [PATCH] xen: simplify evtchn_do_upcall() call maze
On 8/24/23 11:41 AM, Juergen Gross wrote: There are several functions involved for performing the functionality of evtchn_do_upcall(): - __xen_evtchn_do_upcall() doing the real work - xen_hvm_evtchn_do_upcall() just being a wrapper for __xen_evtchn_do_upcall(), exposed for external callers - xen_evtchn_do_upcall() calling __xen_evtchn_do_upcall(), too, but without any user Simplify this maze by: - removing the unused xen_evtchn_do_upcall() - removing xen_hvm_evtchn_do_upcall() as the only left caller of __xen_evtchn_do_upcall(), while renaming __xen_evtchn_do_upcall() to xen_evtchn_do_upcall() Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky
Re: [PATCH v1] xen: remove a confusing comment on auto-translated guest I/O
On 8/2/23 12:31 PM, Petr Tesarik wrote: From: Petr Tesarik After removing the conditional return from xen_create_contiguous_region(), the accompanying comment was left in place, but it now precedes an unrelated conditional and confuses readers. Fixes: 989513a735f5 ("xen: cleanup pvh leftovers from pv-only sources") Signed-off-by: Petr Tesarik Reviewed-by: Boris Ostrovsky
Re: [PATCH] x86/xen: fix secondary processor fpu initialization
On 7/3/23 9:00 AM, Juergen Gross wrote: Moving the call of fpu__init_cpu() from cpu_init() to start_secondary() broke Xen PV guests, as those don't call start_secondary() for APs. Fix that by adding the call of fpu__init_cpu() to cpu_bringup(), which is the Xen PV replacement of start_secondary(). Fixes: b81fac906a8f ("x86/fpu: Move FPU initialization into arch_cpu_finalize_init()") Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky
Re: [PATCH] x86/xen: set default memory type for pv guests to WB
On 6/15/23 8:39 AM, Juergen Gross wrote: When running as an unprivileged PV-guest under Xen (not dom0), the default MTRR memory type should be write-back. Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky
Re: [PATCH v4 0/2] x86: xen: add missing prototypes
On 6/14/23 3:34 AM, Juergen Gross wrote: Avoid missing prototype warnings. Arnd Bergmann (1): x86: xen: add missing prototypes Juergen Gross (1): x86/xen: add prototypes for paravirt mmu functions arch/x86/xen/efi.c | 2 ++ arch/x86/xen/mmu_pv.c | 16 arch/x86/xen/smp.h | 4 arch/x86/xen/smp_pv.c | 1 - arch/x86/xen/xen-ops.h | 3 +++ include/xen/xen.h | 3 +++ 6 files changed, 28 insertions(+), 1 deletion(-) Reviewed-by: Boris Ostrovsky
Re: [PATCH] [v2] x86: xen: add missing prototypes
On 5/23/23 4:37 PM, Arnd Bergmann wrote: On Sat, May 20, 2023, at 00:24, Boris Ostrovsky wrote: On 5/19/23 5:28 AM, Arnd Bergmann wrote: diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h index 22fb982ff971..81a7821dd07f 100644 --- a/arch/x86/xen/smp.h +++ b/arch/x86/xen/smp.h @@ -1,7 +1,11 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _XEN_SMP_H +void asm_cpu_bringup_and_idle(void); +asmlinkage void cpu_bringup_and_idle(void); These can go under CONFIG_SMP Not sure if there is much point for the second one, since it's only called from assembler, so the #else path is never seen, but I can do that for consistency if you like. I generally prefer to avoid the extra #if checks when there is no strict need for an empty stub. Do we need the empty stubs? Neither of these should be referenced if !SMP (or more precisely if !CONFIG_XEN_PV_SMP) I guess you also want me to add the empty stubs for the other functions that only have a bit in #ifdef but not #else then? +void xen_force_evtchn_callback(void); These ... +pteval_t xen_pte_val(pte_t pte); +pgdval_t xen_pgd_val(pgd_t pgd); +pte_t xen_make_pte(pteval_t pte); +pgd_t xen_make_pgd(pgdval_t pgd); +pmdval_t xen_pmd_val(pmd_t pmd); +pmd_t xen_make_pmd(pmdval_t pmd); +pudval_t xen_pud_val(pud_t pud); +pud_t xen_make_pud(pudval_t pud); +p4dval_t xen_p4d_val(p4d_t p4d); +p4d_t xen_make_p4d(p4dval_t p4d); +pte_t xen_make_pte_init(pteval_t pte); +void xen_start_kernel(struct start_info *si); ... should go under '#ifdef CONFIG_XEN_PV' What should the stubs do here? I guess just return the unmodified value? Same here -- these should only be referenced if CONFIG_XEN_PV is defined which is why I suggested moving them under ifdef. -boris
Re: [PATCH] [v2] x86: xen: add missing prototypes
On 5/19/23 5:28 AM, Arnd Bergmann wrote: diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h index 22fb982ff971..81a7821dd07f 100644 --- a/arch/x86/xen/smp.h +++ b/arch/x86/xen/smp.h @@ -1,7 +1,11 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _XEN_SMP_H +void asm_cpu_bringup_and_idle(void); +asmlinkage void cpu_bringup_and_idle(void); These can go under CONFIG_SMP + #ifdef CONFIG_SMP + extern void xen_send_IPI_mask(const struct cpumask *mask, int vector); extern void xen_send_IPI_mask_allbutself(const struct cpumask *mask, diff --git a/arch/x86/xen/smp_pv.c b/arch/x86/xen/smp_pv.c index a92e8002b5cf..d5ae5de2daa2 100644 diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h index 6d7f6318fc07..0f71ee3fe86b 100644 --- a/arch/x86/xen/xen-ops.h +++ b/arch/x86/xen/xen-ops.h @@ -160,4 +160,18 @@ void xen_hvm_post_suspend(int suspend_cancelled); static inline void xen_hvm_post_suspend(int suspend_cancelled) {} #endif +void xen_force_evtchn_callback(void); These ... +pteval_t xen_pte_val(pte_t pte); +pgdval_t xen_pgd_val(pgd_t pgd); +pte_t xen_make_pte(pteval_t pte); +pgd_t xen_make_pgd(pgdval_t pgd); +pmdval_t xen_pmd_val(pmd_t pmd); +pmd_t xen_make_pmd(pmdval_t pmd); +pudval_t xen_pud_val(pud_t pud); +pud_t xen_make_pud(pudval_t pud); +p4dval_t xen_p4d_val(p4d_t p4d); +p4d_t xen_make_p4d(p4dval_t p4d); +pte_t xen_make_pte_init(pteval_t pte); +void xen_start_kernel(struct start_info *si); ... should go under '#ifdef CONFIG_XEN_PV' -boris
Re: [patch 16/37] x86/xen/smp_pv: Remove wait for CPU online
On 4/14/23 7:44 PM, Thomas Gleixner wrote: Now that the core code drops sparse_irq_lock after the idle thread synchronized, it's pointless to wait for the AP to mark itself online. Whether the control CPU runs in a wait loop or sleeps in the core code waiting for the online operation to complete makes no difference. Signed-off-by: Thomas Gleixner Cc: Juergen Gross Cc: Boris Ostrovsky Cc: xen-devel@lists.xenproject.org --- arch/x86/xen/smp_pv.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) --- a/arch/x86/xen/smp_pv.c +++ b/arch/x86/xen/smp_pv.c @@ -340,11 +340,11 @@ static int xen_pv_cpu_up(unsigned int cp xen_pmu_init(cpu); - rc = HYPERVISOR_vcpu_op(VCPUOP_up, xen_vcpu_nr(cpu), NULL); - BUG_ON(rc); - - while (cpu_report_state(cpu) != CPU_ONLINE) - HYPERVISOR_sched_op(SCHEDOP_yield, NULL); + /* +* Why is this a BUG? If the hypercall fails then everything can be +* rolled back, no? +*/ In many cases this indicates either some sort of hypervisor internal error or broken logic in the guest, so it is, well, a bug. But I suppose it may also be some transient condition in the hypervisor (I don't see it now but it can happen in the future) so perhaps we should indeed try not to die on the spot. -boris + BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_up, xen_vcpu_nr(cpu), NULL)); return 0; }
Re: [PATCH v4 05/12] x86/xen: set MTRR state when running as Xen PV initial domain
On 3/6/23 11:34 AM, Juergen Gross wrote: When running as Xen PV initial domain (aka dom0), MTRRs are disabled by the hypervisor, but the system should nevertheless use correct cache memory types. This has always kind of worked, as disabled MTRRs resulted in disabled PAT, too, so that the kernel avoided code paths resulting in inconsistencies. This bypassed all of the sanity checks the kernel is doing with enabled MTRRs in order to avoid memory mappings with conflicting memory types. This has been changed recently, leading to PAT being accepted to be enabled, while MTRRs stayed disabled. The result is that mtrr_type_lookup() no longer is accepting all memory type requests, but started to return WB even if UC- was requested. This led to driver failures during initialization of some devices. In reality MTRRs are still in effect, but they are under complete control of the Xen hypervisor. It is possible, however, to retrieve the MTRR settings from the hypervisor. In order to fix those problems, overwrite the MTRR state via mtrr_overwrite_state() with the MTRR data from the hypervisor, if the system is running as a Xen dom0. Fixes: 72cbc8f04fe2 ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen") Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky
Re: [PATCH v3 05/12] x86/xen: set MTRR state when running as Xen PV initial domain
On 2/27/23 2:12 AM, Juergen Gross wrote: On 24.02.23 22:00, Boris Ostrovsky wrote: On 2/23/23 4:32 AM, Juergen Gross wrote: + + for (reg = 0; reg < MTRR_MAX_VAR_RANGES; reg++) { + op.u.read_memtype.reg = reg; + if (HYPERVISOR_platform_op(&op)) + break; If we fail on the first iteration, do we still want to mark MTRRs are enabled/set in mtrr_overwrite_state()? Hmm, good idea. I think we should just drop the call of mtrr_overwrite_state() in this case. TBH I am not sure what the right way is to handle errors here. What if the hypercall fails on second iteration? -boris
Re: [PATCH v3 05/12] x86/xen: set MTRR state when running as Xen PV initial domain
On 2/23/23 4:32 AM, Juergen Gross wrote: + + for (reg = 0; reg < MTRR_MAX_VAR_RANGES; reg++) { + op.u.read_memtype.reg = reg; + if (HYPERVISOR_platform_op(&op)) + break; If we fail on the first iteration, do we still want to mark MTRRs are enabled/set in mtrr_overwrite_state()? -boris + + /* +* Only called in dom0, which has all RAM PFNs mapped at +* RAM MFNs, and all PCI space etc. is identity mapped. +* This means we can treat MFN == PFN regarding MTTR settings. +*/ + var[reg].base_lo = op.u.read_memtype.type; + var[reg].base_lo |= op.u.read_memtype.mfn << PAGE_SHIFT; + var[reg].base_hi = op.u.read_memtype.mfn >> (32 - PAGE_SHIFT); + mask = ~((op.u.read_memtype.nr_mfns << PAGE_SHIFT) - 1); + mask &= (1UL << width) - 1; + if (mask) + mask |= 1 << 11; + var[reg].mask_lo = mask; + var[reg].mask_hi = mask >> 32; + } + + mtrr_overwrite_state(var, reg, MTRR_TYPE_UNCACHABLE); +#endif +}
Re: [PATCH] x86/Xen: make use of IBPB controlling VM assist
On 2/15/23 3:31 AM, Jan Beulich wrote: On 15.02.2023 01:07, Boris Ostrovsky wrote: On 2/14/23 6:53 PM, Boris Ostrovsky wrote: On 2/14/23 11:13 AM, Jan Beulich wrote: --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -18,6 +18,8 @@ #include #include +#include + #include #include #include @@ -32,6 +34,7 @@ #include #include #include +#include #include #include "cpu.h" @@ -934,7 +937,8 @@ do_cmd_auto: break; case RETBLEED_MITIGATION_IBPB: - setup_force_cpu_cap(X86_FEATURE_ENTRY_IBPB); + if (!xen_pv_domain() || xen_vm_assist_ibpb(true)) Is this going to compile without CONFIG_XEN? Yes. The declaration of xen_vm_assist_ibpb() is visible (satisfying the compiler) and DCE will eliminate the call to the function due to xen_pv_domain() being constant "false" in that case, avoiding any linking issues. The interesting case here really is building with XEN but without XEN_PV: That's why I needed to put the function in enlighten.c. This wouldn't be needed if xen_pv_domain() was also constant "false" in that case (just like xen_pvh_domain() is when !XEN_PVH). I also think these two conditions should be wrapped into something to limit exposure of non-Xen code to Xen-specific primitives. I would have done so, if I had any halfway sensible idea on how to go about doing so in this particular case. In the absence of that it looked okay-ish to me to reference Xen functions directly here. Oh, and this needs x86 maintainers. Eventually yes. But I would prefer to sort the above question first (which I'm sure would have been raised by them, in perhaps more harsh a way), hence the initially limited exposure. I also think there is a bit of a disconnect between how the mitigation is reported in the log/sysfs (retbleed_mitigation is RETBLEED_MITIGATION_IBPB, so "Mitigation: IBPB") and, for example, lscpu (since X86_FEATURE_ENTRY_IBPB is not set anymore). -boris
Re: [PATCH] x86/Xen: make use of IBPB controlling VM assist
On 2/14/23 6:53 PM, Boris Ostrovsky wrote: On 2/14/23 11:13 AM, Jan Beulich wrote: --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -18,6 +18,8 @@ #include #include +#include + #include #include #include @@ -32,6 +34,7 @@ #include #include #include +#include #include #include "cpu.h" @@ -934,7 +937,8 @@ do_cmd_auto: break; case RETBLEED_MITIGATION_IBPB: - setup_force_cpu_cap(X86_FEATURE_ENTRY_IBPB); + if (!xen_pv_domain() || xen_vm_assist_ibpb(true)) Is this going to compile without CONFIG_XEN? I also think these two conditions should be wrapped into something to limit exposure of non-Xen code to Xen-specific primitives. Oh, and this needs x86 maintainers. -boris
Re: [PATCH] x86/Xen: make use of IBPB controlling VM assist
On 2/14/23 11:13 AM, Jan Beulich wrote: --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -18,6 +18,8 @@ #include #include +#include + #include #include #include @@ -32,6 +34,7 @@ #include #include #include +#include #include #include "cpu.h" @@ -934,7 +937,8 @@ do_cmd_auto: break; case RETBLEED_MITIGATION_IBPB: - setup_force_cpu_cap(X86_FEATURE_ENTRY_IBPB); + if (!xen_pv_domain() || xen_vm_assist_ibpb(true)) Is this going to compile without CONFIG_XEN? I also think these two conditions should be wrapped into something to limit exposure of non-Xen code to Xen-specific primitives. -boris + setup_force_cpu_cap(X86_FEATURE_ENTRY_IBPB); mitigate_smt = true; break;
Re: [RFC 01/10] x86: introduce AMD-V and Intel VT-x Kconfig options
On 2/14/23 2:48 AM, Jan Beulich wrote: On 13.02.2023 21:53, Boris Ostrovsky wrote: On 2/13/23 11:41 AM, Jan Beulich wrote: On 13.02.2023 17:30, Xenia Ragiadakou wrote: On 2/13/23 17:11, Jan Beulich wrote: On 13.02.2023 15:57, Xenia Ragiadakou wrote: --- a/xen/arch/x86/cpu/Makefile +++ b/xen/arch/x86/cpu/Makefile @@ -10,4 +10,6 @@ obj-y += intel.o obj-y += intel_cacheinfo.o obj-y += mwait-idle.o obj-y += shanghai.o -obj-y += vpmu.o vpmu_amd.o vpmu_intel.o +obj-y += vpmu.o +obj-$(CONFIG_AMD_SVM) += vpmu_amd.o +obj-$(CONFIG_INTEL_VMX) += vpmu_intel.o This code was moved from hvm/ to cpu/ for the specific reason of also being used by PV guests. (Sadly the comments at the top weren't corrected at that time.) Hmm, the init functions are prefixed with svm/vmx. Does vpmu depend on svm/vmx support? There are interactions, but I don't think there's a dependency. You may want to ask Boris (whom I have now added to Cc). MSR intercept bits need to be manipulated, which is SVM/VMX-specific. But that's "interaction", not "dependency" aiui: The intercept bits aren't relevant for PV guests, are they? Correct, they are not. All accesses to intercept bits are under is_hvm_vcpu(). So vpmu does not depend on presence of vmx/svm support. (And init routines shouldn't be prefixed with those) -boris
Re: [RFC 01/10] x86: introduce AMD-V and Intel VT-x Kconfig options
On 2/13/23 11:41 AM, Jan Beulich wrote: On 13.02.2023 17:30, Xenia Ragiadakou wrote: On 2/13/23 17:11, Jan Beulich wrote: On 13.02.2023 15:57, Xenia Ragiadakou wrote: --- a/xen/arch/x86/cpu/Makefile +++ b/xen/arch/x86/cpu/Makefile @@ -10,4 +10,6 @@ obj-y += intel.o obj-y += intel_cacheinfo.o obj-y += mwait-idle.o obj-y += shanghai.o -obj-y += vpmu.o vpmu_amd.o vpmu_intel.o +obj-y += vpmu.o +obj-$(CONFIG_AMD_SVM) += vpmu_amd.o +obj-$(CONFIG_INTEL_VMX) += vpmu_intel.o This code was moved from hvm/ to cpu/ for the specific reason of also being used by PV guests. (Sadly the comments at the top weren't corrected at that time.) Hmm, the init functions are prefixed with svm/vmx. Does vpmu depend on svm/vmx support? There are interactions, but I don't think there's a dependency. You may want to ask Boris (whom I have now added to Cc). MSR intercept bits need to be manipulated, which is SVM/VMX-specific. -boris
Re: [PATCH 0/2] x86/xen: don't return from xen_pv_play_dead()
On 1/16/23 2:49 AM, Juergen Gross wrote: On 25.11.22 07:32, Juergen Gross wrote: All play_dead() functions but xen_pv_play_dead() don't return to the caller. Adapt xen_pv_play_dead() to behave like the other play_dead() variants. Juergen Gross (2): x86/xen: don't let xen_pv_play_dead() return x86/xen: mark xen_pv_play_dead() as __noreturn arch/x86/xen/smp.h | 2 ++ arch/x86/xen/smp_pv.c | 17 - arch/x86/xen/xen-head.S | 7 +++ tools/objtool/check.c | 1 + 4 files changed, 14 insertions(+), 13 deletions(-) Ping? Reviewed-by: Boris Ostrovsky
Re: [PATCH] x86/paravirt: merge activate_mm and dup_mmap callbacks
On 1/12/23 10:21 AM, Juergen Gross wrote: The two paravirt callbacks .mmu.activate_mm and .mmu.dup_mmap are sharing the same implementations in all cases: for Xen PV guests they are pinning the PGD of the new mm_struct, and for all other cases they are a NOP. So merge them to a common callback .mmu.enter_mmap (in contrast to the corresponding already existing .mmu.exit_mmap). As the first parameter of the old callbacks isn't used, drop it from the replacement one. Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky
Re: [PATCH linux-next v2] x86/xen/time: prefer tsc as clocksource when it is invariant
On 12/14/22 1:01 PM, Krister Johansen wrote: On Tue, Dec 13, 2022 at 04:25:32PM -0500, Boris Ostrovsky wrote: On 12/12/22 5:09 PM, Krister Johansen wrote: On Mon, Dec 12, 2022 at 01:48:24PM -0500, Boris Ostrovsky wrote: On 12/12/22 11:05 AM, Krister Johansen wrote: diff --git a/arch/x86/include/asm/xen/cpuid.h b/arch/x86/include/asm/xen/cpuid.h index 6daa9b0c8d11..d9d7432481e9 100644 --- a/arch/x86/include/asm/xen/cpuid.h +++ b/arch/x86/include/asm/xen/cpuid.h @@ -88,6 +88,12 @@ * EDX: shift amount for tsc->ns conversion * Sub-leaf 2: EAX: host tsc frequency in kHz */ +#define XEN_CPUID_TSC_EMULATED (1u << 0) +#define XEN_CPUID_HOST_TSC_RELIABLE (1u << 1) +#define XEN_CPUID_RDTSCP_INSTR_AVAIL (1u << 2) +#define XEN_CPUID_TSC_MODE_DEFAULT (0) +#define XEN_CPUID_TSC_MODE_EMULATE (1u) +#define XEN_CPUID_TSC_MODE_NOEMULATE (2u) This file is a copy of Xen public interface so this change should go to Xen first. Ok, should I split this into a separate patch on the linux side too? Yes. Once the Xen patch has been accepted you will either submit the same patch for Linux or sync Linux file with Xen (if there are more differences). Thanks. Based upon the feedback I received from you and Jan, I may try to shrink the check in xen_tsc_safe_clocksource() down a bit. In that case, I may only need to refer to a single field in the leaf that provides this information. In that case, are you alright with dropping the change to the header and referring to the value directly, or would you prefer that I proceed with adding these to the public API? It would certainly be appreciated if you updated the header files but it's up to maintainers to decide whether it's required. +static int __init xen_tsc_safe_clocksource(void) +{ + u32 eax, ebx, ecx, edx; + + if (!(xen_hvm_domain() || xen_pvh_domain())) + return 0; + + if (!(boot_cpu_has(X86_FEATURE_CONSTANT_TSC))) + return 0; + + if (!(boot_cpu_has(X86_FEATURE_NONSTOP_TSC))) + return 0; + + if (check_tsc_unstable()) + return 0; + + cpuid(xen_cpuid_base() + 3, &eax, &ebx, &ecx, &edx); + + if (eax & XEN_CPUID_TSC_EMULATED) + return 0; + + if (ebx != XEN_CPUID_TSC_MODE_NOEMULATE) + return 0; Why is the last test needed? I was under the impression that if the mode was 0 (default) it would be possible for the tsc to become emulated in the future, perhaps after a migration. The presence of the tsc_mode noemulate meant that we could count on the falseneess of the XEN_CPUID_TSC_EMULATED check remaining constant. This will filter out most modern processors with TSC scaling support where in default mode we don't intercept RDTCS after migration. But I don't think we have proper interface to determine this so we don't have much choice but to indeed make this check. Yes, if this remains a single boot-time check, I'm not sure that knowing whether the processor supports tsc scaling helps us. If tsc_mode is default, there's always a possibility of the tsc becoming emulated later on as part of migration, correct? If the processor supports TSC scaling I don't think it's possible (it can happen in theory) but if it doesn't and you migrate to a CPU running at different frequency then yes, hypervisor will start emulating RDTSC. The other thing that might be possible here is to add a background timer that periodically checks if the tsc is still not emulated, and if it suddenly becomes so, change the rating again to prefer the xen clocksource. I had written this off initially as an impractical solution, since it seemed like a lot more mechanism and because it meant the performance characteristics of the system would change without user intervention. However, if this seems like a good idea, I'm not opposed to giving it a try. I don't think we should do it. Having the kernel suddenly change clocksource will probably be somewhat of a surprise to users. -boris
Re: [PATCH linux-next v2] x86/xen/time: prefer tsc as clocksource when it is invariant
On 12/12/22 5:09 PM, Krister Johansen wrote: On Mon, Dec 12, 2022 at 01:48:24PM -0500, Boris Ostrovsky wrote: On 12/12/22 11:05 AM, Krister Johansen wrote: diff --git a/arch/x86/include/asm/xen/cpuid.h b/arch/x86/include/asm/xen/cpuid.h index 6daa9b0c8d11..d9d7432481e9 100644 --- a/arch/x86/include/asm/xen/cpuid.h +++ b/arch/x86/include/asm/xen/cpuid.h @@ -88,6 +88,12 @@ * EDX: shift amount for tsc->ns conversion * Sub-leaf 2: EAX: host tsc frequency in kHz */ +#define XEN_CPUID_TSC_EMULATED (1u << 0) +#define XEN_CPUID_HOST_TSC_RELIABLE (1u << 1) +#define XEN_CPUID_RDTSCP_INSTR_AVAIL (1u << 2) +#define XEN_CPUID_TSC_MODE_DEFAULT (0) +#define XEN_CPUID_TSC_MODE_EMULATE (1u) +#define XEN_CPUID_TSC_MODE_NOEMULATE (2u) This file is a copy of Xen public interface so this change should go to Xen first. Ok, should I split this into a separate patch on the linux side too? Yes. Once the Xen patch has been accepted you will either submit the same patch for Linux or sync Linux file with Xen (if there are more differences). +static int __init xen_tsc_safe_clocksource(void) +{ + u32 eax, ebx, ecx, edx; + + if (!(xen_hvm_domain() || xen_pvh_domain())) + return 0; + + if (!(boot_cpu_has(X86_FEATURE_CONSTANT_TSC))) + return 0; + + if (!(boot_cpu_has(X86_FEATURE_NONSTOP_TSC))) + return 0; + + if (check_tsc_unstable()) + return 0; + + cpuid(xen_cpuid_base() + 3, &eax, &ebx, &ecx, &edx); + + if (eax & XEN_CPUID_TSC_EMULATED) + return 0; + + if (ebx != XEN_CPUID_TSC_MODE_NOEMULATE) + return 0; Why is the last test needed? I was under the impression that if the mode was 0 (default) it would be possible for the tsc to become emulated in the future, perhaps after a migration. The presence of the tsc_mode noemulate meant that we could count on the falseneess of the XEN_CPUID_TSC_EMULATED check remaining constant. This will filter out most modern processors with TSC scaling support where in default mode we don't intercept RDTCS after migration. But I don't think we have proper interface to determine this so we don't have much choice but to indeed make this check. -boris
Re: [PATCH linux-next v2] x86/xen/time: prefer tsc as clocksource when it is invariant
On 12/12/22 11:05 AM, Krister Johansen wrote: diff --git a/arch/x86/include/asm/xen/cpuid.h b/arch/x86/include/asm/xen/cpuid.h index 6daa9b0c8d11..d9d7432481e9 100644 --- a/arch/x86/include/asm/xen/cpuid.h +++ b/arch/x86/include/asm/xen/cpuid.h @@ -88,6 +88,12 @@ * EDX: shift amount for tsc->ns conversion * Sub-leaf 2: EAX: host tsc frequency in kHz */ +#define XEN_CPUID_TSC_EMULATED (1u << 0) +#define XEN_CPUID_HOST_TSC_RELIABLE (1u << 1) +#define XEN_CPUID_RDTSCP_INSTR_AVAIL (1u << 2) +#define XEN_CPUID_TSC_MODE_DEFAULT (0) +#define XEN_CPUID_TSC_MODE_EMULATE (1u) +#define XEN_CPUID_TSC_MODE_NOEMULATE (2u) This file is a copy of Xen public interface so this change should go to Xen first. +static int __init xen_tsc_safe_clocksource(void) +{ + u32 eax, ebx, ecx, edx; + + if (!(xen_hvm_domain() || xen_pvh_domain())) + return 0; + + if (!(boot_cpu_has(X86_FEATURE_CONSTANT_TSC))) + return 0; + + if (!(boot_cpu_has(X86_FEATURE_NONSTOP_TSC))) + return 0; + + if (check_tsc_unstable()) + return 0; + + cpuid(xen_cpuid_base() + 3, &eax, &ebx, &ecx, &edx); + + if (eax & XEN_CPUID_TSC_EMULATED) + return 0; + + if (ebx != XEN_CPUID_TSC_MODE_NOEMULATE) + return 0; Why is the last test needed? -boris
Re: [PATCH linux-next] x86/xen/time: prefer tsc as clocksource when it is invariant
On 12/8/22 11:36 AM, Krister Johansen wrote: + /* +* As Dom0 is never moved, no penalty on using TSC there. +* +* If the guest has invariant tsc, then set xen_clocksource rating +* below that of the tsc so that the system prefers tsc instead. This +* check excludes PV domains, because PV is unable to guarantee that the +* guest's cpuid call has been intercepted by the hypervisor. +*/ + if (xen_initial_domain()) { xen_clocksource.rating = 275; + } else if ((xen_hvm_domain() || xen_pvh_domain()) && + boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && + !check_tsc_unstable()) { + xen_clocksource.rating = 299; + } What if RDTSC is intercepted? -boris
Re: [PATCH] drivers/xen/hypervisor: Expose VM SIF flags to userspace
On 11/29/22 10:00 AM, Per Bilse wrote: +static ssize_t flags_show(struct hyp_sysfs_attr *attr, char *buffer) +{ + static char const *const sifstr[SIFN_NUM_SIFN] = { + [SIFN_PRIV] = "privileged", + [SIFN_INIT] = "initdomain", + [SIFN_MULTI] = "multiboot", + [SIFN_PFN] = "mod_start_pfn", + [SIFN_VIRT] = "virtmap" + }; + unsigned sifnum, sifmask; + ssize_t ret = 0; + + sifmask = ~(~0U << SIFN_NUM_SIFN); // ...111... + if (xen_domain() && (xen_start_flags & sifmask) != 0) { + for (sifnum = 0; sifnum != SIFN_NUM_SIFN; sifnum++) { + if ((xen_start_flags & (1< Why not simply show unprocessed xen_start_flags as a hex value? -boris
Re: [PATCH 28/30] x86/xen: Use kstrtobool() instead of strtobool()
On 11/1/22 5:14 PM, Christophe JAILLET wrote: strtobool() is the same as kstrtobool(). However, the latter is more used within the kernel. In order to remove strtobool() and slightly simplify kstrtox.h, switch to the other function name. While at it, include the corresponding header file () Signed-off-by: Christophe JAILLET --- This patch is part of a serie that axes all usages of strtobool(). Each patch can be applied independently from the other ones. The last patch of the serie removes the definition of strtobool(). You may not be in copy of the cover letter. So, if needed, it is available at [1]. [1]: https://lore.kernel.org/all/cover.1667336095.git.christophe.jail...@wanadoo.fr/ Reviewed-by: Boris Ostrovsky
Re: [PATCH] x86/xen: silence smatch warning in pmu_msr_chk_emulated()
On 10/20/22 9:34 AM, Juergen Gross wrote: On 20.10.22 15:16, Jan Beulich wrote: On 20.10.2022 13:37, Juergen Gross wrote: Commit 8714f7bcd3c2 ("xen/pv: add fault recovery control to pmu msr accesses") introduced code resulting in a warning issued by the smatch static checker, claiming to use an uninitialized variable. This is a false positive, but work around the warning nevertheless. The risk of introducing a problem might be quite low here, but in general it exists: With the adjustment you remove any chance of the compiler spotting a missing initialization before use. And I'm not convinced using 0 in such a case would actually be ending up sufficiently benign. Hmm, an alternative would be to initialize it to -1 and add a test for the index to be >= 0 before using it. Or to live with the smash warning with the chance, that a compiler might be warning for the same reason in the future. Is smatch complaining about both variables or just index? There are two cases in is_intel_pmu_msr() where it returns true but index is not set so perhaps that's what bothers smatch? It shold not complain if is_intel_pmu_msr() returns false. -boris
Re: [PATCH v2 3/3] xen/virtio: enable grant based virtio on x86
On 10/7/22 10:00 AM, Oleksandr Tyshchenko wrote: On 07.10.22 09:41, Juergen Gross wrote: Hello Juergen Use an x86-specific virtio_check_mem_acc_cb() for Xen in order to setup the correct DMA ops. Signed-off-by: Juergen Gross --- V2: - add missing PV check in xen_virtio_mem_acc() (Oleksandr Tyshchenko) - add xen_virtio_restricted_mem_acc() stub (Oleksandr Tyshchenko) Reviewed-by: Oleksandr Tyshchenko # common code Reviewed-by: Boris Ostrovsky
Re: [PATCH v2 1/3] xen/pv: allow pmu msr accesses to cause GP
On 10/4/22 4:43 AM, Juergen Gross wrote: bool pmu_msr_read(unsigned int msr, uint64_t *val, int *err) { - if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) { - if (is_amd_pmu_msr(msr)) { - if (!xen_amd_pmu_emulate(msr, val, 1)) - *val = native_read_msr_safe(msr, err); - return true; - } - } else { - int type, index; + int type, index; + bool emulated; - if (is_intel_pmu_msr(msr, &type, &index)) { - if (!xen_intel_pmu_emulate(msr, val, type, index, 1)) - *val = native_read_msr_safe(msr, err); - return true; - } + if (is_amd_pmu_msr(msr)) + emulated = xen_amd_pmu_emulate(msr, val, 1); + else if (is_intel_pmu_msr(msr, &type, &index)) + emulated = xen_intel_pmu_emulate(msr, val, type, index, 1); + else + return false; + + if (!emulated) { You can factor this out even further I think by moving if/elseif/esle into a separate routine and then have 'if (!xen_emulate_pmu_msr(msr, val, 1))' (and pass zero from pmu_msr_write()) -boris
Re: [PATCH 1/3] xen/pv: allow pmu msr accesses to cause GP
On 9/26/22 10:18 AM, Juergen Gross wrote: bool pmu_msr_read(unsigned int msr, uint64_t *val, int *err) { if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) { - if (is_amd_pmu_msr(msr)) { - if (!xen_amd_pmu_emulate(msr, val, 1)) - *val = native_read_msr_safe(msr, err); - return true; + if (!is_amd_pmu_msr(msr)) You should be able to move vendor check inside is__pmu_msr(). -boris + return false; + if (!xen_amd_pmu_emulate(msr, val, 1)) { + *val = err ? native_read_msr_safe(msr, err) + : native_read_msr(msr); } + return true; } else {
Re: [PATCH 1/4] x86/entry: Work around Clang __bdos() bug
On 9/20/22 3:21 PM, Kees Cook wrote: After expanding bounds checking to use __builtin_dynamic_object_size(), Clang produces a false positive when building with CONFIG_FORTIFY_SOURCE=y and CONFIG_UBSAN_BOUNDS=y when operating on an array with a dynamic offset. Work around this by using a direct assignment of an empty instance. Avoids this warning: ../include/linux/fortify-string.h:309:4: warning: call to __write_overflow_field declared with 'warn ing' attribute: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wat tribute-warning] __write_overflow_field(p_size_field, size); ^ which was isolated to the memset() call in xen_load_idt(). Note that this looks very much like another bug that was worked around: https://github.com/ClangBuiltLinux/linux/issues/1592 Cc: Juergen Gross Cc: Boris Ostrovsky Cc: Nathan Chancellor Cc: Nick Desaulniers Cc: xen-devel@lists.xenproject.org Cc: l...@lists.linux.dev Signed-off-by: Kees Cook Reviewed-by: Boris Ostrovsky
Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load
On 9/20/22 10:54 AM, Jan Beulich wrote: On 20.09.2022 16:26, Boris Ostrovsky wrote: On 9/20/22 4:01 AM, Jan Beulich wrote: On 20.09.2022 00:42, Boris Ostrovsky wrote: It is saving vpmu data from current pcpu's MSRs for a remote vcpu so @v in vmx_find_msr() is not @current: vpmu_load() ... prev = per_cpu(last_vcpu, pcpu); vpmu_save_force(prev) core2_vpmu_save() __core2_vpmu_save() vmx_read_guest_msr() vmx_find_msr() The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder though whether this call is needed when code path above is executed (i.e. when we are saving remove vcpu) How could it not be needed? We need to obtain the guest value. The thing I don't understand is why this forced saving is necessary, when context_switch() unconditionally calls vpmu_switch_from(). IIRC the logic is: 1. vcpuA runs on pcpu0 2. vcpuA is de-scheduled and is selected to run on pcpu1. It has not yet called vpmu_load() from pcpu1 The calling of vpmu_load() shouldn't matter here. What does matter is that vpmu_save() was necessarily called already. I thought we don't always save MSRs on context switch because we optimize for case when the same vcpu gets scheduled again. But I am not sure I see this now that I am looking at the code. -boris Therefore I'm having trouble seeing why ... 3. vcpuB is ready to run on pcpu0, calls vpmu_load() 4. vcpuB discovers that pcpu0's MSRs are still holding values from vcpuA 5. vcpuB calls vpmu_force_save(vcpuA) which stashes pcpu0's MSRs into vcpuA's vpmu context. ... forced saving would be necessary here. What's necessary at this point is only the loading of vcpuB's MSR values.
Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load
On 9/20/22 4:01 AM, Jan Beulich wrote: On 20.09.2022 00:42, Boris Ostrovsky wrote: It is saving vpmu data from current pcpu's MSRs for a remote vcpu so @v in vmx_find_msr() is not @current: vpmu_load() ... prev = per_cpu(last_vcpu, pcpu); vpmu_save_force(prev) core2_vpmu_save() __core2_vpmu_save() vmx_read_guest_msr() vmx_find_msr() The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder though whether this call is needed when code path above is executed (i.e. when we are saving remove vcpu) How could it not be needed? We need to obtain the guest value. The thing I don't understand is why this forced saving is necessary, when context_switch() unconditionally calls vpmu_switch_from(). IIRC the logic is: 1. vcpuA runs on pcpu0 2. vcpuA is de-scheduled and is selected to run on pcpu1. It has not yet called vpmu_load() from pcpu1 3. vcpuB is ready to run on pcpu0, calls vpmu_load() 4. vcpuB discovers that pcpu0's MSRs are still holding values from vcpuA 5. vcpuB calls vpmu_force_save(vcpuA) which stashes pcpu0's MSRs into vcpuA's vpmu context. -boris
Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load
On 9/19/22 10:56 AM, Jan Beulich wrote: On 19.09.2022 16:11, Tamas K Lengyel wrote: On Mon, Sep 19, 2022 at 9:58 AM Jan Beulich wrote: On 19.09.2022 15:24, Tamas K Lengyel wrote: On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich wrote: On 19.09.2022 14:25, Tamas K Lengyel wrote: On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich wrote: On 16.09.2022 23:35, Boris Ostrovsky wrote: On 9/16/22 8:52 AM, Jan Beulich wrote: On 15.09.2022 16:01, Tamas K Lengyel wrote: While experimenting with the vPMU subsystem an ASSERT failure was observed in vmx_find_msr because the vcpu_runnable state was true. The root cause of the bug appears to be the fact that the vPMU subsystem doesn't save its state on context_switch. For the further reply below - is this actually true? What is the vpmu_switch_from() (resolving to vpmu_save()) doing then early in context_switch()? The vpmu_load function will attempt to gather the PMU state if its still loaded two different ways: 1. if the current pcpu is not where the vcpu ran before doing a remote save 2. if the current pcpu had another vcpu active before doing a local save However, in case the prev vcpu is being rescheduled on another pcpu its state has already changed and vcpu_runnable is returning true, thus #2 will trip the ASSERT. The only way to avoid this race condition is to make sure the prev vcpu is paused while being checked and its context saved. Once the prev vcpu is resumed and does #1 it will find its state already saved. While I consider this explanation plausible, I'm worried: --- a/xen/arch/x86/cpu/vpmu.c +++ b/xen/arch/x86/cpu/vpmu.c @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t from_guest) vpmu = vcpu_vpmu(prev); /* Someone ran here before us */ +vcpu_pause(prev); vpmu_save_force(prev); vpmu_reset(vpmu, VPMU_CONTEXT_LOADED); +vcpu_unpause(prev); vpmu = vcpu_vpmu(v); } We're running with IRQs off here, yet vcpu_pause() waits for the vcpu to actually be de-scheduled. Even with IRQs on this is already a relatively heavy operation (also including its impact on the remote side). Additionally the function is called from context_switch(), and I'm unsure of the usability of vcpu_pause() on such a path. In particular: Is there a risk of two CPUs doing this mutually to one another? If so, is deadlocking excluded? Hence at the very least I think the description wants extending, to discuss the safety of the change. Boris - any chance you could comment here? Iirc that's code you did introduce. Is the assertion in vmx_find_msr() really needs to be for runnable vcpu or can it be a check on whether vcpu is actually running (e.g. RUNSTATE_running)? You cannot safely check for "running", as "runnable" may transition to/from "running" behind your back. The more I look at this the more I think the only sensible solution is to have the vPMU state be saved on vmexit for all vCPUs. Do you really mean vmexit? It would suffice if state was reliably saved during context-switch-out, wouldn't it? At that point the vCPU can't be resumed on another pCPU, yet. That way all this having to figure out where and when a context needs saving during scheduling goes away. Yes, it adds a bit of overhead for cases where the vCPU will resume on the same pCPU and that context saved could have been skipped, If you really mean vmexit, then I'm inclined to say that's more than just "a bit of overhead". I'd agree if you really meant context-switch-out, but as said further up it looks to me as if that was already occurring. Apparently I'm overlooking something crucial ... Yes, the current setup is doing exactly that, saving the vPMU context on context-switch-out, and that's where the ASSERT failure occurs because the vCPU it needs to save the context for is already runnable on another pCPU. Well, if that's the scenario (sorry for not understanding it that way earlier on), then the assertion is too strict: While in context switch, the vCPU may be runnable, but certainly won't actually run anywhere. Therefore I'd be inclined to suggest to relax the condition just enough to cover this case, without actually going to checking for "running". What ensures the vCPU won't actually run anywhere if it's in the runnable state? The fact that the vCPU is the subject of context_switch(). And how do I relax the condition just enough to cover just this case? That's the more difficult question. The immediate solution, passing a boolean or flag to vpmu_switch_from(), may not be practical, as it would likely mean passing this through many layers. Utilizing that I have Jürgen sitting next to me, I've discussed this with him, and he suggested to simply check for v == current. And indeed - set_current() in context_switch
Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load
On 9/16/22 8:52 AM, Jan Beulich wrote: On 15.09.2022 16:01, Tamas K Lengyel wrote: While experimenting with the vPMU subsystem an ASSERT failure was observed in vmx_find_msr because the vcpu_runnable state was true. The root cause of the bug appears to be the fact that the vPMU subsystem doesn't save its state on context_switch. The vpmu_load function will attempt to gather the PMU state if its still loaded two different ways: 1. if the current pcpu is not where the vcpu ran before doing a remote save 2. if the current pcpu had another vcpu active before doing a local save However, in case the prev vcpu is being rescheduled on another pcpu its state has already changed and vcpu_runnable is returning true, thus #2 will trip the ASSERT. The only way to avoid this race condition is to make sure the prev vcpu is paused while being checked and its context saved. Once the prev vcpu is resumed and does #1 it will find its state already saved. While I consider this explanation plausible, I'm worried: --- a/xen/arch/x86/cpu/vpmu.c +++ b/xen/arch/x86/cpu/vpmu.c @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t from_guest) vpmu = vcpu_vpmu(prev); /* Someone ran here before us */ +vcpu_pause(prev); vpmu_save_force(prev); vpmu_reset(vpmu, VPMU_CONTEXT_LOADED); +vcpu_unpause(prev); vpmu = vcpu_vpmu(v); } We're running with IRQs off here, yet vcpu_pause() waits for the vcpu to actually be de-scheduled. Even with IRQs on this is already a relatively heavy operation (also including its impact on the remote side). Additionally the function is called from context_switch(), and I'm unsure of the usability of vcpu_pause() on such a path. In particular: Is there a risk of two CPUs doing this mutually to one another? If so, is deadlocking excluded? Hence at the very least I think the description wants extending, to discuss the safety of the change. Boris - any chance you could comment here? Iirc that's code you did introduce. Is the assertion in vmx_find_msr() really needs to be for runnable vcpu or can it be a check on whether vcpu is actually running (e.g. RUNSTATE_running)? I think call chain vpmu_load()->..->*_vpmu_save()->...->vmx_find_msr() is the only one where we are doing it for non-current vcpu. If we can guarantee that run state is set after vpmu_load() call (maybe it is already, I haven't checked) then testing the state may avoid the assertion. -boris
Re: [PATCH v3] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector
On 7/28/22 8:52 AM, Jane Malalane wrote: +/* + * Setup per-vCPU vector-type callbacks and trick toolstack to think The comment should be adjusted -- no need to mention toolstack now that that code has been factored out. Other than that Reviewed-by: Boris Ostrovsky + * we are enlightened. If this setup is unavailable, fallback to the + * global vector-type callback. + */ +static __init void xen_init_setup_upcall_vector(void) +{
Re: [PATCH v2] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector
On 7/26/22 8:56 AM, Jane Malalane wrote: +/* Setup per-vCPU vector-type callbacks and trick toolstack to think + * we are enlightened. If this setup is unavailable, fallback to the + * global vector-type callback. */ Comment style. +static __init void xen_init_setup_upcall_vector(void) +{ + unsigned int cpu = 0; Unnecessary variable. + + if (!xen_have_vector_callback) + return; + + if ((cpuid_eax(xen_cpuid_base() + 4) & XEN_HVM_CPUID_UPCALL_VECTOR) && + !xen_set_upcall_vector(cpu) && !xen_set_callback_via(1)) + xen_percpu_upcall = true; + else if (xen_feature(XENFEAT_hvm_callback_vector)) + xen_setup_callback_vector(); + else + xen_have_vector_callback = false; +} + +int xen_set_upcall_vector(unsigned int cpu) +{ + int rc; + xen_hvm_evtchn_upcall_vector_t op = { + .vector = HYPERVISOR_CALLBACK_VECTOR, + .vcpu = per_cpu(xen_vcpu_id, cpu), + }; + + rc = HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector, &op); + if (rc) + return rc; + + if (!cpu) A comment (e.g. "Let toolstack know that we are enlightened." or something along these lines) would be useful here. -boris + rc = xen_set_callback_via(1); +
Re: [PATCH] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector
On 7/25/22 6:03 AM, Jane Malalane wrote: On 18/07/2022 14:59, Boris Ostrovsky wrote: On 7/18/22 4:56 AM, Andrew Cooper wrote: On 15/07/2022 14:10, Boris Ostrovsky wrote: On 7/15/22 5:50 AM, Andrew Cooper wrote: On 15/07/2022 09:18, Jane Malalane wrote: On 14/07/2022 00:27, Boris Ostrovsky wrote: xen_hvm_smp_init(); WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm)); diff --git a/arch/x86/xen/suspend_hvm.c b/arch/x86/xen/suspend_hvm.c index 9d548b0c772f..be66e027ef28 100644 --- a/arch/x86/xen/suspend_hvm.c +++ b/arch/x86/xen/suspend_hvm.c @@ -5,6 +5,7 @@ #include #include #include +#include #include "xen-ops.h" @@ -14,6 +15,23 @@ void xen_hvm_post_suspend(int suspend_cancelled) xen_hvm_init_shared_info(); xen_vcpu_restore(); } - xen_setup_callback_vector(); + if (xen_ack_upcall) { + unsigned int cpu; + + for_each_online_cpu(cpu) { + xen_hvm_evtchn_upcall_vector_t op = { + .vector = HYPERVISOR_CALLBACK_VECTOR, + .vcpu = per_cpu(xen_vcpu_id, cpu), + }; + + BUG_ON(HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector, + &op)); + /* Trick toolstack to think we are enlightened. */ + if (!cpu) + BUG_ON(xen_set_callback_via(1)); What are you trying to make the toolstack aware of? That we have *a* callback (either global or percpu)? Yes, specifically for the check in libxl__domain_pvcontrol_available. And others. This is all a giant bodge, but basically a lot of tooling uses the non-zero-ness of the CALLBACK_VIA param to determine whether the VM has Xen-aware drivers loaded or not. The value 1 is a CALLBACK_VIA value which encodes GSI 1, and the only reason this doesn't explode everywhere is because the evtchn_upcall_vector registration takes priority over GSI delivery. This is decades of tech debt piled on top of tech debt. Feels like it (setting the callback parameter) is something that the hypervisor should do --- no need to expose guests to this. Sensible or not, it is the ABI. Linux still needs to work (nicely) with older Xen's in the world, and we can't just retrofit a change in the hypervisor which says "btw, this ABI we've just changed now has a side effect of modifying a field that you also logically own". The hypercall has been around for a while so I understand ABI concerns there but XEN_HVM_CPUID_UPCALL_VECTOR was introduced only a month ago. Why not tie presence of this bit to no longer having to explicitly set the callback field? Any other opinions on this? (i.e., calling xen_set_callback_via(1) after HVMOP_set_evtchn_upcall_vector OR not exposing this to guests and instead having Xen call this function (in hvmop_set_evtchn_upcall_vector maybe) and tieing its presense to XEN_HVM_CPUID_UPCALL_VECTOR which was recently added) CPUID won't help here, I wasn't thinking clearly. Can we wrap the HVMOP_set_evtchn_upcall_vector hypercall in a function that will decide whether or not to also do xen_set_callback_via(1)? -boris
Re: [PATCH] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector
On 7/18/22 4:56 AM, Andrew Cooper wrote: On 15/07/2022 14:10, Boris Ostrovsky wrote: On 7/15/22 5:50 AM, Andrew Cooper wrote: On 15/07/2022 09:18, Jane Malalane wrote: On 14/07/2022 00:27, Boris Ostrovsky wrote: xen_hvm_smp_init(); WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm)); diff --git a/arch/x86/xen/suspend_hvm.c b/arch/x86/xen/suspend_hvm.c index 9d548b0c772f..be66e027ef28 100644 --- a/arch/x86/xen/suspend_hvm.c +++ b/arch/x86/xen/suspend_hvm.c @@ -5,6 +5,7 @@ #include #include #include +#include #include "xen-ops.h" @@ -14,6 +15,23 @@ void xen_hvm_post_suspend(int suspend_cancelled) xen_hvm_init_shared_info(); xen_vcpu_restore(); } - xen_setup_callback_vector(); + if (xen_ack_upcall) { + unsigned int cpu; + + for_each_online_cpu(cpu) { + xen_hvm_evtchn_upcall_vector_t op = { + .vector = HYPERVISOR_CALLBACK_VECTOR, + .vcpu = per_cpu(xen_vcpu_id, cpu), + }; + + BUG_ON(HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector, + &op)); + /* Trick toolstack to think we are enlightened. */ + if (!cpu) + BUG_ON(xen_set_callback_via(1)); What are you trying to make the toolstack aware of? That we have *a* callback (either global or percpu)? Yes, specifically for the check in libxl__domain_pvcontrol_available. And others. This is all a giant bodge, but basically a lot of tooling uses the non-zero-ness of the CALLBACK_VIA param to determine whether the VM has Xen-aware drivers loaded or not. The value 1 is a CALLBACK_VIA value which encodes GSI 1, and the only reason this doesn't explode everywhere is because the evtchn_upcall_vector registration takes priority over GSI delivery. This is decades of tech debt piled on top of tech debt. Feels like it (setting the callback parameter) is something that the hypervisor should do --- no need to expose guests to this. Sensible or not, it is the ABI. Linux still needs to work (nicely) with older Xen's in the world, and we can't just retrofit a change in the hypervisor which says "btw, this ABI we've just changed now has a side effect of modifying a field that you also logically own". The hypercall has been around for a while so I understand ABI concerns there but XEN_HVM_CPUID_UPCALL_VECTOR was introduced only a month ago. Why not tie presence of this bit to no longer having to explicitly set the callback field? -boris
Re: Build warnings in Xen 5.15.y and 5.10.y with retbleed backports
On 7/17/22 1:20 AM, Juergen Gross wrote: What about filling the complete hypercall page just with "int 3" or "ud2"? Any try to do a hypercall before the hypercall page has been initialized is a bug anyway. What good can come from calling a function which will return a basically random value instead of doing a privileged operation? This is all about objtool, that's why 'ret' was added there originally by f4b4bc10b0b8 ("x86/xen: Support objtool vmlinux.o validation in xen-head.S"). Before that it was all 'nop' which is similar to what you are suggesting ('int3' or 'ud2' would of course be better) -boris
Re: Build warnings in Xen 5.15.y and 5.10.y with retbleed backports
On 7/16/22 12:35 PM, Nicolai Stange wrote: Hi, I see a patch for this has been queued up for 5.10 already ([1]), I'm just sharing my findings in support of this patch here -- it doesn't merely exchange one warning for another, but fixes a real issue and should perhaps get applied to other stable branches as well. TL;DR: for this particular warning, objtool would exit early and fail to create any .orc_unwind* ELF sections for head_64.o, which are consumed by the ORC unwinder at runtime. Boris Ostrovsky writes: On 7/12/22 3:31 PM, Greg KH wrote: On Tue, Jul 12, 2022 at 03:19:39PM -0400, Boris Ostrovsky wrote: On 7/12/22 12:38 PM, Greg KH wrote: Hi all, I'm seeing the following build warning: arch/x86/kernel/head_64.o: warning: objtool: xen_hypercall_mmu_update(): can't find starting instruction in the 5.15.y and 5.10.y retbleed backports. The reason for this is that with RET being multibyte, it can cross those "xen_hypecall_*" symbol boundaries, because ... I don't know why just this one hypercall is being called out by objtool, and this warning isn't in 5.18 and Linus's tree due to I think commit 5b2fc51576ef ("x86/ibt,xen: Sprinkle the ENDBR") being there. But, is this a ret call that we "forgot" here? It's a "real" ret in Linus's branch: .pushsection .noinstr.text, "ax" .balign PAGE_SIZE SYM_CODE_START(hypercall_page) .rept (PAGE_SIZE / 32) UNWIND_HINT_FUNC ANNOTATE_NOENDBR ANNOTATE_UNRET_SAFE ret /* * Xen will write the hypercall page, and sort out ENDBR. */ .skip 31, 0xcc .endr while 5.15.y and older has: .pushsection .text .balign PAGE_SIZE SYM_CODE_START(hypercall_page) .rept (PAGE_SIZE / 32) UNWIND_HINT_FUNC .skip 31, 0x90 ... the "31" is no longer correct, ... ANNOTATE_UNRET_SAFE RET ... as with RET occupying more than one byte, the resulting hypercall entry's total size won't add up to 32 anymore. Right! I haven't thought about that part. I think this has been broken since 14b476e07fab ("x86: Prepare asm files for straight-line-speculation"). It still shouldn't matter as far as correct execution is concerned which is probably why noone complained. Note that those xen_hypercall_* symbols' values are getting statically calculated as 'hypercall page + n * 32' in the HYPERCALL() #define from xen-head.S. So there's a mismatch and with RET == 'ret; int3', the resulting .text effectively becomes 101e: 90 nop 101f: c3 ret 1020 : 1020: cc int3 1021: 90 nop 1022: 90 nop This is probably already not what has been intended, but because 'ret' and 'int3' both are single-byte encoded, objtool would still be able to find at least some "starting instruction" at this point. But with RET == 'jmp __x86_return_thunk', it becomes 101e: 90 nop 101f: e9 .byte 0xe9 1020 : 1020: 00 00 add%al,(%rax) 1022: 00 00 add%al,(%rax) 1024: 90 nop Here the 'e9 00 00 00 00' jmp crosses the symbol boundary and objtool errors out. Ah, thanks for explanation. Then I think we need to replace .skip 31, 0x90 with something like #if defined(CONFIG_RETHUNK) && !defined(__DISABLE_EXPORTS) && !defined(BUILD_VDSO) #define SKIP_BYTES27/* RET is 'jmp __x86_return_thunk' (5 bytes) */ #else /* CONFIG_RETPOLINE */ #ifdef CONFIG_SLS #define SKIP_BYTES30/* RET is 'ret; int3' (2 bytes) */ #else #define SKIP_BYTES31/* RET is 'ret' (1 byte) */ #endif .skip SKIP_BYTES, 0x90 (I don't have patched 5.15 so I am going by what mainline looks like) Or replace RET with ret. (Although at least with unpatched 5.15 the warning below is still generated) -boris .endr So should the "ret" remain or be turned into "RET" in mainline right now? It doesn't matter --- this is overwritten by the hypervisor during initialization when Xen fills in actual hypercall code. It does makes a difference though: even though objtool reports only a warning, it still exits early in this particular case and won't create any of the .orc_unwind* or .return_sites sections for head_64.o as it's supposed to. The significance of not having .orc_unwind* for head_64.o is that the reliable
Re: [PATCH] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector
On 7/15/22 5:50 AM, Andrew Cooper wrote: On 15/07/2022 09:18, Jane Malalane wrote: On 14/07/2022 00:27, Boris Ostrovsky wrote: xen_hvm_smp_init(); WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm)); diff --git a/arch/x86/xen/suspend_hvm.c b/arch/x86/xen/suspend_hvm.c index 9d548b0c772f..be66e027ef28 100644 --- a/arch/x86/xen/suspend_hvm.c +++ b/arch/x86/xen/suspend_hvm.c @@ -5,6 +5,7 @@ #include #include #include +#include #include "xen-ops.h" @@ -14,6 +15,23 @@ void xen_hvm_post_suspend(int suspend_cancelled) xen_hvm_init_shared_info(); xen_vcpu_restore(); } - xen_setup_callback_vector(); + if (xen_ack_upcall) { + unsigned int cpu; + + for_each_online_cpu(cpu) { + xen_hvm_evtchn_upcall_vector_t op = { + .vector = HYPERVISOR_CALLBACK_VECTOR, + .vcpu = per_cpu(xen_vcpu_id, cpu), + }; + + BUG_ON(HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector, + &op)); + /* Trick toolstack to think we are enlightened. */ + if (!cpu) + BUG_ON(xen_set_callback_via(1)); What are you trying to make the toolstack aware of? That we have *a* callback (either global or percpu)? Yes, specifically for the check in libxl__domain_pvcontrol_available. And others. This is all a giant bodge, but basically a lot of tooling uses the non-zero-ness of the CALLBACK_VIA param to determine whether the VM has Xen-aware drivers loaded or not. The value 1 is a CALLBACK_VIA value which encodes GSI 1, and the only reason this doesn't explode everywhere is because the evtchn_upcall_vector registration takes priority over GSI delivery. This is decades of tech debt piled on top of tech debt. Feels like it (setting the callback parameter) is something that the hypervisor should do --- no need to expose guests to this. -boris
Re: [PATCH] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector
On 7/11/22 11:22 AM, Jane Malalane wrote: --- a/arch/x86/xen/enlighten_hvm.c +++ b/arch/x86/xen/enlighten_hvm.c @@ -7,6 +7,7 @@ #include #include +#include #include #include @@ -30,6 +31,9 @@ static unsigned long shared_info_pfn; +__ro_after_init bool xen_ack_upcall; +EXPORT_SYMBOL_GPL(xen_ack_upcall); Shouldn't this be called something like xen_percpu_upcall? + void xen_hvm_init_shared_info(void) { struct xen_add_to_physmap xatp; @@ -125,6 +129,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_xen_hvm_callback) { struct pt_regs *old_regs = set_irq_regs(regs); + if (xen_ack_upcall) + ack_APIC_irq(); + inc_irq_stat(irq_hv_callback_count); xen_hvm_evtchn_do_upcall(); @@ -168,6 +175,15 @@ static int xen_cpu_up_prepare_hvm(unsigned int cpu) if (!xen_have_vector_callback) return 0; + if (xen_ack_upcall) { + xen_hvm_evtchn_upcall_vector_t op = { + .vector = HYPERVISOR_CALLBACK_VECTOR, + .vcpu = per_cpu(xen_vcpu_id, cpu), + }; + + BUG_ON(HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector, &op)); Does this have to be fatal? Can't we just fail bringing this vcpu up? + } + if (xen_feature(XENFEAT_hvm_safe_pvclock)) xen_setup_timer(cpu); @@ -211,8 +227,7 @@ static void __init xen_hvm_guest_init(void) xen_panic_handler_init(); - if (!no_vector_callback && xen_feature(XENFEAT_hvm_callback_vector)) - xen_have_vector_callback = 1; + xen_have_vector_callback = !no_vector_callback; Can we get rid of one of those two variables then? xen_hvm_smp_init(); WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm)); diff --git a/arch/x86/xen/suspend_hvm.c b/arch/x86/xen/suspend_hvm.c index 9d548b0c772f..be66e027ef28 100644 --- a/arch/x86/xen/suspend_hvm.c +++ b/arch/x86/xen/suspend_hvm.c @@ -5,6 +5,7 @@ #include #include #include +#include #include "xen-ops.h" @@ -14,6 +15,23 @@ void xen_hvm_post_suspend(int suspend_cancelled) xen_hvm_init_shared_info(); xen_vcpu_restore(); } - xen_setup_callback_vector(); + if (xen_ack_upcall) { + unsigned int cpu; + + for_each_online_cpu(cpu) { + xen_hvm_evtchn_upcall_vector_t op = { + .vector = HYPERVISOR_CALLBACK_VECTOR, + .vcpu = per_cpu(xen_vcpu_id, cpu), + }; + + BUG_ON(HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector, +&op)); + /* Trick toolstack to think we are enlightened. */ + if (!cpu) + BUG_ON(xen_set_callback_via(1)); What are you trying to make the toolstack aware of? That we have *a* callback (either global or percpu)? BTW, you can take it out the loop. And maybe @op definition too, except for .vcpu assignment. + } + } else { + xen_setup_callback_vector(); + } xen_unplug_emulated_devices(); } -boris
Re: Build warnings in Xen 5.15.y and 5.10.y with retbleed backports
On 7/12/22 3:31 PM, Greg KH wrote: On Tue, Jul 12, 2022 at 03:19:39PM -0400, Boris Ostrovsky wrote: On 7/12/22 12:38 PM, Greg KH wrote: Hi all, I'm seeing the following build warning: arch/x86/kernel/head_64.o: warning: objtool: xen_hypercall_mmu_update(): can't find starting instruction in the 5.15.y and 5.10.y retbleed backports. I don't know why just this one hypercall is being called out by objtool, and this warning isn't in 5.18 and Linus's tree due to I think commit 5b2fc51576ef ("x86/ibt,xen: Sprinkle the ENDBR") being there. But, is this a ret call that we "forgot" here? It's a "real" ret in Linus's branch: .pushsection .noinstr.text, "ax" .balign PAGE_SIZE SYM_CODE_START(hypercall_page) .rept (PAGE_SIZE / 32) UNWIND_HINT_FUNC ANNOTATE_NOENDBR ANNOTATE_UNRET_SAFE ret /* * Xen will write the hypercall page, and sort out ENDBR. */ .skip 31, 0xcc .endr while 5.15.y and older has: .pushsection .text .balign PAGE_SIZE SYM_CODE_START(hypercall_page) .rept (PAGE_SIZE / 32) UNWIND_HINT_FUNC .skip 31, 0x90 ANNOTATE_UNRET_SAFE RET .endr So should the "ret" remain or be turned into "RET" in mainline right now? It doesn't matter --- this is overwritten by the hypervisor during initialization when Xen fills in actual hypercall code. So f4b4bc10b0b85ec66f1a9bf5dddf475e6695b6d2 added 'ret' to make objtool happy and then 14b476e07fab6 replaced 'ret' with RET as part of SLS fixes. The latter was not really necessary but harmless. So it can be 'ret', RET, or anything else that tools don't complain about. It will not be executed. Cool, thanks. But what about the objtool warning that I now see? Is that "real"? It's not real in the sense that the code there is not real, it will be overwritten. (Originally the whole page was 'nop's) I am getting a different error BTW: arch/x86/kernel/head_64.o: warning: objtool: .text+0x5: unreachable instruction I don't run any Xen systems, so I can't test any of this myself. You can't test any changes to that code --- it is rewritten when Xen guest is running. We probably do want to shut up objtool. Josh, any suggestions? -boris
Re: Build warnings in Xen 5.15.y and 5.10.y with retbleed backports
On 7/12/22 12:38 PM, Greg KH wrote: Hi all, I'm seeing the following build warning: arch/x86/kernel/head_64.o: warning: objtool: xen_hypercall_mmu_update(): can't find starting instruction in the 5.15.y and 5.10.y retbleed backports. I don't know why just this one hypercall is being called out by objtool, and this warning isn't in 5.18 and Linus's tree due to I think commit 5b2fc51576ef ("x86/ibt,xen: Sprinkle the ENDBR") being there. But, is this a ret call that we "forgot" here? It's a "real" ret in Linus's branch: .pushsection .noinstr.text, "ax" .balign PAGE_SIZE SYM_CODE_START(hypercall_page) .rept (PAGE_SIZE / 32) UNWIND_HINT_FUNC ANNOTATE_NOENDBR ANNOTATE_UNRET_SAFE ret /* * Xen will write the hypercall page, and sort out ENDBR. */ .skip 31, 0xcc .endr while 5.15.y and older has: .pushsection .text .balign PAGE_SIZE SYM_CODE_START(hypercall_page) .rept (PAGE_SIZE / 32) UNWIND_HINT_FUNC .skip 31, 0x90 ANNOTATE_UNRET_SAFE RET .endr So should the "ret" remain or be turned into "RET" in mainline right now? It doesn't matter --- this is overwritten by the hypervisor during initialization when Xen fills in actual hypercall code. So f4b4bc10b0b85ec66f1a9bf5dddf475e6695b6d2 added 'ret' to make objtool happy and then 14b476e07fab6 replaced 'ret' with RET as part of SLS fixes. The latter was not really necessary but harmless. So it can be 'ret', RET, or anything else that tools don't complain about. It will not be executed. -boris
Re: Reg. Tee init fail...
On 6/29/22 4:03 PM, Stefano Stabellini wrote: Adding Juergen and Boris because this is a Linux/x86 issue. As you can see from this Linux driver: https://elixir.bootlin.com/linux/latest/source/drivers/crypto/ccp/tee-dev.c#L132 Linux as dom0 on x86 is trying to communicate with firmware (TEE). Linux is calling __pa to pass a physical address to firmware. However, __pa returns a "fake" address not an mfn. I imagine that a quick workaround would be to call "virt_to_machine" instead of "__pa" in tee-dev.c. It's probably worth a try but it seems we may need to OR the result with C-bit (i.e. sme_me_mask). Or (for testing purposes) run with TSME on, I think C-bit is not set then. -boris Normally, if this was a device, the "right fix" would be to use swiotlb-xen:xen_swiotlb_map_page to get back a real physical address. However, xen_swiotlb_map_page is meant to be used as part of the dma_ops API and takes a struct device *dev as input parameter. Maybe xen_swiotlb_map_page can be used for tee-dev as well? Basically tee-dev would need to call dma_map_page before passing addresses to firmware, and dma_unmap_page when it is done. E.g.: cmd_buffer = dma_map_page(dev, virt_to_page(cmd), cmd & ~PAGE_MASK, ring_size, DMA_TO_DEVICE); Juergen, Boris, what do you think? On Fri, 24 Jun 2022, Julien Grall wrote: Hi, (moving the discussion to xen-devel as I think it is more appropriate) On 24/06/2022 10:53, SK, SivaSangeetha (Siva Sangeetha) wrote: [AMD Official Use Only - General] Not clear what this means. Hi Xen team, In TEE driver, We allocate a ring buffer, get its physical address from __pa() macro, pass the physical address to secure processor for mapping it and using in secure processor side. Source: https://elixir.bootlin.com/linux/latest/source/drivers/crypto/ccp/tee-dev.c#L132 This works good natively in Dom0 on the target. When we boot the same Dom0 kernel, with Xen hypervisor enabled, ring init fails. Do you have any error message or error code? We suspect that the address passed to secure processor, is not same when xen is enabled, and when xen is enabled, some level of address translation might be required to get exact physical address. If you are using Xen upstream, Dom0 will be mapped with IPA == PA. So there should be no need for translation. Can you provide more details on your setup (version of Xen, Linux...)? Cheers, -- Julien Grall
Re: [PATCH v2 0/3] x86: fix brk area initialization
On 6/29/22 10:10 AM, Juergen Gross wrote: On 23.06.22 11:46, Juergen Gross wrote: The brk area needs to be zeroed initially, like the .bss section. At the same time its memory should be covered by the ELF program headers. Juergen Gross (3): x86/xen: use clear_bss() for Xen PV guests x86: fix setup of brk area x86: fix .brk attribute in linker script arch/x86/include/asm/setup.h | 3 +++ arch/x86/kernel/head64.c | 4 +++- arch/x86/kernel/vmlinux.lds.S | 2 +- arch/x86/xen/enlighten_pv.c | 8 ++-- arch/x86/xen/xen-head.S | 10 +- 5 files changed, 14 insertions(+), 13 deletions(-) Could I please have some feedback? This series is fixing a major regression regarding running as Xen PV guest (depending on kernel configuration system will crash very early). #regzbot ^introduced e32683c6f7d2 I don't think you need this for Xen bits as Jan had already reviewed it but in case you do Reviewed-by: Boris Ostrovsky
Re: [PATCH V3 4/8] xen/virtio: Enable restricted memory access using Xen grant mappings
On 6/2/22 8:49 AM, Oleksandr wrote: On 31.05.22 00:00, Oleksandr Tyshchenko wrote: Hello all. From: Juergen Gross In order to support virtio in Xen guests add a config option XEN_VIRTIO enabling the user to specify whether in all Xen guests virtio should be able to access memory via Xen grant mappings only on the host side. Also set PLATFORM_VIRTIO_RESTRICTED_MEM_ACCESS feature from the guest initialization code on Arm and x86 if CONFIG_XEN_VIRTIO is enabled. Signed-off-by: Juergen Gross Signed-off-by: Oleksandr Tyshchenko Reviewed-by: Stefano Stabellini --- Changes V1 -> V2: - new patch, split required changes from commit: "[PATCH V1 3/6] xen/virtio: Add option to restrict memory access under Xen" - rework according to new platform_has() infrastructure Changes V2 -> V3: - add Stefano's R-b May I please ask for the ack or comments for x86 side here? Reviewed-by: Boris Ostrovsky
Re: [PATCH] xen: replace xen_remap() with memremap()
On 5/30/22 4:26 AM, Juergen Gross wrote: xen_remap() is used to establish mappings for frames not under direct control of the kernel: for Xenstore and console ring pages, and for grant pages of non-PV guests. Today xen_remap() is defined to use ioremap() on x86 (doing uncached mappings), and ioremap_cache() on Arm (doing cached mappings). Uncached mappings for those use cases are bad for performance, so they should be avoided if possible. As all use cases of xen_remap() don't require uncached mappings (the mapped area is always physical RAM), a mapping using the standard WB cache mode is fine. As sparse is flagging some of the xen_remap() use cases to be not appropriate for iomem(), as the result is not annotated with the __iomem modifier, eliminate xen_remap() completely and replace all use cases with memremap() specifying the MEMREMAP_WB caching mode. xen_unmap() can be replaced with memunmap(). Reported-by: kernel test robot Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky
[PATCH] MAINTAINERS: Update Xen maintainership
Due to time constraints I am stepping down as maintainter. I will stay as reviewer for x86 code (for which create a separate category). Stefano is now maintainer for Xen hypervisor interface and Oleksandr has graciously agreed to become a reviewer. Signed-off-by: Boris Ostrovsky --- MAINTAINERS | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index d6d879cb0afd..b879c4e6750e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -21549,23 +21549,29 @@ F:arch/arm64/include/asm/xen/ F: arch/arm64/xen/ XEN HYPERVISOR INTERFACE -M: Boris Ostrovsky M: Juergen Gross -R: Stefano Stabellini +M: Stefano Stabellini +R: Oleksandr Tyshchenko L: xen-devel@lists.xenproject.org (moderated for non-subscribers) S: Supported T: git git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git F: Documentation/ABI/stable/sysfs-hypervisor-xen F: Documentation/ABI/testing/sysfs-hypervisor-xen -F: arch/x86/include/asm/pvclock-abi.h -F: arch/x86/include/asm/xen/ -F: arch/x86/platform/pvh/ -F: arch/x86/xen/ F: drivers/*/xen-*front.c F: drivers/xen/ F: include/uapi/xen/ F: include/xen/ +XEN HYPERVISOR X86 +M: Juergen Gross +R: Boris Ostrovsky +L: xen-devel@lists.xenproject.org (moderated for non-subscribers) +S: Supported +F: arch/x86/include/asm/pvclock-abi.h +F: arch/x86/include/asm/xen/ +F: arch/x86/platform/pvh/ +F: arch/x86/xen/ + XEN NETWORK BACKEND DRIVER M: Wei Liu M: Paul Durrant -- 1.8.3.1
Re: [PATCH V2 2/7] xen/grants: support allocating consecutive grants
On 5/16/22 2:30 PM, Oleksandr wrote: On 16.05.22 19:00, Boris Ostrovsky wrote: With the error handling in gnttab_init() fixed yes, this is a diff that I am going to apply for the next version: [snip] @@ -1596,19 +1601,20 @@ static int gnttab_expand(unsigned int req_entries) int gnttab_init(void) { int i; - unsigned long max_nr_grant_frames; + unsigned long max_nr_grant_frames, max_nr_grefs; unsigned int max_nr_glist_frames, nr_glist_frames; int ret; gnttab_request_version(); max_nr_grant_frames = gnttab_max_grant_frames(); + max_nr_grefs = max_nr_grant_frames * + gnttab_interface->grefs_per_grant_frame; nr_grant_frames = 1; /* Determine the maximum number of frames required for the * grant reference free list on the current hypervisor. */ - max_nr_glist_frames = (max_nr_grant_frames * - gnttab_interface->grefs_per_grant_frame / RPP); + max_nr_glist_frames = max_nr_grefs / RPP; gnttab_list = kmalloc_array(max_nr_glist_frames, sizeof(grant_ref_t *), @@ -1625,8 +1631,7 @@ int gnttab_init(void) } } - i = gnttab_interface->grefs_per_grant_frame * max_nr_grant_frames; - gnttab_free_bitmap = bitmap_zalloc(i, GFP_KERNEL); + gnttab_free_bitmap = bitmap_zalloc(max_nr_grefs, GFP_KERNEL); if (!gnttab_free_bitmap) { ret = -ENOMEM; goto ini_nomem; Looks good, thanks. -boris
Re: [PATCH LINUX v5 1/2] xen: sync xs_wire.h header with upstream xen
On 5/13/22 5:19 PM, Stefano Stabellini wrote: From: Stefano Stabellini Sync the xs_wire.h header file in Linux with the one in Xen. Signed-off-by: Stefano Stabellini --- Changes in v5: - add XSD_ERROR(E2BIG) - Boris gave his reviewed-by but due to this change I removed it Reviewed-by: Boris Ostrovsky
Re: [PATCH V2 2/7] xen/grants: support allocating consecutive grants
On 5/16/22 1:59 AM, Juergen Gross wrote: On 14.05.22 04:34, Boris Ostrovsky wrote: On 5/13/22 1:33 AM, Juergen Gross wrote: On 12.05.22 22:01, Boris Ostrovsky wrote: On 5/7/22 2:19 PM, Oleksandr Tyshchenko wrote: +/* Rebuilds the free grant list and tries to find count consecutive entries. */ +static int get_free_seq(unsigned int count) +{ + int ret = -ENOSPC; + unsigned int from, to; + grant_ref_t *last; + + gnttab_free_tail_ptr = &gnttab_free_head; + last = &gnttab_free_head; + + for (from = find_first_bit(gnttab_free_bitmap, gnttab_size); + from < gnttab_size; + from = find_next_bit(gnttab_free_bitmap, gnttab_size, to + 1)) { + to = find_next_zero_bit(gnttab_free_bitmap, gnttab_size, + from + 1); + if (ret < 0 && to - from >= count) { + ret = from; + bitmap_clear(gnttab_free_bitmap, ret, count); + from += count; + gnttab_free_count -= count; IIUIC we can have multiple passes over this, meaning that the gnttab_free_count may be decremented more than once. Is that intentional? After the first pass decrementing gnttab_free_cnt, ret will no longer be less than zero, so this can be hit only once. Oh, yes, of course. + if (from == to) + continue; + } + + while (from < to) { + *last = from; + last = __gnttab_entry(from); + gnttab_last_free = from; + from++; + } I have been looking at this loop and I can't understand what it is doing ;-( Can you enlighten me? It is recreating the free list in order to have it properly sorted. This is needed to make sure that the free tail has the maximum possible size (you can take the tail off the list without having to worry about breaking the linked list because of references into the tail). So let's say we have the (one-dimensional) table of length 13 idx .. 2 3 ... 10 11 12 grant 12 11 2 -1 3 and gnttab_free_head is 10. I.e. the free list is 2, 12, 3, 11. You meant 10, 2, 12, 3, 11, I guess? What will this look like after the 2 iterations of the outer loop? idx .. 2 3 ... 10 11 12 grant 3 10 11 12 -1 with gnttab_free_head being 2, i.e the free list is now 2, 3, 10, 11, 12. OK, thanks, that helped. I couldn't link the free chunks in my head With the error handling in gnttab_init() fixed Reviewed-by: Boris Ostrovsky -boris
Re: [PATCH V2 2/7] xen/grants: support allocating consecutive grants
On 5/13/22 1:33 AM, Juergen Gross wrote: On 12.05.22 22:01, Boris Ostrovsky wrote: On 5/7/22 2:19 PM, Oleksandr Tyshchenko wrote: +/* Rebuilds the free grant list and tries to find count consecutive entries. */ +static int get_free_seq(unsigned int count) +{ + int ret = -ENOSPC; + unsigned int from, to; + grant_ref_t *last; + + gnttab_free_tail_ptr = &gnttab_free_head; + last = &gnttab_free_head; + + for (from = find_first_bit(gnttab_free_bitmap, gnttab_size); + from < gnttab_size; + from = find_next_bit(gnttab_free_bitmap, gnttab_size, to + 1)) { + to = find_next_zero_bit(gnttab_free_bitmap, gnttab_size, + from + 1); + if (ret < 0 && to - from >= count) { + ret = from; + bitmap_clear(gnttab_free_bitmap, ret, count); + from += count; + gnttab_free_count -= count; IIUIC we can have multiple passes over this, meaning that the gnttab_free_count may be decremented more than once. Is that intentional? After the first pass decrementing gnttab_free_cnt, ret will no longer be less than zero, so this can be hit only once. Oh, yes, of course. + if (from == to) + continue; + } + + while (from < to) { + *last = from; + last = __gnttab_entry(from); + gnttab_last_free = from; + from++; + } I have been looking at this loop and I can't understand what it is doing ;-( Can you enlighten me? It is recreating the free list in order to have it properly sorted. This is needed to make sure that the free tail has the maximum possible size (you can take the tail off the list without having to worry about breaking the linked list because of references into the tail). So let's say we have the (one-dimensional) table of length 13 idx..23 ... 10 11 12 grant 12 112 -1 3 and gnttab_free_head is 10. I.e. the free list is 2, 12, 3, 11. What will this look like after the 2 iterations of the outer loop? (I am really having a mental block on this). -boris
Re: [PATCH V2 2/7] xen/grants: support allocating consecutive grants
On 5/7/22 2:19 PM, Oleksandr Tyshchenko wrote: + +/* + * Handling of free grants: + * + * Free grants are in a simple list anchored in gnttab_free_head. They are + * linked by grant ref, the last element contains GNTTAB_LIST_END. The number + * of free entries is stored in gnttab_free_count. + * Additionally there is a bitmap of free entries anchored in + * gnttab_free_bitmap. This is being used for simplifying allocation of + * multiple consecutive grants, which is needed e.g. for support of virtio. + * gnttab_last_free is used to add free entries of new frames at the end of + * the free list. + * gnttab_free_tail_ptr specifies the variable which references the start If this references the start of the free interval, why is it called gnttab_free_*tail*_ptr? + * of consecutive free grants ending with gnttab_last_free. This pointer is + * updated in a rather defensive way, in order to avoid performance hits in + * hot paths. + * All those variables are protected by gnttab_list_lock. + */ static int gnttab_free_count; -static grant_ref_t gnttab_free_head; +static unsigned int gnttab_size; +static grant_ref_t gnttab_free_head = GNTTAB_LIST_END; +static grant_ref_t gnttab_last_free = GNTTAB_LIST_END; +static grant_ref_t *gnttab_free_tail_ptr; +static unsigned long *gnttab_free_bitmap; static DEFINE_SPINLOCK(gnttab_list_lock); + struct grant_frames xen_auto_xlat_grant_frames; static unsigned int xen_gnttab_version; module_param_named(version, xen_gnttab_version, uint, 0); @@ -170,16 +194,111 @@ static int get_free_entries(unsigned count) ref = head = gnttab_free_head; gnttab_free_count -= count; - while (count-- > 1) - head = gnttab_entry(head); + while (count--) { + bitmap_clear(gnttab_free_bitmap, head, 1); + if (gnttab_free_tail_ptr == __gnttab_entry(head)) + gnttab_free_tail_ptr = &gnttab_free_head; + if (count) + head = gnttab_entry(head); + } gnttab_free_head = gnttab_entry(head); gnttab_entry(head) = GNTTAB_LIST_END; + if (!gnttab_free_count) { + gnttab_last_free = GNTTAB_LIST_END; + gnttab_free_tail_ptr = NULL; + } + spin_unlock_irqrestore(&gnttab_list_lock, flags); return ref; } +static int get_seq_entry_count(void) +{ + if (gnttab_last_free == GNTTAB_LIST_END || !gnttab_free_tail_ptr || + *gnttab_free_tail_ptr == GNTTAB_LIST_END) + return 0; + + return gnttab_last_free - *gnttab_free_tail_ptr + 1; +} + +/* Rebuilds the free grant list and tries to find count consecutive entries. */ +static int get_free_seq(unsigned int count) +{ + int ret = -ENOSPC; + unsigned int from, to; + grant_ref_t *last; + + gnttab_free_tail_ptr = &gnttab_free_head; + last = &gnttab_free_head; + + for (from = find_first_bit(gnttab_free_bitmap, gnttab_size); +from < gnttab_size; +from = find_next_bit(gnttab_free_bitmap, gnttab_size, to + 1)) { + to = find_next_zero_bit(gnttab_free_bitmap, gnttab_size, + from + 1); + if (ret < 0 && to - from >= count) { + ret = from; + bitmap_clear(gnttab_free_bitmap, ret, count); + from += count; + gnttab_free_count -= count; IIUIC we can have multiple passes over this, meaning that the gnttab_free_count may be decremented more than once. Is that intentional? + if (from == to) + continue; + } + + while (from < to) { + *last = from; + last = __gnttab_entry(from); + gnttab_last_free = from; + from++; + } I have been looking at this loop and I can't understand what it is doing ;-( Can you enlighten me? -boris + if (to < gnttab_size) + gnttab_free_tail_ptr = __gnttab_entry(to - 1); + } + + *last = GNTTAB_LIST_END; + if (gnttab_last_free != gnttab_size - 1) + gnttab_free_tail_ptr = NULL; + + return ret; +} + +static int get_free_entries_seq(unsigned int count) +{ + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&gnttab_list_lock, flags); + + if (gnttab_free_count < count) { + ret = gnttab_expand(count - gnttab_free_count); + if (ret < 0) + goto out; + } + + if (get_seq_entry_count() < count) { + ret = get_free_seq(count); + if (ret >= 0) + goto out; + ret = gnttab_expand(count - get_seq_entry_count()); + if (ret < 0) + goto out; + } + + ret = *gnttab_free_t
Re: [PATCH V2 2/7] xen/grants: support allocating consecutive grants
On 5/11/22 2:00 PM, Oleksandr wrote: On 07.05.22 21:19, Oleksandr Tyshchenko wrote: Hello Boris, Stefano From: Juergen Gross For support of virtio via grant mappings in rare cases larger mappings using consecutive grants are needed. Support those by adding a bitmap of free grants. As consecutive grants will be needed only in very rare cases (e.g. when configuring a virtio device with a multi-page ring), optimize for the normal case of non-consecutive allocations. Signed-off-by: Juergen Gross --- Changes RFC -> V1: - no changes Changes V1 -> V2: - no changes May I please ask for the review here? I had a quick look but I am stuck on get_free_seq(), I need to stare at it some more. Unless someone else reviews this, I will try to get to this in the next couple of days. One thing I did notice is @@ -1452,6 +1624,13 @@ int gnttab_init(void) } } + i = gnttab_interface->grefs_per_grant_frame * max_nr_grant_frames; + gnttab_free_bitmap = bitmap_zalloc(i, GFP_KERNEL); + if (!gnttab_free_bitmap) { + ret = -ENOMEM; + goto ini_nomem; + } This overwrites 'i' and will break error handling at ini_nomem. -boris
Re: Attributing linux related patches on xen-devel
On 5/9/22 3:01 PM, Stefano Stabellini wrote: On Mon, 9 May 2022, Juergen Gross wrote: On IRC the question came up whether it would be possible to have a special marker for Linux patches on the xen-devel ML. I suggested to use xen-devel+li...@lists.xenprojext.org for those patches. With a patch for the kernel's MAINTAINERS file this would be quite easy to achieve. Any thoughts? Fine by me, as long as xen-devel+li...@lists.xenprojext.org works :-) The alternative would be to come up with a different subject tag (e.g. "PATCH LINUX") but that doesn't work as it is not supported by today's Linux MAINTAINERS file. So I think xen-devel+linux is fine. I'd prefer '-' instead of '+' but either way is fine. -boris
Re: [PATCH v3 00/21] xen: simplify frontend side ring setup
On 5/5/22 4:16 AM, Juergen Gross wrote: Many Xen PV frontends share similar code for setting up a ring page (allocating and granting access for the backend) and for tearing it down. Create new service functions doing all needed steps in one go. This requires all frontends to use a common value for an invalid grant reference in order to make the functions idempotent. Changes in V3: - new patches 1 and 2, comments addressed Changes in V2: - new patch 9 and related changes in patches 10-18 For the patches that I was explicitly copied on: Reviewed-by: Boris Ostrovsky
Re: [PATCH v4 2/2] xen: add support for initializing xenstore later as HVM domain
On 5/4/22 8:23 PM, Stefano Stabellini wrote: @@ -957,25 +989,44 @@ static int __init xenbus_init(void) * been properly initialized. Instead of attempting to map a * wrong guest physical address return error. * -* Also recognize all bits set as an invalid value. +* Also recognize all bits set as an invalid/uninitialized value. What I really meant (but not what I actually wrote I guess) was that now we are treating -1 differently than 0 and so that comment should go ... */ - if (!v || !~v) { + if (!v) { err = -ENOENT; goto out_error; } - /* Avoid truncation on 32-bit. */ ... here. But this is ntpicking so for the series Reviewed-by: Boris Ostrovsky + if (v == ~0ULL) { + wait = true; + } else { + /* Avoid truncation on 32-bit. */
Re: [LINUX PATCH v3] xen: add support for initializing xenstore later as HVM domain
On 5/2/22 8:36 PM, Stefano Stabellini wrote: I gave it a try and there is a problem with unbinding the IRQ here. If I do that, later when xb_init_comms calls bind_evtchn_to_irqhandler, the event channel doesn't fire anymore. I did some testing and debugging and as far as I can tell the issue is that if we call unbind_from_irqhandler here, we end up calling xen_evtchn_close. Then, when xb_init_comms calls bind_evtchn_to_irqhandler on the same evtchn, it is not enough to receive event notifications anymore because from Xen POV the evtchn is "closed". Ah, yes. That's unfortunate. If I comment out xen_evtchn_close() in __unbind_from_irq, then yes, I can call unbind_from_irqhandler here instead of free_irq and everything works. My suggestion is to keep the call to free_irq here (not unbind_from_irqhandler) and improve the in-code comment. OK. You could add an argument to unbind_from_irq() to keep the event channel open (I in fact am not sure it should be closing the channel since bind routines don't open it) but that looks pretty ugly too. -boris
Re: [LINUX PATCH v3] xen: add support for initializing xenstore later as HVM domain
On 4/29/22 5:10 PM, Stefano Stabellini wrote: From: Luca Miccio When running as dom0less guest (HVM domain on ARM) the xenstore event channel is available at domain creation but the shared xenstore interface page only becomes available later on. In that case, wait for a notification on the xenstore event channel, then complete the xenstore initialization later, when the shared page is actually available. The xenstore page has few extra field. Add them to the shared struct. One of the field is "connection", when the connection is ready, it is zero. If the connection is not-zero, wait for a notification. Signed-off-by: Luca Miccio Signed-off-by: Stefano Stabellini CC: jgr...@suse.com CC: boris.ostrov...@oracle.com --- Changes in v3: - check for the connection field, if it is not zero, wait for event Changes in v2: - remove XENFEAT_xenstore_late_init --- drivers/xen/xenbus/xenbus_probe.c | 86 +++--- include/xen/interface/io/xs_wire.h | 3 ++ 2 files changed, 70 insertions(+), 19 deletions(-) diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c index fe360c33ce71..dc046d25789e 100644 --- a/drivers/xen/xenbus/xenbus_probe.c +++ b/drivers/xen/xenbus/xenbus_probe.c @@ -65,6 +65,7 @@ #include "xenbus.h" +static int xs_init_irq; int xen_store_evtchn; EXPORT_SYMBOL_GPL(xen_store_evtchn); @@ -750,6 +751,17 @@ static void xenbus_probe(void) { xenstored_ready = 1; + if (!xen_store_interface) { + xen_store_interface = xen_remap(xen_store_gfn << XEN_PAGE_SHIFT, + XEN_PAGE_SIZE); + /* +* Now it is safe to free the IRQ used for xenstore late +* initialization. No need to unbind: it is about to be +* bound again. This assumes knowledge of bind/unbind internals. I think we should unbind. +*/ + free_irq(xs_init_irq, &xb_waitq); + } + @@ -959,23 +988,42 @@ static int __init xenbus_init(void) * * Also recognize all bits set as an invalid value. Is this comment still correct? */ - if (!v || !~v) { + if (!v) { err = -ENOENT; goto out_error; } - /* Avoid truncation on 32-bit. */ + if (v == ~0ULL) { + wait = true; + } else { + /* Avoid truncation on 32-bit. */
Re: [PATCH] swiotlb-xen: fix DMA_ATTR_NO_KERNEL_MAPPING on arm
On 4/28/22 6:49 PM, Stefano Stabellini wrote: On Thu, 28 Apr 2022, Boris Ostrovsky wrote: On 4/28/22 5:49 PM, Stefano Stabellini wrote: On Thu, 28 Apr 2022, Christoph Hellwig wrote: On Tue, Apr 26, 2022 at 04:07:45PM -0700, Stefano Stabellini wrote: Reported-by: Rahul Singh Signed-off-by: Christoph Hellwig Reviewed-by: Stefano Stabellini Do you want to take this through the Xen tree or should I pick it up? Either way I'd love to see some testing on x86 as well. I agree on the x86 testing. Juergen, Boris? I'd say to take this patch via the Xen tree but Juergen has just sent a Xen pull request to Linux last Saturday. Juergen do you plan to send another one? Do you have something else lined up? If not, it might be better to let Christoph pick it up. We don't have anything pending. I can test it but at best tomorrow so not sure we can get this into rc5. Do you consider this an urgent fix or can we wait until 5.19? Because it's a bit too much IMO for rc6. On one hand, Linux doesn't boot on a platform without this fix. On the other hand, I totally see that this patch could introduce regressions on x86 so I think it is fair that we are careful with it. From my point of view, it might be better to wait for 5.19 and mark it as backport. No problems uncovered during testing. -boris
Re: [PATCH v2 0/4] xen/pv-scsi: update header and harden frontend
On 4/28/22 3:53 AM, Juergen Gross wrote: Update the Xen PV-scsi interface from the Xen tree and adapt the related drivers to use the new definitions. Harden the frontend driver to be no longer vulnerable to a malicious backend. Juergen Gross (4): xen: update vscsiif.h xen/scsiback: use new command result macros xen/scsifront: use new command result macros xen/scsifront: harden driver against malicious backend drivers/scsi/xen-scsifront.c | 168 ++--- drivers/xen/xen-scsiback.c | 82 +- include/xen/interface/io/vscsiif.h | 133 ++- 3 files changed, 340 insertions(+), 43 deletions(-) Reviewed-by: Boris Ostrovsky
Re: [PATCH] swiotlb-xen: fix DMA_ATTR_NO_KERNEL_MAPPING on arm
On 4/28/22 5:49 PM, Stefano Stabellini wrote: On Thu, 28 Apr 2022, Christoph Hellwig wrote: On Tue, Apr 26, 2022 at 04:07:45PM -0700, Stefano Stabellini wrote: Reported-by: Rahul Singh Signed-off-by: Christoph Hellwig Reviewed-by: Stefano Stabellini Do you want to take this through the Xen tree or should I pick it up? Either way I'd love to see some testing on x86 as well. I agree on the x86 testing. Juergen, Boris? I'd say to take this patch via the Xen tree but Juergen has just sent a Xen pull request to Linux last Saturday. Juergen do you plan to send another one? Do you have something else lined up? If not, it might be better to let Christoph pick it up. We don't have anything pending. I can test it but at best tomorrow so not sure we can get this into rc5. Do you consider this an urgent fix or can we wait until 5.19? Because it's a bit too much IMO for rc6. -boris
Re: x86/PV: (lack of) MTRR exposure
On 4/28/22 11:53 AM, Jan Beulich wrote: Hello, in the course of analyzing the i915 driver causing boot to fail in Linux 5.18 I found that Linux, for all the years, has been running in PV mode as if PAT was (mostly) disabled. This is a result of them tying PAT initialization to MTRR initialization, while we offer PAT but not MTRR in CPUID output. This was different before our moving to CPU featuresets, and as such one could view this behavior as a regression from that change. The first question here is whether not exposing MTRR as a feature was really intended, in particular also for PV Dom0. The XenoLinux kernel and its forward ports did make use of XENPF_*_memtype to deal with MTRRs. That's functionality which (maybe for a good reason) never made it into the pvops kernel. Note that PVH Dom0 does have access to the original settings, as the host values are used as initial state there. Initially MTRR was supposed to be supported but it was shot down by x86 maintainers who strongly suggested to use PAT. https://lists.xen.org/archives/html/xen-devel/2010-09/msg01634.html -boris The next question would be how we could go about improving the situation. For the particular issue in 5.18 I've found a relatively simple solution [1] (which also looks to help graphics performance on other systems, according to my initial observations with using the change), albeit its simplicity likely means it either is wrong in some way, or might not be liked for looking hacky and/or abusive. We can't, for example, simply turn on the MTRR bit in CPUID, as that would implicitly lead to the kernel trying to write the PAT MSR (if, see below, it didn't itself zap the bit). We also can't simply ignore PAT MSR writes, as the kernel won't check whether writes actually took effect. (All of that did work on top of old Xen apparently only because xen_init_capabilities() itself also forces the MTRR feature to off.) Jan [1] https://lists.xen.org/archives/html/xen-devel/2022-04/msg02392.html
Re: [PATCH V1 3/6] xen/virtio: Add option to restrict memory access under Xen
On 4/24/22 12:53 PM, Oleksandr wrote: On 23.04.22 19:40, Christoph Hellwig wrote: + +#ifdef CONFIG_ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS +int arch_has_restricted_virtio_memory_access(void) +{ + return (xen_has_restricted_virtio_memory_access() || + cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)); +} So instead of hardcoding Xen here, this seems like a candidate for another cc_platform_has flag. I have a limited knowledge of x86 and Xen on x86. Would the Xen specific bits fit into Confidential Computing Platform checks? I will let Juergen/Boris comment on this. This is unrelated to confidential so I don't think we can add another CC_ flag. Would arch/x86/kernel/cpu/hypervisor.c be a better home for this? -boris
Re: [PATCH v2] xen: Convert kmap() to kmap_local_page()
On 4/19/22 7:43 PM, Alaa Mohamed wrote: kmap() is being deprecated and these usages are all local to the thread so there is no reason kmap_local_page() can't be used. Replace kmap() calls with kmap_local_page(). Signed-off-by: Alaa Mohamed Applied to for-linus-5.18. (Juergen, I kept your R-b from v1) -boris
Re: [PATCH 2/4] xen/scsiback: use new command result macros
On 4/21/22 4:40 AM, Juergen Gross wrote: On 20.04.22 18:12, Boris Ostrovsky wrote: On 4/20/22 5:25 AM, Juergen Gross wrote: @@ -569,7 +645,7 @@ static void scsiback_device_action(struct vscsibk_pend *pending_req, wait_for_completion(&pending_req->tmr_done); err = (se_cmd->se_tmr_req->response == TMR_FUNCTION_COMPLETE) ? - SUCCESS : FAILED; + XEN_VSCSIIF_RSLT_RESET_SUCCESS : XEN_VSCSIIF_RSLT_RESET_FAILED; scsiback_do_resp_with_sense(NULL, err, 0, pending_req); transport_generic_free_cmd(&pending_req->se_cmd, 0); You also want to initialize err to XEN_VSCSIIF_RSLT_RESET_FAILED. I did that. Yes you did. I don't know what I was was looking at. And also looking at invocations of scsiback_do_resp_with_sense() I think those may need to be adjusted as well. No, the invocations are fine, but scsiback_result() needs to pass through the lowest 16 bits instead of only the lowest 8 bits of the result value. What I was thinking was that this could use the reverse of XEN_VSCSIIF_RSLT_HOST(), i.e. something like #define RSLT_HOST_TO_XEN_VSCSIIF(x) ((x)<<16) to be explicit about namespaces. BTW, should scsiback_result() use XEN_VSCSIIF_RSLT_HOST() at the top? -boris
Re: [PATCH 09/18] xen/xenbus: add xenbus_setup_ring() service function
On 4/20/22 11:09 AM, Juergen Gross wrote: +/* + * xenbus_setup_ring + * @dev: xenbus device + * @vaddr: pointer to starting virtual address of the ring + * @nr_pages: number of pages to be granted + * @grefs: grant reference array to be filled in + * + * Allocate physically contiguous pages for a shared ring buffer and grant it + * to the peer of the given device. The ring buffer is initially filled with + * zeroes. The virtual address of the ring is stored at @vaddr and the + * grant references are stored in the @grefs array. In case of error @vaddr + * will be set to NULL and @grefs will be filled with INVALID_GRANT_REF. + */ +int xenbus_setup_ring(struct xenbus_device *dev, gfp_t gfp, void **vaddr, + unsigned int nr_pages, grant_ref_t *grefs) +{ + unsigned long ring_size = nr_pages * XEN_PAGE_SIZE; + unsigned int i; + int ret; + + *vaddr = alloc_pages_exact(ring_size, gfp | __GFP_ZERO); + if (!*vaddr) { + ret = -ENOMEM; + goto err; + } + + ret = xenbus_grant_ring(dev, *vaddr, nr_pages, grefs); + if (ret) + goto err; + + return 0; + + err: + if (*vaddr) + free_pages_exact(*vaddr, ring_size); + for (i = 0; i < nr_pages; i++) + grefs[i] = INVALID_GRANT_REF; + *vaddr = NULL; + + return ret; +} We can create a wrapper around this function that will also call SHARED_RING_INIT() and FRONT_RING_INIT(). A bunch of drivers do that. -boris
Re: [PATCH 4/4] xen/scsifront: harden driver against malicious backend
Just a couple of nits. On 4/20/22 5:25 AM, Juergen Gross wrote: -static int scsifront_ring_drain(struct vscsifrnt_info *info) +static int scsifront_ring_drain(struct vscsifrnt_info *info, + unsigned int *eoiflag) { - struct vscsiif_response *ring_rsp; + struct vscsiif_response ring_rsp; RING_IDX i, rp; int more_to_do = 0; - rp = info->ring.sring->rsp_prod; - rmb(); /* ordering required respective to dom0 */ + rp = READ_ONCE(info->ring.sring->rsp_prod); + virt_rmb(); /* ordering required respective to backend */ + if (RING_RESPONSE_PROD_OVERFLOW(&info->ring, rp)) { + scsifront_set_error(info, "illegal number of responses"); In net and block drivers we report number of such responses. (But not in usb) + return 0; + } for (i = info->ring.rsp_cons; i != rp; i++) { - ring_rsp = RING_GET_RESPONSE(&info->ring, i); - scsifront_do_response(info, ring_rsp); + RING_COPY_RESPONSE(&info->ring, i, &ring_rsp); + scsifront_do_response(info, &ring_rsp); + if (info->host_active == STATE_ERROR) + return 0; + *eoiflag = 0; *eoiflags &= ~XEN_EOI_FLAG_SPURIOUS; ? We also use eoi_flags name in other instances in this file. -boris
Re: [PATCH 2/4] xen/scsiback: use new command result macros
On 4/20/22 5:25 AM, Juergen Gross wrote: @@ -569,7 +645,7 @@ static void scsiback_device_action(struct vscsibk_pend *pending_req, wait_for_completion(&pending_req->tmr_done); err = (se_cmd->se_tmr_req->response == TMR_FUNCTION_COMPLETE) ? - SUCCESS : FAILED; + XEN_VSCSIIF_RSLT_RESET_SUCCESS : XEN_VSCSIIF_RSLT_RESET_FAILED; scsiback_do_resp_with_sense(NULL, err, 0, pending_req); transport_generic_free_cmd(&pending_req->se_cmd, 0); You also want to initialize err to XEN_VSCSIIF_RSLT_RESET_FAILED. And also looking at invocations of scsiback_do_resp_with_sense() I think those may need to be adjusted as well. -boris
Re: [PATCH] xen/balloon: don't use PV mode extra memory for zone device allocations
On 4/7/22 5:38 AM, Juergen Gross wrote: When running as a Xen PV guest use the extra memory (memory which isn't allocated for the guest at boot time) only for ballooning purposes and not for zone device allocations. This will remove some code without any lack of functionality. While at it move some code to get rid of another #ifdef. Remove a comment which is stale since some time now. Signed-off-by: Juergen Gross Applied to for-linus-5.18 -boris
Re: [PATCH] xen/balloon: don't use PV mode extra memory for zone device allocations
On 4/7/22 5:38 AM, Juergen Gross wrote: When running as a Xen PV guest use the extra memory (memory which isn't allocated for the guest at boot time) only for ballooning purposes and not for zone device allocations. This will remove some code without any lack of functionality. While at it move some code to get rid of another #ifdef. Remove a comment which is stale since some time now. Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky
Re: cleanup swiotlb initialization v8
On 4/4/22 1:05 AM, Christoph Hellwig wrote: Hi all, this series tries to clean up the swiotlb initialization, including that of swiotlb-xen. To get there is also removes the x86 iommu table infrastructure that massively obsfucates the initialization path. Git tree: git://git.infradead.org/users/hch/misc.git swiotlb-init-cleanup Gitweb: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/swiotlb-init-cleanup Tested-by: Boris Ostrovsky
Re: [PATCH v2] xen: fix is_xen_pmu()
On 3/25/22 10:20 AM, Juergen Gross wrote: is_xen_pmu() is taking the cpu number as parameter, but it is not using it. Instead it just tests whether the Xen PMU initialization on the current cpu did succeed. As this test is done by checking a percpu pointer, preemption needs to be disabled in order to avoid switching the cpu while doing the test. While resuming from suspend() this seems not to be the case: [ 88.082751] ACPI: PM: Low-level resume complete [ 88.087933] ACPI: EC: EC started [ 88.091464] ACPI: PM: Restoring platform NVS memory [ 88.097166] xen_acpi_processor: Uploading Xen processor PM info [ 88.103850] Enabling non-boot CPUs ... [ 88.108128] installing Xen timer for CPU 1 [ 88.112763] BUG: using smp_processor_id() in preemptible [] code: systemd-sleep/7138 [ 88.122256] caller is is_xen_pmu+0x12/0x30 [ 88.126937] CPU: 0 PID: 7138 Comm: systemd-sleep Tainted: GW 5.16.13-2.fc32.qubes.x86_64 #1 [ 88.137939] Hardware name: Star Labs StarBook/StarBook, BIOS 7.97 03/21/2022 [ 88.145930] Call Trace: [ 88.148757] [ 88.151193] dump_stack_lvl+0x48/0x5e [ 88.155381] check_preemption_disabled+0xde/0xe0 [ 88.160641] is_xen_pmu+0x12/0x30 [ 88.164441] xen_smp_intr_init_pv+0x75/0x100 Fix that by replacing is_xen_pmu() by a simple boolean variable which reflects the Xen PMU initialization state on cpu 0. Modify xen_pmu_init() to return early in case it is being called for a cpu other than cpu 0 and the boolean variable not being set. Fixes: bf6dfb154d93 ("xen/PMU: PMU emulation code") Reported-by: Marek Marczykowski-Górecki Signed-off-by: Juergen Gross Applied to for-linus-5.18.
Re: [PATCH] xen: don't hang when resuming PCI device
On 3/22/22 9:21 PM, Jakub Kądziołka wrote: If a xen domain with at least two VCPUs has a PCI device attached which enters the D3hot state during suspend, the kernel may hang while resuming, depending on the core on which an async resume task gets scheduled. The bug occurs because xen's do_suspend calls dpm_resume_start while only the timer of the boot CPU has been resumed (when xen_suspend called syscore_resume), before calling xen_arch_suspend to resume the timers of the other CPUs. This breaks pci_dev_d3_sleep. Thus this patch moves the call to xen_arch_resume before the call to dpm_resume_start, eliminating the hangs and restoring the stack-like structure of the suspend/restore procedure. Signed-off-by: Jakub Kądziołka Applied to for-linus-5.18
Re: [PATCH v2] xen: fix is_xen_pmu()
On 3/25/22 10:20 AM, Juergen Gross wrote: is_xen_pmu() is taking the cpu number as parameter, but it is not using it. Instead it just tests whether the Xen PMU initialization on the current cpu did succeed. As this test is done by checking a percpu pointer, preemption needs to be disabled in order to avoid switching the cpu while doing the test. While resuming from suspend() this seems not to be the case: [ 88.082751] ACPI: PM: Low-level resume complete [ 88.087933] ACPI: EC: EC started [ 88.091464] ACPI: PM: Restoring platform NVS memory [ 88.097166] xen_acpi_processor: Uploading Xen processor PM info [ 88.103850] Enabling non-boot CPUs ... [ 88.108128] installing Xen timer for CPU 1 [ 88.112763] BUG: using smp_processor_id() in preemptible [] code: systemd-sleep/7138 [ 88.122256] caller is is_xen_pmu+0x12/0x30 [ 88.126937] CPU: 0 PID: 7138 Comm: systemd-sleep Tainted: GW 5.16.13-2.fc32.qubes.x86_64 #1 [ 88.137939] Hardware name: Star Labs StarBook/StarBook, BIOS 7.97 03/21/2022 [ 88.145930] Call Trace: [ 88.148757] [ 88.151193] dump_stack_lvl+0x48/0x5e [ 88.155381] check_preemption_disabled+0xde/0xe0 [ 88.160641] is_xen_pmu+0x12/0x30 [ 88.164441] xen_smp_intr_init_pv+0x75/0x100 Fix that by replacing is_xen_pmu() by a simple boolean variable which reflects the Xen PMU initialization state on cpu 0. Modify xen_pmu_init() to return early in case it is being called for a cpu other than cpu 0 and the boolean variable not being set. Fixes: bf6dfb154d93 ("xen/PMU: PMU emulation code") Reported-by: Marek Marczykowski-Górecki Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky
Re: [PATCH] xen: fix is_xen_pmu()
On 3/22/22 11:50 AM, Juergen Gross wrote: is_xen_pmu() is taking the cpu number as parameter, but it is not using it. Instead it just tests whether the Xen PMU initialization on the current cpu did succeed. As this test is done by checking a percpu pointer, preemption needs to be disabled in order to avoid switching the cpu while doing the test. While resuming from suspend() this seems not to be the case: [ 88.082751] ACPI: PM: Low-level resume complete [ 88.087933] ACPI: EC: EC started [ 88.091464] ACPI: PM: Restoring platform NVS memory [ 88.097166] xen_acpi_processor: Uploading Xen processor PM info [ 88.103850] Enabling non-boot CPUs ... [ 88.108128] installing Xen timer for CPU 1 [ 88.112763] BUG: using smp_processor_id() in preemptible [] code: systemd-sleep/7138 [ 88.122256] caller is is_xen_pmu+0x12/0x30 [ 88.126937] CPU: 0 PID: 7138 Comm: systemd-sleep Tainted: GW 5.16.13-2.fc32.qubes.x86_64 #1 [ 88.137939] Hardware name: Star Labs StarBook/StarBook, BIOS 7.97 03/21/2022 [ 88.145930] Call Trace: [ 88.148757] [ 88.151193] dump_stack_lvl+0x48/0x5e [ 88.155381] check_preemption_disabled+0xde/0xe0 [ 88.160641] is_xen_pmu+0x12/0x30 [ 88.164441] xen_smp_intr_init_pv+0x75/0x100 There is actually another PMU-related problem on restore which was caused (or, rather, highlighted) by ff083a2d972f56bebfd82409ca62e5dfce950961: [ 116.861637] [ cut here ] [ 116.861651] WARNING: CPU: 1 PID: 31 at kernel/events/core.c:6614 perf_register_guest_info_callbacks+0x68/0x70 [ 116.861673] Modules linked in: [ 116.861682] CPU: 1 PID: 31 Comm: xenwatch Not tainted 5.17.0-rc7ostr #103 [ 116.861695] RIP: e030:perf_register_guest_info_callbacks+0x68/0x70 [ 116.861706] Code: c7 c7 40 e1 86 82 e8 d7 e7 ff ff 48 8b 53 10 48 85 d2 74 14 48 c7 c6 f0 0a c0 81 48 c7 c7 30 e1 86 82 5b e9 ba e7 ff ff 5b c3 <0f> 0b c3 0f 1f 44 00 00 0f 1f 44 00 00 48 8b 05 54 fd 0b 02 48 39 [ 116.861747] RSP: e02b:c9004016fe18 EFLAGS: 00010286 [ 116.861758] RAX: 82432850 RBX: 0003 RCX: 888079c0 [ 116.861768] RDX: 888079c0 RSI: c9004016fe30 RDI: 82432850 [ 116.861778] RBP: R08: 1600 R09: ea0ed340 [ 116.861788] R10: 0758 R11: R12: 888003b4d000 [ 116.861797] R13: 0003 R14: 8162cf10 R15: [ 116.861819] FS: () GS:888079c8() knlGS: [ 116.861830] CS: e030 DS: ES: CR0: 80050033 [ 116.861839] CR2: CR3: 062b6000 CR4: 00040660 [ 116.861853] Call Trace: [ 116.861861] [ 116.861866] xen_pmu_init+0x187/0x280 [ 116.861879] xen_arch_resume+0x30/0x50 [ 116.861888] do_suspend.cold+0x132/0x147 [ 116.861899] shutdown_handler+0x12e/0x140 [ 116.861910] xenwatch_thread+0x94/0x180 [ 116.861919] ? finish_wait+0x80/0x80 [ 116.861928] kthread+0xe7/0x110 [ 116.861938] ? kthread_complete_and_exit+0x20/0x20 [ 116.861948] ret_from_fork+0x22/0x30 [ 116.861959] [ 116.861964] ---[ end trace ]--- I was going to send a patch but I think yours can be slightly modified to take care of this problem as well. @@ -542,6 +539,7 @@ void xen_pmu_init(int cpu) per_cpu(xenpmu_shared, cpu).flags = 0; if (cpu == 0) { if (!is_xen_pmu) + is_xen_pmu = true; perf_register_guest_info_callbacks(&xen_guest_cbs); xen_pmu_arch_init(); } @@ -572,4 +570,7 @@ void xen_pmu_finish(int cpu) free_pages((unsigned long)per_cpu(xenpmu_shared, cpu).xenpmu_data, 0); per_cpu(xenpmu_shared, cpu).xenpmu_data = NULL; + + if (cpu == 0) + is_xen_pmu = false; And drop this hunk. -boris
Re: [PATCH] arch:x86:xen: Remove unnecessary assignment in xen_apic_read()
On 3/14/22 3:05 AM, jianchunfu wrote: In the function xen_apic_read(), the initialized value of 'ret' is unused because it will be assigned by the function HYPERVISOR_platform_op(), thus remove it. Signed-off-by: jianchunfu Reviewed-by: Boris Ostrovsky Applied to for-linus-5.18 (both patches)
Re: [PATCH 1/2] xen/grant-table: remove gnttab_*transfer*() functions
On 3/11/22 5:34 AM, Juergen Gross wrote: All grant table operations related to the "transfer" functionality are unused currently. There have been users in the old days of the "Xen-o-Linux" kernel, but those didn't make it upstream. So remove the "transfer" related functions. Signed-off-by: Juergen Gross Applied to for-linus-5.18 (both patches)
Re: [PATCH 1/2] xen/grant-table: remove gnttab_*transfer*() functions
On 3/15/22 2:10 AM, Juergen Gross wrote: On 15.03.22 00:37, Boris Ostrovsky wrote: On 3/11/22 5:34 AM, Juergen Gross wrote: All grant table operations related to the "transfer" functionality are unused currently. There have been users in the old days of the "Xen-o-Linux" kernel, but those didn't make it upstream. So remove the "transfer" related functions. Do we need to assert somewhere that transfer flags are not set? This would be an orthogonal change, right? My patch is just removing never called functions. I was thinking of having this done as part of code removal (maybe as a separate patch). In any case I believe checking those flags is the job of the hypervisor. If an operation is illegal due to a transfer flag being set, it needs to be rejected at hypervisor level. True. Reviewed-by: Boris Ostrovsky
Re: [PATCH 12/15] swiotlb: provide swiotlb_init variants that remap the buffer
On 3/15/22 2:36 AM, Christoph Hellwig wrote: @@ -271,12 +273,23 @@ void __init swiotlb_init(bool addressing_limit, unsigned int flags) * allow to pick a location everywhere for hypervisors with guest * memory encryption. */ +retry: + bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT); if (flags & SWIOTLB_ANY) tlb = memblock_alloc(bytes, PAGE_SIZE); else tlb = memblock_alloc_low(bytes, PAGE_SIZE); if (!tlb) goto fail; + if (remap && remap(tlb, nslabs) < 0) { + memblock_free(tlb, PAGE_ALIGN(bytes)); + + if (nslabs <= IO_TLB_MIN_SLABS) + panic("%s: Failed to remap %zu bytes\n", + __func__, bytes); + nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE)); I spoke with Konrad (who wrote the original patch --- f4b2f07b2ed9b469ead87e06fc2fc3d12663a725) and apparently the reason for 2MB was to optimize for Xen's slab allocator, it had nothing to do with IO_TLB_MIN_SLABS. Since this is now common code we should not expose Xen-specific optimizations here and smaller values will still work so IO_TLB_MIN_SLABS is fine. I think this should be mentioned in the commit message though, probably best in the next patch where you switch to this code. As far as the hunk above, I don't think we need the max() here: with IO_TLB_MIN_SLABS being 512 we may get stuck in an infinite loop. Something like nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE); if (nslabs <= IO_TLB_MIN_SLABS) panic() should be sufficient. + goto retry; + } if (swiotlb_init_with_tbl(tlb, default_nslabs, flags)) goto fail_free_mem; return; @@ -287,12 +300,18 @@ void __init swiotlb_init(bool addressing_limit, unsigned int flags) pr_warn("Cannot allocate buffer"); } +void __init swiotlb_init(bool addressing_limit, unsigned int flags) +{ + return swiotlb_init_remap(addressing_limit, flags, NULL); +} + /* * Systems with larger DMA zones (those that don't support ISA) can * initialize the swiotlb later using the slab allocator if needed. * This should be just like above, but with some error catching. */ -int swiotlb_init_late(size_t size, gfp_t gfp_mask) +int swiotlb_init_late(size_t size, gfp_t gfp_mask, + int (*remap)(void *tlb, unsigned long nslabs)) { unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE); unsigned long bytes; @@ -303,6 +322,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask) if (swiotlb_force_disable) return 0; +retry: order = get_order(nslabs << IO_TLB_SHIFT); nslabs = SLABS_PER_PAGE << order; bytes = nslabs << IO_TLB_SHIFT; @@ -317,6 +337,16 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask) if (!vstart) return -ENOMEM; + if (remap) + rc = remap(vstart, nslabs); + if (rc) { + free_pages((unsigned long)vstart, order); + + if (IO_TLB_MIN_SLABS <= 1024) + return rc; + nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE)); Same here. (The 'if' check above is wrong anyway). Patches 13 and 14 look good. -boris + goto retry; + } if (order != get_order(bytes)) { pr_warn("only able to allocate %ld MB\n",
Re: [PATCH 1/2] xen/grant-table: remove gnttab_*transfer*() functions
On 3/11/22 5:34 AM, Juergen Gross wrote: All grant table operations related to the "transfer" functionality are unused currently. There have been users in the old days of the "Xen-o-Linux" kernel, but those didn't make it upstream. So remove the "transfer" related functions. Do we need to assert somewhere that transfer flags are not set? -boris
Re: [PATCH 14/15] swiotlb: remove swiotlb_init_with_tbl and swiotlb_init_late_with_tbl
On 3/14/22 3:31 AM, Christoph Hellwig wrote: @@ -314,6 +293,7 @@ void __init swiotlb_init(bool addressing_limit, unsigned int flags) int swiotlb_init_late(size_t size, gfp_t gfp_mask, int (*remap)(void *tlb, unsigned long nslabs)) { + struct io_tlb_mem *mem = &io_tlb_default_mem; unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE); unsigned long bytes; unsigned char *vstart = NULL; @@ -355,33 +335,28 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask, (PAGE_SIZE << order) >> 20); nslabs = SLABS_PER_PAGE << order; } - rc = swiotlb_late_init_with_tbl(vstart, nslabs); - if (rc) - free_pages((unsigned long)vstart, order); - - return rc; -} - -int -swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs) -{ - struct io_tlb_mem *mem = &io_tlb_default_mem; - unsigned long bytes = nslabs << IO_TLB_SHIFT; - if (swiotlb_force_disable) - return 0; + if (remap) + rc = remap(vstart, nslabs); + if (rc) { + free_pages((unsigned long)vstart, order); - /* protect against double initialization */ - if (WARN_ON_ONCE(mem->nslabs)) - return -ENOMEM; + /* Min is 2MB */ + if (nslabs <= 1024) + return rc; + nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE)); + goto retry; + } We now end up with two attempts to remap. I think this second one is what we want since it solves the problem I pointed in previous patch. -boris
Re: [PATCH 13/15] swiotlb: merge swiotlb-xen initialization into swiotlb
On 3/14/22 3:31 AM, Christoph Hellwig wrote: - static void __init pci_xen_swiotlb_init(void) { if (!xen_initial_domain() && !x86_swiotlb_enable) return; x86_swiotlb_enable = true; - xen_swiotlb = true; - xen_swiotlb_init_early(); + swiotlb_init_remap(true, x86_swiotlb_flags, xen_swiotlb_fixup); I think we need to have SWIOTLB_ANY set in x86_swiotlb_flags here. dma_ops = &xen_swiotlb_dma_ops; if (IS_ENABLED(CONFIG_PCI)) pci_request_acs(); @@ -88,14 +85,16 @@ static void __init pci_xen_swiotlb_init(void) int pci_xen_swiotlb_init_late(void) { - int rc; - - if (xen_swiotlb) + if (dma_ops == &xen_swiotlb_dma_ops) return 0; - rc = xen_swiotlb_init(); - if (rc) - return rc; + /* we can work with the default swiotlb */ + if (!io_tlb_default_mem.nslabs) { + int rc = swiotlb_init_late(swiotlb_size_or_default(), + GFP_KERNEL, xen_swiotlb_fixup); This may be comment for previous patch but looking at swiotlb_init_late(): retry: order = get_order(nslabs << IO_TLB_SHIFT); nslabs = SLABS_PER_PAGE << order; bytes = nslabs << IO_TLB_SHIFT; while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) { vstart = (void *)__get_free_pages(gfp_mask | __GFP_NOWARN, order); if (vstart) break; order--; } if (!vstart) return -ENOMEM; if (remap) rc = remap(vstart, nslabs); if (rc) { free_pages((unsigned long)vstart, order); /* Min is 2MB */ if (nslabs <= 1024) return rc; nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE)); goto retry; } if (order != get_order(bytes)) { pr_warn("only able to allocate %ld MB\n", (PAGE_SIZE << order) >> 20); nslabs = SLABS_PER_PAGE << order; <=== } rc = swiotlb_late_init_with_tbl(vstart, nslabs); Notice that we don't do remap() after final update to nslabs. We should. -boris
Re: [PATCH 12/15] swiotlb: provide swiotlb_init variants that remap the buffer
On 3/14/22 3:31 AM, Christoph Hellwig wrote: -void __init swiotlb_init(bool addressing_limit, unsigned int flags) +void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags, + int (*remap)(void *tlb, unsigned long nslabs)) { - size_t bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT); + unsigned long nslabs = default_nslabs; + size_t bytes; void *tlb; if (!addressing_limit && !swiotlb_force_bounce) @@ -271,12 +273,24 @@ void __init swiotlb_init(bool addressing_limit, unsigned int flags) * allow to pick a location everywhere for hypervisors with guest * memory encryption. */ +retry: + bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT); if (flags & SWIOTLB_ANY) tlb = memblock_alloc(bytes, PAGE_SIZE); else tlb = memblock_alloc_low(bytes, PAGE_SIZE); if (!tlb) goto fail; + if (remap && remap(tlb, nslabs) < 0) { + memblock_free(tlb, PAGE_ALIGN(bytes)); + + /* Min is 2MB */ + if (nslabs <= 1024) This is IO_TLB_MIN_SLABS, isn't it? (Xen code didn't say so but that's what it meant to say I believe) + panic("%s: Failed to remap %zu bytes\n", + __func__, bytes); + nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE)); + goto retry; + } @@ -303,6 +323,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask) if (swiotlb_force_disable) return 0; +retry: order = get_order(nslabs << IO_TLB_SHIFT); nslabs = SLABS_PER_PAGE << order; bytes = nslabs << IO_TLB_SHIFT; @@ -317,6 +338,17 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask) if (!vstart) return -ENOMEM; + if (remap) + rc = remap(vstart, nslabs); + if (rc) { + free_pages((unsigned long)vstart, order); + + /* Min is 2MB */ + if (nslabs <= 1024) Same here. -boris + return rc; + nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE)); + goto retry; + } if (order != get_order(bytes)) { pr_warn("only able to allocate %ld MB\n",
Re: [PATCH] x86/xen: Fix kerneldoc warning
On 3/7/22 1:25 AM, Jiapeng Chong wrote: Fix the following W=1 kernel warnings: arch/x86/xen/setup.c:725: warning: expecting prototype for machine_specific_memory_setup(). Prototype was for xen_memory_setup() instead. Reported-by: Abaci Robot Signed-off-by: Jiapeng Chong Reviewed-by: Boris Ostrovsky Applied to for-linus-5.18
Re: [PATCH] drivers/xen: use helper macro __ATTR_RW
On 3/5/22 8:38 AM, zhanglianjie wrote: Use helper macro __ATTR_RW to define HYPERVISOR_ATTR_RW to make code more clear. Minor readability improvement. Signed-off-by: zhanglianjie Reviewed-by: Boris Ostrovsky Applied to for-linus-5.18 (with whitespace change noted in commit message)
Re: [PATCH v4 2/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32
On 3/2/22 11:40 AM, Dongli Zhang wrote: The sched_clock() can be used very early since commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition, with commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock time from 0"), kdump kernel in Xen HVM guest may panic at very early stage when accessing &__this_cpu_read(xen_vcpu)->time as in below: setup_arch() -> init_hypervisor_platform() -> x86_init.hyper.init_platform = xen_hvm_guest_init() -> xen_hvm_init_time_ops() -> xen_clocksource_read() -> src = &__this_cpu_read(xen_vcpu)->time; This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info' embedded inside 'shared_info' during early stage until xen_vcpu_setup() is used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address. However, when Xen HVM guest panic on vcpu >= 32, since xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic. This patch calls xen_hvm_init_time_ops() again later in xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is registered when the boot vcpu is >= 32. This issue can be reproduced on purpose via below command at the guest side when kdump/kexec is enabled: "taskset -c 33 echo c > /proc/sysrq-trigger" The bugfix for PVM is not implemented due to the lack of testing environment. Cc: Joe Jin Signed-off-by: Dongli Zhang Reviewed-by: Boris Ostrovsky Applied to for-linus-5.18 (with return path adjusted)
Re: [PATCH] xen: use time_is_before_eq_jiffies() instead of open coding it
On 2/27/22 10:15 PM, Qing Wang wrote: From: Wang Qing Use the helper function time_is_{before,after}_jiffies() to improve code readability. Signed-off-by: Wang Qing Reviewed-by: Boris Ostrovsky Applied to for-linus-5.18
Re: [PATCH 11/12] swiotlb: merge swiotlb-xen initialization into swiotlb
On 3/9/22 1:18 AM, Christoph Hellwig wrote: On Tue, Mar 08, 2022 at 04:38:21PM -0500, Boris Ostrovsky wrote: On 3/1/22 5:53 AM, Christoph Hellwig wrote: Allow to pass a remap argument to the swiotlb initialization functions to handle the Xen/x86 remap case. ARM/ARM64 never did any remapping from xen_swiotlb_fixup, so we don't even need that quirk. Any chance this patch could be split? Lots of things are happening here and it's somewhat hard to review. (Patch 7 too BTW but I think I managed to get through it) What would be your preferred split? swiotlb_init() rework to be done separately? diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c index e0def4b1c3181..2f2c468acb955 100644 --- a/arch/x86/kernel/pci-dma.c +++ b/arch/x86/kernel/pci-dma.c @@ -71,15 +71,12 @@ static inline void __init pci_swiotlb_detect(void) #endif /* CONFIG_SWIOTLB */ #ifdef CONFIG_SWIOTLB_XEN -static bool xen_swiotlb; - static void __init pci_xen_swiotlb_init(void) { if (!xen_initial_domain() && !x86_swiotlb_enable) return; Now that there is a single call site for this routine I think this check can be dropped. We are only called here for xen_initial_domain()==true. The callsite just checks xen_pv_domain() and itself is called unconditionally during initialization. Oh, right, nevermind. *pv* domain. -boris
Re: [PATCH 11/12] swiotlb: merge swiotlb-xen initialization into swiotlb
On 3/1/22 5:53 AM, Christoph Hellwig wrote: Allow to pass a remap argument to the swiotlb initialization functions to handle the Xen/x86 remap case. ARM/ARM64 never did any remapping from xen_swiotlb_fixup, so we don't even need that quirk. Any chance this patch could be split? Lots of things are happening here and it's somewhat hard to review. (Patch 7 too BTW but I think I managed to get through it) diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c index e0def4b1c3181..2f2c468acb955 100644 --- a/arch/x86/kernel/pci-dma.c +++ b/arch/x86/kernel/pci-dma.c @@ -71,15 +71,12 @@ static inline void __init pci_swiotlb_detect(void) #endif /* CONFIG_SWIOTLB */ #ifdef CONFIG_SWIOTLB_XEN -static bool xen_swiotlb; - static void __init pci_xen_swiotlb_init(void) { if (!xen_initial_domain() && !x86_swiotlb_enable) return; Now that there is a single call site for this routine I think this check can be dropped. We are only called here for xen_initial_domain()==true. -boris
Re: [PATCH 11/12] swiotlb: merge swiotlb-xen initialization into swiotlb
On 3/4/22 12:43 PM, Christoph Hellwig wrote: On Fri, Mar 04, 2022 at 12:36:17PM -0500, Boris Ostrovsky wrote: I bisected it to "x86: remove the IOMMU table infrastructure" but haven't actually looked at the code yet. That looks like the swiotlb buffer did not get initialized at all, but I can't really explain why. Can you stick in a printk and see if xen_swiotlb_init_early gets called at all? Actually, that's the only thing I did do so far and yes, it does get called. So, specifically for "x86: remove the IOMMU table infrastructure" I think we need the one-liner below so that swiotlb_exit doesn't get called for the Xen case. But that should have been fixed up by the next patch already. This indeed allows dom0 to boot. Not sure I see where in the next patch this would have been fixed? (BTW, just noticed in iommu_setup() you set this variable to 1. Should be 'true') -boris