Re: [PATCH v6 00/12] x86: major paravirt cleanup
On 11.03.21 13:51, Borislav Petkov wrote: On Thu, Mar 11, 2021 at 01:50:26PM +0100, Borislav Petkov wrote: and move the cleanups patches 13 and 14 to the beginning of the set? Yeah, 14 needs ALTERNATIVE_TERNARY so I guess after patch 5, that is. I'm putting 13 at the begin of the series and 14 after 5. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v6 02/12] x86/paravirt: switch time pvops functions to use static_call()
On 09.03.21 19:57, Borislav Petkov wrote: On Tue, Mar 09, 2021 at 02:48:03PM +0100, Juergen Gross wrote: @@ -167,6 +168,17 @@ static u64 native_steal_clock(int cpu) return 0; } +DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); + +bool paravirt_using_native_sched_clock = true; + +void paravirt_set_sched_clock(u64 (*func)(void)) +{ + static_call_update(pv_sched_clock, func); + paravirt_using_native_sched_clock = (func == native_sched_clock); +} What's the point of this function if there's a global paravirt_using_native_sched_clock variable now? It is combining the two needed actions: update the static call and set the paravirt_using_native_sched_clock boolean. Looking how the bit of information whether native_sched_clock is used, is needed in tsc.c, it probably would be cleaner if you add a set_sched_clock_native(void); or so, to tsc.c instead and call that here and make that long var name a a shorter and static one in tsc.c instead. I need to transfer a boolean value, so it would need to be set_sched_clock_native(bool state); In the end the difference is only marginal IMO. Just had another idea: I could add a function to static_call.h for querying the current function. This would avoid the double book keeping and could probably be used later when switching other pv_ops calls to static_call, too (e.g. pv_is_native_spin_unlock()). What do you think? Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 11/12] x86/paravirt: switch functions with custom code to ALTERNATIVE
On 08.03.21 19:30, Borislav Petkov wrote: On Mon, Mar 08, 2021 at 01:28:43PM +0100, Juergen Gross wrote: diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 36cd71fa097f..04b3067f31b5 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -137,7 +137,8 @@ static inline void write_cr0(unsigned long x) static inline unsigned long read_cr2(void) { - return PVOP_CALLEE0(unsigned long, mmu.read_cr2); + return PVOP_ALT_CALLEE0(unsigned long, mmu.read_cr2, + "mov %%cr2, %%rax;", ~X86_FEATURE_XENPV); Just some cursory poking first - indepth review later. Do I see this correctly that the negated feature can be expressed with, to use this example here: ALTERNATIVE_TERNARY(mmu.read_cr2, X86_FEATURE_XENPV, "", "mov %%cr2, %%rax;"); ? No. This would leave the Xen-pv case with a nop, while we need it to call mmu.read_cr2(). In the Xen-pv case there must be _no_ alternative patching in order to have the paravirt patching do its patching (indirect->direct call). This is exactly the reason why I need to "not feature". The only other solution I can think of would be a "split static_call" handling using ALTERNATIVE_TERNARY(): ALTERNATIVE_TERNARY(initial_static_call(mmu.read_cr2), X86_FEATURE_XENPV, final_static_call(mmu.read_cr2), "mov %%cr2, %%rax;"); with initial_static_call() doing an indirect call, while final_static_call() would do a direct call. Not sure we really want that. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 02/12] x86/paravirt: switch time pvops functions to use static_call()
On 08.03.21 18:00, Boris Ostrovsky wrote: On 3/8/21 7:28 AM, Juergen Gross wrote: --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -379,11 +379,6 @@ void xen_timer_resume(void) } } -static const struct pv_time_ops xen_time_ops __initconst = { - .sched_clock = xen_sched_clock, - .steal_clock = xen_steal_clock, -}; - static struct pvclock_vsyscall_time_info *xen_clock __read_mostly; static u64 xen_clock_value_saved; @@ -528,7 +523,8 @@ static void __init xen_time_init(void) void __init xen_init_time_ops(void) { xen_sched_clock_offset = xen_clocksource_read(); - pv_ops.time = xen_time_ops; + static_call_update(pv_steal_clock, xen_steal_clock); + paravirt_set_sched_clock(xen_sched_clock); x86_init.timers.timer_init = xen_time_init; x86_init.timers.setup_percpu_clockev = x86_init_noop; @@ -570,7 +566,8 @@ void __init xen_hvm_init_time_ops(void) } xen_sched_clock_offset = xen_clocksource_read(); - pv_ops.time = xen_time_ops; + static_call_update(pv_steal_clock, xen_steal_clock); + paravirt_set_sched_clock(xen_sched_clock); x86_init.timers.setup_percpu_clockev = xen_time_init; x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents; There is a bunch of stuff that's common between the two cases so it can be factored out. Yes. diff --git a/drivers/xen/time.c b/drivers/xen/time.c index 108edbcbc040..152dd33bb223 100644 --- a/drivers/xen/time.c +++ b/drivers/xen/time.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include @@ -175,7 +176,7 @@ void __init xen_time_setup_guest(void) xen_runstate_remote = !HYPERVISOR_vm_assist(VMASST_CMD_enable, VMASST_TYPE_runstate_update_flag); - pv_ops.time.steal_clock = xen_steal_clock; + static_call_update(pv_steal_clock, xen_steal_clock); Do we actually need this? We've already set this up in xen_init_time_ops(). (But maybe for ARM). Correct. Arm needs this. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 06/15] x86/paravirt: switch time pvops functions to use static_call()
On 17.12.20 18:31, Michael Kelley wrote: From: Juergen Gross Sent: Thursday, December 17, 2020 1:31 AM The time pvops functions are the only ones left which might be used in 32-bit mode and which return a 64-bit value. Switch them to use the static_call() mechanism instead of pvops, as this allows quite some simplification of the pvops implementation. Due to include hell this requires to split out the time interfaces into a new header file. Signed-off-by: Juergen Gross --- arch/x86/Kconfig | 1 + arch/x86/include/asm/mshyperv.h | 11 arch/x86/include/asm/paravirt.h | 14 -- arch/x86/include/asm/paravirt_time.h | 38 +++ arch/x86/include/asm/paravirt_types.h | 6 - arch/x86/kernel/cpu/vmware.c | 5 ++-- arch/x86/kernel/kvm.c | 3 ++- arch/x86/kernel/kvmclock.c| 3 ++- arch/x86/kernel/paravirt.c| 16 --- arch/x86/kernel/tsc.c | 3 ++- arch/x86/xen/time.c | 12 - drivers/clocksource/hyperv_timer.c| 5 ++-- drivers/xen/time.c| 3 ++- kernel/sched/sched.h | 1 + 14 files changed, 71 insertions(+), 50 deletions(-) create mode 100644 arch/x86/include/asm/paravirt_time.h [snip] diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index ffc289992d1b..45942d420626 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -56,17 +56,6 @@ typedef int (*hyperv_fill_flush_list_func)( #define hv_get_raw_timer() rdtsc_ordered() #define hv_get_vector() HYPERVISOR_CALLBACK_VECTOR -/* - * Reference to pv_ops must be inline so objtool - * detection of noinstr violations can work correctly. - */ -static __always_inline void hv_setup_sched_clock(void *sched_clock) -{ -#ifdef CONFIG_PARAVIRT - pv_ops.time.sched_clock = sched_clock; -#endif -} - void hyperv_vector_handler(struct pt_regs *regs); static inline void hv_enable_stimer0_percpu_irq(int irq) {} [snip] diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c index ba04cb381cd3..1ed79993fc50 100644 --- a/drivers/clocksource/hyperv_timer.c +++ b/drivers/clocksource/hyperv_timer.c @@ -21,6 +21,7 @@ #include #include #include +#include static struct clock_event_device __percpu *hv_clock_event; static u64 hv_sched_clock_offset __ro_after_init; @@ -445,7 +446,7 @@ static bool __init hv_init_tsc_clocksource(void) clocksource_register_hz(_cs_tsc, NSEC_PER_SEC/100); hv_sched_clock_offset = hv_read_reference_counter(); - hv_setup_sched_clock(read_hv_sched_clock_tsc); + paravirt_set_sched_clock(read_hv_sched_clock_tsc); return true; } @@ -470,6 +471,6 @@ void __init hv_init_clocksource(void) clocksource_register_hz(_cs_msr, NSEC_PER_SEC/100); hv_sched_clock_offset = hv_read_reference_counter(); - hv_setup_sched_clock(read_hv_sched_clock_msr); + static_call_update(pv_sched_clock, read_hv_sched_clock_msr); } EXPORT_SYMBOL_GPL(hv_init_clocksource); These Hyper-V changes are problematic as we want to keep hyperv_timer.c architecture independent. While only the code for x86/x64 is currently accepted upstream, code for ARM64 support is in progress. So we need to use hv_setup_sched_clock() in hyperv_timer.c, and have the per-arch implementation in mshyperv.h. Okay, will switch back to old setup. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 06/15] x86/paravirt: switch time pvops functions to use static_call()
On 06.01.21 11:03, Borislav Petkov wrote: On Thu, Dec 17, 2020 at 10:31:24AM +0100, Juergen Gross wrote: The time pvops functions are the only ones left which might be used in 32-bit mode and which return a 64-bit value. Switch them to use the static_call() mechanism instead of pvops, as this allows quite some simplification of the pvops implementation. Due to include hell this requires to split out the time interfaces into a new header file. I guess you can add Peter's patch to your set, no? https://lkml.kernel.org/r/20201110005609.40989-3-frede...@kernel.org With a slight modification, yes. :-) Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 00/12] x86: major paravirt cleanup
On 15.12.20 15:54, Peter Zijlstra wrote: On Tue, Dec 15, 2020 at 03:18:34PM +0100, Peter Zijlstra wrote: Ah, I was waiting for Josh to have an opinion (and then sorta forgot about the whole thing again). Let me refresh and provide at least a Changelog. How's this then? Thanks, will add it into my series. Juergen --- Subject: objtool: Alternatives vs ORC, the hard way From: Peter Zijlstra Date: Mon, 23 Nov 2020 14:43:17 +0100 Alternatives pose an interesting problem for unwinders because from the unwinders PoV we're just executing instructions, it has no idea the text is modified, nor any way of retrieving what with. Therefore the stance has been that alternatives must not change stack state, as encoded by commit: 7117f16bf460 ("objtool: Fix ORC vs alternatives"). This obviously guarantees that whatever actual instructions end up in the text, the unwind information is correct. However, there is one additional source of text patching that isn't currently visible to objtool: paravirt immediate patching. And it turns out one of these violates the rule. As part of cleaning that up, the unfortunate reality is that objtool now has to deal with alternatives modifying unwind state and validate the combination is valid and generate ORC data to match. The problem is that a single instance of unwind information (ORC) must capture and correctly unwind all alternatives. Since the trivially correct mandate is out, implement the straight forward brute-force approach: 1) generate CFI information for each alternative 2) unwind every alternative with the merge-sort of the previously generated CFI information -- O(n^2) 3) for any possible conflict: yell. 4) Generate ORC with merge-sort Specifically for 3 there are two possible classes of conflicts: - the merge-sort itself could find conflicting CFI for the same offset. - the unwind can fail with the merged CFI. In specific, this allows us to deal with: Alt1Alt2Alt3 0x00 CALL *pv_ops.save_flCALL xen_save_flPUSHF 0x01 POP %RAX 0x02 NOP ... 0x05 NOP ... 0x07 The unwind information for offset-0x00 is identical for all 3 alternatives. Similarly offset-0x05 and higher also are identical (and the same as 0x00). However offset-0x01 has deviating CFI, but that is only relevant for Alt3, neither of the other alternative instruction streams will ever hit that offset. Signed-off-by: Peter Zijlstra (Intel) --- tools/objtool/check.c | 180 tools/objtool/check.h |5 + tools/objtool/orc_gen.c | 180 +++- 3 files changed, 290 insertions(+), 75 deletions(-) --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -1096,6 +1096,32 @@ static int handle_group_alt(struct objto return -1; } + /* +* Add the filler NOP, required for alternative CFI. +*/ + if (special_alt->group && special_alt->new_len < special_alt->orig_len) { + struct instruction *nop = malloc(sizeof(*nop)); + if (!nop) { + WARN("malloc failed"); + return -1; + } + memset(nop, 0, sizeof(*nop)); + INIT_LIST_HEAD(>alts); + INIT_LIST_HEAD(>stack_ops); + init_cfi_state(>cfi); + + nop->sec = last_new_insn->sec; + nop->ignore = last_new_insn->ignore; + nop->func = last_new_insn->func; + nop->alt_group = alt_group; + nop->offset = last_new_insn->offset + last_new_insn->len; + nop->type = INSN_NOP; + nop->len = special_alt->orig_len - special_alt->new_len; + + list_add(>list, _new_insn->list); + last_new_insn = nop; + } + if (fake_jump) list_add(_jump->list, _new_insn->list); @@ -2237,18 +2263,12 @@ static int handle_insn_ops(struct instru struct stack_op *op; list_for_each_entry(op, >stack_ops, list) { - struct cfi_state old_cfi = state->cfi; int res; res = update_cfi_state(insn, >cfi, op); if (res) return res; - if (insn->alt_group && memcmp(>cfi, _cfi, sizeof(struct cfi_state))) { - WARN_FUNC("alternative modifies stack", insn->sec, insn->offset); - return -1; - } - if (op->dest.type == OP_DEST_PUSHF) { if (!state->uaccess_stack) { state->uaccess_stack = 1; @@ -2460,19 +2480,137 @@ static int validate_return(struct symbol * unreported (because they're NOPs), such holes would result in CFI_UNDEFINED *
Re: [PATCH v2 00/12] x86: major paravirt cleanup
Peter, On 23.11.20 14:43, Peter Zijlstra wrote: On Fri, Nov 20, 2020 at 01:53:42PM +0100, Peter Zijlstra wrote: On Fri, Nov 20, 2020 at 12:46:18PM +0100, Juergen Gross wrote: 30 files changed, 325 insertions(+), 598 deletions(-) Much awesome ! I'll try and get that objtool thing sorted. This seems to work for me. It isn't 100% accurate, because it doesn't know about the direct call instruction, but I can either fudge that or switching to static_call() will cure that. It's not exactly pretty, but it should be straight forward. Are you planning to send this out as an "official" patch, or should I include it in my series (in this case I'd need a variant with a proper commit message)? I'd like to have this settled soon, as I'm going to send V2 of my series hopefully this week. Juergen Index: linux-2.6/tools/objtool/check.c === --- linux-2.6.orig/tools/objtool/check.c +++ linux-2.6/tools/objtool/check.c @@ -1090,6 +1090,32 @@ static int handle_group_alt(struct objto return -1; } + /* +* Add the filler NOP, required for alternative CFI. +*/ + if (special_alt->group && special_alt->new_len < special_alt->orig_len) { + struct instruction *nop = malloc(sizeof(*nop)); + if (!nop) { + WARN("malloc failed"); + return -1; + } + memset(nop, 0, sizeof(*nop)); + INIT_LIST_HEAD(>alts); + INIT_LIST_HEAD(>stack_ops); + init_cfi_state(>cfi); + + nop->sec = last_new_insn->sec; + nop->ignore = last_new_insn->ignore; + nop->func = last_new_insn->func; + nop->alt_group = alt_group; + nop->offset = last_new_insn->offset + last_new_insn->len; + nop->type = INSN_NOP; + nop->len = special_alt->orig_len - special_alt->new_len; + + list_add(>list, _new_insn->list); + last_new_insn = nop; + } + if (fake_jump) list_add(_jump->list, _new_insn->list); @@ -2190,18 +2216,12 @@ static int handle_insn_ops(struct instru struct stack_op *op; list_for_each_entry(op, >stack_ops, list) { - struct cfi_state old_cfi = state->cfi; int res; res = update_cfi_state(insn, >cfi, op); if (res) return res; - if (insn->alt_group && memcmp(>cfi, _cfi, sizeof(struct cfi_state))) { - WARN_FUNC("alternative modifies stack", insn->sec, insn->offset); - return -1; - } - if (op->dest.type == OP_DEST_PUSHF) { if (!state->uaccess_stack) { state->uaccess_stack = 1; @@ -2399,19 +2419,137 @@ static int validate_return(struct symbol * unreported (because they're NOPs), such holes would result in CFI_UNDEFINED * states which then results in ORC entries, which we just said we didn't want. * - * Avoid them by copying the CFI entry of the first instruction into the whole - * alternative. + * Avoid them by copying the CFI entry of the first instruction into the hole. */ -static void fill_alternative_cfi(struct objtool_file *file, struct instruction *insn) +static void __fill_alt_cfi(struct objtool_file *file, struct instruction *insn) { struct instruction *first_insn = insn; int alt_group = insn->alt_group; - sec_for_each_insn_continue(file, insn) { + sec_for_each_insn_from(file, insn) { if (insn->alt_group != alt_group) break; - insn->cfi = first_insn->cfi; + + if (!insn->visited) + insn->cfi = first_insn->cfi; + } +} + +static void fill_alt_cfi(struct objtool_file *file, struct instruction *alt_insn) +{ + struct alternative *alt; + + __fill_alt_cfi(file, alt_insn); + + list_for_each_entry(alt, _insn->alts, list) + __fill_alt_cfi(file, alt->insn); +} + +static struct instruction * +__find_unwind(struct objtool_file *file, + struct instruction *insn, unsigned long offset) +{ + int alt_group = insn->alt_group; + struct instruction *next; + unsigned long off = 0; + + while ((off + insn->len) <= offset) { + next = next_insn_same_sec(file, insn); + if (next && next->alt_group != alt_group) + next = NULL; + + if (!next) + break; + + off += insn->len; + insn = next; } + + return insn; +} + +struct instruction * +find_alt_unwind(struct objtool_file *file, + struct instruction *alt_insn, unsigned long offset) +{ + struct instruction *fit; + struct alternative *alt; +
Re: x86/ioapic: Cleanup the timer_works() irqflags mess
On 10.12.20 21:15, Thomas Gleixner wrote: Mark tripped over the creative irqflags handling in the IO-APIC timer delivery check which ends up doing: local_irq_save(flags); local_irq_enable(); local_irq_restore(flags); which triggered a new consistency check he's working on required for replacing the POPF based restore with a conditional STI. That code is a historical mess and none of this is needed. Make it straightforward use local_irq_disable()/enable() as that's all what is required. It is invoked from interrupt enabled code nowadays. Reported-by: Mark Rutland Signed-off-by: Thomas Gleixner Tested-by: Mark Rutland Reviewed-by: Juergen Gross Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf
On 09.12.20 15:02, Mark Rutland wrote: On Wed, Dec 09, 2020 at 01:27:10PM +, Mark Rutland wrote: On Sun, Nov 22, 2020 at 01:44:53PM -0800, Andy Lutomirski wrote: On Sat, Nov 21, 2020 at 10:55 PM Jürgen Groß wrote: On 20.11.20 12:59, Peter Zijlstra wrote: If someone were to write horrible code like: local_irq_disable(); local_irq_save(flags); local_irq_enable(); local_irq_restore(flags); we'd be up some creek without a paddle... now I don't _think_ we have genius code like that, but I'd feel saver if we can haz an assertion in there somewhere... I was just talking to Peter on IRC about implementing the same thing for arm64, so could we put this in the generic irqflags code? IIUC we can use raw_irqs_disabled() to do the check. As this isn't really entry specific (and IIUC the cases this should catch would break lockdep today), maybe we should add a new DEBUG_IRQFLAGS for this, that DEBUG_LOCKDEP can also select? Something like: #define local_irq_restore(flags) \ do {\ if (!raw_irqs_disabled_flags(flags)) { \ trace_hardirqs_on();\ } else if (IS_ENABLED(CONFIG_DEBUG_IRQFLAGS) { \ if (unlikely(raw_irqs_disabled()) \ Whoops; that should be !raw_irqs_disabled(). warn_bogus_irqrestore();\ } \ raw_local_irq_restore(flags); \ } while (0) ... perhaps? (ignoring however we deal with once-ness). If no-one shouts in the next day or two I'll spin this as its own patch. Fine with me. So I'll just ignore a potential error case in my patch. Thanks, Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 04/12] x86/xen: drop USERGS_SYSRET64 paravirt call
On 02.12.20 13:32, Borislav Petkov wrote: On Fri, Nov 20, 2020 at 12:46:22PM +0100, Juergen Gross wrote: @@ -123,12 +115,15 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL) * Try to use SYSRET instead of IRET if we're returning to * a completely clean 64-bit userspace context. If we're not, * go to the slow exit path. +* In the Xen PV case we must use iret anyway. */ - movqRCX(%rsp), %rcx - movqRIP(%rsp), %r11 - cmpq %rcx, %r11 /* SYSRET requires RCX == RIP */ - jne swapgs_restore_regs_and_return_to_usermode + ALTERNATIVE __stringify( \ + movqRCX(%rsp), %rcx; \ + movqRIP(%rsp), %r11; \ + cmpq%rcx, %r11; /* SYSRET requires RCX == RIP */ \ + jne swapgs_restore_regs_and_return_to_usermode), \ + "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV Why such a big ALTERNATIVE when you can simply do: /* * Try to use SYSRET instead of IRET if we're returning to * a completely clean 64-bit userspace context. If we're not, * go to the slow exit path. * In the Xen PV case we must use iret anyway. */ ALTERNATIVE "", "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV movqRCX(%rsp), %rcx; movqRIP(%rsp), %r11; cmpq%rcx, %r11; /* SYSRET requires RCX == RIP */ \ jne swapgs_restore_regs_and_return_to_usermode ? I wanted to avoid the additional NOPs for the bare metal case. If you don't mind them I can do as you are suggesting. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf
On 22.11.20 22:44, Andy Lutomirski wrote: On Sat, Nov 21, 2020 at 10:55 PM Jürgen Groß wrote: On 20.11.20 12:59, Peter Zijlstra wrote: On Fri, Nov 20, 2020 at 12:46:23PM +0100, Juergen Gross wrote: +static __always_inline void arch_local_irq_restore(unsigned long flags) +{ +if (!arch_irqs_disabled_flags(flags)) +arch_local_irq_enable(); +} If someone were to write horrible code like: local_irq_disable(); local_irq_save(flags); local_irq_enable(); local_irq_restore(flags); we'd be up some creek without a paddle... now I don't _think_ we have genius code like that, but I'd feel saver if we can haz an assertion in there somewhere... Maybe something like: #ifdef CONFIG_DEBUG_ENTRY // for lack of something saner WARN_ON_ONCE((arch_local_save_flags() ^ flags) & X86_EFLAGS_IF); #endif At the end? I'd like to, but using WARN_ON_ONCE() in include/asm/irqflags.h sounds like a perfect receipt for include dependency hell. We could use a plain asm("ud2") instead. How about out-of-lining it: #ifdef CONFIG_DEBUG_ENTRY extern void warn_bogus_irqrestore(); #endif static __always_inline void arch_local_irq_restore(unsigned long flags) { if (!arch_irqs_disabled_flags(flags)) { arch_local_irq_enable(); } else { #ifdef CONFIG_DEBUG_ENTRY if (unlikely(arch_local_irq_save() & X86_EFLAGS_IF)) warn_bogus_irqrestore(); #endif } This couldn't be a WARN_ON_ONCE() then (or it would be a catch all). Another approach might be to open-code the WARN_ON_ONCE(), like: #ifdef CONFIG_DEBUG_ENTRY extern void warn_bogus_irqrestore(bool *once); #endif static __always_inline void arch_local_irq_restore(unsigned long flags) { if (!arch_irqs_disabled_flags(flags)) arch_local_irq_enable(); #ifdef CONFIG_DEBUG_ENTRY { static bool once; if (unlikely(arch_local_irq_save() & X86_EFLAGS_IF)) warn_bogus_irqrestore(); } #endif } Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf
On 20.11.20 12:59, Peter Zijlstra wrote: On Fri, Nov 20, 2020 at 12:46:23PM +0100, Juergen Gross wrote: +static __always_inline void arch_local_irq_restore(unsigned long flags) +{ + if (!arch_irqs_disabled_flags(flags)) + arch_local_irq_enable(); +} If someone were to write horrible code like: local_irq_disable(); local_irq_save(flags); local_irq_enable(); local_irq_restore(flags); we'd be up some creek without a paddle... now I don't _think_ we have genius code like that, but I'd feel saver if we can haz an assertion in there somewhere... Maybe something like: #ifdef CONFIG_DEBUG_ENTRY // for lack of something saner WARN_ON_ONCE((arch_local_save_flags() ^ flags) & X86_EFLAGS_IF); #endif At the end? I'd like to, but using WARN_ON_ONCE() in include/asm/irqflags.h sounds like a perfect receipt for include dependency hell. We could use a plain asm("ud2") instead. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 08/12] x86/paravirt: remove no longer needed 32-bit pvops cruft
On 20.11.20 13:08, Peter Zijlstra wrote: On Fri, Nov 20, 2020 at 12:46:26PM +0100, Juergen Gross wrote: +#define PVOP_CALL(rettype, op, clbr, call_clbr, extra_clbr, ...) \ ({ \ PVOP_CALL_ARGS; \ PVOP_TEST_NULL(op); \ + BUILD_BUG_ON(sizeof(rettype) > sizeof(unsigned long)); \ + asm volatile(paravirt_alt(PARAVIRT_CALL)\ +: call_clbr, ASM_CALL_CONSTRAINT \ +: paravirt_type(op), \ + paravirt_clobber(clbr), \ + ##__VA_ARGS__\ +: "memory", "cc" extra_clbr); \ + (rettype)(__eax & PVOP_RETMASK(rettype)); \ }) This is now very similar to PVOP_VCALL() (note how PVOP_CALL_ARGS is PVOP_VCALL_ARGS). Could we get away with doing something horrible like: #define PVOP_VCALL(X...) (void)PVOP_CALL(long, X) ? Oh, indeed. And in patch 9 the same could be done for the ALT variants. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 06/12] x86/paravirt: switch time pvops functions to use static_call()
On 20.11.20 13:01, Peter Zijlstra wrote: On Fri, Nov 20, 2020 at 12:46:24PM +0100, Juergen Gross wrote: The time pvops functions are the only ones left which might be used in 32-bit mode and which return a 64-bit value. Switch them to use the static_call() mechanism instead of pvops, as this allows quite some simplification of the pvops implementation. Due to include hell this requires to split out the time interfaces into a new header file. There's also this patch floating around; just in case that would come in handy: https://lkml.kernel.org/r/20201110005609.40989-3-frede...@kernel.org Ah, yes. This would make life much easier. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf
On 20.11.20 12:59, Peter Zijlstra wrote: On Fri, Nov 20, 2020 at 12:46:23PM +0100, Juergen Gross wrote: +static __always_inline void arch_local_irq_restore(unsigned long flags) +{ + if (!arch_irqs_disabled_flags(flags)) + arch_local_irq_enable(); +} If someone were to write horrible code like: local_irq_disable(); local_irq_save(flags); local_irq_enable(); local_irq_restore(flags); we'd be up some creek without a paddle... now I don't _think_ we have genius code like that, but I'd feel saver if we can haz an assertion in there somewhere... Maybe something like: #ifdef CONFIG_DEBUG_ENTRY // for lack of something saner WARN_ON_ONCE((arch_local_save_flags() ^ flags) & X86_EFLAGS_IF); #endif At the end? I'd be fine with that. I didn't add something like that because I couldn't find a suitable CONFIG_ :-) Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 4/4] x86/xen: drop USERGS_SYSRET64 paravirt call
On 16.11.20 17:28, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 7:23 AM Juergen Gross wrote: USERGS_SYSRET64 is used to return from a syscall via sysret, but a Xen PV guest will nevertheless use the iret hypercall, as there is no sysret PV hypercall defined. So instead of testing all the prerequisites for doing a sysret and then mangling the stack for Xen PV again for doing an iret just use the iret exit from the beginning. This can easily be done via an ALTERNATIVE like it is done for the sysenter compat case already. While at it remove to stale sysret32 remnants. Signed-off-by: Juergen Gross Acked-by: Andy Lutomirski FWIW, you've lost the VGCF_in_syscall optimization. Let me see if I can give it back to you better. Ah, right. Nevertheless a simple kernel build is about 0.5% faster with this patch. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 6/7] xen/balloon: try to merge system ram resources
On 08.09.20 22:10, David Hildenbrand wrote: Let's try to merge system ram resources we add, to minimize the number of resources in /proc/iomem. We don't care about the boundaries of individual chunks we added. Cc: Andrew Morton Cc: Michal Hocko Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: Roger Pau Monné Cc: Julien Grall Cc: Pankaj Gupta Cc: Baoquan He Cc: Wei Yang Signed-off-by: David Hildenbrand Reviewed-by: Juergen Gross Juergen --- drivers/xen/balloon.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index 7bac38764513d..b57b2067ecbfb 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -331,7 +331,7 @@ static enum bp_state reserve_additional_memory(void) mutex_unlock(_mutex); /* add_memory_resource() requires the device_hotplug lock */ lock_device_hotplug(); - rc = add_memory_resource(nid, resource, 0); + rc = add_memory_resource(nid, resource, MEMHP_MERGE_RESOURCE); unlock_device_hotplug(); mutex_lock(_mutex); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 3/7] mm/memory_hotplug: prepare passing flags to add_memory() and friends
On 08.09.20 22:10, David Hildenbrand wrote: We soon want to pass flags, e.g., to mark added System RAM resources. mergeable. Prepare for that. This patch is based on a similar patch by Oscar Salvador: https://lkml.kernel.org/r/20190625075227.15193-3-osalva...@suse.de Acked-by: Wei Liu Cc: Andrew Morton Cc: Michal Hocko Cc: Dan Williams Cc: Jason Gunthorpe Cc: Pankaj Gupta Cc: Baoquan He Cc: Wei Yang Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: "Rafael J. Wysocki" Cc: Len Brown Cc: Greg Kroah-Hartman Cc: Vishal Verma Cc: Dave Jiang Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Wei Liu Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Christian Borntraeger Cc: David Hildenbrand Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: "Oliver O'Halloran" Cc: Pingfan Liu Cc: Nathan Lynch Cc: Libor Pechacek Cc: Anton Blanchard Cc: Leonardo Bras Cc: linuxppc-...@lists.ozlabs.org Cc: linux-a...@vger.kernel.org Cc: linux-nvd...@lists.01.org Cc: linux-hyp...@vger.kernel.org Cc: linux-s...@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Signed-off-by: David Hildenbrand Reviewed-by: Juergen Gross (Xen related part) Juergen --- arch/powerpc/platforms/powernv/memtrace.c | 2 +- arch/powerpc/platforms/pseries/hotplug-memory.c | 2 +- drivers/acpi/acpi_memhotplug.c | 2 +- drivers/base/memory.c | 2 +- drivers/dax/kmem.c | 2 +- drivers/hv/hv_balloon.c | 2 +- drivers/s390/char/sclp_cmd.c| 2 +- drivers/virtio/virtio_mem.c | 2 +- drivers/xen/balloon.c | 2 +- include/linux/memory_hotplug.h | 10 ++ mm/memory_hotplug.c | 15 --- 11 files changed, 23 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c index 13b369d2cc454..a7475d18c671c 100644 --- a/arch/powerpc/platforms/powernv/memtrace.c +++ b/arch/powerpc/platforms/powernv/memtrace.c @@ -224,7 +224,7 @@ static int memtrace_online(void) ent->mem = 0; } - if (add_memory(ent->nid, ent->start, ent->size)) { + if (add_memory(ent->nid, ent->start, ent->size, 0)) { pr_err("Failed to add trace memory to node %d\n", ent->nid); ret += 1; diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 5d545b78111f9..54a888ea7f751 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -606,7 +606,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) block_sz = memory_block_size_bytes(); /* Add the memory */ - rc = __add_memory(lmb->nid, lmb->base_addr, block_sz); + rc = __add_memory(lmb->nid, lmb->base_addr, block_sz, 0); if (rc) { invalidate_lmb_associativity_index(lmb); return rc; diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c index e294f44a78504..d91b3584d4b2b 100644 --- a/drivers/acpi/acpi_memhotplug.c +++ b/drivers/acpi/acpi_memhotplug.c @@ -207,7 +207,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) if (node < 0) node = memory_add_physaddr_to_nid(info->start_addr); - result = __add_memory(node, info->start_addr, info->length); + result = __add_memory(node, info->start_addr, info->length, 0); /* * If the memory block has been used by the kernel, add_memory() diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 4db3c660de831..2287bcf86480e 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -432,7 +432,7 @@ static ssize_t probe_store(struct device *dev, struct device_attribute *attr, nid = memory_add_physaddr_to_nid(phys_addr); ret = __add_memory(nid, phys_addr, - MIN_MEMORY_BLOCK_SIZE * sections_per_block); + MIN_MEMORY_BLOCK_SIZE * sections_per_block, 0); if (ret) goto out; diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 7dcb2902e9b1b..8e66b28ef5bc6 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -95,7 +95,7 @@ int dev_dax_kmem_probe(struct dev_dax *dev_dax) * this as RAM automatically. */ rc = add_memory_driver_managed(numa_node, range.start, - range_len(), kmem_name); + range_len(), kmem_name, 0); res->flags |= IORESOURCE_BUSY; if (rc) {
Re: [PATCH v1 4/5] xen/balloon: try to merge system ram resources
On 21.08.20 12:34, David Hildenbrand wrote: Let's reuse the new mechanism to merge system ram resources below the root. We are the only one hotplugging system ram (e.g., DIMMs don't apply), so this is safe to be used. Cc: Andrew Morton Cc: Michal Hocko Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: Roger Pau Monné Cc: Julien Grall Cc: Pankaj Gupta Cc: Baoquan He Cc: Wei Yang Signed-off-by: David Hildenbrand --- drivers/xen/balloon.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index 37ffccda8bb87..5ec73f752b8a7 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -338,6 +338,10 @@ static enum bp_state reserve_additional_memory(void) if (rc) { pr_warn("Cannot add additional memory (%i)\n", rc); goto err; + } else { + resource = NULL; + /* Try to reduce the number of system ram resources. */ + merge_system_ram_resources(_resource); } I don't see the need for setting resource to NULL and to use an "else" clause here. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers
On 11.08.20 10:12, Peter Zijlstra wrote: On Tue, Aug 11, 2020 at 09:57:55AM +0200, Jürgen Groß wrote: On 11.08.20 09:41, Peter Zijlstra wrote: On Fri, Aug 07, 2020 at 05:19:03PM +0200, Marco Elver wrote: My hypothesis here is simply that kvm_wait() may be called in a place where we get the same case I mentioned to Peter, raw_local_irq_save(); /* or other IRQs off without tracing */ ... kvm_wait() /* IRQ state tracing gets confused */ ... raw_local_irq_restore(); and therefore, using raw variants in kvm_wait() works. It's also safe because it doesn't call any other libraries that would result in corrupt Yes, this is definitely an issue. Tracing, we also musn't call into tracing when using raw_local_irq_*(). Because then we re-intoduce this same issue all over again. Both halt() and safe_halt() are more paravirt calls, but given we're in a KVM paravirt call already, I suppose we can directly use native_*() here. Something like so then... I suppose, but then the Xen variants need TLC too. Just to be sure I understand you correct: You mean that xen_qlock_kick() and xen_qlock_wait() and all functions called by those should gain the "notrace" attribute, right? I am not sure why the kick variants need it, though. IMO those are called only after the lock has been released, so they should be fine without notrace. The issue happens when someone uses arch_spinlock_t under raw_local_irq_*(). And again: we shouldn't forget the Hyper-V variants. Bah, my grep failed :/ Also *groan*, that's calling apic->send_IPI(). In case you don't want to do it I can send the patch for the Xen variants. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers
On 11.08.20 10:12, Peter Zijlstra wrote: On Tue, Aug 11, 2020 at 09:57:55AM +0200, Jürgen Groß wrote: On 11.08.20 09:41, Peter Zijlstra wrote: On Fri, Aug 07, 2020 at 05:19:03PM +0200, Marco Elver wrote: My hypothesis here is simply that kvm_wait() may be called in a place where we get the same case I mentioned to Peter, raw_local_irq_save(); /* or other IRQs off without tracing */ ... kvm_wait() /* IRQ state tracing gets confused */ ... raw_local_irq_restore(); and therefore, using raw variants in kvm_wait() works. It's also safe because it doesn't call any other libraries that would result in corrupt Yes, this is definitely an issue. Tracing, we also musn't call into tracing when using raw_local_irq_*(). Because then we re-intoduce this same issue all over again. Both halt() and safe_halt() are more paravirt calls, but given we're in a KVM paravirt call already, I suppose we can directly use native_*() here. Something like so then... I suppose, but then the Xen variants need TLC too. Just to be sure I understand you correct: You mean that xen_qlock_kick() and xen_qlock_wait() and all functions called by those should gain the "notrace" attribute, right? I am not sure why the kick variants need it, though. IMO those are called only after the lock has been released, so they should be fine without notrace. The issue happens when someone uses arch_spinlock_t under raw_local_irq_*(). Ah, okay. And again: we shouldn't forget the Hyper-V variants. Bah, my grep failed :/ Also *groan*, that's calling apic->send_IPI(). I've seen that, too. :-( Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers
On 11.08.20 09:41, Peter Zijlstra wrote: On Fri, Aug 07, 2020 at 05:19:03PM +0200, Marco Elver wrote: My hypothesis here is simply that kvm_wait() may be called in a place where we get the same case I mentioned to Peter, raw_local_irq_save(); /* or other IRQs off without tracing */ ... kvm_wait() /* IRQ state tracing gets confused */ ... raw_local_irq_restore(); and therefore, using raw variants in kvm_wait() works. It's also safe because it doesn't call any other libraries that would result in corrupt Yes, this is definitely an issue. Tracing, we also musn't call into tracing when using raw_local_irq_*(). Because then we re-intoduce this same issue all over again. Both halt() and safe_halt() are more paravirt calls, but given we're in a KVM paravirt call already, I suppose we can directly use native_*() here. Something like so then... I suppose, but then the Xen variants need TLC too. Just to be sure I understand you correct: You mean that xen_qlock_kick() and xen_qlock_wait() and all functions called by those should gain the "notrace" attribute, right? I am not sure why the kick variants need it, though. IMO those are called only after the lock has been released, so they should be fine without notrace. And again: we shouldn't forget the Hyper-V variants. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers
On 11.08.20 09:00, Marco Elver wrote: On Fri, 7 Aug 2020 at 17:19, Marco Elver wrote: On Fri, Aug 07, 2020 at 02:08PM +0200, Marco Elver wrote: On Fri, 7 Aug 2020 at 14:04, Jürgen Groß wrote: On 07.08.20 13:38, Marco Elver wrote: On Fri, Aug 07, 2020 at 12:35PM +0200, Jürgen Groß wrote: ... I think CONFIG_PARAVIRT_XXL shouldn't matter, but I'm not completely sure about that. CONFIG_PARAVIRT_SPINLOCKS would be my primary suspect. Yes, PARAVIRT_XXL doesn't make a different. When disabling PARAVIRT_SPINLOCKS, however, the warnings go away. Thanks for testing! I take it you are doing the tests in a KVM guest? Yes, correct. If so I have a gut feeling that the use of local_irq_save() and local_irq_restore() in kvm_wait() might be fishy. I might be completely wrong here, though. Happy to help debug more, although I might need patches or pointers what to play with. BTW, I think Xen's variant of pv spinlocks is fine (no playing with IRQ on/off). Hyper-V seems to do the same as KVM, and kicking another vcpu could be problematic as well, as it is just using IPI. I experimented a bit more, and the below patch seems to solve the warnings. However, that was based on your pointer about kvm_wait(), and I can't quite tell if it is the right solution. My hypothesis here is simply that kvm_wait() may be called in a place where we get the same case I mentioned to Peter, raw_local_irq_save(); /* or other IRQs off without tracing */ ... kvm_wait() /* IRQ state tracing gets confused */ ... raw_local_irq_restore(); and therefore, using raw variants in kvm_wait() works. It's also safe because it doesn't call any other libraries that would result in corrupt IRQ state AFAIK. Just to follow-up, it'd still be nice to fix this. Suggestions? I could send the below as a patch, but can only go off my above hypothesis and the fact that syzbot is happier, so not entirely convincing. Peter has told me via IRC he will look soon further into this. Your finding suggests that the pv-lock implementation for Hyper-V needs some tweaking, too. For that purpose I'm adding Wei to Cc. Juergen Thanks, -- Marco -- >8 -- diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 233c77d056c9..1d412d1466f0 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -797,7 +797,7 @@ static void kvm_wait(u8 *ptr, u8 val) if (in_nmi()) return; - local_irq_save(flags); + raw_local_irq_save(flags); if (READ_ONCE(*ptr) != val) goto out; @@ -810,10 +810,10 @@ static void kvm_wait(u8 *ptr, u8 val) if (arch_irqs_disabled_flags(flags)) halt(); else - safe_halt(); + raw_safe_halt(); out: - local_irq_restore(flags); + raw_local_irq_restore(flags); } #ifdef CONFIG_X86_32 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 4/7] x86/paravirt: remove 32-bit support from PARAVIRT_XXL
On 10.08.20 18:53, Boris Ostrovsky wrote: On 8/10/20 12:39 AM, Jürgen Groß wrote: On 09.08.20 04:34, Boris Ostrovsky wrote: On 8/7/20 4:38 AM, Juergen Gross wrote: @@ -377,10 +373,7 @@ static inline pte_t __pte(pteval_t val) { pteval_t ret; - if (sizeof(pteval_t) > sizeof(long)) - ret = PVOP_CALLEE2(pteval_t, mmu.make_pte, val, (u64)val >> 32); - else - ret = PVOP_CALLEE1(pteval_t, mmu.make_pte, val); + ret = PVOP_CALLEE1(pteval_t, mmu.make_pte, val); return (pte_t) { .pte = ret }; Can this now simply return (pte_t) ret? I don't think so, but I can turn it into return native_make_pte(PVOP_CALLEE1(...)); I thought that since now this is only built for 64-bit we don't have to worry about different pte_t definitions and can do what we do for example, for __pgd()? Yes, I did that: return (pte_t) { ret }; Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 4/7] x86/paravirt: remove 32-bit support from PARAVIRT_XXL
On 09.08.20 04:34, Boris Ostrovsky wrote: On 8/7/20 4:38 AM, Juergen Gross wrote: @@ -377,10 +373,7 @@ static inline pte_t __pte(pteval_t val) { pteval_t ret; - if (sizeof(pteval_t) > sizeof(long)) - ret = PVOP_CALLEE2(pteval_t, mmu.make_pte, val, (u64)val >> 32); - else - ret = PVOP_CALLEE1(pteval_t, mmu.make_pte, val); + ret = PVOP_CALLEE1(pteval_t, mmu.make_pte, val); return (pte_t) { .pte = ret }; Can this now simply return (pte_t) ret? I don't think so, but I can turn it into return native_make_pte(PVOP_CALLEE1(...)); Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers
On 07.08.20 13:38, Marco Elver wrote: On Fri, Aug 07, 2020 at 12:35PM +0200, Jürgen Groß wrote: On 07.08.20 11:50, Marco Elver wrote: On Fri, Aug 07, 2020 at 11:24AM +0200, Jürgen Groß wrote: On 07.08.20 11:01, Marco Elver wrote: On Thu, 6 Aug 2020 at 18:06, Marco Elver wrote: On Thu, 6 Aug 2020 at 15:17, Marco Elver wrote: On Thu, Aug 06, 2020 at 01:32PM +0200, pet...@infradead.org wrote: On Thu, Aug 06, 2020 at 09:47:23AM +0200, Marco Elver wrote: Testing my hypothesis that raw then nested non-raw local_irq_save/restore() breaks IRQ state tracking -- see the reproducer below. This is at least 1 case I can think of that we're bound to hit. ... /me goes ponder things... How's something like this then? --- include/linux/sched.h | 3 --- kernel/kcsan/core.c | 62 --- 2 files changed, 44 insertions(+), 21 deletions(-) Thank you! That approach seems to pass syzbot (also with CONFIG_PARAVIRT) and kcsan-test tests. I had to modify it some, so that report.c's use of the restore logic works and not mess up the IRQ trace printed on KCSAN reports (with CONFIG_KCSAN_VERBOSE). I still need to fully convince myself all is well now and we don't end up with more fixes. :-) If it passes further testing, I'll send it as a real patch (I want to add you as Co-developed-by, but would need your Signed-off-by for the code you pasted, I think.) I let it run on syzbot through the night, and it's fine without PARAVIRT (see below). I have sent the patch (need your Signed-off-by as it's based on your code, thank you!): https://lkml.kernel.org/r/20200807090031.3506555-1-el...@google.com With CONFIG_PARAVIRT=y (without the notrace->noinstr patch), I still get lockdep DEBUG_LOCKS_WARN_ON(!lockdep_hardirqs_enabled()), although it takes longer for syzbot to hit them. But I think that's expected because we can still get the recursion that I pointed out, and will need that patch. Never mind, I get these warnings even if I don't turn on KCSAN (CONFIG_KCSAN=n). Something else is going on with PARAVIRT=y that throws off IRQ state tracking. :-/ What are the settings of CONFIG_PARAVIRT_XXL and CONFIG_PARAVIRT_SPINLOCKS in this case? I attached a config. $> grep PARAVIRT .config CONFIG_PARAVIRT=y CONFIG_PARAVIRT_XXL=y # CONFIG_PARAVIRT_DEBUG is not set CONFIG_PARAVIRT_SPINLOCKS=y # CONFIG_PARAVIRT_TIME_ACCOUNTING is not set CONFIG_PARAVIRT_CLOCK=y Anything special I need to do to reproduce the problem? Or would you be willing to do some more rounds with different config settings? I can only test it with syzkaller, but that probably doesn't help if you don't already have it set up. It can't seem to find a C reproducer. I did some more rounds with different configs. I think CONFIG_PARAVIRT_XXL shouldn't matter, but I'm not completely sure about that. CONFIG_PARAVIRT_SPINLOCKS would be my primary suspect. Yes, PARAVIRT_XXL doesn't make a different. When disabling PARAVIRT_SPINLOCKS, however, the warnings go away. Thanks for testing! I take it you are doing the tests in a KVM guest? If so I have a gut feeling that the use of local_irq_save() and local_irq_restore() in kvm_wait() might be fishy. I might be completely wrong here, though. BTW, I think Xen's variant of pv spinlocks is fine (no playing with IRQ on/off). Hyper-V seems to do the same as KVM, and kicking another vcpu could be problematic as well, as it is just using IPI. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers
On 07.08.20 11:50, Marco Elver wrote: On Fri, Aug 07, 2020 at 11:24AM +0200, Jürgen Groß wrote: On 07.08.20 11:01, Marco Elver wrote: On Thu, 6 Aug 2020 at 18:06, Marco Elver wrote: On Thu, 6 Aug 2020 at 15:17, Marco Elver wrote: On Thu, Aug 06, 2020 at 01:32PM +0200, pet...@infradead.org wrote: On Thu, Aug 06, 2020 at 09:47:23AM +0200, Marco Elver wrote: Testing my hypothesis that raw then nested non-raw local_irq_save/restore() breaks IRQ state tracking -- see the reproducer below. This is at least 1 case I can think of that we're bound to hit. ... /me goes ponder things... How's something like this then? --- include/linux/sched.h | 3 --- kernel/kcsan/core.c | 62 --- 2 files changed, 44 insertions(+), 21 deletions(-) Thank you! That approach seems to pass syzbot (also with CONFIG_PARAVIRT) and kcsan-test tests. I had to modify it some, so that report.c's use of the restore logic works and not mess up the IRQ trace printed on KCSAN reports (with CONFIG_KCSAN_VERBOSE). I still need to fully convince myself all is well now and we don't end up with more fixes. :-) If it passes further testing, I'll send it as a real patch (I want to add you as Co-developed-by, but would need your Signed-off-by for the code you pasted, I think.) I let it run on syzbot through the night, and it's fine without PARAVIRT (see below). I have sent the patch (need your Signed-off-by as it's based on your code, thank you!): https://lkml.kernel.org/r/20200807090031.3506555-1-el...@google.com With CONFIG_PARAVIRT=y (without the notrace->noinstr patch), I still get lockdep DEBUG_LOCKS_WARN_ON(!lockdep_hardirqs_enabled()), although it takes longer for syzbot to hit them. But I think that's expected because we can still get the recursion that I pointed out, and will need that patch. Never mind, I get these warnings even if I don't turn on KCSAN (CONFIG_KCSAN=n). Something else is going on with PARAVIRT=y that throws off IRQ state tracking. :-/ What are the settings of CONFIG_PARAVIRT_XXL and CONFIG_PARAVIRT_SPINLOCKS in this case? I attached a config. $> grep PARAVIRT .config CONFIG_PARAVIRT=y CONFIG_PARAVIRT_XXL=y # CONFIG_PARAVIRT_DEBUG is not set CONFIG_PARAVIRT_SPINLOCKS=y # CONFIG_PARAVIRT_TIME_ACCOUNTING is not set CONFIG_PARAVIRT_CLOCK=y Anything special I need to do to reproduce the problem? Or would you be willing to do some more rounds with different config settings? I think CONFIG_PARAVIRT_XXL shouldn't matter, but I'm not completely sure about that. CONFIG_PARAVIRT_SPINLOCKS would be my primary suspect. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 4/7] x86/paravirt: remove 32-bit support from PARAVIRT_XXL
On 07.08.20 11:39, pet...@infradead.org wrote: On Fri, Aug 07, 2020 at 10:38:23AM +0200, Juergen Gross wrote: -# else - const unsigned char cpu_iret[1]; -# endif }; static const struct patch_xxl patch_data_xxl = { @@ -42,7 +38,6 @@ static const struct patch_xxl patch_data_xxl = { .irq_save_fl= { 0x9c, 0x58 }, // pushf; pop %[re]ax .mmu_read_cr2 = { 0x0f, 0x20, 0xd0 }, // mov %cr2, %[re]ax .mmu_read_cr3 = { 0x0f, 0x20, 0xd8 }, // mov %cr3, %[re]ax -# ifdef CONFIG_X86_64 .mmu_write_cr3 = { 0x0f, 0x22, 0xdf }, // mov %rdi, %cr3 .irq_restore_fl = { 0x57, 0x9d }, // push %rdi; popfq .cpu_wbinvd = { 0x0f, 0x09 }, // wbinvd @@ -50,19 +45,11 @@ static const struct patch_xxl patch_data_xxl = { 0x48, 0x0f, 0x07 }, // swapgs; sysretq .cpu_swapgs = { 0x0f, 0x01, 0xf8 }, // swapgs .mov64 = { 0x48, 0x89, 0xf8 }, // mov %rdi, %rax -# else - .mmu_write_cr3 = { 0x0f, 0x22, 0xd8 }, // mov %eax, %cr3 - .irq_restore_fl = { 0x50, 0x9d }, // push %eax; popf - .cpu_iret = { 0xcf }, // iret -# endif I was looking at x86_64 paravirt the other day and found we actually have pv_ops.cpu.iret users there.. On x86_64 we have (without PARAVIRT_XXL): #define INTERRUPT_RETURNjmp native_iret and with PARAVIRT_XXL this is basically a jmp *pv_ops.cpu.iret which will then be patched to either jmp native_iret or jmp xen_iret. On x86_32 INTERRUPT_RETURN was just "iret" for the non-paravirt case. This is the reason for above dropping of the static patch data. So we want to change the above to also patch iret on x86_64 or do we need to fix x86_64 to not have pv-iret? We want it to stay how it is. This will let both variants (PARVIRT y/n) continue to work. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers
On 07.08.20 11:01, Marco Elver wrote: On Thu, 6 Aug 2020 at 18:06, Marco Elver wrote: On Thu, 6 Aug 2020 at 15:17, Marco Elver wrote: On Thu, Aug 06, 2020 at 01:32PM +0200, pet...@infradead.org wrote: On Thu, Aug 06, 2020 at 09:47:23AM +0200, Marco Elver wrote: Testing my hypothesis that raw then nested non-raw local_irq_save/restore() breaks IRQ state tracking -- see the reproducer below. This is at least 1 case I can think of that we're bound to hit. ... /me goes ponder things... How's something like this then? --- include/linux/sched.h | 3 --- kernel/kcsan/core.c | 62 --- 2 files changed, 44 insertions(+), 21 deletions(-) Thank you! That approach seems to pass syzbot (also with CONFIG_PARAVIRT) and kcsan-test tests. I had to modify it some, so that report.c's use of the restore logic works and not mess up the IRQ trace printed on KCSAN reports (with CONFIG_KCSAN_VERBOSE). I still need to fully convince myself all is well now and we don't end up with more fixes. :-) If it passes further testing, I'll send it as a real patch (I want to add you as Co-developed-by, but would need your Signed-off-by for the code you pasted, I think.) I let it run on syzbot through the night, and it's fine without PARAVIRT (see below). I have sent the patch (need your Signed-off-by as it's based on your code, thank you!): https://lkml.kernel.org/r/20200807090031.3506555-1-el...@google.com With CONFIG_PARAVIRT=y (without the notrace->noinstr patch), I still get lockdep DEBUG_LOCKS_WARN_ON(!lockdep_hardirqs_enabled()), although it takes longer for syzbot to hit them. But I think that's expected because we can still get the recursion that I pointed out, and will need that patch. Never mind, I get these warnings even if I don't turn on KCSAN (CONFIG_KCSAN=n). Something else is going on with PARAVIRT=y that throws off IRQ state tracking. :-/ What are the settings of CONFIG_PARAVIRT_XXL and CONFIG_PARAVIRT_SPINLOCKS in this case? Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers
On 05.08.20 16:12, pet...@infradead.org wrote: On Wed, Aug 05, 2020 at 03:59:40PM +0200, Marco Elver wrote: On Wed, Aug 05, 2020 at 03:42PM +0200, pet...@infradead.org wrote: Shouldn't we __always_inline those? They're going to be really small. I can send a v2, and you can choose. For reference, though: 86271ee0 : 86271ee0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 86271ee5: 48 83 3d 43 87 e4 01cmpq $0x0,0x1e48743(%rip) # 880ba630 86271eec: 00 86271eed: 74 0d je 86271efc 86271eef: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 86271ef4: ff 14 25 30 a6 0b 88callq *0x880ba630 86271efb: c3 retq 86271efc: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 86271f01: 0f 0b ud2 86271a90 : 86271a90: 53 push %rbx 86271a91: 48 89 fbmov%rdi,%rbx 86271a94: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 86271a99: 48 83 3d 97 8b e4 01cmpq $0x0,0x1e48b97(%rip) # 880ba638 86271aa0: 00 86271aa1: 74 11 je 86271ab4 86271aa3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 86271aa8: 48 89 dfmov%rbx,%rdi 86271aab: ff 14 25 38 a6 0b 88callq *0x880ba638 86271ab2: 5b pop%rbx 86271ab3: c3 retq 86271ab4: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 86271ab9: 0f 0b ud2 Blergh, that's abysmall. In part I suspect because you have CONFIG_PARAVIRT_DEBUG, let me try and untangle that PV macro maze. Probably. I have found the following in my kernel: fff81540a5f : 81540a5f: ff 14 25 40 a4 23 82callq *0x8223a440 81540a66: c3 retq Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 0/4] Remove 32-bit Xen PV guest support
On 02.07.20 16:48, Brian Gerst wrote: On Wed, Jul 1, 2020 at 7:07 AM Juergen Gross wrote: The long term plan has been to replace Xen PV guests by PVH. The first victim of that plan are now 32-bit PV guests, as those are used only rather seldom these days. Xen on x86 requires 64-bit support and with Grub2 now supporting PVH officially since version 2.04 there is no need to keep 32-bit PV guest support alive in the Linux kernel. Additionally Meltdown mitigation is not available in the kernel running as 32-bit PV guest, so dropping this mode makes sense from security point of view, too. One thing that you missed is removing VDSO_NOTE_NONEGSEG_BIT from vdso32/note.S. With that removed there is no difference from the 64-bit version. Oh, this means we can probably remove arch/x86/xen/vdso.h completely. Otherwise this series looks good to me. Thanks, Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] MAINTAINERS: Update PARAVIRT_OPS_INTERFACE and VMWARE_HYPERVISOR_INTERFACE
On 17.04.20 01:45, Deep Shah wrote: Thomas Hellstrom will be handing over VMware's maintainership of these interfaces to Deep Shah. Signed-off-by: Deep Shah Acked-by: Thomas Hellstrom Pushed to xen/tip.git for-linus-5.8 Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] MAINTAINERS: Update PARAVIRT_OPS_INTERFACE and VMWARE_HYPERVISOR_INTERFACE
On 17.04.20 08:22, Jürgen Groß wrote: On 17.04.20 01:45, Deep Shah wrote: Thomas Hellstrom will be handing over VMware's maintainership of these interfaces to Deep Shah. Signed-off-by: Deep Shah Acked-by: Thomas Hellstrom Acked-by: Juergen Gross BTW, I can carry this patch through the Xen tree if nobody else wants to take it... Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] MAINTAINERS: Update PARAVIRT_OPS_INTERFACE and VMWARE_HYPERVISOR_INTERFACE
On 17.04.20 01:45, Deep Shah wrote: Thomas Hellstrom will be handing over VMware's maintainership of these interfaces to Deep Shah. Signed-off-by: Deep Shah Acked-by: Thomas Hellstrom Acked-by: Juergen Gross Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 00/26] Runtime paravirt patching
On 08.04.20 14:08, Peter Zijlstra wrote: On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote: Mechanism: the patching itself is done using stop_machine(). That is not ideal -- text_poke_stop_machine() was replaced with INT3+emulation via text_poke_bp(), but I'm using this to address two issues: 1) emulation in text_poke() can only easily handle a small set of instructions and this is problematic for inlined pv-ops (and see a possible alternatives use-case below.) 2) paravirt patching might have inter-dependendent ops (ex. lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and need to be updated atomically.) And then you hope that the spinlock state transfers.. That is that both implementations agree what an unlocked spinlock looks like. Suppose the native one was a ticket spinlock, where unlocked means 'head == tail' while the paravirt one is a test-and-set spinlock, where unlocked means 'val == 0'. That just happens to not be the case now, but it was for a fair while. Sure? This would mean that before spinlock-pvops are being set no lock is allowed to be used in the kernel, because this would block the boot time transition of the lock variant to use. Another problem I'm seeing is that runtime pvops patching would rely on the fact that stop_machine() isn't guarded by a spinlock. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 00/26] Runtime paravirt patching
On 08.04.20 07:02, Ankur Arora wrote: A KVM host (or another hypervisor) might advertise paravirtualized features and optimization hints (ex KVM_HINTS_REALTIME) which might become stale over the lifetime of the guest. For instance, the Then this hint is wrong if it can't be guaranteed. host might go from being undersubscribed to being oversubscribed (or the other way round) and it would make sense for the guest switch pv-ops based on that. I think using pvops for such a feature change is just wrong. What comes next? Using pvops for being able to migrate a guest from an Intel to an AMD machine? ... There are four main sets of patches in this series: 1. PV-ops management (patches 1-10, 20): mostly infrastructure and refactoring pieces to make paravirt patching usable at runtime. For the most part scoped under CONFIG_PARAVIRT_RUNTIME. Patches 1-7, to persist part of parainstructions in memory: "x86/paravirt: Specify subsection in PVOP macros" "x86/paravirt: Allow paravirt patching post-init" "x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME" "x86/alternatives: Refactor alternatives_smp_module* "x86/alternatives: Rename alternatives_smp*, smp_alt_module "x86/alternatives: Remove stale symbols "x86/paravirt: Persist .parainstructions.runtime" Patches 8-10, develop the inerfaces to safely switch pv-ops: "x86/paravirt: Stash native pv-ops" "x86/paravirt: Add runtime_patch()" "x86/paravirt: Add primitives to stage pv-ops" Patch 20 enables switching of pv_lock_ops: "x86/paravirt: Enable pv-spinlocks in runtime_patch()" 2. Non-emulated text poking (patches 11-19) Patches 11-13 are mostly refactoring to split __text_poke() into map, unmap and poke/memcpy phases with the poke portion being re-entrant "x86/alternatives: Remove return value of text_poke*()" "x86/alternatives: Use __get_unlocked_pte() in text_poke()" "x86/alternatives: Split __text_poke()" Patches 15, 17 add the actual poking state-machine: "x86/alternatives: Non-emulated text poking" "x86/alternatives: Add patching logic in text_poke_site()" with patches 14 and 18 containing the pieces for BP handling: "x86/alternatives: Handle native insns in text_poke_loc*()" "x86/alternatives: Handle BP in non-emulated text poking" and patch 19 provides the ability to use the state-machine above in an NMI context (fixes some potential deadlocks when handling inter- dependent operations and multiple NMIs): "x86/alternatives: NMI safe runtime patching". Patch 16 provides the interface (paravirt_runtime_patch()) to use the poking mechanism developed above and patch 21 adds a selftest: "x86/alternatives: Add paravirt patching at runtime" "x86/alternatives: Paravirt runtime selftest" 3. KVM guest changes to be able to use this (patches 22-23,25-26): "kvm/paravirt: Encapsulate KVM pv switching logic" "x86/kvm: Add worker to trigger runtime patching" "x86/kvm: Guest support for dynamic hints" "x86/kvm: Add hint change notifier for KVM_HINT_REALTIME". 4. KVM host changes to notify the guest of a change (patch 24): "x86/kvm: Support dynamic CPUID hints" Testing: With paravirt patching, the code is mostly stable on Intel and AMD systems under kernbench and locktorture with paravirt toggling (with, without synthetic NMIs) in the background. Queued spinlock performance for locktorture is also on expected lines: [ 1533.221563] Writes: Total: 1048759000 Max/Min: 0/0 Fail: 0 # toggle PV spinlocks [ 1594.713699] Writes: Total: 660545 Max/Min: 0/0 Fail: 0 # PV spinlocks (in ~60 seconds) = 62,901,545 # toggle native spinlocks [ 1656.117175] Writes: Total: 111340 Max/Min: 0/0 Fail: 0 # native spinlocks (in ~60 seconds) = 2,228,295 The alternatives testing is more limited with it being used to rewrite mostly harmless X86_FEATUREs with load in the background. Patches also at: ssh://g...@github.com/terminus/linux.git alternatives-rfc-upstream-v1 Please review. Thanks Ankur [1] The precise change in memory footprint depends on config options but the following example inlines queued_spin_unlock() (which forms the bulk of the added state). The added footprint is the size of the .parainstructions.runtime section: $ objdump -h vmlinux|grep .parainstructions Idx Name Size VMA LMAFile-off Algn 27 .parainstructions 0001013c 82895000 02895000 01c95000 2**3 28 .parainstructions.runtime cd2c 828a5140 028a5140 01ca5140 2**3 $ size vmlinux text data bssdec hex filename 13726196 12302814 14094336 40123346 2643bd2 vmlinux Ankur Arora (26): x86/paravirt: Specify subsection in PVOP macros x86/paravirt: Allow paravirt patching post-init x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME x86/alternatives: Refactor
Re: VIRTIO adoption in other hypervisors
On 28.02.20 17:47, Alex Bennée wrote: Jan Kiszka writes: On 28.02.20 11:30, Jan Kiszka wrote: On 28.02.20 11:16, Alex Bennée wrote: Hi, I believe there has been some development work for supporting VIRTIO on Xen although it seems to have stalled according to: https://wiki.xenproject.org/wiki/Virtio_On_Xen Recently at KVM Forum there was Jan's talk about Inter-VM shared memory which proposed ivshmemv2 as a VIRTIO transport: https://events19.linuxfoundation.org/events/kvm-forum-2019/program/schedule/ As I understood it this would allow Xen (and other hypervisors) a simple way to be able to carry virtio traffic between guest and end point. And to clarify the scope of this effort: virtio-over-ivshmem is not the fastest option to offer virtio to a guest (static "DMA" window), but it is the simplest one from the hypervisor PoV and, thus, also likely the easiest one to argue over when it comes to security and safety. So to drill down on this is this a particular problem with type-1 hypervisors? It seems to me any KVM-like run loop trivially supports a range of virtio devices by virtue of trapping accesses to the signalling area of a virtqueue and allowing the VMM to handle the transaction which ever way it sees fit. I've not quite understood the way Xen interfaces to QEMU aside from it's different to everything else. More over it seems the type-1 hypervisors are more interested in providing better isolation between segments of a system whereas VIRTIO currently assumes either the VMM or the hypervisor has full access the full guest address space. I've seen quite a lot of slides that want to isolate sections of device emulation to separate processes or even separate guest VMs. In Xen device emulation is done by other VMs. Normally the devices are emulated via dom0, but it is possible to have other driver domains, too (those need to get passed through the related PCI devices, of course). PV device backends get access only to the guest pages the PV frontends allow. This is done via so called "grants", which are per guest. So a frontend can grant another Xen VM access to dedicated pages. The backend is using the grants to map those pages via the hypervisor in order to perform the I/O. After finishing the I/O the I/O-pages are unmapped by the backend again. For legacy device emulation via qemu the guest running qemu needs to get access to all the guests memory, as the guest won't grant any pages to the emulating VM. It is possible to let qemu run in a small stub guest using PV devices in order to isolate the legacy guest from e.g. dom0. Hope that makes it clearer, Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/ioperm: add new paravirt function update_io_bitmap
Friendly ping... On 18.02.20 16:47, Juergen Gross wrote: Commit 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control ioperm() as well") reworked the iopl syscall to use I/O bitmaps. Unfortunately this broke Xen PV domains using that syscall as there is currently no I/O bitmap support in PV domains. Add I/O bitmap support via a new paravirt function update_io_bitmap which Xen PV domains can use to update their I/O bitmaps via a hypercall. Fixes: 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control ioperm() as well") Reported-by: Jan Beulich Cc: # 5.5 Signed-off-by: Juergen Gross Reviewed-by: Jan Beulich Tested-by: Jan Beulich --- arch/x86/include/asm/io_bitmap.h | 9 - arch/x86/include/asm/paravirt.h | 7 +++ arch/x86/include/asm/paravirt_types.h | 4 arch/x86/kernel/paravirt.c| 5 + arch/x86/kernel/process.c | 2 +- arch/x86/xen/enlighten_pv.c | 25 + 6 files changed, 50 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/io_bitmap.h b/arch/x86/include/asm/io_bitmap.h index 02c6ef8f7667..07344d82e88e 100644 --- a/arch/x86/include/asm/io_bitmap.h +++ b/arch/x86/include/asm/io_bitmap.h @@ -19,7 +19,14 @@ struct task_struct; void io_bitmap_share(struct task_struct *tsk); void io_bitmap_exit(void); -void tss_update_io_bitmap(void); +void native_tss_update_io_bitmap(void); + +#ifdef CONFIG_PARAVIRT_XXL +#include +#else +#define tss_update_io_bitmap native_tss_update_io_bitmap +#endif + #else static inline void io_bitmap_share(struct task_struct *tsk) { } static inline void io_bitmap_exit(void) { } diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 86e7317eb31f..694d8daf4983 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -295,6 +295,13 @@ static inline void write_idt_entry(gate_desc *dt, int entry, const gate_desc *g) PVOP_VCALL3(cpu.write_idt_entry, dt, entry, g); } +#ifdef CONFIG_X86_IOPL_IOPERM +static inline void tss_update_io_bitmap(void) +{ + PVOP_VCALL0(cpu.update_io_bitmap); +} +#endif + static inline void paravirt_activate_mm(struct mm_struct *prev, struct mm_struct *next) { diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 84812964d3dd..732f62e04ddb 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -140,6 +140,10 @@ struct pv_cpu_ops { void (*load_sp0)(unsigned long sp0); +#ifdef CONFIG_X86_IOPL_IOPERM + void (*update_io_bitmap)(void); +#endif + void (*wbinvd)(void); /* cpuid emulation, mostly so that caps bits can be disabled */ diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 789f5e4f89de..c131ba4e70ef 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -30,6 +30,7 @@ #include #include #include +#include /* * nop stub, which must not clobber anything *including the stack* to @@ -341,6 +342,10 @@ struct paravirt_patch_template pv_ops = { .cpu.iret = native_iret, .cpu.swapgs = native_swapgs, +#ifdef CONFIG_X86_IOPL_IOPERM + .cpu.update_io_bitmap = native_tss_update_io_bitmap, +#endif + .cpu.start_context_switch = paravirt_nop, .cpu.end_context_switch = paravirt_nop, diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 839b5244e3b7..3053c85e0e42 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -374,7 +374,7 @@ static void tss_copy_io_bitmap(struct tss_struct *tss, struct io_bitmap *iobm) /** * tss_update_io_bitmap - Update I/O bitmap before exiting to usermode */ -void tss_update_io_bitmap(void) +void native_tss_update_io_bitmap(void) { struct tss_struct *tss = this_cpu_ptr(_tss_rw); struct thread_struct *t = >thread; diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c index 1f756e8b..feaf2e68ee5c 100644 --- a/arch/x86/xen/enlighten_pv.c +++ b/arch/x86/xen/enlighten_pv.c @@ -72,6 +72,9 @@ #include #include #include +#ifdef CONFIG_X86_IOPL_IOPERM +#include +#endif #ifdef CONFIG_ACPI #include @@ -837,6 +840,25 @@ static void xen_load_sp0(unsigned long sp0) this_cpu_write(cpu_tss_rw.x86_tss.sp0, sp0); } +#ifdef CONFIG_X86_IOPL_IOPERM +static void xen_update_io_bitmap(void) +{ + struct physdev_set_iobitmap iobitmap; + struct tss_struct *tss = this_cpu_ptr(_tss_rw); + + native_tss_update_io_bitmap(); + + iobitmap.bitmap = (uint8_t *)(>x86_tss) + + tss->x86_tss.io_bitmap_base; + if (tss->x86_tss.io_bitmap_base == IO_BITMAP_OFFSET_INVALID) + iobitmap.nr_ports = 0; + else + iobitmap.nr_ports = IO_BITMAP_BITS; + +
Re: [PATCH 23/62] x86/idt: Move IDT to data segment
On 19.02.20 11:42, Joerg Roedel wrote: Hi Jürgen, On Wed, Feb 12, 2020 at 05:28:21PM +0100, Jürgen Groß wrote: Xen-PV is clearing BSS as the very first action. In the kernel image? Or in the ELF loader before jumping to the kernel image? In the kernel image. See arch/x86/xen/xen-head.S - startup_xen is the entry point of the kernel. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/ioperm: add new paravirt function update_io_bitmap
On 19.02.20 10:22, Thomas Gleixner wrote: Jürgen Groß writes: On 18.02.20 22:03, Thomas Gleixner wrote: BTW, why isn't stuff like this not catched during next or at least before the final release? Is nothing running CI on upstream with all that XEN muck active? This problem showed up by not being able to start the X server (probably not the freshest one) in dom0 on a moderate aged AMD system. Our CI tests tend do be more text console based for dom0. tools/testing/selftests/x86/io[perm|pl] should have caught that as well, right? If not, we need to fix the selftests. Hmm, yes. Thanks for the pointer. Will ask our testing specialist what is done in this regard and how it can be enhanced. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/ioperm: add new paravirt function update_io_bitmap
On 18.02.20 22:03, Thomas Gleixner wrote: Juergen Gross writes: Commit 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control ioperm() as well") reworked the iopl syscall to use I/O bitmaps. Unfortunately this broke Xen PV domains using that syscall as there is currently no I/O bitmap support in PV domains. Add I/O bitmap support via a new paravirt function update_io_bitmap which Xen PV domains can use to update their I/O bitmaps via a hypercall. Fixes: 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control ioperm() as well") Reported-by: Jan Beulich Cc: # 5.5 Signed-off-by: Juergen Gross Reviewed-by: Jan Beulich Tested-by: Jan Beulich Duh, sorry about that and thanks for fixing it. BTW, why isn't stuff like this not catched during next or at least before the final release? Is nothing running CI on upstream with all that XEN muck active? This problem showed up by not being able to start the X server (probably not the freshest one) in dom0 on a moderate aged AMD system. Our CI tests tend do be more text console based for dom0. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 23/62] x86/idt: Move IDT to data segment
On 12.02.20 17:23, Andy Lutomirski wrote: On Feb 12, 2020, at 3:55 AM, Joerg Roedel wrote: On Tue, Feb 11, 2020 at 02:41:25PM -0800, Andy Lutomirski wrote: On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel wrote: From: Joerg Roedel With SEV-ES, exception handling is needed very early, even before the kernel has cleared the bss segment. In order to prevent clearing the currently used IDT, move the IDT to the data segment. Ugh. At the very least this needs a comment in the code. Yes, right, added a comment for that. I had a patch to fix the kernel ELF loader to clear BSS, which would fix this problem once and for all, but it didn't work due to the messy way that the decompressor handles memory. I never got around to fixing this, sadly. Aren't there other ways of booting (Xen-PV?) which don't use the kernel ELF loader? Dunno. I would hope the any sane loader would clear BSS before executing anything. This isn’t currently the case, though. Oh well. Xen-PV is clearing BSS as the very first action. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization