Re: [PATCH v6 00/12] x86: major paravirt cleanup

2021-03-11 Thread Jürgen Groß via Virtualization

On 11.03.21 13:51, Borislav Petkov wrote:

On Thu, Mar 11, 2021 at 01:50:26PM +0100, Borislav Petkov wrote:

and move the cleanups patches 13 and 14 to the beginning of the set?


Yeah, 14 needs ALTERNATIVE_TERNARY so I guess after patch 5, that is.


I'm putting 13 at the begin of the series and 14 after 5.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v6 02/12] x86/paravirt: switch time pvops functions to use static_call()

2021-03-09 Thread Jürgen Groß via Virtualization

On 09.03.21 19:57, Borislav Petkov wrote:

On Tue, Mar 09, 2021 at 02:48:03PM +0100, Juergen Gross wrote:

@@ -167,6 +168,17 @@ static u64 native_steal_clock(int cpu)
return 0;
  }
  
+DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);

+DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
+
+bool paravirt_using_native_sched_clock = true;
+
+void paravirt_set_sched_clock(u64 (*func)(void))
+{
+   static_call_update(pv_sched_clock, func);
+   paravirt_using_native_sched_clock = (func == native_sched_clock);
+}


What's the point of this function if there's a global
paravirt_using_native_sched_clock variable now?


It is combining the two needed actions: update the static call and
set the paravirt_using_native_sched_clock boolean.


Looking how the bit of information whether native_sched_clock is used,
is needed in tsc.c, it probably would be cleaner if you add a

set_sched_clock_native(void);

or so, to tsc.c instead and call that here and make that long var name a
a shorter and static one in tsc.c instead.


I need to transfer a boolean value, so it would need to be

set_sched_clock_native(bool state);

In the end the difference is only marginal IMO.

Just had another idea: I could add a function to static_call.h for
querying the current function. This would avoid the double book keeping
and could probably be used later when switching other pv_ops calls to
static_call, too (e.g. pv_is_native_spin_unlock()).

What do you think?


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v5 11/12] x86/paravirt: switch functions with custom code to ALTERNATIVE

2021-03-08 Thread Jürgen Groß via Virtualization

On 08.03.21 19:30, Borislav Petkov wrote:

On Mon, Mar 08, 2021 at 01:28:43PM +0100, Juergen Gross wrote:

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 36cd71fa097f..04b3067f31b5 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -137,7 +137,8 @@ static inline void write_cr0(unsigned long x)
  
  static inline unsigned long read_cr2(void)

  {
-   return PVOP_CALLEE0(unsigned long, mmu.read_cr2);
+   return PVOP_ALT_CALLEE0(unsigned long, mmu.read_cr2,
+   "mov %%cr2, %%rax;", ~X86_FEATURE_XENPV);


Just some cursory poking first - indepth review later.

Do I see this correctly that the negated feature can be expressed with, to use
this example here:

ALTERNATIVE_TERNARY(mmu.read_cr2, X86_FEATURE_XENPV, "", "mov %%cr2, 
%%rax;");

?


No.

This would leave the Xen-pv case with a nop, while we need it to call
mmu.read_cr2().

In the Xen-pv case there must be _no_ alternative patching in order to
have the paravirt patching do its patching (indirect->direct call).

This is exactly the reason why I need to "not feature".

The only other solution I can think of would be a "split static_call"
handling using ALTERNATIVE_TERNARY():

ALTERNATIVE_TERNARY(initial_static_call(mmu.read_cr2),
X86_FEATURE_XENPV,
final_static_call(mmu.read_cr2),
"mov %%cr2, %%rax;");

with initial_static_call() doing an indirect call, while
final_static_call() would do a direct call.

Not sure we really want that.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v5 02/12] x86/paravirt: switch time pvops functions to use static_call()

2021-03-08 Thread Jürgen Groß via Virtualization

On 08.03.21 18:00, Boris Ostrovsky wrote:


On 3/8/21 7:28 AM, Juergen Gross wrote:

--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -379,11 +379,6 @@ void xen_timer_resume(void)
}
  }
  
-static const struct pv_time_ops xen_time_ops __initconst = {

-   .sched_clock = xen_sched_clock,
-   .steal_clock = xen_steal_clock,
-};
-
  static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
  static u64 xen_clock_value_saved;
  
@@ -528,7 +523,8 @@ static void __init xen_time_init(void)

  void __init xen_init_time_ops(void)
  {
xen_sched_clock_offset = xen_clocksource_read();
-   pv_ops.time = xen_time_ops;
+   static_call_update(pv_steal_clock, xen_steal_clock);
+   paravirt_set_sched_clock(xen_sched_clock);
  
  	x86_init.timers.timer_init = xen_time_init;

x86_init.timers.setup_percpu_clockev = x86_init_noop;
@@ -570,7 +566,8 @@ void __init xen_hvm_init_time_ops(void)
}
  
  	xen_sched_clock_offset = xen_clocksource_read();

-   pv_ops.time = xen_time_ops;
+   static_call_update(pv_steal_clock, xen_steal_clock);
+   paravirt_set_sched_clock(xen_sched_clock);
x86_init.timers.setup_percpu_clockev = xen_time_init;
x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;



There is a bunch of stuff that's common between the two cases so it can be 
factored out.


Yes.




  
diff --git a/drivers/xen/time.c b/drivers/xen/time.c

index 108edbcbc040..152dd33bb223 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -7,6 +7,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -175,7 +176,7 @@ void __init xen_time_setup_guest(void)
xen_runstate_remote = !HYPERVISOR_vm_assist(VMASST_CMD_enable,
VMASST_TYPE_runstate_update_flag);
  
-	pv_ops.time.steal_clock = xen_steal_clock;

+   static_call_update(pv_steal_clock, xen_steal_clock);
  



Do we actually need this? We've already set this up in xen_init_time_ops(). 
(But maybe for ARM).


Correct. Arm needs this.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 06/15] x86/paravirt: switch time pvops functions to use static_call()

2021-01-19 Thread Jürgen Groß via Virtualization

On 17.12.20 18:31, Michael Kelley wrote:

From: Juergen Gross  Sent: Thursday, December 17, 2020 1:31 AM


The time pvops functions are the only ones left which might be
used in 32-bit mode and which return a 64-bit value.

Switch them to use the static_call() mechanism instead of pvops, as
this allows quite some simplification of the pvops implementation.

Due to include hell this requires to split out the time interfaces
into a new header file.

Signed-off-by: Juergen Gross 
---
  arch/x86/Kconfig  |  1 +
  arch/x86/include/asm/mshyperv.h   | 11 
  arch/x86/include/asm/paravirt.h   | 14 --
  arch/x86/include/asm/paravirt_time.h  | 38 +++
  arch/x86/include/asm/paravirt_types.h |  6 -
  arch/x86/kernel/cpu/vmware.c  |  5 ++--
  arch/x86/kernel/kvm.c |  3 ++-
  arch/x86/kernel/kvmclock.c|  3 ++-
  arch/x86/kernel/paravirt.c| 16 ---
  arch/x86/kernel/tsc.c |  3 ++-
  arch/x86/xen/time.c   | 12 -
  drivers/clocksource/hyperv_timer.c|  5 ++--
  drivers/xen/time.c|  3 ++-
  kernel/sched/sched.h  |  1 +
  14 files changed, 71 insertions(+), 50 deletions(-)
  create mode 100644 arch/x86/include/asm/paravirt_time.h



[snip]
  

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index ffc289992d1b..45942d420626 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -56,17 +56,6 @@ typedef int (*hyperv_fill_flush_list_func)(
  #define hv_get_raw_timer() rdtsc_ordered()
  #define hv_get_vector() HYPERVISOR_CALLBACK_VECTOR

-/*
- * Reference to pv_ops must be inline so objtool
- * detection of noinstr violations can work correctly.
- */
-static __always_inline void hv_setup_sched_clock(void *sched_clock)
-{
-#ifdef CONFIG_PARAVIRT
-   pv_ops.time.sched_clock = sched_clock;
-#endif
-}
-
  void hyperv_vector_handler(struct pt_regs *regs);

  static inline void hv_enable_stimer0_percpu_irq(int irq) {}


[snip]


diff --git a/drivers/clocksource/hyperv_timer.c 
b/drivers/clocksource/hyperv_timer.c
index ba04cb381cd3..1ed79993fc50 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -21,6 +21,7 @@
  #include 
  #include 
  #include 
+#include 

  static struct clock_event_device __percpu *hv_clock_event;
  static u64 hv_sched_clock_offset __ro_after_init;
@@ -445,7 +446,7 @@ static bool __init hv_init_tsc_clocksource(void)
clocksource_register_hz(_cs_tsc, NSEC_PER_SEC/100);

hv_sched_clock_offset = hv_read_reference_counter();
-   hv_setup_sched_clock(read_hv_sched_clock_tsc);
+   paravirt_set_sched_clock(read_hv_sched_clock_tsc);

return true;
  }
@@ -470,6 +471,6 @@ void __init hv_init_clocksource(void)
clocksource_register_hz(_cs_msr, NSEC_PER_SEC/100);

hv_sched_clock_offset = hv_read_reference_counter();
-   hv_setup_sched_clock(read_hv_sched_clock_msr);
+   static_call_update(pv_sched_clock, read_hv_sched_clock_msr);
  }
  EXPORT_SYMBOL_GPL(hv_init_clocksource);


These Hyper-V changes are problematic as we want to keep hyperv_timer.c
architecture independent.  While only the code for x86/x64 is currently
accepted upstream, code for ARM64 support is in progress.   So we need
to use hv_setup_sched_clock() in hyperv_timer.c, and have the per-arch
implementation in mshyperv.h.


Okay, will switch back to old setup.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 06/15] x86/paravirt: switch time pvops functions to use static_call()

2021-01-19 Thread Jürgen Groß via Virtualization

On 06.01.21 11:03, Borislav Petkov wrote:

On Thu, Dec 17, 2020 at 10:31:24AM +0100, Juergen Gross wrote:

The time pvops functions are the only ones left which might be
used in 32-bit mode and which return a 64-bit value.

Switch them to use the static_call() mechanism instead of pvops, as
this allows quite some simplification of the pvops implementation.

Due to include hell this requires to split out the time interfaces
into a new header file.


I guess you can add Peter's patch to your set, no?

https://lkml.kernel.org/r/20201110005609.40989-3-frede...@kernel.org


With a slight modification, yes. :-)


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 00/12] x86: major paravirt cleanup

2020-12-15 Thread Jürgen Groß via Virtualization

On 15.12.20 15:54, Peter Zijlstra wrote:

On Tue, Dec 15, 2020 at 03:18:34PM +0100, Peter Zijlstra wrote:

Ah, I was waiting for Josh to have an opinion (and then sorta forgot
about the whole thing again). Let me refresh and provide at least a
Changelog.


How's this then?


Thanks, will add it into my series.


Juergen



---
Subject: objtool: Alternatives vs ORC, the hard way
From: Peter Zijlstra 
Date: Mon, 23 Nov 2020 14:43:17 +0100

Alternatives pose an interesting problem for unwinders because from
the unwinders PoV we're just executing instructions, it has no idea
the text is modified, nor any way of retrieving what with.

Therefore the stance has been that alternatives must not change stack
state, as encoded by commit: 7117f16bf460 ("objtool: Fix ORC vs
alternatives"). This obviously guarantees that whatever actual
instructions end up in the text, the unwind information is correct.

However, there is one additional source of text patching that isn't
currently visible to objtool: paravirt immediate patching. And it
turns out one of these violates the rule.

As part of cleaning that up, the unfortunate reality is that objtool
now has to deal with alternatives modifying unwind state and validate
the combination is valid and generate ORC data to match.

The problem is that a single instance of unwind information (ORC) must
capture and correctly unwind all alternatives. Since the trivially
correct mandate is out, implement the straight forward brute-force
approach:

  1) generate CFI information for each alternative

  2) unwind every alternative with the merge-sort of the previously
 generated CFI information -- O(n^2)

  3) for any possible conflict: yell.

  4) Generate ORC with merge-sort

Specifically for 3 there are two possible classes of conflicts:

  - the merge-sort itself could find conflicting CFI for the same
offset.

  - the unwind can fail with the merged CFI.

In specific, this allows us to deal with:

Alt1Alt2Alt3

  0x00  CALL *pv_ops.save_flCALL xen_save_flPUSHF
  0x01  POP %RAX
  0x02  NOP
  ...
  0x05  NOP
  ...
  0x07   

The unwind information for offset-0x00 is identical for all 3
alternatives. Similarly offset-0x05 and higher also are identical (and
the same as 0x00). However offset-0x01 has deviating CFI, but that is
only relevant for Alt3, neither of the other alternative instruction
streams will ever hit that offset.

Signed-off-by: Peter Zijlstra (Intel) 
---
  tools/objtool/check.c   |  180 

  tools/objtool/check.h   |5 +
  tools/objtool/orc_gen.c |  180 
+++-
  3 files changed, 290 insertions(+), 75 deletions(-)

--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1096,6 +1096,32 @@ static int handle_group_alt(struct objto
return -1;
}
  
+	/*

+* Add the filler NOP, required for alternative CFI.
+*/
+   if (special_alt->group && special_alt->new_len < special_alt->orig_len) 
{
+   struct instruction *nop = malloc(sizeof(*nop));
+   if (!nop) {
+   WARN("malloc failed");
+   return -1;
+   }
+   memset(nop, 0, sizeof(*nop));
+   INIT_LIST_HEAD(>alts);
+   INIT_LIST_HEAD(>stack_ops);
+   init_cfi_state(>cfi);
+
+   nop->sec = last_new_insn->sec;
+   nop->ignore = last_new_insn->ignore;
+   nop->func = last_new_insn->func;
+   nop->alt_group = alt_group;
+   nop->offset = last_new_insn->offset + last_new_insn->len;
+   nop->type = INSN_NOP;
+   nop->len = special_alt->orig_len - special_alt->new_len;
+
+   list_add(>list, _new_insn->list);
+   last_new_insn = nop;
+   }
+
if (fake_jump)
list_add(_jump->list, _new_insn->list);
  
@@ -2237,18 +2263,12 @@ static int handle_insn_ops(struct instru

struct stack_op *op;
  
  	list_for_each_entry(op, >stack_ops, list) {

-   struct cfi_state old_cfi = state->cfi;
int res;
  
  		res = update_cfi_state(insn, >cfi, op);

if (res)
return res;
  
-		if (insn->alt_group && memcmp(>cfi, _cfi, sizeof(struct cfi_state))) {

-   WARN_FUNC("alternative modifies stack", insn->sec, 
insn->offset);
-   return -1;
-   }
-
if (op->dest.type == OP_DEST_PUSHF) {
if (!state->uaccess_stack) {
state->uaccess_stack = 1;
@@ -2460,19 +2480,137 @@ static int validate_return(struct symbol
   * unreported (because they're NOPs), such holes would result in CFI_UNDEFINED
   * 

Re: [PATCH v2 00/12] x86: major paravirt cleanup

2020-12-15 Thread Jürgen Groß via Virtualization

Peter,

On 23.11.20 14:43, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 01:53:42PM +0100, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 12:46:18PM +0100, Juergen Gross wrote:

  30 files changed, 325 insertions(+), 598 deletions(-)


Much awesome ! I'll try and get that objtool thing sorted.


This seems to work for me. It isn't 100% accurate, because it doesn't
know about the direct call instruction, but I can either fudge that or
switching to static_call() will cure that.

It's not exactly pretty, but it should be straight forward.


Are you planning to send this out as an "official" patch, or should I
include it in my series (in this case I'd need a variant with a proper
commit message)?

I'd like to have this settled soon, as I'm going to send V2 of my
series hopefully this week.


Juergen



Index: linux-2.6/tools/objtool/check.c
===
--- linux-2.6.orig/tools/objtool/check.c
+++ linux-2.6/tools/objtool/check.c
@@ -1090,6 +1090,32 @@ static int handle_group_alt(struct objto
return -1;
}
  
+	/*

+* Add the filler NOP, required for alternative CFI.
+*/
+   if (special_alt->group && special_alt->new_len < special_alt->orig_len) 
{
+   struct instruction *nop = malloc(sizeof(*nop));
+   if (!nop) {
+   WARN("malloc failed");
+   return -1;
+   }
+   memset(nop, 0, sizeof(*nop));
+   INIT_LIST_HEAD(>alts);
+   INIT_LIST_HEAD(>stack_ops);
+   init_cfi_state(>cfi);
+
+   nop->sec = last_new_insn->sec;
+   nop->ignore = last_new_insn->ignore;
+   nop->func = last_new_insn->func;
+   nop->alt_group = alt_group;
+   nop->offset = last_new_insn->offset + last_new_insn->len;
+   nop->type = INSN_NOP;
+   nop->len = special_alt->orig_len - special_alt->new_len;
+
+   list_add(>list, _new_insn->list);
+   last_new_insn = nop;
+   }
+
if (fake_jump)
list_add(_jump->list, _new_insn->list);
  
@@ -2190,18 +2216,12 @@ static int handle_insn_ops(struct instru

struct stack_op *op;
  
  	list_for_each_entry(op, >stack_ops, list) {

-   struct cfi_state old_cfi = state->cfi;
int res;
  
  		res = update_cfi_state(insn, >cfi, op);

if (res)
return res;
  
-		if (insn->alt_group && memcmp(>cfi, _cfi, sizeof(struct cfi_state))) {

-   WARN_FUNC("alternative modifies stack", insn->sec, 
insn->offset);
-   return -1;
-   }
-
if (op->dest.type == OP_DEST_PUSHF) {
if (!state->uaccess_stack) {
state->uaccess_stack = 1;
@@ -2399,19 +2419,137 @@ static int validate_return(struct symbol
   * unreported (because they're NOPs), such holes would result in CFI_UNDEFINED
   * states which then results in ORC entries, which we just said we didn't 
want.
   *
- * Avoid them by copying the CFI entry of the first instruction into the whole
- * alternative.
+ * Avoid them by copying the CFI entry of the first instruction into the hole.
   */
-static void fill_alternative_cfi(struct objtool_file *file, struct instruction 
*insn)
+static void __fill_alt_cfi(struct objtool_file *file, struct instruction *insn)
  {
struct instruction *first_insn = insn;
int alt_group = insn->alt_group;
  
-	sec_for_each_insn_continue(file, insn) {

+   sec_for_each_insn_from(file, insn) {
if (insn->alt_group != alt_group)
break;
-   insn->cfi = first_insn->cfi;
+
+   if (!insn->visited)
+   insn->cfi = first_insn->cfi;
+   }
+}
+
+static void fill_alt_cfi(struct objtool_file *file, struct instruction 
*alt_insn)
+{
+   struct alternative *alt;
+
+   __fill_alt_cfi(file, alt_insn);
+
+   list_for_each_entry(alt, _insn->alts, list)
+   __fill_alt_cfi(file, alt->insn);
+}
+
+static struct instruction *
+__find_unwind(struct objtool_file *file,
+ struct instruction *insn, unsigned long offset)
+{
+   int alt_group = insn->alt_group;
+   struct instruction *next;
+   unsigned long off = 0;
+
+   while ((off + insn->len) <= offset) {
+   next = next_insn_same_sec(file, insn);
+   if (next && next->alt_group != alt_group)
+   next = NULL;
+
+   if (!next)
+   break;
+
+   off += insn->len;
+   insn = next;
}
+
+   return insn;
+}
+
+struct instruction *
+find_alt_unwind(struct objtool_file *file,
+   struct instruction *alt_insn, unsigned long offset)
+{
+   struct instruction *fit;
+   struct alternative *alt;
+   

Re: x86/ioapic: Cleanup the timer_works() irqflags mess

2020-12-10 Thread Jürgen Groß via Virtualization

On 10.12.20 21:15, Thomas Gleixner wrote:

Mark tripped over the creative irqflags handling in the IO-APIC timer
delivery check which ends up doing:

 local_irq_save(flags);
local_irq_enable();
 local_irq_restore(flags);

which triggered a new consistency check he's working on required for
replacing the POPF based restore with a conditional STI.

That code is a historical mess and none of this is needed. Make it
straightforward use local_irq_disable()/enable() as that's all what is
required. It is invoked from interrupt enabled code nowadays.

Reported-by: Mark Rutland 
Signed-off-by: Thomas Gleixner 
Tested-by: Mark Rutland 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf

2020-12-09 Thread Jürgen Groß via Virtualization

On 09.12.20 15:02, Mark Rutland wrote:

On Wed, Dec 09, 2020 at 01:27:10PM +, Mark Rutland wrote:

On Sun, Nov 22, 2020 at 01:44:53PM -0800, Andy Lutomirski wrote:

On Sat, Nov 21, 2020 at 10:55 PM Jürgen Groß  wrote:

On 20.11.20 12:59, Peter Zijlstra wrote:

If someone were to write horrible code like:

   local_irq_disable();
   local_irq_save(flags);
   local_irq_enable();
   local_irq_restore(flags);

we'd be up some creek without a paddle... now I don't _think_ we have
genius code like that, but I'd feel saver if we can haz an assertion in
there somewhere...



I was just talking to Peter on IRC about implementing the same thing for
arm64, so could we put this in the generic irqflags code? IIUC we can
use raw_irqs_disabled() to do the check.

As this isn't really entry specific (and IIUC the cases this should
catch would break lockdep today), maybe we should add a new
DEBUG_IRQFLAGS for this, that DEBUG_LOCKDEP can also select?

Something like:

#define local_irq_restore(flags)   \
do {\
if (!raw_irqs_disabled_flags(flags)) {  \
trace_hardirqs_on();\
} else if (IS_ENABLED(CONFIG_DEBUG_IRQFLAGS) {  \
if (unlikely(raw_irqs_disabled())   \


Whoops; that should be !raw_irqs_disabled().


warn_bogus_irqrestore();\
}   \
raw_local_irq_restore(flags);   \
 } while (0)

... perhaps? (ignoring however we deal with once-ness).


If no-one shouts in the next day or two I'll spin this as its own patch.


Fine with me. So I'll just ignore a potential error case in my patch.

Thanks,


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 04/12] x86/xen: drop USERGS_SYSRET64 paravirt call

2020-12-02 Thread Jürgen Groß via Virtualization

On 02.12.20 13:32, Borislav Petkov wrote:

On Fri, Nov 20, 2020 at 12:46:22PM +0100, Juergen Gross wrote:

@@ -123,12 +115,15 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, 
SYM_L_GLOBAL)
 * Try to use SYSRET instead of IRET if we're returning to
 * a completely clean 64-bit userspace context.  If we're not,
 * go to the slow exit path.
+* In the Xen PV case we must use iret anyway.
 */
-   movqRCX(%rsp), %rcx
-   movqRIP(%rsp), %r11
  
-	cmpq	%rcx, %r11	/* SYSRET requires RCX == RIP */

-   jne swapgs_restore_regs_and_return_to_usermode
+   ALTERNATIVE __stringify( \
+   movqRCX(%rsp), %rcx; \
+   movqRIP(%rsp), %r11; \
+   cmpq%rcx, %r11; /* SYSRET requires RCX == RIP */ \
+   jne swapgs_restore_regs_and_return_to_usermode), \
+   "jmp   swapgs_restore_regs_and_return_to_usermode", 
X86_FEATURE_XENPV


Why such a big ALTERNATIVE when you can simply do:

 /*
  * Try to use SYSRET instead of IRET if we're returning to
  * a completely clean 64-bit userspace context.  If we're not,
  * go to the slow exit path.
  * In the Xen PV case we must use iret anyway.
  */
 ALTERNATIVE "", "jmp swapgs_restore_regs_and_return_to_usermode", 
X86_FEATURE_XENPV

 movqRCX(%rsp), %rcx;
 movqRIP(%rsp), %r11;
 cmpq%rcx, %r11; /* SYSRET requires RCX == RIP */ \
 jne swapgs_restore_regs_and_return_to_usermode

?



I wanted to avoid the additional NOPs for the bare metal case.

If you don't mind them I can do as you are suggesting.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf

2020-11-22 Thread Jürgen Groß via Virtualization

On 22.11.20 22:44, Andy Lutomirski wrote:

On Sat, Nov 21, 2020 at 10:55 PM Jürgen Groß  wrote:


On 20.11.20 12:59, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 12:46:23PM +0100, Juergen Gross wrote:

+static __always_inline void arch_local_irq_restore(unsigned long flags)
+{
+if (!arch_irqs_disabled_flags(flags))
+arch_local_irq_enable();
+}


If someone were to write horrible code like:

   local_irq_disable();
   local_irq_save(flags);
   local_irq_enable();
   local_irq_restore(flags);

we'd be up some creek without a paddle... now I don't _think_ we have
genius code like that, but I'd feel saver if we can haz an assertion in
there somewhere...

Maybe something like:

#ifdef CONFIG_DEBUG_ENTRY // for lack of something saner
   WARN_ON_ONCE((arch_local_save_flags() ^ flags) & X86_EFLAGS_IF);
#endif

At the end?


I'd like to, but using WARN_ON_ONCE() in include/asm/irqflags.h sounds
like a perfect receipt for include dependency hell.

We could use a plain asm("ud2") instead.


How about out-of-lining it:

#ifdef CONFIG_DEBUG_ENTRY
extern void warn_bogus_irqrestore();
#endif

static __always_inline void arch_local_irq_restore(unsigned long flags)
{
if (!arch_irqs_disabled_flags(flags)) {
arch_local_irq_enable();
} else {
#ifdef CONFIG_DEBUG_ENTRY
if (unlikely(arch_local_irq_save() & X86_EFLAGS_IF))
 warn_bogus_irqrestore();
#endif
}



This couldn't be a WARN_ON_ONCE() then (or it would be a catch all).
Another approach might be to open-code the WARN_ON_ONCE(), like:

#ifdef CONFIG_DEBUG_ENTRY
extern void warn_bogus_irqrestore(bool *once);
#endif

static __always_inline void arch_local_irq_restore(unsigned long flags)
{
if (!arch_irqs_disabled_flags(flags))
arch_local_irq_enable();
#ifdef CONFIG_DEBUG_ENTRY
{
static bool once;

if (unlikely(arch_local_irq_save() & X86_EFLAGS_IF))
warn_bogus_irqrestore();
}
#endif
}


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf

2020-11-21 Thread Jürgen Groß via Virtualization

On 20.11.20 12:59, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 12:46:23PM +0100, Juergen Gross wrote:

+static __always_inline void arch_local_irq_restore(unsigned long flags)
+{
+   if (!arch_irqs_disabled_flags(flags))
+   arch_local_irq_enable();
+}


If someone were to write horrible code like:

local_irq_disable();
local_irq_save(flags);
local_irq_enable();
local_irq_restore(flags);

we'd be up some creek without a paddle... now I don't _think_ we have
genius code like that, but I'd feel saver if we can haz an assertion in
there somewhere...

Maybe something like:

#ifdef CONFIG_DEBUG_ENTRY // for lack of something saner
WARN_ON_ONCE((arch_local_save_flags() ^ flags) & X86_EFLAGS_IF);
#endif

At the end?


I'd like to, but using WARN_ON_ONCE() in include/asm/irqflags.h sounds
like a perfect receipt for include dependency hell.

We could use a plain asm("ud2") instead.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 08/12] x86/paravirt: remove no longer needed 32-bit pvops cruft

2020-11-20 Thread Jürgen Groß via Virtualization

On 20.11.20 13:08, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 12:46:26PM +0100, Juergen Gross wrote:

+#define PVOP_CALL(rettype, op, clbr, call_clbr, extra_clbr, ...)   \
({  \
PVOP_CALL_ARGS; \
PVOP_TEST_NULL(op); \
+   BUILD_BUG_ON(sizeof(rettype) > sizeof(unsigned long));   \
+   asm volatile(paravirt_alt(PARAVIRT_CALL)\
+: call_clbr, ASM_CALL_CONSTRAINT   \
+: paravirt_type(op),   \
+  paravirt_clobber(clbr),  \
+  ##__VA_ARGS__\
+: "memory", "cc" extra_clbr);  \
+   (rettype)(__eax & PVOP_RETMASK(rettype));   \
})


This is now very similar to PVOP_VCALL() (note how PVOP_CALL_ARGS is
PVOP_VCALL_ARGS).

Could we get away with doing something horrible like:

#define PVOP_VCALL(X...) (void)PVOP_CALL(long, X)

?


Oh, indeed. And in patch 9 the same could be done for the ALT variants.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 06/12] x86/paravirt: switch time pvops functions to use static_call()

2020-11-20 Thread Jürgen Groß via Virtualization

On 20.11.20 13:01, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 12:46:24PM +0100, Juergen Gross wrote:

The time pvops functions are the only ones left which might be
used in 32-bit mode and which return a 64-bit value.

Switch them to use the static_call() mechanism instead of pvops, as
this allows quite some simplification of the pvops implementation.

Due to include hell this requires to split out the time interfaces
into a new header file.


There's also this patch floating around; just in case that would come in
handy:

   https://lkml.kernel.org/r/20201110005609.40989-3-frede...@kernel.org



Ah, yes. This would make life much easier.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf

2020-11-20 Thread Jürgen Groß via Virtualization

On 20.11.20 12:59, Peter Zijlstra wrote:

On Fri, Nov 20, 2020 at 12:46:23PM +0100, Juergen Gross wrote:

+static __always_inline void arch_local_irq_restore(unsigned long flags)
+{
+   if (!arch_irqs_disabled_flags(flags))
+   arch_local_irq_enable();
+}


If someone were to write horrible code like:

local_irq_disable();
local_irq_save(flags);
local_irq_enable();
local_irq_restore(flags);

we'd be up some creek without a paddle... now I don't _think_ we have
genius code like that, but I'd feel saver if we can haz an assertion in
there somewhere...

Maybe something like:

#ifdef CONFIG_DEBUG_ENTRY // for lack of something saner
WARN_ON_ONCE((arch_local_save_flags() ^ flags) & X86_EFLAGS_IF);
#endif

At the end?



I'd be fine with that. I didn't add something like that because I
couldn't find a suitable CONFIG_ :-)


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 4/4] x86/xen: drop USERGS_SYSRET64 paravirt call

2020-11-17 Thread Jürgen Groß via Virtualization

On 16.11.20 17:28, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 7:23 AM Juergen Gross  wrote:


USERGS_SYSRET64 is used to return from a syscall via sysret, but
a Xen PV guest will nevertheless use the iret hypercall, as there
is no sysret PV hypercall defined.

So instead of testing all the prerequisites for doing a sysret and
then mangling the stack for Xen PV again for doing an iret just use
the iret exit from the beginning.

This can easily be done via an ALTERNATIVE like it is done for the
sysenter compat case already.

While at it remove to stale sysret32 remnants.

Signed-off-by: Juergen Gross 


Acked-by: Andy Lutomirski 

FWIW, you've lost the VGCF_in_syscall optimization.  Let me see if I
can give it back to you better.


Ah, right.

Nevertheless a simple kernel build is about 0.5% faster with this
patch.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 6/7] xen/balloon: try to merge system ram resources

2020-09-08 Thread Jürgen Groß

On 08.09.20 22:10, David Hildenbrand wrote:

Let's try to merge system ram resources we add, to minimize the number
of resources in /proc/iomem. We don't care about the boundaries of
individual chunks we added.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Roger Pau Monné 
Cc: Julien Grall 
Cc: Pankaj Gupta 
Cc: Baoquan He 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 


Reviewed-by: Juergen Gross 

Juergen


---
  drivers/xen/balloon.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 7bac38764513d..b57b2067ecbfb 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -331,7 +331,7 @@ static enum bp_state reserve_additional_memory(void)
mutex_unlock(_mutex);
/* add_memory_resource() requires the device_hotplug lock */
lock_device_hotplug();
-   rc = add_memory_resource(nid, resource, 0);
+   rc = add_memory_resource(nid, resource, MEMHP_MERGE_RESOURCE);
unlock_device_hotplug();
mutex_lock(_mutex);
  



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 3/7] mm/memory_hotplug: prepare passing flags to add_memory() and friends

2020-09-08 Thread Jürgen Groß

On 08.09.20 22:10, David Hildenbrand wrote:

We soon want to pass flags, e.g., to mark added System RAM resources.
mergeable. Prepare for that.

This patch is based on a similar patch by Oscar Salvador:

https://lkml.kernel.org/r/20190625075227.15193-3-osalva...@suse.de

Acked-by: Wei Liu 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Jason Gunthorpe 
Cc: Pankaj Gupta 
Cc: Baoquan He 
Cc: Wei Yang 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Wei Liu 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: David Hildenbrand 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: "Oliver O'Halloran" 
Cc: Pingfan Liu 
Cc: Nathan Lynch 
Cc: Libor Pechacek 
Cc: Anton Blanchard 
Cc: Leonardo Bras 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-a...@vger.kernel.org
Cc: linux-nvd...@lists.01.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: David Hildenbrand 


Reviewed-by: Juergen Gross  (Xen related part)


Juergen


---
  arch/powerpc/platforms/powernv/memtrace.c   |  2 +-
  arch/powerpc/platforms/pseries/hotplug-memory.c |  2 +-
  drivers/acpi/acpi_memhotplug.c  |  2 +-
  drivers/base/memory.c   |  2 +-
  drivers/dax/kmem.c  |  2 +-
  drivers/hv/hv_balloon.c |  2 +-
  drivers/s390/char/sclp_cmd.c|  2 +-
  drivers/virtio/virtio_mem.c |  2 +-
  drivers/xen/balloon.c   |  2 +-
  include/linux/memory_hotplug.h  | 10 ++
  mm/memory_hotplug.c | 15 ---
  11 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 13b369d2cc454..a7475d18c671c 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -224,7 +224,7 @@ static int memtrace_online(void)
ent->mem = 0;
}
  
-		if (add_memory(ent->nid, ent->start, ent->size)) {

+   if (add_memory(ent->nid, ent->start, ent->size, 0)) {
pr_err("Failed to add trace memory to node %d\n",
ent->nid);
ret += 1;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5d545b78111f9..54a888ea7f751 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -606,7 +606,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
block_sz = memory_block_size_bytes();
  
  	/* Add the memory */

-   rc = __add_memory(lmb->nid, lmb->base_addr, block_sz);
+   rc = __add_memory(lmb->nid, lmb->base_addr, block_sz, 0);
if (rc) {
invalidate_lmb_associativity_index(lmb);
return rc;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index e294f44a78504..d91b3584d4b2b 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -207,7 +207,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
  
-		result = __add_memory(node, info->start_addr, info->length);

+   result = __add_memory(node, info->start_addr, info->length, 0);
  
  		/*

 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 4db3c660de831..2287bcf86480e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -432,7 +432,7 @@ static ssize_t probe_store(struct device *dev, struct 
device_attribute *attr,
  
  	nid = memory_add_physaddr_to_nid(phys_addr);

ret = __add_memory(nid, phys_addr,
-  MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+  MIN_MEMORY_BLOCK_SIZE * sections_per_block, 0);
  
  	if (ret)

goto out;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7dcb2902e9b1b..8e66b28ef5bc6 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -95,7 +95,7 @@ int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 * this as RAM automatically.
 */
rc = add_memory_driver_managed(numa_node, range.start,
-   range_len(), kmem_name);
+   range_len(), kmem_name, 0);
  
  		res->flags |= IORESOURCE_BUSY;

if (rc) {

Re: [PATCH v1 4/5] xen/balloon: try to merge system ram resources

2020-09-02 Thread Jürgen Groß

On 21.08.20 12:34, David Hildenbrand wrote:

Let's reuse the new mechanism to merge system ram resources below the
root. We are the only one hotplugging system ram (e.g., DIMMs don't apply),
so this is safe to be used.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Roger Pau Monné 
Cc: Julien Grall 
Cc: Pankaj Gupta 
Cc: Baoquan He 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 
---
  drivers/xen/balloon.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 37ffccda8bb87..5ec73f752b8a7 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -338,6 +338,10 @@ static enum bp_state reserve_additional_memory(void)
if (rc) {
pr_warn("Cannot add additional memory (%i)\n", rc);
goto err;
+   } else {
+   resource = NULL;
+   /* Try to reduce the number of system ram resources. */
+   merge_system_ram_resources(_resource);
}


I don't see the need for setting resource to NULL and to use an "else"
clause here.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread Jürgen Groß

On 11.08.20 10:12, Peter Zijlstra wrote:

On Tue, Aug 11, 2020 at 09:57:55AM +0200, Jürgen Groß wrote:

On 11.08.20 09:41, Peter Zijlstra wrote:

On Fri, Aug 07, 2020 at 05:19:03PM +0200, Marco Elver wrote:


My hypothesis here is simply that kvm_wait() may be called in a place
where we get the same case I mentioned to Peter,

raw_local_irq_save(); /* or other IRQs off without tracing */
...
kvm_wait() /* IRQ state tracing gets confused */
...
raw_local_irq_restore();

and therefore, using raw variants in kvm_wait() works. It's also safe
because it doesn't call any other libraries that would result in corrupt


Yes, this is definitely an issue.

Tracing, we also musn't call into tracing when using raw_local_irq_*().
Because then we re-intoduce this same issue all over again.

Both halt() and safe_halt() are more paravirt calls, but given we're in
a KVM paravirt call already, I suppose we can directly use native_*()
here.

Something like so then... I suppose, but then the Xen variants need TLC
too.


Just to be sure I understand you correct:

You mean that xen_qlock_kick() and xen_qlock_wait() and all functions
called by those should gain the "notrace" attribute, right?

I am not sure why the kick variants need it, though. IMO those are
called only after the lock has been released, so they should be fine
without notrace.


The issue happens when someone uses arch_spinlock_t under
raw_local_irq_*().


And again: we shouldn't forget the Hyper-V variants.


Bah, my grep failed :/ Also *groan*, that's calling apic->send_IPI().


In case you don't want to do it I can send the patch for the Xen
variants.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread Jürgen Groß

On 11.08.20 10:12, Peter Zijlstra wrote:

On Tue, Aug 11, 2020 at 09:57:55AM +0200, Jürgen Groß wrote:

On 11.08.20 09:41, Peter Zijlstra wrote:

On Fri, Aug 07, 2020 at 05:19:03PM +0200, Marco Elver wrote:


My hypothesis here is simply that kvm_wait() may be called in a place
where we get the same case I mentioned to Peter,

raw_local_irq_save(); /* or other IRQs off without tracing */
...
kvm_wait() /* IRQ state tracing gets confused */
...
raw_local_irq_restore();

and therefore, using raw variants in kvm_wait() works. It's also safe
because it doesn't call any other libraries that would result in corrupt


Yes, this is definitely an issue.

Tracing, we also musn't call into tracing when using raw_local_irq_*().
Because then we re-intoduce this same issue all over again.

Both halt() and safe_halt() are more paravirt calls, but given we're in
a KVM paravirt call already, I suppose we can directly use native_*()
here.

Something like so then... I suppose, but then the Xen variants need TLC
too.


Just to be sure I understand you correct:

You mean that xen_qlock_kick() and xen_qlock_wait() and all functions
called by those should gain the "notrace" attribute, right?

I am not sure why the kick variants need it, though. IMO those are
called only after the lock has been released, so they should be fine
without notrace.


The issue happens when someone uses arch_spinlock_t under
raw_local_irq_*().


Ah, okay.




And again: we shouldn't forget the Hyper-V variants.


Bah, my grep failed :/ Also *groan*, that's calling apic->send_IPI().


I've seen that, too. :-(


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread Jürgen Groß

On 11.08.20 09:41, Peter Zijlstra wrote:

On Fri, Aug 07, 2020 at 05:19:03PM +0200, Marco Elver wrote:


My hypothesis here is simply that kvm_wait() may be called in a place
where we get the same case I mentioned to Peter,

raw_local_irq_save(); /* or other IRQs off without tracing */
...
kvm_wait() /* IRQ state tracing gets confused */
...
raw_local_irq_restore();

and therefore, using raw variants in kvm_wait() works. It's also safe
because it doesn't call any other libraries that would result in corrupt


Yes, this is definitely an issue.

Tracing, we also musn't call into tracing when using raw_local_irq_*().
Because then we re-intoduce this same issue all over again.

Both halt() and safe_halt() are more paravirt calls, but given we're in
a KVM paravirt call already, I suppose we can directly use native_*()
here.

Something like so then... I suppose, but then the Xen variants need TLC
too.


Just to be sure I understand you correct:

You mean that xen_qlock_kick() and xen_qlock_wait() and all functions
called by those should gain the "notrace" attribute, right?

I am not sure why the kick variants need it, though. IMO those are
called only after the lock has been released, so they should be fine
without notrace.

And again: we shouldn't forget the Hyper-V variants.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread Jürgen Groß

On 11.08.20 09:00, Marco Elver wrote:

On Fri, 7 Aug 2020 at 17:19, Marco Elver  wrote:

On Fri, Aug 07, 2020 at 02:08PM +0200, Marco Elver wrote:

On Fri, 7 Aug 2020 at 14:04, Jürgen Groß  wrote:


On 07.08.20 13:38, Marco Elver wrote:

On Fri, Aug 07, 2020 at 12:35PM +0200, Jürgen Groß wrote:

...

I think CONFIG_PARAVIRT_XXL shouldn't matter, but I'm not completely
sure about that. CONFIG_PARAVIRT_SPINLOCKS would be my primary suspect.


Yes, PARAVIRT_XXL doesn't make a different. When disabling
PARAVIRT_SPINLOCKS, however, the warnings go away.


Thanks for testing!

I take it you are doing the tests in a KVM guest?


Yes, correct.


If so I have a gut feeling that the use of local_irq_save() and
local_irq_restore() in kvm_wait() might be fishy. I might be completely
wrong here, though.


Happy to help debug more, although I might need patches or pointers
what to play with.


BTW, I think Xen's variant of pv spinlocks is fine (no playing with IRQ
on/off).

Hyper-V seems to do the same as KVM, and kicking another vcpu could be
problematic as well, as it is just using IPI.


I experimented a bit more, and the below patch seems to solve the
warnings. However, that was based on your pointer about kvm_wait(), and
I can't quite tell if it is the right solution.

My hypothesis here is simply that kvm_wait() may be called in a place
where we get the same case I mentioned to Peter,

 raw_local_irq_save(); /* or other IRQs off without tracing */
 ...
 kvm_wait() /* IRQ state tracing gets confused */
 ...
 raw_local_irq_restore();

and therefore, using raw variants in kvm_wait() works. It's also safe
because it doesn't call any other libraries that would result in corrupt
IRQ state AFAIK.


Just to follow-up, it'd still be nice to fix this. Suggestions?

I could send the below as a patch, but can only go off my above
hypothesis and the fact that syzbot is happier, so not entirely
convincing.


Peter has told me via IRC he will look soon further into this.

Your finding suggests that the pv-lock implementation for Hyper-V
needs some tweaking, too. For that purpose I'm adding Wei to Cc.


Juergen



Thanks,
-- Marco


-- >8 --

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 233c77d056c9..1d412d1466f0 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -797,7 +797,7 @@ static void kvm_wait(u8 *ptr, u8 val)
 if (in_nmi())
 return;

-   local_irq_save(flags);
+   raw_local_irq_save(flags);

 if (READ_ONCE(*ptr) != val)
 goto out;
@@ -810,10 +810,10 @@ static void kvm_wait(u8 *ptr, u8 val)
 if (arch_irqs_disabled_flags(flags))
 halt();
 else
-   safe_halt();
+   raw_safe_halt();

  out:
-   local_irq_restore(flags);
+   raw_local_irq_restore(flags);
  }

  #ifdef CONFIG_X86_32




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 4/7] x86/paravirt: remove 32-bit support from PARAVIRT_XXL

2020-08-10 Thread Jürgen Groß

On 10.08.20 18:53, Boris Ostrovsky wrote:

On 8/10/20 12:39 AM, Jürgen Groß wrote:

On 09.08.20 04:34, Boris Ostrovsky wrote:

On 8/7/20 4:38 AM, Juergen Gross wrote:

@@ -377,10 +373,7 @@ static inline pte_t __pte(pteval_t val)
   {
   pteval_t ret;
   -    if (sizeof(pteval_t) > sizeof(long))
-    ret = PVOP_CALLEE2(pteval_t, mmu.make_pte, val, (u64)val >>
32);
-    else
-    ret = PVOP_CALLEE1(pteval_t, mmu.make_pte, val);
+    ret = PVOP_CALLEE1(pteval_t, mmu.make_pte, val);
     return (pte_t) { .pte = ret };



Can this now simply return (pte_t) ret?


I don't think so, but I can turn it into

   return native_make_pte(PVOP_CALLEE1(...));



I thought that since now this is only built for 64-bit we don't have to
worry about different pte_t definitions and can do what we do for
example, for __pgd()?


Yes, I did that:

 return (pte_t) { ret };


Juergen

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 4/7] x86/paravirt: remove 32-bit support from PARAVIRT_XXL

2020-08-09 Thread Jürgen Groß

On 09.08.20 04:34, Boris Ostrovsky wrote:

On 8/7/20 4:38 AM, Juergen Gross wrote:

@@ -377,10 +373,7 @@ static inline pte_t __pte(pteval_t val)
  {
pteval_t ret;
  
-	if (sizeof(pteval_t) > sizeof(long))

-   ret = PVOP_CALLEE2(pteval_t, mmu.make_pte, val, (u64)val >> 32);
-   else
-   ret = PVOP_CALLEE1(pteval_t, mmu.make_pte, val);
+   ret = PVOP_CALLEE1(pteval_t, mmu.make_pte, val);
  
  	return (pte_t) { .pte = ret };



Can this now simply return (pte_t) ret?


I don't think so, but I can turn it into

  return native_make_pte(PVOP_CALLEE1(...));


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-07 Thread Jürgen Groß

On 07.08.20 13:38, Marco Elver wrote:

On Fri, Aug 07, 2020 at 12:35PM +0200, Jürgen Groß wrote:

On 07.08.20 11:50, Marco Elver wrote:

On Fri, Aug 07, 2020 at 11:24AM +0200, Jürgen Groß wrote:

On 07.08.20 11:01, Marco Elver wrote:

On Thu, 6 Aug 2020 at 18:06, Marco Elver  wrote:

On Thu, 6 Aug 2020 at 15:17, Marco Elver  wrote:

On Thu, Aug 06, 2020 at 01:32PM +0200, pet...@infradead.org wrote:

On Thu, Aug 06, 2020 at 09:47:23AM +0200, Marco Elver wrote:

Testing my hypothesis that raw then nested non-raw
local_irq_save/restore() breaks IRQ state tracking -- see the reproducer
below. This is at least 1 case I can think of that we're bound to hit.

...


/me goes ponder things...

How's something like this then?

---
include/linux/sched.h |  3 ---
kernel/kcsan/core.c   | 62 
---
2 files changed, 44 insertions(+), 21 deletions(-)


Thank you! That approach seems to pass syzbot (also with
CONFIG_PARAVIRT) and kcsan-test tests.

I had to modify it some, so that report.c's use of the restore logic
works and not mess up the IRQ trace printed on KCSAN reports (with
CONFIG_KCSAN_VERBOSE).

I still need to fully convince myself all is well now and we don't end
up with more fixes. :-) If it passes further testing, I'll send it as a
real patch (I want to add you as Co-developed-by, but would need your
Signed-off-by for the code you pasted, I think.)


I let it run on syzbot through the night, and it's fine without
PARAVIRT (see below). I have sent the patch (need your Signed-off-by
as it's based on your code, thank you!):
https://lkml.kernel.org/r/20200807090031.3506555-1-el...@google.com


With CONFIG_PARAVIRT=y (without the notrace->noinstr patch), I still
get lockdep DEBUG_LOCKS_WARN_ON(!lockdep_hardirqs_enabled()), although
it takes longer for syzbot to hit them. But I think that's expected
because we can still get the recursion that I pointed out, and will
need that patch.


Never mind, I get these warnings even if I don't turn on KCSAN
(CONFIG_KCSAN=n). Something else is going on with PARAVIRT=y that
throws off IRQ state tracking. :-/


What are the settings of CONFIG_PARAVIRT_XXL and
CONFIG_PARAVIRT_SPINLOCKS in this case?


I attached a config.

$> grep PARAVIRT .config
CONFIG_PARAVIRT=y
CONFIG_PARAVIRT_XXL=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_PARAVIRT_SPINLOCKS=y
# CONFIG_PARAVIRT_TIME_ACCOUNTING is not set
CONFIG_PARAVIRT_CLOCK=y


Anything special I need to do to reproduce the problem? Or would you be
willing to do some more rounds with different config settings?


I can only test it with syzkaller, but that probably doesn't help if you
don't already have it set up. It can't seem to find a C reproducer.

I did some more rounds with different configs.


I think CONFIG_PARAVIRT_XXL shouldn't matter, but I'm not completely
sure about that. CONFIG_PARAVIRT_SPINLOCKS would be my primary suspect.


Yes, PARAVIRT_XXL doesn't make a different. When disabling
PARAVIRT_SPINLOCKS, however, the warnings go away.


Thanks for testing!

I take it you are doing the tests in a KVM guest?

If so I have a gut feeling that the use of local_irq_save() and
local_irq_restore() in kvm_wait() might be fishy. I might be completely
wrong here, though.

BTW, I think Xen's variant of pv spinlocks is fine (no playing with IRQ
on/off).

Hyper-V seems to do the same as KVM, and kicking another vcpu could be
problematic as well, as it is just using IPI.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-07 Thread Jürgen Groß

On 07.08.20 11:50, Marco Elver wrote:

On Fri, Aug 07, 2020 at 11:24AM +0200, Jürgen Groß wrote:

On 07.08.20 11:01, Marco Elver wrote:

On Thu, 6 Aug 2020 at 18:06, Marco Elver  wrote:

On Thu, 6 Aug 2020 at 15:17, Marco Elver  wrote:

On Thu, Aug 06, 2020 at 01:32PM +0200, pet...@infradead.org wrote:

On Thu, Aug 06, 2020 at 09:47:23AM +0200, Marco Elver wrote:

Testing my hypothesis that raw then nested non-raw
local_irq_save/restore() breaks IRQ state tracking -- see the reproducer
below. This is at least 1 case I can think of that we're bound to hit.

...


/me goes ponder things...

How's something like this then?

---
   include/linux/sched.h |  3 ---
   kernel/kcsan/core.c   | 62 
---
   2 files changed, 44 insertions(+), 21 deletions(-)


Thank you! That approach seems to pass syzbot (also with
CONFIG_PARAVIRT) and kcsan-test tests.

I had to modify it some, so that report.c's use of the restore logic
works and not mess up the IRQ trace printed on KCSAN reports (with
CONFIG_KCSAN_VERBOSE).

I still need to fully convince myself all is well now and we don't end
up with more fixes. :-) If it passes further testing, I'll send it as a
real patch (I want to add you as Co-developed-by, but would need your
Signed-off-by for the code you pasted, I think.)


I let it run on syzbot through the night, and it's fine without
PARAVIRT (see below). I have sent the patch (need your Signed-off-by
as it's based on your code, thank you!):
https://lkml.kernel.org/r/20200807090031.3506555-1-el...@google.com


With CONFIG_PARAVIRT=y (without the notrace->noinstr patch), I still
get lockdep DEBUG_LOCKS_WARN_ON(!lockdep_hardirqs_enabled()), although
it takes longer for syzbot to hit them. But I think that's expected
because we can still get the recursion that I pointed out, and will
need that patch.


Never mind, I get these warnings even if I don't turn on KCSAN
(CONFIG_KCSAN=n). Something else is going on with PARAVIRT=y that
throws off IRQ state tracking. :-/


What are the settings of CONFIG_PARAVIRT_XXL and
CONFIG_PARAVIRT_SPINLOCKS in this case?


I attached a config.

$> grep PARAVIRT .config
CONFIG_PARAVIRT=y
CONFIG_PARAVIRT_XXL=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_PARAVIRT_SPINLOCKS=y
# CONFIG_PARAVIRT_TIME_ACCOUNTING is not set
CONFIG_PARAVIRT_CLOCK=y


Anything special I need to do to reproduce the problem? Or would you be
willing to do some more rounds with different config settings?

I think CONFIG_PARAVIRT_XXL shouldn't matter, but I'm not completely
sure about that. CONFIG_PARAVIRT_SPINLOCKS would be my primary suspect.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 4/7] x86/paravirt: remove 32-bit support from PARAVIRT_XXL

2020-08-07 Thread Jürgen Groß

On 07.08.20 11:39, pet...@infradead.org wrote:

On Fri, Aug 07, 2020 at 10:38:23AM +0200, Juergen Gross wrote:


-# else
-   const unsigned char cpu_iret[1];
-# endif
  };
  
  static const struct patch_xxl patch_data_xxl = {

@@ -42,7 +38,6 @@ static const struct patch_xxl patch_data_xxl = {
.irq_save_fl= { 0x9c, 0x58 },   // pushf; pop %[re]ax
.mmu_read_cr2   = { 0x0f, 0x20, 0xd0 }, // mov %cr2, %[re]ax
.mmu_read_cr3   = { 0x0f, 0x20, 0xd8 }, // mov %cr3, %[re]ax
-# ifdef CONFIG_X86_64
.mmu_write_cr3  = { 0x0f, 0x22, 0xdf }, // mov %rdi, %cr3
.irq_restore_fl = { 0x57, 0x9d },   // push %rdi; popfq
.cpu_wbinvd = { 0x0f, 0x09 },   // wbinvd
@@ -50,19 +45,11 @@ static const struct patch_xxl patch_data_xxl = {
0x48, 0x0f, 0x07 }, // swapgs; sysretq
.cpu_swapgs = { 0x0f, 0x01, 0xf8 }, // swapgs
.mov64  = { 0x48, 0x89, 0xf8 }, // mov %rdi, %rax
-# else
-   .mmu_write_cr3  = { 0x0f, 0x22, 0xd8 }, // mov %eax, %cr3
-   .irq_restore_fl = { 0x50, 0x9d },   // push %eax; popf
-   .cpu_iret   = { 0xcf }, // iret
-# endif


I was looking at x86_64 paravirt the other day and found we actually
have pv_ops.cpu.iret users there..


On x86_64 we have (without PARAVIRT_XXL):

#define INTERRUPT_RETURNjmp native_iret

and with PARAVIRT_XXL this is basically a jmp *pv_ops.cpu.iret which
will then be patched to either jmp native_iret or jmp xen_iret.

On x86_32 INTERRUPT_RETURN was just "iret" for the non-paravirt case.
This is the reason for above dropping of the static patch data.


So we want to change the above to also patch iret on x86_64 or do we
need to fix x86_64 to not have pv-iret?


We want it to stay how it is. This will let both variants (PARVIRT y/n)
continue to work.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-07 Thread Jürgen Groß

On 07.08.20 11:01, Marco Elver wrote:

On Thu, 6 Aug 2020 at 18:06, Marco Elver  wrote:

On Thu, 6 Aug 2020 at 15:17, Marco Elver  wrote:

On Thu, Aug 06, 2020 at 01:32PM +0200, pet...@infradead.org wrote:

On Thu, Aug 06, 2020 at 09:47:23AM +0200, Marco Elver wrote:

Testing my hypothesis that raw then nested non-raw
local_irq_save/restore() breaks IRQ state tracking -- see the reproducer
below. This is at least 1 case I can think of that we're bound to hit.

...


/me goes ponder things...

How's something like this then?

---
  include/linux/sched.h |  3 ---
  kernel/kcsan/core.c   | 62 ---
  2 files changed, 44 insertions(+), 21 deletions(-)


Thank you! That approach seems to pass syzbot (also with
CONFIG_PARAVIRT) and kcsan-test tests.

I had to modify it some, so that report.c's use of the restore logic
works and not mess up the IRQ trace printed on KCSAN reports (with
CONFIG_KCSAN_VERBOSE).

I still need to fully convince myself all is well now and we don't end
up with more fixes. :-) If it passes further testing, I'll send it as a
real patch (I want to add you as Co-developed-by, but would need your
Signed-off-by for the code you pasted, I think.)


I let it run on syzbot through the night, and it's fine without
PARAVIRT (see below). I have sent the patch (need your Signed-off-by
as it's based on your code, thank you!):
https://lkml.kernel.org/r/20200807090031.3506555-1-el...@google.com


With CONFIG_PARAVIRT=y (without the notrace->noinstr patch), I still
get lockdep DEBUG_LOCKS_WARN_ON(!lockdep_hardirqs_enabled()), although
it takes longer for syzbot to hit them. But I think that's expected
because we can still get the recursion that I pointed out, and will
need that patch.


Never mind, I get these warnings even if I don't turn on KCSAN
(CONFIG_KCSAN=n). Something else is going on with PARAVIRT=y that
throws off IRQ state tracking. :-/


What are the settings of CONFIG_PARAVIRT_XXL and
CONFIG_PARAVIRT_SPINLOCKS in this case?


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-05 Thread Jürgen Groß

On 05.08.20 16:12, pet...@infradead.org wrote:

On Wed, Aug 05, 2020 at 03:59:40PM +0200, Marco Elver wrote:

On Wed, Aug 05, 2020 at 03:42PM +0200, pet...@infradead.org wrote:



Shouldn't we __always_inline those? They're going to be really small.


I can send a v2, and you can choose. For reference, though:

86271ee0 :
86271ee0:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
86271ee5:   48 83 3d 43 87 e4 01cmpq   $0x0,0x1e48743(%rip)   
 # 880ba630 
86271eec:   00
86271eed:   74 0d   je 86271efc 

86271eef:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
86271ef4:   ff 14 25 30 a6 0b 88callq  
*0x880ba630
86271efb:   c3  retq
86271efc:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
86271f01:   0f 0b   ud2



86271a90 :
86271a90:   53  push   %rbx
86271a91:   48 89 fbmov%rdi,%rbx
86271a94:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
86271a99:   48 83 3d 97 8b e4 01cmpq   $0x0,0x1e48b97(%rip)   
 # 880ba638 
86271aa0:   00
86271aa1:   74 11   je 86271ab4 

86271aa3:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
86271aa8:   48 89 dfmov%rbx,%rdi
86271aab:   ff 14 25 38 a6 0b 88callq  
*0x880ba638
86271ab2:   5b  pop%rbx
86271ab3:   c3  retq
86271ab4:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
86271ab9:   0f 0b   ud2



Blergh, that's abysmall. In part I suspect because you have
CONFIG_PARAVIRT_DEBUG, let me try and untangle that PV macro maze.



Probably. I have found the following in my kernel:

fff81540a5f :
81540a5f:   ff 14 25 40 a4 23 82callq  *0x8223a440
81540a66:   c3  retq


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/4] Remove 32-bit Xen PV guest support

2020-07-02 Thread Jürgen Groß

On 02.07.20 16:48, Brian Gerst wrote:

On Wed, Jul 1, 2020 at 7:07 AM Juergen Gross  wrote:


The long term plan has been to replace Xen PV guests by PVH. The first
victim of that plan are now 32-bit PV guests, as those are used only
rather seldom these days. Xen on x86 requires 64-bit support and with
Grub2 now supporting PVH officially since version 2.04 there is no
need to keep 32-bit PV guest support alive in the Linux kernel.
Additionally Meltdown mitigation is not available in the kernel running
as 32-bit PV guest, so dropping this mode makes sense from security
point of view, too.


One thing that you missed is removing VDSO_NOTE_NONEGSEG_BIT from
vdso32/note.S.  With that removed there is no difference from the
64-bit version.


Oh, this means we can probably remove arch/x86/xen/vdso.h completely.



Otherwise this series looks good to me.


Thanks,


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] MAINTAINERS: Update PARAVIRT_OPS_INTERFACE and VMWARE_HYPERVISOR_INTERFACE

2020-06-07 Thread Jürgen Groß

On 17.04.20 01:45, Deep Shah wrote:

Thomas Hellstrom will be handing over VMware's maintainership of these
interfaces to Deep Shah.

Signed-off-by: Deep Shah 
Acked-by: Thomas Hellstrom 

Pushed to xen/tip.git for-linus-5.8


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] MAINTAINERS: Update PARAVIRT_OPS_INTERFACE and VMWARE_HYPERVISOR_INTERFACE

2020-04-17 Thread Jürgen Groß

On 17.04.20 08:22, Jürgen Groß wrote:

On 17.04.20 01:45, Deep Shah wrote:

Thomas Hellstrom will be handing over VMware's maintainership of these
interfaces to Deep Shah.

Signed-off-by: Deep Shah 
Acked-by: Thomas Hellstrom 


Acked-by: Juergen Gross 


BTW, I can carry this patch through the Xen tree if nobody else wants to
take it...


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] MAINTAINERS: Update PARAVIRT_OPS_INTERFACE and VMWARE_HYPERVISOR_INTERFACE

2020-04-17 Thread Jürgen Groß

On 17.04.20 01:45, Deep Shah wrote:

Thomas Hellstrom will be handing over VMware's maintainership of these
interfaces to Deep Shah.

Signed-off-by: Deep Shah 
Acked-by: Thomas Hellstrom 


Acked-by: Juergen Gross 


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 00/26] Runtime paravirt patching

2020-04-08 Thread Jürgen Groß

On 08.04.20 14:08, Peter Zijlstra wrote:

On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:

Mechanism: the patching itself is done using stop_machine(). That is
not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
via text_poke_bp(), but I'm using this to address two issues:
  1) emulation in text_poke() can only easily handle a small set
  of instructions and this is problematic for inlined pv-ops (and see
  a possible alternatives use-case below.)
  2) paravirt patching might have inter-dependendent ops (ex.
  lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
  need to be updated atomically.)


And then you hope that the spinlock state transfers.. That is that both
implementations agree what an unlocked spinlock looks like.

Suppose the native one was a ticket spinlock, where unlocked means 'head
== tail' while the paravirt one is a test-and-set spinlock, where
unlocked means 'val == 0'.

That just happens to not be the case now, but it was for a fair while.


Sure? This would mean that before spinlock-pvops are being set no lock
is allowed to be used in the kernel, because this would block the boot
time transition of the lock variant to use.

Another problem I'm seeing is that runtime pvops patching would rely on
the fact that stop_machine() isn't guarded by a spinlock.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 00/26] Runtime paravirt patching

2020-04-08 Thread Jürgen Groß

On 08.04.20 07:02, Ankur Arora wrote:

A KVM host (or another hypervisor) might advertise paravirtualized
features and optimization hints (ex KVM_HINTS_REALTIME) which might
become stale over the lifetime of the guest. For instance, the


Then this hint is wrong if it can't be guaranteed.


host might go from being undersubscribed to being oversubscribed
(or the other way round) and it would make sense for the guest
switch pv-ops based on that.


I think using pvops for such a feature change is just wrong.

What comes next? Using pvops for being able to migrate a guest from an
Intel to an AMD machine?

...


There are four main sets of patches in this series:

  1. PV-ops management (patches 1-10, 20): mostly infrastructure and
  refactoring pieces to make paravirt patching usable at runtime. For the
  most part scoped under CONFIG_PARAVIRT_RUNTIME.

  Patches 1-7, to persist part of parainstructions in memory:
   "x86/paravirt: Specify subsection in PVOP macros"
   "x86/paravirt: Allow paravirt patching post-init"
   "x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME"
   "x86/alternatives: Refactor alternatives_smp_module*
   "x86/alternatives: Rename alternatives_smp*, smp_alt_module
   "x86/alternatives: Remove stale symbols
   "x86/paravirt: Persist .parainstructions.runtime"

  Patches 8-10, develop the inerfaces to safely switch pv-ops:
   "x86/paravirt: Stash native pv-ops"
   "x86/paravirt: Add runtime_patch()"
   "x86/paravirt: Add primitives to stage pv-ops"

  Patch 20 enables switching of pv_lock_ops:
   "x86/paravirt: Enable pv-spinlocks in runtime_patch()"

  2. Non-emulated text poking (patches 11-19)

  Patches 11-13 are mostly refactoring to split __text_poke() into map,
  unmap and poke/memcpy phases with the poke portion being re-entrant
   "x86/alternatives: Remove return value of text_poke*()"
   "x86/alternatives: Use __get_unlocked_pte() in text_poke()"
   "x86/alternatives: Split __text_poke()"

  Patches 15, 17 add the actual poking state-machine:
   "x86/alternatives: Non-emulated text poking"
   "x86/alternatives: Add patching logic in text_poke_site()"

  with patches 14 and 18 containing the pieces for BP handling:
   "x86/alternatives: Handle native insns in text_poke_loc*()"
   "x86/alternatives: Handle BP in non-emulated text poking"

  and patch 19 provides the ability to use the state-machine above in an
  NMI context (fixes some potential deadlocks when handling inter-
  dependent operations and multiple NMIs):
   "x86/alternatives: NMI safe runtime patching".

  Patch 16 provides the interface (paravirt_runtime_patch()) to use the
  poking mechanism developed above and patch 21 adds a selftest:
   "x86/alternatives: Add paravirt patching at runtime"
   "x86/alternatives: Paravirt runtime selftest"

  3. KVM guest changes to be able to use this (patches 22-23,25-26):
   "kvm/paravirt: Encapsulate KVM pv switching logic"
   "x86/kvm: Add worker to trigger runtime patching"
   "x86/kvm: Guest support for dynamic hints"
   "x86/kvm: Add hint change notifier for KVM_HINT_REALTIME".

  4. KVM host changes to notify the guest of a change (patch 24):
   "x86/kvm: Support dynamic CPUID hints"

Testing:
With paravirt patching, the code is mostly stable on Intel and AMD
systems under kernbench and locktorture with paravirt toggling (with,
without synthetic NMIs) in the background.

Queued spinlock performance for locktorture is also on expected lines:
  [ 1533.221563] Writes:  Total: 1048759000  Max/Min: 0/0   Fail: 0
  # toggle PV spinlocks

  [ 1594.713699] Writes:  Total: 660545  Max/Min: 0/0   Fail: 0
  # PV spinlocks (in ~60 seconds) = 62,901,545

  # toggle native spinlocks
  [ 1656.117175] Writes:  Total: 111340  Max/Min: 0/0   Fail: 0
   # native spinlocks (in ~60 seconds) = 2,228,295

The alternatives testing is more limited with it being used to rewrite
mostly harmless X86_FEATUREs with load in the background.

Patches also at:

ssh://g...@github.com/terminus/linux.git alternatives-rfc-upstream-v1

Please review.

Thanks
Ankur

[1] The precise change in memory footprint depends on config options
but the following example inlines queued_spin_unlock() (which forms
the bulk of the added state). The added footprint is the size of the
.parainstructions.runtime section:

  $ objdump -h vmlinux|grep .parainstructions
  Idx Name  Size  VMA
LMAFile-off  Algn
   27 .parainstructions 0001013c  82895000
02895000   01c95000  2**3
   28 .parainstructions.runtime  cd2c  828a5140
028a5140   01ca5140  2**3

   $ size vmlinux
   text   data   bssdec  hex   filename
   13726196   12302814   14094336   40123346 2643bd2   vmlinux

Ankur Arora (26):
   x86/paravirt: Specify subsection in PVOP macros
   x86/paravirt: Allow paravirt patching post-init
   x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME
   x86/alternatives: Refactor 

Re: VIRTIO adoption in other hypervisors

2020-02-28 Thread Jürgen Groß

On 28.02.20 17:47, Alex Bennée wrote:


Jan Kiszka  writes:


On 28.02.20 11:30, Jan Kiszka wrote:

On 28.02.20 11:16, Alex Bennée wrote:

Hi,




I believe there has been some development work for supporting VIRTIO on
Xen although it seems to have stalled according to:

https://wiki.xenproject.org/wiki/Virtio_On_Xen

Recently at KVM Forum there was Jan's talk about Inter-VM shared memory
which proposed ivshmemv2 as a VIRTIO transport:

https://events19.linuxfoundation.org/events/kvm-forum-2019/program/schedule/


As I understood it this would allow Xen (and other hypervisors) a simple
way to be able to carry virtio traffic between guest and end point.


And to clarify the scope of this effort: virtio-over-ivshmem is not
the fastest option to offer virtio to a guest (static "DMA" window),
but it is the simplest one from the hypervisor PoV and, thus, also
likely the easiest one to argue over when it comes to security and
safety.


So to drill down on this is this a particular problem with type-1
hypervisors?

It seems to me any KVM-like run loop trivially supports a range of
virtio devices by virtue of trapping accesses to the signalling area of
a virtqueue and allowing the VMM to handle the transaction which ever
way it sees fit.

I've not quite understood the way Xen interfaces to QEMU aside from it's
different to everything else. More over it seems the type-1 hypervisors
are more interested in providing better isolation between segments of a
system whereas VIRTIO currently assumes either the VMM or the hypervisor
has full access the full guest address space. I've seen quite a lot of
slides that want to isolate sections of device emulation to separate
processes or even separate guest VMs.


In Xen device emulation is done by other VMs. Normally the devices are
emulated via dom0, but it is possible to have other driver domains, too
(those need to get passed through the related PCI devices, of course).

PV device backends get access only to the guest pages the PV frontends
allow. This is done via so called "grants", which are per guest. So a
frontend can grant another Xen VM access to dedicated pages. The backend
is using the grants to map those pages via the hypervisor in order to
perform the I/O. After finishing the I/O the I/O-pages are unmapped by
the backend again.

For legacy device emulation via qemu the guest running qemu needs to get
access to all the guests memory, as the guest won't grant any pages to
the emulating VM. It is possible to let qemu run in a small stub guest
using PV devices in order to isolate the legacy guest from e.g. dom0.


Hope that makes it clearer,


Juergen

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/ioperm: add new paravirt function update_io_bitmap

2020-02-28 Thread Jürgen Groß

Friendly ping...

On 18.02.20 16:47, Juergen Gross wrote:

Commit 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control
ioperm() as well") reworked the iopl syscall to use I/O bitmaps.

Unfortunately this broke Xen PV domains using that syscall as there
is currently no I/O bitmap support in PV domains.

Add I/O bitmap support via a new paravirt function update_io_bitmap
which Xen PV domains can use to update their I/O bitmaps via a
hypercall.

Fixes: 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control ioperm() as 
well")
Reported-by: Jan Beulich 
Cc:  # 5.5
Signed-off-by: Juergen Gross 
Reviewed-by: Jan Beulich 
Tested-by: Jan Beulich 
---
  arch/x86/include/asm/io_bitmap.h  |  9 -
  arch/x86/include/asm/paravirt.h   |  7 +++
  arch/x86/include/asm/paravirt_types.h |  4 
  arch/x86/kernel/paravirt.c|  5 +
  arch/x86/kernel/process.c |  2 +-
  arch/x86/xen/enlighten_pv.c   | 25 +
  6 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/io_bitmap.h b/arch/x86/include/asm/io_bitmap.h
index 02c6ef8f7667..07344d82e88e 100644
--- a/arch/x86/include/asm/io_bitmap.h
+++ b/arch/x86/include/asm/io_bitmap.h
@@ -19,7 +19,14 @@ struct task_struct;
  void io_bitmap_share(struct task_struct *tsk);
  void io_bitmap_exit(void);
  
-void tss_update_io_bitmap(void);

+void native_tss_update_io_bitmap(void);
+
+#ifdef CONFIG_PARAVIRT_XXL
+#include 
+#else
+#define tss_update_io_bitmap native_tss_update_io_bitmap
+#endif
+
  #else
  static inline void io_bitmap_share(struct task_struct *tsk) { }
  static inline void io_bitmap_exit(void) { }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 86e7317eb31f..694d8daf4983 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -295,6 +295,13 @@ static inline void write_idt_entry(gate_desc *dt, int 
entry, const gate_desc *g)
PVOP_VCALL3(cpu.write_idt_entry, dt, entry, g);
  }
  
+#ifdef CONFIG_X86_IOPL_IOPERM

+static inline void tss_update_io_bitmap(void)
+{
+   PVOP_VCALL0(cpu.update_io_bitmap);
+}
+#endif
+
  static inline void paravirt_activate_mm(struct mm_struct *prev,
struct mm_struct *next)
  {
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 84812964d3dd..732f62e04ddb 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -140,6 +140,10 @@ struct pv_cpu_ops {
  
  	void (*load_sp0)(unsigned long sp0);
  
+#ifdef CONFIG_X86_IOPL_IOPERM

+   void (*update_io_bitmap)(void);
+#endif
+
void (*wbinvd)(void);
  
  	/* cpuid emulation, mostly so that caps bits can be disabled */

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 789f5e4f89de..c131ba4e70ef 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -30,6 +30,7 @@
  #include 
  #include 
  #include 
+#include 
  
  /*

   * nop stub, which must not clobber anything *including the stack* to
@@ -341,6 +342,10 @@ struct paravirt_patch_template pv_ops = {
.cpu.iret   = native_iret,
.cpu.swapgs = native_swapgs,
  
+#ifdef CONFIG_X86_IOPL_IOPERM

+   .cpu.update_io_bitmap   = native_tss_update_io_bitmap,
+#endif
+
.cpu.start_context_switch   = paravirt_nop,
.cpu.end_context_switch = paravirt_nop,
  
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c

index 839b5244e3b7..3053c85e0e42 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,7 +374,7 @@ static void tss_copy_io_bitmap(struct tss_struct *tss, 
struct io_bitmap *iobm)
  /**
   * tss_update_io_bitmap - Update I/O bitmap before exiting to usermode
   */
-void tss_update_io_bitmap(void)
+void native_tss_update_io_bitmap(void)
  {
struct tss_struct *tss = this_cpu_ptr(_tss_rw);
struct thread_struct *t = >thread;
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 1f756e8b..feaf2e68ee5c 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -72,6 +72,9 @@
  #include 
  #include 
  #include 
+#ifdef CONFIG_X86_IOPL_IOPERM
+#include 
+#endif
  
  #ifdef CONFIG_ACPI

  #include 
@@ -837,6 +840,25 @@ static void xen_load_sp0(unsigned long sp0)
this_cpu_write(cpu_tss_rw.x86_tss.sp0, sp0);
  }
  
+#ifdef CONFIG_X86_IOPL_IOPERM

+static void xen_update_io_bitmap(void)
+{
+   struct physdev_set_iobitmap iobitmap;
+   struct tss_struct *tss = this_cpu_ptr(_tss_rw);
+
+   native_tss_update_io_bitmap();
+
+   iobitmap.bitmap = (uint8_t *)(>x86_tss) +
+ tss->x86_tss.io_bitmap_base;
+   if (tss->x86_tss.io_bitmap_base == IO_BITMAP_OFFSET_INVALID)
+   iobitmap.nr_ports = 0;
+   else
+   iobitmap.nr_ports = IO_BITMAP_BITS;
+
+   

Re: [PATCH 23/62] x86/idt: Move IDT to data segment

2020-02-19 Thread Jürgen Groß

On 19.02.20 11:42, Joerg Roedel wrote:

Hi Jürgen,

On Wed, Feb 12, 2020 at 05:28:21PM +0100, Jürgen Groß wrote:

Xen-PV is clearing BSS as the very first action.


In the kernel image? Or in the ELF loader before jumping to the kernel
image?


In the kernel image.

See arch/x86/xen/xen-head.S - startup_xen is the entry point of the
kernel.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/ioperm: add new paravirt function update_io_bitmap

2020-02-19 Thread Jürgen Groß

On 19.02.20 10:22, Thomas Gleixner wrote:

Jürgen Groß  writes:

On 18.02.20 22:03, Thomas Gleixner wrote:

BTW, why isn't stuff like this not catched during next or at least
before the final release? Is nothing running CI on upstream with all
that XEN muck active?


This problem showed up by not being able to start the X server (probably
not the freshest one) in dom0 on a moderate aged AMD system.

Our CI tests tend do be more text console based for dom0.


tools/testing/selftests/x86/io[perm|pl] should have caught that as well,
right? If not, we need to fix the selftests.


Hmm, yes. Thanks for the pointer.

Will ask our testing specialist what is done in this regard and how it
can be enhanced.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/ioperm: add new paravirt function update_io_bitmap

2020-02-18 Thread Jürgen Groß

On 18.02.20 22:03, Thomas Gleixner wrote:

Juergen Gross  writes:

Commit 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control
ioperm() as well") reworked the iopl syscall to use I/O bitmaps.

Unfortunately this broke Xen PV domains using that syscall as there
is currently no I/O bitmap support in PV domains.

Add I/O bitmap support via a new paravirt function update_io_bitmap
which Xen PV domains can use to update their I/O bitmaps via a
hypercall.

Fixes: 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control ioperm() as 
well")
Reported-by: Jan Beulich 
Cc:  # 5.5
Signed-off-by: Juergen Gross 
Reviewed-by: Jan Beulich 
Tested-by: Jan Beulich 


Duh, sorry about that and thanks for fixing it.

BTW, why isn't stuff like this not catched during next or at least
before the final release? Is nothing running CI on upstream with all
that XEN muck active?


This problem showed up by not being able to start the X server (probably
not the freshest one) in dom0 on a moderate aged AMD system.

Our CI tests tend do be more text console based for dom0.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 23/62] x86/idt: Move IDT to data segment

2020-02-12 Thread Jürgen Groß

On 12.02.20 17:23, Andy Lutomirski wrote:




On Feb 12, 2020, at 3:55 AM, Joerg Roedel  wrote:

On Tue, Feb 11, 2020 at 02:41:25PM -0800, Andy Lutomirski wrote:

On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:

From: Joerg Roedel 

With SEV-ES, exception handling is needed very early, even before the
kernel has cleared the bss segment. In order to prevent clearing the
currently used IDT, move the IDT to the data segment.


Ugh.  At the very least this needs a comment in the code.


Yes, right, added a comment for that.


I had a patch to fix the kernel ELF loader to clear BSS, which would
fix this problem once and for all, but it didn't work due to the messy
way that the decompressor handles memory.  I never got around to
fixing this, sadly.


Aren't there other ways of booting (Xen-PV?) which don't use the kernel
ELF loader?


Dunno. I would hope the any sane loader would clear BSS before executing 
anything. This isn’t currently the case, though. Oh well.


Xen-PV is clearing BSS as the very first action.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization