Re: [RFC PATCH 15/26] x86/alternatives: Non-emulated text poking

2020-04-08 Thread Peter Zijlstra
On Tue, Apr 07, 2020 at 10:03:12PM -0700, Ankur Arora wrote:
> +static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
> +{
> + int ret;
> +
> + lockdep_assert_held(&text_mutex);
> +
> + if (system_state != SYSTEM_RUNNING)
> + return -EINVAL;
> +
> + text_poke_state.stage = stage;
> + text_poke_state.num_acks = cpumask_weight(cpu_online_mask);
> + text_poke_state.head = &alt_modules;
> +
> + text_poke_state.patch_worker = worker;
> + text_poke_state.state = PATCH_SYNC_DONE; /* Start state */
> + text_poke_state.primary_cpu = smp_processor_id();
> +
> + /*
> +  * Run the worker on all online CPUs. Don't need to do anything
> +  * for offline CPUs as they come back online with a clean cache.
> +  */
> + ret = stop_machine(patch_worker, &text_poke_state, cpu_online_mask);

This.. that on its own is almost a reason to NAK the entire thing. We're
all working very hard to get rid of stop_machine() and you're adding
one.

Worse, stop_machine() is notoriously crap on over-committed virt, the
exact scenario where you want it.

> +
> + return ret;
> +}



Re: [RFC PATCH 15/26] x86/alternatives: Non-emulated text poking

2020-04-08 Thread Peter Zijlstra
On Tue, Apr 07, 2020 at 10:03:12PM -0700, Ankur Arora wrote:
> +static void __maybe_unused sync_one(void)
> +{
> + /*
> +  * We might be executing in NMI context, and so cannot use
> +  * IRET as a synchronizing instruction.
> +  *
> +  * We could use native_write_cr2() but that is not guaranteed
> +  * to work on Xen-PV -- it is emulated by Xen and might not
> +  * execute an iret (or similar synchronizing instruction)
> +  * internally.
> +  *
> +  * cpuid() would trap as well. Unclear if that's a solution
> +  * either.
> +  */
> + if (in_nmi())
> + cpuid_eax(1);
> + else
> + sync_core();
> +}

That's not thinking staight; what do you think the INT3 does when it
happens inside an NMI ?



[RFC PATCH 15/26] x86/alternatives: Non-emulated text poking

2020-04-07 Thread Ankur Arora
Patching at runtime needs to handle interdependent pv-ops: as an example,
lock.queued_lock_slowpath(), lock.queued_lock_unlock() and the other
pv_lock_ops are paired and so need to be updated atomically. This is
difficult with emulation because non-patching CPUs could be executing in
critical sections.
(We could apply INT3 everywhere first and then use RCU to force a
barrier but given that spinlocks are everywhere, it still might mean a
lot of time in emulation.)

Second, locking operations can be called from interrupt handlers which
means we cannot trivially use IPIs to introduce a pipeline sync step on
non-patching CPUs.

Third, some pv-ops can be inlined and so we would need to emulate a
broader set of operations than CALL, JMP, NOP*.

Introduce the core state-machine with the actual poking and pipeline
sync stubbed out. This executes via stop_machine() with the primary CPU
carrying out a text_poke_bp() style three-staged algorithm.

The control flow diagram below shows CPU0 as the primary which does the
patching, while the rest of the CPUs (CPUx) execute the sync loop in
text_poke_sync_finish().

 CPU0   CPUx
    

 patch_worker() patch_worker()

   /* Traversal, insn-gen */  text_poke_sync_finish()
   tps.patch_worker() /*
   * wait until:
 /* for each patch-site */ *  tps->state == PATCH_DONE
 text_poke_site()  */
   poke_sync()

   ... ...

   smp_store_release(&tps->state, PATCH_DONE)

Commits further on flesh out the rest of the code.

Signed-off-by: Ankur Arora 
---
sync_one() uses the following for pipeline synchronization:

+   if (in_nmi())
+   cpuid_eax(1);
+   else
+   sync_core();

The if (in_nmi()) clause is meant to be executed from NMI contexts.
Reading through past LKML discussions cpuid_eax() is probably a
bad choice -- at least in so far as Xen PV is concerned. What
would be a good primitive to use insead?

Also, given that we do handle the nested NMI case, does it make sense
to just use native_iret() (via sync_core()) in NMI contexts well?

---
 arch/x86/kernel/alternative.c | 247 ++
 1 file changed, 247 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 004fe86f463f..452d4081eded 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -979,6 +979,26 @@ void text_poke_sync(void)
on_each_cpu(do_sync_core, NULL, 1);
 }
 
+static void __maybe_unused sync_one(void)
+{
+   /*
+* We might be executing in NMI context, and so cannot use
+* IRET as a synchronizing instruction.
+*
+* We could use native_write_cr2() but that is not guaranteed
+* to work on Xen-PV -- it is emulated by Xen and might not
+* execute an iret (or similar synchronizing instruction)
+* internally.
+*
+* cpuid() would trap as well. Unclear if that's a solution
+* either.
+*/
+   if (in_nmi())
+   cpuid_eax(1);
+   else
+   sync_core();
+}
+
 struct text_poke_loc {
s32 rel_addr; /* addr := _stext + rel_addr */
union {
@@ -1351,6 +1371,233 @@ void __ref text_poke_bp(void *addr, const void *opcode, 
size_t len, const void *
text_poke_bp_batch(&tp, 1);
 }
 
+struct text_poke_state;
+typedef void (*patch_worker_t)(struct text_poke_state *tps);
+
+/*
+ *+---possible-BP--+
+ *||
+ * +--write-INT3--+   +--suffix--+   +-insn-prefix-+
+ */   | _/   |__/  |
+ *   /v' v v
+ * PATCH_SYNC_0PATCH_SYNC_1PATCH_SYNC_2   *PATCH_SYNC_DONE*
+ *   \|`> 
PATCH_DONE
+ *`--<-<-<-<--+
+ *
+ * We start in state PATCH_SYNC_DONE and loop through PATCH_SYNC_* states
+ * to end at PATCH_DONE. The primary drives these in text_poke_site()
+ * with patch_worker() making the final transition to PATCH_DONE.
+ * All transitions but the last iteration need to be globally observed.
+ *
+ * On secondary CPUs, text_poke_sync_finish() waits in a cpu_relax()
+ * loop waiting for a transition to PATCH_SYNC_0 at which point it would
+ * start observing transitions until PATCH_SYNC_DONE.
+ * Eventually the master moves to PATCH_DONE and secondary CPUs finish.
+ */
+enum patch_state {
+   /*
+* Add an artificial state that we can do a bitwise operation
+* over all the PATCH_SYNC_* states.
+*/
+   PATCH_SYNC_x = 4,
+   PATCH_SYNC_0 = PATCH_SYNC_x | 0,/* Serialize INT3 */
+   PATCH_SYNC_1 = PATCH_SYNC_x | 1,/*