Patching at runtime needs to handle interdependent pv-ops: as an example,
lock.queued_lock_slowpath(), lock.queued_lock_unlock() and the other
pv_lock_ops are paired and so need to be updated atomically. This is
difficult with emulation because non-patching CPUs could be executing in
critical sections.
(We could apply INT3 everywhere first and then use RCU to force a
barrier but given that spinlocks are everywhere, it still might mean a
lot of time in emulation.)
Second, locking operations can be called from interrupt handlers which
means we cannot trivially use IPIs to introduce a pipeline sync step on
non-patching CPUs.
Third, some pv-ops can be inlined and so we would need to emulate a
broader set of operations than CALL, JMP, NOP*.
Introduce the core state-machine with the actual poking and pipeline
sync stubbed out. This executes via stop_machine() with the primary CPU
carrying out a text_poke_bp() style three-staged algorithm.
The control flow diagram below shows CPU0 as the primary which does the
patching, while the rest of the CPUs (CPUx) execute the sync loop in
text_poke_sync_finish().
CPU0 CPUx
patch_worker() patch_worker()
/* Traversal, insn-gen */ text_poke_sync_finish()
tps.patch_worker() /*
* wait until:
/* for each patch-site */ * tps->state == PATCH_DONE
text_poke_site() */
poke_sync()
... ...
smp_store_release(&tps->state, PATCH_DONE)
Commits further on flesh out the rest of the code.
Signed-off-by: Ankur Arora
---
sync_one() uses the following for pipeline synchronization:
+ if (in_nmi())
+ cpuid_eax(1);
+ else
+ sync_core();
The if (in_nmi()) clause is meant to be executed from NMI contexts.
Reading through past LKML discussions cpuid_eax() is probably a
bad choice -- at least in so far as Xen PV is concerned. What
would be a good primitive to use insead?
Also, given that we do handle the nested NMI case, does it make sense
to just use native_iret() (via sync_core()) in NMI contexts well?
---
arch/x86/kernel/alternative.c | 247 ++
1 file changed, 247 insertions(+)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 004fe86f463f..452d4081eded 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -979,6 +979,26 @@ void text_poke_sync(void)
on_each_cpu(do_sync_core, NULL, 1);
}
+static void __maybe_unused sync_one(void)
+{
+ /*
+* We might be executing in NMI context, and so cannot use
+* IRET as a synchronizing instruction.
+*
+* We could use native_write_cr2() but that is not guaranteed
+* to work on Xen-PV -- it is emulated by Xen and might not
+* execute an iret (or similar synchronizing instruction)
+* internally.
+*
+* cpuid() would trap as well. Unclear if that's a solution
+* either.
+*/
+ if (in_nmi())
+ cpuid_eax(1);
+ else
+ sync_core();
+}
+
struct text_poke_loc {
s32 rel_addr; /* addr := _stext + rel_addr */
union {
@@ -1351,6 +1371,233 @@ void __ref text_poke_bp(void *addr, const void *opcode,
size_t len, const void *
text_poke_bp_batch(&tp, 1);
}
+struct text_poke_state;
+typedef void (*patch_worker_t)(struct text_poke_state *tps);
+
+/*
+ *+---possible-BP--+
+ *||
+ * +--write-INT3--+ +--suffix--+ +-insn-prefix-+
+ */ | _/ |__/ |
+ * /v' v v
+ * PATCH_SYNC_0PATCH_SYNC_1PATCH_SYNC_2 *PATCH_SYNC_DONE*
+ * \|`>
PATCH_DONE
+ *`--<-<-<-<--+
+ *
+ * We start in state PATCH_SYNC_DONE and loop through PATCH_SYNC_* states
+ * to end at PATCH_DONE. The primary drives these in text_poke_site()
+ * with patch_worker() making the final transition to PATCH_DONE.
+ * All transitions but the last iteration need to be globally observed.
+ *
+ * On secondary CPUs, text_poke_sync_finish() waits in a cpu_relax()
+ * loop waiting for a transition to PATCH_SYNC_0 at which point it would
+ * start observing transitions until PATCH_SYNC_DONE.
+ * Eventually the master moves to PATCH_DONE and secondary CPUs finish.
+ */
+enum patch_state {
+ /*
+* Add an artificial state that we can do a bitwise operation
+* over all the PATCH_SYNC_* states.
+*/
+ PATCH_SYNC_x = 4,
+ PATCH_SYNC_0 = PATCH_SYNC_x | 0,/* Serialize INT3 */
+ PATCH_SYNC_1 = PATCH_SYNC_x | 1,/*