Re: x86/kprobes bug? (was: [PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions)

2019-10-17 Thread Peter Zijlstra
On Wed, Oct 09, 2019 at 10:26:35AM -0400, Mathieu Desnoyers wrote:

> Yes, I think you are right on both counts. synchronize_rcu() is not enough
> to guarantee that other cores have observed the required core serializing
> instructions.

*phew*, not crazy then ;-)

> I would also be more comfortable if we ensure core serialization for all
> cores after arming the kprobe with text_poke() (before doing the text_poke_bp
> to JMP32), and after the text_poke(old) in DISARM (before freeing, and 
> possibly
> re-using, the memory).
> 
> I think originally it might have been OK to text_poke the INT3 without core 
> serialization
> before introducing optimized kprobes, since it would only switch back and 
> forth between
> the original instruction { 0xAA, 0xBB, 0xCC, ... } and the breakpoint
> { INT3, 0xBB, 0xCC, ... }. But now that the optimized kprobes are adding
> additional states, we end up requiring core serialization in case a core
> observes the original instruction and the optimized kprobes jump without
> observing the INT3.
> 
> The follow up patch you propose at 
> https://lore.kernel.org/lkml/20191009132844.gg2...@hirez.programming.kicks-ass.net/
> makes sense.

Thanks; full patch with Changelog below.

> Now depending on whether we care mostly about speed or robustness in this
> code, there is a small tweak we could do. The approach you propose aims for
> robustness by issuing a text_poke_sync() after each ARM/DISARM, which
> effectively adds IPIs to all cores even in !opt cases. If we aim for speed
> in the !opt case, we might want to move the text_poke_sync() within the
> if (opt) branches so it only IPIs if the probe happens to be optimized.
> 
> In my personal opinion, I would prefer simple and robust over clever and fast
> for inserting kprobes, but you guys know more about the performance trade-offs
> than I do.

I chose for this variant because it also provides the guarantee that the
kprobe is visible/active the moment we return from installing it. I
would personally find it really confusing to have to wait $random period
before being able to observe the probe on all CPUs.


---
Subject: x86/kprobes: Fix ordering
From: Peter Zijlstra 
Date: Wed Oct  9 21:15:28 CEST 2019

Kprobes does something like:

register:
arch_arm_kprobe()
  text_poke(INT3)
  /* guarantees nothing, INT3 will become visible at some point, maybe 
*/

kprobe_optimizer()
  /* guarantees the bytes after INT3 are unused */
  syncrhonize_rcu_tasks();
  text_poke_bp(JMP32);
  /* implies IPI-sync, kprobe really is enabled */


unregister:
__disarm_kprobe()
  unoptimize_kprobe()
text_poke_bp(INT3 + tail);
/* implies IPI-sync, so tail is guaranteed visible */
  arch_disarm_kprobe()
text_poke(old);
/* guarantees nothing, old will maybe become visible */

synchronize_rcu()

free-stuff

Now the problem is that on register, the synchronize_rcu_tasks() does
not imply sufficient to guarantee all CPUs have already observed INT3
(although in practise this is exceedingly unlikely not to have
happened) (similar to how MEMBARRIER_CMD_PRIVATE_EXPEDITED does not
imply MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).

Worse, even if it did, we'd have to do 2 synchronize calls to provide
the guarantee we're looking for, the first to ensure INT3 is visible,
the second to guarantee nobody is then still using the instruction
bytes after INT3.

Similar on unregister; the synchronize_rcu() between
__unregister_kprobe_top() and __unregister_kprobe_bottom() does not
guarantee all CPUs are free of the INT3 (and observe the old text).

Therefore, sprinkle some IPI-sync love around. This guarantees that
all CPUs agree on the text and RCU once again provides the required
guaranteed.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: h...@zytor.com
Cc: paul...@kernel.org
Cc: mathieu.desnoy...@efficios.com
---
 arch/x86/include/asm/text-patching.h |1 +
 arch/x86/kernel/alternative.c|   11 ---
 arch/x86/kernel/kprobes/core.c   |2 ++
 arch/x86/kernel/kprobes/opt.c|   12 
 4 files changed, 15 insertions(+), 11 deletions(-)

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -42,6 +42,7 @@ extern void text_poke_early(void *addr,
  * an inconsistent instruction while you patch.
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
+extern void text_poke_sync(void);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void text_poke_bp(void *addr, const void *opcode, size_t len, const 
void *emulate);
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -936,6 +936,11 @@ static void do_sync_core(void *info)
sync_core();
 }

+void text_poke_sync(void)
+{
+   on_each_cpu(do_sync_core, NULL, 1);
+}
+
 struct 

Re: x86/kprobes bug? (was: [PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions)

2019-10-09 Thread Mathieu Desnoyers
+ hpa, paulmck

- On Oct 9, 2019, at 9:07 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Fri, Oct 04, 2019 at 10:45:40PM +0900, Masami Hiramatsu wrote:
> 
>> > > > >  text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
>> > > > > - op->optinsn.insn);
>> > > > > + emulate_buff);
>> > > > >  }
>> > > 
>> > > As argued in a previous thread, text_poke_bp() is broken when it changes
>> > > more than a single instruction at a time.
>> > > 
>> > > Now, ISTR optimized kprobes does something like:
>> > > 
>> > >  poke INT3
>> > 
>> > Hmm, it does this using text_poke(), but lacks a
>> > on_each_cpu(do_sync_core, NULL, 1), which I suppose is OK-ish IFF you do
>> > that synchronize_rcu_tasks() after it, but less so if you don't.
>> > 
>> > That is, without either, you can't really tell if the kprobe is in
>> > effect or not.
>> 
>> Yes, it doesn't wait the change by design at this moment.
> 
> Right, this might surprise some, I suppose, and I might've found a small
> issue with it, see below.
> 
>> > >  synchronize_rcu_tasks() /* waits for all tasks to schedule
>> > > guarantees instructions after INT3
>> > > are unused */
>> > >  install optimized probe /* overwrites multiple instrctions with
>> > > JMP.d32 */
>> > > 
>> > > And the above then undoes that by:
>> > > 
>> > >  poke INT3 on top of the optimzed probe
>> > > 
>> > >  poke tail instructions back /* guaranteed safe because the
>> > > above INT3 poke ensures the
>> > > JMP.d32 instruction is unused */
>> > > 
>> > >  poke head byte back
>> 
>> Yes, anyway, the last poke should recover another INT3... (for kprobe)
> 
> It does indeed.
> 
>> > > Is this correct? If so, we should probably put a comment in there
>> > > explaining how all this is unusual but safe.
> 
> So from what I can tell of kernel/kprobes.c, what it does is something like:
> 
> ARM: (__arm_kprobe)
>   text_poke(INT3)
>   /* guarantees nothing, INT3 will become visible at some point, maybe */
> 
> (kprobe_optimizer)
>   if (opt) {
>   /* guarantees the bytes after INT3 are unused */
>   syncrhonize_rcu_tasks();
>   text_poke_bp(JMP32);
>   /* implies IPI-sync, kprobe really is enabled */
>   }
> 
> 
> DISARM: (__unregister_kprobe_top)
>   if (opt) {
>   text_poke_bp(INT3 + tail);
>   /* implies IPI-sync, so tail is guaranteed visible */
>   }
>   text_poke(old);
> 
> 
> FREE: (__unregister_kprobe_bottom)
>   /* guarantees 'old' is visible and the kprobe really is unused, maybe */
>   synchronize_rcu();
>   free();
> 
> 
> Now the problem is that I don't think the synchronize_rcu() at free
> implies enough to guarantee 'old' really is visible on all CPUs.
> Similarly, I don't think synchronize_rcu_tasks() is sufficient on the
> ARM side either. It only provides the guarantee -provided- the INT3 is
> actually visible. If it is not, all bets are off.
> 
> I'd feel much better if we switch arch_arm_kprobe() over to using
> text_poke_bp(). Or at the very least add the on_each_cpu(do_sync_core)
> to it.
> 
> Hmm?

Yes, I think you are right on both counts. synchronize_rcu() is not enough
to guarantee that other cores have observed the required core serializing
instructions.

I would also be more comfortable if we ensure core serialization for all
cores after arming the kprobe with text_poke() (before doing the text_poke_bp
to JMP32), and after the text_poke(old) in DISARM (before freeing, and possibly
re-using, the memory).

I think originally it might have been OK to text_poke the INT3 without core 
serialization
before introducing optimized kprobes, since it would only switch back and forth 
between
the original instruction { 0xAA, 0xBB, 0xCC, ... } and the breakpoint
{ INT3, 0xBB, 0xCC, ... }. But now that the optimized kprobes are adding
additional states, we end up requiring core serialization in case a core
observes the original instruction and the optimized kprobes jump without
observing the INT3.

The follow up patch you propose at 
https://lore.kernel.org/lkml/20191009132844.gg2...@hirez.programming.kicks-ass.net/
makes sense.

Now depending on whether we care mostly about speed or robustness in this
code, there is a small tweak we could do. The approach you propose aims for
robustness by issuing a text_poke_sync() after each ARM/DISARM, which
effectively adds IPIs to all cores even in !opt cases. If we aim for speed
in the !opt case, we might want to move the text_poke_sync() within the
if (opt) branches so it only IPIs if the probe happens to be optimized.

In my personal opinion, I would prefer simple and robust over clever and fast
for inserting kprobes, but you guys know more about the performance trade-offs
than I do.

hpa provided very insightful feedback in the original 

Re: x86/kprobes bug? (was: [PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions)

2019-10-09 Thread Peter Zijlstra
On Wed, Oct 09, 2019 at 03:26:39PM +0200, Peter Zijlstra wrote:
> So I suppose I'm suggesting we do something like the below on top of
> what I already have here:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/ftrace
> 
> All it needs are a few comments ;-) Also note how this nicely gets rid
> of the one text_poke_bp(.emulate) user, so we can go and remove that as
> well.
> 
> ---
>  arch/x86/include/asm/text-patching.h |  1 +
>  arch/x86/kernel/alternative.c| 11 ---
>  arch/x86/kernel/kprobes/core.c   |  1 +
>  arch/x86/kernel/kprobes/opt.c| 12 
>  4 files changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/include/asm/text-patching.h 
> b/arch/x86/include/asm/text-patching.h
> index d553175212b3..d3269558e5b5 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -42,6 +42,7 @@ extern void text_poke_early(void *addr, const void *opcode, 
> size_t len);
>   * an inconsistent instruction while you patch.
>   */
>  extern void *text_poke(void *addr, const void *opcode, size_t len);
> +extern void text_poke_sync(void);
>  extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
>  extern int poke_int3_handler(struct pt_regs *regs);
>  extern void text_poke_bp(void *addr, const void *opcode, size_t len, const 
> void *emulate);
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 34a08bc68e9a..9e81ab542190 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -936,6 +936,11 @@ static void do_sync_core(void *info)
>   sync_core();
>  }
>  
> +void text_poke_sync(void)
> +{
> + on_each_cpu(do_sync_core, NULL, 1);
> +}
> +
>  struct text_poke_loc {
>   s32 rel_addr; /* addr := _stext + rel_addr */
>   s32 rel32;
> @@ -1089,7 +1094,7 @@ static void text_poke_bp_batch(struct text_poke_loc 
> *tp, unsigned int nr_entries
>   for (i = 0; i < nr_entries; i++)
>   text_poke(text_poke_addr([i]), , sizeof(int3));
>  
> - on_each_cpu(do_sync_core, NULL, 1);
> + text_poke_sync();
>  
>   /*
>* Second step: update all but the first byte of the patched range.
> @@ -,7 +1116,7 @@ static void text_poke_bp_batch(struct text_poke_loc 
> *tp, unsigned int nr_entries
>* not necessary and we'd be safe even without it. But
>* better safe than sorry (plus there's not only Intel).
>*/
> - on_each_cpu(do_sync_core, NULL, 1);
> + text_poke_sync();
>   }
>  
>   /*
> @@ -1127,7 +1132,7 @@ static void text_poke_bp_batch(struct text_poke_loc 
> *tp, unsigned int nr_entries
>   }
>  
>   if (do_sync)
> - on_each_cpu(do_sync_core, NULL, 1);
> + text_poke_sync();
>  
>   /*
>* sync_core() implies an smp_mb() and orders this store against

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 7e4a8a1c9d9a..04858ea7cd76 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -498,11 +498,13 @@ int arch_prepare_kprobe(struct kprobe *p)
 void arch_arm_kprobe(struct kprobe *p)
 {
text_poke(p->addr, ((unsigned char []){INT3_INSN_OPCODE}), 1);
+   text_poke_sync();
 }
 
 void arch_disarm_kprobe(struct kprobe *p)
 {
text_poke(p->addr, >opcode, 1);
+   text_poke_sync();
 }
 
 void arch_remove_kprobe(struct kprobe *p)

>  void arch_remove_kprobe(struct kprobe *p)
> diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
> index 36d7249f2145..30a2646cfc8a 100644
> --- a/arch/x86/kernel/kprobes/opt.c
> +++ b/arch/x86/kernel/kprobes/opt.c
> @@ -446,14 +446,10 @@ void arch_optimize_kprobes(struct list_head *oplist)
>  /* Replace a relative jump with a breakpoint (int3).  */
>  void arch_unoptimize_kprobe(struct optimized_kprobe *op)
>  {
> - u8 insn_buff[JMP32_INSN_SIZE];
> -
> - /* Set int3 to first byte for kprobes */
> - insn_buff[0] = INT3_INSN_OPCODE;
> - memcpy(insn_buff + 1, op->optinsn.copied_insn, DISP32_SIZE);
> -
> - text_poke_bp(op->kp.addr, insn_buff, JMP32_INSN_SIZE,
> -  text_gen_insn(JMP32_INSN_OPCODE, op->kp.addr, 
> op->optinsn.insn));
> + arch_arm_kprobe(>kp);
> + text_poke(op->kp.addr + INT3_INSN_SIZE,
> +   op->optinsn.copied_insn, DISP32_SIZE);
> + text_poke_sync();
>  }
>  
>  /*


Re: x86/kprobes bug? (was: [PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions)

2019-10-09 Thread Peter Zijlstra
On Wed, Oct 09, 2019 at 03:07:54PM +0200, Peter Zijlstra wrote:
> So from what I can tell of kernel/kprobes.c, what it does is something like:
> 
> ARM: (__arm_kprobe)
>   text_poke(INT3)
>   /* guarantees nothing, INT3 will become visible at some point, maybe */
> 
>  (kprobe_optimizer)
>   if (opt) {
>   /* guarantees the bytes after INT3 are unused */
>   syncrhonize_rcu_tasks();
>   text_poke_bp(JMP32);
>   /* implies IPI-sync, kprobe really is enabled */
>   }
> 
> 
> DISARM: (__unregister_kprobe_top)
>   if (opt) {
>   text_poke_bp(INT3 + tail);
>   /* implies IPI-sync, so tail is guaranteed visible */
>   }
>   text_poke(old);
> 
> 
> FREE: (__unregister_kprobe_bottom)
>   /* guarantees 'old' is visible and the kprobe really is unused, maybe */
>   synchronize_rcu();
>   free();
> 
> 
> Now the problem is that I don't think the synchronize_rcu() at free
> implies enough to guarantee 'old' really is visible on all CPUs.
> Similarly, I don't think synchronize_rcu_tasks() is sufficient on the
> ARM side either. It only provides the guarantee -provided- the INT3 is
> actually visible. If it is not, all bets are off.
> 
> I'd feel much better if we switch arch_arm_kprobe() over to using
> text_poke_bp(). Or at the very least add the on_each_cpu(do_sync_core)
> to it.
> 
> Hmm?

So I suppose I'm suggesting we do something like the below on top of
what I already have here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/ftrace

All it needs are a few comments ;-) Also note how this nicely gets rid
of the one text_poke_bp(.emulate) user, so we can go and remove that as
well.

---
 arch/x86/include/asm/text-patching.h |  1 +
 arch/x86/kernel/alternative.c| 11 ---
 arch/x86/kernel/kprobes/core.c   |  1 +
 arch/x86/kernel/kprobes/opt.c| 12 
 4 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h 
b/arch/x86/include/asm/text-patching.h
index d553175212b3..d3269558e5b5 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -42,6 +42,7 @@ extern void text_poke_early(void *addr, const void *opcode, 
size_t len);
  * an inconsistent instruction while you patch.
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
+extern void text_poke_sync(void);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void text_poke_bp(void *addr, const void *opcode, size_t len, const 
void *emulate);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 34a08bc68e9a..9e81ab542190 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -936,6 +936,11 @@ static void do_sync_core(void *info)
sync_core();
 }
 
+void text_poke_sync(void)
+{
+   on_each_cpu(do_sync_core, NULL, 1);
+}
+
 struct text_poke_loc {
s32 rel_addr; /* addr := _stext + rel_addr */
s32 rel32;
@@ -1089,7 +1094,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, 
unsigned int nr_entries
for (i = 0; i < nr_entries; i++)
text_poke(text_poke_addr([i]), , sizeof(int3));
 
-   on_each_cpu(do_sync_core, NULL, 1);
+   text_poke_sync();
 
/*
 * Second step: update all but the first byte of the patched range.
@@ -,7 +1116,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, 
unsigned int nr_entries
 * not necessary and we'd be safe even without it. But
 * better safe than sorry (plus there's not only Intel).
 */
-   on_each_cpu(do_sync_core, NULL, 1);
+   text_poke_sync();
}
 
/*
@@ -1127,7 +1132,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, 
unsigned int nr_entries
}
 
if (do_sync)
-   on_each_cpu(do_sync_core, NULL, 1);
+   text_poke_sync();
 
/*
 * sync_core() implies an smp_mb() and orders this store against
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 7e4a8a1c9d9a..8ae5170207b2 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -503,6 +503,7 @@ void arch_arm_kprobe(struct kprobe *p)
 void arch_disarm_kprobe(struct kprobe *p)
 {
text_poke(p->addr, >opcode, 1);
+   text_poke_sync();
 }
 
 void arch_remove_kprobe(struct kprobe *p)
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index 36d7249f2145..30a2646cfc8a 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -446,14 +446,10 @@ void arch_optimize_kprobes(struct list_head *oplist)
 /* Replace a relative jump with a breakpoint (int3).  */
 void arch_unoptimize_kprobe(struct optimized_kprobe *op)
 {
-   u8 

x86/kprobes bug? (was: [PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions)

2019-10-09 Thread Peter Zijlstra
On Fri, Oct 04, 2019 at 10:45:40PM +0900, Masami Hiramatsu wrote:

> > > > >   text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
> > > > > -  op->optinsn.insn);
> > > > > +  emulate_buff);
> > > > >  }
> > > 
> > > As argued in a previous thread, text_poke_bp() is broken when it changes
> > > more than a single instruction at a time.
> > > 
> > > Now, ISTR optimized kprobes does something like:
> > > 
> > >   poke INT3
> > 
> > Hmm, it does this using text_poke(), but lacks a
> > on_each_cpu(do_sync_core, NULL, 1), which I suppose is OK-ish IFF you do
> > that synchronize_rcu_tasks() after it, but less so if you don't.
> > 
> > That is, without either, you can't really tell if the kprobe is in
> > effect or not.
> 
> Yes, it doesn't wait the change by design at this moment.

Right, this might surprise some, I suppose, and I might've found a small
issue with it, see below.

> > >   synchronize_rcu_tasks() /* waits for all tasks to schedule
> > >  guarantees instructions after INT3
> > >  are unused */
> > >   install optimized probe /* overwrites multiple instrctions with
> > >  JMP.d32 */
> > > 
> > > And the above then undoes that by:
> > > 
> > >   poke INT3 on top of the optimzed probe
> > > 
> > >   poke tail instructions back /* guaranteed safe because the
> > >  above INT3 poke ensures the
> > >  JMP.d32 instruction is unused */
> > > 
> > >   poke head byte back
> 
> Yes, anyway, the last poke should recover another INT3... (for kprobe)

It does indeed.

> > > Is this correct? If so, we should probably put a comment in there
> > > explaining how all this is unusual but safe.

So from what I can tell of kernel/kprobes.c, what it does is something like:

ARM: (__arm_kprobe)
text_poke(INT3)
/* guarantees nothing, INT3 will become visible at some point, maybe */

 (kprobe_optimizer)
if (opt) {
/* guarantees the bytes after INT3 are unused */
syncrhonize_rcu_tasks();
text_poke_bp(JMP32);
/* implies IPI-sync, kprobe really is enabled */
}


DISARM: (__unregister_kprobe_top)
if (opt) {
text_poke_bp(INT3 + tail);
/* implies IPI-sync, so tail is guaranteed visible */
}
text_poke(old);


FREE: (__unregister_kprobe_bottom)
/* guarantees 'old' is visible and the kprobe really is unused, maybe */
synchronize_rcu();
free();


Now the problem is that I don't think the synchronize_rcu() at free
implies enough to guarantee 'old' really is visible on all CPUs.
Similarly, I don't think synchronize_rcu_tasks() is sufficient on the
ARM side either. It only provides the guarantee -provided- the INT3 is
actually visible. If it is not, all bets are off.

I'd feel much better if we switch arch_arm_kprobe() over to using
text_poke_bp(). Or at the very least add the on_each_cpu(do_sync_core)
to it.

Hmm?


Re: [PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions

2019-10-07 Thread Peter Zijlstra
On Fri, Oct 04, 2019 at 10:45:40PM +0900, Masami Hiramatsu wrote:
> Hi Peter,
> 
> On Thu, 3 Oct 2019 13:01:06 +0200
> Peter Zijlstra  wrote:

> > > I'm halfway through a patch introducing:
> > > 
> > >   union text_poke_insn {
> > >   u8 code[POKE_MAX_OPCODE_SUZE];
> > >   struct {
> > >   u8 opcode;
> > >   s32 disp;
> > >   } __attribute__((packed));
> > >   };
> > > 
> > > to text-patching.h to unify all such custom unions we have all over the
> > > place. I'll mob up the above in that.
> 
> I think it is good to unify such unions, but I meant above was, it was
> also important to unify the opcode macro. Since poke_int3_handler()
> clasifies the opcode by your *_INSN_OPCODE macro, it is natual to use
> those opcode for text_poke_bp() interface.

Right, but I think we should do that as another patch, there's a lot of
instances in that file and just changing one or two over is 'weird'.

I can put it on the todo to fix that all up.


Re: [PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions

2019-10-04 Thread Masami Hiramatsu
Hi Peter,

On Thu, 3 Oct 2019 13:01:06 +0200
Peter Zijlstra  wrote:

> On Thu, Oct 03, 2019 at 10:27:51AM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 03, 2019 at 02:00:50PM +0900, Masami Hiramatsu wrote:
> > 
> > > > This fits almost all text_poke_bp() users, except
> > > > arch_unoptimize_kprobe() which restores random text, and for that site
> > > > we have to build an explicit emulate instruction.
> > > 
> > > OK, and in this case, I would like to change RELATIVEJUMP_OPCODE
> > > to JMP32_INSN_OPCODE for readability. (or at least
> > > making RELATIVEJUMP_OPCODE as an alias of JMP32_INSN_OPCODE)
> > 
> > > > @@ -448,12 +447,18 @@ void arch_optimize_kprobes(struct list_h
> > > >  void arch_unoptimize_kprobe(struct optimized_kprobe *op)
> > > >  {
> > > > u8 insn_buff[RELATIVEJUMP_SIZE];
> > > > +   u8 emulate_buff[RELATIVEJUMP_SIZE];
> > > >  
> > > > /* Set int3 to first byte for kprobes */
> > > > insn_buff[0] = BREAKPOINT_INSTRUCTION;
> > > > memcpy(insn_buff + 1, op->optinsn.copied_insn, 
> > > > RELATIVE_ADDR_SIZE);
> > > > +
> > > > +   emulate_buff[0] = RELATIVEJUMP_OPCODE;
> > > > +   *(s32 *)(_buff[1]) = (s32)((long)op->optinsn.insn -
> > > > +   ((long)op->kp.addr + RELATIVEJUMP_SIZE));
> > 
> > I'm halfway through a patch introducing:
> > 
> >   union text_poke_insn {
> > u8 code[POKE_MAX_OPCODE_SUZE];
> > struct {
> > u8 opcode;
> > s32 disp;
> > } __attribute__((packed));
> >   };
> > 
> > to text-patching.h to unify all such custom unions we have all over the
> > place. I'll mob up the above in that.

I think it is good to unify such unions, but I meant above was, it was
also important to unify the opcode macro. Since poke_int3_handler()
clasifies the opcode by your *_INSN_OPCODE macro, it is natual to use
those opcode for text_poke_bp() interface.

> > > > +
> > > > text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
> > > > -op->optinsn.insn);
> > > > +emulate_buff);
> > > >  }
> > 
> > As argued in a previous thread, text_poke_bp() is broken when it changes
> > more than a single instruction at a time.
> > 
> > Now, ISTR optimized kprobes does something like:
> > 
> > poke INT3
> 
> Hmm, it does this using text_poke(), but lacks a
> on_each_cpu(do_sync_core, NULL, 1), which I suppose is OK-ish IFF you do
> that synchronize_rcu_tasks() after it, but less so if you don't.
> 
> That is, without either, you can't really tell if the kprobe is in
> effect or not.

Yes, it doesn't wait the change by design at this moment.

> Also, I think text_poke_bp(INT3) is broken, although I don't think
> anybody actually does that. Still, let me fix that.

OK.

> 
> > synchronize_rcu_tasks() /* waits for all tasks to schedule
> >guarantees instructions after INT3
> >are unused */
> > install optimized probe /* overwrites multiple instrctions with
> >JMP.d32 */
> > 
> > And the above then undoes that by:
> > 
> > poke INT3 on top of the optimzed probe
> > 
> > poke tail instructions back /* guaranteed safe because the
> >above INT3 poke ensures the
> >JMP.d32 instruction is unused */
> > 
> > poke head byte back

Yes, anyway, the last poke should recover another INT3... (for kprobe)

> > 
> > Is this correct? If so, we should probably put a comment in there
> > explaining how all this is unusual but safe.

OK.

Thank you,

-- 
Masami Hiramatsu 


Re: [PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions

2019-10-03 Thread Peter Zijlstra
On Thu, Oct 03, 2019 at 10:27:51AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 03, 2019 at 02:00:50PM +0900, Masami Hiramatsu wrote:
> 
> > > This fits almost all text_poke_bp() users, except
> > > arch_unoptimize_kprobe() which restores random text, and for that site
> > > we have to build an explicit emulate instruction.
> > 
> > OK, and in this case, I would like to change RELATIVEJUMP_OPCODE
> > to JMP32_INSN_OPCODE for readability. (or at least
> > making RELATIVEJUMP_OPCODE as an alias of JMP32_INSN_OPCODE)
> 
> > > @@ -448,12 +447,18 @@ void arch_optimize_kprobes(struct list_h
> > >  void arch_unoptimize_kprobe(struct optimized_kprobe *op)
> > >  {
> > >   u8 insn_buff[RELATIVEJUMP_SIZE];
> > > + u8 emulate_buff[RELATIVEJUMP_SIZE];
> > >  
> > >   /* Set int3 to first byte for kprobes */
> > >   insn_buff[0] = BREAKPOINT_INSTRUCTION;
> > >   memcpy(insn_buff + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
> > > +
> > > + emulate_buff[0] = RELATIVEJUMP_OPCODE;
> > > + *(s32 *)(_buff[1]) = (s32)((long)op->optinsn.insn -
> > > + ((long)op->kp.addr + RELATIVEJUMP_SIZE));
> 
> I'm halfway through a patch introducing:
> 
>   union text_poke_insn {
>   u8 code[POKE_MAX_OPCODE_SUZE];
>   struct {
>   u8 opcode;
>   s32 disp;
>   } __attribute__((packed));
>   };
> 
> to text-patching.h to unify all such custom unions we have all over the
> place. I'll mob up the above in that.

How's something like so?

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -49,6 +49,8 @@ extern void text_poke_bp(void *addr, con
 extern void text_poke_queue(void *addr, const void *opcode, size_t len, const 
void *emulate);
 extern void text_poke_finish(void);
 
+extern void *text_gen_insn(u8 opcode, unsigned long addr, unsigned long dest);
+
 extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1237,3 +1237,39 @@ void text_poke_bp(void *addr, const void
text_poke_loc_init(, addr, opcode, len, emulate);
text_poke_bp_batch(, 1);
 }
+
+union text_poke_insn {
+   u8 text[POKE_MAX_OPCODE_SIZE];
+   struct {
+   u8 opcode;
+   s32 disp;
+   } __attribute__((packed));
+};
+
+void *text_gen_insn(u8 opcode, unsigned long addr, unsigned long dest)
+{
+   static union text_poke_insn insn; /* text_mutex */
+   int size = 0;
+
+   lockdep_assert_held(_mutex);
+
+   insn.opcode = opcode;
+
+#define __CASE(insn)   \
+   case insn##_INSN_OPCODE: size = insn##_INSN_SIZE; break
+
+   switch(opcode) {
+   __CASE(INT3);
+   __CASE(CALL);
+   __CASE(JMP32);
+   __CASE(JMP8);
+   }
+
+   if (size > 1) {
+   insn.disp = dest - (addr + size);
+   if (size == 2)
+   BUG_ON((insn.disp >> 31) != (insn.disp >> 7));
+   }
+
+   return 
+}
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -16,14 +16,6 @@
 #include 
 #include 
 
-union jump_code_union {
-   char code[JUMP_LABEL_NOP_SIZE];
-   struct {
-   char jump;
-   int offset;
-   } __attribute__((packed));
-};
-
 static void bug_at(unsigned char *ip, int line)
 {
/*
@@ -38,33 +30,29 @@ static void bug_at(unsigned char *ip, in
 static const void *
 __jump_label_set_jump_code(struct jump_entry *entry, enum jump_label_type 
type, int init)
 {
-   static union jump_code_union code; /* relies on text_mutex */
const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
-   const void *expect;
+   const void *expect, *code;
int line;
 
-   lockdep_assert_held(_mutex);
-
-   code.jump = JMP32_INSN_OPCODE;
-   code.offset = jump_entry_target(entry) -
-  (jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
+   code = text_gen_insn(JMP32_INSN_OPCODE, jump_entry_code(entry),
+   jump_entry_target(entry));
 
if (init) {
expect = default_nop; line = __LINE__;
} else if (type == JUMP_LABEL_JMP) {
expect = ideal_nop; line = __LINE__;
} else {
-   expect = code.code; line = __LINE__;
+   expect = code; line = __LINE__;
}
 
if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
bug_at((void *)jump_entry_code(entry), line);
 
if (type == JUMP_LABEL_NOP)
-   memcpy(, ideal_nop, JUMP_LABEL_NOP_SIZE);
+   code = ideal_nop;
 
-   return 
+   return code;
 }
 
 static void inline __jump_label_transform(struct jump_entry *entry,
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -447,18 +447,13 @@ void 

Re: [PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions

2019-10-03 Thread Peter Zijlstra
On Thu, Oct 03, 2019 at 01:01:06PM +0200, Peter Zijlstra wrote:
> Also, I think text_poke_bp(INT3) is broken, although I don't think
> anybody actually does that. Still, let me fix that.

Something like so should allow text_poke_bp(INT3) to work as expected.

---
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -999,6 +999,13 @@ int poke_int3_handler(struct pt_regs *re
ip += tp->len;
 
switch (tp->opcode) {
+   case INT3_INSN_OPCODE:
+   /*
+* Someone poked an explicit INT3, they'll want to handle it,
+* do not consume.
+*/
+   return 0;
+
case CALL_INSN_OPCODE:
int3_emulate_call(regs, (long)ip + tp->rel32);
break;
@@ -1040,8 +1047,8 @@ NOKPROBE_SYMBOL(poke_int3_handler);
 void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
 {
unsigned char int3 = INT3_INSN_OPCODE;
-   int patched_all_but_first = 0;
unsigned int i;
+   int do_sync;
 
lockdep_assert_held(_mutex);
 
@@ -1065,16 +1072,16 @@ void text_poke_bp_batch(struct text_poke
/*
 * Second step: update all but the first byte of the patched range.
 */
-   for (i = 0; i < nr_entries; i++) {
+   for (do_sync = 0, i = 0; i < nr_entries; i++) {
if (tp[i].len - sizeof(int3) > 0) {
text_poke((char *)tp[i].addr + sizeof(int3),
  (const char *)tp[i].text + sizeof(int3),
  tp[i].len - sizeof(int3));
-   patched_all_but_first++;
+   do_sync++;
}
}
 
-   if (patched_all_but_first) {
+   if (do_sync) {
/*
 * According to Intel, this core syncing is very likely
 * not necessary and we'd be safe even without it. But
@@ -1087,10 +1094,17 @@ void text_poke_bp_batch(struct text_poke
 * Third step: replace the first byte (int3) by the first byte of
 * replacing opcode.
 */
-   for (i = 0; i < nr_entries; i++)
+   for (do_sync = 0, i = 0; i < nr_entries; i++) {
+   if (tp[i].text[0] == INT3_INSN_OPCODE)
+   continue;
+
text_poke(tp[i].addr, tp[i].text, sizeof(int3));
+   do_sync++;
+   }
+
+   if (do_sync)
+   on_each_cpu(do_sync_core, NULL, 1);
 
-   on_each_cpu(do_sync_core, NULL, 1);
/*
 * sync_core() implies an smp_mb() and orders this store against
 * the writing of the new instruction.
@@ -1123,6 +1137,9 @@ void text_poke_loc_init(struct text_poke
tp->opcode = insn.opcode.bytes[0];
 
switch (tp->opcode) {
+   case INT3_INSN_OPCPDE:
+   break;
+
case CALL_INSN_OPCODE:
case JMP32_INSN_OPCODE:
case JMP8_INSN_OPCODE:


Re: [PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions

2019-10-03 Thread Peter Zijlstra
On Thu, Oct 03, 2019 at 10:27:51AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 03, 2019 at 02:00:50PM +0900, Masami Hiramatsu wrote:
> 
> > > This fits almost all text_poke_bp() users, except
> > > arch_unoptimize_kprobe() which restores random text, and for that site
> > > we have to build an explicit emulate instruction.
> > 
> > OK, and in this case, I would like to change RELATIVEJUMP_OPCODE
> > to JMP32_INSN_OPCODE for readability. (or at least
> > making RELATIVEJUMP_OPCODE as an alias of JMP32_INSN_OPCODE)
> 
> > > @@ -448,12 +447,18 @@ void arch_optimize_kprobes(struct list_h
> > >  void arch_unoptimize_kprobe(struct optimized_kprobe *op)
> > >  {
> > >   u8 insn_buff[RELATIVEJUMP_SIZE];
> > > + u8 emulate_buff[RELATIVEJUMP_SIZE];
> > >  
> > >   /* Set int3 to first byte for kprobes */
> > >   insn_buff[0] = BREAKPOINT_INSTRUCTION;
> > >   memcpy(insn_buff + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
> > > +
> > > + emulate_buff[0] = RELATIVEJUMP_OPCODE;
> > > + *(s32 *)(_buff[1]) = (s32)((long)op->optinsn.insn -
> > > + ((long)op->kp.addr + RELATIVEJUMP_SIZE));
> 
> I'm halfway through a patch introducing:
> 
>   union text_poke_insn {
>   u8 code[POKE_MAX_OPCODE_SUZE];
>   struct {
>   u8 opcode;
>   s32 disp;
>   } __attribute__((packed));
>   };
> 
> to text-patching.h to unify all such custom unions we have all over the
> place. I'll mob up the above in that.
> 
> > > +
> > >   text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
> > > -  op->optinsn.insn);
> > > +  emulate_buff);
> > >  }
> 
> As argued in a previous thread, text_poke_bp() is broken when it changes
> more than a single instruction at a time.
> 
> Now, ISTR optimized kprobes does something like:
> 
>   poke INT3

Hmm, it does this using text_poke(), but lacks a
on_each_cpu(do_sync_core, NULL, 1), which I suppose is OK-ish IFF you do
that synchronize_rcu_tasks() after it, but less so if you don't.

That is, without either, you can't really tell if the kprobe is in
effect or not.

Also, I think text_poke_bp(INT3) is broken, although I don't think
anybody actually does that. Still, let me fix that.

>   synchronize_rcu_tasks() /* waits for all tasks to schedule
>  guarantees instructions after INT3
>  are unused */
>   install optimized probe /* overwrites multiple instrctions with
>  JMP.d32 */
> 
> And the above then undoes that by:
> 
>   poke INT3 on top of the optimzed probe
> 
>   poke tail instructions back /* guaranteed safe because the
>  above INT3 poke ensures the
>  JMP.d32 instruction is unused */
> 
>   poke head byte back
> 
> Is this correct? If so, we should probably put a comment in there
> explaining how all this is unusual but safe.


Re: [PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions

2019-10-03 Thread Peter Zijlstra
On Thu, Oct 03, 2019 at 02:00:50PM +0900, Masami Hiramatsu wrote:

> > This fits almost all text_poke_bp() users, except
> > arch_unoptimize_kprobe() which restores random text, and for that site
> > we have to build an explicit emulate instruction.
> 
> OK, and in this case, I would like to change RELATIVEJUMP_OPCODE
> to JMP32_INSN_OPCODE for readability. (or at least
> making RELATIVEJUMP_OPCODE as an alias of JMP32_INSN_OPCODE)

> > @@ -448,12 +447,18 @@ void arch_optimize_kprobes(struct list_h
> >  void arch_unoptimize_kprobe(struct optimized_kprobe *op)
> >  {
> > u8 insn_buff[RELATIVEJUMP_SIZE];
> > +   u8 emulate_buff[RELATIVEJUMP_SIZE];
> >  
> > /* Set int3 to first byte for kprobes */
> > insn_buff[0] = BREAKPOINT_INSTRUCTION;
> > memcpy(insn_buff + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
> > +
> > +   emulate_buff[0] = RELATIVEJUMP_OPCODE;
> > +   *(s32 *)(_buff[1]) = (s32)((long)op->optinsn.insn -
> > +   ((long)op->kp.addr + RELATIVEJUMP_SIZE));

I'm halfway through a patch introducing:

  union text_poke_insn {
u8 code[POKE_MAX_OPCODE_SUZE];
struct {
u8 opcode;
s32 disp;
} __attribute__((packed));
  };

to text-patching.h to unify all such custom unions we have all over the
place. I'll mob up the above in that.

> > +
> > text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
> > -op->optinsn.insn);
> > +emulate_buff);
> >  }

As argued in a previous thread, text_poke_bp() is broken when it changes
more than a single instruction at a time.

Now, ISTR optimized kprobes does something like:

poke INT3
synchronize_rcu_tasks() /* waits for all tasks to schedule
   guarantees instructions after INT3
   are unused */
install optimized probe /* overwrites multiple instrctions with
   JMP.d32 */

And the above then undoes that by:

poke INT3 on top of the optimzed probe

poke tail instructions back /* guaranteed safe because the
   above INT3 poke ensures the
   JMP.d32 instruction is unused */

poke head byte back

Is this correct? If so, we should probably put a comment in there
explaining how all this is unusual but safe.


Re: [PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions

2019-10-02 Thread Masami Hiramatsu
Hi Peter,

On Tue, 27 Aug 2019 20:06:23 +0200
Peter Zijlstra  wrote:

> In preparation for static_call and variable size jump_label support,
> teach text_poke_bp() to emulate instructions, namely:
> 
>   JMP32, JMP8, CALL, NOP2, NOP_ATOMIC5
> 
> The current text_poke_bp() takes a @handler argument which is used as
> a jump target when the temporary INT3 is hit by a different CPU.
> 
> When patching CALL instructions, this doesn't work because we'd miss
> the PUSH of the return address. Instead, teach poke_int3_handler() to
> emulate an instruction, typically the instruction we're patching in.
> 
> This fits almost all text_poke_bp() users, except
> arch_unoptimize_kprobe() which restores random text, and for that site
> we have to build an explicit emulate instruction.

OK, and in this case, I would like to change RELATIVEJUMP_OPCODE
to JMP32_INSN_OPCODE for readability. (or at least
making RELATIVEJUMP_OPCODE as an alias of JMP32_INSN_OPCODE)

Except for that, this looks good to me.

Reviewed-by: Masami Hiramatsu 

Thank you,

> 
> Cc: Masami Hiramatsu 
> Cc: Steven Rostedt 
> Cc: Daniel Bristot de Oliveira 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/include/asm/text-patching.h |   24 ++--
>  arch/x86/kernel/alternative.c|   98 
> ++-
>  arch/x86/kernel/jump_label.c |9 +--
>  arch/x86/kernel/kprobes/opt.c|   11 ++-
>  4 files changed, 103 insertions(+), 39 deletions(-)
> 
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -26,10 +26,11 @@ static inline void apply_paravirt(struct
>  #define POKE_MAX_OPCODE_SIZE 5
>  
>  struct text_poke_loc {
> - void *detour;
>   void *addr;
> - size_t len;
> - const char opcode[POKE_MAX_OPCODE_SIZE];
> + int len;
> + s32 rel32;
> + u8 opcode;
> + const char text[POKE_MAX_OPCODE_SIZE];
>  };
>  
>  extern void text_poke_early(void *addr, const void *opcode, size_t len);
> @@ -51,8 +52,10 @@ extern void text_poke_early(void *addr,
>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>  extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
>  extern int poke_int3_handler(struct pt_regs *regs);
> -extern void text_poke_bp(void *addr, const void *opcode, size_t len, void 
> *handler);
> +extern void text_poke_bp(void *addr, const void *opcode, size_t len, const 
> void *emulate);
>  extern void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int 
> nr_entries);
> +extern void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
> +const void *opcode, size_t len, const void 
> *emulate);
>  extern int after_bootmem;
>  extern __ro_after_init struct mm_struct *poking_mm;
>  extern __ro_after_init unsigned long poking_addr;
> @@ -63,8 +66,17 @@ static inline void int3_emulate_jmp(stru
>   regs->ip = ip;
>  }
>  
> -#define INT3_INSN_SIZE 1
> -#define CALL_INSN_SIZE 5
> +#define INT3_INSN_SIZE   1
> +#define INT3_INSN_OPCODE 0xCC
> +
> +#define CALL_INSN_SIZE   5
> +#define CALL_INSN_OPCODE 0xE8
> +
> +#define JMP32_INSN_SIZE  5
> +#define JMP32_INSN_OPCODE0xE9
> +
> +#define JMP8_INSN_SIZE   2
> +#define JMP8_INSN_OPCODE 0xEB
>  
>  static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
>  {
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -956,16 +956,15 @@ NOKPROBE_SYMBOL(patch_cmp);
>  int poke_int3_handler(struct pt_regs *regs)
>  {
>   struct text_poke_loc *tp;
> - unsigned char int3 = 0xcc;
>   void *ip;
>  
>   /*
>* Having observed our INT3 instruction, we now must observe
>* bp_patching.nr_entries.
>*
> -  *  nr_entries != 0 INT3
> -  *  WMB RMB
> -  *  write INT3  if (nr_entries)
> +  *  nr_entries != 0 INT3
> +  *  WMB RMB
> +  *  write INT3  if (nr_entries)
>*
>* Idem for other elements in bp_patching.
>*/
> @@ -978,9 +977,9 @@ int poke_int3_handler(struct pt_regs *re
>   return 0;
>  
>   /*
> -  * Discount the sizeof(int3). See text_poke_bp_batch().
> +  * Discount the INT3. See text_poke_bp_batch().
>*/
> - ip = (void *) regs->ip - sizeof(int3);
> + ip = (void *) regs->ip - INT3_INSN_SIZE;
>  
>   /*
>* Skip the binary search if there is a single member in the vector.
> @@ -997,8 +996,22 @@ int poke_int3_handler(struct pt_regs *re
>   return 0;
>   }
>  
> - /* set up the specified breakpoint detour */
> - regs->ip = (unsigned long) tp->detour;
> + ip += tp->len;
> +
> + switch (tp->opcode) {
> + case CALL_INSN_OPCODE:
> + int3_emulate_call(regs, (long)ip + tp->rel32);
> +   

[PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions

2019-08-27 Thread Peter Zijlstra
In preparation for static_call and variable size jump_label support,
teach text_poke_bp() to emulate instructions, namely:

  JMP32, JMP8, CALL, NOP2, NOP_ATOMIC5

The current text_poke_bp() takes a @handler argument which is used as
a jump target when the temporary INT3 is hit by a different CPU.

When patching CALL instructions, this doesn't work because we'd miss
the PUSH of the return address. Instead, teach poke_int3_handler() to
emulate an instruction, typically the instruction we're patching in.

This fits almost all text_poke_bp() users, except
arch_unoptimize_kprobe() which restores random text, and for that site
we have to build an explicit emulate instruction.

Cc: Masami Hiramatsu 
Cc: Steven Rostedt 
Cc: Daniel Bristot de Oliveira 
Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/include/asm/text-patching.h |   24 ++--
 arch/x86/kernel/alternative.c|   98 ++-
 arch/x86/kernel/jump_label.c |9 +--
 arch/x86/kernel/kprobes/opt.c|   11 ++-
 4 files changed, 103 insertions(+), 39 deletions(-)

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -26,10 +26,11 @@ static inline void apply_paravirt(struct
 #define POKE_MAX_OPCODE_SIZE   5
 
 struct text_poke_loc {
-   void *detour;
void *addr;
-   size_t len;
-   const char opcode[POKE_MAX_OPCODE_SIZE];
+   int len;
+   s32 rel32;
+   u8 opcode;
+   const char text[POKE_MAX_OPCODE_SIZE];
 };
 
 extern void text_poke_early(void *addr, const void *opcode, size_t len);
@@ -51,8 +52,10 @@ extern void text_poke_early(void *addr,
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
-extern void text_poke_bp(void *addr, const void *opcode, size_t len, void 
*handler);
+extern void text_poke_bp(void *addr, const void *opcode, size_t len, const 
void *emulate);
 extern void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int 
nr_entries);
+extern void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
+  const void *opcode, size_t len, const void 
*emulate);
 extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
@@ -63,8 +66,17 @@ static inline void int3_emulate_jmp(stru
regs->ip = ip;
 }
 
-#define INT3_INSN_SIZE 1
-#define CALL_INSN_SIZE 5
+#define INT3_INSN_SIZE 1
+#define INT3_INSN_OPCODE   0xCC
+
+#define CALL_INSN_SIZE 5
+#define CALL_INSN_OPCODE   0xE8
+
+#define JMP32_INSN_SIZE5
+#define JMP32_INSN_OPCODE  0xE9
+
+#define JMP8_INSN_SIZE 2
+#define JMP8_INSN_OPCODE   0xEB
 
 static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
 {
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -956,16 +956,15 @@ NOKPROBE_SYMBOL(patch_cmp);
 int poke_int3_handler(struct pt_regs *regs)
 {
struct text_poke_loc *tp;
-   unsigned char int3 = 0xcc;
void *ip;
 
/*
 * Having observed our INT3 instruction, we now must observe
 * bp_patching.nr_entries.
 *
-*  nr_entries != 0 INT3
-*  WMB RMB
-*  write INT3  if (nr_entries)
+*  nr_entries != 0 INT3
+*  WMB RMB
+*  write INT3  if (nr_entries)
 *
 * Idem for other elements in bp_patching.
 */
@@ -978,9 +977,9 @@ int poke_int3_handler(struct pt_regs *re
return 0;
 
/*
-* Discount the sizeof(int3). See text_poke_bp_batch().
+* Discount the INT3. See text_poke_bp_batch().
 */
-   ip = (void *) regs->ip - sizeof(int3);
+   ip = (void *) regs->ip - INT3_INSN_SIZE;
 
/*
 * Skip the binary search if there is a single member in the vector.
@@ -997,8 +996,22 @@ int poke_int3_handler(struct pt_regs *re
return 0;
}
 
-   /* set up the specified breakpoint detour */
-   regs->ip = (unsigned long) tp->detour;
+   ip += tp->len;
+
+   switch (tp->opcode) {
+   case CALL_INSN_OPCODE:
+   int3_emulate_call(regs, (long)ip + tp->rel32);
+   break;
+
+   case JMP32_INSN_OPCODE:
+   case JMP8_INSN_OPCODE:
+   int3_emulate_jmp(regs, (long)ip + tp->rel32);
+   break;
+
+   default: /* nop */
+   int3_emulate_jmp(regs, (long)ip);
+   break;
+   }
 
return 1;
 }
@@ -1027,8 +1040,8 @@ NOKPROBE_SYMBOL(poke_int3_handler);
  */
 void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
 {
+   unsigned char int3 = INT3_INSN_OPCODE;
int 

[PATCH 1/3] x86/alternatives: Teach text_poke_bp() to emulate instructions

2019-08-26 Thread Peter Zijlstra
In preparation for static_call and variable size jump_label support,
teach text_poke_bp() to emulate instructions, namely:

  JMP32, JMP8, CALL, NOP2, NOP_ATOMIC5

The current text_poke_bp() takes a @handler argument which is used as
a jump target when the temporary INT3 is hit by a different CPU.

When patching CALL instructions, this doesn't work because we'd miss
the PUSH of the return address. Instead, teach poke_int3_handler() to
emulate an instruction, typically the instruction we're patching in.

This fits almost all text_poke_bp() users, except
arch_unoptimize_kprobe() which restores random text, and for that site
we have to build an explicit emulate instruction.

Cc: Masami Hiramatsu 
Cc: Steven Rostedt 
Cc: Daniel Bristot de Oliveira 
Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/include/asm/text-patching.h |   24 ++--
 arch/x86/kernel/alternative.c|   98 ++-
 arch/x86/kernel/jump_label.c |9 +--
 arch/x86/kernel/kprobes/opt.c|   11 ++-
 4 files changed, 103 insertions(+), 39 deletions(-)

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -26,10 +26,11 @@ static inline void apply_paravirt(struct
 #define POKE_MAX_OPCODE_SIZE   5
 
 struct text_poke_loc {
-   void *detour;
void *addr;
-   size_t len;
-   const char opcode[POKE_MAX_OPCODE_SIZE];
+   int len;
+   s32 rel32;
+   u8 opcode;
+   const char text[POKE_MAX_OPCODE_SIZE];
 };
 
 extern void text_poke_early(void *addr, const void *opcode, size_t len);
@@ -51,8 +52,10 @@ extern void text_poke_early(void *addr,
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
-extern void text_poke_bp(void *addr, const void *opcode, size_t len, void 
*handler);
+extern void text_poke_bp(void *addr, const void *opcode, size_t len, const 
void *emulate);
 extern void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int 
nr_entries);
+extern void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
+  const void *opcode, size_t len, const void 
*emulate);
 extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
@@ -63,8 +66,17 @@ static inline void int3_emulate_jmp(stru
regs->ip = ip;
 }
 
-#define INT3_INSN_SIZE 1
-#define CALL_INSN_SIZE 5
+#define INT3_INSN_SIZE 1
+#define INT3_INSN_OPCODE   0xCC
+
+#define CALL_INSN_SIZE 5
+#define CALL_INSN_OPCODE   0xE8
+
+#define JMP32_INSN_SIZE5
+#define JMP32_INSN_OPCODE  0xE9
+
+#define JMP8_INSN_SIZE 2
+#define JMP8_INSN_OPCODE   0xEB
 
 static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
 {
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -956,16 +956,15 @@ NOKPROBE_SYMBOL(patch_cmp);
 int poke_int3_handler(struct pt_regs *regs)
 {
struct text_poke_loc *tp;
-   unsigned char int3 = 0xcc;
void *ip;
 
/*
 * Having observed our INT3 instruction, we now must observe
 * bp_patching.nr_entries.
 *
-*  nr_entries != 0 INT3
-*  WMB RMB
-*  write INT3  if (nr_entries)
+*  nr_entries != 0 INT3
+*  WMB RMB
+*  write INT3  if (nr_entries)
 *
 * Idem for other elements in bp_patching.
 */
@@ -978,9 +977,9 @@ int poke_int3_handler(struct pt_regs *re
return 0;
 
/*
-* Discount the sizeof(int3). See text_poke_bp_batch().
+* Discount the INT3. See text_poke_bp_batch().
 */
-   ip = (void *) regs->ip - sizeof(int3);
+   ip = (void *) regs->ip - INT3_INSN_SIZE;
 
/*
 * Skip the binary search if there is a single member in the vector.
@@ -997,8 +996,22 @@ int poke_int3_handler(struct pt_regs *re
return 0;
}
 
-   /* set up the specified breakpoint detour */
-   regs->ip = (unsigned long) tp->detour;
+   ip += tp->len;
+
+   switch (tp->opcode) {
+   case CALL_INSN_OPCODE:
+   int3_emulate_call(regs, (long)ip + tp->rel32);
+   break;
+
+   case JMP32_INSN_OPCODE:
+   case JMP8_INSN_OPCODE:
+   int3_emulate_jmp(regs, (long)ip + tp->rel32);
+   break;
+
+   default: /* nop */
+   int3_emulate_jmp(regs, (long)ip);
+   break;
+   }
 
return 1;
 }
@@ -1027,8 +1040,8 @@ NOKPROBE_SYMBOL(poke_int3_handler);
  */
 void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
 {
+   unsigned char int3 = INT3_INSN_OPCODE;
int