Re: [PATCH 05/15] perf: Track guest callbacks on a per-CPU basis

2021-08-27 Thread Sean Christopherson
On Fri, Aug 27, 2021, Peter Zijlstra wrote:
> On Fri, Aug 27, 2021 at 02:49:50PM +, Sean Christopherson wrote:
> > On Fri, Aug 27, 2021, Peter Zijlstra wrote:
> > > On Thu, Aug 26, 2021 at 05:57:08PM -0700, Sean Christopherson wrote:
> > > > Use a per-CPU pointer to track perf's guest callbacks so that KVM can 
> > > > set
> > > > the callbacks more precisely and avoid a lurking NULL pointer 
> > > > dereference.
> > > 
> > > I'm completely failing to see how per-cpu helps anything here...
> > 
> > It doesn't help until KVM is converted to set the per-cpu pointer in flows 
> > that
> > are protected against preemption, and more specifically when KVM only 
> > writes to
> > the pointer from the owning CPU.  
> 
> So the 'problem' I have with this is that sane (!KVM using) people, will
> still have to suffer that load, whereas with the static_call() we patch
> in an 'xor %rax,%rax' and only have immediate code flow.

Again, I've no objection to the static_call() approach.  I didn't even see the
patch until I had finished testing my series :-/

> > Ignoring static call for the moment, I don't see how the unreg side can be 
> > safe
> > using a bare single global pointer.  There is no way for KVM to prevent an 
> > NMI
> > from running in parallel on a different CPU.  If there's a more elegant 
> > solution,
> > especially something that can be backported, e.g. an rcu-protected pointer, 
> > I'm
> > all for it.  I went down the per-cpu path because it allowed for cleanups 
> > in KVM,
> > but similar cleanups can be done without per-cpu perf callbacks.
> 
> If all the perf_guest_cbs dereferences are with preemption disabled
> (IRQs disabled, IRQ context, NMI context included), then the sequence:
> 
>   WRITE_ONCE(perf_guest_cbs, NULL);
>   synchronize_rcu();
> 
> Ensures that all prior observers of perf_guest_csb will have completed
> and future observes must observe the NULL value.

That alone won't be sufficient, as the read side also needs to ensure it doesn't
reload perf_guest_cbs between NULL checks and dereferences.  But that's easy
enough to solve with a READ_ONCE and maybe a helper to make it more cumbersome
to use perf_guest_cbs directly.

How about this for a series?

  1. Use READ_ONCE/WRITE_ONCE + synchronize_rcu() to fix the underlying bug
  2. Fix KVM PT interrupt handler bug
  3. Kill off perf_guest_cbs usage in architectures that don't need the 
callbacks
  4. Replace ->is_in_guest()/->is_user_mode() with ->state(), and 
s/get_guest_ip/get_ip
  5. Implement static_call() support
  6. Cleanups, if there are any
  6..N KVM cleanups, e.g. to eliminate current_vcpu and share x86+arm64 
callbacks



Re: [PATCH 05/15] perf: Track guest callbacks on a per-CPU basis

2021-08-27 Thread Peter Zijlstra
On Fri, Aug 27, 2021 at 02:49:50PM +, Sean Christopherson wrote:
> On Fri, Aug 27, 2021, Peter Zijlstra wrote:
> > On Thu, Aug 26, 2021 at 05:57:08PM -0700, Sean Christopherson wrote:
> > > Use a per-CPU pointer to track perf's guest callbacks so that KVM can set
> > > the callbacks more precisely and avoid a lurking NULL pointer dereference.
> > 
> > I'm completely failing to see how per-cpu helps anything here...
> 
> It doesn't help until KVM is converted to set the per-cpu pointer in flows 
> that
> are protected against preemption, and more specifically when KVM only writes 
> to
> the pointer from the owning CPU.  

So the 'problem' I have with this is that sane (!KVM using) people, will
still have to suffer that load, whereas with the static_call() we patch
in an 'xor %rax,%rax' and only have immediate code flow.

> Ignoring static call for the moment, I don't see how the unreg side can be 
> safe
> using a bare single global pointer.  There is no way for KVM to prevent an NMI
> from running in parallel on a different CPU.  If there's a more elegant 
> solution,
> especially something that can be backported, e.g. an rcu-protected pointer, 
> I'm
> all for it.  I went down the per-cpu path because it allowed for cleanups in 
> KVM,
> but similar cleanups can be done without per-cpu perf callbacks.

If all the perf_guest_cbs dereferences are with preemption disabled
(IRQs disabled, IRQ context, NMI context included), then the sequence:

WRITE_ONCE(perf_guest_cbs, NULL);
synchronize_rcu();

Ensures that all prior observers of perf_guest_csb will have completed
and future observes must observe the NULL value.




Re: [PATCH 05/15] perf: Track guest callbacks on a per-CPU basis

2021-08-27 Thread Sean Christopherson
On Fri, Aug 27, 2021, Peter Zijlstra wrote:
> On Thu, Aug 26, 2021 at 05:57:08PM -0700, Sean Christopherson wrote:
> > Use a per-CPU pointer to track perf's guest callbacks so that KVM can set
> > the callbacks more precisely and avoid a lurking NULL pointer dereference.
> 
> I'm completely failing to see how per-cpu helps anything here...

It doesn't help until KVM is converted to set the per-cpu pointer in flows that
are protected against preemption, and more specifically when KVM only writes to
the pointer from the owning CPU.  

> > On x86, KVM supports being built as a module and thus can be unloaded.
> > And because the shared callbacks are referenced from IRQ/NMI context,
> > unloading KVM can run concurrently with perf, and thus all of perf's
> > checks for a NULL perf_guest_cbs are flawed as perf_guest_cbs could be
> > nullified between the check and dereference.
> 
> No longer allowing KVM to be a module would be *AWESOME*. I detest how
> much we have to export for KVM :/
> 
> Still, what stops KVM from doing a coherent unreg? Even the
> static_call() proposed in the other patch, unreg can do
> static_call_update() + synchronize_rcu() to ensure everybody sees the
> updated pointer (would require a quick audit to see all users are with
> preempt disabled, but I think your using per-cpu here already imposes
> the same).

Ignoring static call for the moment, I don't see how the unreg side can be safe
using a bare single global pointer.  There is no way for KVM to prevent an NMI
from running in parallel on a different CPU.  If there's a more elegant 
solution,
especially something that can be backported, e.g. an rcu-protected pointer, I'm
all for it.  I went down the per-cpu path because it allowed for cleanups in 
KVM,
but similar cleanups can be done without per-cpu perf callbacks.

As for static calls, I certainly have no objection to employing static calls for
the callbacks, but IMO we should not be relying on static call for correctness,
i.e. the existing bug needs to be fixed first.



Re: [PATCH 05/15] perf: Track guest callbacks on a per-CPU basis

2021-08-27 Thread Peter Zijlstra
On Thu, Aug 26, 2021 at 05:57:08PM -0700, Sean Christopherson wrote:
> Use a per-CPU pointer to track perf's guest callbacks so that KVM can set
> the callbacks more precisely and avoid a lurking NULL pointer dereference.

I'm completely failing to see how per-cpu helps anything here...

> On x86, KVM supports being built as a module and thus can be unloaded.
> And because the shared callbacks are referenced from IRQ/NMI context,
> unloading KVM can run concurrently with perf, and thus all of perf's
> checks for a NULL perf_guest_cbs are flawed as perf_guest_cbs could be
> nullified between the check and dereference.

No longer allowing KVM to be a module would be *AWESOME*. I detest how
much we have to export for KVM :/

Still, what stops KVM from doing a coherent unreg? Even the
static_call() proposed in the other patch, unreg can do
static_call_update() + synchronize_rcu() to ensure everybody sees the
updated pointer (would require a quick audit to see all users are with
preempt disabled, but I think your using per-cpu here already imposes
the same).





[PATCH 05/15] perf: Track guest callbacks on a per-CPU basis

2021-08-26 Thread Sean Christopherson
Use a per-CPU pointer to track perf's guest callbacks so that KVM can set
the callbacks more precisely and avoid a lurking NULL pointer dereference.
On x86, KVM supports being built as a module and thus can be unloaded.
And because the shared callbacks are referenced from IRQ/NMI context,
unloading KVM can run concurrently with perf, and thus all of perf's
checks for a NULL perf_guest_cbs are flawed as perf_guest_cbs could be
nullified between the check and dereference.

In practice, this has not been problematic because the callbacks are
always guarded with a "perf_guest_cbs && perf_guest_cbs->is_in_guest()"
pattern, and it's extremely unlikely the compiler will choost to reload
perf_guest_cbs in that particular sequence.  Because is_in_guest() is
obviously true only when KVM is running a guest, perf always wins the
race to the guarded code (which does often reload perf_guest_cbs) as KVM
has to stop running all guests and do a heavy teardown before unloading.

Cc: Zhu Lingshan 
Signed-off-by: Sean Christopherson 
---
 arch/arm64/kernel/perf_callchain.c | 18 --
 arch/x86/events/core.c | 17 +++--
 arch/x86/events/intel/core.c   |  8 +---
 include/linux/perf_event.h |  2 +-
 kernel/events/core.c   | 12 +---
 5 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/kernel/perf_callchain.c 
b/arch/arm64/kernel/perf_callchain.c
index 4a72c2727309..38555275c6a2 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -102,7 +102,9 @@ compat_user_backtrace(struct compat_frame_tail __user *tail,
 void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 struct pt_regs *regs)
 {
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* We don't support guest os callchain now */
return;
}
@@ -147,9 +149,10 @@ static bool callchain_trace(void *data, unsigned long pc)
 void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
   struct pt_regs *regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
struct stackframe frame;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* We don't support guest os callchain now */
return;
}
@@ -160,18 +163,21 @@ void perf_callchain_kernel(struct 
perf_callchain_entry_ctx *entry,
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
-   return perf_guest_cbs->get_guest_ip();
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+
+   if (guest_cbs && guest_cbs->is_in_guest())
+   return guest_cbs->get_guest_ip();
 
return instruction_pointer(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
int misc = 0;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
-   if (perf_guest_cbs->is_user_mode())
+   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs->is_user_mode())
misc |= PERF_RECORD_MISC_GUEST_USER;
else
misc |= PERF_RECORD_MISC_GUEST_KERNEL;
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1eb45139fcc6..34155a52e498 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2761,10 +2761,11 @@ static bool perf_hw_regs(struct pt_regs *regs)
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
struct unwind_state state;
unsigned long addr;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2864,10 +2865,11 @@ perf_callchain_user32(struct pt_regs *regs, struct 
perf_callchain_entry_ctx *ent
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
struct stack_frame frame;
const struct stack_frame __user *fp;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2944,18 +2946,21 @@ static unsigned long code_segment_base(struct pt_regs 
*regs)