Re: [PATCHv2] perf x86_64: Fix rsp register for system call fast path

2012-10-02 Thread Frederic Weisbecker
On Tue, Oct 02, 2012 at 06:06:26PM +0200, Jiri Olsa wrote:
> On Tue, Oct 02, 2012 at 05:49:26PM +0200, Frederic Weisbecker wrote:
> > On Tue, Oct 02, 2012 at 04:58:15PM +0200, Jiri Olsa wrote:
> > > diff --git a/arch/x86/kernel/cpu/perf_event.c 
> > > b/arch/x86/kernel/cpu/perf_event.c
> > > index 915b876..11d62ff 100644
> > > --- a/arch/x86/kernel/cpu/perf_event.c
> > > +++ b/arch/x86/kernel/cpu/perf_event.c
> > > @@ -34,6 +34,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  
> > >  #include "perf_event.h"
> > >  
> > > @@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct 
> > > perf_event_mmap_page *userpg, u64 now)
> > >   userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
> > >  }
> > >  
> > > +#ifdef CONFIG_X86_64
> > > +__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs 
> > > *regs)
> > > +{
> > > + int kernel = !user_mode(regs);
> > > +
> > > + if (kernel) {
> > > + if (current->mm)
> > > + regs = task_pt_regs(current);
> > > + else
> > > + regs = NULL;
> > > + }
> > 
> > Shouldn't the above stay in generic code?
> 
> could be.. I guess I thought that having the regs retrieval
> plus the fixup at the same place feels better/compact ;)
> 
> but could change that if needed

Yeah please.

> > 
> > I'm trying to scratch my head to find a solution to detect the race and
> > bail out instead of recording erroneous values but I can't find one.
> > 
> > Anyway this is still better than what we have now.
> > 
> > Another solution could be to force syscall slow path and have some variable
> > set there that tells us we are in a syscall and every regs have been saved.
> > 
> > But we probably don't want to force syscall slow path...
> 
> I was trying something like that as well, but the one I sent looks
> far less hacky to me.. :)

Actually it's more hacky because it's less deterministic.
But it's more simple, and doesn't hurt performances.

Ok, let's start with that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2] perf x86_64: Fix rsp register for system call fast path

2012-10-02 Thread Jiri Olsa
On Tue, Oct 02, 2012 at 05:49:26PM +0200, Frederic Weisbecker wrote:
> On Tue, Oct 02, 2012 at 04:58:15PM +0200, Jiri Olsa wrote:
> > diff --git a/arch/x86/kernel/cpu/perf_event.c 
> > b/arch/x86/kernel/cpu/perf_event.c
> > index 915b876..11d62ff 100644
> > --- a/arch/x86/kernel/cpu/perf_event.c
> > +++ b/arch/x86/kernel/cpu/perf_event.c
> > @@ -34,6 +34,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include "perf_event.h"
> >  
> > @@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct 
> > perf_event_mmap_page *userpg, u64 now)
> > userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
> >  }
> >  
> > +#ifdef CONFIG_X86_64
> > +__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs 
> > *regs)
> > +{
> > +   int kernel = !user_mode(regs);
> > +
> > +   if (kernel) {
> > +   if (current->mm)
> > +   regs = task_pt_regs(current);
> > +   else
> > +   regs = NULL;
> > +   }
> 
> Shouldn't the above stay in generic code?

could be.. I guess I thought that having the regs retrieval
plus the fixup at the same place feels better/compact ;)

but could change that if needed

SNIP

> 
> That said, a race is there already: if the syscall is interrupted before
> SAVE_ARGS and co.

yep

> 
> I'm trying to scratch my head to find a solution to detect the race and
> bail out instead of recording erroneous values but I can't find one.
> 
> Anyway this is still better than what we have now.
> 
> Another solution could be to force syscall slow path and have some variable
> set there that tells us we are in a syscall and every regs have been saved.
> 
> But we probably don't want to force syscall slow path...

I was trying something like that as well, but the one I sent looks
far less hacky to me.. :)

jirka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2] perf x86_64: Fix rsp register for system call fast path

2012-10-02 Thread Frederic Weisbecker
On Tue, Oct 02, 2012 at 04:58:15PM +0200, Jiri Olsa wrote:
> diff --git a/arch/x86/kernel/cpu/perf_event.c 
> b/arch/x86/kernel/cpu/perf_event.c
> index 915b876..11d62ff 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "perf_event.h"
>  
> @@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct 
> perf_event_mmap_page *userpg, u64 now)
>   userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
>  }
>  
> +#ifdef CONFIG_X86_64
> +__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs *regs)
> +{
> + int kernel = !user_mode(regs);
> +
> + if (kernel) {
> + if (current->mm)
> + regs = task_pt_regs(current);
> + else
> + regs = NULL;
> + }

Shouldn't the above stay in generic code?

> +
> + if (regs) {
> + memcpy(oregs, regs, sizeof(*regs));
> +
> + /*
> +  * If the perf event was triggered within the kernel code
> +  * path, then it was either syscall or interrupt. While
> +  * interrupt stores almost all user registers, the syscall
> +  * fast path does not. At this point we can at least set
> +  * rsp register right, which is crucial for dwarf unwind.
> +  *
> +  * The syscall_get_nr function returns -1 (orig_ax) for
> +  * interrupt, and positive value for syscall.
> +  *
> +  * We have two race windows in here:
> +  *
> +  * 1) Few instructions from syscall entry until old_rsp is
> +  *set.
> +  *
> +  * 2) In syscall/interrupt path from entry until the orig_ax
> +  *is set.
> +  *
> +  * Above described race windows are fractional opposed to
> +  * the syscall fast path, so we get much better results
> +  * fixing rsp this way.

That said, a race is there already: if the syscall is interrupted before
SAVE_ARGS and co.

I'm trying to scratch my head to find a solution to detect the race and
bail out instead of recording erroneous values but I can't find one.

Anyway this is still better than what we have now.

Another solution could be to force syscall slow path and have some variable
set there that tells us we are in a syscall and every regs have been saved.

But we probably don't want to force syscall slow path...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2] perf x86_64: Fix rsp register for system call fast path

2012-10-02 Thread Jiri Olsa
On Tue, Oct 02, 2012 at 12:44:04PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-10-01 at 19:31 +0200, Jiri Olsa wrote:
> > @@ -696,7 +696,7 @@ struct perf_branch_stack {
> >  
> >  struct perf_regs_user {
> > __u64   abi;
> > -   struct pt_regs  *regs;
> > +   struct pt_regs  regs;
> >  };
> 
> That's somewhat unfortunate but unavoidable I guess, can't go modify pt_regs. 
> 
> 
> > +   if (uregs->abi)
> > +   stack_size = 
> > perf_sample_ustack_size(sample_stack_user,
> > +header->size,
> > +
> 
> just a style nit, please add {} for all multi-line single stmt
> constructs like that, even though not strictly required.
> 
> It reduces the possible confusion between multi-line and multi-statement
> and reads easier.

fixed, new version is attached

thanks,
jirka


---
The user level rsp register value attached to the sample is crucial
for proper user stack dump and for proper dwarf backtrace post unwind.

But currently, if the event happens within the system call fast path,
we don't store proper rsp register value in the event sample.

The reason is that the syscall fast path stores just minimal set of
registers to the task's struct pt_regs area. The rsp itself is stored
in per cpu variable 'old_rsp'.

This patch fixes this rsp register value based on the:
  - 'old_rsp' per cpu variable
(updated within the syscall fast path)
  - guess on how we got into the kernel - syscall or interrupt
(via pt_regs::orig_ax value)
We can use 'old_rsp' value only if we are inside the syscall.
Thanks to Oleg who outlined this solution!

Above guess introduces 2 race windows (fully desccribed within the patch
comments), where we might get incorrect user level rsp value stored in
sample. However, in comparison with system call fast path length, we still
get much more precise rsp values than without the patch.

Note that as we are now changing the pt_regs, we use statically allocated
pt_regs inside the sample data instead of task pt_regs pointer.

Example of syscall fast path dwarf backtrace unwind:
(perf record -e cycles -g dwarf ls; perf report --stdio)

Before the patch applied:

  --23.76%-- preempt_schedule_irq
 retint_kernel
 tty_ldisc_deref
 tty_write
 vfs_write
 sys_write
 system_call_fastpath
 __GI___libc_write
 0x6

With the patch applied:

  --12.37%-- finish_task_switch
 __schedule
 preempt_schedule
 queue_work
 schedule_work
 tty_flip_buffer_push
 pty_write
 n_tty_write
 tty_write
 vfs_write
 sys_write
 system_call_fastpath
 __GI___libc_write
 _IO_file_write@@GLIBC_2.2.5
 new_do_write
 _IO_do_write@@GLIBC_2.2.5
 _IO_file_overflow@@GLIBC_2.2.5
 print_current_files
 main
 __libc_start_main
 _start

Signed-off-by: Jiri Olsa 
Cc: Frederic Weisbecker 
Cc: Ingo Molnar 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Cc: Oleg Nesterov 
---
 arch/x86/kernel/cpu/perf_event.c |   47 ++
 include/linux/perf_event.h   |6 +-
 kernel/events/core.c |   81 +++--
 3 files changed, 92 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 915b876..11d62ff 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "perf_event.h"
 
@@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct 
perf_event_mmap_page *userpg, u64 now)
userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
 }
 
+#ifdef CONFIG_X86_64
+__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs *regs)
+{
+   int kernel = !user_mode(regs);
+
+   if (kernel) {
+   if (current->mm)
+   regs = task_pt_regs(current);
+   else
+   regs = NULL;
+   }
+
+   if (regs) {
+   memcpy(oregs, regs, sizeof(*regs));
+
+   /*
+* If the perf event was triggered within the kernel code
+* path, then it was either syscall or interrupt. While
+* interrupt stores almost all user registers, the syscall
+* fast path does not. At this point we can at least set
+* rsp register right, which is crucial for dwarf unwind.
+*
+* The syscall_get_nr function returns -1 (orig_ax) for
+* interrupt, and positive value for syscall.
+*
+* We have two race windows in here:
+*
+